<table>
<tr>
<td><img src="images/python-logo-master.png" style="width:252px"></td>
<td><img src="images/pandas_logo.png" style="width:525px"></td>
</tr>
</table>             

# Data Manipulation with Python and Pandas

## School of Medicine Research Computing
**Christina Gancayco**<br>
**January 15, 2019**<br>
**gancayco@virginia.edu**<br>

## What Can You Do With Pandas?
* Data cleansing and preparation
* Data analysis and modeling
* Read data from many different file types (e.g. CSV, Excel, text file, etc)

## Outline

<p>
In this workshop we will manipulate survey data from the National Health and Nutrition Examination Survey conducted by the CDC. We will work with this dataset throughout the entire tutorial. Answers to example questions can be found in the notebook at this link: .
</p>

1. Importing pandas
2. Reading a dataset into Python
3. Properties of a DataFrame
4. Preliminary analysis
5. Pandas Operations
    * Query
    * Quantile
    * Select and drop
    * Sort values
    * Assign
    * Group by
    * Aggregate
6. "Piping"


### Importing Pandas
<p>
Before we can begin using Pandas, we first have to import the module. We will be importing Pandas as `pd` for short.
<br>
<br>
We will be using Numpy later in the workshop, so we can go ahead and import that now as well. Go ahead and run the cell block below by selecting it and clicking the "Run" button in the menu above or hitting the `Shift`+`Return` keys at the same time.
</p>

In [None]:
# Hashes are used to comment code
# Anything following a hash is not executed
# Comments are awesome for annotating code!

# Let's import the modules we'll be using
import pandas as pd
import numpy as np

### Reading a dataset into Python
<p>
Pandas can read data from a variety of file types. The commands to read in data start with `read_`.
<br>
<br>
In this case we will be reading in data from a CSV file `nhanes.csv`, so we will use Pandas' `read_csv` command.
</p>

In [None]:
# Run this cell to read the data from nhanes.csv.

# We will assign the input data to a variable called df.
# (In this lesson we are using df, but you can name this variable anything you want.)

df = pd.read_csv("nhanes.csv")


# Let's also check out the data type of df

print(type(df))

### DataFrames
<p>
From the previous cell, we can see that `df` is a DataFrame.
<br>
<br>
<b>DataFrames</b> are 2-D tabular data structures with labeled columns and rows. Typically each row corresponds to a single observation or case of multiple measurements (represented by columns). The table below is an example representation of a DataFrame.
</p>

<table>
<tr>
<th>Index</th> <th>id</th> <th>Gender</th> <th>Age</th>
</tr>

<tr>
<td>0</td> <td>62163</td> <td>male</td> <td>14</td>
</tr>

<tr>
<td>1</td> <td>62172</td> <td>female</td> <td>43</td>
</tr>

<tr>
<td>2</td> <td>62174</td> <td>male</td> <td>80</td>
</tr>

<tr>
<td>3</td> <td>62174</td> <td>male</td> <td>80</td>
</tr>

<tr>
<td>4</td> <td>62175</td> <td>male</td> <td>5</td>
</tr>
</table>

In this case, `Index` is the label for the DataFrame's rows, and `id`, `Gender`, and `Age` are the column labels.

### Properties of DataFrames

<p>
Usually the DataFrames we work with have a lot more than 5 rows and 3 columns, so it can be more difficult to visually inspect our DataFrame. There are several attributes we can explore to learn more about our DataFrame.
</p>

<table>
<tr>
<th><p style="font-size:12pt">Attribute</p></th> <th><p style="font-size:12pt">Output</p></th>
</tr>

<tr>
<td><p style="font-size:12pt">`shape`</p></td> <td><p style="font-size:12pt">Dimensions of DataFrame <br>(# rows, # columns)</p></td>
</tr>

<tr>
<td><p style="font-size:12pt">`index`</p></td> <td><p style="font-size:12pt">Row label range <br>(`Index` min and max)</p></td>
</tr>

<tr>
<td><p style="font-size:12pt">`columns`</p></td> <td><p style="font-size:12pt">Column labels <br> (e.g. `id`, `Gender`, `Age`)</p></td>
</tr>
</table>

Use of these methods is demonstrated in the code below.

In [None]:
# Display the shape of the DataFrame assigned to the variable df
print("Shape of df is:", df.shape)

# Display the row label range of df
print("df row labels:", df.index)

# Display the column labels of df
print("df column labels:", df.columns)

### Preliminary Analysis

<p>
We can use a few different methods to preview the contents of the dataset.
</p>

<table>
<tr>
<th><p style="font-size:12pt">Method</p></th> <th><p style="font-size:12pt">Output</p></th>
</tr>

<tr>
<td><p style="font-size:12pt">`head(n)`</p></td> <td><p style="font-size:12pt">Display first `n` rows of DataFrame<br>(default n=5)</p></td>
</tr>

<tr>
<td><p style="font-size:12pt">`tail(n)`</p></td> <td><p style="font-size:12pt">Display last `n` rows of DataFrame<br>(default n=5)</p></td>
</tr>

<tr>
<td><p style="font-size:12pt">`describe()`</p></td> <td><p style="font-size:12pt">Display summary stats for each column</td>
</tr>
</table>

In [None]:
# Using head, we can view the first five rows of the dataframe

df.head()

In [None]:
# We can also view the first ten rows if we set n equal to 10.

df.head(10)

In [None]:
# tail works similarly, but instead displays the last n rows.

df.tail()

In [None]:
# With describe, we can calculate and view preliminary statistics including mean, standard deviation, etc.

df.describe()

### Pandas Operations

There are a variety of way we can manipulate the data using Pandas operations.

### Query

The `query` operation is used to filter your data according to some criteria.

For example, we can return all the rows in our dataframe where the column variable `SmokingStatus` is `Never`.

There are several ways we can do this.

In [None]:
## Query Method 1

# The following line of code will return a vector of booleans telling us which rows
# contain a value of "Never" in the SmokingStatus column

print(df.SmokingStatus == "Never")

# We can use those booleans to index our dataframe

df[df.SmokingStatus == "Never"]


#### \*Side Note

When we index or query our dataframe, we are not overwriting are original dataframe with our filtered result.

In order to save any changes made to your dataframe you must either overwrite your current dataframe or store it in a new variable.

eg. `NonSmokers = df[df.SmokingStatus=="Never"`

In [None]:
## Query Method 2

# Here we are using pandas' query method to filter our dataframe. Again we are looking for
# all the rows in our dataframe that have a value of "Never" in the SmokingStatus column

df.query("SmokingStatus == 'Never'")

# Note that the input argument to query is a string (e.g "SmokingStatus == 'Never'")
# Since "Never" is also a string, we need to use single quotes to differentiate it from within the larger string

In [None]:
## More Fun with Query

# We can query with numbers in addition to strings

df[df.BPSys >= 130]

df.query("Age < 18")


In [None]:
## Query with Logical Operators

# We can also use logical operators like AND (&) and OR (|) when we query.

# Here we want to return rows of our data frame that have a value of "Never"
# in the SmokingStatus column AND a value greater than 130 in the BPSys column

df[(df.SmokingStatus == "Never") & (df.BPSys > 130)]

# When we index our dataframe with logical operators we want to separate conditions with parentheses
# However, if we use pandas' query method, we just put everything in a single string argument.

df.query("SmokingStatus == 'Never' & BPSys > 130")

# Here we want to return rows of our dataframe that have a value
# of "Never" OR "Former" in the SmokingStatus column

df[(df.SmokingStatus == "Never") | (df.SmokingStatus == "Former")]

#### Example 1
Query for current smokers with a BMI over 25.

The result should be 394 rows x 32 columns.

In [None]:
# Example 1

#---Your code here---#


### Quantile

The `quantile` operation can be used to determine the quantiles or percentiles for columns in your dataframe with numerical values.

In [None]:
# Here we are determining the 50th percentile (or median) for income.

df.Income.quantile(0.5)

In [None]:
# We can then use that value to index our dataframe for rows containing
# an Income value greater than our 50th percentile, 50000.

df[df.Income > 50000]

In [None]:
# We can combine the two lines above into a single line of code.

df[df.Income > df.Income.quantile(0.5)]

#### Example 2

Find the patients who are in the 90th percentile for BMI.

The result should be 481 rows x 32 columns.

In [None]:
# Example 2

#---Your code here---#


### Select and Drop

If you only need to work with certain variables, you can select or drop columns of your dataframe until it contains only the variables of interest.

#### Select

In [None]:
# Selecting data

# To select columns you want to keep, use two sets of brackets [[]].

# If we only want the "Age" column, then we can use the following line of code:

df[["Age"]]

# If we want to keep multiple columns, we can use the same syntax and
# separate items with a comma.

df[["Gender", "Age", "Race"]]

#### Drop

In [None]:
# Dropping (removing) data

# To drop individual columns, we can use the "drop" operation

df.drop("id", axis=1)

# We need to specify the axis (0=rows (default), 1=columns), or else drop doesn't work.

# We can drop multiple columns at once by putting the column names in a list.

df.drop(["id", "Insured", "Poverty"], axis=1)

### Sort Values

We can sort the rows of our dataframe by the values of any variable by using the `sort_values` operation. For example, we can sort them by age, so that the youngest patients appear at the top of the dataframe and the oldest appear at the bottom. We could also sort them in descending order if we want the oldest at the top.

In [None]:
# Sort the dataframe by using "sort_values" and setting the "by" input argument to your desired variable.

df.sort_values(by="Age")

# We can sort in descending order
df.sort_values(by="Age", ascending=False)

#### \*Combining Operations

We can combine operations by stringing them together. The below example shows how we can select desired columns and sort the rows in the same line of code.

In [None]:
# Combining operations

df[["Gender", "Age", "Race"]].sort_values(by="Race")

#### Example 3

Select the `id`, `Weight`, `Height`, and `BMI` columns from the dataframe and sort by `BMI` in descending order.

The result should be 5000 rows x 4 columns.

In [None]:
# Example 3

#---Your code here---#


### Assign

We can use the `assign` operation to add new column variables to our dataframe.

In [None]:
# Assign

# Let's create a new dataframe to play with containing only the Weight and Height columns.

myDF = df[["Weight", "Height"]]

# The weight and height are listed in metric units of measurement.
# Let's convert the weight from kilograms to pounds and assign the result as a new
# variable in the dataframe myDF.

myDF.assign(WeightLB = myDF.Weight*2.2)

# Here we multiplied the values in the weight column by 2.2 to convert from kg to lb
# and assigned the result to a column called WeightLB.

#### \*Side Note

As was shown with the `query` operation, in order to save any changes made to your dataframe you must either overwrite your current dataframe or store it in a new variable.

### Group by

The `groupby` operation allows you to split your data into groups, apply a function to each of those groups independently, and combine the results into a data structure.

In the example below, the dataframe is split into groups based on the different `SmokingStatus` types, and various summary statistics are performed on each group.

In [None]:
# Group by

# The groupby operation splits the data into the separate SmokingStatus types.
# This merely performs the splitting into groups.

SmokingGroups = df.groupby("SmokingStatus")

# We can use the describe function that we used previously to display 
# summary statistics for each column variable within each SmokingStatus group

SmokingGroups.describe()

In [None]:
# Group by (individual variables and statistics)

# If we only need to look at the summary statistics for one variable, we can specify that.
# In the example below, we want to look at the summary statistics for Weight within each SmokingStatus group.

SmokingGroups = df.groupby("SmokingStatus")
SmokingGroups["Weight"].describe()

# We can also look at individual stats.
SmokingGroups["Weight"].mean()
SmokingGroups["Weight"].quantile(0.9)

### Aggregate

Using the aggregate function on a grouped dataframe allows us to perform additional computations on the grouped data, such as functions from the Numpy (`np`) package.



In [None]:
# Aggregate

SmokingGroups = df.groupby("SmokingStatus")

# Calculate the mean using Numpy's mean function
SmokingGroups.agg(np.mean)

# You can apply multiple aggregate functions at once by putting the functions in a list
SmokingGroups["Weight"].agg([np.mean, np.median, np.std])

#### Example 4

Determine the median `Income` for the different `Education` groups.

In [None]:
# Example 4

#---Your code here---#


## Write a dataframe to a CSV file

If you want to write a new or modified dataframe to a CSV file, you can use the `to_csv` function.

In the example below we are writing the dataframe `myDF` to the CSV file called `output.csv`.

In [None]:
# Write a dataframe to a CSV file

myDF.to_csv("output.csv")

## More Examples

Below are a few more examples to try on your own.

### Example 5

Query the dataframe for female patients who are 18 years old and older. Sort by `Pulse` in descending order adn view the first 10 rows. You can use the `query` -> `sort_values` -> `head` functions.

The result should be 10 rows x 33 columns.

In [None]:
# Example 5

#---Your code here---#


### Example 6

Group by race and display summary statistics for systolic blood pressure (column name `BPSys`). You can use the `groupby` -> `describe` functions.

In [None]:
# Example 6

#---Your code here---#


### Example 7

Select the `id`, `Education`, `PhysActive`, and `PhysActiveDays` columns. Calculate the percentage of physically active days and assign it to a new variable called `PCTActive` (calculate by dividing `PhysActiveDays` by 7 and multiplying by 100). Sort in ascending order by `PCTActive` and display the first 10 rows. You can use `[["col1", "col2", ...]]` -> `assign` -> `query` -> `sort_values` -> `head`.

The result should be 10 rows x 5 columns.

In [None]:
# Example 7

#---Your code here---#


### Example 8

Query the dataframe for patients with `Income` below the 50th percentile. Group by `Race` and show the mean `Weight`. You can use `query` -> `groupby` -> `aggregate`.

In [None]:
# Example 8

#---Your code here---#


### Example 9

Group by `Education` and calculate mean hours of sleep (use column `SleepHrsNight`). Sort by `SleepHrsNight` in descending order. You can use `groupby` -> `agg` -> `sort_values`.

In [None]:
# Example 9

#---Your code here---#
