# Lab Exercise 05
---

## More Pandas Functions!

We have been given a dataset split into two csv files: stroke_data_01.csv and stroke_data_02.csv. Both files have unique variables pertaining to a cohort of patients. *Source: [Stroke prediction dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?resource=download)*

stroke_data_01.csv:
- `id`: unique identifier
- `hypertension`: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
- `heart_disease`: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
- `avg_glucose_level`: average glucose level in blood
- `bmi`: body mass index
- `smoking_status`: "formerly smoked", "never smoked", "smokes" or "Unknown"*
- `stroke`: 1 if the patient had a stroke or 0 if not

stroke_data_02.csv:
- `id`: unique identifier
- `gender`: "Male", "Female" or "Other"
- `age`: age of the patient
- `ever_married`: "No" or "Yes"
- `work_type`: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
- `Residence_type`: "Rural" or "Urban"

Our first step is load these datasets into two pandas dataframes.

In [None]:
# Import the pandas module with alias pd
import pandas as pd

# Load our data
df1 = pd.read_csv("stroke_data_01.csv")
df2 = pd.read_csv("stroke_data_02.csv")

Let's take a quick look at our first dataframe, `df1`, using the `head()` method.

In [None]:
df1.head()

It looks like all of the variables are there and match the information we were given.

Now let's check out `df2` using `head()`

In [None]:
df2.head()

We can get some descriptive statistics quickly by using the `describe()` method.

In [None]:
df1.describe()

This function provides total number of non NaN values, mean, standard deviation, min, max, and 25th, 50th (median), and 75th percentiles. This is a very helpful function, especially for continuous data types.

What happens when we run `describe()` on `df2`?

In [None]:
df2.describe()

For `df2` you see that only age shows up. This is because age is the only variable (column) that has an int or float data type. `describe()` is not as useful for this dataframe because most of the columns are categorial variable (str).

Later we will discuss a method that can be very helpful for quickly viewing and understanding categorical variables.

As mentioned previously, these two dataframes are actually part of the same dataset, meaning all of the patients are the same across both dataframes. It would be much easier to work with all of the data in a single dataframe.

We can do this by joining the two dataframes. We will use the `merge()` method to match the patients based on a specific variable (`id`). We will merge `df2` to `df1` using left join (`how='left'`).

In [None]:
df = df1.merge(df2, on='id', how='left')
df

Left join allows us to use the dataframe to which the method was applied, `df1`, and return all rows from that dataframe, while also returning columns from the "right" dataframe, `df2`, if the row matches specific data from the "left" dataframe (`id`).

You can modify how this is done by changing the `how` argument to "right", "inner", etc. Here are the definitions for each option (from the Pandas Documentation):
- left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
- right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
- outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
- inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

The below figure can help visualize the difference among these options.
![join](img/pandas_merge.png)
*Source: https://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/*

Now that we have a single dataframe, we can start cleaning it up. First, let's look at the all of the columns.

In [None]:
df.columns

For consistency, let's change the name of `Residence_type` to all lower case. We can do this using the `rename()` method.

In [None]:
df.rename(columns={'Residence_type':'residence_type'},inplace=True)
df

Next, we will make the `id` column the index column for the dataframe. We will use the `set_index()` method to do this.

In [None]:
df.set_index('id',inplace=True)
df

Next, we will look at the frequencies for different values for each column. We will do this using the `value_counts()` method.

In [None]:
df.value_counts()

Using the method on the full dataframe is not very helpful as it shows the frequency for each unique combination of values acorss all columns. Let's clean this up by selecting individual columns.

Let's see the frequency of patients based on presence/absence of hyptertension, then we will look at frequnecy based on hypertension and heart disease.

In [None]:
df.value_counts(subset=['hypertension'])

In [None]:
df.value_counts(subset=['hypertension','heart_disease'])

This easily allowed us view how many patients fall within these 4 different groups. We can further this analysis and use `groupby()` to get the patients in the same groups that we defined with `value_counts()`, but we will calculate the median for all of the other float variables. 

In [None]:
df.groupby(['hypertension','heart_disease']).median()

Let us bring our attention to the categorical (str) variables. We can quickly assess what values exist for these columns using `unique()`. This will return an array of the unique values for a given column.

In [None]:
df.residence_type.unique()

So `residence_type` has two possible values: "Rural" or "Urban".

We can use the `nunique()` method to output how many unique values exist for a column.

In [None]:
df.residence_type.nunique()

For some columns, we may want to change a continuous variable to a discretized value. We can accomplish this by binning the values for a column using the `cut()` or `qcut()` functions.

`cut()` allows the user to bin data for a selected column using user defined cutoffs and labels for the bins. We will apply this function to the `age` column and create a `age_bin` column. 

In [None]:
df['age_bin'] = pd.cut(df['age'], bins=[0,20,40,60,80], labels=['<20','21-40','41-60','>61'])
df.value_counts(subset='age_bin')

`qcut()` is very similar, but it automatically defines the bins by calculating the `q` quantiles. The labels will be named based on the calculated bin range. We will choose 4 quantiles to mimic what we did with the previous binning.

In [None]:
df['age_bin'] = pd.qcut(df['age'], q=4)
df.value_counts(subset='age_bin')

# Graded Portion

---

For all three problems, you must first load the dataset "stroke_data.csv" as `df`.

## Problem 01 (5 points)

Change the values of column `Residence_type` from 'Urban' and 'Rural' to 0 and 1, respectively. Also change the values for `smoking_status` from 'formerly smoked', 'never smoked', 'smokes', and 'Unknown' to 0, 1, 2, and 3, respectively.

In [None]:
# Write your code here to answer the question

#

In [None]:
# Test the function
print(df['Residence_type'].unique().tolist()==[0,1])
print(df['smoking_status'].unique().tolist()==[0,1,2,3])

## Problem 02 (5 points)

Bin `avg_glucose_level` into 4 quantiles, label the bins as `[0,1,2,3]`, and calculate the frequency of patients for each of the 4 groups. Assign the frequency Series as a variable named `glucose_freq`.

In [None]:
# Write your code here to answer the question

#

In [None]:
# Test the function
print(glucose_freq[0]==1278)
print(glucose_freq[1]==1278)
print(glucose_freq[2]==1277)
print(glucose_freq[3]==1277)

## Problem 03 (10 points)

Calculate the mean `bmi` based on `gender`, round the mean values to 2 decimals places, then replace any `bmi` NaN values with the calculated median `bmi` corresponding to each patient's `gender`. Modify the exisiting dataframe (`df`)

In [None]:
# Write your code here to answer the question

#

In [None]:
# Test the function
print(df.loc[1,'bmi']==29.07)
print(df.loc[13,'bmi']==28.65)