# Python for Data Science

## Why Python?

- Python: general-purpose programming language $\rightarrow$ large user base from diverse disciplines
    - cf. R: specializes in statistical computing/data analytics (more domain-specific)
Unlike R, which specializes in statistical computing/data analytics (more domain-specific)

    - Rich pool of scientific libraries with strong community support
    - Many useful libraries beyond scientific computing 
        - Natural to directly utilize scientific computing results in diverse applications such as image processing, natural language processing, gaming, etc.
    - Disciplines like machine learning: 
        - Python became the de facto standard language in many companies, and you may be required to use it sometimes. 
        - tensorflow, pytorch for deep learning, etc. and they are all in Python.

### Scientific Python (SciPy) Ecosystem

Libraries are important characteristic of how Python works: each application has its libraries. For scientific computing, one can use a combination of the following packages that meets the need. 




| Package | Description | Logo |
|:---:|:---|:---:|
| __Numpy__ | Numerical arrays | <img src="https://numpy.org/doc/stable/_static/numpylogo.svg" width="300"/> |
| Scipy | User-friendly and efficient numerical routines:<br> numerical integration, interpolation, optimization, linear algebra, and statistics | <img src="https://scipy.org/images/logo.svg" width="200"/> |
| __Jupyter Notebook__ | Interactive programming environment | <img src="https://jupyter.org/assets/homepage/main-logo.svg" width="200"/> |
| __Matplotlib__ | Plotting | <img src="https://matplotlib.org/_static/logo2_compressed.svg" width="300"/>|
| __Pandas__ | Data analytics <br> R-like data frames | <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/1024px-Pandas_logo.svg.png" width="300"/> |
| Scikit-learn | Machine learning | <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/2880px-Scikit_learn_logo_small.svg.png" width="300"/> | 

All the packages above are included in the [Anaconda distribution](https://www.anaconda.com/products/distribution) of Python. If you download Anaconda, it comes with all the useful scientific programming packages. Packages used in this workshop are in bold.




In [None]:
import IPython
print(IPython.sys_info())

## Before We move on...

We have limited resources for each user on the cloud. Don't forget to shut down the kernels you are not using. 

![](shutdown_kernel.png)

## Pandas

Pandas is a Python library for working with datasets. It supports data frame structure like in R. 

In [None]:
# load the pandas library
import pandas as pd
# we also load numpy for array computation for convenience
import numpy as np
import platform
mimic_path =  "/Users/huazhou/Desktop/mimic-iv-1.0" if platform.uname().node == "RELMDOMFAC30610" else "/home/shared/1.0"

### Loading data

Let's load `icustays.csv.gz` file as a pandas data frame. We need to predetermine the columnes with date-time values.

In [None]:
icustays_df = pd.read_csv(mimic_path + "/icu/icustays.csv.gz", parse_dates=["intime", "outtime"])
print(icustays_df)

You may press `Shift-Tab` to see the function documentation interactively, or typing `?pd.read_csv` in the code cell. 

In [None]:
icustays_df.__class__

The variable read in is an instance of `DataFrame`. Let's talk a little bit about what this means. 

#### Note: Object-Oriented Programming

Python has built-in object-oriented programming (OOP) support. The OOP paradigm is based on "objects", which is bundled with data representing properties of the object and code in the form of method. 

- __Class__: defines data format (attributes) and available procdeures (methods) for an object. `DataFrame`.
- __Object__: An instance of a class. `icustays_df`. 
- __Attributes__: Properties like column names, number of rows, number of columns, etc. __syntax: `df.attribute`__
- __Methods__: "Actions" (or procedures/functions) applied to or performed on the object. __syntax: `df.method(arg1, arg2, etc.)`__

Now `admissions.csv.gz`:

In [None]:
admissions_df = pd.read_csv(mimic_path + "/core/admissions.csv.gz",
                           parse_dates = ["admittime", "dischtime", "deathtime", "edregtime", "edouttime"])
print(admissions_df)

And `patients.csv.gz`:

In [None]:
patients_df = pd.read_csv(mimic_path + "/core/patients.csv.gz")
print(patients_df)

For `chartevents_filtered_itemid.csv.gz`, we learn how to read in only selected columns.

In [None]:
from timeit import default_timer as timer

In [None]:
start = timer()
chartevents_df = pd.read_csv(
    mimic_path + "/icu/chartevents_filtered_itemid.csv.gz",
    usecols = ["stay_id", "itemid", "charttime", "valuenum"],
    dtype = {"stay_id" : np.float64, "itemid" : np.float64, "charttime" : "str", "valuenum" : np.float64},
    parse_dates = ["charttime"]
    )
end = timer()
print("Elapsed time: ", end - start)

In [None]:
print(chartevents_df)

For filtering, we can use the `query` method. 

In [None]:
chartevents_df.query("stay_id == 30600691 and itemid == 220045")

And for plotting, we use the package `matplotlib`.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
(
    chartevents_df.query("stay_id == 30600691 and itemid == 220045").
        plot.scatter(x="charttime", y="valuenum")
)

#### Note: Method Chaining
One may use method chaining for linearizing method calls, as above. As the dot operator(`.`) is evaluated from left to right, one may "chain" another method call or attribute access right after obtaining the result of the previous method call or attribute access. This is a "pythonic" way of avoiding nested calls.

One limitation is that we can only do this for methods or attributes of a class. In this case, `print()` is not a method of `DataFrame`, we cannot chain `print` as what we did in R. One may use [`pandas.DataFrame.pipe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pipe.html) method for such operations. 
There are packages that implement pipe by overloading other operator (e.g., the bitwise or (`|`) operator). 

## Target cohort (from R section)

Let's continue on with the task we did with R. We aim to develop a predictive model, which computes the chance of dying within 30 days of ICU stay `intime` based on baseline features  
- `first_careunit`  
- `age` at `intime`  
- `gender`  
- `ethnicity`  
- first measurement of the following vitals since ICU stay `intime`  
    - 220045 for heart rate   
    - 223761 for Temperature Fahrenheit  


We restrict to the first ICU stays of each unique patient. 

## Wrangling and merging data frames

Our stragegy is

1. Identify and keep the first ICU stay of each patient. 

2. Identify and keep the first vital measurements during the first ICU stay of each patient.

3. Join four data frames into a single data frame.

Important data wrangling concepts: group_by, sort, slice, joins, and pivot.

### Step 1: restrict to the first ICU stay of each patient

`icustays_df` has 76,540 rows, which is reduced to 53,150 unique ICU stays.

In [None]:
icustays_df_1ststay = (icustays_df.sort_values(["subject_id", "intime"]).
                       groupby("subject_id").
                       head(1)) # head() is much faster than slice_head(n) in dplyr
print(icustays_df_1ststay)

### Step 2: restrict to the first vital measurements during the ICU stay

Key data wrangling concepts: selecting columns, left_join, right_join, group_by, sort, pivot

In [None]:
chartevents_df_1ststay = (
    chartevents_df.
    merge(
        icustays_df_1ststay[["stay_id", "intime", "outtime"]],
        how = "right",
        on = "stay_id"). # 15738363 rows
    query("charttime >= intime and charttime <= outtime"). # 15700234 rows
    sort_values(["stay_id", "itemid", "charttime"]).
    groupby(["stay_id", "itemid"]).
    head(1). # 263332 rows
    drop(["charttime", "intime", "outtime"], axis="columns"). # remove unnecessary columns
    astype({"itemid": str}). # change it to string for easier renaming
    pivot(index="stay_id", columns="itemid", values="valuenum").
    rename(columns={"220045.0": "heart_rate",
            "223761.0": "temp_f"})
)

In [None]:
chartevents_df_1ststay

### Step 3: merge DataFrames

New data wrangling concept: mutate. Pandas equivalent is `assign`. 

In [None]:
mimic_icu_cohort = (
    icustays_df_1ststay.
    # merge dataframes
    merge(admissions_df, on=["subject_id", "hadm_id"], how="left").
    merge(patients_df, on=["subject_id"], how="left").
    merge(chartevents_df_1ststay, on=["stay_id"], how="left").
    # age_intime is the age at the ICU stay intime
    assign(age_intime = lambda df: 
           df["anchor_age"] + df["intime"].map(lambda x : x.year) - df["anchor_year"]).
    # whether the patient died within 30 days of ICU stay intime
    assign(hadm_to_death = lambda df: 
           np.where(np.isnan(df["deathtime"]), 
                    np.inf, 
                    (df["deathtime"] - df["intime"]).dt.total_seconds())).
    assign(thirty_day_mort = lambda df: df["hadm_to_death"] <= 2592000)   
)

In [None]:
mimic_icu_cohort

## Data visualization


Remember we want to model: 

thirty_day_mort ~ first_careunit + age_intime + gender + ethnicity + heart_rate + bp_mean + bp_syst + temp_f + resp_rate

Let's start with a numerical summary of variables of interest.

For numerical column, we can obtain mean, standard deviation, and quartiles using the method `describe()`. For a categorical column, we obtain number of unique values, value with the most appearance, and its frequency. 

In [None]:
(
    mimic_icu_cohort[["first_careunit", 
        "gender", 
        "ethnicity", 
        "age_intime", 
        "heart_rate", 
        "temp_f"]].
    describe(include="all")
)


Do you spot anything unusual?

To obtain counts of each value for categorical column, we use `value_counts()` method.

In [None]:
mimic_icu_cohort["first_careunit"].value_counts()

In [None]:
mimic_icu_cohort["gender"].value_counts()

In [None]:
mimic_icu_cohort["ethnicity"].value_counts()

### Univariate summaries

Before we start, let's import the `seaborn` package for styling the figures a little bit (looking like ggplot2). This package is for statistical data visualization.

In [None]:
import seaborn as sns
sns.set()

Bar plot of `first_careunit`

In [None]:
mimic_icu_cohort["first_careunit"].value_counts(sort=False).plot.bar()

In [None]:
mimic_icu_cohort["age_intime"].plot.hist()

In [None]:
mimic_icu_cohort["age_intime"].plot.box()

Histogram and boxplot of `age_intime`

#### Exercises

1. Summarize discrete variables: `gender`, `ethnicity`.  
2. Summarize continuous variables: `heart_rate`, `temp_f`.
3. Is there anything unusual about `temp_f`?

### Bivariate summaries

Tally of `thirty_day_mort` vs `first_careunit`. 

We need to be a little more verbose for plotting frequencies in stacked barplot in Python. 

In [None]:
(mimic_icu_cohort.
     groupby("first_careunit")["thirty_day_mort"].value_counts().
     unstack("thirty_day_mort").iloc[:, ::-1]. # reversing column order to make True come first
 plot.bar(stacked=True)
)

In [None]:
(mimic_icu_cohort.
     groupby("first_careunit")["thirty_day_mort"].value_counts(normalize=True).
     unstack("thirty_day_mort").iloc[:, ::-1].
     plot.bar(stacked=True)
)

Tally of `thirty_day_mort` vs `gender`

In [None]:
(mimic_icu_cohort.
     groupby("gender")["thirty_day_mort"].value_counts(normalize=False).
     unstack("thirty_day_mort").iloc[:, ::-1].
     plot.bar(stacked=True)
)

In [None]:
(mimic_icu_cohort.
     groupby("gender")["thirty_day_mort"].value_counts(normalize=True).
     unstack("thirty_day_mort").iloc[:, ::-1].
     plot.bar(stacked=True)
)

In [None]:
mimic_icu_cohort.groupby("gender")["thirty_day_mort"].value_counts(normalize=True).unstack("thirty_day_mort")

#### Exercises

1. Graphical summaries of `thirty_day_mort` vs other predictors.

## Pros and Cons of Python for Data Science


__Pros__: 
- General purpose $\rightarrow$ can directly use the data analysis result to different disciplines
- Wide user base $\rightarrow$ rich package ecosystem
- Readable and fast

__Cons__: 
- Less statistical analysis packages
- Libraries may be more hard to understand
- Visualization is more convoluted than R