# Intro to Programming II: Dataframes & Sequencing

Last lecture, we discussed **variables, operators, functions,** and **modules**.

Today, we will be handling real scientific data (survey data & sequencing data) and performing preliminary dataset analysis. We will:

* Introduce conditional statements and data structures
* Familiarize ourselves with Pandas Dataframes
* Download and play with real scientific data

### 1 - Conditionals

**Conditional statements** (if/elif/else) allow you to run or not run a line of code based on a boolean value.

In [None]:
if True:
  print("The boolean value is True")
else:
  print("The boolean value is False.")

As with function definition, indentation is important. This is true of all Python grammar that involves a colon.

In [None]:
x = -5

# Write code that prints "negative" if x is a negative number.
# code goes here

* Define a function (remember those?) that takes an integer as an input and identifies it as negative, positive, or zero.

### 2 - Data Structures

When working with a large amount of data, there are two problems we can run into when using variables to track data:
1. Inefficiency in variable assignment (ex. weather forecasting)
2. Size inflexibility of the program (ex. office birthdays)

To solve these problems, we frequently use **data structures**, flexible "containers" which can be referred to by a single variable name.

In [None]:
# Lists are a common type of data structure. They are 1-dimensional and can be
# any length.
programmers =
birthmonths =

In [None]:
# What __type__ are these lists?


In [None]:
# Individual list elements can be called with list_name[]. You can also use this
# to assign individual list elements without reassigning the whole list. This is
# called indexing.

# Predict the result of running this line of code.
programmers[3]

Data structures (like variables) come with additional in-built methods in Python. The in-built methods for lists can be found [here](https://www.w3schools.com/python/python_lists_methods.asp).

Below:
* Add a programmer to the list. (His name is Benson. He was born in April.)
* Count the number of programmers born in April. (Bonus: Use the in-built list methods to print the name of each programmer born in April.)
* Take Benson off the programmer list.

### 3 - Pandas Dataframes

**Pandas Dataframes** provide a flexible 2-dimensional data structure, as well as many useful functions for manipulating that data. We'll be using it to hold our RNA sequencing data.

The documentation for Pandas.dataframe can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Always read the documentation to get a sense of what you can and can't do with a module.

In [5]:
# First, load the pandas module.
import pandas as pd

In [35]:
# Next, we need some data for the dataframe.
data = {
  'name': ['Aatrox', 'Ahri', 'Akali', 'Akshan', 'Alistar', 'Amumu'],
  'hp': [650, 590, 600, 630, 685, 685],
  'winrate': [48.86, 50.14, 49.66, 51.41, 49.99, 52.34]
}

You may notice I haven't explained the curly brackets above. That's actually *another* data structure called a dictionary. You can learn more about dictionaries and other data structures [here](https://docs.python.org/3/tutorial/datastructures.html#dictionaries).

For brevity's sake, we'll move on to defining our dataframe.

In [7]:
df = pd.DataFrame(data)
print(df)

      name   hp  winrate
0   Aatrox  650    48.86
1     Ahri  590    50.14
2    Akali  600    49.66
3   Akshan  630    51.41
4  Alistar  685    49.99
5    Amumu  685    52.34


In [46]:
# You can index a dataframe using square brackets, just like with lists.
df['name']

# You can also filter columns with dataframe_name.column_name. These notations
# have identical functions, so use whichever appeals to you.


0     Aatrox
1       Ahri
2      Akali
3     Akshan
4    Alistar
5      Amumu
Name: name, dtype: object

In [43]:
# You can also filter dataframes using boolean columns.
df[df.name=="Aatrox"]

Unnamed: 0,name,hp,winrate
0,Aatrox,650,48.86


Using boolean columns, find...
* ...the names of all champions with a positive winrate.
* ...the names of all champions with more than 650 HP.
* ...the names of all champions with both a positive winrate and more than 650 HP.

### 4 - Dataframes, cont.

Let's familiarize ourselves further with dataframes by looking at real scientific data.

We'll be using [this dataset](https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data), sourced from Kaggle.com. (Kaggle is a great source of practice data for anyone interested in improving their data analysis skills.)

Before working on the data, take a moment to read over the About section.

In [50]:
# Load data from Github
!wget https://raw.githubusercontent.com/Pitt-IshiharaLab/CompBioHA_IntroProgramming2024/main/heart_failure_clinical_records_dataset.csv?token=GHSAT0AAAAAACSHU3KL62PBKGZZAKWIXATQZSXMXRA -O heart_failure_dataset.csv

--2024-05-29 16:31:24--  https://raw.githubusercontent.com/Pitt-IshiharaLab/CompBioHA_IntroProgramming2024/main/heart_failure_clinical_records_dataset.csv?token=GHSAT0AAAAAACSHU3KL62PBKGZZAKWIXATQZSXMXRA
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12239 (12K) [text/plain]
Saving to: ‘heart_failure_dataset.csv’


2024-05-29 16:31:24 (78.9 MB/s) - ‘heart_failure_dataset.csv’ saved [12239/12239]



In [57]:
df = pd.read_csv('heart_failure_dataset.csv')

In [55]:
# Use df.head() to visually assess the data. Note that the functions "belong" to df.
df.head()

# Return to the Pandas Dataframe documentation. What parameters does head() have?
# What happens if you mess around with them?

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


Using the Pandas Dataframe documentation:
* How many patients are in the dataset?
* How many patients were older? What was the average age?
* How many patients deceased during the follow-up period?
* How many deceased patients had hypertension? How does this compare to non-deceased patients?
* Does the data suggest a correlation between hypertension and heart failure? What about smoking? Sex? Diabetes?

You can practice on your own with [this dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data) (investigating comorbidities of stroke instead of heart failure). If you do, you can use this command to download `stroke-data.csv`:

`!wget https://raw.githubusercontent.com/Pitt-IshiharaLab/CompBioHA_IntroProgramming2024/main/healthcare-dataset-stroke-data.csv?token=GHSAT0AAAAAACSHU3KLHEPU4Q6W3EOQNW7KZSXNU7A -O stroke-data.csv`

### 5 - RNA Sequencing Data

We're sourcing our RNA sequencing data from [another Kaggle dataset](https://www.kaggle.com/datasets/usharengaraju/indian-women-in-defense/data). Again, take a moment to read over and familiarize yourself with the About section.

In [None]:
!wget https://raw.githubusercontent.com/Pitt-IshiharaLab/CompBioHA_IntroProgramming2024/main/airway_metadata.csv?token=GHSAT0AAAAAACSHU3KKIGQBBLP4BSCNEFEUZSXNPJQ -O rnaseq_metadata.csv
!wget https://raw.githubusercontent.com/Pitt-IshiharaLab/CompBioHA_IntroProgramming2024/main/airway_scaledcounts%202.csv?token=GHSAT0AAAAAACSHU3KLBMATLN4FMLNUEAIWZSXNONA -O rnaseq_scaledcounts.csv

In [58]:
metadata = pd.read_csv('rnaseq_metadata.csv')
seq_data = pd.read_csv('rnaseq_scaledcounts.csv')

In [61]:
# Always start by checking your data!
seq_data.head(n=10)

Unnamed: 0,ensgene,SRR1039508,SRR1039509,SRR1039512,SRR1039513,SRR1039516,SRR1039517,SRR1039520,SRR1039521
0,ENSG00000000003,723.0,486.0,904.0,445.0,1170.0,1097.0,806.0,604.0
1,ENSG00000000005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,ENSG00000000419,467.0,523.0,616.0,371.0,582.0,781.0,417.0,509.0
3,ENSG00000000457,347.0,258.0,364.0,237.0,318.0,447.0,330.0,324.0
4,ENSG00000000460,96.0,81.0,73.0,66.0,118.0,94.0,102.0,74.0
5,ENSG00000000938,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0
6,ENSG00000000971,3413.0,3916.0,6000.0,4308.0,6424.0,10723.0,5039.0,7803.0
7,ENSG00000001036,2328.0,1714.0,2640.0,1381.0,2165.0,2262.0,2175.0,1786.0
8,ENSG00000001084,670.0,372.0,692.0,448.0,917.0,807.0,744.0,685.0
9,ENSG00000001167,426.0,295.0,531.0,178.0,740.0,651.0,414.0,269.0


In [63]:
metadata

Unnamed: 0,id,dex,celltype,geo_id
0,SRR1039508,control,N61311,GSM1275862
1,SRR1039509,treated,N61311,GSM1275863
2,SRR1039512,control,N052611,GSM1275866
3,SRR1039513,treated,N052611,GSM1275867
4,SRR1039516,control,N080611,GSM1275870
5,SRR1039517,treated,N080611,GSM1275871
6,SRR1039520,control,N061011,GSM1275874
7,SRR1039521,treated,N061011,GSM1275875


In [87]:
# Remove rows with all zeros.
# Hint: Use boolean columns.
# Hint 2: The function any() returns True if any value in a list is True.
# Hint 3: You may want to drop() the ensgene column.
#seq_data[(seq_data[1:].T!=0).any()]

Unnamed: 0,SRR1039508,SRR1039509,SRR1039512,SRR1039513,SRR1039516,SRR1039517,SRR1039520,SRR1039521
0,723.0,486.0,904.0,445.0,1170.0,1097.0,806.0,604.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,467.0,523.0,616.0,371.0,582.0,781.0,417.0,509.0
3,347.0,258.0,364.0,237.0,318.0,447.0,330.0,324.0
4,96.0,81.0,73.0,66.0,118.0,94.0,102.0,74.0
...,...,...,...,...,...,...,...,...
38689,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38690,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38691,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38692,1.0,2.0,1.0,0.0,1.0,1.0,2.0,0.0


(Pre-process, do differential gene expression analysis ahead of time, leave links to original data as extension question)

use matplotlib <-- most basic version

To visualize our dataset, we need another module: Matplotlib.pyplot

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0, 5, 0.1)
y = np.sin(x)
plt.plot(x, y)

Maybe we work with bioinfokit? admittedly this doesn't involve all the same filtering, but its dataframes-centered and potentially more accessible (https://www.reneshbedre.com/blog/volcano.html)

```
from bioinfokit import analys, visuz
# load dataset as pandas dataframe
df = analys.get_data('volcano').data
df.head(2)
          GeneNames  value1  value2    log2FC       p-value
0  LOC_Os09g01000.1    8862   32767 -1.886539  1.250000e-55
1  LOC_Os12g42876.1    1099     117  3.231611  1.050000e-55

visuz.GeneExpression.volcano(df=df, lfc='log2FC', pv='p-value')
