# Intro to Programming II: Dataframes & Sequencing

Last lecture, we discussed **variables, operators, functions,** and **modules**, as well as five different data types.

Today, we will be focusing on one data type in particular: **booleans**. We will use booleans to handle real scientific data (survey data & sequencing data) and perform preliminary dataset analysis. In this lecture, we will:

* Introduce conditional statements and data structures
* Familiarize ourselves with Pandas Dataframes
* Analyse and plot real scientific data

### 1 - Conditionals, briefly

**Conditional statements** (if/elif/else) allow you to run or not run a line of code based on a boolean value.

In [None]:
if True:
  print("True!")

True!


As with function definition, indentation is important. This is true of all Python grammar that involves a colon.

In [None]:
# Write a function that tells you (with printed text) whether or not you can vote
# (based on age).
# Also, practice leaving helpful comments!
def can_vote(age):

SyntaxError: incomplete input (<ipython-input-2-0ddb5fa58f89>, line 4)

In [None]:
can_vote(20)

NameError: name 'can_vote' is not defined

In [None]:
can_vote(16)

### 2 - Data Structures

When working with a large amount of data, there are two problems we can run into when using variables to track data:
1. Inefficiency in variable assignment (ex. weather forecasting)
2. Size inflexibility of the program (ex. office birthdays)

To solve these problems, we frequently use **data structures**, flexible "containers" which can be referred to by a single variable name.

Lists are a common type of data structure. They are 1-dimensional, have an order, and can be any length.

Data structures come with additional in-built methods in Python. The in-built methods for lists can be found [here](https://www.w3schools.com/python/python_lists_methods.asp).

In [None]:
programmers = ["Chris","Elijah","Luisa","Keisuke","Benson"]
birthmonths = ["April","March","April","May","April"]

Individual list elements can be called with `list_name[]`. You can also use this to assign individual list elements without reassigning the whole list. This is called **indexing**.

In [None]:
# Predict the result of running this line of code.
programmers[3]

'Keisuke'

### 3 - Pandas Dataframes

**Pandas Dataframes** provide a flexible 2-dimensional data structure, like a spreadsheet! as well as many useful functions for manipulating that data. We'll be using it to hold our RNA sequencing data.

The documentation for Pandas.dataframe can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Always read the documentation to get a sense of what you can and can't do with a module.

In [None]:
# First, load the pandas module.
import pandas as pd

We'll be using [this dataset](https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data), sourced from Kaggle.com. (Kaggle is a great source of practice data for anyone interested in improving their data analysis skills.)

Before working on the data, take a moment to read over the About section.

In [None]:
# Then, make the dataframe.
df = pd.read_csv('heart_failure_clinical_records_dataset.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'heart_failure_clinical_records_dataset.csv'

In [None]:
# Use df.head() to visually assess the data. ALWAYS check your data!
# Note that the functions "belong" to df.
df.head()

# Return to the Pandas Dataframe documentation. What parameters does head() have?
# What happens if you mess around with them?

In [None]:
# You can index a dataframe using square brackets, just like with lists.
df['age']

# You can also filter columns with dataframe_name.column_name. These notations
# have identical functions, so use whichever appeals to you. NOTE: This doesn't
# work if your column names have spaces in them!
df.age

In [None]:
# You can also filter dataframes using boolean columns.
df[df.age==50]

Pandas Dataframes also has built-in plotting functions! Below, write code that generates a histogram of patient ages.

Using the Pandas Dataframe documentation, you can already answer a wide variety of scientific questions:
* How many patients are in the dataset?
* How many patients were older? What was the average age?
* How many patients deceased during the follow-up period?
* How many deceased patients had hypertension? How does this compare to non-deceased patients?
* Does the data suggest a correlation between hypertension and heart failure? What about smoking? Sex? Diabetes?

If you want another dataset to practice on your own with, [this dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data) investigates comorbidities of stroke instead of heart failure.

### 4 - RNA Sequencing Data

We're sourcing our RNA sequencing data from [another Kaggle dataset](https://www.kaggle.com/datasets/usharengaraju/indian-women-in-defense/data). Again, take a moment to read over and familiarize yourself with the About section.

In [None]:
metadata = pd.read_csv('airway_metadata.csv', index_col=0)
seq_data = pd.read_csv('airway_scaledcounts.csv', index_col=0)

In [None]:
# Always start by checking your data!
seq_data.head(n=10)

In [None]:
metadata

Part of the preprocessing for differential gene expression analysis requires removing non-zero rows. Why?

For practice, try using the Pandas Dataframes documentation to remove all non-zero rows from the raw sequencing data.

### 6 - Plotting Data

We've preprocessed the data (performed differential gene expression analysis) for you. In this section, we'll use another module, Matplotlib, to generate plots from this data.

In [None]:
# You can find the preprocessed data on the Github
dgea = pd.read_csv('results_edgeR.csv', index_col=0)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0, 5, 0.1)
y = np.sin(x)
plt.plot(x, y)

In [None]:
# Volcano Plot


In [None]:
# Heatmap
