# Lab Exercise 04
---

## Pandas

Pandas is a very useful python package for data scientists and biomedical informaticians.
Enables programmers to load, manipulate, clean, and analyze large datasets

Use the `import` keyword to load the pandas module. Use the `as` keyword to set an alias for the module.

In [None]:
import pandas as pd

## Pandas Series

Series is one of two Pandas data structures.

Series is a 1 dimensional array of data, similar to a column in a data table



In [None]:
l = [1,2,3,4]
s = pd.Series(l)
print(s)

Each item in the Series has a corresponding index. You can set the index using the `index` argument.


In [None]:
l = [1,2,3,4]
s = pd.Series(l, index=['a','b','c','d'])
print(s)

## Pandas DataFrame

Dataframe is a 2 dimensional array of data, similar to a table with multiple rows and columns


In [None]:
l = [[1,2,3],[2,3,4]]
df = pd.DataFrame(l)
print(df)

Both the index and comlumn names can be defined by the user using `index` and `column` keywords.

In [None]:
l = [[1,2,3],[2,3,4]]
df = pd.DataFrame(l,index=['a','b'], columns=['one','two','three'])
print(df)

## Load datasets

Load data using Pandas functions. Which function used depends on the file format. .csv (comma separated values) is a common format for flat data files.
Use the `read_csv()` function to load data from this file type.


In [None]:
df = pd.read_csv("diabetes.csv")
print(df)

JSON is another popular file format for large datasets. Load this data using `read_json()`.

In [None]:
df = pd.read_json("diabetes.json")
print(df)

To see what the JSON file format looks like you can read the file in and print it to line. You will notice it looks similar to a dictionary.

In [None]:
with open("diabetes.json") as f:
    print(f.read())

## Saving dataframes

You can write DataFrames to file using `to_csv()` or `to_json()`.

In [None]:
df = pd.read_csv("diabetes.csv")
df.to_csv("test04.csv")

In [None]:
with open("test04.csv") as f:
    print(f.read())

You can change the delimiter (character that separates each value in the file). For csv, the default is a comma (,), but you can change it to any character using the `sep` argument.

In [None]:
df = pd.read_csv("diabetes.csv")
df.to_csv("test04.tsv", sep='\t')

In [None]:
with open("test04.tsv") as f:
    print(f.read())

## Selecting data

View all column names and index names using `columns` and `index` attributes.

In [None]:
df = pd.read_csv("diabetes.csv")
df.columns

In [None]:
df = pd.read_csv("diabetes.csv")
df.index

You can select specific rows, columns, or cells in the dataframe using the `.loc[]` attirbute.
First, let's select an entire row.

In [None]:
df = pd.read_csv("diabetes.csv")
df.loc[0]

Or an entire column. Similar to splices of arrays, the colon, `:`, selects all of the items (rows in this case). The second item in the `[]` is the column name.

In [None]:
df = pd.read_csv("diabetes.csv")
df.loc[:,'Pregnancies']

We can also select multiple columns or indices.

In [None]:
df = pd.read_csv("diabetes.csv")
df.loc[:,['Pregnancies','Glucose']]

Easier way to select a column is using the column name with square brackets.

In [None]:
df = pd.read_csv("diabetes.csv")
df['Pregnancies']

Or using the column name as an attribute.

In [None]:
df = pd.read_csv("diabetes.csv")
df.Pregnancies

Now let's select just one item/cell in the dataframe.

In [None]:
df = pd.read_csv("diabetes.csv")
df.loc[2,'Pregnancies']

The attribute `.iloc[]` is similar to loc, but uses the positional label for index and columns. To get the same value that the line `df.loc[2,'Pregnancies']` got us, we will have to use the number index for each.

In [None]:
df.iloc[2,0]

You can also select all rows where a certain condition is met.

In [None]:
df = pd.read_csv("diabetes.csv")
df.loc[(df['Pregnancies']==3)]

In [None]:
df = pd.read_csv("diabetes.csv")
df.loc[(df['BMI']>=39.9)]

## Checking your data

You can view the first N lines of your dataframe using the `head()` method. 

In [None]:
df = pd.read_csv("diabetes.csv")
df.head()

You can modify how many rows are outputted by passing an integer argument to `head()`.

In [None]:
df = pd.read_csv("diabetes.csv")
df.head(10)

`tail()` gives the last N lines. Same argument applies as with `head()`.

In [None]:
df = pd.read_csv("diabetes.csv")
df.tail()

In [None]:
df = pd.read_csv("diabetes.csv")
df.tail(10)

Get the length (number of rows) of the dataframe using the `len()` function.

In [None]:
df = pd.read_csv("diabetes.csv")
len(df)

The `info()` method gives in-depth information about the dataframe.

In [None]:
df = pd.read_csv("diabetes.csv")
df.info()

## Data Cleaning

We can remove missing data using the `dropna()` method. This will identify rows that have Null values and drop those rows from the dataframe.

In [None]:
df = pd.read_csv("diabetes.csv")
print(f"Original df has {len(df)} samples.")

df = df.dropna()
print(f"Cleaned df has {len(df)} samples.")

`dropna()` returns a modified dataframe, so we much reassign the varialbe `df` to get the cleaned dataframe. Use the `inplace` argument to modified the original dataframe.

In [None]:
df = pd.read_csv("diabetes.csv")
print(f"Original df has {len(df)} samples.")

df.dropna(inplace=True)
print(f"Cleaned df has {len(df)} samples.")

Rather than remove the samples (rows) with missing data, you can replace the missing values with a user defined value using the `fillna()` method.

In [None]:
df = pd.read_csv("diabetes.csv")
df.loc[2,'BloodPressure']

In [None]:
df.fillna(120, inplace=True)
df.loc[2,'BloodPressure']

Some datasets may have redundant data, which can cause bias in our analyses. We can remove these duplicates quickly using pandas. Use the `duplicated()` method to identify the rows that are duplicates. Then use the `drop_duplicates()` method to remove the redundant rows (keeping the first instance).

In [None]:
df = pd.read_csv("diabetes.csv")
df.duplicated()

We can use the `drop()` method to remove the rows that are False to give us the indicies where the duplicate rows exist.

In [None]:
df = pd.read_csv("diabetes.csv")
df_dup = df.duplicated()
for i in df_dup.index:
    if df_dup.loc[i] == False:
        df_dup.drop(i,inplace=True)
print(df_dup)

If you want to cahnge any value in the dataset you can do so with the `loc[]` attirbute and reassigning the item like a variable.

In [None]:
df = pd.read_csv("diabetes.csv")
print(df.loc[2,'BloodPressure'])
df.loc[2,'BloodPressure'] = 120
print(df.loc[2,'BloodPressure'])

## Some other functions and methods

You can use `max()` and `min()` methods to find the maximum and minimum vlaues for a dataframe or specific columnn.

In [None]:
df = pd.read_csv("diabetes.csv")
df.min()

In [None]:
df = pd.read_csv("diabetes.csv")
df.Pregnancies.min()

In [None]:
df = pd.read_csv("diabetes.csv")
df['Pregnancies'].max()

Can also calculate descriptive statistics quickly using `mean()`, `median()`, and `mode()`.

In [None]:
df = pd.read_csv("diabetes.csv")
df['Pregnancies'].mean()

In [None]:
df = pd.read_csv("diabetes.csv")
df['Pregnancies'].median()

In [None]:
df = pd.read_csv("diabetes.csv")
df['Pregnancies'].mode()

# Graded portion
---

## Problem 01 (5 points)

Create a dataframe, `df_animals`, with three samples (rows) and three columns. The index names are `'a','b', 'c'` and the column names are `'name','color','mammal'`. The samples should contain the following information:
- tiger, orange, True
- elephant, grey, True
- crocodile, green, False

1 point for correct indicies, 1 point for correct columns, 3 points for each correct row.

In [None]:
# Write your code here to answer the question

#

In [None]:
# Test the function
print(df.index.to_list()==['a','b','c'])
print(df.columns.to_list()==['name','color','mammal'])
print(df.iloc[0].to_list()==['tiger','orange',True])
print(df.iloc[1].to_list()==['elephant','grey',True])
print(df.iloc[2].to_list()==['crocodile','green',False])

## Problem 02 (5 points)

Read in the csv file "LE04-P02-patients.csv" as a dataframe, drop rows with missing data, drop redundant rows, and save the updated dataframe as "new_patients.json".

In [None]:
# Write your code here to answer the question

#

In [None]:
# Test the function
df = pd.read_json("new_patients.json")
print(df['Pregnancies'].isnull().sum()==0)
print(df['Glucose'].isnull().sum()==0)
print(df['BMI'].isnull().sum()==0)
df_dup = df.duplicated()
for i in df_dup.index:
    if df_dup[i]==True:
        print("Duplicate still exists.")
        break
print("No duplicates.")

## Problem 03 (10 points)

Read in the json file "LE04-P03-patients.json" and create a dictionary, `d_p3`, which has each column name as a key and the min, max, mean, median, and mode values for each column in a list as the corresponding values.

Ex. Column "Height" has a min value of 1.2, a max value of 2.0, a mean value of 1.4, and median value of 1.5 --- `d_p3 = {"Height": [1.2,2.0,1.4,1.5]}`

In [None]:
# Write your code here to answer the question

#

In [None]:
# Test the function
print(d_p3['Pregnancies']==[0, 15, 4.12, 3.0])
print(d_p3['Glucose']==[0, 197, 117.05, 110.5])
print(d_p3['BloodPressure']==[0, 122, 68.12, 71.0])
print(d_p3['SkinThickness']==[0, 60, 20.18, 22.5])
print(d_p3['Insulin']==[0, 846, 75.17, 0.0])