# **SIG AIDA Data Science Workshop**
## _Punching Through Data with Pandas_


# Introduction
## What is Pandas?
Pandas is one of the biggest and most popular libraries in Python for data science (among other things). 

- It allows you to load in data in the form of a **dataframe**, which is essentially a table, and then further lets you run fast calculations on the table's columns!


In [4]:
#@title Please run this cell for setup!

import pandas as pd
from pprint import pprint


import sqlite3
conn = sqlite3.connect('example.db')

c = conn.cursor()
c.execute("SELECT name FROM sqlite_master WHERE type='table'")
if len(c.fetchall()) > 0:
    c.execute("DROP TABLE IF EXISTS uber")
    c.execute("DROP TABLE IF EXISTS gpa")

uber_url = "https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/Uber-Jan-Feb-FOIL.csv"
uber_data = pd.read_csv(uber_url, index_col=0)
uber_data.to_sql('uber', conn)

gpa_url = "https://raw.githubusercontent.com/wadefagen/datasets/master/gpa/uiuc-gpa-dataset.csv"
gpa_data = pd.read_csv(gpa_url, index_col=0)
gpa_data.to_sql('gpa', conn)

stu19 = pd.read_excel('http://dmi.illinois.edu/stuenr/ethsexres/ethsexfa19.xls',
                      header=4, sheet_name="summary",
                      names=["term", "code", "name", "st_level", "total", "men", "women", "unknown_gender",
                             "caucasian", "asian_american", "african_american", "hispanic", "native_american",
                             "hawaiian_pacificisl", "multiracial", "international", "unknown_race",
                             "all_african_american", "all_native_american", "all_hawaiian_pacificisl",
                             "all_asian", "illinois", "non_illinois", "part_time", "full_time"])
stu19.to_sql('stu19', conn)

def run_query(query):
    return pd.read_sql_query(query, conn)

print("Setup Complete!")

ValueError: Length mismatch: Expected axis has 24 elements, new values have 25 elements

## Comparing SQL and Pandas
Pandas Cheat Sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

SQL Cheat Sheet: https://cdn.sqltutorial.org/wp-content/uploads/2016/04/SQL-cheat-sheet.pdf

A workshop we conducted last semester on the same topic: https://drive.google.com/file/d/1NtRa_pueIMcpig-0-78xyVSW19OvSww0/view?usp=sharing

In [None]:
stu19.head(10)

In [None]:
# Pandas Version of SELECT

stu19["name"].head(5)

# SQL equivalent:
# SELECT name FROM stu19 LIMIT 5

In [None]:
# Pandas Version of WHERE

stu19[stu19["name"] == 'Business '].head(5)

# SQL equivalent:
# SELECT * FROM stu19 WHERE name == "Business " LIMIT 5

In [None]:
# Pandas Version of LIKE

stu19[stu19["name"].str.contains("Bus")].head(5)

# SQL equivalent:
# SELECT * FROM stu19 WHERE name LIKE "%Bus%" LIMIT 5

In [None]:
# Pandas Version of GROUP BY and aggregate functions

# .sum can be replaced with .count, .mean, or others
stu19.groupby(["name"]).sum().head(5)

# SQL Equivalent:
# SELECT SUM(term), ..., SUM(<last_numeric_col>) FROM stu19 GROUP BY name LIMIT 5

In [None]:
new_df = stu19.set_index(["name", "st_level"])
#new_df
new_df.loc[[("Business ", "Undergraduate "),
            ("Education ", "Graduate ")]]

#new_df.loc[(["Business ", "Education "], ["Undergraduate ", "Graduate "])]

In [None]:
#Pandas Version of ORDER BY

stu19.sort_values(by = ["total", "women"], ascending=False).head(5)

# SQL Equivalent:
# SELECT * FROM stu19 ORDER BY total DESC LIMIT 5

## Some Basic Functions (not from SQL)



####`pd.DataFrame()`

Sometimes you'll want to create your own DataFrame, there are **many** different ways you can do this with a whole variation of lists and tuples

- You can check out [this page](https://www.geeksforgeeks.org/different-ways-to-create-pandas-dataframe/#:~:text=Pandas%20DataFrame%20can%20be%20created%20by%20passing%20lists%20of%20dictionaries,dictionary%20keys%20taken%20as%20columns.&text=%23%20Pandas%20DataFrame%20by%20lists%20of%20dicts.&text=%23%20Initialise%20data%20to%20lists.) for a few examples

In [None]:
# a simple dataframe built from two lists
data = {'SIG': ['pwny', 'glug', 'bot', 'aida', 'music', 'icpc', 'arch'], 'Coolest': ['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes']}
df_sigs = pd.DataFrame(data)
display(df_sigs)

#### `.head(n)`

- useful for when you just want to look at the first "n" rows in a dataframe 
- if n is not provided, the default is 5

In [None]:
df_sigs.head()

####`.tail(n)`
- does the same thing as .head(), just from the end
- again, the default value for n is 5 (as you can see below)

In [None]:
df_sigs.tail(1)

####`.describe()`
- returns a few possibly interesting statistics
- the example below just shows you what you'll get when you run it on the array containing 1, 2, and 3

In [None]:
df = pd.DataFrame({'numeric': [1, 2, 3]})
#display(df)
df.describe()

## Some Advanced Functions

####`df_concatenated = pd.concat(_list_of_dataframes_)`
- this one you can't call on any specific dataframe, rather you have to call it on pandas and set your output dataframe equal to it

- as a side note, you can also use `append`, `merge`, and `join`
- documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)!

In [None]:
df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])
#       df1
#   letter    number
#     a         1
#     b         2

df2 = pd.DataFrame([['c', 3], ['d', 4]],
                   columns=['letter', 'number'])

#       df2
#   letter    number
#     c         3
#     d         4

df_both = pd.concat([df1, df2])
print("after concat'ing df1 and df2")
display(df_both)

####`.apply(_function_)`
- the `.apply()` function will take any other function as an argument and attempt to run it across the given dataframe

In [None]:
import numpy as np

df = pd.DataFrame([["Hello", 9]] * 3, columns=['A', 'B'])

#       df before running the apply function
#    A       B
#     4       9
#     4       9
#     4       9

df

#df.apply(np.sqrt)

# pandas takes the square root of every possible row and cell in the dataframe


### You can also specify which columns you want the .apply() function to act on (and throw this into a new column!)

- note that apply does not change the information that is already in the table, it gives you a column that you can pass into a new column creation

In [None]:
df['B_sqrt'] = df['B'].apply(np.sqrt)
df

####`.to_csv(_filepath_)`
Last but certainly not least, how do you get your dataframe out?
- you can provide really any .csv file for the filepath
- if the csv file doesn't exist yet, pandas will make one
- otherwise, pandas will overwrite the existing csv (so be careful!)

In [None]:
data = {'SIG': ['pwny', 'glug', 'bot', 'aida', 'music', 'icpc', 'arch'], 'Coolest': ['no', 'no', 'no', 'yes', 'no', 'no', 'no']}
df_sigs = pd.DataFrame(data)
display(df_sigs)

df_sigs.to_csv("sig_coolness.csv")

# Congratulations! You now have a csv containing our original df_sigs dataframe we created at the start
# Since we are working in Google colab, all files are saved in colab itself, so you'll need to click the folder icon on the left hand side of your colab window to see "sig_coolness.csv"

# Practice!
Now here's a chance for you to practice!

### SQL to Pandas!

In [None]:
# Problem 1: Find a class you've taken on campus before in the dataframe 'gpa_data'
#   Note about pandas-specific syntax for AND, OR, NOT:
#   https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing
gpa_data['Subject'].unique()

In [None]:
# Problem 2: Find the instructor with the highest number of A's given in the dataframe 'gpa_data'


In [None]:
# Problem 3: Find the department with the most number of instructors in the dataframe 'gpa_data'
#   Then unique professors?


In [None]:
# Problem 4: Find a GPA for each class in the dataframe 'gpa_data'
gpa_point_values = [4, 4, 3.67, 3.33, 3, 2.67, 2.33, 2, 1.67, 1.33, 1, 0.67, 0, 0]


## Data Cleaning: Student Demographic Dataset
Even before we begin our analysis, we need to be able to read in our dataset correctly! Download this dataset (305 kB): http://dmi.illinois.edu/stuenr/ethsexres/ethsexfa19.xls to your computer and open it using Excel; see if there would be any issues in reading this data into Python using Pandas (if you don't see any problems, try reading it in!)

The function you should be using here is `pd.read_excel("filename or URL")` (yes, we are directly reading from the URL for this example).

Here is the documentation for `pandas.read_excel`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

In [None]:
#@title Hint (double click me to open)
# double click the right hand side to close

# Hint 1: If you were to just load in the data, do you see many unnamed columns?
# Now looking in the actual excel file, you can see that there are some
# pieces of information at the top regarding the dataset (metadata). However,
# Pandas does not like this in loading in our dataset because we ideally want
# only a block of values.

# Hint 2: There are some arguments that you can put into read_excel, mainly
# header. This will let you specify the number of rows to skip/use as headers,
# which allows us to skip the first few empty rows of metadata.

In [None]:
url = "http://dmi.illinois.edu/stuenr/ethsexres/ethsexfa19.xls"

# load in data (you can use a variable that contains the url instead of
# directly typing it in)

# examine the first few rows to check that you've read in the right data


In [None]:
# Problem 5: Make a new dataframe combining the undergrad and graduate rows for each college (keeping the same columns)


In [None]:
# Problem 6: Using the dataframe from problem 5, make a new column showing the percentage of Hawaiian-Pacific Islanders in each college


In [None]:
# Problem 7: Explore this dataset (or any others) to your heart's content!
#   Tell us if you find anything cool!


# More!

## Plotting!
The good thing about python libraries is a lot of the time they come built to have good interaction with other libraries. 

In this case, Pandas has built in integration with matplotlib, one of the most popular basic plotting libraries!

Next week's workshop will be on cooler visualizations using other libraries (plot.ly specifically), so this is just an intro of what you can do with now cleaned data!

In [None]:
pd.options.mode.chained_assignment = None

# This is our version of the cleaned dataset from the problems above
stu19_plot = pd.read_excel('http://dmi.illinois.edu/stuenr/ethsexres/ethsexfa19.xls',
                      header=4, sheet_name="summary",
                      names=["term", "code", "name", "st_level", "total", "men", "women", "unknown_gender",
                             "caucasian", "asian_american", "african_american", "hispanic", "native_american",
                             "hawaiian_pacificisl", "multiracial", "international", "unknown_race",
                             "all_african_american", "all_native_american", "all_hawaiian_pacificisl",
                             "all_asian", "illinois", "non_illinois", "part_time", "full_time"])

# Feel free to look up any of the functions you don't recognize here
stu_genders = stu19_plot[['term', 'code', 'name', 'st_level', 'total', 'men', 'women', 'unknown_gender']]
stu_genders.loc[:,'name'] = stu_genders.loc[:,'name'].str.strip()
stu_genders.loc[:,'st_level'] = stu_genders.loc[:,'st_level'].str.strip()
stu_genders_by_dept = stu_genders[['name', 'st_level', 'total', 'men', 'women', 'unknown_gender']]
stu_genders_by_dept.set_index(keys=['name', 'st_level'], inplace=True)
stu_genders_by_dept.sort_values(by='total', inplace=True)
stu_genders_by_dept.loc[:,'agg_total'] = stu_genders_by_dept.groupby(by='name').transform(sum)['total']
stu_genders_by_dept
to_plot = stu_genders_by_dept.sort_values(by=['agg_total', 'st_level'])

# .plot.bar() allows us to easily plot a bar graph!
to_plot[['total', 'agg_total']].plot.bar(figsize=[10, 10],
                                         title="Number of Students of Each Type in Each College vs. Total Number of Students in Each College")

## An anecdote from Michael:
Pandas uses your computer's RAM to store the data it needs, primarily the dataframes you're working with. This means that if you happen to be working with a **_large_** amount of data, i.e. more than the amount of memory your computer has, Python will throw you a MemoryError and tell you it can't allocate the amount of space it needs on your computer.

This happened to me over the summer at my internship when my dataframe got to be **212 Gb large** but my laptop only had 32 Gb.

This probably won't happen to you unless your dataset is massive, but you can check out [_this website_](https://pythonspeed.com/articles/pandas-load-less-data/) if you want to see ways people have dealt with compression.


### Next Week: Plotting with plot.ly!
Quick preview of something we'll do!

In [None]:
import plotly.express as px


gpa = pd.read_csv("https://github.com/wadefagen/datasets/raw/master/gpa/uiuc-gpa-dataset.csv")

gpa['total_students'] = gpa['A+'] + gpa['A'] + gpa['A-'] + gpa['B'] + gpa['B+'] + gpa['B-'] + gpa['C+'] + gpa['C'] + gpa['C-'] + gpa['D+'] + gpa['D'] + gpa['D-'] + gpa['F']

gpa['GPA'] = (gpa['A+'] * 4 + gpa['A'] * 4 + gpa['A-'] * 3.67 + gpa['B'] * 3 + gpa['B+'] * 3.33 + gpa['B-'] * 2.67 + gpa['C+'] * 2.33 + gpa['C'] * 2 + gpa['C-'] * 1.67 + gpa['D+'] * 1.33 + gpa['D'] + gpa['D-'] * 0.67) / gpa['total_students']
gpa["4s given"] = (gpa['A'] + gpa['A+']) / gpa['total_students']

gpa_cs = gpa[gpa['Subject'] == 'CS']
gpa_ece = gpa[gpa['Subject'] == 'ECE']
gpa_abe = gpa[gpa['Subject'] == "ABE"]
gpa_ae = gpa[gpa['Subject'] == "AE"]
gpa_me = gpa[gpa['Subject'] == "ME"]
gpa_bioe = gpa[gpa['Subject'] == "BIOE"]
gpa_chbe = gpa[gpa['Subject'] == "CHBE"]
gpa_npre = gpa[gpa['Subject'] == "NPRE"]
gpa_mse = gpa[gpa['Subject'] == "MSE"]
gpa_cee = gpa[gpa['Subject'] == "CEE"]
gpa_ise = gpa[gpa['Subject'] == "IE"]

gpa_engr = pd.concat([gpa_cs, gpa_ece, gpa_abe, gpa_ae, gpa_me, gpa_chbe, gpa_bioe, gpa_npre, gpa_mse, gpa_cee, gpa_ise])

fig = px.scatter(gpa_engr, x = '4s given', y = 'GPA', size = gpa_engr['total_students'], color = gpa_engr['Subject'])
fig.show()