# Introduction to Python

This introduction to Python will focus on the second of two main topics: tools for data manipulation, analysis, and plotting.

We will cover:
- Pandas library and dataframes,
- computing statistics from data,
- finding relationships in data,
- plotting data to show relationships.

As a reminder:
IPython notebooks are organized by "cells." Each cell can have its own code and can be run independently and in any order (although they are usually run top to bottom in a notebook.) To run a cell and move to the next cell press ```Shift+Enter```. To run a cell and stay on that cell press ```Control+Enter```.

Questions to be discussed in groups are highlighted in <font color='green'>green</font>. If you don't understand a function that is used, try googling something like "python function-name".

In [4]:
import pandas as pd # Library for storing, manipulating, and visualizing data
import matplotlib.pyplot as plt # Library for plotting
%matplotlib inline

In [5]:
# We'll use the same data as before
file_name = 'student_data.csv'

# Dataframes
Rather than dealing with loading the csv data ourselves, we can use the Pandas library to directly read the csv and store is in a data structure called a "Dataframe." Dataframes are objects that can be used to store and manipulate 'columnar' data.

One nice features of dataframes is that they are a structure that can store both the raw data and also meta-data, e.g. student ID/name or assignment type.

In [6]:
# Read csv into dataframe
df = pd.read_csv(file_name)
df

Unnamed: 0,ID,MT1,MT2,HW,Final
0,100,0.618561,0.523077,0.712692,0.810502
1,101,0.566880,0.498883,0.657585,0.537925
2,102,0.567084,1.000000,0.758540,0.503278
3,103,1.000000,0.576961,0.592728,0.987362
4,104,0.703201,0.529477,0.706027,0.632194
5,105,0.858572,0.575547,0.657599,0.659696
6,106,0.265628,0.503418,0.735388,0.184639
7,107,0.504815,0.585569,0.663745,0.629271
8,108,0.881558,1.000000,0.732801,0.782823
9,109,0.703076,0.944710,0.726910,1.000000


In [None]:
# Make student ID the 'index'
df = df.set_index('ID')
df

## Selecting Data from Dataframes
There are a number of ways of extracting subsets of the data from a dataframe.
- columns
- iloc: index based, similar to NumPy,
- loc: label based, more similar to a dictionary,
- ix: index and labels (but labels cannot be integers)
### Columns

In [None]:
# You can extract a column directly from the dataframe.
# Remember that you can collapse the output by clicking in the space to the left.
df[['HW']]
# What happens if you change the double brackets to single brackets?

In [None]:
# You can also extract multiple columns
df[['ID', 'MT1', 'MT2']]

### Rows

In [None]:
# iloc selects indices
df.iloc[2:5]

In [None]:
# loc selects labels
df.loc[[2,3,4]]

You can also select columns and rows at the same time.
<font color='green'>
1. Can you select the Final grade data for students 20-30?<br>
2. How about the MT1 and MT2 data?</font>

In [None]:
# Try here:


## Creating New Columns
In the last notebook, you created a 'final grade' for the students based on the weighting:
- MT1: 25%
- MT2: 25%
- HW: 20%
- Final: 30%

<font color='green'>
<br>1. Create a new column in the dataframe that computes the final grade.<br>
Hint: the syntax for creating a new column looks very similar to adding a new item to a dictionary!<br>
2. Create a new column that has a 'S' or 'U' based on the .6 cuttoff. Hint: you'll want to use loc for this. First try finding what students have a Final Grade > .6!</font>

If you have trouble with either of these things, search the Pandas documentation. An example for calculating the Midterm Mean is given.

In [None]:
df['Midterm Mean'] = .5*df['MT1']+.5*df['MT2']

In [None]:
# Create new columns here:


# Dataframe Utilities
## Computing Statistics
You can also compute statistics on the dataframe as a whole or subsets of the data.

In [None]:
print('Mean')
print(df.mean())
print('')
print('Max')
print(df.max())
print('')


print('Final')
print(df['Final'].mean())
print('')

print('Homework for Selected Students')
print(df.loc[120:130, 'HW'].mean())

## Plotting
There are also a number of plots from matplotlib that you can create directly from the dataframe.
### Histograms
Try making histograms for different columns. You can also compute the columns mean and standard deviation (.std()) and see if it makes sense given the histogram.

In [None]:
col = 'Final'
df.plot(y=col, kind='hist')
print(df[col].mean(), df[col].std())

## Finding Correlations in Data
We might hope that the grade on the final exam is a good overall measure of students' knowledge.

One question you could ask is: "what other grades in the course are correlated with the grade on the final?" We can make scatter plots of the different columns versus the final exam to get a qualitative sense (we'll do a quantitative answer later in the course).

In [None]:
print(df.columns)

In [None]:
col_comparison = 'HW'
df.plot(x='Final', y=col_comparison, kind='scatter')

<font color='green'>
1. Which other assessments seem to correlate with the final exam grades?<br>
2. Which do not?</font>

Note that this is generated data and not data from a read course.