# Exercises to follow-up on pandas_part1 and pandas_part2

### 1: Selecting and subsetting dataframe with Boolean indexing

We will return to the student dataset used in HW2. This is the dataset that got downloaded as student.zip and had student-mat.csv inside of it.

If you still have it on your computer from HW2 you can use that original file again (assuming that in HW2 P9F you saved the altered file with a new name).

If not you can download it again:

https://archive.ics.uci.edu/ml/machine-learning-databases/00320/

Download student.zip

It is advised that the targeted file student-mat.csv should be stored in the same location as this jupyter notebook to simplify load it, however this is not required.

Source: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

### Guide to important columns in the dataset

You can find a description of the fields in the data file if you look at the student.txt file that is included in the zip that you downloaded.

Each row of the data file corresponds to a student.

Some columns that you will use in the next few problems are described here:

* 'school': which school a student attends
* 'age': student age
* 'famsize': student's family size (binary (two values): "LE3" - less or equal to 3 or "GT3" - greater than 3))
* 'traveltime': home to school travel time (numeric coding: 1 is <15 min., 2 is 15 to 30 min., 3 is 30 min. to 1 hour, or 4 is >1 hour)
* 'freetime': free time after school (numeric: from 1 - very low to 5 - very high)
* 'absences': number of school absences (numeric: from 0 to 93)
* 'G1', 'G2', and 'G3': these three numeric columns (0-20) are grades

**Load the data in student-mat.csv into a dataframe. You can revisit HW2 if you run into trouble with the delimiter... Don't forget to import pandas first.**


## This code assumes that one level up from this notebook is a folder called /student/ and inside of that is a file called student-mat.csv

In [None]:
# import pandas using conventional short name
import pandas as pd

## student-mat.csv has semicolon ';' as delimiter so specify that on the import:

In [None]:
# load the student mat using relative path
sdf = pd.read_csv('../student/student-mat.csv', delimiter=';')

In [None]:
# take a look at the top of the dataframe
sdf.head()

**Use a Boolean comparison to select and output the rows of the dataframe where <code>school</code> is 'GP'.**

**Hint:** pandas_part2.ipynb

In [None]:
# get a series of True/False values, one for each row, specifying whether 
# the 'school' column has value GP in that row
gp_sel = sdf['school']=='GP'
gp_sel

In [None]:
# use the gp_sel to pull out only those rows of the dataframe that have True
# this will correspond to rows of the dataframe that have school equal to GP
gp_df = sdf[gp_sel]
gp_df.head()

In [None]:
# or do it all at once without using 
# a separate variable to hold the True/False series

# the inner part of the brackets generates the True/False
# and then that gets passed directly to the datatframe
gp_df = sdf[sdf['school']=='GP']
gp_df.head()

**Use a Boolean comparison to select and output the rows of the dataframe where <code>absences</code> is greater than zero <i>and </i> less than 10.**


## For compound comparisoms, wrap individual checks in parentheses and combine them with & symbol

In [None]:
# use compound comparisons, each one wrapped in parentheses, combined with &
# this will return a series of True/False with True if both conditions are met
abs_selector = (sdf['absences']>0) & (sdf['absences']<10)

# use that Boolean series to pull only the dataframe rows with absences between 0 and 10
abs_df = sdf[abs_sel]
abs_df

In [None]:
# use the dataframe column unique() method to see absence values left over in the resulting
# dataframe. Their shouldn't be any outside our desired range
abs_df['absences'].unique()

### 2: Descriptive Stats with DataFrame 

**Create a new column in your dataframe called <code>composite</code> that is the average of the scores of columns G1, G2 and G3.**

In [None]:
# make a new column by giving it's name and what goes into it
# we can do element-wise math (rows lined up) by giving whole columns at once

# put the summing of the three columns in parentheses so we get the sum 
# before dividing by 3
sdf['composite'] = (sdf['G1'] + sdf['G2'] + sdf['G3'])/3
sdf['composite']

**Report the average age and the standard deviation of the ages in the entire dataset** (hint: we can calculate the mean() and std() on a data frame column)

In [None]:
sdf['age'].mean()

In [None]:
sdf['age'].std()

**What is the minimum (lowest) age in the dataset?** 

In [None]:
sdf['age'].min()

**Subselect only those rows where school equals 'GP' and store it in a new dataframe. Do the same thing for school equals 'MS'.**

In [None]:
gp_sel = sdf['school'] == 'GP'

# the copy() command here makes a "copy" of the data
# rather than just pointing a reference to it
# this isn't strictly necessary but it avoids some warnings (not errors)
# that can pop up later
# The warning will say "A value is trying to be set on a copy of a slice from a DataFrame."
# This isn't a problem for now but it can be annoying to look at.
gp_df = sdf[gp_sel].copy()
gp_df.head()

In [None]:
ms_sel = sdf['school'] == 'MS'
ms_df = sdf[ms_sel].copy()
ms_df.head()

**Using your dataframes from the last section (one for each school), compute the mean 'composite' score for each school and store those values in variables.** Then, write an if/else block that checks whether the average composite scores for school 'GP' are greater than or less than the average scores for school 'MS'. Your code should use the if/else blocks to print a statement about which shools average composite score is higher.

In [None]:
# make composite score for GP, put the summing in parentheses, then divide by three
gp_df['composite'] = (gp_df['G1'] + gp_df['G2'] + gp_df['G3'])/3 
gp_df['composite']

In [None]:
# make composite score for MS, put the summing in parentheses, then divide by three
ms_df['composite'] = (ms_df['G1'] + ms_df['G2'] + ms_df['G3'])/3 
ms_df['composite']

In [None]:
# do a check on which average composite score is higher

if gp_df['composite'].mean() > ms_df['composite'].mean():
    print('gp composite mean is higher')
    
elif gp_df['composite'].mean() < ms_df['composite'].mean():
    print('gp composite mean is lower')
    
elif gp_df['composite'].mean() == ms_df['composite'].mean():
    print('gp and ms have same composite mean score')