### Prepping Data Challenge: Passing Grades (Week 3)

This weeks challenge will allow you to practice both <strong>Aggregation</strong> and <strong>Join</strong>. 

The Prep School needs to understand how it's students are performing. The school needs to understand how many students are passing their classes so it can put extra teaching in place if needed. The school is wondering if Gender might be a driver for performance so we'll leave those details in the final data set but anonymise the rest of the data. 

Input
Again we are going to make use of Week 1's dataset and an additional data set. 

#### Requirement:

- Input both data sets
- Join the data sets together to give us the grades per student
- Remove the parental data fields, they aren't needed for the challenge this week
- Pivot the data to create one row of data per student and subject
- Rename the pivoted fields to Subject and Score 
- Create an average score per student based on all of their grades
- Create a field that records whether the student passed each subject
  - Pass mark is 75 and above in all subjects
- Aggregate the data per student to count how many subjects each student passed
- Round the average score per student to one decimal place
- Remove any unnecessary fields and output the data

In [1]:
import pandas as pd

In [2]:
# Input the data sets.
df1 = pd.read_csv('WK1-Input.csv')
df2 = pd.read_csv('WK3- Grades Input.csv')

In [3]:
df1.head()

Unnamed: 0,id,pupil first name,pupil last name,gender,Date of Birth,Parental Contact Name_1,Parental Contact Name_2,Preferred Contact Employer,Parental Contact
0,1,Ronna,Nellies,Female,12/21/2013,Purcell,Ketti,Demizz,1
1,2,Rusty,Andriulis,Male,7/21/2012,Vassili,Rivi,Brainbox,1
2,3,Roberta,Oakeshott,Female,12/4/2011,Lind,Haskell,Centidel,2
3,4,Lola,Rubinfajn,Male,6/29/2012,Elie,Tresa,Edgeblab,2
4,5,Kamila,Benedtti,Female,7/10/2012,Adela,Clevey,Trudoo,1


In [4]:
df2.head()

Unnamed: 0,Student ID,Maths,English,Spanish,Science,Art,History,Geography
0,1,66,97,85,75,76,94,76
1,2,84,85,62,87,68,75,74
2,3,88,68,69,81,92,89,75
3,4,65,97,96,89,98,77,62
4,5,86,97,94,98,67,77,97


In [5]:
# Join the data sets together to give us the grades per student
df = pd.merge(df1, df2, left_on = 'id', right_on ='Student ID', how='left')

In [6]:
df.head()

Unnamed: 0,id,pupil first name,pupil last name,gender,Date of Birth,Parental Contact Name_1,Parental Contact Name_2,Preferred Contact Employer,Parental Contact,Student ID,Maths,English,Spanish,Science,Art,History,Geography
0,1,Ronna,Nellies,Female,12/21/2013,Purcell,Ketti,Demizz,1,1,66,97,85,75,76,94,76
1,2,Rusty,Andriulis,Male,7/21/2012,Vassili,Rivi,Brainbox,1,2,84,85,62,87,68,75,74
2,3,Roberta,Oakeshott,Female,12/4/2011,Lind,Haskell,Centidel,2,3,88,68,69,81,92,89,75
3,4,Lola,Rubinfajn,Male,6/29/2012,Elie,Tresa,Edgeblab,2,4,65,97,96,89,98,77,62
4,5,Kamila,Benedtti,Female,7/10/2012,Adela,Clevey,Trudoo,1,5,86,97,94,98,67,77,97


In [7]:
#Remove the parental data fields, they aren't needed for the challenge this week
df.drop(columns=['Parental Contact Name_1','Parental Contact Name_2','Preferred Contact Employer','Parental Contact'], inplace=True)

In [8]:
df.head()

Unnamed: 0,id,pupil first name,pupil last name,gender,Date of Birth,Student ID,Maths,English,Spanish,Science,Art,History,Geography
0,1,Ronna,Nellies,Female,12/21/2013,1,66,97,85,75,76,94,76
1,2,Rusty,Andriulis,Male,7/21/2012,2,84,85,62,87,68,75,74
2,3,Roberta,Oakeshott,Female,12/4/2011,3,88,68,69,81,92,89,75
3,4,Lola,Rubinfajn,Male,6/29/2012,4,65,97,96,89,98,77,62
4,5,Kamila,Benedtti,Female,7/10/2012,5,86,97,94,98,67,77,97


In [9]:
#Pivot the data to create one row of data per student and subject
#Rename the pivoted fields to Subject and Score 
df_pivot = pd.melt(df, id_vars=['Student ID','gender'], value_vars=['Maths','English','Spanish','Science','Art','History','Geography'], var_name='Subject', value_name='Score')

In [10]:
#Create an average score per student based on all of their grades
df_pivot["Student's Avg Score"] = df_pivot.groupby(['Student ID'])['Score'].transform('mean')

In [11]:
#Create a field that records whether the student passed each subject
#Pass mark is 75 and above in all subjects
#Aggregate the data per student to count how many subjects each student passed
df_pivot['Pass'] = df_pivot['Score'].apply(lambda x: 'Pass' if x >= 75 else 'Fail')
df_pivot['Passed Subjects'] = df_pivot[df_pivot['Pass'] == 'Pass'].groupby('Student ID')['Subject'].count()
df_pivot.dropna(inplace = True)

In [12]:
#Round the average score per student to one decimal place
df_pivot["Student's Avg Score"] = df_pivot["Student's Avg Score"].round(2)

In [13]:
output = df_pivot[['Passed Subjects',"Student's Avg Score",'Student ID','gender']]

In [14]:
output.head(10)

Unnamed: 0,Passed Subjects,Student's Avg Score,Student ID,gender
1,6.0,76.43,2,Male
2,4.0,80.29,3,Female
3,5.0,83.43,4,Male
4,5.0,88.0,5,Female
5,6.0,75.29,6,Female
6,3.0,81.86,7,Female
7,4.0,82.43,8,Male
8,6.0,80.57,9,Female
9,5.0,76.86,10,Male
10,3.0,84.57,11,Genderqueer


In [15]:
#Output the data
df.to_csv('wk3-output.csv',index=False)