**Data prep**

**22022: Week 5 The Prep School - Setting Grades**

<a href = "https://preppindata.blogspot.com/2022/02/2022-week-5-prep-school-setting-grades.html" >Data source and requirements </a>

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')


**Input the csv file**

In [2]:
data= pd.read_csv("PD 2022 WK 3 Grades.csv")
data.head(10)

Unnamed: 0,Student ID,Maths,English,Spanish,Science,Art,History,Geography
0,1,66,97,85,75,76,94,76
1,2,84,85,62,87,68,75,74
2,3,88,68,69,81,92,89,75
3,4,65,97,96,89,98,77,62
4,5,86,97,94,98,67,77,97
5,6,80,78,70,89,71,67,72
6,7,68,69,100,84,94,90,68
7,8,82,76,81,96,84,87,71
8,9,100,84,79,82,60,62,97
9,10,81,71,94,73,66,63,90


**Pivot Subjects**

In [3]:
data = pd.melt(data, id_vars=['Student ID'], value_vars=["Maths","English","Spanish","Science","Art","History","Geography"],
                   var_name="Subject", value_name='Score')
data = data.sort_values(by=['Score'], ascending=False)


**Divide the students grades into 6 evenly distributed groups <br>
By evenly distributed, it means the same number of students gain each grade within a subject**

In [4]:
dfs = np.array_split(data, 6)
dfs1=pd.DataFrame(dfs[0])
dfs2=pd.DataFrame(dfs[1])
dfs3=pd.DataFrame(dfs[2])
dfs4=pd.DataFrame(dfs[3])
dfs5=pd.DataFrame(dfs[4])
dfs6=pd.DataFrame(dfs[5])

dfs1['Grade']=1
dfs2['Grade']=2
dfs3['Grade']=3
dfs4['Grade']=4
dfs5['Grade']=5
dfs6['Grade']=6

final_df=pd.concat([dfs1,dfs2,dfs3,dfs4,dfs5,dfs6], ignore_index=True)

**Convert the groups to two different metrics:<br>
The top scoring group should get an A, second group B etc through to the sixth group who receive an F<br>
An A is worth 10 points for their high school application, B gets 8, C gets 6, D gets 4, E gets 2 and F gets 1.**

In [5]:
final_df.loc[(final_df['Grade'] ==1), 'Grade'] = 'A'
final_df.loc[(final_df['Grade'] ==2), 'Grade'] = 'B'
final_df.loc[(final_df['Grade'] ==3), 'Grade'] = 'C'
final_df.loc[(final_df['Grade'] ==4), 'Grade'] = 'D'
final_df.loc[(final_df['Grade'] ==5), 'Grade'] = 'E'
final_df.loc[(final_df['Grade'] ==6), 'Grade'] = 'F'

final_df.loc[(final_df['Grade'] =='A'), 'Points'] = '10'
final_df.loc[(final_df['Grade'] =='B'), 'Points'] = '8'
final_df.loc[(final_df['Grade'] =='C'), 'Points'] = '6'
final_df.loc[(final_df['Grade'] =='D'), 'Points'] = '4'
final_df.loc[(final_df['Grade'] =='E'), 'Points'] = '2'
final_df.loc[(final_df['Grade'] =='F'), 'Points'] = '1'

In [6]:
final_df["Points"]=final_df["Points"].astype(int)
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7000 entries, 0 to 6999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Student ID  7000 non-null   int64 
 1   Subject     7000 non-null   object
 2   Score       7000 non-null   int64 
 3   Grade       7000 non-null   object
 4   Points      7000 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 273.6+ KB


**Determine how many high school application points each Student has received across all their subjects <br>
Work out the average total points per student by grade <br>
ie for all the students who got an A, how many points did they get across all their subjects**

In [7]:
dfx=pd.DataFrame(final_df.groupby(["Student ID"])['Points'].sum().reset_index())
dfx.columns=["Student ID", "Total Points per Student"]
final_df=pd.merge(final_df,dfx, left_on='Student ID', right_on='Student ID')
columns=["Total Points per Student","Grade", "Points", "Subject","Score","Student ID"]

final_df = final_df[columns]

In [8]:
dfx=pd.DataFrame(final_df.groupby(["Grade"])['Total Points per Student'].mean().round(decimals=2).reset_index())
dfx.columns=["Grade", "Avg student total points per grade"]
final_df=pd.merge(final_df,dfx, left_on='Grade', right_on='Grade')
columns=["Avg student total points per grade","Total Points per Student","Grade", "Points", "Subject","Score","Student ID"]
final_df = final_df[columns]


**Take the average total score you get for students who have received at least one A and remove anyone who scored less than this. <br>
Remove results where students received an A grade <br>
How many students scored more than the average if you ignore their As?**

In [9]:
final_df = final_df[(final_df['Grade'] != 'A') ]
final_df = final_df[(final_df['Grade'] != 41.08) ]
final_df

Unnamed: 0,Avg student total points per grade,Total Points per Student,Grade,Points,Subject,Score,Student ID
1167,39.00,34,B,8,English,88,773
1168,39.00,45,B,8,Art,93,11
1169,39.00,45,B,8,History,87,11
1170,39.00,41,B,8,History,93,635
1171,39.00,41,B,8,English,92,635
...,...,...,...,...,...,...,...
6995,31.69,19,F,1,Geography,60,723
6996,31.69,10,F,1,Maths,66,118
6997,31.69,10,F,1,Art,62,118
6998,31.69,10,F,1,Science,61,118


In [10]:
#Output the data 
final_df.to_csv('PD 2022 Week 5 Output.csv', index=False)


In [11]:
print("Data Prepped!")

Data Prepped!
