# Week 11 Activity

A local school district has a goal to reach a 95% graduation rate by the end of the
decade by identifying students who need intervention before they drop out of
school. As a software engineer contacted by the school district, your task is to
model the factors that predict how likely a student is to pass their high school final
exam, by constructing an intervention system that leverages supervised learning
techniques. The board of supervisors has asked that you find the most effective
model that uses the least amount of computation costs to save on the budget. You
will need to analyze the dataset on students' performance and develop a model
that will predict a given student will pass, quantifying whether an intervention is
necessary

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [13]:
df = pd.read_csv (r'C:\Users\ADAM\Downloads\student-data.csv')
df = pd.DataFrame(df, columns= ['school','sex','age','address','famsize','Pstatus','Medu','Fedu','Mjob','Fjob','reason','guardian','traveltime','studytime','failures','schoolsup','famsup','paid','activities','nursery','higher','internet','romantic','famrel','freetime','goout','Dalc','Walc','health','absences','passed'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 31 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

In [14]:
for i in df.columns:
    print(i,df[i].unique())

school ['GP' 'MS']
sex ['F' 'M']
age [18 17 15 16 19 22 20 21]
address ['U' 'R']
famsize ['GT3' 'LE3']
Pstatus ['A' 'T']
Medu [4 1 3 2 0]
Fedu [4 1 2 3 0]
Mjob ['at_home' 'health' 'other' 'services' 'teacher']
Fjob ['teacher' 'other' 'services' 'health' 'at_home']
reason ['course' 'other' 'home' 'reputation']
guardian ['mother' 'father' 'other']
traveltime [2 1 3 4]
studytime [2 3 1 4]
failures [0 3 2 1]
schoolsup ['yes' 'no']
famsup ['no' 'yes']
paid ['no' 'yes']
activities ['no' 'yes']
nursery ['yes' 'no']
higher ['yes' 'no']
internet ['no' 'yes']
romantic ['no' 'yes']
famrel [4 5 3 1 2]
freetime [3 2 4 1 5]
goout [4 3 2 1 5]
Dalc [1 2 5 3 4]
Walc [1 3 2 4 5]
health [3 5 1 2 4]
absences [ 6  4 10  2  0 16 14  7  8 25 12 54 18 26 20 56 24 28  5 13 15 22  3 21
  1 75 30 19  9 11 38 40 23 17]
passed ['no' 'yes']


In [15]:
passed = len(df[df['passed']=='yes'])
failed = len(df[df['passed'] == 'no'])
grad_rate = passed/(failed + passed)
print ("students passed:",passed)
print ("students failed:",failed)
print ("pass percent:",grad_rate*100)

students passed: 265
students failed: 130
pass percent: 67.08860759493672


In [16]:
df.rename(columns={'famsize': 'Big_fam', 'Pstatus': 'PTogether'}, inplace=True)
df = df.replace(['yes', 'no'], [1, 0])
df = df.replace(['T', 'A'], [1, 0])
df = df.replace(['GT3', 'LE3'], [1, 0])
df=pd.get_dummies(df,prefix="gender",columns=["sex"])
df=pd.get_dummies(df,prefix="school",columns=["school"])
df=pd.get_dummies(df,prefix="address",columns=["address"])
df=pd.get_dummies(df,prefix="Motherjob",columns=["Mjob"])
df=pd.get_dummies(df,prefix="Fatherjob",columns=["Fjob"])
df=pd.get_dummies(df,prefix="reason",columns=["reason"])
df=pd.get_dummies(df,prefix="guardian",columns=["guardian"])
df.head(10)

Unnamed: 0,age,Big_fam,PTogether,Medu,Fedu,traveltime,studytime,failures,schoolsup,famsup,...,Fatherjob_other,Fatherjob_services,Fatherjob_teacher,reason_course,reason_home,reason_other,reason_reputation,guardian_father,guardian_mother,guardian_other
0,18,1,0,4,4,2,2,0,1,0,...,0,0,1,1,0,0,0,0,1,0
1,17,1,1,1,1,1,2,0,0,1,...,1,0,0,1,0,0,0,1,0,0
2,15,0,1,1,1,1,2,3,1,0,...,1,0,0,0,0,1,0,0,1,0
3,15,1,1,4,2,1,3,0,0,1,...,0,1,0,0,1,0,0,0,1,0
4,16,1,1,3,3,1,2,0,0,1,...,1,0,0,0,1,0,0,1,0,0
5,16,0,1,4,3,1,2,0,0,1,...,1,0,0,0,0,0,1,0,1,0
6,16,0,1,2,2,1,2,0,0,0,...,1,0,0,0,1,0,0,0,1,0
7,17,1,0,4,4,2,2,0,1,1,...,0,0,1,0,1,0,0,0,1,0
8,15,0,0,3,2,1,2,0,0,1,...,1,0,0,0,1,0,0,0,1,0
9,15,1,1,3,4,1,2,0,0,1,...,1,0,0,0,1,0,0,0,1,0


In [20]:
from sklearn.model_selection import train_test_split
df_target = df['passed'] 
df_feature =df.drop(['passed'], axis = 1)
train = 300
test=len(df.index)-train
X_train, X_test, y_train, y_test = train_test_split(df_feature, df_target, stratify = df_target, test_size=test, random_state=42)
print ("Training set =",X_train.shape[0]," samples.")
print ("Testing set =",X_test.shape[0]," samples.")

Training set = 300  samples.
Testing set = 95  samples.


In [21]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import f1_score
DesTr = DecisionTreeClassifier(random_state=42)
SvM = SVC(random_state=42)

DesTr.fit(X_train, y_train)
DesTr_y_pred_train = DesTr.predict(X_train)
DesTr_y_pred_test = DesTr.predict(X_test)
DesTr_train_score = f1_score(y_train.values, DesTr_y_pred_train)
DesTr_test_score = f1_score(y_test.values, DesTr_y_pred_test)
print("DecisionTreeClassifier: F1 score{train:",DesTr_train_score, ", test:",DesTr_test_score,"}")

SvM.fit(X_train, y_train)
SvM_y_pred_train = SvM.predict(X_train)
SvM_y_pred_test = SvM.predict(X_test)
SvM_train_score = f1_score(y_train.values, SvM_y_pred_train)
SvM_test_score = f1_score(y_test.values, SvM_y_pred_test) 
print("SVC: F1 score{train:",SvM_train_score, ", test:",SvM_test_score,"}")

DecisionTreeClassifier: F1 score{train: 1.0 , test: 0.6890756302521008 }
SVC: F1 score{train: 0.804 , test: 0.8050314465408805 }
