# Unit 1 Assignment

In this assignment, we will focus on education. This dataset contains data about high school students. Each row represents a single student. The school administrators want to predict a student's cumulative GPA at the time of graduation so that they can make interventions for struggling students. The goal is to predict the CGPA of a student. 

## Description of Variables

The description of variables are provided in "High School - Data Dictionary.docx"

## Goal

Use the **high_school.csv** data set and build a model to predict **CGPA**.

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Section 1: (6 points in total)

### Setup - Get the Data - Split into Train/Test - Check for missing values

In [1]:
# Import common libraries to use
import numpy as np
import pandas as pd

# Set random seed for repeatability
np.random.seed(142)

In [2]:
# Ingest the data into python, take a look a the top 5 rows
HSinfo = pd.read_csv("high_school.csv")
HSinfo.head()

Unnamed: 0,Gender,ParentEdu,ParentMaritalStatus,ExtraCurricular,IsFirstChild,Siblings,Transportation,AvgReadingScore,AvgWritingScore,traveltime,studytime,internet,freetime,absences,CGPA
0,female,bachelor's degree,married,regularly,yes,3.0,school_bus,71,74,2,2,no,3,6,C
1,female,some college,married,sometimes,yes,0.0,,90,88,1,2,yes,3,4,D
2,female,master's degree,single,sometimes,yes,4.0,school_bus,93,91,1,2,yes,3,10,B
3,male,associate's degree,married,never,no,1.0,,56,42,1,3,yes,2,2,F
4,male,some college,married,sometimes,yes,0.0,school_bus,78,75,1,2,no,3,4,C


In [3]:
# Import scikit-learn train-test-split function, then execute on the data
from sklearn.model_selection import train_test_split

train, test = train_test_split(HSinfo, test_size=0.3)

In [4]:
# Check for missing values in training set to assess how to clean the data
train.isna().sum()

Gender                   0
ParentEdu               95
ParentMaritalStatus     65
ExtraCurricular         32
IsFirstChild            50
Siblings                75
Transportation         186
AvgReadingScore          0
AvgWritingScore          0
traveltime               0
studytime                0
internet                 0
freetime                 0
absences                 0
CGPA                     0
dtype: int64

In [5]:
# Check for missing values in testing set to assess how to clean the data
test.isna().sum()

Gender                  0
ParentEdu              38
ParentMaritalStatus    38
ExtraCurricular        15
IsFirstChild           34
Siblings               43
Transportation         80
AvgReadingScore         0
AvgWritingScore         0
traveltime              0
studytime               0
internet                0
freetime                0
absences                0
CGPA                    0
dtype: int64

## Data Prep (5.5 points)

In [6]:
# Import more scikit-learn functions to use for data data cleaning
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

### Separate the target variable

In [7]:
# Identify the dependent variable for train and test
train_y = train['CGPA']
test_y = test['CGPA']

# Creat independent variables by removing the dependent variable for train and test
train_inputs = train.drop(['CGPA'], axis=1)
test_inputs = test.drop(['CGPA'], axis=1)

###  Identify the numerical and categorical columns

In [8]:
# Identify the datatypes in the training data
train_inputs.dtypes

Gender                  object
ParentEdu               object
ParentMaritalStatus     object
ExtraCurricular         object
IsFirstChild            object
Siblings               float64
Transportation          object
AvgReadingScore          int64
AvgWritingScore          int64
traveltime               int64
studytime                int64
internet                object
freetime                 int64
absences                 int64
dtype: object

In [9]:
# Identify the numerical columns (no binary to worry about)
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [10]:
# View numeric columns
numeric_columns

['Siblings',
 'AvgReadingScore',
 'AvgWritingScore',
 'traveltime',
 'studytime',
 'freetime',
 'absences']

In [11]:
# View categorical columns
categorical_columns

['Gender',
 'ParentEdu',
 'ParentMaritalStatus',
 'ExtraCurricular',
 'IsFirstChild',
 'Transportation',
 'internet']

### Create a Pipeline

In [12]:
# Create a numeric pipeline, impute the mean value in null fields, standardize the numbers
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='mean')),
                ('scaler', StandardScaler())])

In [13]:
# Create a categrorical pipeline, impute 'unknown' in null fields, implement one-hot-encoding
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [14]:
# Assign the new pipelines to the data columns
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)],
        remainder='drop')

### Transform the TRAIN data

In [15]:
# Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

# View the transformed data
train_x

array([[-1.47347293, -0.730908  , -1.01430911, ...,  0.        ,
         1.        ,  0.        ],
       [-0.09979351,  0.01102444,  0.13593754, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.5870462 , -1.47284044, -1.26991947, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 2.64756532,  0.48316326,  0.4554505 , ...,  0.        ,
         0.        ,  1.        ],
       [ 0.5870462 ,  1.62978612,  1.22228159, ...,  0.        ,
         1.        ,  0.        ],
       [-0.78663322, -0.52856279, -0.56699097, ...,  0.        ,
         0.        ,  1.        ]])

In [16]:
# View the shape of the transformed training data
train_x.shape

(1658, 33)

### Transform the TEST data

In [17]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

# View the transformed data
test_x

array([[-0.09979351,  0.68550847,  0.64715827, ...,  0.        ,
         0.        ,  1.        ],
       [-0.78663322,  0.34826645, -0.05577024, ...,  1.        ,
         0.        ,  1.        ],
       [-0.78663322, -0.730908  , -1.33382206, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [-0.78663322,  0.14592124, -0.05577024, ...,  0.        ,
         0.        ,  1.        ],
       [-1.47347293,  0.88785368,  0.32764531, ...,  0.        ,
         0.        ,  1.        ],
       [-0.78663322, -1.00070161, -1.26991947, ...,  0.        ,
         1.        ,  0.        ]])

In [18]:
# View the shape of the transformed testing data
test_x.shape

(711, 33)

## Find the Baseline (0.5 point)

In [19]:
# Import another scikit-learn function
from sklearn.dummy import DummyClassifier

# Instantiate it with the "most_frequent" strategy. 
dummy_clf = DummyClassifier(strategy="most_frequent")

# Fit the model. This finds the most frequently (i.e., majority) class (in the training set).
dummy_clf.fit(train_x, train_y)

In [20]:
# Import another scikit-learn function
from sklearn.metrics import accuracy_score

# Call the predict function of the classifier. This predicts all values as the majority class.
dummy_train_pred = dummy_clf.predict(train_x)

# Compare the predicted values with the actual values to calculate accuracy
baseline_train_acc = accuracy_score(train_y, dummy_train_pred)

# Show the baseline Train Accuracy
print('Baseline Train Accuracy: {}' .format(baseline_train_acc))

Baseline Train Accuracy: 0.32448733413751507


In [21]:
# We repeat the same steps for the test set
dummy_test_pred = dummy_clf.predict(test_x)

baseline_test_acc = accuracy_score(test_y, dummy_test_pred)

# Show the baseline Test Accuracy
print('Baseline Test Accuracy: {}' .format(baseline_test_acc))

Baseline Test Accuracy: 0.3459915611814346


# Section 2: (3 points in total)

Build three different SVM models (by changing the kernels, regularization, etc.). Generate their training and test values. Each model is worth 1 point. 

(Add cells as needed)



## SVM Model 1:

In [22]:
# Import the SVC function from scikit-learn
from sklearn.svm import SVC

# Create an multiclass linear model using one-versus-rest
svm_clf = SVC(kernel="linear", C=100, decision_function_shape='ovr')

# Fit the model to the data
svm_clf.fit(train_x, train_y)

In [23]:
#Predict the train values
train_y_pred = svm_clf.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.6743063932448733

In [24]:
#Predict the test values
test_y_pred = svm_clf.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.639943741209564

In [25]:
# Import the confusion matrix function from scikit-learn
from sklearn.metrics import confusion_matrix

# Create the confusion matrix on test set
confusion_matrix(test_y, test_y_pred)

array([[ 19,  13,   4,   0,   0],
       [ 10,  54,  21,   4,   0],
       [  4,  27,  99,  46,   1],
       [  1,   2,  48,  83,  29],
       [  0,   0,   2,  44, 200]], dtype=int64)

In [26]:
# Import the class report function from scikit-learn
from sklearn.metrics import classification_report

# Create the classification report on test set
print(classification_report(test_y, test_y_pred))

              precision    recall  f1-score   support

           A       0.56      0.53      0.54        36
           B       0.56      0.61      0.58        89
           C       0.57      0.56      0.56       177
           D       0.47      0.51      0.49       163
           F       0.87      0.81      0.84       246

    accuracy                           0.64       711
   macro avg       0.61      0.60      0.60       711
weighted avg       0.65      0.64      0.64       711



### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

#### 5% is not much for overfitting, but I'll adjust anyhow

In [27]:
# Lower the C value to determine impact of potential overfitting
svm_clf = SVC(kernel="linear", C=1, decision_function_shape='ovr')

svm_clf.fit(train_x, train_y)

In [28]:
#Predict the train values
train_y_pred = svm_clf.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.6737032569360676

In [29]:
#Predict the test values
test_y_pred = svm_clf.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.6413502109704642

## SVM Model 2:

In [30]:
# Create an multiclass poly model using one-versus-rest
pol_svm2 = SVC(kernel="poly", degree=3, coef0=1, C=10, decision_function_shape='ovr')

# Fit the model to the data
pol_svm2.fit(train_x, train_y)

In [31]:
#Predict the train values
train_y_pred = pol_svm2.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.97708082026538

In [32]:
#Predict the test values
test_y_pred = pol_svm2.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.5414908579465542

In [33]:
# Create the confusion matrix on test set
confusion_matrix(test_y, test_y_pred)

array([[ 17,  13,   6,   0,   0],
       [ 13,  45,  24,   7,   0],
       [  6,  51,  67,  49,   4],
       [  3,  10,  53,  65,  32],
       [  0,   0,  11,  44, 191]], dtype=int64)

In [34]:
# Create the classification report on test set
print(classification_report(test_y, test_y_pred))

              precision    recall  f1-score   support

           A       0.44      0.47      0.45        36
           B       0.38      0.51      0.43        89
           C       0.42      0.38      0.40       177
           D       0.39      0.40      0.40       163
           F       0.84      0.78      0.81       246

    accuracy                           0.54       711
   macro avg       0.49      0.51      0.50       711
weighted avg       0.55      0.54      0.55       711



### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

#### Yes, 45% difference in accuracy is clearly overfitting

In [35]:
# Adjust the C value lower to decrease overfitting
pol_svm2 = SVC(kernel="poly", degree=3, coef0=1, C=.05, decision_function_shape='ovr')

# Fit the model to the data
pol_svm2.fit(train_x, train_y)

In [36]:
#Predict the train values
train_y_pred = pol_svm2.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.7002412545235223

In [37]:
#Predict the test values
test_y_pred = pol_svm2.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.6019690576652602

## SVM Model 3:

In [38]:
# Create an multiclass rbf model using one-versus-rest
rbf_svm = SVC(kernel="rbf", C=10, gamma=0.1, decision_function_shape='ovr')

# Fit the model to the data
rbf_svm.fit(train_x, train_y)

In [39]:
#Predict the train values
train_y_pred = rbf_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.982509047044632

In [40]:
#Predict the test values
test_y_pred = rbf_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.5513361462728551

In [41]:
# Create the confusion matrix on test set
confusion_matrix(test_y, test_y_pred)

array([[ 15,  15,   6,   0,   0],
       [  6,  48,  24,  11,   0],
       [  5,  50,  63,  54,   5],
       [  2,   7,  54,  68,  32],
       [  0,   0,   7,  41, 198]], dtype=int64)

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

#### Yes, 42% difference in accuracy is overfitting

In [42]:
# Import additional functions to perform a randomized grid search
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
import random

# Define the upper and lower bounds for the randomizer
param_distribs = {
        'C': uniform(.01, 2),
        'gamma': uniform(0.01, 0.2),    
    }

rbf_svm = SVC(kernel="rbf", decision_function_shape='ovr')

# Defines limit of 5 models (n_inter), each ran 5 times (CV) for 25 total
rbf_search = RandomizedSearchCV(rbf_svm, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='accuracy', random_state=42)

rbf_search.fit(train_x, train_y)

In [43]:
# Show results from random grid search above
cvres = rbf_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(mean_score, params)

0.5663596986131838 {'C': 0.759080237694725, 'gamma': 0.20014286128198325}
0.5802315000181997 {'C': 1.4739878836228102, 'gamma': 0.12973169683940733}
0.5711644159720454 {'C': 0.32203728088487305, 'gamma': 0.041198904067240534}
0.4143668328904743 {'C': 0.12616722433639893, 'gamma': 0.18323522915498705}
0.5790285007097878 {'C': 1.2122300234864176, 'gamma': 0.1516145155592091}
0.3256906781203363 {'C': 0.051168988591604896, 'gamma': 0.20398197043239888}
0.6097786918065009 {'C': 1.6748852816008435, 'gamma': 0.052467822135655234}
0.5771921522949806 {'C': 0.37364993441420125, 'gamma': 0.046680901970686764}
0.5790194008663051 {'C': 0.6184844859190755, 'gamma': 0.11495128632644756}
0.5977050194736651 {'C': 0.8738900372842315, 'gamma': 0.06824582803960838}


In [44]:
# Fit the model to the data
final_model = rbf_search.best_estimator_

test_predictions = final_model.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_predictions)

0.6033755274261603

# Section 3: (3 points in total)

Build two different SGD models (by changing the penalty, etc. or adding polynomial terms) and one LogisticRregression model. Generate their training and test values. Each model is worth 1 point.

(Add cells as needed)

## SGD Model 1:

In [45]:
# Import additional function to perform a SGDClassifier
from sklearn.linear_model import SGDClassifier 

# Define the model hyperparameters
sgd_logreg = SGDClassifier(max_iter=500, penalty=None, eta0=.07) 

# Fit the model to the data
sgd_logreg.fit(train_x, train_y)

In [46]:
#Predict the train values
train_y_pred = sgd_logreg.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.5518697225572979

In [47]:
#Predict the test values
test_y_pred = sgd_logreg.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.49226441631504925

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

This model is not overfitting.

## SGD Model 2:

In [48]:
# Define the model hyperparameters
sgd_logreg = SGDClassifier(max_iter=500, penalty='elasticnet', eta0=0.05) 

# Fit the model to the data
sgd_logreg.fit(train_x, train_y)

In [49]:
#Predict the train values
train_y_pred = sgd_logreg.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.5591073582629674

In [50]:
#Predict the test values
test_y_pred = sgd_logreg.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.5527426160337553

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

This model is not overfitting.

## LogisticRegression Model:

In [51]:
# Import additional function to perform a LogisticRegression
from sklearn.linear_model import LogisticRegression

# Define the model hyperparameters
log_reg = LogisticRegression(solver='saga', penalty='elasticnet', max_iter=1000, l1_ratio=.5)

# Fit the model to the data
log_reg.fit(train_x, train_y)

In [52]:
#Predict the train values
train_y_pred = log_reg.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.6700844390832328

In [53]:
#Predict the test values
test_y_pred = log_reg.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.6469760900140648

In [54]:
# Create the confusion matrix on test set
confusion_matrix(test_y, test_y_pred)

array([[ 20,  13,   3,   0,   0],
       [  5,  56,  23,   5,   0],
       [  2,  26,  98,  50,   1],
       [  0,   1,  48,  82,  32],
       [  0,   0,   1,  41, 204]], dtype=int64)

In [55]:
# Create the classification report on test set
print(classification_report(test_y, test_y_pred))

              precision    recall  f1-score   support

           A       0.74      0.56      0.63        36
           B       0.58      0.63      0.61        89
           C       0.57      0.55      0.56       177
           D       0.46      0.50      0.48       163
           F       0.86      0.83      0.84       246

    accuracy                           0.65       711
   macro avg       0.64      0.61      0.63       711
weighted avg       0.65      0.65      0.65       711



### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

The model is not overfitting.

# Discussion (3 points in total)


## List the train and test values of each model you built (1 point)

**If the train/test values listed here do not match the outputs of models, you will lose points.**

## Which model performs the best and why? (1 point) 

Hint: The best model is the one that has the best TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## How does your best model compare to the baseline? (1 point)