# Unit 1 Assignment- Aakesh Kumar Murala U63411601

In this assignment, we will focus on education. This dataset contains data about high school students. Each row represents a single student. The school administrators want to predict a student's cumulative GPA at the time of graduation so that they can make interventions for struggling students. The goal is to predict the CGPA of a student. 

## Description of Variables

The description of variables are provided in "High School - Data Dictionary.docx"

## Goal

Use the **high_school.csv** data set and build a model to predict **CGPA**.

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Section 1: (6 points in total)

## Data Prep (5.5 points)

In [96]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(42)

In [100]:
#We will predict the "median_house_value" value in the data set:

highschool = pd.read_csv("high_school.csv")
highschool.head()

Unnamed: 0,Gender,ParentEdu,ParentMaritalStatus,ExtraCurricular,IsFirstChild,Siblings,Transportation,AvgReadingScore,AvgWritingScore,traveltime,studytime,internet,freetime,absences,CGPA
0,female,bachelor's degree,married,regularly,yes,3.0,school_bus,71,74,2,2,no,3,6,C
1,female,some college,married,sometimes,yes,0.0,,90,88,1,2,yes,3,4,D
2,female,master's degree,single,sometimes,yes,4.0,school_bus,93,91,1,2,yes,3,10,B
3,male,associate's degree,married,never,no,1.0,,56,42,1,3,yes,2,2,F
4,male,some college,married,sometimes,yes,0.0,school_bus,78,75,1,2,no,3,4,C


In [101]:
# Find the total number of rows

highschool.shape

(2369, 15)

In [102]:
#spiltting the dataset
from sklearn.model_selection import train_test_split

train, test = train_test_split(highschool, test_size=0.3)

In [103]:
# Total missing values in each column train
train.isna().sum()

Gender                   0
ParentEdu               96
ParentMaritalStatus     65
ExtraCurricular         32
IsFirstChild            60
Siblings                86
Transportation         190
AvgReadingScore          0
AvgWritingScore          0
traveltime               0
studytime                0
internet                 0
freetime                 0
absences                 0
CGPA                     0
dtype: int64

In [104]:
# Total missing values in each column test
test.isna().sum()

Gender                  0
ParentEdu              37
ParentMaritalStatus    38
ExtraCurricular        15
IsFirstChild           24
Siblings               32
Transportation         76
AvgReadingScore         0
AvgWritingScore         0
traveltime              0
studytime               0
internet                0
freetime                0
absences                0
CGPA                    0
dtype: int64

In [105]:
# Importing required packages
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [106]:
# Setting target variable
train_target = train[['CGPA']]
test_target = test[['CGPA']]

# removing the target variable from train and test sets respectively
train_inputs = train.drop(['CGPA'], axis=1)
test_inputs = test.drop(['CGPA'], axis=1)

#### Programatically

In [107]:
train_inputs.dtypes

Gender                  object
ParentEdu               object
ParentMaritalStatus     object
ExtraCurricular         object
IsFirstChild            object
Siblings               float64
Transportation          object
AvgReadingScore          int64
AvgWritingScore          int64
traveltime               int64
studytime                int64
internet                object
freetime                 int64
absences                 int64
dtype: object

In [108]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [109]:
numeric_columns

['Siblings',
 'AvgReadingScore',
 'AvgWritingScore',
 'traveltime',
 'studytime',
 'freetime',
 'absences']

In [110]:
categorical_columns

['Gender',
 'ParentEdu',
 'ParentMaritalStatus',
 'ExtraCurricular',
 'IsFirstChild',
 'Transportation',
 'internet']

In [111]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [112]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [113]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)],
        remainder='drop')

##### Transform: fit_transform() for TRAIN

In [114]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[ 0.62025679, -0.39354658, -0.36595899, ...,  0.        ,
         0.        ,  1.        ],
       [-0.77416458,  1.31923225,  1.70799104, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.62025679, -0.73610234, -0.81963556, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 0.62025679, -0.25652427, -0.49558087, ...,  0.        ,
         1.        ,  0.        ],
       [ 1.31746748,  0.15454265,  0.60620509, ...,  0.        ,
         0.        ,  1.        ],
       [-0.07695389,  0.2230538 ,  0.41177227, ...,  0.        ,
         1.        ,  0.        ]])

In [115]:
train_x.shape

(1658, 33)

In [116]:
#Fit and transform the test data
test_x = preprocessor.fit_transform(test_inputs)

test_x

array([[-0.08106718, -0.50478379, -0.6735128 , ...,  0.        ,
         0.        ,  1.        ],
       [-0.767243  , -1.57056336, -1.49230524, ...,  0.        ,
         1.        ,  0.        ],
       [-0.08106718,  1.49355292,  0.96407206, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.29128444,  1.4269417 ,  1.71988045, ...,  0.        ,
         0.        ,  1.        ],
       [ 1.97746025,  1.75999781,  1.40496029, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.60510863, -0.03850522, -0.04367247, ...,  0.        ,
         0.        ,  1.        ]])

In [117]:
test_x.shape

(711, 33)

## Find the Baseline (0.5 point)

In [118]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")

dummy_clf.fit(train_x, train_target)

In [119]:
from sklearn.metrics import accuracy_score

In [120]:
# This is the baseline Train Accuracy

dummy_train_pred = dummy_clf.predict(train_x)

baseline_train_acc = accuracy_score(train_target, dummy_train_pred)

print('Baseline Train Accuracy: {}' .format(baseline_train_acc))

Baseline Train Accuracy: 0.33715319662243665


In [121]:
# This is the baseline Test Accuracy

dummy_test_pred = dummy_clf.predict(test_x)

baseline_test_acc = accuracy_score(test_target, dummy_test_pred)

print('Baseline Test Accuracy: {}' .format(baseline_test_acc))

Baseline Test Accuracy: 0.31645569620253167


In [27]:
# the Baseline values came around 33.7% and 31.6% for train and test respectively

# Section 2: (3 points in total)

Build three different SVM models (by changing the kernels, regularization, etc.). Generate their training and test values. Each model is worth 1 point. 

(Add cells as needed)



## SVM Model 1: Linear

In [122]:
from sklearn.svm import SVC
 
lin_svm = SVC(kernel="linear")

lin_svm.fit(train_x, np.array(train_target).ravel())

#### Accuracy Scores

In [123]:
from sklearn.metrics import accuracy_score

In [124]:
#Predict the train values
train_target_pred = lin_svm.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.6779252110977081

In [125]:
#Predict the test values
test_target_pred = lin_svm.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.6244725738396625

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

##### Correction Linear

In [126]:
# Reducing Complexity from C=1.0 to C=0.1 removing overfitting and improving accuracy for test

In [127]:
from sklearn.svm import SVC
 
lin_svm1 = SVC(kernel="linear", C=0.1)

lin_svm1.fit(train_x, np.array(train_target).ravel())

In [128]:
from sklearn.metrics import accuracy_score

In [129]:
#Predict the train values
train_target_pred = lin_svm1.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.6634499396863691

In [130]:
#Predict the test values
test_target_pred = lin_svm1.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.6244725738396625

In [82]:
from sklearn.model_selection import cross_val_score

# Assuming you have your X (features) and y (target) data

lin_svm = SVC(kernel="linear", C=0.1)  # Adjust C value as needed

# Perform 5-fold cross-validation
scores = cross_val_score(lin_svm, train_x, train_target, cv=5)

# Print the cross-validation scores
print(scores)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[0.65361446 0.62349398 0.61445783 0.66163142 0.62839879]


  y = column_or_1d(y, warn=True)


In [85]:
best_accuracy = max(scores)

print(best_accuracy)

0.6616314199395771


## SVM Model 2: Polynomial

In [90]:
from sklearn.svm import SVC

pol_svm = SVC(kernel="poly", degree=3, coef0=1, C=10)

pol_svm.fit(train_x, np.array(train_target).ravel())

##### Accuracy

In [91]:
#Predict the train values
train_target_pred = pol_svm.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.9728588661037394

In [92]:
#Predict the test values
test_target_pred = pol_svm.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.510548523206751

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

##### Correction Polynomial

In [None]:
# Reduced the values of degree to 2, coef0 to 0.1 and C to 0.1 which removed overfitting by a large margin and improved accuracy for test but is not better than linear kernel

In [93]:
from sklearn.svm import SVC

pol_svm1 = SVC(kernel="poly", degree=2, coef0=0.1, C=0.1)

pol_svm1.fit(train_x, np.array(train_target).ravel())

In [94]:
#Predict the train values
train_target_pred = pol_svm1.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.6224366706875754

In [95]:
#Predict the test values
test_target_pred = pol_svm1.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.5527426160337553

## SVM Model 3: Gaussian rbf Kernel

In [306]:
rbf_svm = SVC(kernel="rbf", C=10, gamma='scale')

rbf_svm.fit(train_x, np.array(train_target).ravel())

In [308]:
#Predict the train values
train_target_pred = rbf_svm.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.9439083232810616

In [310]:
#Predict the test values
test_target_pred = rbf_svm.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.5668073136427567

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

##### Correction rbf

In [None]:
# Changed the value of gamma to 'auto' and reduced the value of C to 0.5 which resulted in removing overfitting and attaining better accuracy

In [None]:
# Note: C value to 0.1 was producing worst as accuracy was around 52%

In [330]:
rbf_svm = SVC(kernel="rbf", C=0.5, gamma='auto')

rbf_svm.fit(train_x, np.array(train_target).ravel())

In [332]:
#Predict the train values
train_target_pred = rbf_svm.predict(train_x)

#Train accuracy
accuracy_score(train_target, train_target_pred)

0.6447527141133896

In [334]:
#Predict the test values
test_target_pred = rbf_svm.predict(test_x)

#Test accuracy
accuracy_score(test_target, test_target_pred)

0.6090014064697609

# Section 3: (3 points in total)

Build two different SGD models (by changing the penalty, etc. or adding polynomial terms) and one LogisticRregression model. Generate their training and test values. Each model is worth 1 point.

(Add cells as needed)

## SGD Model 1:

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

## SGD Model 2:

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

## LogisticRegression Model:

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

# Discussion (3 points in total)


## List the train and test values of each model you built (1 point)

**If the train/test values listed here do not match the outputs of models, you will lose points.**

## Which model performs the best and why? (1 point) 

Hint: The best model is the one that has the best TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## How does your best model compare to the baseline? (1 point)