This project is a machine learning project to predict student score, compare a linear regression model built from scratch with scikit-learn linear regression model.

In [23]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv('data/stud.csv') #read the csv as a pandas dataFrame

In [4]:
print(f'This dataset has {df.shape[0]} samples/observations/rows')
print(f'This dataset has {df.shape[1]} features/columns')


This dataset has 1000 samples/observations/rows
This dataset has 8 features/columns


In [5]:
df.head() #returns the first 5 observations

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [6]:
df.info() #return columns, non-null count and respective data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race_ethnicity               1000 non-null   object
 2   parental_level_of_education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test_preparation_course      1000 non-null   object
 5   math_score                   1000 non-null   int64 
 6   reading_score                1000 non-null   int64 
 7   writing_score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


In [7]:
categorical_cols = df.select_dtypes(include='object').columns #selects only categorical columns
count_of_cols = 0 #initialize a counter
for cols in categorical_cols: #loops through the columns
    count_of_cols += 1
    print(f'{cols} - {df[cols].unique()}\n') #returns column names and a list of unique values
print(f'This dataset has {count_of_cols} columns') #returns the number of columns

gender - ['female' 'male']

race_ethnicity - ['group B' 'group C' 'group A' 'group D' 'group E']

parental_level_of_education - ["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school']

lunch - ['standard' 'free/reduced']

test_preparation_course - ['none' 'completed']

This dataset has 5 columns


- Insight on categorical columns
* There are 5 categorical columns
Gender - sex of the student. {Male, Female}
Race or ethnicity - {Group A - E}
Parental level of Education - {Some high school - master's degree}
lunch - type of lunch taken before the test - {standard, free/reduced}
test preparation course - whether students completed the test preparaton course or not - {none,completed}

In [8]:
numerical_cols = df.select_dtypes(include='number').columns #selects only numerical columns
count_of_cols = 0
for cols in numerical_cols: #loop through columns
    count_of_cols += 1
    print(f'{cols} - Max score : {df[cols].min()} - Min score : {df[cols].max()}') #returns columns with their minimum and maximum values
print('\n')
print(f'This dataset has {count_of_cols} columns') #returns number of numerical columns

math_score - Max score : 0 - Min score : 100
reading_score - Max score : 17 - Min score : 100
writing_score - Max score : 10 - Min score : 100


This dataset has 3 columns


- Insights
There are 3 numerical columns - three score columns : {math,reading and writing scores}
math score: minimum score of 0 and maximum score of 100
reading score: minimum score of 17 and maximum score of 100
writing score: minimum score of 10 and maximum score of 100

In [9]:
df.isnull().sum() # return the sum of missing values for each column

gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64

- Insight
There are no missing values

In [10]:
df.duplicated() #returns duplicated rows

0      False
1      False
2      False
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Length: 1000, dtype: bool

- Insight
There are no duplicated values

In [11]:
numerical_cols = df.select_dtypes(include='number').columns #selects only numerical columns
for col in numerical_cols:
    q1 = df[col].quantile(0.25) #25th percentile
    q3 = df[col].quantile(0.75) #75th percentile
    iqr = q3 - q1 #inter-quartiile range
    lower_fence = q1 - 1.5 * iqr  #min threshold
    upper_fence = q3 + 1.5 * iqr #max threshold
    outlier = df[(df[col] < lower_fence) | df[col] > upper_fence] # returns rows below lower fence and above upper fence
    print(f'Outliers in {col}')
    print(outlier)
    print('-'*50)

Outliers in math_score
Empty DataFrame
Columns: [gender, race_ethnicity, parental_level_of_education, lunch, test_preparation_course, math_score, reading_score, writing_score]
Index: []
--------------------------------------------------
Outliers in reading_score
Empty DataFrame
Columns: [gender, race_ethnicity, parental_level_of_education, lunch, test_preparation_course, math_score, reading_score, writing_score]
Index: []
--------------------------------------------------
Outliers in writing_score
Empty DataFrame
Columns: [gender, race_ethnicity, parental_level_of_education, lunch, test_preparation_course, math_score, reading_score, writing_score]
Index: []
--------------------------------------------------


- Insights
There are no outliers

In [13]:
df.corr(method='pearson',numeric_only=True)

Unnamed: 0,math_score,reading_score,writing_score
math_score,1.0,0.81758,0.802642
reading_score,0.81758,1.0,0.954598
writing_score,0.802642,0.954598,1.0


- Insights
The numeric columns are closely correlated and equally important 
Therefore, we add all three columns to get a new column -{total_score}

In [None]:
df['total_score'] = df[numerical_cols].sum(axis=1) #returns a new column - total_score - that sums all three scores in the dataset

In [24]:
df['average_score'] = df[numerical_cols].mean(axis=1) #returns a new column - average_socre - that averages the three columns

Dataset insights 
1. Does gender of the student affect performance
2. Does the parent's level of education affect performance
3. Did the test preparation course have an impact on the performance of students
4. Did lunch affect the student's performance
5. Does race and ethnicity affect performance
6. How many students score below 20 in maths, reading or writing
7. How many students score full marks in all three subjects
8. What's the average score of each student
9. Whats the total score of each student
etc.......

In [None]:
y = df['total_score'] # the target output 

In [None]:
cols_to_drop = ['math_score','reading_score','writing_score','total_score','average_score'] #columns to drop from training set
df.drop(columns=cols_to_drop,inplace=True,errors='ignore') #returns a dataFrame with dropped columns

In [36]:
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,average_score
0,female,group B,bachelor's degree,standard,none,72.666667
1,female,group C,some college,standard,completed,82.333333
2,female,group B,master's degree,standard,none,92.666667
3,male,group A,associate's degree,free/reduced,none,49.333333
4,male,group C,some college,standard,none,76.333333


In [38]:
y.head()

0    218
1    247
2    278
3    148
4    229
Name: total_score, dtype: int64

In [44]:
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [45]:
preprocessor = ColumnTransformer(transformers=[
    ('onehot',OneHotEncoder(handle_unknown='ignore'),categorical_cols)
])

In [59]:
x = preprocessor.fit_transform(df)

In [47]:
x_train,x_test,y_train,y_test = train_test_split(
    x, y, test_size=0.3, random_state=42
)

In [48]:
from linear_regression import LinearRegression

In [51]:
model = LinearRegression(learning_rate=0.1,epochs=1000)
model.fit(x,y)

UFuncTypeError: Cannot cast ufunc 'subtract' output from dtype('O') to dtype('float64') with casting rule 'same_kind'