# Multiple Linear Regression
## Homework 2
## Nicholas Thomson

### Import libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

### 1. Download the dataset
The dataset was downloaded from https://www.kaggle.com/datasets/petermushemi/dataset-for-predicting-the-college-gpa-of-students?select=Gpa.csv

I intend to investigate the relationship between various factors and college GPA

In [2]:
#Get data set
df = pd.read_csv('Academic.csv')

df.head()

Unnamed: 0,Study Hours per Week,Attendance Rate,Major,High School GPA,Extracurricular Activities,Part-Time Job,Library Usage per Week,Online Coursework Engagement,Sleep Hours per Night,College GPA
0,21.95,79.64,Business,2.83,4,No,16.87,11.22,5.64,2.8
1,28.61,50.5,Business,3.26,1,No,6.53,7.79,5.78,2.55
2,24.11,73.79,,3.56,3,No,17.04,8.09,7.1,2.77
3,21.8,85.44,Business,3.7,4,Yes,17.77,14.27,9.97,3.28
4,16.95,52.2,Arts,3.63,4,No,5.33,11.08,6.94,2.59


Data included in this data set include:
Study Hours per Week, Attendance Rate, Major, High School GPA, Extracurricular Activities, and more.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Study Hours per Week          2000 non-null   float64
 1   Attendance Rate               1803 non-null   float64
 2   Major                         1941 non-null   object 
 3   High School GPA               2000 non-null   float64
 4   Extracurricular Activities    2000 non-null   int64  
 5   Part-Time Job                 2000 non-null   object 
 6   Library Usage per Week        2000 non-null   float64
 7   Online Coursework Engagement  1978 non-null   float64
 8   Sleep Hours per Night         1841 non-null   float64
 9   College GPA                   2000 non-null   float64
dtypes: float64(7), int64(1), object(2)
memory usage: 156.4+ KB


# 2. Data Preprocessing

The data contains dummy variables which are College major and whether the student has a part time job. I will convert these variables to dummies.

In [4]:
df_dummies = pd.get_dummies(df, dtype=float)
df_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Study Hours per Week          2000 non-null   float64
 1   Attendance Rate               1803 non-null   float64
 2   High School GPA               2000 non-null   float64
 3   Extracurricular Activities    2000 non-null   int64  
 4   Library Usage per Week        2000 non-null   float64
 5   Online Coursework Engagement  1978 non-null   float64
 6   Sleep Hours per Night         1841 non-null   float64
 7   College GPA                   2000 non-null   float64
 8   Major_Arts                    2000 non-null   float64
 9   Major_Business                2000 non-null   float64
 10  Major_Engineering             2000 non-null   float64
 11  Major_Science                 2000 non-null   float64
 12  Part-Time Job_No              2000 non-null   float64
 13  Par

There is no need to have two dummy variables for part time job, so I will remove one of them. Also remove all nan data

In [5]:
df_cleaned = df_dummies.drop(['Part-Time Job_No'],axis=1)
df_cleaned = df_cleaned.dropna()
df_cleaned.head()

Unnamed: 0,Study Hours per Week,Attendance Rate,High School GPA,Extracurricular Activities,Library Usage per Week,Online Coursework Engagement,Sleep Hours per Night,College GPA,Major_Arts,Major_Business,Major_Engineering,Major_Science,Part-Time Job_Yes
0,21.95,79.64,2.83,4,16.87,11.22,5.64,2.8,0.0,1.0,0.0,0.0,0.0
1,28.61,50.5,3.26,1,6.53,7.79,5.78,2.55,0.0,1.0,0.0,0.0,0.0
2,24.11,73.79,3.56,3,17.04,8.09,7.1,2.77,0.0,0.0,0.0,0.0,0.0
3,21.8,85.44,3.7,4,17.77,14.27,9.97,3.28,0.0,1.0,0.0,0.0,1.0
4,16.95,52.2,3.63,4,5.33,11.08,6.94,2.59,1.0,0.0,0.0,0.0,0.0


# Select dependent variable and independent variables
The dependent variable in this case is College GPA. The rest of the variables are the dependent variables

In [6]:
Y = df_cleaned['College GPA']
X = df_cleaned.drop(['College GPA'],axis=1) # Select all variables except for MEDV

# Split the dataset into training and testing set
The test size of the testing set is 20%. The rest will be used for training, selected randomly

In [7]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# 3. Perform Multiple Linear Regression
Create and fit the multiple linear regression model

In [8]:
# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, Y_train)

Make predictions

In [9]:
Y_pred = model.predict(X_test)

# 4/5. Interpret/Evaluate the Model

In [10]:
mse = mean_squared_error(Y_test, Y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(Y_test, Y_pred)

print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')
print(f'R-squared (R²): {r2:.2f}')

Mean Squared Error (MSE): 0.00
Root Mean Squared Error (RMSE): 0.00
R-squared (R²): 1.00


The R-squared is 1.00, which indicates a perfect correlation between the dependent and independent variables

We also should look at the coefficients of the model

In [11]:
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})  
print(coefficients) 

                         Feature  Coefficient
0           Study Hours per Week     0.025001
1                Attendance Rate     0.010004
2                High School GPA     0.250236
3     Extracurricular Activities    -0.000084
4         Library Usage per Week     0.000016
5   Online Coursework Engagement     0.066678
6          Sleep Hours per Night    -0.000042
7                     Major_Arts    -0.000050
8                 Major_Business    -0.000341
9              Major_Engineering     0.000037
10                 Major_Science     0.000049
11             Part-Time Job_Yes    -0.000278


Key variables that I notice include High School GPA as the biggest contributing factor. Other variables that are statistically significant include Study Hours per Week, Online Coursework Engagement, and Attendance Rate.

In [12]:
import statsmodels.api as sm

In [13]:
# Add a constant term to the independent variables matrix for the intercept
X = sm.add_constant(X)

# Fit the multiple linear regression model
model = sm.OLS(Y, X).fit()

In [14]:
summary = model.summary()
print(summary)

                            OLS Regression Results                            
Dep. Variable:            College GPA   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 3.296e+06
Date:                Sun, 04 Feb 2024   Prob (F-statistic):               0.00
Time:                        17:24:33   Log-Likelihood:                 7249.4
No. Observations:                1638   AIC:                        -1.447e+04
Df Residuals:                    1625   BIC:                        -1.440e+04
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
const           

# 6. Provide recommendations or insights

**R-squared (R²):** The r-squared and adjusted r-squared values provided in the model suggests that the model is a perfect fit. This is surprising considering there are usually outliers in models. Either the dataset itself is not reliable or the model is very good.

**F-statistic:** The F-statistic indicates the model is statistically significant in predicting College GPA, as the F-statistic is high.

**Prob (F-statistic):** The p-value associated with the F-statistic is zero, indicating that the model is highly significant.

**Coefficients (coef):** Higher coefficients indicate a significant effect on College GPA. Study Hours Per Week, Attendance Rate, High School GPA, and Online Coursework Engagement all have high coefficient values compared to the other variables

Evaluating the model provides clear insight into the important variables listed in the report. The 4 variables that have a p-score less than 0.05 are Study Hours Per Week, Attendance Rate, High School GPA, and Online Coursework Engagement. All other p-scores are higher than this. It is safe to say that these factors are the most relevant to predicting College GPA.

The primary people who should use this model include students of all types who wish to get into top colleges as well as perform well in school. Because High School GPA is a significant factor that contributes to College GPA, it is beneficial to push students to do well in high school and prepare them accordingly. Actively engaging in class is a big factor to College GPA as well. Major, having a job, library usage, and sleep hours per week do not seem to be significant enough to focus on in improving college GPA.