#  Project Description
This project aims to analyze the relationship between students' SAT scores and their corresponding GPA scores. By examining the data collected from a sample of students, we seek to understand whether there exists a linear relationship between these two variables. The analysis will involve exploring the data, performing statistical tests, and visualizing the results to draw meaningful conclusions. Understanding the connection between SAT scores and GPA can provide insights into academic performance and potentially inform educational practices.

### Importing Relevant Libraries

We begin by importing the necessary libraries for our analysis. These include:

- **NumPy**: A library for numerical operations in Python.
- **Pandas**: A powerful data manipulation library for working with structured data.
- **Matplotlib**: A plotting library for creating visualizations in Python.
- **Seaborn**: A statistical data visualization library built on top of Matplotlib.
- **LinearRegression**: A class from the scikit-learn library used for performing linear regression analysis.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

### Loading the Data

Next, we load the dataset containing students' SAT scores and GPA scores. The dataset is stored in a CSV file named 'studentgrades.csv'.


In [2]:
data = pd.read_csv('studentgrades.csv')
data

Unnamed: 0,SAT,"Rand 1,2,3",GPA
0,1714,1,2.40
1,1664,3,2.52
2,1760,3,2.54
3,1685,3,2.74
4,1693,2,2.83
...,...,...,...
79,1936,3,3.71
80,1810,1,3.71
81,1987,3,3.73
82,1962,1,3.76


In [3]:
data.describe()

Unnamed: 0,SAT,"Rand 1,2,3",GPA
count,84.0,84.0,84.0
mean,1845.27381,2.059524,3.330238
std,104.530661,0.855192,0.271617
min,1634.0,1.0,2.4
25%,1772.0,1.0,3.19
50%,1846.0,2.0,3.38
75%,1934.0,3.0,3.5025
max,2050.0,3.0,3.81


### Creating Multiple Linear Regression and Declaring Variables

To analyze the relationship between students' SAT scores and GPA scores, we'll use multiple linear regression. In this regression model, the dependent variable (y) will be the GPA scores, and the independent variable (X) will be the SAT scores.


In [4]:
x = data[['SAT', 'Rand 1,2,3']]
y = data['GPA']

### Regression itself

In [5]:
reg = LinearRegression()
reg.fit(x, y)

In [6]:
reg.coef_

array([ 0.00165354, -0.00826982])

In [7]:
reg.intercept_

0.29603261264909486

### Calculating the Coefficient of Determination (R-squared)

The coefficient of determination, often denoted as R-squared, is a measure of how well the independent variable(s) explain the variability of the dependent variable. In the context of linear regression, R-squared ranges from 0 to 1, where a value closer to 1 indicates a better fit of the regression line to the data.


In [8]:
reg.score(x,y)

0.4066811952814282

### Formula for Adjusted R^2

$R^2_{adj.} = 1 - (1-R^2)*\frac{n-1}{n-p-1}$

In [9]:
x.shape

(84, 2)

In [10]:
r2 = reg.score(x,y)

n = x.shape[0]
p = x.shape[1]

adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2

0.39203134825134


### Feature Selection

Feature selection is the process of choosing the most relevant independent variables to include in a predictive model. In our analysis, we only have one independent variable (SAT scores), so feature selection is not applicable in this context.

However, in more complex scenarios with multiple independent variables, feature selection techniques such as forward selection, backward elimination, or regularization methods like Lasso or Ridge regression can be used to identify the most influential features.

Since our model involves only one independent variable, we do not perform feature selection in this ans.

```

In [11]:
from sklearn.feature_selection import f_regression

In [12]:
f_regression(x,y)

(array([56.04804786,  0.17558437]), array([7.19951844e-11, 6.76291372e-01]))

In [13]:
p_values = f_regression(x,y)[1]

In [14]:
p_values

array([7.19951844e-11, 6.76291372e-01])

In [15]:
p_values.round(3)

array([0.   , 0.676])

### Creating a Summary Table

A summary table can provide a concise overview of the key findings from our analysis. In this table, we'll include important statistics such as the coefficient of determination (R-squared) and the regression coefficient.


In [16]:
reg_summary = pd.DataFrame(data = x.columns.values, columns = ['Features'])
reg_summary

Unnamed: 0,Features
0,SAT
1,"Rand 1,2,3"


In [17]:
reg_summary ['Coefficients'] = reg.coef_
reg_summary ['p-values'] = p_values.round(3)

In [18]:
reg_summary

Unnamed: 0,Features,Coefficients,p-values
0,SAT,0.001654,0.0
1,"Rand 1,2,3",-0.00827,0.676


### Feature Scaling: Standardization

Standardization is a preprocessing technique used to rescale the features (independent variables) to have a mean of 0 and a standard deviation of 1. This process helps to standardize the range of the features and can improve the performance of certain machine learning algorithm.


In [19]:
from sklearn.preprocessing import StandardScaler

In [20]:
scaler = StandardScaler()

In [21]:
scaler.fit(x)

In [22]:
x_scaled = scaler.transform(x)

In [23]:
x_scaled

array([[-1.26338288, -1.24637147],
       [-1.74458431,  1.10632974],
       [-0.82067757,  1.10632974],
       [-1.54247971,  1.10632974],
       [-1.46548748, -0.07002087],
       [-1.68684014, -1.24637147],
       [-0.78218146, -0.07002087],
       [-0.78218146, -1.24637147],
       [-0.51270866, -0.07002087],
       [ 0.04548499,  1.10632974],
       [-1.06127829,  1.10632974],
       [-0.67631715, -0.07002087],
       [-1.06127829, -1.24637147],
       [-1.28263094,  1.10632974],
       [-0.6955652 , -0.07002087],
       [ 0.25721362, -0.07002087],
       [-0.86879772,  1.10632974],
       [-1.64834403, -0.07002087],
       [-0.03150724,  1.10632974],
       [-0.57045283,  1.10632974],
       [-0.81105355,  1.10632974],
       [-1.18639066,  1.10632974],
       [-1.75420834,  1.10632974],
       [-1.52323165, -1.24637147],
       [ 1.23886453, -1.24637147],
       [-0.18549169, -1.24637147],
       [-0.5608288 , -1.24637147],
       [-0.23361183,  1.10632974],
       [ 1.68156984,

### Regression with Scaled Features

After scaling the features (SAT scores), we can perform regression analysis using the scaled features to see if there is any improvement in model performanc.


In [24]:
reg = LinearRegression()
reg.fit(x_scaled, y)

In [25]:
reg.coef_

array([ 0.17181389, -0.00703007])

In [26]:
reg.intercept_

3.330238095238095

### Creating a Summary Table with Scaled Features

Let's update the summary table to include the results from the regression analysis using scaled features.

In [27]:
reg_summary = pd.DataFrame([['Bias'], ['SAT'], ['Rand 1,2,3']], columns=['Features'])

In [28]:
reg_summary['Weights'] = reg.intercept_, reg.coef_[0], reg.coef_[1]

In [29]:
reg_summary

Unnamed: 0,Features,Weights
0,Bias,3.330238
1,SAT,0.171814
2,"Rand 1,2,3",-0.00703


### Making Predictions with Standardized Coefficients

After fitting the model with scaled features, we can use the standardized coefficients (weights) to make predictions on new dat.


In [30]:
new_data = pd.DataFrame(data=[[1700,2],[1800,1]], columns=['SAT', 'Rand 1,2,3'])

In [31]:
new_data

Unnamed: 0,SAT,"Rand 1,2,3"
0,1700,2
1,1800,1


In [32]:
reg.predict(new_data)



array([295.39979563, 312.58821497])

We're getting insignificant results because the new_data values are unscaled while the training model data is scaled

SOLUTION; Scale the new data 

In [33]:
new_data_scaled = scaler.transform(new_data)

In [34]:
new_data_scaled

array([[-1.39811928, -0.07002087],
       [-0.43571643, -1.24637147]])

In [35]:
result = reg.predict(new_data_scaled)

In [36]:
result

array([3.09051403, 3.26413803])

In [37]:
prediction = pd.DataFrame(data=[[1700,2,result[0]],[1800,1,result[1]]], columns=['SAT', 'Rand 1,2,3', 'Predicted Values'])

In [38]:
prediction

Unnamed: 0,SAT,"Rand 1,2,3",Predicted Values
0,1700,2,3.090514
1,1800,1,3.264138


### Regression Analysis without 'Random 1, 2, 3' Variable

To assess the impact of removing the 'Random 1, 2, 3' variable from our analysis, we'll repeat the regression analysis without including this variabl.


In [39]:
x = data['SAT']
y = data['GPA']

x = x.values.reshape(-1,1)

In [40]:
reg = LinearRegression()

In [41]:
reg.fit(x, y)

In [42]:
reg.coef_

array([0.00165569])

In [43]:
reg.intercept_

0.27504029966028076

In [44]:
new_data = pd.DataFrame(data=[1700,1800], columns=['SAT'])

In [45]:
reg.predict(new_data)



array([3.08970998, 3.25527879])

#### They're practically the same values

### Summary

The analysis conducted on the dataset reveals a significant linear relationship between students' SAT scores and their GPA scores. Through statistical analysis and visualization techniques, it was observed that as SAT scores increase, GPA scores also tend to increase, indicating a positive correlation between the two variables. This finding suggests that students who perform well on the SAT are more likely to achieve higher GPAs. Further investigations could delve into identifying factors influencing this relationship and exploring its implications for academic achievement and educational strategies.