# Jamboree Education - Linear Regression

## Introduction
Jamboree has helped thousands of students like you make it to top colleges abroad. Be it GMAT, GRE or SAT, their unique problem-solving methods ensure maximum scores with minimum effort.
They recently launched a feature where students/learners can come to their website and check their probability of getting into the IVY league college. This feature estimates the chances of graduate admission from an Indian perspective.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.graphics.regressionplots import plot_regress_exog
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.graphics.gofplots import qqplot
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import Ridge, Lasso

In [None]:
df = pd.read_csv("Jamboree_Admission.csv")
df.head()

In [None]:
df.describe()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.dtypes

In [None]:
df.isnull().sum()

In [None]:
sns.histplot(df['GRE Score'])
sns.histplot(df['TOEFL Score'])
sns.histplot(df['University Rating'])
sns.histplot(df['SOP'])
sns.histplot(df['LOR '])
sns.histplot(df['Research'])
sns.histplot(df['CGPA'])
sns.histplot(df['Chance of Admit '])
plt.show()

## Once you’ve ensured that students with varied merit apply for the university, you can start understanding the relationship between different factors responsible for graduate admissions.

In [None]:
sns.scatterplot(x='GRE Score', y='Chance of Admit ', data=df)
sns.scatterplot(x='TOEFL Score', y='Chance of Admit ', data=df)
sns.scatterplot(x='University Rating', y='Chance of Admit ', data=df)
sns.scatterplot(x='LOR ', y='Chance of Admit ', data=df)
sns.scatterplot(x='SOP', y='Chance of Admit ', data=df)
sns.scatterplot(x='CGPA', y='Chance of Admit ', data=df)
sns.scatterplot(x='Research', y='Chance of Admit ', data=df)
plt.show()

In [None]:
sns.pairplot(df[['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA', 'Research', 'Chance of Admit ']])
# Display the plot
plt.show()

## Drop the unique row Identifier if you see any. This step is important as you don’t want your model to build some understanding based on row numbers.

In [None]:
df = df.drop(['Serial No.'], axis=1)
df

In [None]:
df.describe()

## Use Non-graphical and graphical analysis for getting inferences about variables. 
   * This can be done by checking the distribution of variables of graduate applicants.

In [None]:
df.columns

In [None]:
# Calculate descriptive statistics
stats = df[['GRE Score', 'TOEFL Score', 'University Rating', 'CGPA', 'Chance of Admit ']].describe()

# Print the statistics
print(stats)

In [None]:
# Create histograms
sns.histplot(df['GRE Score'], kde=True)
sns.histplot(df['TOEFL Score'], kde=True)
sns.histplot(df['CGPA'], kde=True)

# Create boxplots
sns.boxplot(x=df['University Rating'], y=df['GRE Score'])
sns.boxplot(x=df['University Rating'], y=df['TOEFL Score'])
sns.boxplot(x=df['University Rating'], y=df['CGPA'])

# Create density plots
sns.kdeplot(df['GRE Score'])
sns.kdeplot(df['TOEFL Score'])
sns.kdeplot(df['CGPA'])

# Create bar plot
sns.countplot(x=df['Research'])

# Display the plots
plt.show()

In [None]:
# Calculate correlation coefficients
corr = df[['GRE Score', 'TOEFL Score', 'University Rating', 'CGPA', 'Chance of Admit ']].corr()

# Print the correlation matrix
print(corr)

In [None]:
# Visualize the correlation matrix
sns.heatmap(corr, annot=True, cmap='coolwarm')

# Display the heatmap
plt.show()

## Check correlation among independent variables and how they interact with each other.

In [None]:
df.corr()

In [None]:
# Create a heatmap of the correlation matrix
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

# Show the plot
plt.show()

In [None]:
# Create a pairplot of the independent variables
sns.pairplot(df[['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA', 'Research']], diag_kind='kde')

# Show the plot
plt.show()

## Use Linear Regression from (Statsmodel library) and explain the results.
### Test the assumptions of linear regression:
* Multicollinearity check by VIF score
* Mean of residuals
* Linearity of variables (no pattern in residual plot)
* Test for Homoscedasticity
* Normality of residuals

In [None]:
# Build the linear regression model using the Statsmodel library:
X = df[['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA', 'Research']]
y = df['Chance of Admit ']

X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.params)

In [None]:
# Multicollinearity check by VIF score
vif = pd.DataFrame()
vif['Variables'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)

In [None]:
# Mean of residuals:
residuals = model.resid
mean_residual = np.mean(residuals)
print(f'Mean of residuals: {mean_residual}')

In [None]:
# Linearity of variables (no pattern in residual plot):
sns.scatterplot(x=y, y=residuals)
plt.xlabel('Chance of Admit ')
plt.ylabel('Residuals')
plt.title('Linearity of Variables')
plt.show()

In [None]:
# Test for Homoscedasticity:
plt.scatter(model.fittedvalues, residuals)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Test for Homoscedasticity')
plt.show()

In [None]:
# Normality of residuals:
residuals = model.resid
qqplot(residuals, line='s')
plt.title('Normality of Residuals')
plt.show()

## Do model evaluation- MAE, RMSE, R2 score, Adjusted R2.

In [None]:
# Assuming you have train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model on the training data
model = sm.OLS(y_train, X_train).fit()

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
adj_r2 = 1 - (1 - r2) * (len(y_test) - 1) / (len(y_test) - X.shape[1] - 1)

# Print the evaluation metrics
print("MAE:", mae)
print("RMSE:", rmse)
print("R2 score:", r2)
print("Adjusted R2 score:", adj_r2)

In [None]:
# Define the regularization strengths
alpha_ridge = 1.0
alpha_lasso = 1.0

# Fit Ridge regression on the training data
ridge = Ridge(alpha=alpha_ridge)
ridge.fit(X_train, y_train)

# Fit Lasso regression on the training data
lasso = Lasso(alpha=alpha_lasso)
lasso.fit(X_train, y_train)

# Make predictions on the test data for Ridge and Lasso
y_pred_ridge = ridge.predict(X_test)
y_pred_lasso = lasso.predict(X_test)

# Calculate the evaluation metrics for Ridge and Lasso
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
rmse_ridge = np.sqrt(mse_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)

mae_lasso = mean_absolute_error(y_test, y_pred_lasso)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
rmse_lasso = np.sqrt(mse_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)

# Print the evaluation metrics for Ridge and Lasso
print("Ridge Regression:")
print("MAE:", mae_ridge)
print("RMSE:", rmse_ridge)
print("R2 score:", r2_ridge)

print("Lasso Regression:")
print("MAE:", mae_lasso)
print("RMSE:", rmse_lasso)
print("R2 score:", r2_lasso)

# Insights
* GRE scores have a significant positive relationship with the Chance of Admit, meaning that a higher GRE score increases the chances of getting into the IVY league college.
* TOEFL scores have a moderate positive relationship with the Chance of Admit, indicating that a higher TOEFL score also increases the chances of admission.
* The university rating has a weak positive relationship with the Chance of Admit, suggesting that a higher university rating increases the chances of admission.
* Both SOP and LOR have a weak positive relationship with the Chance of Admit, indicating that a well-written SOP and strong LOR can improve the chances of admission.
* CGPA has a moderate positive relationship with the Chance of Admit, suggesting that a higher CGPA increases the chances of admission.
* Research experience has a strong positive relationship with the Chance of Admit, indicating that students with research experience have a higher chance of admission.
* The linear regression model has a high R2 score and a low MAE and RMSE, indicating that the model has a good fit and can predict the Chance of Admit with a high degree of accuracy.
* The model can be improved by including additional predictor variables, such as work experience, publications, and extracurricular activities.
* The model can be implemented on Jamboree's website to provide students with an estimate of their chances of admission.
* The model can help Jamboree to improve their services by providing personalized estimates of the Chance of Admit to students.

# Recommendations
* Encourage students to aim for a higher GRE score to improve their chances of admission.
* Encourage students to prepare well for the TOEFL exam and aim for a high score.
* Students should consider applying to universities with higher ratings to improve their chances of admission.
* Encourage students to focus on writing a strong SOP and obtaining a strong LOR from their professors.
* Encourage students to maintain a high CGPA during their undergraduate studies.
* Encourage students to gain research experience during their undergraduate studies.
* The model can be used to predict the Chance of Admit for students with varied merit.
* Collect data on additional predictor variables to improve the model's performance.
* Implement the model on Jamboree's website to provide students with personalized estimates of their chances of admission.
* Use the model to provide personalized estimates of the Chance of Admit to students and offer targeted services to improve their chances of admission.