# Decoding Graduate Admissions: Unveiling Patterns and Factors Influencing Admission Chances


By : Addepalli Sai Rithvik 
<br>

UIN : 934007934

# INTRODUCTION 





## Research Interest/Problem Statement:
The graduate admission process is a multifaceted and crucial aspect of higher education. Aspiring graduate students undergo evaluations based on various factors such as GRE scores, TOEFL scores, undergraduate GPA, statement of purpose, and letters of recommendation. However, there exists a gap in understanding the nuanced impact of these components on the likelihood of admission. This project seeks to address this gap by conducting a comprehensive analysis of graduate admission data, aiming to uncover patterns and relationships among the admission criteria. This particular dataset caught my intrest because I am also a new graduate student at Texas A&M and I also was once an applicant who was confused which factor would contribute the most 

## Significance of the Project:
Understanding the intricate relationships between admission criteria and the probability of admission is crucial for both prospective graduate students and university admissions committees. By deciphering the significance of each criterion, this project aims to contribute to the refinement and transparency of the graduate admission process. Providing insights into the factors that significantly influence admission chances can empower applicants to enhance their applications and guide admissions committees in making informed decisions.
- **For Students:** Gain insights into the factors influencing admission chances, enabling better preparation and decision-making.
- **For Committees:** Enhance fairness and transparency in the decision-making process, aligning with the goal of selecting candidates best suited for academic success.

## Theoretical Background:
The theoretical foundation of this project lies in the realm of statistical modeling and predictive analytics. Leveraging machine learning techniques, particularly regression analysis, allows us to explore the complex interactions between different admission criteria and the ultimate chance of admission. Additionally, the project draws on the principles of data science to extract meaningful patterns from a dataset, providing actionable insights for decision-making in the context of graduate admissions.






## Project Objectives: 
- **Data Analysis:** Explore admission data to identify trends and correlations among different components.
- **Pattern Recognition:** Uncover hidden patterns in the data to understand the nuanced relationships between admission factors.
- **Transparency Enhancement:** Provide insights to improve transparency in the admission process for prospective graduate students and admissions committees.
- **Informed Decision-Making:** Equip both students and admissions committees with valuable information for making more informed decisions.


## Objectives (Research Hypotheses):
1. **Hypothesis 1:**
   - "GRE scores positively correlate with CGPAs."
     - *Expected Outcome:* Higher GRE scores are associated with a higher CGPA.

2. **Hypothesis 2:**
   - "The mean of admission chances for those with and without Research Experience are not same."
     - *Expected Outcome:* A difference in the means for those with and without research experience .

3. **Hypothesis 3:**
   - "Statement of Purpose (SOP) scores positively correlate with CGPA."
     - *Expected Outcome:* A positive association between the quality of Statement of purpose and admission CGPA.

4. **Hypothesis 4 :**
   - "The chance of admit can be predicted by a regression model considering all the factors ."
     - *Expected Outcome:* To buil a regression model which can predict the chance of an admit.
     
     
     
     
## Expected Outcomes:
The project seeks to contribute to the refinement of the graduate admission process by shedding light on critical factors. Improved transparency can empower students in their application journey and assist admissions committees in making fair and evidence-based decisions.



In conclusion, "Decoding Graduate Admissions" strives to bring clarity to a complex process, fostering a more transparent and equitable environment for both applicants and universities.


# DATA DESCRIPTION 



This dataset, sourced from Kaggle, is tailored for predicting graduate admissions from an Indian perspective. It encompasses various parameters that play a pivotal role in the admission process. The included features are GRE scores, TOEFL scores, undergraduate GPA, Statement of Purpose and Letter of Recommendation Strength, University Rating, Research Experience, and the Chance of Admit.

**Dataset Link:**
[Kaggle Graduate Admissions Dataset](https://www.kaggle.com/datasets/mohansacharya/graduate-admissions/data)

This dataset serves as a valuable resource for exploring and modeling the intricate dynamics of graduate admissions, providing insights into the influential factors that contribute to the likelihood of admission.


In [4]:
data.shape

(500, 9)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Serial No.         500 non-null    int64  
 1   GRE Score          500 non-null    int64  
 2   TOEFL Score        500 non-null    int64  
 3   University Rating  500 non-null    int64  
 4   SOP                500 non-null    float64
 5   LOR                500 non-null    float64
 6   CGPA               500 non-null    float64
 7   Research           500 non-null    int64  
 8   Chance of Admit    500 non-null    float64
dtypes: float64(4), int64(5)
memory usage: 35.3 KB


**Dataset Overview: Graduate Admissions**

- **Number of Rows:** 500
- **Number of Columns:** 9

- <br> The columns are : Serial No. GRE Score, TOEFL Score , , University Rating , SOP , LOR , CGPA, Research  ,Chance of Admit 


**Continuous Variables:**
1. **GRE Score:**
   - Type: Quantitative (Continuous)
   - Explanation: Represents measurable quantities, such as scores obtained in the Graduate Record Examination (GRE).

2. **TOEFL Score:**
   - Type: Quantitative (Continuous)
   - Explanation: Represents measurable quantities, specifically scores obtained in the Test of English as a Foreign Language (TOEFL).

3. **CGPA:**
   - Type: Quantitative (Continuous)
   - Explanation: Represents measurable quantities, indicating the Grade Point Average (GPA) obtained during undergraduate studies. Ranges from 0.0 to 10.0

4. **SOP:**
   - Type: Quantitative (Continuous)
   - Explanation: Represents measurable quantities, reflecting the strength of the Statement of Purpose.

5. **LOR:**
   - Type: Quantitative (Continuous)
   - Explanation: Represents measurable quantities, indicating the strength of the Letter of Recommendation.

6. **Chance of Admit:**
   - Type: Quantitative (Continuous)
   - Explanation: Represents a probability, ranging from 0.0 to 1.0, and is considered continuous.

**Qualitative Variables:**
1. **University Rating :**
   - Type: Qualitative (Discrete)
   - Explanation: Represents different categories or ratings for the university/college attended during undergraduate studies.

2. **Research:**
   - Type: Qualitative (Discrete)
   - Explanation: Represents the presence or absence of a characteristic (binary), indicating whether the applicant has research experience.


# DATA PREPROCCESSING 

# Preprocessing Steps


## 1. IMPORTED ALL THE REQUIRED MODULES 
- First I imported all the modules I will be using in this project. Modules such as pandas for handling dataframes,numpy for faster calculations , matplotlib and seaborn for plotting and  sklearn for my machine learning models 









## 2. DROPPING NOT REQUIRED COLUMNS 

- In this step I dropped the column Serial No. since it was of no signifiance and it was just used to indicate the row number . Dropping columns which we do not require can help us increaset the effictevness of our analysis 



### DATA BEFORE DROPPING THE COLUMN 'Serial No.'  AND AFTER DROPPING THE COLUMN

In [3]:
data.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [67]:

data.head()

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,337,118,4,4.5,4.5,9.65,1,0.92
1,324,107,4,4.0,4.5,8.87,1,0.76
2,316,104,3,3.0,3.5,8.0,1,0.72
3,322,110,3,3.5,2.5,8.67,1,0.8
4,314,103,2,2.0,3.0,8.21,0,0.65


## 3.CHECKING FOR NULL VALUES 

- Before moving ahead I checked the entire dataset for null values using pandas dataframe.isnull() method. The dataset had no null values . Since there were no null values , I did not have to drop any row

### Missing Data Check

No missing values were found in the following columns:

- GRE Score
- TOEFL Score
- University Rating
- SOP
- LOR
- CGPA
- Research
- Chance of Admit


## 4.UNDERSTANDING THE DATA 

- Before performing EDA , I first tried to understand the dataset , I achieved this by calling the dataframe.describe method from pandas. By calling this method I was able to see the statistics such as mean values for each column . One intresting fact that I found from the dataset was that <b>none</b> of the students had a <b>100% chance of admit </b> . 
<br> Give below is the basic description statistics of each column </b>

In [69]:
data.describe()

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,316.472,107.192,3.114,3.374,3.484,8.57644,0.56,0.72174
std,11.295148,6.081868,1.143512,0.991004,0.92545,0.604813,0.496884,0.14114
min,290.0,92.0,1.0,1.0,1.0,6.8,0.0,0.34
25%,308.0,103.0,2.0,2.5,3.0,8.1275,0.0,0.63
50%,317.0,107.0,3.0,3.5,3.5,8.56,1.0,0.72
75%,325.0,112.0,4.0,4.0,4.0,9.04,1.0,0.82
max,340.0,120.0,5.0,5.0,5.0,9.92,1.0,0.97


# EXPLANATORY DATA ANALYSIS (EDA)

## Histograms for Exploratory Data Analysis (EDA)

- Histograms are essential for understanding the distribution of individual variables in a dataset. They provide a visual representation of the frequency or count of values within predefined bins or intervals. By examining histograms, we can quickly identify patterns, central tendencies, and potential outliers in the data. This visualization is crucial for gaining insights into the overall shape and characteristics of the data distribution, aiding in subsequent analysis and decision-making processes.

- After plotting the histograms I observed few things. They are : 

    1. **GRE Score** : Most people had their gre score slightly greater than 310.

    2. **TOEFL Score** : Most people had their TOEFL score around 110.

    3. **University Rating** : The most common university rating was 3.0

    4. **SOP** : The most common SOP rating was 4.0

    5. **LOR** : The most common LOR rating was 3.0


    6. **CGPA** : Most people had their GPA around 8.5

    7.  **Research** : More pepople had research experience.



- What was intresting was the fact that most people did not lie in the extremes for any category

![Histograms%20plot%20.png](attachment:Histograms%20plot%20.png)

                                   Figure 1 HISTOGRAM FOR EACH COLUMN


- I also fit normal curves for histograms of GRE Score , TOEFL Score and CGPA , I observed that the data was following normal distribuiton

![Histograms%20for%20TOEFL%20etc.png](attachment:Histograms%20for%20TOEFL%20etc.png)

                                       Figure 2 :  NORMAL CURVES 

In [8]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


## OUTLIERS AND BOX PLOTS

**Importance of Handling Outliers in EDA:**
Detecting and addressing outliers during EDA is crucial for improving data quality, preserving statistical assumptions, enhancing model performance, and gaining a more accurate understanding of the data distribution.

- I found the number of outliers using IQR method and plotted them using box plots. 
     1. Column 'GRE Score': 0 outliers
     2. Column 'TOEFL Score': 0 outliers
     3. Column 'University Rating': 0 outliers
     4. Column 'SOP': 0 outliers
     5. Column 'LOR ': 1 outliers
     6. Column 'CGPA': 0 outliers
     7. Column 'Research': 0 outliers
     8. Column 'Chance of Admit ': 2 outliers
       
       
- Since there were very few outliers in the dataset I did not remove those rows having outliers 

![box%20plot%20.png](attachment:box%20plot%20.png)

                                        Figure 3 :  BOX PLOTS

## COUNT PLOTS 

I plotted count plots for few columns to see the distribuition of the data 



![COUNT%20PLOT%20LOR%20.png](attachment:COUNT%20PLOT%20LOR%20.png)

                                        Figure 4 :  LOR COUNT PLOT

![COUNT%20PLOTS%20SOP%20.png](attachment:COUNT%20PLOTS%20SOP%20.png)



                                        Figure 5 :  SOP COUNT PLOT

![COUNT%20PLOT%20RESEARCH%20EXP.png](attachment:COUNT%20PLOT%20RESEARCH%20EXP.png)


                                        Figure 6 :  RESEARCH EXPERIENCE COUNT PLOT

![COUNT%20PLOT%20UNI%20RATING.png](attachment:COUNT%20PLOT%20UNI%20RATING.png)


                                      Figure 7 :  UNIVERSITY RATING COUNT PLOT

###  Before starting the hypothesis testing , I plotted pair plots and correlation matrix to see if my hypothesis were sound or not ###

## PAIR PLOTS 



![pair%20plots%20.png](attachment:pair%20plots%20.png)


                                  Figure 8 :  PAIR PLOTS FOR ALL COLUMNS

# CORRELATION MATRIX PLOT

**Importance of Correlation Matrix in EDA:**
The correlation matrix is crucial for understanding relationships between variables, aiding in feature selection, identifying dependencies, and checking assumptions in statistical modeling.


![correlation%20matrix.png](attachment:correlation%20matrix.png)


                                        FIGURE 9 : CORRELATION MATRIX

***From the correlation matrix we can see that the most contributing factor to chance of admit is CGPA . The least contributing factor is Research***

# METHODOLGY

## Hypothesis Testing 1 : GRE Scores and CGPA

**Null Hypothesis (H0):**
There is no significant correlation between GRE scores and CGPA.

**Alternative Hypothesis (H1):**
There is a significant correlation between GRE scores and CGPA. 



###  RATIONALE BEHIND THE HYPOTHESIS : 

- Testing the correlation between GRE scores and CGPA is essential for several reasons. The GRE score is a standardized test that assesses a student's readiness for graduate-level academic work, while CGPA reflects their academic performance during undergraduate studies. Investigating the correlation helps understand if performance on the GRE is associated with undergraduate academic achievement. This knowledge is valuable for both prospective graduate students and admission committees, providing insights into the predictive power of GRE scores on undergraduate success.

### STEPS AND METHODS FOLLWED FOR TESTING HYPOTHESIS 1 : 
     1. I created a sub dataframe from the original only having columns 'Score' & 'CGPA'. I did this to make                calculations easier and less confusing.
     2. I first visualised the data to analyse the relationship .
     3. Finally I went ahead with the statistical test 
#### VISUAL TEST : 

- First I made a pair plot to see how CGPA and GRE score varies.From the pair plot I interpreted that there was some sort of relationship which was present. It was not a perfect linear relationship. with increasing CGPA there was an increasing GRE score . It can be seen from Figure 10 the scatter plot there appears to be a relationship between the GRE scores and CGPA, except a few exceptions it can be seen that  higher the CGPA higher the GRE score 

#### STATISTCAL TEST : 
- Since from the plot there appears to be a linear relationship between GPA and GRE scores, we perform  Pearson correlation test.The choice of using the Pearson correlation coefficient (pearsonr method) is based on the assumption that the relationship between GRE scores and CGPA is somewhat  linear. The Pearson correlation coefficient measures the strength and direction of a linear relationship between two continuous variables. It ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation.
    
    

![gre%20vs%20gpa.png](attachment:gre%20vs%20gpa.png)



                                        FIGURE 10 : SCATTER PLOT OF CGPA VS GRE

#### The new dataframe created for easier computation

In [29]:
df_testing.head() 

Unnamed: 0,GRE Score,CGPA
0,337,9.65
1,324,8.87
2,316,8.0
3,322,8.67
4,314,8.21


## STASTICAL TEST 

![Screenshot%202023-12-01%20at%203.05.55%20PM.png](attachment:Screenshot%202023-12-01%20at%203.05.55%20PM.png)

                                FIGURE 11 : RESULTS FROM STATISTICAL TEST

**Pearson Correlation Coefficient: 0.8259**

**P-value: 5.19e-126**

The Pearson correlation coefficient between GRE scores and CGPA is 0.8259, indicating a  positive correlation. The extremely low p-value (5.19e-126) suggests that this correlation is statistically significant. Therefore, we reject the null hypothesis, providing evidence that there is indeed  correlation between GRE scores and CGPA.


## CORRELATION MATRIX

![CORRELATION%20MATRIX%20HYP1.png](attachment:CORRELATION%20MATRIX%20HYP1.png)

                                        FIGURE 12 : CORREALTIION MATRIX FOR GRE AND CGPA

# HYPOTHESIS 2  TESTING  CHANCE OF ADMIT WITH AND WITHOUT RESEARCH



- **Null Hypothesis (H0):** The mean chance of admit for applicants with research experience is equal to the mean chance of admit  for applicants without research experience.
- **Alternative Hypothesis (H1):** The mean chance of admit  for applicants with research experience is significantly greater than the mean chance of  admit for applicants without research experience.


###  RATIONALE BEHIND THE HYPOTHESIS : 

- The hypothesis is designed to investigate the impact of research experience on admission chance . By comparing the mean admission probabilities between applicants with and without research experience, we aim to determine whether research experience is a significant factor in influencing admission chances. The alternative hypothesis suggests that research experience has a positive and significant effect on admission likelihood.


### STEPS AND METHODS FOLLWED FOR TESTING HYPOTHESIS 2  : 


1. **Data Subsetting:**
   - I created a sub dataframe from the original, including only the columns 'Chance of Admit' and 'Research'. This simplification aimed to facilitate calculations and reduce complexity.

2. **Data Visualization:**
   - I initiated the analysis by visualizing the data to explore and understand the relationship between 'Chance of Admit' and 'Research'. Visualization provided insights into potential patterns or trends in the data.

3. **Statistical Testing:**
   - Subsequently, I conducted a statistical test to assess the significance of the relationship between 'Chance of Admit' and 'Research'. The statistical analysis aimed to provide quantitative evidence supporting or refuting the existence of a correlation.

4. **Validation Test:**
   - To ensure the accuracy of the results, I performed a mathematical test in the final stage of the analysis. This additional test aimed to validate the statistical findings through a separate mathematical verification process.

#### VISUAL TEST : 

- First I made a box plot and swarmplot to see how the mean of Chance of Admit and Research Experience varies.From the pair plot I interpreted that the mean for those with research experience was higher than for those without research experience.

#### STATISTCAL TEST : 
- To confirm the hypotehsis I went ahead with a <b>T test</b>.The Independent Samples T-Test is appropriate when comparing the means of two independent groups to determine if there is a significant difference between them. In our case, we are comparing the mean admission probabilities for two groups: those with research experience and those without research experience.
    
#### VALIDATION BY CALCULATION :

- Finally to validate the findings by statistical test and visual test I computed the means admission chance  of both groups 
    

![hyp%202.png](attachment:hyp%202.png)
                                    
                                    FIGURE 13 : CHANCE OF ADMIT AND RESEARCH BOX AND SWARM PLOT

## STASTICAL TEST 

![Screenshot%202023-12-01%20at%203.54.57%20PM.png](attachment:Screenshot%202023-12-01%20at%203.54.57%20PM.png)

                                    FIGURE 14 : RESULTS FROM T TEST 

**T-Test Results:**

- **T-statistic:** 14.538797385517404
- **P-value:** 1.7977467729204891e-40

**Conclusion:**
The obtained T-statistic of 14.54 and an extremely low p-value of 1.80e-40 provide strong evidence to reject the null hypothesis. Therefore, we reject the idea that the mean admission probability for applicants with and without research experience is the same. The results suggest a significant difference in the mean admission probabilities, supporting the alternative hypothesis that applicants with research experience have a significantly greater mean admission probability compared to those without research experience.


![Screenshot%202023-12-01%20at%204.02.37%20PM.png](attachment:Screenshot%202023-12-01%20at%204.02.37%20PM.png)



                                 FIGURE 15 : MEANS COMPUTED MATHEMATICALLY

# Hypothesis Testing 3 : SOP and CGPA

**Null Hypothesis (H0):**
There is no correlation between SOP scores and CGPA.

**Alternative Hypothesis (H1):**
There is  correlation between SOP scores and CGPA. 


###  RATIONALE BEHIND THE HYPOTHESIS : 
The SOP (Statement of Purpose) is an essential component of graduate school applications and provides insights into a candidate's academic and professional goals. CGPA (Cumulative Grade Point Average) reflects the overall academic performance. The hypothesis assumes that the SOP, which articulates the candidate's intentions and aspirations, might exhibit a correlation with their academic achievement as measured by CGPA. Exploring this correlation can provide valuable insights into the relevance of SOP in the admission process.



### STEPS AND METHODS FOLLWED FOR TESTING HYPOTHESIS 2  : 
     1. I created a sub dataframe from the original only having columns 'SOP' & 'CGPA'. I did this to make                calculations easier and less confusing.
     2. I first visualised the data to analyse the relationship using scatter plot .
     3. Finally I went ahead with the statistical test.
#### VISUAL TEST : 

- First I made a scatter plot of SOP and CGPA . There was some sort of relationship , higher CGPA had higher SOP rating . The relations were not linear but 
#### STATISTCAL TEST : 
-  Spearman's rank correlation coefficient is used when the relationship between variables may not be linear, and the data might contain outliers. In the context of the hypothesis testing for SOP scores and CGPA, this non-parametric test is suitable as it doesn't assume a specific distribution of data and is less sensitive to extreme values. Additionally, Spearman's correlation assesses monotonic relationships, making it appropriate for exploring the potential association between SOP scores and CGPA, even when the relationship is not strictly linear.

    


#### The new dataframe created for easier computation


In [77]:
df_testing.head()

Unnamed: 0,CGPA,SOP
0,9.65,4.5
1,8.87,4.0
2,8.0,3.0
3,8.67,3.5
4,8.21,2.0


![sop%20vs%20cgpa.png](attachment:sop%20vs%20cgpa.png)

                                    Figure 16 : SCATTER PLOT OF SOP AND CGPA 

**Spearman's Rank Correlation Results:**
- Spearman's Rank Correlation Coefficient: 0.7174
- P-value: 3.3614e-80

**Interpretation:**
The calculated Spearman's rank correlation coefficient of 0.7174 indicates a strong positive monotonic correlation between SOP scores and CGPA. The extremely low p-value (3.3614e-80) suggests that this correlation is statistically significant. Therefore, we reject the null hypothesis, providing evidence of a significant and positive association between SOP scores and CGPA.


![spearmans.png](attachment:spearmans.png)

                                     FIGURE 17 : RESULTS FROM THE STATISTICAL TEST

![sop%20vs%20cgpa%20corr.png](attachment:sop%20vs%20cgpa%20corr.png)


                                    FIGURE 18 : SHOWING THE CORRELATION BETWEEN SOP AND CGPA

# HYPOTHESIS -4 

<b>There exists a significant difference in the predictive performance of various regression models, including linear regression, lasso regression, and ridge regression, when applied to the task of estimating the likelihood of admission. The objective is to identify and select the best-suited regression model that optimally predicts the chance of admission among the considered models. Furthermore, the investigation will explore the impact of regularization by choosing the best alpha values for lasso and ridge regression through cross-validation, aiming to enhance the overall predictive accuracy and generalizability of the selected model </b>

**Choice of Regression for the Dataset:**

Regression is chosen for this dataset due to its simplicity and interpretability. The goal is to model the relationship between the independent variables (GRE Score, TOEFL Score, University Rating, SOP, LOR, CGPA, Research) and the dependent variable (Chance of Admit). Linear regression provides coefficients that represent the magnitude and direction of the relationships, making it suitable for interpreting the impact of each predictor on the outcome. Additionally, it serves as a baseline model for comparison with more complex regression techniques.


# STEPS AND METHODS FOLLWED FOR TESTING WITH REGRESSION  : 
     1. For testing with linear regression the variable to be predicted (Y) was chance of admit. All other columns were picked as independent variables 
     2. The dataset was split into 90% training and 10% for testing . Since the dataset was small I did not follow the 80-20 rule. 
     2. Using the sklearn module we performed this task 
     3. Finally we computed R2 and MSE for the model to evaluvate its performance.

## LINEAR REGRESSION PERFOMANCE METRICS 

In [79]:
mse_linear = mean_squared_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)
print('MSE for linear regression' ,  mse_linear)
print('r2 for linear regression ' , r2_linear)

MSE for linear regression 0.004606820638854424
r2 for linear regression  0.8044590074240376


# ASSUMPTIONS MADE BY LINEAR REGRESSION

1. **Linearity:** The relationship between the independent variables (features) and the dependent variable (target) is linear. This means that changes in the predictors are associated with constant changes in the response.
2. **Independence:** The observations are independent of each other. The value of the dependent variable for one observation should not be influenced by the values of the dependent variable for other observations.
3. **Homoscedasticity:** The variance of the errors (residuals) is constant across all levels of the independent variables. In simpler terms, the spread of residuals should be roughly constant as the predicted values increase.
4. **Normality:** The residuals are normally distributed.
5. **No Multicollinearity:** There should not be exact linear relationships among the independent variables (multicollinearity). High multicollinearity can lead to unreliable coefficient estimates.

# TEST FOR ASSUMPTIONS MADE BY LINEAR REGRESSION


## 1. LINEARITY

**Linearity Test using Residuals Plot:**

- To assess the linearity assumption in linear regression, a residuals plot was created. The plot compares the residuals (the differences between observed and predicted values) against the predicted values. A random scatter of residuals suggests that the linearity assumption is met, indicating that the model captures the underlying linear relationship between predictors and the target variable.

- Figure 19 shows that the assumptions hold true for our model


![LINEAR%20REGRESSION%20TEST%20FOR%20RESIDUALS%20.png](attachment:LINEAR%20REGRESSION%20TEST%20FOR%20RESIDUALS%20.png)

                                        FIGURE 19 : PLOT FOR TESTING LINEARITY

## INDEPENDENCE TEST
durbin_watson test
- around 2 suggests no autocorrelation 
- less than 1 or greater than 3 suggest positive or negative autocorrelation respectively.

In [100]:
from statsmodels.stats.stattools import durbin_watson

durbinWatson = durbin_watson(residuals)
print("Durbin-Watson:", durbinWatson)

Durbin-Watson: 2.432662759659651


##  HOMOSCEDACSTICITY



- Homoscedasticity, an assumption of linear regression, can be assessed by plotting residuals against predicted values. In the presence of homoscedasticity, the residuals exhibit a random scatter around the zero line. Conversely, the presence of any discernible pattern, such as a funnel shape or systematic trend, suggests heteroscedasticity – a violation of this assumption. Therefore, a visual inspection of the residual plot helps to determine whether the variability of the residuals remains constant across all levels of the predicted values.

- From figure 19 it can be concluded that the assumption made by linear regression holds. There is no visible trend in the plot

## NORMALITY TEST

- Plotting  a histogram of the residuals and see if it looks approximately normal, or use Q-Q plot which compares quantiles of observed data against theoretical quantiles expected under normal distribution. 
- The assumption holds 

![Q%20PLOT%20LINEAR%20REG.png](attachment:Q%20PLOT%20LINEAR%20REG.png)

                                         FIGURE 20  :Q PLOT 


![NORMAL%20PLOT%20LINEAR%20REG.png](attachment:NORMAL%20PLOT%20LINEAR%20REG.png)


                                FIGURE 21 : RESIDUALS HISTOGRAM NORMAL PLOT

# MULTICOLLINEARITY

From our correlation matrix as seen in Figure 21 ,  It can be seen that there is some sort of relationship among the variables.There is a strong relationship but its not excatly linear . 


![linear%20reg%20corr%20matrix.png](attachment:linear%20reg%20corr%20matrix.png)


                                FIGURE 22 : CORRELATION MATRIX FOR OUR TRAINING DATA

# Experimenting with Ridge Regression and Polynomial Regression

Given the observed relationships among the variables in the dataset, I decided to explore more complex regression models beyond linear regression. Specifically, I experimented with Ridge Regression and Polynomial Regression to capture potential non-linearities and account for multicollinearity.

### Ridge Regression:

- Ridge Regression is a regularization technique used to prevent multicollinearity in a multiple regression model. It adds a penalty term to the linear regression cost function, which includes the sum of squared coefficients. This penalty term, controlled by the hyperparameter alpha, helps to shrink the coefficients, preventing them from becoming too large and mitigating multicollinearity issues. Ridge Regression is particularly useful when dealing with a dataset where predictor variables are highly correlated.

### Polynomial Regression:

- Polynomial Regression involves transforming the features of a linear regression model by introducing polynomial terms. This allows the model to capture non-linear relationships between the independent and dependent variables. The degree of the polynomial determines the complexity of the model. Polynomial Regression is valuable when the true relationship between variables is not linear and can help improve the model's fit to the data.

By experimenting with these advanced regression techniques, I aim to identify the model that best captures the underlying patterns in the dataset, considering both linear and non-linear relationships.

# RIDGE REGRESSION 

    1. For testing with ridge regression the variable to be predicted (Y) was chance of admit. All other columns were picked as independent variables 
     2. The dataset was split into 90% training and 10% for testing . Since the dataset was small I did not follow the 80-20 rule. 
     2. Using the sklearn module we performed this task 
     3. Inititally I set the alpha to be equal to 1 .
     4. Later on I experimented with further values of alpha 

## RIDGE REGRESSION PERFOMANCE METRICS 

In [81]:
print('mse for ridge regression :  ' , mse_ridge_default)
print('R2 for ridge regression :  ' , r2_ridge_default)

mse for ridge regression :   0.004625297337725463
R2 for ridge regression :   0.8036747459300481


## TUNING PARAMETERS OF RIDGE REGRESSION

Alpha is a hyperparameter of Ridge Regression that controls the amount of shrinkage: the larger the value of alpha, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity.

The goal is to find an optimal balance between bias (under-fitting) and variance (over-fitting). 

- When alpha is 0, ridge regression equals linear regression. 
- When alpha is very large, all coefficients are shrunk towards zero leading to a model with high bias.

If Alpha = 0: The objective becomes same as simple linear regression. We'll get same coefficients as simple linear regression.
- If Alpha = ∞: The coefficients will be zero because we're essentially penalizing them with an infinite weight.
- If 0 < Alpha < ∞: The magnitude of α will decide the weight given to different parts of cost function.


Scikit-Learn provides a function called RidgeCV which performs ridge regression with built-in cross-validation of the alpha parameter. It will compute regressor performance for various alphas and give us best performing one:

![RIDGE%20_cv.png](attachment:RIDGE%20_cv.png)

                            FIGURE 23: RESULTS AFTER HYPERTUNING 

## RIDGE REGRESSION CONCLUSION

- The results after tuning and before tuning show no significant difference. Ridge Regression shows no significant difference from linear regression. Next we proccedd with polynomial regression to capture the relationships among variables

# POLYNOMIAL REGRESSION

## Polynomial Regression Analysis

In the course of our analysis, we explored various regression techniques to model the relationship between the independent variables and the dependent variable ("Chance of Admit"). After an initial attempt with Ridge Regression, which did not show a significant improvement over linear regression, we proceeded to employ Polynomial Regression.

### Polynomial Regression Overview

Polynomial regression is a versatile regression technique that allows for the modeling of nonlinear relationships between variables. By introducing polynomial terms (e.g., \(X^2\), \(X^3\)) based on the degree chosen, the model gains the capability to capture curves and bends in the data.

### Flexibility and Overfitting

The flexibility of polynomial regression comes with a caveat—overfitting. With the increased complexity of the model, there is a risk of fitting the training data too closely, capturing noise instead of the true underlying pattern. To address this, regularization techniques like Ridge Regression can be applied.

### Degree of the Polynomial

The choice of the polynomial degree is pivotal. Higher degrees do not always lead to better performance, and finding the optimal degree requires experimentation. It's crucial to strike a balance between model complexity and performance.


### Further regression ? 


Despite the introduction of polynomial terms and the exploration of various degrees, if the polynomial regression does not exhibit a significant improvement over previous models, it suggests that the underlying relationships in the data may not be adequately captured by the increased model complexity. In such cases, simpler models like linear regression might be more appropriate.

Moving forward, further experimentation with different degrees and potentially applying regularization to the polynomial regression model can be considered to refine our understanding of the relationships within the dataset.


# STEPS AND METHODS FOLLWED FOR TESTING WITH REGRESSION  : 
     1. For testing with polynomial regression the variable to be predicted (Y) was chance of admit. All other columns were picked as independent variables 
     2. The dataset was split into 90% training and 10% for testing . Since the dataset was small I did not follow the 80-20 rule. 
     3. We use fit_transform method from sklearn to transform our original features into polynomial features
     2. Using the sklearn module we performed this task 
     3. Finally we computed R-Squared and MSE for the model to evaluvate its performance.

# POLYNOMIAL REGRESSION PERFOMANCE METRICS 

In [92]:
from sklearn.metrics import mean_squared_error, r2_score
mse_poly = mean_squared_error(y_test, y_pred_poly)
print(f'Mean Squared Error (MSE) for Polynomial Regression: {mse_poly:}')
r2_poly = r2_score(y_test, y_pred_poly)
print(f'R-squared (R2) for Polynomial Regression: {r2_poly:}')

Mean Squared Error (MSE) for Polynomial Regression: 0.004373411809440357
R-squared (R2) for Polynomial Regression: 0.8143662727068834


- <b> It can be seen that polynomial regression is a slight improvement over linear regression.</b>

## EXPERIMENTING WITH L1 AND L2 REGULARIZATION

### L2 REGULARIZATION PERFOMANCE METRICS  (RIDGE)

In [91]:
from sklearn.linear_model import Ridge


degree = 2
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)


alpha_ridge = 0.01  
ridge_model = Ridge(alpha=alpha_ridge)
ridge_model.fit(X_train_poly, y_train)
y_pred_poly_ridge = ridge_model.predict(X_test_poly)


mse_poly_ridge = mean_squared_error(y_test, y_pred_poly_ridge)
print(f'Mean Squared Error (Polynomial Regression with Ridge): {mse_poly_ridge}')


Mean Squared Error (Polynomial Regression with Ridge): 0.004740793381855659


## L1 REGULARIZATIONO PERFOMANCE METRICS ( LASSO)

In [93]:
from sklearn.linear_model import Lasso
degree = 2
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
alpha_lasso = 0.01  
lasso_model = Lasso(alpha=alpha_lasso)
lasso_model.fit(X_train_poly, y_train)
y_pred_poly_lasso = lasso_model.predict(X_test_poly)
mse_poly_lasso = mean_squared_error(y_test, y_pred_poly_lasso)
print(f'Mean Squared Error (Polynomial Regression with Lasso): {mse_poly_lasso}')


Mean Squared Error (Polynomial Regression with Lasso): 0.004601864171175934


  model = cd_fast.enet_coordinate_descent(


<B> It can be seen that   L1 and L2 Regularisation do not improve the model. MSE increased in both cases

## EXPERIMENTING WITH DEGREES 

### For degree 3 the model is performing even worse. 

In [97]:
mse_poly = mean_squared_error(y_test, y_pred_poly)
print(f'Mean Squared Error (MSE) for Polynomial Regression: {mse_poly:}')
r2_poly = r2_score(y_test, y_pred_poly)
print(f'R-squared (R2) for Polynomial Regression: {r2_poly:}')

Mean Squared Error (MSE) for Polynomial Regression: 0.01136772163458145
R-squared (R2) for Polynomial Regression: 0.5174859743820948


## Polynomial regression model with degress 2 has been choosen as the final model . The reasons include its lower MSE 

# ASSUMPTIONS TEST FOR POLYNOMIAL REGRESSION

Polynomial regression shares many of the same assumptions as linear regression. These include:

1. **Linearity:** The relationship between the predictors (i.e., independent variables) and the response (i.e., dependent variable) is assumed to be polynomial in nature.
2. **Independence:** The residuals are assumed to be independent, meaning that there's no correlation between consecutive residuals or residuals and predictors.
3. **Homoscedasticity:** The variance of the errors is constant across all levels of the predictor variables.
4. **Normality:** For any fixed value of X, Y is normally distributed.

## 1. LINEARITY TEST 

**Linearity Test using Residuals Plot:**

- To assess the linearity assumption in linear regression, a residuals plot was created. The plot compares the residuals (the differences between observed and predicted values) against the predicted values. A random scatter of residuals suggests that the linearity assumption is met, indicating that the model captures the underlying linear relationship between predictors and the target variable.

![PLOY.png](attachment:PLOY.png)

                                       FIGURE 24 : RESIDUAL PLOT 

## INDEPENDENCE TEST
durbin_watson test
- around 2 suggests no autocorrelation 
- less than 1 or greater than 3 suggest positive or negative autocorrelation respectively.

In [98]:
durbin_watson_stat_poly = sm.stats.stattools.durbin_watson(residuals_poly)
print(f'Durbin-Watson Statistic for Polynomial Regression: {durbin_watson_stat_poly}')

Durbin-Watson Statistic for Polynomial Regression: 2.5408269288268146


## NORMALITY TEST

- Plotting  a histogram of the residuals and see if it looks approximately normal, or use Q-Q plot which compares quantiles of observed data against theoretical quantiles expected under normal distribution. 
- The assumption holds 

![normal%20poly.png](attachment:normal%20poly.png)

                                        FIGURE 25 : NORMALLITY TEST FOR POLYNOMIAL REGRESSION

##  HOMOSCEDACSTICITY



- Homoscedasticity, an assumption of linear regression, can be assessed by plotting residuals against predicted values. In the presence of homoscedasticity, the residuals exhibit a random scatter around the zero line. Conversely, the presence of any discernible pattern, such as a funnel shape or systematic trend, suggests heteroscedasticity – a violation of this assumption. Therefore, a visual inspection of the residual plot helps to determine whether the variability of the residuals remains constant across all levels of the predicted values.

- From figure 19 it can be concluded that the assumption made by linear regression holds. There is no visible trend in the plot

# INTERPRETAION OF COFFECIENTS FOR POLYNOMIAL REGRESSION

## Polynomial Regression Coefficients

### Coefficient Interpretation

1. **Constant Term (Intercept):**
   - $X_0 = 1.0000$ 

2. **Linear Coefficients:**
   - $X_1$ (GRE Score): -0.0013
   - $X_2$ (TOEFL Score): 0.0406
   - $X_3$ (University Rating): -0.0531
   - $X_4$ (SOP): -0.1573
   - $X_5$ (LOR): 0.0181
   - $X_6$ (CGPA): 0.6243
   - $X_7$ (Research): -0.0908

3. **Quadratic Coefficients:**
   - $X_8$ (GRE Score^2): 0.0001
   - $X_9$ (GRE Score x TOEFL Score): -0.0000
   - $X_{10}$ (GRE Score x  University Rating): 0.0002
   - $X_{11}$ (GRE Score x SOP): 0.0001
   - $X_{12}$ (GRE Score x LOR): 0.0009
   - $X_{13}$ (GRE Score x  CGPA): -0.0032
   - $X_{14}$ (GRE Score Research): -0.0004
   - $X_{15}$ (TOEFL Score^2): -0.0001
   - $X_{16}$ (TOEFL Score x University Rating): 0.0007
   - $X_{17}$ (TOEFL Score x SOP): 0.0016
   - $X_{18}$ (TOEFL Score x  LOR): -0.0015
   - $X_{19}$ (TOEFL Score x  CGPA): 0.0004
   - $X_{20}$ (TOEFL Score  x Research): 0.0008
   - $X_{21}$ (University Rating^2): -0.0014
   - $X_{22}$ (University Rating x  SOP): 0.0226
   - $X_{23}$ (University Rating x  LOR): -0.0051
   - $X_{24}$ (University Rating x CGPA): -0.0134
   - $X_{25}$ (University Rating x Research): 0.0026
   - $X_{26}$ (SOP^2): -0.0149
   - $X_{27}$ (SOP x  LOR): 0.0036
   - $X_{28}$ (SOP x  CGPA): -0.0020
   - $X_{29}$ (SOP x  Research): -0.0021
   - $X_{30}$ (LOR^2): 0.0030
   - $X_{31}$ (LOR x  CGPA): -0.0163
   - $X_{32}$ (LOR x Research): -0.0046
   - $X_{33}$ (CGPA^2): 0.0313
   - $X_{34}$ (CGPA x  Research): 0.0322
   - $X_{35}$ (Research^2): -0.0908


## Significance of Polynomial Regression Coefficients

In the context of the polynomial regression model applied to predict the chance of admission, the coefficients play a crucial role in capturing complex relationships between independent variables (features) and the dependent variable (chance of admission). Let's delve into the significance of each component:

### 1. Constant Term (Intercept):
   - **Coefficient for \(X_0\) (Intercept):** 0.0000
   - **Significance:** The intercept represents the baseline value of the chance of admission when all independent variables are zero. While it may not have a practical interpretation in this context, it is included to account for the constant term in the polynomial regression equation.

### 2. Linear Coefficients:
   - **Coefficients for \(X_1\) to \(X_7\):** -0.0013, 0.0406, -0.0531, -0.1573, 0.0181, 0.6243, -0.0908
   - **Significance:** These coefficients quantify the impact of individual linear features on the chance of admission. For example, a positive coefficient for CGPA (0.6243) suggests that an increase in CGPA is associated with a higher chance of admission, while a negative coefficient for SOP (-0.1573) indicates the opposite effect.

### 3. Quadratic Coefficients:
   - **Coefficients for \(X_8\) to \(X_{35}\):** 0.0001, -0.0000, 0.0002, 0.0001, 0.0009, -0.0032, -0.0004, -0.0001, 0.0007, 0.0016, -0.0015, 0.0004, 0.0008, -0.0014, 0.0226, -0.0051, -0.0134, 0.0026, -0.0149, 0.0036, -0.0020, -0.0021, 0.0030, -0.0163, -0.0046, 0.0313, 0.0322, -0.0908
   - **Significance:** These coefficients capture the quadratic relationships between pairs of features and represent interaction effects. Positive quadratic coefficients indicate convex relationships, while negative coefficients suggest concave relationships.

### Overall Significance:
- The coefficients collectively define a polynomial equation, allowing for a more flexible modeling of the relationship between features and the chance of admission.
- Signs and magnitudes of the coefficients provide insights into the impact of each feature and its interactions on the chance of admission.
- The polynomial regression model enables the capture of non-linear patterns in the data, enhancing predictive accuracy.

Understanding the significance of these coefficients is essential for interpreting the model's behavior and making informed decisions about the importance of different features in predicting the chance of admission.



# CONCLUSION 

## Project Overview and implications

In this project, a comprehensive analysis was conducted to predict the chances of admission for prospective students. The journey encompassed several key stages, each contributing to a deeper understanding of the dataset and the factors influencing admission outcomes.

### Exploratory Data Analysis (EDA)

The project commenced with Exploratory Data Analysis (EDA), a crucial phase for gaining insights into the dataset's characteristics. The EDA process involved the following key steps:

- **Data Inspection:** An initial exploration of the dataset, checking for missing values, and understanding the distribution of variables.
- **Statistical Summaries:** Obtaining descriptive statistics and visualizations to summarize the main features, uncover patterns, and identify potential outliers.
- **Correlation Analysis:** Exploring correlations between different features, laying the groundwork for understanding potential relationships.

### Relationship Exploration

Following EDA, an in-depth analysis of relationships between variables was conducted. This involved hypothesis testing to assess correlations and associations. Notably, the following relationships were explored:

1. **GRE Scores and CGPA:**
   - The hypothesis test revealed a significant correlation between GRE scores and CGPA, suggesting that variations in GRE scores are associated with meaningful changes in CGPA.

2. **Research Experience and Chance of Admit:**
   - The hypothesis test indicated that applicants with research experience have a significantly higher mean chance of admission compared to those without research experience, emphasizing the importance of research background.

3. **SOP Scores and CGPA:**
   - Another significant correlation was identified between SOP scores and CGPA, highlighting the impact of Statement of Purpose scores on academic performance.

### Regression Model Testing

To predict the chances of admission more accurately, various regression models were employed and evaluated. The models included linear regression and polynomial regression with different degrees. Key steps in the modeling process included:

- **Data Splitting:** The dataset was split into training and testing sets to facilitate model training and evaluation.
- **Feature Transformation:** Polynomial regression was applied to capture non-linear relationships among variables.
- **Regularization Techniques:** L1 (Lasso) and L2 (Ridge) regularization were applied to the polynomial regression model to test it .



Finally Polynomial Regression model was chosen to be the better perofroming one.

In conclusion, this project seamlessly integrated exploratory data analysis, hypothesis testing, and regression modeling to gain insights into the factors influencing admission chances. The rejection of null hypotheses and the significance of various features in the regression models collectively contribute to a robust understanding of the dataset.

Moving forward, the predictive power of the polynomial regression model, especially in capturing non-linear relationships, provides a solid foundation for enhancing admission predictions. The insights gained from this project can inform future decision-making processes in the context of student admissions.


## Interpretation of Hypothesis Testing Results in the Context of Polynomial Regression Model

### 1. Hypothesis Testing 1: GRE Scores and CGPA

- **Result:** Reject the null hypothesis, indicating a significant correlation between GRE scores and CGPA.

**Interpretation for Polynomial Regression Model:**
- The polynomial regression model might capture the correlation between GRE scores and CGPA through the corresponding coefficients. The significant correlation suggests that as GRE scores increase or decrease, there is a meaningful impact on CGPA, and this relationship is incorporated into the polynomial regression equation.

### 2. Hypothesis Testing 2: Research Experience and Chance of Admit

- **Result:** Reject the null hypothesis, suggesting that applicants with research experience have a significantly higher mean chance of admission compared to those without research experience.

**Interpretation for Polynomial Regression Model:**
- The polynomial regression model may reflect the influence of research experience on the chance of admission through specific coefficients. The rejection of the null hypothesis indicates that the presence of research experience is a meaningful factor contributing to a higher chance of admission.

### 3. Hypothesis Testing 3: SOP Scores and CGPA

- **Result:** Reject the null hypothesis, indicating a significant correlation between SOP scores and CGPA.

**Interpretation for Polynomial Regression Model:**
- Similar to the first hypothesis, the polynomial regression model could capture the correlation between SOP scores and CGPA through the relevant coefficients. The rejection of the null hypothesis suggests that variations in SOP scores are associated with meaningful changes in CGPA, and these associations are considered in the polynomial regression equation.

### Overall interpretation  :

- The rejection of null hypotheses in all three tests signifies that the features examined (GRE scores, research experience, SOP scores) play a significant role in predicting the chance of admission.
- The polynomial regression model, with its flexibility to capture non-linear relationships, aligns with the observed correlations and associations identified in the hypothesis tests.
- The coefficients in the polynomial regression model corresponding to GRE scores, research experience, and SOP scores likely contribute meaningfully to the model's predictive power for the chance of admission.

These results collectively contribute to a more nuanced understanding of the relationships between input features and the target variable in the context of predicting admission chances.


## Limitations

While this project has provided valuable insights into predicting admission chances, it is essential to acknowledge certain limitations that may impact the generalizability and scope of the findings:

1. **Population Bias:**
   - The dataset used for this project primarily consists of applicants from India. As a result, the findings and models developed may not be universally applicable to a more diverse global population. Future research should aim to incorporate data from a broader range of geographical locations and demographics to enhance the model's generalizability.

2. **Limited Feature Set:**
   - The dataset includes a specific set of features such as GRE scores, CGPA, SOP scores, etc. Other potential influential factors, such as letters of recommendation from employers or additional standardized test scores, may not have been considered. The model's predictive power could be further enhanced with the inclusion of a more comprehensive set of features.

3. **Assumption of Linearity:**
   - The polynomial regression model assumes a certain degree of linearity in the relationships between features and the chance of admission. While polynomial regression allows for capturing non-linear patterns, it might not fully address intricate relationships that require more sophisticated modeling techniques.

4. **Data Quality and Representativeness:**
   - The accuracy of any predictive model is heavily reliant on the quality and representativeness of the data. Inaccuracies, missing values, or biases in the dataset could impact the reliability of the model's predictions. Rigorous data validation and cleansing processes should be implemented to mitigate these concerns.

5. **Changing Admission Criteria:**
   - The admissions process for educational institutions is dynamic and subject to change. The criteria for admission may evolve over time due to policy changes, institutional priorities, or external factors. As a result, models developed based on historical data may not fully capture future shifts in admission requirements.

Acknowledging these limitations is crucial for interpreting the project's findings responsibly and guiding future research efforts to address these challenges.


## Future Research Directions

1. **Diverse Geographical Representation:**
   - Include admission data from diverse global regions to enhance the model's cross-cultural applicability.

2. **Enriched Feature Set:**
   - Incorporate additional relevant features, such as letters of recommendation, to improve the model's predictive accuracy.

3. **Advanced Modeling Techniques:**
   - Explore advanced modeling methods beyond polynomial regression for a more nuanced understanding of complex relationships.

4. **Continuous Data Monitoring:**
   - Implement a system for ongoing data monitoring and updating to ensure the model remains relevant over time.

5. **Cross-Validation and External Validation:**
   - Conduct cross-validation and external validation studies with independent datasets to validate model robustness and generalizability.


# REFRENCES  
## LINKS

- 1. Class Notes 
  2. https://www.kaggle.com 
  3. https://www.analyticsvidhya.com
  4. https://stackoverflow.com
  5.https://matplotlib.org
  6. https://www.analyticsvidhya.com/blog/2017/06/a-comprehensive-guide-for-linear-ridge-and-lasso-regression/

## CODE

In [104]:


'''#!/usr/bin/env python
# coding: utf-8

# In[21]:


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt


# In[22]:


data = pd.read_csv('admission_dataset.csv')


# In[23]:


data.head()


# In[24]:


data.shape


# In[25]:


data.info()


# In[26]:


data.describe()


# In[27]:


data.head()


# In[28]:


data = data.drop(['Serial No.'], axis=1)


# In[29]:


data.head()


# In[30]:


df = data


# In[31]:


# Checking for null values 

df.isnull().sum()


# In[ ]:





# In[32]:


df.head()


# #  EDA 

# In[33]:


num_cols = len(df.columns)
num_rows = (num_cols - 1) // 3 + 1  

fig, axes = plt.subplots(nrows=num_rows, ncols=3, figsize=(15, 5 * num_rows))
fig.suptitle('Histograms for Each Column', fontsize=16)

axes_flat = axes.flatten()

for i, column in enumerate(df.columns):
    df[column].hist(ax=axes_flat[i], bins=20, edgecolor='black', color='skyblue')
    axes_flat[i].set_title(column)
    axes_flat[i].grid(False)  # Remove grid

for i in range(num_cols, len(axes_flat)):
    fig.delaxes(axes_flat[i])

plt.tight_layout(rect=[0, 0.03, 1, 0.95])

plt.show()


# In[34]:


fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(15, 8))
fig.suptitle('Box Plots for Each Column', fontsize=16)


axes_flat = axes.flatten()


for i, column in enumerate(df.columns):
    df.boxplot(column, ax=axes_flat[i], vert=False)
    axes_flat[i].set_title(column)
    axes_flat[i].grid(False)  


plt.tight_layout(rect=[0, 0.03, 1, 0.95])


plt.show()


# In[35]:


def count_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers_count = len(df[(df[column] < lower_bound) | (df[column] > upper_bound)])
    return outliers_count


# In[36]:


for column in df.columns:
    outliers_count = count_outliers(df, column)
    print(f"Column '{column}': {outliers_count} outliers")


# In[ ]:





# In[37]:


sns.heatmap(df.corr(),annot=True)
plt.title('Correlation Matrix')
plt.xticks(rotation=60)
plt.show()

From the correlation matrix we can see that the most contributing factor to chance of admit is CGPA . The least contributing factor is Research 
# In[38]:


sns.pairplot(df)
plt.show()


# In[39]:


from scipy.stats import norm
import numpy as np 
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(18, 5))
fig.suptitle('Histograms with Fitted Curves for GRE Score, TOEFL Score, and CGPA', fontsize=16)


sns.histplot(df['GRE Score'], kde=True, color='skyblue', ax=axes[0], stat='density', bins=20)
axes[0].set_title('GRE Score')
mu_gre, std_gre = norm.fit(df['GRE Score'])
xmin_gre, xmax_gre = axes[0].get_xlim()
x_gre = np.linspace(xmin_gre, xmax_gre, 100)
p_gre = norm.pdf(x_gre, mu_gre, std_gre)
axes[0].plot(x_gre, p_gre, 'k', linewidth=2)


sns.histplot(df['TOEFL Score'], kde=True, color='salmon', ax=axes[1], stat='density', bins=20)
axes[1].set_title('TOEFL Score')
mu_toefl, std_toefl = norm.fit(df['TOEFL Score'])
xmin_toefl, xmax_toefl = axes[1].get_xlim()
x_toefl = np.linspace(xmin_toefl, xmax_toefl, 100)
p_toefl = norm.pdf(x_toefl, mu_toefl, std_toefl)
axes[1].plot(x_toefl, p_toefl, 'k', linewidth=2)

sns.histplot(df['CGPA'], kde=True, color='lightgreen', ax=axes[2], stat='density', bins=20)
axes[2].set_title('CGPA')
mu_cgpa, std_cgpa = norm.fit(df['CGPA'])
xmin_cgpa, xmax_cgpa = axes[2].get_xlim()
x_cgpa = np.linspace(xmin_cgpa, xmax_cgpa, 100)
p_cgpa = norm.pdf(x_cgpa, mu_cgpa, std_cgpa)
axes[2].plot(x_cgpa, p_cgpa, 'k', linewidth=2)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])

plt.show()


# In[40]:


df.columns


# In[41]:


df['SOP'].value_counts().sort_index()


# In[42]:


sns.countplot(x='SOP',data=df)
plt.show()


# In[43]:


df['University Rating'].value_counts().sort_index()


# In[44]:


sns.countplot(x='University Rating',data=df)
plt.show();


# In[45]:


df['LOR '].value_counts().sort_index()


# In[46]:


sns.countplot(x='LOR ',data=df)
plt.show();


# In[47]:


df['Research'].value_counts().sort_index()


# In[48]:


sns.countplot(x='Research',data=df)
plt.show();


# 

# ## Hypothesis Testing: GRE Scores and CGPA
# 
# **Null Hypothesis (H0):**
# There is no significant correlation between GRE scores and CGPA.
# 
# **Alternative Hypothesis (H1):**
# There is a significant correlation between GRE scores and CGPA. 
# 
# 

# In[49]:


df_testing = df[['GRE Score', 'CGPA']].copy()


# In[50]:


df_testing.head()


# In[51]:


import seaborn as sns
import matplotlib.pyplot as plt


plt.figure(figsize=(10, 6))
sns.scatterplot(x='GRE Score', y='CGPA', data=df_testing)
plt.title('Scatter Plot: GRE Scores vs. CGPA')
plt.xlabel('GRE Scores')
plt.ylabel('CGPA')
plt.show()


# #### As it can be seen from the scatter plot there appears to be a relationship between the GRE scores and CGPA, except a few exceptions it can be seen that  higher the CGPA higher the GRE score 

# ## STASTICAL TEST 
Since from the plot there appears to be a linear relationship between GPA and GRE scores , we perform a Pearson correlation test.The choice of using the Pearson correlation coefficient (pearsonr method) is based on the assumption that the relationship between GRE scores and CGPA is linear. The Pearson correlation coefficient measures the strength and direction of a linear relationship between two continuous variables. It ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation.
# In[52]:


from scipy.stats import pearsonr

correlation_coefficient, p_value = pearsonr(df_testing['GRE Score'], df_testing['CGPA'])

print(f"Pearson Correlation Coefficient: {correlation_coefficient:.4f}")
print(f"P-value: {p_value:.4f}")

alpha = 0.05  
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant correlation.")
else:
    print("Fail to reject the null hypothesis: There is no significant correlation.")


# From P tests we can conclude that there is a significant correlation between the CGPA and the GRE scores 

# ## CORRELATION MATRIX

# In[53]:


import seaborn as sns
import matplotlib.pyplot as plt

correlation_matrix = df_testing.corr()

plt.figure(figsize=(6, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix')
plt.show()


# # HYPOTHESIS TESTING 2 

# **Hypothesis for Independent Samples T-Test:**
# 
# - **Null Hypothesis (H0):** The mean admission probability for applicants with research experience is equal to the mean admission probability for applicants without research experience.
# - **Alternative Hypothesis (H1):** The mean admission probability for applicants with research experience is significantly greater than the mean admission probability for applicants without research experience.
# 

# The Independent Samples T-Test is appropriate when comparing the means of two independent groups to determine if there is a significant difference between them. In our case, we are comparing the mean admission probabilities for two groups: those with research experience and those without research experience.

# In[54]:


import seaborn as sns
import matplotlib.pyplot as plt




plt.figure(figsize=(10, 6))
sns.boxplot(x='Research', y='Chance of Admit ', data=df, showfliers=False)  # Excluding outliers for better visibility
sns.swarmplot(x='Research', y='Chance of Admit ', data=df, color='black', alpha=0.5)

plt.xlabel('Research Experience')
plt.ylabel('Chance of Admit')
plt.title('Admission Probability vs. Research Experience')


plt.show()


# In[55]:


from scipy.stats import ttest_ind

with_research_exp = df[df['Research'] == 1]['Chance of Admit ']
without_research_exp = df[df['Research'] == 0]['Chance of Admit ']


t_stat, p_value = ttest_ind(with_research_exp, without_research_exp, alternative='greater')


print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")


alpha = 0.05  
if p_value < alpha:
    print("Reject the null hypothesis: The mean is not same .")
else:
    print("Fail to reject the null hypothesis: The mean is same.")


# In[56]:


mean_admit_with_research = df[df['Research'] == 1]['Chance of Admit '].mean()
mean_admit_without_research = df[df['Research'] == 0]['Chance of Admit '].mean()

print(f"Mean Admission Chances with Research: {mean_admit_with_research:.4f}")
print(f"Mean Admission Chances without Research: {mean_admit_without_research:.4f}")


# With a large positive t-value (14.5), it suggests that the mean admission probability for applicants with research experience is significantly higher than the mean admission probability for applicants without research experience.

# # HYPOTHESIS 3 

# In[57]:


df_testing = df[['CGPA', 'SOP']].copy()


# In[58]:


import seaborn as sns
import matplotlib.pyplot as plt


plt.figure(figsize=(10, 6))
sns.regplot(x='CGPA', y='SOP', data=df_testing, scatter_kws={'alpha':0.3})
plt.title('Scatter Plot of CGPA vs SOP Quality')
plt.xlabel('CGPA')
plt.ylabel('SOP Quality')

plt.show()


# In[59]:


import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.scatterplot(x='CGPA', y='SOP', data=df_testing)
plt.title('Scatter Plot of CGPA vs SOP Quality')
plt.xlabel('CGPA')
plt.ylabel('SOP Quality')
plt.show()


# In[60]:


from scipy.stats import spearmanr


correlation, p_value = spearmanr(df_testing['CGPA'], df_testing['SOP'])

print(f"Spearman's Rank Correlation: {correlation:}")
print(f"P-value: {p_value:}")

if p_value <= 0.05:
    print("Reject the null hypothesis. There is a significant correlation.")
else:
    print("Fail to reject the null hypothesis. There is no significant correlation.")


# In[61]:


import seaborn as sns
import matplotlib.pyplot as plt


correlation_matrix = df_testing.corr()


plt.figure(figsize=(6, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix')
plt.show()


# # RESEARCH  HYPOTHESIS

# <b> "There exists a significant difference in the predictive performance of various regression models, including linear regression, lasso regression, and ridge regression, when applied to the task of estimating the likelihood of admission. The objective is to identify and select the best-suited regression model that optimally predicts the chance of admission among the considered models. Furthermore, the investigation will explore the impact of regularization by choosing the best alpha values for lasso and ridge regression through cross-validation, aiming to enhance the overall predictive accuracy and generalizability of the selected model </b>

# Null Hypothesis (H0): The combination of GRE scores, TOEFL scores, undergraduate GPA, Statement of Purpose and Letter of Recommendation Strength, University Rating, and Research Experience does not have a significant impact on the chance of admission.

# Alternative Hypothesis (H1): The combination of GRE scores, TOEFL scores, undergraduate GPA, Statement of Purpose and Letter of Recommendation Strength, University Rating, and Research Experience has a significant impact on the chance of admission.

# # LINEAR REGRESSION

# In[62]:


X = df.drop(columns=['Chance of Admit ']) 
y = df['Chance of Admit ']


# In[63]:


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)


# In[64]:


linear_model = LinearRegression()
linear_model.fit(X_train, y_train)


# In[65]:


y_pred_linear = linear_model.predict(X_test)


# In[66]:


mse_linear = mean_squared_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)
print(mse_linear)
print(r2_linear)


# In[67]:


import matplotlib.pyplot as plt
import seaborn as sns


plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred_linear, alpha=0.7)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2)  # Diagonal line
plt.title('Linear Regression: Actual vs. Predicted')
plt.xlabel('Actual Admission Chances')
plt.ylabel('Predicted Admission Chances')
plt.show()


# # ASSUMPTIONS MADE BY LINEAR REGRESSION

# 1. **Linearity:** The relationship between the independent variables (features) and the dependent variable (target) is linear. This means that changes in the predictors are associated with constant changes in the response.
# 2. **Independence:** The observations are independent of each other. The value of the dependent variable for one observation should not be influenced by the values of the dependent variable for other observations.
# 3. **Homoscedasticity:** The variance of the errors (residuals) is constant across all levels of the independent variables. In simpler terms, the spread of residuals should be roughly constant as the predicted values increase.
# 4. **Normality:** The residuals are normally distributed.
# 5. **No Multicollinearity:** There should not be exact linear relationships among the independent variables (multicollinearity). High multicollinearity can lead to unreliable coefficient estimates.

# # TEST FOR ASSUMPTIONS MADE BY LINEAR REGRESSION

# In[68]:


import matplotlib.pyplot as plt
import seaborn as sns


residuals = y_test - y_pred_linear


plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_pred_linear, y=residuals, color='blue', alpha=0.7)
plt.title('Residual Plot for Linearity Check')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()


# Random scatter 

# ####  Independence test
# durbin_watson test
# - around 2 suggests no autocorrelation 
# - less than 1 or greater than 3 suggest positive or negative autocorrelation respectively.

# In[69]:


from statsmodels.stats.stattools import durbin_watson

durbinWatson = durbin_watson(residuals)
print("Durbin-Watson:", durbinWatson)


# ####  Homoscedasticity

# Homoscedasticity, an assumption of linear regression, can be assessed by plotting residuals against predicted values. In the presence of homoscedasticity, the residuals exhibit a random scatter around the zero line. Conversely, the presence of any discernible pattern, such as a funnel shape or systematic trend, suggests heteroscedasticity – a violation of this assumption. Therefore, a visual inspection of the residual plot helps to determine whether the variability of the residuals remains constant across all levels of the predicted values.

# #### 6.3.4. Normality

# In[70]:


import scipy.stats as stats

stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Normal Q-Q Plot")
plt.show()


# In[71]:


import matplotlib.pyplot as plt
import seaborn as sns


residuals = y_test - y_pred_linear

plt.figure(figsize=(8, 6))
sns.histplot(residuals, bins=45, kde=True, color='blue')
plt.title('Histogram of Residuals for Normality Check')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()


# # RIDGE REGRESSION 

# In[72]:


from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=1)
ridge_model.fit(X_train, y_train)
y_pred_ridge_default = ridge_model.predict(X_test)


# In[73]:


mse_ridge_default = mean_squared_error(y_test, y_pred_ridge_default)
r2_ridge_default = r2_score(y_test, y_pred_ridge_default)


# In[74]:


print('mse' , mse_ridge_default)
print('R2' , r2_ridge_default)


# # POLYNOMIAL REGRESSION

# # POLYNOMIAL WITH L1 REGULARISATION

# # ASSUMPTIONS TEST FOR POLYNOMIAL REGRESSION

# Polynomial regression shares many of the same assumptions as linear regression. These include:
# 
# 1. **Linearity:** The relationship between the predictors (i.e., independent variables) and the response (i.e., dependent variable) is assumed to be polynomial in nature.
# 2. **Independence:** The residuals are assumed to be independent, meaning that there's no correlation between consecutive residuals or residuals and predictors.
# 3. **Homoscedasticity:** The variance of the errors is constant across all levels of the predictor variables.
# 4. **Normality:** For any fixed value of X, Y is normally distributed.

# In[76]:


from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd


# Selecting features and target variable
X = df.drop(columns=['Chance of Admit '])
y = df['Chance of Admit ']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Applying Polynomial Regression
degree = 2  # You can adjust the degree of the polynomial
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
y_pred_poly = poly_model.predict(X_test_poly)

# Check for linearity using a residual plot
residuals_poly = y_test - y_pred_poly
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_pred_poly, y=residuals_poly, color='blue', alpha=0.7)
plt.axhline(y=0, color='red', linestyle='--', linewidth=2)
plt.title('Residual Plot for Polynomial Regression')
plt.xlabel('Predicted Values (Polynomial Regression)')
plt.ylabel('Residuals (Polynomial Regression)')
plt.show()

# Check for normality of residuals using a histogram
plt.figure(figsize=(8, 6))
sns.histplot(residuals_poly, bins=30, kde=True, color='blue')
plt.title('Histogram of Residuals for Polynomial Regression')
plt.xlabel('Residuals (Polynomial Regression)')
plt.ylabel('Frequency')
plt.show()



# Check for autocorrelation using Durbin-Watson statistic
durbin_watson_stat_poly = sm.stats.stattools.durbin_watson(residuals_poly)
print(f'Durbin-Watson Statistic for Polynomial Regression: {durbin_watson_stat_poly}')


# In[77]:


from sklearn.metrics import mean_squared_error, r2_score

# Calculate Mean Squared Error (MSE) for Polynomial Regression
mse_poly = mean_squared_error(y_test, y_pred_poly)
print(f'Mean Squared Error (MSE) for Polynomial Regression: {mse_poly:}')

# Calculate R-squared (R2) for Polynomial Regression
r2_poly = r2_score(y_test, y_pred_poly)
print(f'R-squared (R2) for Polynomial Regression: {r2_poly:}')


# In[78]:


coefficients_poly = poly_model.coef_

print("Coefficients of Polynomial Regression Model:")
for i, coef in enumerate(coefficients_poly):
    print(f"Coefficient for X{i}: {coef:.4f}")





poly_feature_names = poly.get_feature_names_out(X.columns)


print("Names of Polynomial Regression Coefficients:")
for i, coef_name in enumerate(poly_feature_names):
    print(f"Coefficient {i}: {coef_name}")





from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

X = df.drop(columns=['Chance of Admit '])
y = df['Chance of Admit ']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

degree = 2  #
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
y_pred_poly = poly_model.predict(X_test_poly)

from sklearn.metrics import mean_squared_error, r2_score
mse_poly = mean_squared_error(y_test, y_pred_poly)
print(f'Mean Squared Error (MSE) for Polynomial Regression: {mse_poly:}')
r2_poly = r2_score(y_test, y_pred_poly)
print(f'R-squared (R2) for Polynomial Regression: {r2_poly:}')


from sklearn.linear_model import Ridge


degree = 2
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)


alpha_ridge = 0.01  
ridge_model = Ridge(alpha=alpha_ridge)
ridge_model.fit(X_train_poly, y_train)
y_pred_poly_ridge = ridge_model.predict(X_test_poly)


mse_poly_ridge = mean_squared_error(y_test, y_pred_poly_ridge)
print(f'Mean Squared Error (Polynomial Regression with Ridge): {mse_poly_ridge}')



from sklearn.linear_model import Lasso
degree = 2
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
alpha_lasso = 0.01  
lasso_model = Lasso(alpha=alpha_lasso)
lasso_model.fit(X_train_poly, y_train)
y_pred_poly_lasso = lasso_model.predict(X_test_poly)
mse_poly_lasso = mean_squared_error(y_test, y_pred_poly_lasso)
print(f'Mean Squared Error (Polynomial Regression with Lasso): {mse_poly_lasso}')






'''

print(' ')

 
