#### **Problem Statement:**
In this experiment we will do Ordinary least squares (multiple) regression for the prediction of Graduate Admissions from an Indian/Bangladeshi perspective. The dataset can be obtained from [Kaggle](https://www.kaggle.com/mohansacharya/graduate-admissions). The dataset (Admission_Predict.csv) containss even features arranged into columns in a CSV file. There are 400 sample datapoints. The features are as follows:
1. GRE Scores (out of 340)
2. TOEFL Scores (out of 120)
3. University Rating (out of 5)
4. Statement of Purpose and Letter of Recommendation Strength (out of 5)
5. Undergraduate GPA (out of 10)
6. Research Experience (either 0 or 1)
7. Chance of Admit (ranging from 0 to 1)

The first column of the dataset contains a serial number, and the final column provide the probability of getting admission, i.e. the target output for each datapoint.We will be using the dataset to create a linear regression model in order to determine the chances of admission of a new sample student, and to assess how well our model works in making a useful forecast.

#### **1. Import necessary packages:**

In [1]:
# Write appropriate code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### **2. Upload and load dataset:**
At first we have to upload the dataset to google colab to start working with it. Please download the **"Admission_Predict.csv"** dataset from [here](https://drive.google.com/file/d/1_DMXoBvuHZUpvBvCYtvMMdr5MDjbhTsu/view?usp=sharing). Then click on files form sidebar, drag and drop your file to side bar to upload the dataset.

Now, use `data = pd.read_csv("Admission_Predict.csv")` to load the data.

In [2]:
# Write appropriate code
data = pd.read_csv("Admission_Predict.csv")
data.head()

FileNotFoundError: ignored

#### **3. Preprocess the Data:**
* To visualize the loaded data use `print(data.head())`. 
* Now, after visualizing the data did you observe we have an extra column named `Serial No.`? 
* This certainly is not a feature, so we will drop this column. Use `data.drop('Serial No.', axis=1, inplace=True)` to drop the column.
* See the column `'Chance of Admit'` is also not a feature rather it is our target. 
  * We will store it in a seperate variable `y` using `y = data['Chance of Admit ']`.
  * Convert `y` to numpy array using `y = y.values`
  * Dorp the column from `data` using `data.drop('Chance of Admit ', axis=1, inplace=True)`
* In `data` we are left with all 7 features. Covert it to numpy array and store in a new variable `X` using `X = data.values`. So, `X` is the matrix of feature columns, each column in `X` will be the feature vectors.

☢ Note: Be careful about the space after the column name `'Chance of Admit '`.

In [None]:
data.columns

In [None]:
# Write appropriate code
data.drop('Serial No.', axis = 1, inplace=True)
data.head()

In [None]:
y = data['Chance of Admit ']
y = y.values
print(y.shape)

In [None]:
data.drop('Chance of Admit ', axis = 1, inplace=True)
data.head()

In [None]:
X = data.values
print(X.shape)

#### **4. Add a ones column vector to X:**
Add a new column cosisting ones as $0^{th}$ column to X. Saw the [numpy documentation](https://numpy.org/doc/stable/reference/generated/numpy.c_.html) for more details. Devide data X and y into x_train, x_test, y_train and y_test. Train dataset will contains 300 datapoints and test dataset will contains 100 datapoint.

$f(x) =  [c_1  c_2 c_3] \times [1 x_1 x_2]$

In [None]:
# Write appropriate code
X = np.c_[np.ones((X.shape[0], 1)), X]
print(X.shape)
print(X)


In [None]:
x_train = X[:300]
y_train = y[:300]

x_test = X[300:]
y_test = y[300:]

#### **5. Solve the system of equation:**
Solve the system of equations $(Xβ = y)$ to find the values of the $β$ vector $(β_0, β_1, β_2, \ldots, β_n)$. You can find $β$ using $β = X^† y = (X^T X)^{−1} X^T y = R^{−1} Q^T y$. There is also a numpy function to calculate the psuedo inverse: `np.linalg.pinv()`, saw the [numpy documentation](https://numpy.org/doc/stable/reference/generated/numpy.linalg.pinv.html) for more details. Use `x_train` and `y_train` as dataset. 

In [None]:
# Write appropriate code
R = np.matmul(x_train.T, x_train)
R_inv = np.linalg.inv(R)
Q_T = x_train.T

beta = np.matmul(R_inv, Q_T).dot(y_train)
print(beta)

#### **6. Find predicted chance of admit:**
Find the predicted chance of admit $\hat y$, by multiplying $X * β$. For prediction use `x_test` as dataset.

In [None]:
# Write appropriate code
y_hat = x_test.dot(beta)
print(x_test[-1], y_hat[-1], y_test[-1])

#### **7. Find the error vector e:**
Find the error vector, $e$, by subtracting $\hat y$ from `y_test`.

In [None]:
# Write appropriate code
e = y_test - y_hat
print(e)

#### **8. Compute the $r^2$ value:**
Recall that, $r^2 = 1 - SSE / SST$, where $SSE$ is the sum of squared errors: $e^Te$ and $SST = \text{Total sum of squares : } (\text{y_test} - avg(\text{y_test}))^T(\text{y_test} - avg(\text{y_test}))$

In [None]:
# Write appropriate code
sse = e.T.dot(e)
sst = (y_test - np.average(y_test)).T.dot((y_test - np.average(y_test)))
r_squared  = 1 - sse/sst

print(r_squared)

#### **9. Plot the vectors $y$, $\hat y$, and $e$:**
Plot the vectors $\text{y_test}$, $\hat y$, and $e$, and make suitable observations. Use different color for three vectors while ploting.

In [None]:
# Write appropriate code
plt.plot(y_test)
plt.plot(y_hat)
plt.plot(e)
plt.legend(["Actual", "Prediction", "Error"])
plt.show()

#### **10. Test with new data:**
Introduce a new sample student with your own data, and find where they fall.

In [None]:
[1]+[2, 10]

In [None]:
# Write appropriate code
# GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research
def pred(a):
  return np.array([1]+a).dot(beta)

In [None]:
pred([320, 110, 3, 4.5, 4.5, 9.7, 1])*100