# List Full Names of all the participants in your team below:
1. Yousuf Aziz
2. Messiah Smith-Bonet
3. Surya Muthiah Pillai
4. Yuan Meng
5. Eric Yang
6. David Tan
7. Daniel Ip
8. Joyce Sommer 
9. Kurt Su
10. Jonathan Romano
11. Xianxin Lin


Hello Machine Learning Engineer Sera Team, 

You have been given a data which is obtained from **Air Quality** of Seattle City. The dataset contains 788 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device.

Number of Instances: 789 <br>
Number of Attributes: 14 (including the target variable `y`)

Attribute Information: 
* **y**  AQI Air Quality Index
* **f1** True hourly averaged concentration CO in mg/m^3 (reference analyzer)
* **f2** PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
* **f3** True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
* **f4** True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
* **f5** PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
* **f6** True hourly averaged NOx concentration in ppb (reference analyzer)
* **f7** PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
* **f8** True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
* **f9** PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
* **f10** PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
* **f11** Temperature in Â°C
* **f12** Relative Humidity (%)
* **f13** AH Absolute Humidity

There are no missing Attribute Values.

Your task is to implement a **Gaussian Radial Basis Function based Linear Regression model using Closed Form Solution** for predicting the Air Quality Index for Seattle City.

## Closed Form Solution with Basis Functions
The **genesis equation** for Linear Regression with Gaussian Basis Function is of the form:

$y(x,w) = \phi(x).W$  

* $y(x,w)$ is predicted output,
* $\phi(x)$ is the Design Matrix
* $W = (w_{1}, ... w_{M})$ are the parameters to be learned from training samples

### Design Matrix
Each gaussian radial basis function $\phi_{j}$ converts the input instance to a value as shown below: <br>

$\phi_{j}(x) = \exp(-\frac{1}{2}(x - \mu_{j})^{T}\sum_{j}^{-1}(x - \mu_{j}))$

* $x$ is the input scaled dataset <br>
* $\mu_{j}$ is the center of the $j_{th}$ Guassian Radial Basis Function <br>
* $\sum_{j}$ decides how braodly the $j_{th}$ basis function spreads (Diagonal Covariance Matrix)

Repeated application of $j$ basis functions results in a Design Matrix as shown below:
![!picture](https://drive.google.com/uc?export=view&id=1j1kxv6nUPPECacd-_bDg_lL1yTJS5BwA)

For finding parameters $W$ for the above genesis using the **closed form solution** we pre-multiply by $\phi^{-1}(x)$ on LHS and RHS. We get,

$W = \phi^{-1}(x)Y$

But $\phi(x)$ is NOT A SQUARE MATRIX of FULL RANK! Hence, $\phi^{-1}(x)$ is intractable.

We therefore use the Moore-Penrose pseudo inverse as a generalization of the matrix inverse when the matrix may not be invertible. Hence, the final closed form solution for finding parameters $W$ with linear regression least squares solution is as follows:

$W = (\phi^{T}\phi)^{-1}\phi^{T}Y$

YOU NEED TO IMPLEMENT ABOVE EQUATION for finding $W$. 

<font color="red"> YOU CANNOT USE NUMPY linalg **pinv** https://numpy.org/doc/stable/reference/generated/numpy.linalg.pinv.html </font>

<font color="red">DO NOT USE SKLEARNS LINEAR REGRESSION LIBRARY DIRECTLY.</font>

<font color="green">YOU CAN USE np.linalg.inv, and np.dot FOR IMPLEMENTING PSEUDO-INVERSE</font>

### **Question:** In the following code cell implement the following:
* Step 1: Import the dataset (AirQualitySeattle.csv) using Pandas Dataframe (Step 1 Implemented already)
* Step 2: Partition your dataset into training testing and validation using sklearns train_test_split library and split the features and target labels into seperate variables (Step 2 Implemented already)
* Step 3: Scale the features using sklearns min max scaling function (Step 3 Implemented already)
* Step 4: Convert Scaled Features and Labels into numpy arrays with dimensions required by closed form solution (Step 4 Implemented already)
* Step 5: Find the Mean ($\mu_{j}$) and Spread ($\sum_{j}$) for **3 basis functions** (Step 5 Implemented Already)
* Step 6: Create a Design Matrix using the scaled features, Mean ($\mu_{j}$) and Spread ($\sum_{j}$)
* Step 7: Train using Linear Regression algorithm with a Closed Form Solution **Hint: Use Pseudo Inverse Formula**
* Step 8: Test using Testing Dataset (Make sure you create a design matrix for Testing dataset using same Mean ($\mu_{j}$) and Spread ($\sum_{j}$) from Step 5)
* Step 9: Calculate Root Mean Squared Error (Erms) for Test Dataset
    * $Erms = \sqrt{\frac{1}{n}\sum_{i=0}^{i=n} (y\_test_{i} - y\_test\_pred_{i})^{2}}$ 

In [3]:
# Step 1 already implemented
import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/Mihir2/BreakoutSessionDataset/master/AirQualitySeattle.csv"
s = requests.get(url).content
data = pd.read_csv(io.StringIO(s.decode('utf-8')))
data

# Step 2 already implemented
import numpy as np
from sklearn.model_selection import train_test_split
output = data['y']
input = data.to_numpy()[:,1:]
x_train, x_test, y_train, y_test = train_test_split(input, output, test_size = 0.2)

# Step 3 already implemented
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
sc_xtrain = scaler.fit_transform(x_train)
sc_xtest = scaler.transform(x_test)

# Step 4 already implemented
y_train_arr = y_train.to_numpy().reshape(y_train.shape[0],1)
x_train_arr = sc_xtrain
y_test_arr  = y_test.to_numpy().reshape(y_test.shape[0],1)
x_test_arr  = sc_xtest

# Step 5 already implemented
from  sklearn.cluster import MiniBatchKMeans
number_of_basis_function = 3
model = MiniBatchKMeans(n_clusters=number_of_basis_function)
distances = model.fit_transform(x_train_arr)
basis_means = model.cluster_centers_
basis_variances = np.zeros(number_of_basis_function)
i = 0
for label in model.labels_:
  basis_variances[label] = basis_variances[label] + (distances[i][label]**2)
  i = i + 1
for j in range(0,number_of_basis_function):
  basis_variances[j] = basis_variances[j]/np.count_nonzero(model.labels_ == j)
basis_variances = np.diag(basis_variances)

#print(basis_means)
# print(basis_variances)

## TA Answer

In [None]:
# Step 6
x_mu = np.zeros((number_of_basis_function,x_train_arr.shape[0]))
for i in range(0,number_of_basis_function):
  x_mu[i] = np.sum((x_train_arr - basis_means[i]),axis=1)

train_design_mat = np.exp(-0.5*np.multiply(np.dot(x_mu.T,np.linalg.inv(basis_variances)),x_mu.T))

# Step 7 
weights = np.dot(np.dot(np.linalg.inv(np.dot(train_design_mat.T,train_design_mat)),train_design_mat.T),y_train_arr)

# Step 8
x_mu = np.zeros((number_of_basis_function,x_test_arr.shape[0]))
for i in range(0,number_of_basis_function):
  x_mu[i] = np.sum((x_test_arr - basis_means[i]),axis=1)

test_design_mat = np.exp(-0.5*np.multiply(np.dot(x_mu.T,np.linalg.inv(basis_variances)),x_mu.T))
y_test_pred = np.dot(test_design_mat, weights)

#Step 9
Erms = np.sqrt(np.sum((y_test_pred - y_test_arr)**2)/y_test_arr.shape[0])
print(Erms)

163.92466954866845


## Student Response

In [None]:
# Step 6 
import math as m

# newx = model.fit_transform()
# mu = model.fit_transform()

# x_mu = distances - basis_means

basis_transpose = np.transpose(distances)

basisByTwo = basis_transpose * -0.5
sigma_inv = np.linalg.inv(basis_variances)
designmat = np.dot(sigma_inv, distances.T)*distances.T

#designmat = m.exp(ibd_bytwo)
print(designmat.shape)

# Step 7 
W=np.dot(np.dot(np.linalg.inv(np.dot(designmat.T,designmat)),designmat.T),y_train_arr)

# Step 8 


# Step 9 