Hello Machine Learning Engineer Drepung Team, 

You have been given a data which is obtained from **COVID-19 Tracking project** and NYTimes. Coronaviruses are a large family of viruses which may cause illness in animals or humans. In humans, several coronaviruses are known to cause respiratory infections ranging from the common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). 

The number of new cases are increasing day by day around the world. This dataset has COVID information for United States at daily level.

Number of Instances: 156 <br>
Number of Attributes: 7 (including the target variable `y`)

Attribute Information: 
   * **y:**  Total number of tests with positive results in a single day(Numerical)
   * **f1:** date of observation
   * **f2:** number of tests with negative results (Numerical)
   * **f2:** number of test with pending results (Numerical)
   * **f3:** Number of patients hospitalized on the date (Numerical)
   * **f4:** Number of patients on ventilator on the date (Numerical)
   * **f5:** Number of patients recovered on the date (Numerical)
   * **f6:** number of deaths (Numerical)

There are no missing Attribute Values.

Your task is to implement a Linear Regression model using **Closed Form Solution** for predicting the total number of positive results in a single day

## Closed Form Solution
The genesis equation for Linear Regression model is of the form:

$y = X.W$  where; <br>
$Y$ is output, <br>
$W$ are the parameters and <br>
 $T$ is the Target

For finding parameters $W$ for the above genesis using the **closed form solution** we pre-multiply by $X^{-1}$ on LHS and RHS. We get,

$W = X^{-1}Y$

But X is NOT A SQUARE MATRIX of FULL RANK! Hence, $X^{-1}$ is intractable.

We therefore use the Moore-Penrose pseudo inverse as a generalization of the matrix inverse when the matrix may not be invertible. Hence, the final closed form solution for finding parameters $W$ with linear regression least squares solution is as follows:

$W = (X^{T}X)^{-1}X^{T}Y$

YOU NEED TO IMPLEMENT ABOVE EQUATION for finding $W$. 

<font color="red"> YOU CANNOT USE NUMPY linalg **pinv** https://numpy.org/doc/stable/reference/generated/numpy.linalg.pinv.html </font>

<font color="red">DO NOT USE SKLEARNS LINEAR REGRESSION LIBRARY DIRECTLY.</font>


<font color="green">YOU CAN USE np.linalg.inv, and np.dot FOR IMPLEMENTING PSEUDO-INVERSE</font>



### **Question:** In the following code cell implement the following:
* Step 1: Import the dataset (us_covid.csv) using Pandas Dataframe (Step 1 Implemented already)
* Step 2: Partition your dataset into training testing and validation using sklearns train_test_split library and split the features and target labels into seperate variables (Step 2 Implemented already)
* Step 3: Scale the features using sklearns min max scaling function (Step 3 Implemented already)
* Step 4: Convert Scaled Features and Labels into numpy arrays with dimensions required by closed form solution (Step 4 Implemented already)
* Step 5: Train using Linear Regression algorithm with a Closed Form Solution **Hint: Use Pseudo Inverse Formula**
* Step 6: Test using Testing Dataset
* Step 7: Calculate Root Mean Squared Error for Test Dataset
    * $Erms = \frac{1}{n}\sqrt{(y\_test - y\_test\_pred)^{2}}$ 

In [None]:
# Step 1 already implemented
import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/Mihir2/BreakoutSessionDataset/master/us_covid.csv"
s = requests.get(url).content
data = pd.read_csv(io.StringIO(s.decode('utf-8')))
data

# Step 2 already implemented
import numpy as np
from sklearn.model_selection import train_test_split
output = data['y']
input = data.to_numpy()[:,1:]
x_train, x_test, y_train, y_test = train_test_split(input, output, test_size = 0.2, random_state = 42)


# Step 3 already implemented
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
sc_xtrain = scaler.fit_transform(x_train)
sc_xtest = scaler.transform(x_test)

# Step 4 already implemented
y_train_arr = y_train.to_numpy().reshape(y_train.shape[0],1)
x_train_arr = sc_xtrain
y_test_arr  = y_test.to_numpy().reshape(y_test.shape[0],1)
x_test_arr  = sc_xtest

# Step 5
from numpy.linalg import inv

weights = np.dot(np.linalg.inv(np.dot(x_train_arr.T,x_train_arr)),np.dot(x_train_arr.T, y_train_arr))

# Step 6

y_test_preds = np.dot(x_test_arr, weights)
print(y_test_preds)

# Step 7
from sklearn.metrics import mean_squared_error

rmse_error1 = np.sqrt(mean_squared_error(y_test_arr, y_test_preds))
rmse_error = np.sqrt(((y_test_arr-y_test_preds)**2).mean())

print(rmse_error1)
print(rmse_error)

[[26439.10908156]
 [31207.08649332]
 [23969.91020927]
 [28651.32119697]
 [38011.726676  ]
 [55871.97419253]
 [25244.38836825]
 [18751.06056943]
 [35139.04507943]
 [18185.77403563]
 [46401.1333174 ]
 [25456.18194589]
 [65250.89602094]
 [16104.00753897]
 [52634.56888839]
 [63045.0932175 ]
 [21146.39623115]
 [34209.47442686]
 [25075.77799086]
 [46164.33515023]
 [28882.38091233]
 [58083.6310286 ]
 [39797.83816398]
 [49403.40714035]
 [69965.09932508]
 [23513.42984726]
 [26656.90629278]
 [40716.72379115]
 [50243.37461313]
 [23808.92254536]
 [55425.50710823]
 [32631.24275019]]
8207.329772196703
8207.329772196703
