# Regression Practical Assessment
This assessment is for determining how much you have learnt in the past sprint, the results of which will be used to determine how EDSA can best prepare you for the working world. This assessment consists of and practical questions in Regression.

The answers for this test will be input into Athena as Multiple Choice Questions. The questions are included in this notebook and are made **bold** and numbered according to the Athena Questions.

As this is a time-constrained assessment, if you are struggling with a question, rather move on to a task you are better prepared to answer rather than spending unnecessary time on one question.

**_Good Luck!_**

## Honour Code
I **Basheer, Ashafa**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code (https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).  

Non-compliance with the honour code constitutes a material breach of contract.

### Download the data

Download the Notebook and data files here: https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/Machine_Learning_Assessment.zip

### Imports

In [19]:
import numpy as np
import pandas as pd

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

### Reading in the data
For this assessment we will be using a dataset about the quality of wine. Read in the data and take a look at it.

**Note** the feature we will be predicting is quality, i.e. the label is quality.

In [41]:
df = pd.read_csv('winequality.csv')
df.head()

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,0,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,0,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,0,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,0,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


## Task 1 - Data pre-processing

Write a function to pre-process the data so that we can run it through the regressor. The function should:
* If there are any NAN values, fill them with zeros
* Split the data into features and labels
* Standardise the features using sklearn's ```StandardScaler```
* Split the data into 75% training and 25% testing data
* Set random_state to equal 16 for this internal method

_**Function Specifications:**_
* Should take a dataframe as input.
* Should return two `tuples` of the form `(X_train, y_train), (X_test, y_test)`.

**Note: be sure to pay attention to the test size and random state you use as the following questions assume you split the data correctly**

In [135]:
def data_preprocess(df):
    # your code here
    df = pd.DataFrame(df)
    df.fillna(0, inplace= True)
    X =  df.drop('quality', axis=1).to_numpy()
    Y = df['quality'].to_numpy()
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, Y, test_size=0.25, random_state=16)
    return (X_train, y_train), (X_test, y_test)

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0,7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,0,6.3,0.300,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,0,8.1,0.280,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,0,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
4,0,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,1,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
6493,1,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.00,11.2,6
6494,1,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
6495,1,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [136]:
(X_train, y_train), (X_test, y_test) = data_preprocess(df)
print(X_train[0])
print(y_train[0])
print(X_test[3])
print(y_test[3])

[ 1.75018984 -0.00412596  0.12564592  0.97278786 -0.70253493  0.51297929
 -0.36766435 -1.26942219  0.21456681  0.92881824  2.13682458  0.42611996]
7
[-0.57136659 -0.30574457 -0.54115965 -0.12776549  1.33614333 -0.37168026
  1.26631947  0.30531117  0.18121615 -0.91820734  0.32886116 -0.07697409]
6


In [107]:
y_test

array([5, 6, 8, ..., 6, 7, 7], dtype=int64)

_**Expected Outputs:**_

```python
(X_train, y_train), (X_test, y_test) = data_preprocess(df)
print(X_train[0])
print(y_train[0])
print(X_test[3])
print(y_test[3])

[ 1.75018984 -0.00412596  0.12564592  0.97278786 -0.70253493  0.51297929
 -0.36766435 -1.26942219  0.21456681  0.92881824  2.13682458  0.42611996]
7
[-0.57136659 -0.30574457 -0.54115965 -0.12776549  1.33614333 -0.37168026
  1.26631947  0.30531117  0.18121615 -0.91820734  0.32886116 -0.07697409]
6

```

In [64]:
new.isna().sum()

0     0
1     1
2     2
3     2
4     1
5     0
6     0
7     0
8     0
9     1
10    1
11    0
dtype: int64

In [140]:
new= pd.DataFrame(X_test)
new.loc[16]

0     1.750190
1     2.785846
2     0.186265
3     1.798203
4    -0.303206
5     0.427367
6    -0.762074
7    -0.897856
8     1.551928
9     0.030265
10    1.668093
11   -0.328521
Name: 16, dtype: float64

In [141]:
new.loc[0:13, 0:14]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,-0.571367,-0.531959,-0.147138,-0.815611,0.243244,-0.428755,-1.212828,0.181456,0.081164,-1.367484,-0.13987,-1.418558
1,1.75019,1.88099,-0.177448,1.041572,-0.807621,0.17053,-0.987451,-1.464052,1.201746,0.279863,0.864554,-0.41237
2,-0.571367,-0.833577,-0.359304,-0.334119,-0.156085,-0.742667,1.435352,0.429167,-1.426285,-0.269252,-0.742525,1.683855
3,-0.571367,-0.305745,-0.54116,-0.127765,1.336143,-0.37168,1.266319,0.305311,0.181216,-0.918207,0.328861,-0.076974
4,-0.571367,-0.23034,-0.601778,0.353727,0.138157,-0.828279,0.421155,-0.066255,-1.279542,-0.119494,-1.278217,1.600006
5,-0.571367,0.523707,-0.844253,1.179142,-0.807621,-0.086306,-0.874763,0.499941,-0.565838,-0.269252,-0.407717,0.006875
6,-0.571367,-0.154935,-1.086728,0.284942,-0.597448,-0.771204,-0.142287,-0.313966,-1.14614,-0.51885,-1.077333,0.593818
7,-0.571367,-0.908982,-0.116829,-0.127765,0.390365,3.36672,0.87191,0.765346,0.164541,-0.169413,-0.273793,-0.915464
8,-0.571367,-0.531959,-0.723016,0.009804,-0.681518,-0.799741,0.308467,-0.420128,-1.312893,0.279863,0.395823,1.180761
9,-0.571367,-1.135196,-0.480541,-0.127765,-0.765587,-0.48583,-0.029599,-0.190111,-1.583033,1.328175,-0.072908,1.683855


**Q11. What is the result of printing out the 6th column and the 13th row of X_train?**

**Q12. What is the result of printing out the 6th column and the 13th row of X_test?**

**Q13. What is the result of printing out the 16th row of y_train?**

**Q14. What is the result of printing out the 16th row of y_test?**

## Task 2 - Train Linear Regression Model

Since this dataset is about predicting quality, which ranges from 1 to 10, lets try fit the data to a regression model and see how well that performs.

Fit a model using sklearn's `LinearRegression` class with its default parameters. Write a function that will take as input `(X_train, y_train)` that we created previously, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* Should return an sklearn `LinearRegression` model.
* The returned model should be fitted to the data.

In [144]:
def train_model(X_train, y_train):
    # your code here
    X = X_train
    Y = y_train
    lm = LinearRegression()
    return lm.fit(X, Y)

In [145]:
lm = train_model(X_train, y_train)

In [152]:
coef= round(lm.coef_[2], 2)
coef

-0.26

In [149]:
round(lm.intercept_,3)

5.821

**Q15. What is the value of the *_intercept_* of the trained model rounded to 3 decimal places?**

**Q16. What is the value of the *_coefficient_* of the trained model rounded to 2 decimal places?**

## Task 3 - Test Regression Model

We would now like to test our regression model. This test should give the residual sum of squares, which for your convenience is written as
$$
RSS = \sum_{i=1}^N (p_i - y_i)^2,
$$
where $p_i$ refers to the $i^{\rm th}$ prediction made from `X_test`, $y_i$ refers to the $i^{\rm th}$ value in `y_test`, and $N$ is the length of `y_test`.

_**Function Specifications:**_
* Should take a trained model and two `arrays` as input. This will be the `X_test` and `y_test` variables. 
* Should return the residual sum of squares over the input from the predicted values of `X_test` as compared to values of `y_test`.
* The output should be a `float` rounded to 2 decimal places.

In [158]:
def test_model(lm, X_test, y_test):
    # your code here
    y_pred = lm.predict(X_test)
    df=pd.DataFrame({"Pred": y_pred, "Real_Y":y_test})
    Rss = np.sum(np.square(df['Pred'] - df['Real_Y']))
    RSS = round(Rss, 2)
    return RSS


In [159]:
test_model(lm, X_test, y_test)

882.3

**Q17. What is the residual sum of squares value for the *_Linear Regression_* model fitted to the test set?**

## Task 4 - Train Decision Tree Regresson Model

Let us try improve this accuracy by training a model using sklearn's `DecisionTreeRegressor` class with a random state value of 42. Write a function that will take as input `(X_train, y_train)` that we created previously, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* Should return an sklearn `DecisionTreeRegressor` model with a random state value of 42.
* The returned model should be fitted to the data.

In [161]:
def train_dt_model(X_train, y_train):
    # your code here
    regr_tree = DecisionTreeRegressor(random_state=42)
    return regr_tree.fit(X_train,y_train)


In [162]:
dec_mod= train_dt_model(X_train, y_train)

Now that you have trained your model, lets see how well it does on the test set. Use the test_reg_model function you previously created to do this.

In [163]:
def test_model(regre_tree, X_test, y_test):
    # your code here
    y_pred = regre_tree.predict(X_test)
    df=pd.DataFrame({"Pred": y_pred, "Real_Y":y_test})
    Rss = np.sum(np.square(df['Pred'] - df['Real_Y']))
    RSS = round(Rss, 2)
    return RSS


In [164]:
test_model(dec_mod,X_test, y_test)

1113.0

**Q18. What is the residual sum of squares value for the decision tree regression model, fitted on the test set?**

## Task 5 - Mean Absolute Error
Write a function to compute the Mean Absolute Error (MAE), which is given by:

$$
MAE = \frac{1}{N} \sum_{n=i}^N |p_i - y_i|
$$

where $p_i$ refers to the $i^{\rm th}$ `prediction`, $y_i$ refers to the $i^{\rm th}$ value in `y_test`, and $N$ is the length of `y_test`.

_**Function Specifications:**_
* Should take two `arrays` as input. You can think of these as the `predictions` and `y_test` variables you get when testing a model. 
* Should return the mean absolute error over the input from the predicted values of `X_test` as compared to values of `y_test`.
* The output should be a `float` rounded to 3 decimal places.

In [174]:
def mean_abs_err(pred, y_test):
    # your code here
    n = len(pred)
    Sum = 0
    for num in range(n):
        Sum += abs(y_test[num] - pred[num])
    error = Sum/n
    
    return round(float(error), 3) 

**Q19. What is the result of printing out** `mean_abs_err(np.array([7.5,7,1.2]),np.array([3.2,2,-2]))`**?**

In [175]:
v=mean_abs_err(np.array([7.5,7,1.2]),np.array([3.2,2,-2]))
v

4.167

**Q20. Which regression model (linear vs decision tree) has the lowest mean absolute error?**

In [171]:
pred_l= lm.predict(X_test)
pred_r= dt.predict(X_test)

In [172]:
mean_abs_err(pred_l, y_test)

0.577

In [173]:
mean_abs_err(pred_r, y_test)

0.486