# Multi-linear and polynomial regression

For this homework you are going to create a data pipeline for multi-linear and polynomial regression and compare the results. You will be using the data Airquality dataset from UCI Machine Learning Repository found [here](http://archive.ics.uci.edu/ml/datasets/Air+Quality) to predict the Nitrogen Oxide (NOx(GT)) levels.

In this homework you will perform the steps required for data exploration and for the last question you will create a data pipeline that will perform all of the these steps.

## 0) Import the required packages, check the data types and number of non-null values for each column

When importing the data you may want to use the "decimal" parameter from read_csv

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import FunctionTransformer
%matplotlib inline

#This will prevent the warnings from showing when you run your code :
import warnings
warnings.filterwarnings('ignore')

## 1) Data cleaning. Drop any unnecessary columns and NaN values from the data. 

## Check that the data that is in the columns are valid inputs. Make the invalid inputs NaNs and remove any columns with a majority of NaN values then replace the values left with the median value for that column.

## 2) Check the assumptions for a multi-linear model regression. 

### a) Check strong correlation among independent variables (correlation greater than 0.8). If strong correlation exists between variables, remove one of them from the data.

### b) Check that the relationships between the independent variables left are quasi-linear with the outcome variable. Drop variables that don't satisfy the criteria

## 3) Now that you have selected your independent variables, split the data into 70-30 train test split. 

## Scale the train and test data using RobustScaler and fit a linear model onto the data. 

## Report the values of the intercept,  coefficients and the accurary of the model

In [None]:
#You may use the adjustedR2 function we used in class :
def adjustedR2(r2,n,k):
    return r2-(k-1)/(n-k)*(1-r2)

# Create a dataframe to store all of our metrics for the models
evaluation = pd.DataFrame({'Model': [],
                           'Root Mean Squared Error (RMSE)':[],
                           'R-squared (training)':[],
                           'Adjusted R-squared (training)':[],
                           'R-squared (test)':[],
                           'Adjusted R-squared (test)':[]})

## 4) Calculate the root mean squared error (RMSE), R-squared (training), Adjusted R-squared (training), R-squared (test), Adjusted R-squared (test). Add these result to the evaluation dataframe.

## 5) Verify that the errors are normal using a histogram plot

## 6) Repeat this process for a naive model with all the variables (after the preprocessing was done, data frame you had at the end of question 1). Add the results to the evaluation dataframe.

### Why is there such a difference between the models ? Is our model wrong ?

## 7) Fit a polynomial regression model of degree 2 on the selected features. Calculate th RMSE, R-Squared training and R-squared test scores to the evaluation dataframe.

### Which model is best ? Based on which metric(s) ?

## 8) Implement a preprocessing pipeline (replacing NaN values, removing columns, scaling features, etc...) and a regression pipeline. Then combine the two into a single pipeline. Implement only for the polynomial regression.