## Linear Regression

Example in Excel

<video controls src="regression.mp4" />




The short video show a verey simple example of Linear Regression using Excel.

The values for the feature and taget are all generated using random numbers, however a loose relationship between the two (i.e how thay are calculatted has been maintained.

The values are re-generated by hitting F9

In a real Linear Regression situation there are likely to be many features, not just one. 

What our simple example has in common with a more realistic case is that we would be aiming to predict
the target value from a set of features (in this case a set of one)

Excel illustrates how it would make the prediction by inserting a 'best fit' line through the points. 
We will discuss best fit a bit later.

You can also have Excel write out the equation of the line it has drawn. This is essentially the Linear Regression model that 
it has calculated.

Two things should be immediately obvious;

1. Almost non-of the real points lay on the line
2. As the data points change, the line and the equation changes


**Conclusion: Exercise!**

## Supervised learning

In order to get Excel to produce a model at all we neded to provide both feature(s) values and target values

That is we need to provide features for which we already know the answers. By providing both the feateres and the associated Target values in this way, we allow a model to be trained so as to predict target values (which we didn't know) from new set of features which we did. 

This is called Supervised learning


## The dataset 

For this lesson we will use a dataset which is included with the scikit package.

The dataset that we will be using is the **Boston House prices** dataset.

The datasets in scikit are provided as Dictionary objects. This allows both the data and appropriate metadata, including provenance and citation information to be included.

You can see the contents of the dictionary with the following code


In [None]:
# Dataset  - Boston house-prices from sklearn

from sklearn import datasets
import pandas as pd

boston_data = datasets.load_boston()

#print(boston_data)

#print(boston_data['DESCR'])


In [None]:
# we can put the data (the features) into a dataframe and add the 'target values'
df_boston = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df_boston['target'] = pd.Series(boston_data.target)
df_boston.head()

In [None]:
# how big is the dataset?
df_boston.shape

In [None]:
# stats on the numerical values
df_boston.describe()

### Missing data

Most machine learning algorithms don't like missing data. 

In this particular case we don't have any, but if there was we could adopt standard approaches to either removing such rows or imputing the missing values.

The following code is just examples of what you might do.

In [None]:
df_missing = pd.read_csv("MissingData.csv")

In [None]:
# The missing values in a dataframe are represented by 'NaN' 
df_missing

In [None]:
# we can 'drop' all of the rows containing a NaN with

print(df_missing.shape)
df_missing.dropna(inplace=True)
print(df_missing.shape)

In [None]:
# a 35% reduction in data!

In [None]:
df_missing = pd.read_csv("MissingData.csv")
df_missing = df_missing.fillna(df_missing.mean())
df_missing

In [None]:
df_missing = pd.read_csv("MissingData.csv")
df_missing['CatA'].fillna('Unknown', inplace = True)
df_missing

# Visualisation of the data

We are looking for insights as to the nature of the data as a whole.  We can use different visualizations for different types of data.

We will use matplotlib for our visualizations

In [None]:
# only need the pyplot functions
import matplotlib.pyplot as plt

# needed by jupyter to ensure that the plots appear inline (in the usual output cell)
%matplotlib inline

In [None]:
# remind ourselves what our dataset looks like
df_boston.describe()

### We can create simple plots to look at the  data

In [None]:
# both histograms ...
for col in df_boston.columns:
    df_boston[col].hist(bins = 20)
    plt.title('Histogram of ' + col)
    plt.show() 

In [None]:
#  ...   and boxplots can be useful

for col in df_boston.columns:
    df_boston.boxplot(column = col)
    plt.title('Boxplot of ' + col)
    plt.show() 


In [None]:
# variable correlations
import seaborn as sns
sns.pairplot(df_boston)
plt.show()

In [None]:
# we can look at the correlation between each pair of variables

corr = df_boston.corr()
corr

In [None]:
# or graphically with a heatmap 
import seaborn as sns

fig, ax = plt.subplots(figsize=(15,12))
heat_map = sns.heatmap(corr)
plt.show()

# Dealing with outliers

[Demonstration using Excel of the effect of Outliers ]

<video controls src="regression_outliers.mp4" />

From the demonstration you can see that outliers can distort considerably the position of the trendline. This is not desirable.

For a given set of values there is no real definition of which are outliers. A common approach is to consider any value outside of 2 standard deviations of the mean could be considered an outlier.

We will adopt this approach and write a function which will list all of the rows in the dataset which contain such values.

In [None]:
# We start by creating a list of the columns which could contain outliers.
# In our dataset it is all of the columns with the exception of the Target column and the Chas
# column which we know from the description is a categorical boolean value.

poss_outlier_columns = ['CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']



If you go back and look at the .describe values for the CHAS column you can see that if we included it in this approach, we would effectively remove all of the 1 valued rows.

In [None]:
# A common approach to removing outliers is to treat all data outside of 2 standard deviations of the mean as outliers
# We will create a small function to do this and then passs it our dataframe and our list of columns

def get_outliers(data, columns) :
    # create a list for the results
    outlier_list = []
    for col in columns:
        mean = data[col].mean()
        sd = data[col].std()
        # get the index values of all values higher or lower than the mean +/- 2 standard deviations
        outliers = data[(data[col] > mean + 2*sd) | (data[col]  < mean  - 2*sd)].index
        # and add those values to our list
        outlier_list  += [x for x in outliers]
        # put our list into a set, as this will remove duplicates
        # and then return it as a list
    return list(set(outlier_list))

# creat our list of outlier row indexes
boston_outliers = get_outliers(df_boston, poss_outlier_columns)

# and then drop them
df_boston = df_boston.drop(boston_outliers, axis = 0)


In [None]:
df_boston.shape

In [None]:
df_boston.describe()

In [None]:
# lets repeat some of the graphics
sns.pairplot(df_boston)
plt.show()

In [None]:
# and
corr = df_boston.corr()
fig, ax = plt.subplots(figsize=(15,12))
heat_map = sns.heatmap(corr)
plt.show()

In [None]:
# and the corr figures
print(corr)

### Normalisation

In [None]:
from sklearn.preprocessing import StandardScaler

# this function loops through columns in a data set and defines a predefined scaler to each
def scale_numeric(data, numeric_columns, scaler):
    for col in numeric_columns:
        data[col] = scaler.fit_transform(data[col].values.reshape(-1, 1))
    return data

# we can now define the scaler we want to use and apply it to our dataset 

# Other scalers are available see the scikit documentation
scaler = StandardScaler()
df_boston = scale_numeric(df_boston, poss_outlier_columns, scaler)


In [None]:
df_boston.describe()

In [None]:
df_boston[0:10]

So far all we have been doing is cleaning and preparing the data to make it more acceptable to the algorithm we want to use.

Now we need to make sure that we have suitable data to both create the model and some data to test the model

## Splitting the data

Currently our dataset includes all of the (remaining) rows and each row includes the target column. I.e. the values that we would like the model to predict.

For a Supervised learning method like Regression, we need to provide the learning algorithm with both the predictor columns along with the corresponding target values to enable the model to create and train itself.

We also need to keep some of the dataframe rows back so that after we have created the model we have available data with which to test it


For the names used for the these new dataframes we follow convention and use 'X' to indicate the predictors and 'y' for the target (the predicted).  So we will end up with 4 distinct structures X_train, X_test, y_train and y_test.

Because the need to perform this operation is so common in these supervised learning methods Scikit has its own function to help you.

In [None]:
# first we need to split out the predictors from the targets and put them  into seperate dataframes.

# the predictors
df_boston_X = pd.DataFrame(df_boston,columns=boston_data.feature_names)
df_boston_X.describe()

In [None]:
# the tagets
df_boston_y = pd.DataFrame(df_boston,columns=['target'])
df_boston_y.describe()

## Now we can use the train_test_split function from sklearn

We need to provide both the predictors and the target dataframes
We also provide a 'test_size' value to indicate the % of the rows to be used for the test datframe. 

Essentially we are going to split up our original dataset into four areas


<img src="Data_split.png" />


** Red area = ** Rows of training features used to generate the model

** Purple area = ** Target values used to generate the model

** Orange area = ** Rows of features used to test the model

** Green area = ** The actual target values from the test data used to evaluate the accuracy of the model 

In [None]:
from sklearn import model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(df_boston_X, df_boston_y, test_size = 0.2, random_state = 42)

In [None]:
# get shape of test and training sets that have been created
print('Training Set Row Count: ', X_train.shape[0])
print('Test Set Row Count: ', X_test.shape[0])


## Creating the model

The first step is deciding which algorithm to use. There are more than one Regression algorithms, but we are in fact going to use the lm model from scikit 



In [None]:
from sklearn.linear_model import LinearRegression

# we start by creating an object of the LinearRegression class
lm = LinearRegression()

## Fitting the model

We now need to provide the fit function with the training predictors (X_train) 
and the known taget values (y_train) for these training predictor rows.

In [None]:
lm.fit(X_train, y_train)

# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

## Using the model
We now have a model which we can use to predict the taget values for the test predictors (X_test)

In [None]:
Y_pred = lm.predict(X_test)


In [None]:
# A quick check of the reults
print(len(Y_pred))
print(Y_pred)


## All that remains is to check - How good is the model?

Bear in mind that this is only *one* model. Even using the same algorithm, changing the training set of predictors and targets could have resulted in a different model, which may have been better or worse than the one we have.

Even so, we need to have some kind of measure of how good we think the model is, or how much confidence we are prepared to place in the model.

In order to do this we need 

In [None]:
from sklearn import metrics

In [None]:

def evaluate(Y_test, Y_pred):
    # this block of code returns all the metrics we are interested in 
    mse = metrics.mean_squared_error(Y_test, Y_pred)
    msa = metrics.mean_absolute_error(Y_test, Y_pred)
    r2 = metrics.r2_score(Y_test, Y_pred)

    print("Mean squared error: ", mse)
    print("Mean absolute error: ", msa)
    print("R^2 : ", r2)
    
    # this creates a chart plotting predicted and actual 
    plt.scatter(Y_test, Y_pred)
    plt.xlabel("Prices: $Y_i$")
    plt.ylabel("Predicted prices: $\hat{Y}_i$")
    plt.title("Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$")

evaluate(y_test, Y_pred)
