# Multiple Linear Regression with Mixed Categorical and Numeric Inputs, Plus Regularisation
## Newspaper sales prediction Example
This notebook uses multiple linear regression to predict newspaper sales from Advert Spend, Price, Front Page Story, Offered Prize Value and whether or not it was Wet that day.
  
The front page story is a categorical variable, so it needs to be one-hot encoded. No hyper-parameter searching is included, so no validation data are used.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import numpy as np

## Load the Data
You should download the file called `Newspaper_Num_Cat.csv` from the course website and put it in a folder accessible to this notebook. In the code below, we assume it is in the same folder as this notebook. Change the code if it is somewhere else.
  
genfromtxt reads data into a structured array. Read about them here https://numpy.org/doc/stable/user/basics.rec.html

In [2]:
data = np.genfromtxt("Newspaper_Num_Cat.csv", delimiter=',', names=True,
                     dtype=None, encoding='utf-8-sig')
print(data.dtype)

[('Advert_Spend', '<i8'), ('Price', '<i8'), ('FP_Story', '<U8'), ('Prize_Value', '<i8'), ('Wet', '<i8'), ('Sales', '<i8')]


## One-Hot Encode the Front Page Story Variable

The code below extracts the numeric variables and converts them into an unstructured ndarray (as opposed to the structured one we got when we loaded the data with genfromtxt). It then inserts a one-hot-encoded representation of Front Page Story where the categorical version of that used to be.

In [3]:
from sklearn.preprocessing import OneHotEncoder
from numpy.lib import recfunctions
# Needed for the function to convert from structured to unstructured ndarray

enc = OneHotEncoder() # one hot encoder is taking the place of get_dummies from previous examples
fp = data['FP_Story'].reshape(-1, 1) # change the data from a 1-d array ['a','b','c'] to a 2-d array as needed by the encoder; the 1 means that we want the second dimension to be 1 (so, 1 column); the -1 means that numpy should work out how big the other dimension should be from the input data (so, the number of rows will be enough to use up all the data)
enc.fit(fp) # fit the encoder (work out which classes there are, and so what new features need created)
codedfp = enc.transform(fp).toarray() # transform the data - make the new features, drop the original

# Now extract the numeric columns into an unstructured ndarray
ndata = data[['Advert_Spend','Price', 'Prize_Value', 'Wet', 'Sales']] # get the named columns
ndata = recfunctions.structured_to_unstructured(ndata) # ndata is a list of tuples to start with; this restructures into a 2-d array

# Now we insert the one hot encoded variables into the original data
ndata = np.insert(ndata, [2], codedfp, axis=1) # 
print(ndata[0])

[ 1757    60     0     1     0     0     0    30     1 50611]


## Extract the Inputs and Outputs
The target output variable, `sales` is the last column in the file, so we put that into a variable called `y` and the other, input, columns into `X`.

In [4]:
# now we have a 2-d array with all the data; split out the features and target
cols = ndata.shape[1] # number of columns is needed to work out where the last one is
X = ndata[:,0:cols-1]
y = ndata[:,cols-1]
ndata[0]

array([ 1757,    60,     0,     1,     0,     0,     0,    30,     1,
       50611])

##  Split off 30% for testing.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

## Build the Regression Model
We fit the regression model next and print the R-squared value from the training data.

In [6]:
reg = LinearRegression().fit(X_train, y_train)
print(reg.score(X_train, y_train))

0.9881071616276322


## Finally, Predict on the Test Data
We predict the values for the test data and calculate the mean absolute error for that data. Try other metrics in the second line.

In [7]:
preds = reg.predict(X_test)
test_MAE = metrics.mean_absolute_error(y_test, preds)
print("Mean Absolute Error on test =",test_MAE)

Mean Absolute Error on test = 736.3282739215695


## Now We Add Regularisation and Train Two Models Using Cross Validation

In [8]:
#Import cross validation, ridge regression
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

## First, Ridge Regression

In [9]:
ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X_train, y_train)

In [10]:
cross_val_score(ridge_reg, X_train, y_train, cv=5)

array([0.98664521, 0.98423445, 0.98472636, 0.99041139, 0.98431089])

# Now, Lasso

In [11]:
lasso_reg = Lasso(alpha=1.0)
lasso_reg.fit(X_train, y_train)

In [12]:
cross_val_score(lasso_reg, X_train, y_train, cv=5)

array([0.98670138, 0.98453562, 0.98461011, 0.99073966, 0.98442077])

## We can try a number of different regularisation levels in a loop
`alpha` is the amount of regularisation to apply

In [13]:
for alpha in [1, 10, 500, 1000]:
    lasso_reg = Lasso(alpha=alpha)
    lasso_reg.fit(X_train, y_train)
    print("Alpha = ", alpha, "CV Scores = ",cross_val_score(lasso_reg, X_train, y_train, cv=5))

Alpha =  1 CV Scores =  [0.98670138 0.98453562 0.98461011 0.99073966 0.98442077]
Alpha =  10 CV Scores =  [0.98672919 0.98443755 0.984681   0.99057447 0.9843872 ]
Alpha =  500 CV Scores =  [0.92676761 0.88349333 0.91908824 0.91812085 0.88640942]
Alpha =  1000 CV Scores =  [0.86075881 0.78033027 0.83689482 0.83818316 0.73316832]


Try a few See what happens if you set the regularisation strength to maximum.

# Quantile Regression
One thing we haven't covered is quantile regression. It's as simple as dropping in a QuantileRegressor in place of a LinearRegressor - this will also let you set the quantile (default is 0.5) - see here for details: https://scikit-learn.org/stable/auto_examples/linear_model/plot_quantile_regression.html 