# Ridge Regression with scikit-learn

This notebook creates and measures a ridge regression model using sklearn.

* Method: Ridge Regression
* Dataset: Big Mart dataset

## Imports

In [None]:
import numpy as np
import pandas as pd

from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

import matplotlib.pyplot as plt
%matplotlib inline

## Load the Data

In [None]:
data = pd.read_csv('/Users/robert.dempsey/Dev/daamlobd/data/bigmart/big_mart_train.csv')

In [None]:
data.dtypes

In [None]:
data.head()

In [None]:
data.describe(include='all')

## Data Preprocessing

In [None]:
# Handle missing values
data['Item_Weight'].fillna((data['Item_Weight'].mean()), inplace=True)
data['Item_Visibility'] = data['Item_Visibility'].replace(0,np.mean(data['Item_Visibility']))
data['Outlet_Establishment_Year'] = 2013 - data['Outlet_Establishment_Year']
data['Outlet_Size'].fillna('Small',inplace=True)

In [None]:
# Create dummy variables to convert categorical data into numeric values
object_cols = list(data.select_dtypes(include=['object']).columns)
dummies = pd.get_dummies(data[object_cols], prefix= object_cols)
data.drop(object_cols, axis=1, inplace = True)
X = pd.concat([data, dummies], axis =1)
X.head()

In [None]:
X.describe()

## Fit a Ridge Regression Model

In [None]:
# Splitting into training and cv for cross validation
X = X.drop('Item_Outlet_Sales',1)
X_train, X_test, Y_train, Y_test = \
    train_test_split(X, data.Item_Outlet_Sales, test_size=0.3, random_state=42)

In [None]:
# Create an instance of a Ridge Regression model
model = Ridge(alpha=0.05, normalize=True)
model.fit(X_train, Y_train)

**Intercept Coefficient**: represents the mean change in the response variable for one unit of change in the predictor variable while holding everything else constant. It isolates the role of one variable from all others.

In [None]:
# Print the intercept coefficient
print('Estimated intercept coefficient: {}'.format(model.intercept_))

In [None]:
# Create a dataframe with the features and coefficients
fc_df = pd.DataFrame(list(zip(X.columns, model.coef_)), columns=['features', 'coefficients'])
fc_df.head()

## Predict a Price

In [None]:
y_pred = model.predict(X_test)

In [None]:
# Create a plot to compare actual sales (Y_test) and the predicted sales (pred_test)
fig = plt.figure(figsize=(20,10))
plt.scatter(Y_test, y_pred)
plt.xlabel("Actual Sales: $Y_i$")
plt.ylabel("Predicted Sales: $\hat{Y}_i$")
plt.title("Actual vs. Predicted Sales: $Y_i$ vs. $\hat{Y}_i$")
plt.show()

## Model Evaluation

### Mean Squared Error

In [None]:
# Get the Mean Squared Error (MSE) for all predictions
mse = mean_squared_error(Y_train, model.predict(X_train))
print("MSE Training Data: {}".format(mse))

In [None]:
# Get the MSE for the test data
print("MSE Test Data: {}".format(mean_squared_error(Y_test, model.predict(X_test))))

### Variance (R^2) Score

* Explains how much of the variability of a factor can be caused or explained by its relationship to another factor; how well the model is predicting.
* A score of 1 means a perfect prediction
* A score of 0 means the model always predicts the expected value of y, disregarding the input features

In [None]:
print("Variance Score: %.2f" % r2_score(Y_test, y_pred))

## Residual Plot

**Residuals**: the difference between the predictions and the actuals.


**Interpretation**: If the model is working well then the data should be randomly scattered around line zero. If there is structure in the data, that means the model is not capturing something, perhaps interaction between two variables or it's time dependent. Check the parameters of your model.

In [None]:
# Create a residual plot
fig = plt.figure(figsize=(20,10))
plt.scatter(model.predict(X_train), model.predict(X_train) - Y_train, c='b', s=40, alpha=0.5)
plt.scatter(model.predict(X_test), model.predict(X_test) - Y_test, c='g', s=40)
plt.hlines(y=0, xmin=0, xmax=50)
plt.ylabel("Residuals")
plt.title("Residual Plot Using Training (Blue) and Test (Green) Data")
plt.show()

**Interpretation**

The funnel shape indicates Heteroskedasticity. The variance of error terms(residuals) is not constant. Generally, non-constant variance arises in the presence of outliers or extreme leverage values. These values get too much weight, thereby disproportionately influencing the model’s performance.

This indicates signs of non linearity in the data which has not been captured by the model.

## Different Alpha

## Fit a New Model

In [None]:
model_2 = Ridge(alpha=0.5, normalize=True)
model_2.fit(X_train, Y_train)

In [None]:
print('Estimated intercept coefficient: {}'.format(model_2.intercept_))

### Predict a Price

In [None]:
y2_pred = model_2.predict(X_test)

In [None]:
# Create a plot to compare actual sales (Y_test) and the predicted sales (pred_test)
fig = plt.figure(figsize=(20,10))
plt.scatter(Y_test, y2_pred)
plt.xlabel("Actual Sales: $Y_i$")
plt.ylabel("Predicted Sales: $\hat{Y}_i$")
plt.title("Actual vs. Predicted Sales: $Y_i$ vs. $\hat{Y}_i$")
plt.show()

## Model Evaluation

### Mean Squared Error

In [None]:
# Get the Mean Squared Error (MSE) for all predictions
mse = mean_squared_error(Y_train, model_2.predict(X_train))
print("MSE Training Data: {}".format(mse))

In [None]:
# Get the MSE for the test data
print("MSE Test Data: {}".format(mean_squared_error(Y_test, model_2.predict(X_test))))

### Variance (R^2) Score

* Explains how much of the variability of a factor can be caused or explained by its relationship to another factor; how well the model is predicting.
* A score of 1 means a perfect prediction
* A score of 0 means the model always predicts the expected value of y, disregarding the input features

In [None]:
print("Variance Score: %.2f" % r2_score(Y_test, y2_pred))

## Residual Plot

**Residuals**: the difference between the predictions and the actuals.


**Interpretation**: If the model is working well then the data should be randomly scattered around line zero. If there is structure in the data, that means the model is not capturing something, perhaps interaction between two variables or it's time dependent. Check the parameters of your model.

In [None]:
# Create a residual plot
fig = plt.figure(figsize=(20,10))
plt.scatter(model_2.predict(X_train), model_2.predict(X_train) - Y_train, c='b', s=40, alpha=0.5)
plt.scatter(model_2.predict(X_test), model_2.predict(X_test) - Y_test, c='g', s=40)
plt.hlines(y=0, xmin=0, xmax=50)
plt.ylabel("Residuals")
plt.title("Residual Plot Using Training (Blue) and Test (Green) Data")
plt.show()