# Machine Learning with Linear Regression

In this project, we have a (fake) dataset from a (fake) Ecommerce company that sells clothing online but also has in-store style and clothing advice sessions. After their sessions, customers can go home and use a mobile app, or the store's website to make purchases.

The company wants to decide whether to focus their efforts on their mobile app experience or their website, depending on which one of them has the greater impact.

Let's try to answer their question.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

<class 'ModuleNotFoundError'>: No module named 'seaborn'

## Data
The dataset contains customer info, such as Email, Address, and their Avatar. But as this is a regression project, we'll deal with the numerical features we have:

- Avg. Session Length: Average session of in-store style advice sessions.
- Time on App: Average time spent on App in minutes
- Time on Website: Average time spent on Website in minutes
- Length of Membership: How many years the customer has been a member.

In [3]:
customers = pd.read_csv('../data/Ecommerce Customers.csv')

In [4]:
customers.head()

Unnamed: 0,Email,Address,Avatar,Avg. Session Length,Time on App,Time on Website,Length of Membership,Yearly Amount Spent
0,mstephenson@fernandez.com,"835 Frank Tunnel\r\nWrightmouth, MI 82180-9605",Violet,34.497268,12.655651,39.577668,4.082621,587.951054
1,hduke@hotmail.com,"4547 Archer Common\r\nDiazchester, CA 06566-8576",DarkGreen,31.926272,11.109461,37.268959,2.664034,392.204933
2,pallen@yahoo.com,"24645 Valerie Unions Suite 582\r\nCobbborough,...",Bisque,33.000915,11.330278,37.110597,4.104543,487.547505
3,riverarebecca@gmail.com,"1414 David Throughway\r\nPort Jason, OH 22070-...",SaddleBrown,34.305557,13.717514,36.721283,3.120179,581.852344
4,mstephens@davidson-herman.com,"14023 Rodriguez Passage\r\nPort Jacobville, PR...",MediumAquaMarine,33.330673,12.795189,37.536653,4.446308,599.406092


In [None]:
customers.describe()

In [None]:
customers.info()

## Exploratory Analysis
Before we begin fitting a linear regression model on the data, let's try and eyeball it first.

Visualising the relationship between time spent on Website and yearly spend.

In [None]:
sns.jointplot(x=customers['Time on Website'],y=customers['Yearly Amount Spent'])

Visualising the relationship between time spent on app, and yearly spend.

In [None]:
sns.jointplot(x=customers['Time on App'],y=customers['Yearly Amount Spent'])

Just from the above two visuals, we can conclude that there's a stronger correlation between time spent on app, and the yearly spend, than time spent on the website.

Let's visualise the relationship between the different variables using a seaborn pairplot.

In [None]:
sns.pairplot(customers)

In [None]:
customers.corr()

It looks like the length of membership is the feature that's the most (positively) correlated with yearly amount spent. This makes sense, as loyal customers are inclined to spend more.

We can use seaborn the fit this on a linear plot.

In [None]:
sns.lmplot(x='Length of Membership',y='Yearly Amount Spent',data=customers)

## Splitting the Data
We're going to split the data between training and test sets, in a 70:30 ratio.

In [None]:
customers.columns

In [None]:
#Selecting only the numerical features for training the model.
X = customers[['Avg. Session Length', 'Time on App','Time on Website', 'Length of Membership']]
y = customers['Yearly Amount Spent']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_test,X_train,y_test,y_train = train_test_split(X,y,test_size=0.3,random_state=101)

## Training the Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression()

In [None]:
lm.fit(X_train,y_train)

## Making Predictions

In [None]:
predictions = lm.predict(X_test) 

Create a scatterplot of the real test values versus the predicted values.

To visualise the predictions, let's create a scatterplot between real and predicted values.

In [None]:
plt.scatter(y_test,predictions)

Nice, it looks like our model performs fairly well, as the our predictions and real values fit linearly without much variation.

## Evaluation and Understanding Results
But there's a standard way to evaluate linear regression models. Let's calculate the residual sum of squares.

In [None]:
from sklearn import metrics

In [None]:
print('MAE:',metrics.mean_absolute_error(y_test,predictions))
print('MSE:',metrics.mean_squared_error(y_test,predictions))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test,predictions)))

Let's try to interpret the coefficients for the variables.

In [None]:
cust_coeff = pd.DataFrame(lm.coef_,X.columns)
cust_coeff.columns = ['Coefficient']
cust_coeff

What the coefficients mean, is that, assuming all other features stay fixed,

- 1 unit increase in the Avg. Session Length leads to an approximate \\$25 increase in yearly spend.
- 1 unit increase in the Time on App leads to an approximate \\$39 increase in yearly spend.
- 1 unit increase in the Time on Website leads to an approximate \\$0.77 increase in yearly spend.
- 1 unit increase in the Length of Membership leads to an approximate \\$62 increase in yearly spend.

This concludes our project!