# Building a regression model for an e-commerce dataset

## Importing Libraries

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse
import seaborn as sns
pd.options.display.float_format = '{:.5f}'.format
import warnings
import math
import scipy.stats as stats
import scipy
from sklearn.preprocessing import scale
warnings.filterwarnings('ignore')

## Loading Data

In [None]:
df = pd.read_csv("Ecom_Customers.csv")
df.head()

## EDA

In [None]:
df_kor = df.corr()
plt.figure(figsize=(10,10))
sns.heatmap(df_kor, vmin=-1, vmax=1, cmap="viridis", annot=True, linewidth=0.1)

In [None]:
sns.pairplot(df)

## Checking Missing Data

In [None]:
df.isnull().sum()

In [None]:
df.dropna(inplace=True)

# To fill, you can use fillna():

df["Time on App"].fillna(df["Time on App"].mean(), inplace=True)

## Building a linear regression model

In [None]:
Y=df["Yearly Amount Spent"]

X=df[[ "Length of Membership", "Time on App", "Time on Website", 'Avg. Session Length']]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

print('Training Data Count: {}'.format(X_train.shape[0]))
print('Testing Data Count: {}'.format(X_test.shape[0]))

In [None]:
X_train = sm.add_constant(X_train)
results = sm.OLS(y_train, X_train).fit()
results.summary()

## Understanding the outputs of the model: Is this statistically significant?

**So what do all those numbers mean actually?**

Before continuing, it will be better to explain these basic statistical terms here because I will decide if my model is sufficient or not by looking at those numbers.

**What is the p-value?**
> P-value or probability value shows statistical significance. Let’s say you have a hypothesis that the average CTR of your brand keywords is 70% or more and its p-value is 0.02. This means there is a 2% probability that you would see CTRs of your brand keywords below %70. Is it statistically significant? 0.05 is generally used for max limit (95% confidence level), so if you have p-value smaller than 0.05, yes! It is significant. The smaller the p-value is, the better your results!

Now let’s look at the summary table. My 4 variables have some p-values showing their relations whether significant or insignificant with Yearly Amount Spent. As you can see, Time on Website is statistically insignificant with it because its p-value is 0.180. So it will be better to drop it.

**What is R squared and Adjusted R squared?**
> R square is a simple but powerful metric that shows how much variance is explained by the model. It counts all variables you defined in X and gives a percentage of explanation. It is something like your model capabilities. 

**Adjusted R squared** is also similar to R squared but it counts only statistically significant variables. That is why it is better to look at adjusted R squared all the time.

In my model, 98.4% of the variance can be explained, which is really high. 

**What is Coef?**
They are coefficients of the variables which give us the equation of the model.

So is it over? No! I have Time on Website variable in my model which is statistically insignificant. 

Now I will build another model and drop Time on Website variable:

In [None]:
X2=df[["Length of Membership", "Time on App", 'Avg. Session Length']]
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, Y, test_size = 0.2, random_state = 465)

print('Training Data Count:', X2_train.shape[0])
print('Testing Data Count::', X2_test.shape[0])

In [None]:
X2_train = sm.add_constant(X2_train)

results2 = sm.OLS(y2_train, X2_train).fit()
results2.summary()

R squared is still good and I have no variable having p-value higher than 0.05.

Let’s look at the model chart here:

In [None]:
X2_test = sm.add_constant(X2_test)

y2_preds = results2.predict(X2_test)

plt.figure(dpi = 75)
plt.scatter(y2_test, y2_preds)
plt.plot(y2_test, y2_test, color="red")
plt.xlabel("Actual Scores", fontdict=ex_font)
plt.ylabel("Estimated Scores", fontdict=ex_font)
plt.title("Model: Actual vs Estimated Scores", fontdict=header_font)
plt.show()

It seems like I predict values really good! Actual scores and predicted scores have almost perfect linearity.

Finally, I will check the errors.

**Errors**
When building models, comparing them and deciding which one is better is a crucial step. You should test lots of things and then analyze summaries. Drop some variables, sum or multiply them and again test. After completing the series of analysis, you will check p-values, errors and R squared. The best model will have:

- P-values smaller than 0.05
- Smaller errors
- Higher adjusted R squared

Let’s look at errors now:

In [None]:
print("Mean Absolute Error (MAE)         : {}".format(mean_absolute_error(y2_test, y2_preds)))
print("Mean Squared Error (MSE) : {}".format(mse(y2_test, y2_preds)))
print("Root Mean Squared Error (RMSE) : {}".format(rmse(y2_test, y2_preds)))
print("Root Mean Squared Error (RMSE) : {}".format(rmse(y2_test, y2_preds)))
print("Mean Absolute Perc. Error (MAPE) : {}".format(np.mean(np.abs((y2_test - y2_preds) / y2_test)) * 100))

If you want to know what MSE, RMSE or MAPE is, you can read this article.

They are all different calculations of errors and now, we will just focus on smaller ones while comparing different models.

So, in order to compare my model with another one, I will create one more model including Length of Membership and Time on App only.

In [None]:
X3=df[['Length of Membership', 'Time on App']]
Y = df['Yearly Amount Spent']
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, Y, test_size = 0.2, random_state = 465)

X3_train = sm.add_constant(X3_train)

results3 = sm.OLS(y3_train, X3_train).fit()
results3.summary()

In [None]:
X3_test = sm.add_constant(X3_test)
y3_preds = results3.predict(X3_test)

plt.figure(dpi = 75)
plt.scatter(y3_test, y3_preds)
plt.plot(y3_test, y3_test, color="red")
plt.xlabel("Actual Scores", fontdict=eksen_font)
plt.ylabel("Estimated Scores", fontdict=eksen_font)
plt.title("Model Actual Scores vs Estimated Scores", fontdict=baslik_font)
plt.show()

print("Mean Absolute Error (MAE)
: {}".format(mean_absolute_error(y3_test, y3_preds)))
print("Mean Squared Error (MSE) : {}".format(mse(y3_test, y3_preds)))
print("Root Mean Squared Error (RMSE) :
{}".format(rmse(y3_test, y3_preds))) print("Mean Absolute Perc. Error (MAPE) :
{}".format(np.mean(np.abs((y3_test - y3_preds) / y3_test)) * 100))

# Comparing Machine Learning Methods

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,10)

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [None]:
#read in the data
data = pd.read_csv('indian_liver_patient.csv')

In [None]:
data_to_use = data
del data_to_use['Gender']
data_to_use.dropna(inplace=True)

In [None]:
values = data_to_use.values

Y = values[:,9]
X = values[:,0:9]

In [None]:
random_seed = 12

outcome = []
model_names = []
models = [('LogReg', LogisticRegression()), 
          ('SVM', SVC()), 
          ('DecTree', DecisionTreeClassifier()),
          ('KNN', KNeighborsClassifier()),
          ('LinDisc', LinearDiscriminantAnalysis()),
          ('GaussianNB', GaussianNB())]

We are going to use a k-fold validation to evaluate each algorithm and will run through each model with a for loop, running the analysis and then storing the outcomes into the lists we created above. We’ll use a 10-fold cross validation.

In [None]:
for model_name, model in models:
    k_fold_validation = model_selection.KFold(n_splits=10, random_state=random_seed)
    results = model_selection.cross_val_score(model, X, Y, cv=k_fold_validation, scoring='accuracy')
    outcome.append(results)
    model_names.append(model_name)
    output_message = "%s| Mean=%f STD=%f" % (model_name, results.mean(), results.std())
    print(output_message)

Output
- LogReg| Mean=0.718633 STD=0.058744
- SVM| Mean=0.715124 STD=0.058962
- DecTree| Mean=0.637568 STD=0.108805
- KNN| Mean=0.651301 STD=0.079872
- LinDisc| Mean=0.716878 STD=0.050734
- GaussianNB| Mean=0.554719 STD=0.081961

From the above, it looks like the Logistic Regression, Support Vector Machine and Linear Discrimination Analysis methods are providing the best results (based on the ‘mean’ values). Taking Jason’s lead, we can take a look at a box plot to see what the accuracy is for each cross validation fold, we can see just how good each does relative to each other and their means.

In [None]:
fig = plt.figure()
fig.suptitle('Machine Learning Model Comparison')
ax = fig.add_subplot(111)
plt.boxplot(outcome)
ax.set_xticklabels(model_names)
plt.show()

From the box plot, when it is easy to see the three mentioned machine learning methods (Logistic Regression, Support Vector Machine and Linear Discrimination Analysis) are providing better accuracies. From this outcome, we can then take this data and start working with these three models to see how we might be able to optimize the modeling process to see if one model works a bit better than others.