# ANN: Regression (Website Traffic)

## Imports

First we do all installations and imports

In [3]:
# pip install scikit-learn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import metrics

# pip install tensorflow
import tensorflow as tf
import keras
from keras import layers

## Loads and process data

The csv file is loaded and we check the columns

In [4]:
df = pd.read_csv("./Data/website_data.csv")

In [5]:
df.columns

Index(['Page Views', 'Session Duration', 'Bounce Rate', 'Traffic Source',
       'Time on Page', 'Previous Visits', 'Conversion Rate'],
      dtype='object')

In [6]:
df = df[['Page Views', 'Session Duration', 'Bounce Rate',
       'Time on Page', 'Previous Visits', 'Conversion Rate']]

With the code below, we try to take care of possble outliers from the dataframe

In [7]:
from scipy import stats
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

In [8]:
df.head()

Unnamed: 0,Page Views,Session Duration,Bounce Rate,Time on Page,Previous Visits,Conversion Rate
0,5,11.051381,0.230652,3.89046,3,1.0
1,4,3.429316,0.391001,8.478174,0,1.0
2,4,1.621052,0.397986,9.63617,2,1.0
3,5,3.629279,0.180458,2.071925,3,1.0
4,5,4.235843,0.291541,1.960654,5,1.0


## X en y-variabelen definieren

In [9]:
X = df[['Page Views', 'Bounce Rate', 'Previous Visits', 'Session Duration', 'Time on Page']]
# X = df[['Bounce Rate', 'Time on Page',
#         'Previous Visits', 'Conversion Rate']]
y = df[['Conversion Rate']]

## Optimal variables

We want to optimize the selection of used variables in the ANN. We use a correlation table to visualise the correlation between every variable. The correlations are a number between -1 and 1, high correlations can cause redundancy => overfitting, ineficiency, ... So the point of doing the following step is to see if we can leave one or more higly correlated variables behind before proceeding to the ANN

In [10]:
# !pip install pandasgui
from pandasgui import show

correlations = df.corr()
show(correlations)

ModuleNotFoundError: No module named 'pandasgui'

We can see from the grid that the highest correlation is 0.204, which is considered pretty weak. So there's no significant correlation between variables

Now in the next code part, let's go over feature importance using KBest selection

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# convert all continuous variables to integer,
# and convert all negative numbers to 0
X_cat = X.astype(int)
X_cat = X_cat.clip(lower=0)

# initialize chi2 and SelectKBest
# Note: chi2 -test is a very common test
# in statistics and quantitative analysis
# basically it studies the data whether variables are related
# or independent of each other
chi_2_features = SelectKBest(chi2, k=len(X_cat.columns))

# fit our data to the SelectKBest
best_features = chi_2_features.fit(X_cat,y.astype(int))

# use decimal format in table print later
pd.options.display.float_format = '{:.2f}'.format

# wrap it up, and show the results
# the higher the score, the more effect that column has on price
df_features = pd.DataFrame(best_features.scores_)
df_columns = pd.DataFrame(X_cat.columns)
f_scores = pd.concat([df_columns,df_features],axis=1)
f_scores.columns = ['Features','Score']
f_scores.sort_values(by='Score',ascending=False)

From this output we can conclude that Session Duration and Time on Page are the two most important features who have the stronges relation with the target variable => Conversion Rate. It is weird that Bounce Rate gets a NaN.

In [None]:
print(df['Bounce Rate'].isnull().sum())
print(df['Bounce Rate'].nunique())
# No weird output so that's strange...

## Test/train/validation-split

In [11]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=101)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=101)

## Neural Network

Now the fun part... creating the model. I started with a simple model with 3 layers, but the results were very bad. I tried a lot of different combinations, and even added Dropout layers, but the metrics are still very bad. We set the target variable to be "Conversion Rate", so that's the one we're predicting

In [None]:
model = keras.Sequential(
    [
        layers.Dense(16, activation="relu", input_shape=(5,)),
        layers.Dropout(0.3),
        layers.Dense(32, activation="relu"),
        layers.Dropout(0.3),
        layers.Dense(64, activation="relu"),
        layers.Dropout(0.4),
        layers.Dense(128, activation="relu"),
        layers.Dropout(0.3),
        layers.Dense(32, activation="relu"),
        layers.Dense(1)
    ]
)

model.compile(optimizer='adam', loss='mse')

model.summary()

## Start training of NN

In [None]:
model.fit(x=X_train, y=y_train, epochs=400, validation_data=(X_val, y_val))

## Training error metrics

In [None]:
loss_df = pd.DataFrame(model.history.history)
loss_df.plot()

## Test/training data eval

In [None]:
print("Test data evaluation:")
print(model.evaluate(X_test, y_test, verbose=0))
print("\nTrain data evaluation:")
print(model.evaluate(X_train, y_train, verbose=0))

## Get test predictions for evaluation metrics

In [None]:
test_predictions = model.predict(X_test)

test_predictions = pd.Series(test_predictions.reshape(len(y_test),))
pred_df = pd.DataFrame(np.asarray(y_test), columns=['Test True Y'])
pred_df = pd.concat([pred_df, test_predictions], axis=1)
pred_df.columns = ['Test True Y', 'Model Predictions']


pred_df

## Metrics

In [None]:
sns.scatterplot(x='Test True Y', y='Model Predictions', data=pred_df)

## Error regression metrics

In [None]:
# MAE - Mean average error
print("MAE")
print(round(metrics.mean_absolute_error(y_test, test_predictions), 2), "sec")

# MSE - Mean square error
print("\nMSE")
print(round(metrics.mean_squared_error(y_test, test_predictions), 2), "sec^2")

# RMSE - Root mean square error
print('\nRMSE:')
print(round(np.sqrt(metrics.mean_squared_error(y_test, test_predictions)), 2), "sec")

# R-squared. 0 = the model descibes the dataset poorly
# 1 = model describes the dataset perfectly
print('\nR-squared:')
print(round(metrics.r2_score(y_test, test_predictions), 2))

# Explained Variance Score => 0 = the model descibes the dataset poorly
# 1 = model describes the dataset perfectly
# high variance score = model is a good fit for the data 
# low variance score = model is not a good fit for the data
# the higher the score, the model is more able to explain the variation in the data
# if score is low, we might need more and better data
print("\nExplained variance score:")
print(round(metrics.explained_variance_score(y_test, test_predictions), 2))

## Quick note

So up until now I used Conversion Rate as the target variable. Seeing the results was very strange so I went looking at the values in the Conversion Rate column... they are almost all "1". So let's change the target variable. Knowing that Conversion Rate has very little variance in its values, we should also think about losing that variable.

In [None]:
df['Conversion Rate'].value_counts()

So let's start over again in the "part_2" file.