# Predictive Analytics

In this course, we are working towards two types of "usage" for statistics and analytics:
1. We will see how to perform some statistical analyses for hypothesis testing using Python. This is similar to what you have been doing so far in other courses, and what you will most likely use for your thesis.
2. We will also see how to use statistics for **predictive analytics**, i.e., make predictions using digital trace data.

This notebook will briefly show what predictive analytics are, or at least how we execute one type of predictions using Python. *For now, all you need to do is follow the steps and understand the logic. By the end of the course, you will be able to perform this from beginning to end*.

**Don't worry about all the commands being used here. We will learn all of them in the coming weeks :-)**

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression, LinearRegression
import statsmodels.api as sm
%matplotlib inline

## The case
Our website has launched new campaigns to increase engagement with the website, and engagement being defined as ensuring that the user sees more pages (totals_pageviews) and does not leave the website upon entering through the campaign (landing_isExit, binary). It also wants to understand which of the campaigns leads to more sales (as binary, converted from totals_transactionRevenue) and revenue (totals_transactionRevenue).

# Loading data

Here we are loading and briefly inspecting the dataset. You will learn more about this in DA2 and DA3.

In [None]:
data = pd.read_csv('googlestore.csv')

In [None]:
data.head()

In [None]:
data.columns

In [None]:
data['trafficSource_medium'].value_counts()

## Data cleaning

Here we are taking steps to prepare the variables that are important for the analysis. You will learn more about it in DA3.

In [None]:
data['landing_isExit'].value_counts()

In [None]:
def fix_landing(landing):
    if str(landing).lower() == 'nan':
        return 0
    return 1

In [None]:
data['isExit'] = data['landing_isExit'].apply(fix_landing)

In [None]:
data['isExit'].describe()

In [None]:
data['totals_pageviews'].describe()

We want to understand if campaigns from affiliates or using cpc are performing better than other ways that visitors have to get to the site.

In [None]:
def check_category(source, variablename):
    if source == variablename:
        return 1
    return 0

In [None]:
data['cpc'] = data['trafficSource_medium'].apply(check_category, args=('cpc',))
data['affiliate'] = data['trafficSource_medium'].apply(check_category, args=('affiliate',))

In [None]:
data[['cpc', 'affiliate',]].describe()

In [None]:
data[['totals_pageviews', 'cpc', 'affiliate', 'isExit']].isna().sum()

In [None]:
data['pageviews'] = data['totals_pageviews'].fillna(0)

## Data exploration and visualisation

Here we are looking at the descriptive statistics of the final dataset and using visualisations to understand the relationship between variables. You will learn more about in DA4.

In [None]:
data[['cpc', 'affiliate', 'isExit', 'pageviews']].describe().transpose()

In [None]:
data[['cpc', 'affiliate', 'isExit', 'pageviews']].groupby(['cpc', 'affiliate']).describe().transpose()

In [None]:
sns.barplot(x='cpc', y='pageviews', data=data)

In [None]:
sns.barplot(x='affiliate', y='pageviews', data=data)

In [None]:
sns.barplot(x='cpc', y='isExit', data=data)

In [None]:
sns.barplot(x='affiliate', y='isExit', data=data)

## Modelling and hypothesis testing

Here we are using traditional statistics and machine learning to understand the differences between campaigns. You will learn more about in DA5 and DA6.

## Predictions for pageviews using Linear (OLS) regression

First, the "traditional" (frequentist) statistics.

In [None]:
ols_stat = sm.OLS(data['pageviews'], sm.add_constant(data[['cpc', 'affiliate']]))

In [None]:
result_ols = ols_stat.fit()

In [None]:
print(result_ols.summary())

### Now, using ML for predictive analytics.

In [None]:
ols_clf = LinearRegression(fit_intercept = True)

In [None]:
ols_clf.fit(data[['cpc', 'affiliate']], data['pageviews'])

In [None]:
ols_clf.predict([[1,0]])

In [None]:
ols_clf.predict([[0,1]])

In [None]:
ols_clf.predict([[0,0]])

## Probabilities for leaving the website using Logistic Regression

First, using "traditional" (frequentist) statistics. But for a **binary** dependent variable.

In [None]:
logit_stats = sm.Logit(data['isExit'], sm.add_constant(data[['cpc', 'affiliate']]))

In [None]:
result_logit = logit_stats.fit()

In [None]:
print(result_logit.summary())

### Now, using ML for predictive analytics.

In [None]:
logit_clf = LogisticRegression(max_iter=1000, fit_intercept = True)


In [None]:
logit_clf.fit(data[['cpc', 'affiliate']], data['isExit'])

In [None]:
logit_clf.predict_proba([[1,0]])

In [None]:
logit_clf.predict_proba([[0,1]])

In [None]:
logit_clf.predict_proba([[0,0]])