### Data Science Accelerator - Predicting High Street Recovery

The aim of this project is to predict high street recovery and identify high streets at risk, providing indicators of high street health which can be used to inform public policy and prioritise the most effective areas of intervention. Ths project aims to create a model which can be used to detect high streets with lower levels of recovery or those in decline. Data sources to be used include Mastercard spend data, provided weekly from 2018 until the present day. This data has been aggregated at a high street level.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from functools import reduce

import re
import pickle

from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.metrics import precision_recall_fscore_support as score

from xgboost import XGBClassifier

from highstreets.features import processing_functions as pf

from dotenv import load_dotenv, find_dotenv
import os

load_dotenv(find_dotenv())

HS_MCARD_TXN = os.environ.get("HS_MCARD_TXN")
LDN_MCARD_TXN = os.environ.get("LDN_MCARD_TXN")
O2_2021 = os.environ.get("O2_2021")
O2_2022 = os.environ.get("O2_2022")
HS_MSOA_LOOKUP = os.environ.get("HS_MSOA_LOOKUP")
PROFILE_FILE = os.environ.get("PROFILE_FILE")

%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
# read in high streets transaction data
mcard_highstreets = pd.read_csv(HS_MCARD_TXN)
mcard_london = pd.read_csv(LDN_MCARD_TXN)

# select weekday retail data
hs = mcard_highstreets[
    ["yr", "wk", "week_start", "highstreet_id", "highstreet_name", "txn_amt_wd_retail"]
]

In [None]:
# read in o2 footfall data
o2_2021 = pd.read_csv(O2_2021, encoding="utf=8")
o2_2022 = pd.read_csv(O2_2022, encoding="utf-8")
# bind rows
o2_concat = pd.concat([o2_2021, o2_2022])
o2_cleaned = o2_concat[["msoa11cd", "count_date", "count_type", "h13"]]
# msoa - highstreet lookup
hs_msoa_lookup = pd.read_csv(HS_MSOA_LOOKUP)
lookup = hs_msoa_lookup[["msoa11cd", "highstreet_name"]]
# inner join dfs
o2_hs = pd.merge(o2_cleaned, lookup, on="msoa11cd")
o2_hs = o2_hs[o2_hs["count_type"] == "Visitor"]

In [None]:
# read in highstreet profile data
hs_profiles = pd.read_excel(PROFILE_FILE)

##### Data pre-processing

Pre-processing of the Mastercard dataset is executed to ensure dates are in the correct format. Much of the pre-processing of the Mastercard data has already been performed in the processing scripts which load in the Mastercard data from the database. Feature scaling will also be applied as a pre-processing step prior to inputting the features into the algorithms, to account for the different sizes of the highstreets and the resulting large differences in their scale of spending patterns.

In [None]:
# MASTERCARD clean data and change to date by creating copy of existing list
hs = hs[hs["week_start"].notnull()].copy()
hs["week_start"] = pd.to_datetime(hs["week_start"])

# create month yr variable
hs["month_year"] = hs["week_start"].dt.to_period("M")

In [None]:
# O2 change date format
o2_hs["count_date"] = pd.to_datetime(o2_hs["count_date"])
# o2_hs['highstreet_name'] = o2_hs['highstreet_name'].encode('UTF-8')
# group by vars
o2_hs.groupby(["count_date", "highstreet_name", "count_type"]).sum()

##### Data labelling

To feed the data into a supervised machine learning model, the data must have labels. Here, binary labels are created by categorising each high street as either at risk or not. These are categorised by looking at the averages of each x number of months, and if average spend is less than 75% of previous month's average spend, the high street will be categorised as 'at risk'. This creates the response dataset, the output variable that essentially depends on the feature variables. They are also known as target, label or output. 

In [None]:
labelled = pf.create_labels(hs, "2022-03", "2021-03", "2022-02")
labelled["labels"] = labelled["labels"].astype("int")
print(labelled.dtypes)
# check class imbalance
print(labelled.labels.value_counts())

#### Feature engineering

Calculate mean and standard deviation for each high street to transform the raw data into features. This created the feature dataset, also known as predictors, inputs or attributes. 

In [None]:
# O2 feature engineering
day_range = pd.date_range("2021-05-08", "2022-04-01", freq="d")

mean_sd_o2 = pf.create_mean_sd_o2(o2_hs, day_range)

In [None]:
# create slice of march 2021 to march 2022
months = pd.date_range("2021-03", "2022-03", freq="m").to_period("M")
mean_sd = pf.create_mean_sd_mcard(hs, months)

# # create mean and sd for the two recovery periods
# months_r1 = pd.date_range('2020-04', '2020-10', freq='m').to_period('M')
# mean_sd_r1 = create_mean_sd_mcard(hs, months_r1)

# months_r2 = pd.date_range('2021-01', '2021-04', freq='m').to_period('M')
# mean_sd_r2 = create_mean_sd_mcard(hs, months_r2)

To create a feature for the gradient, linear regression is used on each highstreet, with the coefficient taken as another feature to demonstrate the overall spending trend of each high street. The dates are converted from period objects to datetime objects and the into ordinal values, which can be inputted into scikitlearn. A question in creating this feature is whether to pool the transaction values by month, then average and find a trend line over that, or whether to leave the data as weekly values to find the gradient over this. 

In [None]:
months = pd.date_range("2021-03", "2022-03", freq="m").to_period("M")
gradients = pf.create_gradient(hs, months)

months_r1_grad = pd.date_range("2020-04", "2020-10", freq="m").to_period("M")
gradients_r1 = pf.create_gradient(hs, months_r1_grad)

months_r2_grad = pd.date_range("2021-04", "2021-08", freq="m").to_period("M")
gradients_r2 = pf.create_gradient(hs, months_r2_grad)

In [None]:
# O2 feature engineering pre methodology change
day_range_pre = pd.date_range("2021-05-08", "2021-11-14", freq="d")
gradient_o2_pre = pf.create_gradient_o2(o2_hs, day_range_pre)
gradient_o2_pre = gradient_o2_pre.rename({"gradient": "grad_o2_pre"}, axis=1)
# O2 post meth change
day_range = pd.date_range("2021-11-15", "2022-04-01", freq="d")
gradient_o2_post = pf.create_gradient_o2(o2_hs, day_range)
gradient_o2_post = gradient_o2_post.rename({"gradient": "grad_o2_post"}, axis=1)

In [None]:
# merge all the mastercard dataframes together into one feature df
feature_list_mcard = [mean_sd, gradients, gradients_r1, gradients_r2]
features_mcard = reduce(
    lambda left, right: pd.merge(left, right, on=["highstreet_name"], how="outer"),
    feature_list_mcard,
)

In [None]:
# do same with o2
features_o2 = mean_sd_o2.merge(gradient_o2_pre, on="highstreet_name").merge(
    gradient_o2_post, on="highstreet_name"
)
# quick fix, needs work asap
# as encoding of o2 files is wrong, will change hs name apostrophes
col = [[re.sub(r"[^\x00-\x7f]", "'", i)] for i in mean_sd_o2["highstreet_name"]]
col = sum(col, [])

features_o2.drop(["highstreet_name"], axis=1)
features_o2["highstreet_name"] = col

feature_complete = features_mcard.merge(features_o2, on="highstreet_name")
feature_complete.drop(["highstreet_name"], axis=1, inplace=True)

#### Split into test and training sets

The data is split into training and test sets, with a 70/30 split. Normalisation is then applied to both test and train data, though seperately to avoid including information from the future validation set into the training data. . Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance. If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance). Therefore, you should perform feature normalisation over the training data. Then perform normalisation on testing instances as well, but this time using the mean and variance of training explanatory variables. In this way, we can test and evaluate whether our model can generalize well to new, unseen data points.

In [None]:
# fix error about data format, wants an array
labelled = np.ravel(labelled)

# testing with different
# split data
X_train, X_test, y_train, y_test = train_test_split(
    feature_complete, labelled, test_size=0.3, random_state=0
)

print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)

# normalise
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### Cross validation and hyperparameter tuning

The models are cross validated using the training data, after which the optimal model is chosen to use on the test data for the final model evaluation. The base model using only the mastercard spend gradient as a feature had an accuracy score of 0.33 for SVM and XGBoost. Adding in the mean and sd, log regression accuracy is now 0.48 and xgboost at 0.419. Including mean_sd, and gradients of time period and two recovery periods = random forest 0.48, xg and log 0.465.

In [None]:
# parameters to test for each algorithm
log_reg_params = [{"C": 0.01}, {"C": 0.1}, {"C": 1}, {"C": 10}]
dec_tree_params = [{"criterion": "gini"}, {"criterion": "entropy"}]
rand_for_params = [{"criterion": "gini"}, {"criterion": "entropy"}]
kneighbors_params = [{"n_neighbors": 3}, {"n_neighbors": 5}]
naive_bayes_params = [{}]
svc_params = [{"C": 0.01}, {"C": 0.1}, {"C": 1}, {"C": 10}]
xgb_params = [{}]

# list of models, params etc
modelclasses = [
    ["log regression", LogisticRegression, log_reg_params],
    ["decision tree", DecisionTreeClassifier, dec_tree_params],
    ["random forest", RandomForestClassifier, rand_for_params],
    ["k neighbors", KNeighborsClassifier, kneighbors_params],
    ["naive bayes", GaussianNB, naive_bayes_params],
    ["support vector machines", SVC, svc_params],
    ["xgboost", XGBClassifier, xgb_params],
]

# loop through each model with k-fold cv
insights = []
for modelname, Model, params_list in modelclasses:
    for params in params_list:
        model = Model(**params)
        kfold = model_selection.KFold(n_splits=10, random_state=None)
        cv_results = model_selection.cross_val_score(
            model, X_train, y_train, cv=kfold, scoring="accuracy"
        )
        mean = cv_results.mean()
        sd = cv_results.std()
        insights.append((modelname, model, params, cv_results, mean, sd))

insights.sort(key=lambda x: x[-2], reverse=True)
with open("models.pickle", "wb") as f:
    for modelname, model, params, cv_results, mean, sd in insights:
        print(modelname, params, "Accuracy:", round(mean, 3))
        # pickle models
        pickle.dump(model, f)

results = pd.DataFrame(
    insights, columns=["model name", "model", "parameter", "accuracy", "mean", "s.d."]
)

# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results["accuracy"])
ax.set_xticklabels(results["model name"])
plt.xticks(rotation=90)
plt.show()

In [None]:
# load models back into memory
models = []
with open("models.pickle", "rb") as f:
    while True:
        try:
            models.append(pickle.load(f))
        except EOFError:
            break

#### Final model evaluation after hyperparameter tuning

In [None]:
model = LogisticRegression(C=10, max_iter=150)
# model = RandomForestClassifier()
# model = XGBClassifier()
# fit classifier and flatten arrays
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Finding accuracy by comparing actual response values (y_test)
# with predicted response value (y_pred)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("F1:", metrics.f1_score(y_test, y_pred, average="weighted"))
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm)
disp.plot()
plt.show()
# accuracy of multiclass classification problem
precision, recall, fscore, support = score(y_test, y_pred)

FP = cm.sum(axis=0) - np.diag(cm)
FN = cm.sum(axis=1) - np.diag(cm)
TP = np.diag(cm)
TN = cm.sum() - (FP + FN + TP)
FP = FP.astype(float)
FN = FN.astype(float)
TP = TP.astype(float)
TN = TN.astype(float)

acc = (TP + TN) / (TP + FP + FN + TN)
print(acc)

print("Class accuracy:", cm.diagonal() / cm.sum(axis=0))
print("Precision: {}".format(precision))
print("Recall: {}".format(recall))
print("F-score: {}".format(fscore))
print("Support: {}".format(support))