# __Milestone 1: Business Understanding__

## Problem Statement

Predict whether customers are likely to churn based on their past behaviour and demographics. 

## Data identification

In order for us to build a machine learning algorithm to predict customer churning, we will need a combination of features capturing the customer's interactions with our service as well as customer demographic information. Features that we will be uitilizing in our machine learning model will include:

 - CustomerID
 - Gender
 - Age
 - Income
 - TotalPurchase
 - NumOfPurchases
 - Location
 - MaritalStatus
 - Education
 - SubscriptionPlan
 - Churn (label)

## Hypothesis 

## Collect and clean the data

We have collected raw data based on the desired features and target attributes for our churn prediction model. This raw data has been stored in the train.csv file in our data folder. We will now import this data into a dataframe and start cleaning the data.

### Import

In [51]:
# Supress warnings
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

import pandas as pd # data wrangling
import seaborn as sns # data visualization
import plotly.express as px
import matplotlib.pyplot as plt

# for cat features
from category_encoders import OneHotEncoder

from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.pipeline import make_pipeline

from skimpy import clean_columns

In [52]:
df = pd.read_csv('./data/train.csv') #reading the data from the csv file to our dataframe
df.head() #display the first few data entries as well as column headings

Unnamed: 0,CustomerID,Gender,Age,Income,TotalPurchase,NumOfPurchases,Location,MaritalStatus,Education,SubscriptionPlan,Churn
0,1,,35,52850.0,1500,6.0,Urban,Married,Bachelor's,Gold,Yes
1,2,Female,25,29500.0,800,3.0,Suburban,,High School,Bronze,No
2,3,Male,45,73500.0,2000,8.0,Rural,Married,Master's,Silver,No
3,4,Female,30,,1200,5.0,Urban,Single,Bachelor's,Bronze,No
4,5,Male,55,80400.0,2500,9.0,Suburban,Married,PhD,Gold,No


We notice that our raw data has 10 features as well as a target feature called Churn. This data is not yet ready to be modelled and needs to be cleaned and prepared.

### Preprocessing data

__Removing irrelevent features__

As we will not need to know the customer ID to determine if they will churn or not, it is not a relevent feature for machine learning modelling and can therefor be dropped.

In [53]:
#removing the irrelevent feature
df.drop(
    columns='CustomerID',
    inplace=True
)

df.head() #inspecting the dataframe without the irrelevent feature

Unnamed: 0,Gender,Age,Income,TotalPurchase,NumOfPurchases,Location,MaritalStatus,Education,SubscriptionPlan,Churn
0,,35,52850.0,1500,6.0,Urban,Married,Bachelor's,Gold,Yes
1,Female,25,29500.0,800,3.0,Suburban,,High School,Bronze,No
2,Male,45,73500.0,2000,8.0,Rural,Married,Master's,Silver,No
3,Female,30,,1200,5.0,Urban,Single,Bachelor's,Bronze,No
4,Male,55,80400.0,2500,9.0,Suburban,Married,PhD,Gold,No


__Changing the target, Churn, to numeric values__

We want to convert the target data type from string values to integer values for more accurate machine learning modelling.

In [54]:
# Replacing the yes and no values with 1 and 0
df['Churn'].replace(
    {'Yes': 1, 'No': 0},
    inplace= True
)

df['Churn']

0      1
1      0
2      0
3      0
4      0
      ..
355    1
356    0
357    1
358    1
359    0
Name: Churn, Length: 360, dtype: int64

We have now converted the Churn datatype to int.

__Data profiling__

We will make use of the skimpy library to create a summary of desired data information.

In [55]:
import skimpy as sk #importing the skimpy library

sk.skim(df) #create a summary of df information

Some key takeaways of this skimpy summary is that we have now have 5 numeric features(including the target), and 5 categorical features. We also notice that there are missing values for the features Income, NumOfPurchases, Gender, Location, and MaritalStatus. We will need to handle these missing features.

__Handling missing values__

In [56]:
num_col = ['Income','NumOfPurchases'] #creating a list of the numeric features with missing values
cat_col = ['Gender','Location','MaritalStatus'] #creating a list of categorical features with missing values

for col1 in num_col: #for each of the columns in the list replace the missing values with the mean of the column
    df[col1].fillna(
        df[col1]
        .dropna()
        .mean(),
        inplace= True
    )

for col2 in cat_col:
    df[col2].fillna( #replace the missing categorical values with the mode of the feature
        df[col2]
        .mode()[0],
        inplace= True
    )

df.isnull().sum()


Gender              0
Age                 0
Income              0
TotalPurchase       0
NumOfPurchases      0
Location            0
MaritalStatus       0
Education           0
SubscriptionPlan    0
Churn               0
dtype: int64

In [57]:
df.head()

Unnamed: 0,Gender,Age,Income,TotalPurchase,NumOfPurchases,Location,MaritalStatus,Education,SubscriptionPlan,Churn
0,Female,35,52850.0,1500,6.0,Urban,Married,Bachelor's,Gold,1
1,Female,25,29500.0,800,3.0,Suburban,Married,High School,Bronze,0
2,Male,45,73500.0,2000,8.0,Rural,Married,Master's,Silver,0
3,Female,30,54273.529412,1200,5.0,Urban,Single,Bachelor's,Bronze,0
4,Male,55,80400.0,2500,9.0,Suburban,Married,PhD,Gold,0


We now have no missing values in our dataframe.

__Checking the cardinality of categorical features__

In [58]:
df.select_dtypes('object').nunique()

Gender              2
Location            3
MaritalStatus       2
Education           4
SubscriptionPlan    3
dtype: int64

As our categorical features don't have very low or very high cardinality, we do not have to handle any feature cardinality.

__High collinearity__

We will now inspect the correlation between the features to detect any cases of high collinearity.

In [59]:
corr_df = df.select_dtypes('number').corr()
corr_df

Unnamed: 0,Age,Income,TotalPurchase,NumOfPurchases,Churn
Age,1.0,0.989016,0.991159,0.97357,-0.578108
Income,0.989016,1.0,0.996362,0.979777,-0.576808
TotalPurchase,0.991159,0.996362,1.0,0.980369,-0.569293
NumOfPurchases,0.97357,0.979777,0.980369,1.0,-0.543626
Churn,-0.578108,-0.576808,-0.569293,-0.543626,1.0


In [60]:
fig = px.imshow(corr_df, color_continuous_scale='Spectral')
fig.update_layout(title='Heat Map: Correlation of Features', font=dict(size=12))
fig.show()

We notice that the highest collinearity is between TotalPurchase, Income, Age, and NumOfPurchases. As Income, TotalPurchase, and NumOfPurchases of the customer are important for churn predicitons, we can look at removing the Age feature for better model accuracy.

In [61]:
#dropping feature with high collinearity
df.drop(
    columns= 'Age',
    inplace= True
)

df.head()

Unnamed: 0,Gender,Income,TotalPurchase,NumOfPurchases,Location,MaritalStatus,Education,SubscriptionPlan,Churn
0,Female,52850.0,1500,6.0,Urban,Married,Bachelor's,Gold,1
1,Female,29500.0,800,3.0,Suburban,Married,High School,Bronze,0
2,Male,73500.0,2000,8.0,Rural,Married,Master's,Silver,0
3,Female,54273.529412,1200,5.0,Urban,Single,Bachelor's,Bronze,0
4,Male,80400.0,2500,9.0,Suburban,Married,PhD,Gold,0


## Storing the prepared data

__Creating a prepare data function__

We will now combine our data preparation code into a single function which will return a dataframe of prepared data ready for modelling.

In [62]:
def prepare_data(path): #declaring the function with paramater path which will be the file directory of the raw data
    prep_df = pd.read_csv(path) #reading the raw data from the path into a dataframe

    #removing the irrelevent feature
    prep_df.drop(
        columns='CustomerID',
        inplace=True
    )

    # Replacing the yes and no values with 1 and 0
    prep_df['Churn'].replace(
        {'Yes': 1, 'No': 0},
        inplace= True
    )

    num_col = ['Income','NumOfPurchases'] #creating a list of the numeric features with missing values
    cat_col = ['Gender','Location','MaritalStatus'] #creating a list of categorical features with missing values

    for col in num_col: #for each of the columns in the list replace the missing values with the mean of the column
        prep_df[col].fillna(
            prep_df[col]
            .dropna()
            .mean(),
            inplace= True
        )

    for col1 in cat_col:
        prep_df[col1].fillna( #replace the missing categorical values with the mode of the feature
            prep_df[col1]
            .mode()[0],
            inplace= True
        )

    prep_df.drop(
        columns= 'Age',
        inplace= True
    )

    return clean_columns(prep_df)

__Calling the prepare_data function__

In [63]:
prepared_df = prepare_data('./data/train.csv')
prepared_df.to_csv('./data/prepared_data.csv')

# __Milestone 2: Machine Learning Model Implementation__

## Data exploration

We will now explore our prepared data to gain more insights into their meaning and behaviour.

### Univariate analysis

We will start our analysis by looking at the state and behaviour of our target, Churn.

In [64]:
# Prepare data to display
labels = (
    prepared_df['churn']
    .astype('str')
    .str.replace('0','No', regex=True)
    .str.replace('1','Yes', regex=True)
    .value_counts()
)

# Create figure using Plotly
fig = px.bar(
    data_frame=labels, 
    x=labels.index, 
    y=labels.values, 
    title=f'Class Imbalance', 
    color=labels.index
)

# Add titles & Display figure
fig.update_layout(xaxis_title='Churn', yaxis_title='Number of Customers')
fig.show()

For business purposes, we want to focus on the customers that do churn. It is clear in this graph that the amount of customers that have churned is quite significant and the business would like to reduce this number.

### Bivariate/Multi-variate analysis

__Numeric Features__

We will now visualise the relationships of the numeric features against our target to understand their behaviour and impact.

In [65]:
plot_cols = ['income','total_purchase','num_of_purchases']

# Plot numeric features against target
plt.Figure(figsize=(3,4))
for col in plot_cols:
    fig = px.box(data_frame=prepared_df[plot_cols], x=col, color=prepared_df['churn'], title=f'BoxPlot for {col} Feature against the Target')
    fig.update_layout(xaxis_title=f'{col} Feature')
    fig.show()

After handling the outliers we concluded the following:

 - Customers with lower income is more likely to churn

 - Customers with lower total purchase amounts are churning

 - Customers with lower number of purchases are also churning

__Categorical features__

In [66]:
plot_columns = ['gender','location','marital_status','education','subscription_plan']
for plot in plot_columns:
    new_df = pd.DataFrame(
        prepared_df[[plot, 'churn']]
        .groupby(['churn'])
        .value_counts()
        .reset_index()
    )

    # Plot Category feature vs label
    fig = px.bar(
        data_frame=new_df, 
        x=plot, 
        y='count', 
        facet_col='churn', 
        color=new_df['churn'].astype(str), # convert it to string to avoid continuous scale on legend
        title=f'{plot} vs Target'
    )

    fig.update_layout(xaxis_title=plot, yaxis_title='Number of Customers')
    fig.show()

Focussing on the the customers that do churn we notice from the graphs that:
 - More females are churning
 - Customers from urban areas are churning the most
 - More single customers are churning
 - Customers with bachelor's degrees are churning the most
 - Bronze level subscription plan customers are the ones that churn the most

## Model Evaluation

### Importing necessary libraries

In [67]:
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_absolute_error

accuracy_scores = []
precisions = []
f1_scores = []
recalls = []
mae_scores = []

### Splitting the data

In [68]:
target = 'churn'
x = prepared_df.drop(columns=[target], inplace=False)
y = prepared_df[target]

x_Train, x_Test, y_Train, y_Test = train_test_split(x, y, test_size=0.4, random_state=42)

print(
    f'Training dataset \
    \nx_Train: {x_Train.shape[0]/len(x)*100:.0f}% \ny_Train: {y_Train.shape[0]/len(x)*100:.0f}% \
    \n\nValidation dataset \
    \nx_Val: {x_Test.shape[0]/len(x)*100:.0f}% \ny_Val: {y_Test.shape[0]/len(x)*100:.0f}%'
)

Training dataset     
x_Train: 60% 
y_Train: 60%     

Validation dataset     
x_Val: 40% 
y_Val: 40%


### Base accuracy

In [69]:
accuracy_Base = y_Train.value_counts(normalize=True).max()

print("Baseline Accuracy:", round(accuracy_Base, 2))

Baseline Accuracy: 0.71


### Linear Regression model

In [70]:
# Encode, build, and fit model
lin_model = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    LinearRegression()
)
lin_model.fit(x_Train, y_Train)

# Train model
y_test_lin_prob = lin_model.predict(x_Test)
y_test_lin_pred = (y_test_lin_prob > 0.5).astype(int)

# Populate evaluation metrics
accuracy_scores.append(round(accuracy_score(y_Test, y_test_lin_pred),4)), 
precisions.append(round(precision_score(y_Test, y_test_lin_pred),4)), 
recalls.append(round(recall_score(y_Test, y_test_lin_pred),4)), 
f1_scores.append(round(f1_score(y_Test, y_test_lin_pred),4))
mae_scores.append(round(mean_absolute_error(y_Test,y_test_lin_pred),4))

### Logistic Regression model

In [71]:
# Encode, build, and fit model
log_Model = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    LogisticRegression(max_iter=5000)
)
log_Model.fit(x_Train, y_Train)

# Train model
y_test_log_pred = log_Model.predict(x_Test)

# Populate evaluation metrics
accuracy_scores.append(round(accuracy_score(y_Test, y_test_log_pred),4)), 
precisions.append(round(precision_score(y_Test, y_test_log_pred),4)), 
recalls.append(round(recall_score(y_Test, y_test_log_pred),4)), 
f1_scores.append(round(f1_score(y_Test, y_test_log_pred),4))
mae_scores.append(round(mean_absolute_error(y_Test,y_test_log_pred),4))

### Decision Tree model

In [72]:
tree_hyperparam = range(1, 8)

# List of scores for visualization
train_Scores = []
test_Scores = []

for i in tree_hyperparam:
    # Encode, build, and fit model
    tree_Model = make_pipeline(
        OneHotEncoder(use_cat_names=True),
        DecisionTreeClassifier(max_depth=i, random_state=42)
    )
    tree_Model.fit(x_Train, y_Train)
    
    # Training accuracy score
    train_Scores.append(tree_Model.score(x_Train, y_Train))
    
    # Testing accuracy score
    test_Scores.append(tree_Model.score(x_Test, y_Test))

tune_data = pd.DataFrame(
    data = {'Training': train_Scores, 'Testing': test_Scores}, 
    index=tree_hyperparam
)

fig = px.line(
    data_frame=tune_data, 
    x=tree_hyperparam, 
    y=['Training', 'Testing'], 
    title="Decision Tree model training & testing curves"
)
fig.update_layout(xaxis_title ="Maximum Depth", yaxis_title="Accuracy Score")
fig.show()

y_test_tree_pred = tree_Model.predict(x_Test)

accuracy_scores.append(round(accuracy_score(y_Test, y_test_tree_pred),4)), 
precisions.append(round(precision_score(y_Test, y_test_tree_pred),4)), 
recalls.append(round(recall_score(y_Test, y_test_tree_pred),4)), 
f1_scores.append(round(f1_score(y_Test, y_test_tree_pred),4))
mae_scores.append(round(mean_absolute_error(y_Test,y_test_tree_pred),4))

### MODEL

   Encoding categorical variables

In [73]:
from category_encoders import OneHotEncoder

ohe = OneHotEncoder(
    use_cat_names=True,
    cols=['Gender','Income','TotalPurchase','NumOfPurchases','Location','MaritalStatus','Education','SubscriptionPlan','Churn']
)

encoded_df = ohe.fit_transform(df)
encoded_df.head()

Unnamed: 0,Gender_Female,Gender_Male,Income_52850.0,Income_29500.0,Income_73500.0,Income_54273.529411764706,Income_80400.0,Income_22500.0,Income_58500.0,Income_35900.0,...,MaritalStatus_Single,Education_Bachelor's,Education_High School,Education_Master's,Education_PhD,SubscriptionPlan_Gold,SubscriptionPlan_Bronze,SubscriptionPlan_Silver,Churn_1.0,Churn_0.0
0,1,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,1,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,1
2,0,1,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,1
3,1,0,0,0,0,1,0,0,0,0,...,1,1,0,0,0,0,1,0,0,1
4,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,1,1,0,0,0,1


In [74]:
df.columns


Index(['Gender', 'Income', 'TotalPurchase', 'NumOfPurchases', 'Location',
       'MaritalStatus', 'Education', 'SubscriptionPlan', 'Churn'],
      dtype='object')

In [75]:
encoded_df.mean()

Gender_Female              0.511111
Gender_Male                0.488889
Income_52850.0             0.002778
Income_29500.0             0.005556
Income_73500.0             0.002778
                             ...   
SubscriptionPlan_Gold      0.280556
SubscriptionPlan_Bronze    0.372222
SubscriptionPlan_Silver    0.347222
Churn_1.0                  0.300000
Churn_0.0                  0.700000
Length: 213, dtype: float64

In [76]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Assuming your encoded DataFrame is named 'encoded_df'

# Separate features (X) and target variable (y)
X = prepared_df.drop('Churn',axis=1)  # Replace 'target_column' with your actual target column name
y = prepared_df['Churn']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest Regressor model
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model on the training data
rf_regressor.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = rf_regressor.predict(X_test)

# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R-squared (R^2): {r2:.4f}")


KeyError: "['Churn'] not found in axis"

In [None]:
df['Gender'].unique()

In [None]:
df['MaritalStatus'].value_counts()

# Label encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing 

lbl_encd = preprocessing.LabelEncoder()

In [None]:
# Encode labels in column 'species'. 
df['Location']= lbl_encd.fit_transform(df['Location']) 

In [None]:
df['Location'].unique()

In [None]:
df['Education'].unique()

In [None]:
# Encode labels in column 'Education'. 
df['Education']= lbl_encd.fit_transform(df['Education']) 

In [None]:
df['Education'].unique()

In [None]:
df['SubscriptionPlan'].unique()

In [None]:
# Encode labels in column 'SubscriptionPlan'. 
df['SubscriptionPlan']= lbl_encd.fit_transform(df['SubscriptionPlan']) 

In [None]:
df['SubscriptionPlan'].unique()

In [None]:
encoded_df = lbl_encd.fit_transform(df['MaritalStatus'])

Check the Categorical and Numerical Columns.

In [None]:
# Categorical columns
cat_col = [col for col in df.columns if df[col].dtype == 'object']
print('Categorical columns :',cat_col)
# Numerical columns
num_col = [col for col in df.columns if df[col].dtype != 'object']
print('Numerical columns :',num_col)


In [None]:
label_encoded_data = pd.DataFrame(data = encoded_df, columns = ['Encoded_Location'])
label_encoded_data

In [None]:
a = data.copy()
a.assign(location = label_encoded_data['Encoded_Location'])

# One hot Encoding


In [None]:
one_hot_encoded_data =pd.get_dummies(data,columns = ['gender','marital_status'])
one_hot_encoded_data.replace({True: 1, False: 0}, inplace=True)
one_hot_encoded_data 

#   Gradient Boosting Classifier accuracy

In [None]:
# Import models and utility functions
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_digits

# Setting SEED for reproducibility
SEED = 23

# Importing the dataset 
X, y = load_digits(return_X_y=True)

# Splitting dataset
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.25, random_state = SEED)

# Instantiate Gradient Boosting Regressor
gbc = GradientBoostingClassifier(n_estimators=300,
								learning_rate=0.05,
								random_state=100,
								max_features=5 )
# Fit to training set
gbc.fit(train_X, train_y)

# Predict on test set
pred_y = gbc.predict(test_X)

# accuracy
acc = accuracy_score(test_y, pred_y)
print("Gradient Boosting Classifier accuracy is : {:.2f}".format(acc))


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# loaded into a DataFrame named 'df'
# Separate features (X) and target variable (y)
X = df.drop('Gender', axis=1)  # Replace 'target_column' with your actual target column name
y = df['Churn']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest model
clf = RandomForestClassifier(n_estimators=100, random_state=42)  # Adjust n_estimators as needed

# Train the model on the training data
clf.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = clf.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Accuracy: {accuracy:.4f}")

### MODEL

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.cluster import KMeans


In [None]:
# Build Model
import pandas as pd
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

# Assuming you have your data stored in a DataFrame called df
X = df.copy()  

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans

# Assuming X contains your data
# Define the column transformer to handle numerical and categorical columns separately
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['income','total_purchase','num_of_purchases']),  # List numerical column names
        ('cat', OneHotEncoder(), ['gender', 'location', 'marital_status', 'education', 'subscription_plan'])  # List categorical column names
    ])

# Build the pipeline with preprocessing and KMeans model
model = Pipeline([
    ('preprocessor', preprocessor),
    ('kmeans', KMeans(n_clusters=5, random_state=42))
])

# Fit model and assign labels
X['Clusters'] = model.fit_predict(X)



# Feature engineering

In [None]:
# csv file load into pandas Data frame
import pandas as pd
df = pd.read_csv('train.csv')

In [None]:
# data hanle missing values int the data
df = df.fillna(0)

In [None]:
# categorical variables converted into numerical using one-hot encoding
df = pd.get_dummies(df, columns=['Gender', 'Location', 'MaritalStatus', 'Education', 'SubscriptionPlan'])

In [None]:
# new features created from existing ones to capture relavant relationships
df['PurchasePerVisit'] = df['TotalPurchase'] / df['NumOfPurchases']
df['IncomePerPurchase'] = df['Income'] / df['TotalPurchase']

In [None]:
# bin continuous variables into hidden ranges to capture non-linear relationships
bins = [0, 30, 40, 50, 60]
labels = ['Young', 'Middle-aged', 'Senior', 'Elderly']
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels)

In [None]:
# multiple features combined into a single feature to reduce dimensionality
df['FinancialStatus'] = df['Income'] * df['TotalPurchase']

In [None]:
# irrelavant features removed to iprove model perfomance
df = df.drop(['CustomerID'], axis=1)

In [None]:
#  average purchase calculated per customer
df['AveragePurchase'] = df['TotalPurchase'] / df['NumOfPurchases']
print(df)


## Evaluation comparison

In [None]:
metrics = {
        'Accuracy': accuracy_scores,
        'Precision': precisions,
        'F1-Score': f1_scores, 
        'Recall': recalls,
        'MAE': mae_scores
    }

pd.DataFrame(
    data=metrics, 
    index=['Linear Regression','Logistic Regression', 'Decision Tree']
).sort_values(
    by='Accuracy', 
    ascending=False
)