![Anz](https://s3.amazonaws.com/pro.brandkit.io/accounts/anz/asset_files/201181/large_thumb_preview.png?updatedAt=2016-11-17T02:33:44 "Anz")

> **Project Title:** Customer Salary Prediction<br>
> **Project Owner:** David Adarkwah<br>
> **Email:** davidwyse48@gmail.com<br>
> **Github Profile:** [Github](github.com/Adark-Amal)<br>
> **LinkedIn Profile:** [LinkedIn](https://www.linkedin.com/in/d-adark/)

## Table of Contents <a id='mu'></a>

* [Business Problem Understanding](#bpu)
    * [Problem Statement](#ps)
    * [Hypothesis](#hs)
    * [Project Goal](#pg)
    * [Information Needed](#in)
    * [Methodology](#my)
* [Data Preparation](#dp)
    * [Data Quality Assessment](#dqa)
    * [Data Cleaning and Preprocessing](#dcp)
* [Exploratory Data Analysis](#eda) 
* [Statistical Analysis](#sa)
    * [Removing Outliers](#ro)
    * [Normality Test](#nt)
    * [Homogeneity of Variances](#hov)
    * [Box-Cos Transformation](#bct)
    * [Hypothesis Test](#ht)
    * [Effect Size](#es)
    * [Confidence Interval](#ci)
* [Modeling](#dm)
    * [Data Understanding](#du)
        * [Descriptive Statistics](#ds)
        * [Data Visualization](#dv)
    * [Feature Engineering and Selection](#fes)
    * [Splitting Dataset](#sd)
    * [Algorithm Evaluation](#ae)
    * [Hyperparameter Tuning](#pt)
    * [Finalizing Model](#fm)
    * [Model Understanding](#mdu)
    * [Save Model](#sm)
* [Conclusion and Recommendation](#cr)
* [References](#r)

## 1. Business Problem Understanding<a id='bpu'></a>

[Move Up](#mu)

<p style="text-align:justify;">The first step in approaching a data science problem is problem understanding. This step is very important since it allows us to know the kind of decisions we want to make, the information or data that will be needed to inform those decisions and finally, the kind of analysis that will be used to arrive at those decisions. In a nutshell, developing a mental model of the problem allows us to properly structure potentially relevant information needed to solve the problem.</p>

### 1.1 Problem Statement <a id='ps'></a>

ANZ has a synthesised transaction dataset containing 3 months’ worth of transactions for 100 hypothetical customers. It contains purchases, recurring transactions, and salary transactions. Based on this dataset, ANZ will want to understand the behaviours of their customers and how transactions are undertaken by each hypothetical customer and finally, be able to predict the annual salary of their present and potential customers.

### 1.2 Hypothesis <a id='hs'></a>

It is possible to predict the annual salary for each customer using a predictive model. The hypothesis to be considered is that the annual salary for each customer can be estimated based on a couple of factors such as age and purchasing behaviour of the customer.

### 1.3 Project Goal <a id='pg'></a>

In this project, we seek to achieve 2 main goals and they are;

* Segment dataset and draw unique insights, including visualization of the transaction volume and assessing the effect of any outliers.
* Explore correlation between customer attributes and build a regression and a decision-tree prediction model based on your findings. 

### 1.4 Information Needed <a id='in'></a>

In order to test the hypothesis of whether annual salary can be estimated using the age and purchasing behaviour of the customers, we would need to acquire the data needed to test the hypothesis and perform Exploratory Data Analysis. This will help us determine other factors that might help us predict annual salary of customers, consequently allowing us to make plausible decisions.

We would need the following data to be able to perform EDA and build our model.
1. **`Customer data`** - which should include characteristics of each customer, for example, age, education, transaction mode of customer etc.
2. **`Salary data`** - which would indicate current salary of customers.
3. **`Historical transaction data`** – which should indicate every transaction the customer has performed.

### 1.5 Methodology <a id='my'></a>

<p style="text-align:justify;">The methodology that will be used for our project will largely depend on the goals we set out to achieve. The methodlogy framework below gives us a comprehensive guide on the methodology apparoach that will help us achieve our goals.</p>
<br>
<p style="text-align:center;font-weight:bold;font-size:20px"> Methodology Framework</p>
<br>
<img src='https://artofdatablog.files.wordpress.com/2017/10/methodology-map.jpg' style="float:center;width:700px;">

Once we have the data, we would need to engineer features based on the data that we obtain, and build a model suitable for continuous numeric predictions (e.g., Linear Regression, Decision Tree, Random Forest, Gradient Boosted Machines to name a few), picking the most appropriate model based on the tradeoff between the complexity, the understanding, and the error margin.

## 2. Data Preparation <a id='dp'></a>

[Move Up](#mu)

An understanding of the data coupled with problem understanding will help us in cleaning and preparing our data for analysis. It is usually rare to acquire a ready-to-use data for any analysis without some level of preparation. To prepare our data, we normally assess the quality of the data, cleanse, format, blend and sample the data since we may encounter various issues with columns in our data. These issues may include:

* **`Missing values:`** meaning column values are incomplete
* **`Incorrect data:`** meaning you see values not expected for the column name
* **`Inconsistent values:`** meaning some values may fall outside the expected range
* **`Duplicate values:`** meaning whether or not there are duplicate values
* **`Inconsistent data type:`** meaning values entered in the columns may not be consistent with the column names

To properly prepare our data for analysis, we will perform two important tasks which are;

* Part I: Data Quality Assessment
* Part II: Data Cleaning and Preprocessing 

### 2.1 Data Quality Assessment <a id='dqa'></a>

<p style="text-align:justify;">The first task that we will perform under the data preparation step is initial assessment of the quality of data which will easily allow us to properly clean our data. We will use this section to write any code necesary for inspecting the dataset. Once completed, we will leave our report in the Data Quality Report Document.

At the end of our inspection, we will provide a summary of all of our findings.</p>

In [None]:
# import libraries needed

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from analyticViz.viz import *
import seaborn as sns
import datetime
from wordcloud import WordCloud, STOPWORDS
import scipy.stats as stats
from scipy.special import inv_boxcox
import random
import json
import warnings
import pickle

warnings.filterwarnings('ignore')
sns.set(color_codes=True)
pd.set_option('display.max_columns',100)
%matplotlib inline

In [None]:
# load data using pandas

data = pd.read_csv('../data/anz.csv')

In [None]:
# inspect the shape of the dataframe

data.shape

> We can see from the above results that we have `12043` observations and `23` columns. The data is rich enough to help us perform our analysis as well as build the predictive models. However, we will have to assess the quality of the data and make the necessary cleaning before setting out to achieve our goals.

In this step we will be assessing the quality of the data and make all the possible recommendation for cleaning this data.

In [None]:
# data information

data.info()

In [None]:
# data information

data.describe()

In [None]:
# check duplicates

data.duplicated(keep=False).unique()

We will proceed by checking if the data contain any missing values. We can easily tell from the results of the info but we want to be extremely sure there are no missing values

In [None]:
# function to determine the percentage of missing values in our data

def missing_values(data):
    """Function that checks for null values and computes the percentage of null values
    Args:
        data: dataframe - data whose missing value is to be determined
    Return:
        missing_output: dataframe - dataframe of total null values with corresponding percentages
    """
    total = data.isnull().sum().sort_values(ascending=False) 
    percentage = round((total / data.shape[0]) * 100, 2)
    
    missing_output = pd.concat([total, percentage], axis=1, keys=['Total','Percentage'])
    
    return missing_output

In [None]:
# call the missing values function on data1

miss_values = missing_values(data)
miss_values

In [None]:
# visualize the extent of the missing values

plt.figure(figsize=(9,9))
sns.heatmap(data.isnull(), cbar=False)
plt.show()

#### Data Quality Summary:

This data presents a lot of opportunity for data cleaning. This is because most of the features have issues that have to be resolved. Also, we have to determine columns will not be needed for various reasons and then consequently drop them.

Base on the ouputs displayed above, here is the summary of the data quality issues which have to be fixed during the data cleaning process:<br>

* Because of the huge null values in some of the columns, we will have to drop the entire column with missing values percentage greater than 60% in order to keep huge portion of our data. Columns to drop include:
    * merchant_code
    * bpay_biller_code


* Also, we can see that there are some columns that will not be needed both for the analysis we seek to perform and model  building. These columns are:
    * account (customer account numbers wont be needed for the task ahead)
    * currency (because its all in AUD based on the summary statistics results - unique value = 1)
    * country (because all customers are from Australia - unique value= 1)


* We also need to properly format the data types for some of the columns which include:
    * long_lat (split the values and format to int data type)
    * date (format to datetime object)
    * extraction (format to datetime objects. Similar to the date column and we might possibly drop it)
    * merchant_long_lat (split and format to int data type)


* Finally, based on the analysis we want to perform we might engineer new features.

### 2.2 Data Cleaning and Preprocessing<a id='dcp'></a>

<p style="text-align:justify">The preprocessing step (usually an iterative process) is carried out to clean the data based on data quality issues identified. During the data quality assessment, we identified various data quality issues including missing values, incorrect data, inconsistent values, etc. 

In this task we will perform all the initial data cleaning and preprocessing needed to produce data that will be suitable for our analysis.</p>

#### Handling Missing Values

In this step, we will drop columns with high percentage of missing values. The columns with less percentage of missing values will be either dropped or imputed during model building.

In [None]:
# We are dropping based on percentage of missing values (greater than 60%).

data = data.drop(columns=['merchant_code', 'bpay_biller_code'])

#### Dropping Unneeded Columns

We highlighted during the data quality assessment that there are some columns that won't be needed for both our analysis and model building. Therefore, we will have to drop those columns entirely.

In [None]:
# columns have single values

data.nunique()

In [None]:
# We are dropping irrelevant columns.

data = data.drop(columns=['account', 'currency', 'country'])

#### Date Formatting

The date column for this data will be formatted to datetime. We must note that the **`extraction`** column contains both date and time that needs to be formatted to datetime. This column shows transactions performed for different times within a day but the date of the transaction is similar to the **`dates`** column. Since we only interested in the day the transaction was made and not the specific time, we will drop the extraction column and base our analysis on the date column.

In [None]:
# We are dropping extraction column.

data = data.drop(columns=['extraction'])

In [None]:
# transform the dates to datetime objects for easy access of dates

data['date'] = pd.to_datetime(data['date'], format='%m/%d/%Y')

#### Splitting Coordinates Column

In this step we will split the values for both `long_lat` column and `merchant_long_lat` column and assign the values to both longitude and latitude

In [None]:
# splitting and formatting long_lat column

data['longitude'] = [float(data.split('-')[0].strip()) for data in data['long_lat']]
data['latitude'] = [float(data.split('-')[1].strip()) for data in data['long_lat']]

In [None]:
# splitting and formatting merchant_long_lat column

data['merchant_longitude'] = [float(data.split('-')[0].strip()) if str(data) != 'nan' else np.nan 
                                for data in data['merchant_long_lat']]
    
data['merchant_latitude'] =  [float(data.split('-')[1].strip()) if str(data) != 'nan' else np.nan 
                                for data in data['merchant_long_lat']]

After creating these new columns, we will have to remove the `long_lat` and `merchant_long_lat` columns 

In [None]:
# dropping previous columns

data = data.drop(columns=['long_lat', 'merchant_long_lat'])

In [None]:
# have a glimpse of the data

data.head(2)

#### Data Type Formatting

In this step we will format the columns to the correct data types which will make our analysis easier.

In [None]:
# inspect the data types

data.info()

In [None]:
# convert all objects to category dtypes

data[data.select_dtypes(['object']).columns] = data.select_dtypes(['object']).apply(lambda x: x.astype('category'))

After the data cleaning and preprocessing step comes the exploratory data analysis task. In this step we will explore our data to discover hidden trends and generate new insights which will lead us to better understand the customers.

## 3. Exploratory Data Analysis<a id='eda'></a>

[Move Up](#mu)

One of the goals for this project as mentioned earlier is to segment dataset and draw unique insights, including visualization of the transaction volume and assessing the effect of any outliers. Based on this stated goal, we will perform any set of analysis to obtain insights that can help us arrive at some plausible conclusions.

To achieve the first goal, we will look at general distirbutions of our features and try to answer the questions listed below:


* Are males performing more transactions as compared to females?
* What is the average spending by the customers?
* Are most of the transactions authorized?
* Between males and females, who spends a lot?
* Which suburb do most of the transactions take place?
* How does spending vary with state?
* How did the average amount spent by customers changed over time ( days, weeks)?

<b>NB: Questions that can be answered are not limited to the ones stated above.</b>

#### Feature Exploration

In [None]:
scatter(data, 'age', 'amount', 'Transaction Amount Vs Age', 'Age', 'Amount', color='gender', render='webgl', f_col='status')

> For authorized transactions, we can see that huge part of the amounts transacted are relatively low and this is mostly spread among customers with ages between `20` and `50` years. Also, the highest authorized transaction amount was completed by a female customer despite the fact that we have more males performing high amount transactions. On the flip side, posted transaction amounts are approximately well spread out amount males and females but still dominated by younger to adult customers.

In [None]:
scatter(data, 'age', 'balance', 'Account Balance Vs Age', 'Age', 'Balance', color='gender', render='webgl')

> Customers with high account balances are mostly aged between 35 to 50 years. Most customers have account balances ranging between` 1` to `25000 AUD`.

In [None]:
box(data, 'gender', 'balance', 'Account Balance Per Gender', 'Gender', 'Balance')

> On average male customers tend to have a lot of money in their accounts as compared to females, that is `7967 AUD` to `4883 AUD`. However, this is not conclusive as the difference might be due to a lot of factors like outliers which are clearly visible in the box plot or other extraneous factors. We will there need to confidently conclude the difference using statistical analysis. 

> Looking at the graph, we can also see that `75%` of female customers have at most approximately `8990 AUD` available in their accounts whereas `75%` of male customers have at most approximately `15658 AUD` in their accounts.

> That said, we can see there are a lot of outliers in the dataset which we will need to handle later during analysis.

In [None]:
box(data, 'gender', 'amount', 'Transaction Amount Per Gender', 'Gender', 'Amount')

> Looking at the transaction amount per gender, we can see that the distribution looks relatively the same despite the fact that males transact approximately `2 AUD` on average than females. Again, we can see some outliers in the distribution which we will look at.  

In [None]:
hist(data, 'age', 'Age Distribution of Customers', 'box', color='gender')

> The age distribution of customers is slightly skewed to the right due to the few outliers we have in the dataset. The age ranges from `18 - 78 years` for males and `18 - 64 years` for females. The average age is approximately close at `28` for male and `27` for females.

In [None]:
hist(data, 'amount', 'Amount Distribution of Customers', 'box', color='gender')

> The distribution is right skewed with only a few customers performing huge amount transactions. In fact `75%` of the customers transacted `55 AUD` or less. The maximum amount transacted by males is approximately `9000 AUD` and approximately `7000 AUD` for females. 

#### Data Analysis

#### Question 1: Are males performing more transaction amounts as compared to females?

In [None]:
sun(data, ['gender', 'movement', 'txn_description'], 'amount', 'Transaction Amount Per Gender')

> The total amount transacted by males stands at `1,292,961 AUD` and that of females stands at `970,322 AUD`. There is a huge difference of about `300,000 AUD` which again might be due to more male customers as compared to females. In addition, huge part of the total amount was `credit` transactions which were mainly `salary` payments. 

#### Question 2: What is the average spending by the customers?

For this question we will be using the `movement` and `amount` column. That said, we must note that we only talking about customer average spending which means, we looking at money moving out of their accounts. Therefore, we will only select transactions that resulted in a debit and then compute average amount for those transactions

In [None]:
# segmenting data for only debit transactions

debit_data = data[data['movement'] == 'debit']

In [None]:
# compute average spend

deb_dat = debit_data[['customer_id', 'amount', 'first_name', 'gender']]
dat = deb_dat.groupby(['customer_id', 'first_name', 'gender']).mean().sort_values(by='amount', ascending=False)
dat.reset_index(inplace=True)

In [None]:
Hbar(dat.head(20), 'amount', 'customer_id', 'amount', 'Top 20 Spenders', 'AVG Spend', 'Customer', color='gender')

> The top spenders are mostly males with only 5 females appearing in the list. Top male spender spends about `190 AUD` on average whereas top female spender spends `160 AUD` on average as well.

#### Question 3: Are most of these transactions authorized?

We will use the `status` column to answer this question.

In [None]:
tran_stat = data[['customer_id', 'status', 'gender']]
status = tran_stat.groupby(['status', 'gender']).count()
status.reset_index(inplace=True)
status.rename(columns={'customer_id':'count'}, inplace=True)

In [None]:
Vbar(status, 'status', 'count', 'count', 'Transactions Authorized per Gender', 'Status', 'Count','gender', 'stack')

>`Authorized` transactions are transactions that have been approved and account yet to be debited whereas `posted` transactions have already been processed and accounts have been debited. In relation to our analysis, we can see that most customers have performed transactions and are yet to be debited. However, completed transactions are about `5000` approximately.

#### Question 4: Which suburb do most of the transactions take place?

We now want to look at debit transactions performed in each suburb. This will give us a clue as to which suburb do most customers spend their money. We will be using the `merchant_suburb` column from the debit data.

In [None]:
text = " ".join(str(review) for review in debit_data['merchant_suburb'].dropna())

stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)

In [None]:
plt.figure(figsize = (9, 5), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.title("Suburb with Highest Number of Debit Transactions\n", size=15)
plt.tight_layout(pad = 0)
plt.show()

> It is not surprising that `Sydney` and `Melbourne` stand out as the suburbs that customers prefer to spend their money. An online [blog post](https://www.wsfm.com.au/entertainment/these-sydney-suburbs-are-spending-the-most-when-it-comes-to-online-shopping) highlighted the spending habits in these two suburbs. We can also see other surburbs like `Brisbane City`,`Southport`,`Adelaide` etc.

#### Question 5: How does spending vary with state?

We will visualize the spending amount for each state. To do this we will use the `merchant_state` and `amount` columns of our data.

In [None]:
# reassigning the names of the states

new_state = {'QLD': 'Queensland', 'NSW': 'New South Wales', 'VIC': 'Victoria', 'WA': 'Western Australia', 
             'SA': 'South Australia', 'NT': 'Northern Territory', 'TAS': 'Tasmania', 'ACT': 'Australian Capital Territory'}


map_data = data[data['movement'] == 'debit']
map_data = map_data.replace({'merchant_state': new_state})

In [None]:
plot_map = map_data.loc[:, ['merchant_state', 'amount']]
now = plot_map.groupby('merchant_state').sum()
now.reset_index(inplace=True)

In [None]:
# load json file containing properties for each state
state = json.load(open('../australian-states.min.geojson', 'r'))


# creating a mapping between id and states
state_id_map = {}
for feature in state['features']:
    state_id_map[feature['properties']['STATE_NAME']] = feature['id']
    
    
# assigning unique id to states in data 
now['id'] = now['merchant_state'].apply(lambda x: state_id_map[x])
now['id'] = now['id'].astype('Int64')

In [None]:
# plotting the map of debit transactions for each state

fig = px.choropleth_mapbox(
    now,
    locations="id",
    geojson=state,
    color="amount",
    hover_name="merchant_state",
    hover_data=["amount"],
    title="Spending Per State",
    mapbox_style="carto-positron",
    center={"lat": -26, "lon": 133},
    color_continuous_scale=px.colors.diverging.BrBG,
    color_continuous_midpoint=0,
    zoom=3,
    opacity=0.7,
)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

fig.show()

> We can see from the visualization above that `New South Wales` have the highest amount spent with a total amount of about `102,000 AUD` and they are closely followed by `Victoria` with an amount of approximately `88,000 AUD`. `Queensland`, `Western Australia`, `South Australia`, `Northern Territory` and `Tasmania` also follow in that order.

#### Question 6: How did the amount spent by customers change over time ( days, weeks)?

Now we want to analyse the spending habits of customers over time. In this task we will look at spending habits per gender over time. We will need the date and amount columns to answer this question.

In [None]:
dat = debit_data.loc[:, ['date', 'amount', 'gender']]
amount_per_day = pd.DataFrame({'median' : dat.set_index('date').groupby('gender'). \
                               resample('D')["amount"].median()}).reset_index()
amount_per_week = pd.DataFrame({'median' : dat.set_index('date').groupby('gender'). \
                               resample('W')["amount"].median()}).reset_index()

In [None]:
line(amount_per_day, 'date', 'median', 'Time', 'Amount Spent', 'Daily Average Spend Per Gender', color='gender')

In [None]:
line(amount_per_week, 'date', 'median', 'Time', 'Amount Spent', 'Weekly Average Spend Per Gender', color='gender')

>There is no defined trend for daily average spending amount from the above visualization. However, we can see the highest average spend in a day was on `Sept 17, 2018` with an amount of `35.5 AUD` and the lowest average spend was recorded on `Oct 30, 2018` with an amount of `20 AUD`. We expect to see rise in spend during the festive periods and its a bit surprising that the average spend has not started peaking already instead we only see an average spend of about `25 AUD` as at `24 November, 2018`. Also, there is no record for `August 16, 2018` which is a situation that will need further investigation. We must note that the distribution of the amount is skewed as we saw from the second question. Therefore, in order to properly analyze the average amount spend we opted to go with the median.
For the weekly average spend, we can see a trend here. We realise that the average spend decreases for the last week of every month. It could be that most customer might have end up spending large portion of their salary and are now managing what's left thus the decrease in average spend in the final week.

## 4. Statistical Analysis<a id='sa'></a>

[Move Up](#mu)

For this task we will further look into the question that was asked about the spending habit of customers based on their gender. We found out that the number of male customers performed a lot of debit transactions than their female counterparts. We will calculate the average amount spent by both genders and then conclude on which gender spends more using statistical analysis.

In [None]:
spend = debit_data[['amount', 'gender']]
spend_habit = spend.groupby(['gender']).mean()
spend_habit.reset_index(inplace=True)
spend_habit.rename(columns={'amount':'average_spend'}, inplace=True)

In [None]:
spend_habit

>From the above dataframe, we can see that the average spend for males is approximately `5 AUD` more than that of females. From this result we can easily conclude that males spend a lot as compared to females. However, what if the result could be due to more males than females, or vise versa but with high amount spent. We will therefore need to clearly conclude without any doubt that males spend more than females and that the difference is not due to chance. To do this we will perform hypothesis testing to draw conclusion  on the result.
In order to select a particular test to use, we will look at the distribution of the `amount` column. If the distribution is normal then we will go ahead and use the `Welch's t-Test` since there's unequal number of males and females. However, if the distribution is not normal, then we will transform the data and later apply the Welch's t-Test.

We will go ahead and look at the sample distribution

In [None]:
hist(spend, 'amount', 'Distribution of Customer Spend', 'box', 'gender')

>From the distribution above we can see that our data is skewed to the right which may be due to a lot of outliers in our data (rare spending value of customers). These outliers may affect the overall spending value of the customers and might lead to bias conclusions. In order to curb the impact of these outliers on our results, we will go ahead and remove them and then compute the averages again.

### 4.1 Removing Outliers<a id='ro'></a>

Lets go ahead and detect the outliers in our data. 

In [None]:
# function to detect outliers

def remove_outlier(data, column):
    """
    Function that removes outliers from the dataframe

    Args:
        data : pandas dataframe
            Contains the data where the outliers are to be found
        column : str
            Usually a string with the name of the column
    
    Returns:
        None: prints number of outliers and then removes all the outliers
    """
    
    # calculate interquartile range
    q25, q75 = np.percentile(data[column], 25), np.percentile(data[column], 75)
    iqr = q75 - q25
    print('Percentiles: 25th = %.3f, 75th = %.3f, IQR = %.3f' % (q25, q75, iqr))
    
    # calculate the outlier cutoff
    cut_off = iqr * 1.5
    lower, upper = q25 - cut_off, q75 + cut_off
    
    # identify outliers
    indx = np.where((data[column] < lower) | (data[column] > upper))
    print('Identified outliers: %d' % len(indx[0]))
    
    # remove outliers
    data.drop(data.iloc[indx].index, inplace=True)
    print('Non-outlier observations: %d' % len(data))

In [None]:
remove_outlier(spend, 'amount')

> From the results, we were able to detect and remove `1182` outliers in total using the interquartile range approach. We will now go ahead and look at the average spend and the resulting distribution.

Visualize after removing outliers.

In [None]:
hist(spend, 'amount', 'Distribution of Customer Spend', 'box', 'gender')

> From the graph above, we can see that the distribution for each gender is still skewed to the right. We will perform the normality test and perform all the needed transforms to make our distribution normal.

### 4.2 Normality Test<a id='nt'></a>

The next step after removing outliers is to confirm the normality of the distribution to ensure we select the right statistical technique. We will quantify whether the distribution is normal or not by using the `Shapiro-Wilk test`.

**Null hypothesis:** <br>
<br>
$H_0:$ `Distribution is normal`
<br>
<br>

**Alternative hypothesis:** <br>
<br>
$H_1:$ `Distribution is not normal`

<br>
<br>

**`Interpretation of Normality Test`**

- p-value $\leq$ alpha: significant result, reject null hypothesis, not Gaussian (H1).
- p-value $>$ alpha: not significant result, fail to reject null hypothesis, Gaussian (H0).

In [None]:
# normality test

stat, p = stats.shapiro(spend['amount'])
print('Statistics={:.3f}, p={:.3f}'.format(stat, p))

# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')

>The distribution is not normal which is expected because of the skewed distribution we saw earlier. In order to perform parametric hypothesis test we will need to transform the distribution to a normal distribution and in this case we will be using the `box-cox` transform. But we will go ahead and test for homoscedasticity

### 4.3 Homogeneity of variances<a id='hov'></a>

The next assumption to test before performing hypothesis testing is homoscedasticity or Homogeneity of variances. That is, determine if the variances are equal between treatment groups. We can use `Levene’s`, `Bartlett’s`, or `Brown-Forsythe test` based on where sample was taken from. If sample was drawn from a normal distribution then `Bartlett's test` can be used otherwise use either `Levene's` or `Brown-Forsythe`.

In [None]:
stat, p = stats.levene(spend[spend['gender']=='M']['amount'], spend[spend['gender']=='F']['amount'])
print('Statistics={:.3f}, p={:.3f}'.format(stat, p))

# interpret
alpha = 0.05
if p > alpha:
    print('Equal Variances (fail to reject H0)')
else:
    print('Unequal Variances (reject H0)')

> We can also see that there exists unequal variances in the samples for both male and female spend. Now we will go ahead and transform the data.

### 4.4 Box-Cox Transformation<a id='bct'></a>

If after performing normality test and we find out our data is not normally distributed, then we will go ahead and use any suitable transformations. One example is `Box-Cox Transaformation`.

The Box-Cox method is a data transform method that is able to perform a range of power transforms, including the log and the square root. It can be thought of as a power tool to iron out power-based change in your data sample. The resulting data sample may be more linear and will better represent the underlying non-power distribution, including Gaussian.

In [None]:
spend['amount'], fit_lambda = stats.boxcox(spend['amount'])

Now we will go ahead and visualize our data to check the impact of the transformation.

In [None]:
hist(spend, 'amount', 'Distribution of Customer Spend', 'box', 'gender')

> Now we can see that our distribution is roughly normal which satisfies the condition for performing Welch's t-Test.

### 4.5 Hypothesis Test <a id='ht'></a>

In statistics, a `hypothesis test` is used to assess and understand the plausibility, or likelihood of some hypothesis.

#### 4.5.1 t-Test

`Welch's t-test`, or unequal variances t-test, is a two-sample location test which is used to test the hypothesis that two populations have equal means. It is more reliable when the two samples have unequal variances and/or unequal sample sizes. In our case the number of male counts is greater than female counts which makes this test the best choice for testing our hypothesis.

<br>

**`Welch's t-Test Setup`**

**Null hypothesis:** <br>
<br>
$H_0:$ $μ_1$ = $μ_2$
<br>
<br>

**Alternative hypothesis:** <br>
<br>
$H_1:$ $μ_1$ $\neq$ $μ_2$

<br>
<br>

**`Interpretation of Welch's T-Test`**

- p-value $\leq$ alpha: `Different distributions (reject H0)`
- p-value $>$ alpha: `Same distributions (fail to reject H0)`

Now we will go ahead and perform the test. The only difference between the `welch's t-test` and `student t-test` is basically the difference in variance. The latter assumes the variance to be the same(homogeneity of variance) whereas the former doesn't. Therefore, we will use the scipy module `ttest_ind` but we will specify `equal_var` to be False.

In [None]:
stat, p = stats.ttest_ind(spend[spend['gender']=='M']['amount'], spend[spend['gender']=='F']['amount'], equal_var=False)
print('Statistics={:.3f}, p={:.3f}'.format(stat, p))

# interpret
alpha = 0.05
if p > alpha:
    print('Same distributions (fail to reject H0)')
else:
    print('Different distributions (reject H0)')

> Based on the outcome of the hypothesis test we can conclude there is a significant difference between the average spend for male and female customers. And this buttresses our initial assumption that male spend more than females base on our data set. To properly use this result to estimate the entire population we will go ahead and compute the confidence interval for average spend for both male and female customers.

### 4.6 Effect Size<a id='es'></a>

`Cohen’s D` , or standardized mean difference, is one of the most common ways to measure `effect size`. An `effect size` is how large an effect is. For example, medication A has a larger effect than medication B. While a p-value can tell you if there is an effect, it won’t tell you how large that effect is.

Cohen’s D specifically measures the effect size of the difference between two means.

In [None]:
def cohend(d1, d2):
    # calculate the size of samples
    n1, n2 = len(d1), len(d2)
    
    # calculate the variance of the samples
    s1, s2 = np.var(d1, ddof=1), np.var(d2, ddof=1)
    
    # calculate the pooled standard deviation
    s = np.sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2))
    
    # calculate the means of the samples
    u1, u2 = np.mean(d1), np.mean(d2)
    
    # calculate the effect size
    return (u1 - u2) / s

In [None]:
d = cohend(spend[spend['gender']=='M']['amount'], spend[spend['gender']=='F']['amount'])
print('Cohens d: %.3f' % d)

> Per the hypothesis test that we performed, we concluded that there is significant difference between male and female spend. To properly quantify this, the `cohen's d` value indicates that the two sample means differ by approximately 1 standard deviation. Despite the fact that the effect is small, the difference is still significant.

### 4.7 Confidence Interval<a id='ci'></a>

Now that we know and can confidently say that there is significant difference in average spend, how does this reflect in the entire population. How can we estimate the average spend for either customer at any point in time? This is where we need confidence interval to give us a range in which the average spend for each customer in a population will fall.

We will go ahead and compute the confidence interval for both male and female average spend. Now we must know that the values we going to get will be the transformed values. To get the average spend values, we will have to use `inverse box-cox` transformation on our data.

In [None]:
spend['amount'] = inv_boxcox(spend['amount'], fit_lambda)

In [None]:
x1 = spend[spend['gender']=='M']['amount']
x2 = spend[spend['gender']=='F']['amount']

In [None]:
# computing the confidence interval
print("""-------------------------------------------------------------
Confidence Interval - Average Spend Range for Male Customers
-------------------------------------------------------------\n""")
print('Confidence Interval: {}\n\n'.format(stats.t.interval(alpha=0.95, df=len(x1)-1, loc=np.mean(x1),scale=stats.sem(x1))))


print("""-------------------------------------------------------------
Confidence Interval - Average Spend Range for Male Customers
-------------------------------------------------------------\n""")
print('Confidence Interval: {}'.format(stats.t.interval(alpha=0.95, df=len(x2)-1, loc=np.mean(x2), scale=stats.sem(x2)))) 

>From the confidence interval, if we repeated this over and over again, 95 percent of the time, the average spend for male would fall somewhere between.`28.2 AUD` and `29.2 AUD` and average female spend would fall between `26.4 AUD` and `27.4 AUD`.

## 5. Modeling<a id='dm'></a>

[Move Up](#mu)

To remind ourselves, we stated earlier as one of the objectives to build a model that is able to predict annual salary of customers. Based on the methodology framework we saw at the beginning, we know that our problem is a continuous numeric prediction problem. We also highlighted that, we will have to build multiple models and then finally select the best one based on certain evaluation metrics. 

Therefore, to complete this task we will go through the various machine learning steps which includes;

* Data Understanding
* Feature Engineering
* Splitting Dataset
* Algorithm Evaluation
* Parameter Tuning
* Final Model
* Model Understanding

### 5.1 Data Understanding<a id='du'></a>

In this section, we will probe the data to understand how the features in the data relate. This will inform us on which new variables to engineer.

In [None]:
# make a copy of the dataset

ml_data = data.copy()

In [None]:
# drop unneeded columns

ml_data.drop(columns=['merchant_id', 'first_name', 'transaction_id'], axis=1, inplace=True)

In [None]:
ml_data.head()

#### 5.1.1 Descriptive Statistics<a id='ds'></a>

In [None]:
# check data types and counts of our data

ml_data.info()

In [None]:
# attain summary statistics - add include='all' to see that for categorical data

ml_data.describe()

In [None]:
# dimension of the data

ml_data.shape

In [None]:
# checking the correlation coefficient

pd.set_option('precision', 5)
ml_data.corr()

> We can see that there are different ranges for numerical values with different standard deviations which will clearly hamper the performance of our ml model. To resolve this issue we will have to `standardize` the dataset and ensure that all numeric values revolve around the same `mean and std`. Also, we observe that some of the columns have `missing values` and some are also `strongly correlated`.

#### 5.1.2 Data Visualization<a id='dv'></a>

In [None]:
fig = px.scatter_matrix(ml_data, 
                        dimensions=['balance', 'age', 'amount','card_present_flag', 'movement'],
                        color='gender',
                        color_discrete_sequence=px.colors.qualitative.D3)
fig.update_traces(diagonal_visible=False)
fig.show()

> We can see there are a few columns with binary values and the scatter plots show good relationship between some of the columns.  

### 5.2 Feature Engineering and Selection<a id='fes'></a>

The better we prapare our data for the machine learning model, the better our prediction will be. In this task, we will properly prepare our data by transforming columns, dropping irrelevant columns, handling missing and categorical values and finally merging if the need be.

We will first determine the best way to compute annual salary for the customers. We must understand that the transaction dataset contains different types of transactions completed by each customer. Therefore in order to compute the salary for each customer we will have to filter our dataset to contain only salary transactions. To do this we will use the`txn_description` column.

In [None]:
# customer salary

salary = ml_data[ml_data['txn_description'] == 'PAY/SALARY']
salary.head()

In [None]:
salary.nunique()

>Looking at the unique values we can clearly see that the transactions only pertain to salary transactions. That is, because its a salary transaction, we expect the status to be only `posted`. Also, the movement should be `credit` since customer's account will be credited with their salaries. No merchant details because this is not a spending activity and as expected the customer ID is also `100`.

Because the accounts of the customers were credited with their salaries, we will assume that customers were credited with the same amount of money at different intervals in the month for different customers.

In [None]:
ml_data[(ml_data['customer_id'] == 'CUS-1462656821') & (ml_data['txn_description'] == 'PAY/SALARY')].head(10)

> From the dataframes above, we can observe that different customers receive salaries at different time intervals. Some receive for every week and others every 2 weeks. We will therefore go ahead and assume that at the end of every month, they receieve the same 

Now the next step will be to calculate the annual salary for each customer and also engineer new features based on the `amount`, `balance`, and `date` columns. 

In [None]:
# extract all salary transactions

salary_trans = ml_data[ml_data['txn_description'] == 'PAY/SALARY']
salary_freq = salary_trans[['customer_id', 'date', 'amount', 'balance']]

In [None]:
# get month names since year remains constant

month_data = salary_freq[['customer_id', 'date', 'amount']]
month_data['month'] = month_data['date'].dt.month_name()

We will now compute the number of times customers received salary in a particular month. Based on the inspection above, we noticed that customers received same salary at different frequencies for each month. Some had salaries every 2 weeks and others received it every week.  

In [None]:
# determine the frequency of salary payments in a month

payment_freq = pd.DataFrame(month_data.groupby(['customer_id', 'amount'])['month'].value_counts())
payment_freq.rename(columns={'month': 'monthly_freq'}, inplace=True)
payment_freq.reset_index(inplace=True)


# calculate salary for each month based on frequency

payment_freq['monthly_sal'] = payment_freq['amount'] * payment_freq['monthly_freq']

We were able to compute `monthly salary` for each customer based on their respective frequencies. Now we will go ahead and compute their `average monthly salary` and `average monthly frequency` for the unqiue number of months they received their salaries. After that, we will then compute their annual salary by multiplying the average monthly salary by 12 months.

In [None]:
# calculate average monthly salary based on frequency

monthly_salary = payment_freq.groupby('customer_id')['monthly_freq', 'monthly_sal', 'amount'].mean().reset_index()
monthly_salary.rename(columns={'monthly_freq':'avg_monthly_freq', 'monthly_sal':'avg_monthly_sal', 
                               'amount':'avg_trans_amt'}, inplace=True)

Now we will engineer new features based on the columns we have and finally merge all the dataframes.

In [None]:
# determine number of months salary was received 

number_of_months = payment_freq.groupby('customer_id')['month'].nunique().reset_index()
monthly_sum = payment_freq.groupby('customer_id')['monthly_freq', 'monthly_sal'].sum().reset_index()
cust_bal = salary_freq.groupby('customer_id')['balance'].mean().reset_index()

In [None]:
# merge dataframes 

monthly_data = pd.merge(monthly_salary, number_of_months, on='customer_id')
bal_data = pd.merge(monthly_data, cust_bal, on='customer_id')
final_output = pd.merge(bal_data, monthly_sum, on='customer_id')

The final step is to calculate the annual salary usng the average monthly salary.

In [None]:
final_output['annual_salary'] = final_output['avg_monthly_sal'] * 12
final_output.rename(columns={'balance':'avg_balance'}, inplace=True)

In [None]:
final_output.head()

Another feature we can engineer is to look at the spending habit of customers. We already confirmed that male spend more than females based on this dataset. Therefore, having the spending ratio can also help us in predicting customer salaries based on the assumption that `high spending customers earn high salary`. The spending ratio is to basically tell us how much of customers annual income do they spend. We will use the debit data we extracted earlier.

In [None]:
# calculating average amount spent by each customer

spend = debit_data.pivot_table(index='customer_id',values='amount', aggfunc=np.sum)
spend['annual_spend'] = round(spend['amount'] * 4, 2)
spend.reset_index(inplace=True)

In [None]:
# merge the created dataframes

final_output = pd.merge(final_output, spend.drop(columns=['amount']), on='customer_id')
final_output['spending_ratio'] = round(final_output['annual_spend'] / final_output['annual_salary'], 2)

In [None]:
# obtain summary statistics

final_output.describe()

> We can see that the highest customer percentage spend is `85%` and the minimum is `6%`. Also, 50% of customers have a percentage of `33%` which means `50%` of customers save or invest approximately `70%` of their income. Also, average number of times customers receive salary is at a maximum of `5`and minimum of `1`. The highest average monthly salary is `11781 AUD` whereas the minimum is `2385 AUD`. What we can infer from this statistics is that most customers spend less than `50%` of their income so investment solutions or products can be sold out to customers ensuring that the bank makes profit on the income saved.

Next task is to append the `age` and `gender` of each customer to the final output before moving on to building our predictive models.

In [None]:
# getting age and gender columns

other_info = ml_data[['customer_id', 'age', 'gender']]
other_info.drop_duplicates(inplace=True)

In [None]:
# function to bin age values

def age_bins(data):
    """Function that groups age values
    
    Args:
        data: dataframe - data that contains age column
    
    Returns:
        groups: str - various groupings (young, adult, old)
    """
    if data['age'] <= 20:
        return 'young'
    elif data['age'] > 20 and data['age'] <= 40:
        return 'adult'
    else:
        return 'old'

In [None]:
# apply function and merge output data

other_info['age_bin'] = other_info.apply(age_bins, axis=1)
final_output = pd.merge(final_output, other_info, on='customer_id')

We will also go ahead and bin the speding ratio values into high, medium and low values.

In [None]:
# function to bin speding ratio values

def spend_ratio_bins(data):
    """Function that groups spending ratios
    
    Args:
        data: dataframe - data that contains spending ratio column
    
    Returns:
        groups: str - various groupings (low, medium, high)
    """
    if data['spending_ratio'] <= 0.3:
        return 'low'
    elif data['spending_ratio'] > 0.3 and data['spending_ratio'] <= 0.6:
        return 'medium'
    else:
        return 'high'

In [None]:
final_output['ratio_bins'] = final_output.apply(spend_ratio_bins, axis=1)

In [None]:
final_output.drop(columns='customer_id', inplace=True)

In [None]:
final_output.head()

The final task will be to properly format the data types of each column before proceeding to training our models.

In [None]:
final_output.info()

In [None]:
final_output['age_bin'] = final_output['age_bin'].astype('category')
final_output['ratio_bins'] = final_output['ratio_bins'].astype('category')

In [None]:
# save data for later

final_output.to_csv('ml_data.csv', index=False)

> Now our data is ready for machine learning.

### 5.3 Splitting Dataset<a id='sd'></a>

It is a good idea to use a test hold-out set. This is a sample of the data that we hold back from our analysis and modeling. We use it right at the end of our project to evaluate the performance of our final model. It is a smoke test that we can use to see if we messed up and to give us confidence on models performance on unseen data. We will use 80% of the dataset for modeling and hold back 20% for validation.

We will begin by importing the needed libraries for this task.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
import sklearn.metrics as metrics

Now we will go ahead and split our data.

In [None]:
train_data = final_output.copy()

In [None]:
# checking the correlation coefficient

pd.set_option('precision', 2)
train_data.corr()

> From the output dataframe, we can see that `avg_monthly_sal` is highly correlated with both `monthly_sal` and `annual_salary` and its the same for `monthly_sal`. Therefore, to prevent multicolinearity, we will drop these two columns and work with the rest.

In [None]:
# drop highly correlated columns

train_data = train_data.drop(columns=['avg_monthly_sal', 'monthly_sal'], axis=1)

In [None]:
# split data into train and test

X = train_data.drop('annual_salary', axis=1)
y = train_data['annual_salary']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

### 5.4 Algorithm Evaluation<a id='ae'></a>

We don't know which algorithms will do well on this dataset. We will use 10-fold cross validation to evaluate the algorithms.

We will evaluate algorithms using the appropriate metric. Firstly, we will begin by evaluating linear algorithms.

#### 5.4.1 Linear Algorithm Evaluation<a id='lae'></a>

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import LassoLars
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import OrthogonalMatchingPursuit
from sklearn.linear_model import Lars
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import ARDRegression
from sklearn.linear_model import PassiveAggressiveRegressor
from sklearn.linear_model import RANSACRegressor
from sklearn.linear_model import TheilSenRegressor
from sklearn.linear_model import HuberRegressor

Now the next step is to define our pipeline. That is the series of steps we want to be performed before model is trained and saved. In this section we will perform median imputation for numeric missing values and then standardize numeric values. For the categorical values, we will simply perform one-hot encoding.

In [None]:
# we define pipelines for various columns in our dataset
# numeric columns will be standardize and missing values replaced with median

numeric_features = list(X.drop(columns=['age_bin', 'gender', 'ratio_bins']).columns)
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())])


# categorical values will be encoded
categorical_features = ['age_bin', 'gender', 'ratio_bins']
categorical_transformer = OneHotEncoder(handle_unknown="ignore")


# put everything together in a column transformer
preprocessor = ColumnTransformer(transformers=[("num", numeric_transformer, numeric_features),
                                               ("cat", categorical_transformer, categorical_features)])

Next step is to define our pipelines.

In [None]:
# multiple algorithm evaluation

pipes = []
pipes.append(('LinearRegression', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", LinearRegression())])))
pipes.append(('Lasso', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", Lasso())])))
pipes.append(('Ridge', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", Ridge())])))
pipes.append(('LassoLars', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", LassoLars())])))
pipes.append(('ElasticNet', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", ElasticNet())])))
pipes.append(('Lars', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", Lars())])))
pipes.append(('BayesianRidge', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", BayesianRidge())])))
pipes.append(('ARDRegression', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", ARDRegression())])))
pipes.append(('PAR', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", PassiveAggressiveRegressor())])))

Now we will go ahead and perform cross validation to evaluate the performance of selected algorithms on different sections of the train data.

In [None]:
def cross_val(folds, scoring, pipeline = pipes):
    """Function for cross validation using piepelines
    Args:
        pipes - list: list of pipelines to be evaluated
    Returns:
        fig - dataframe: dataframe showing output results
    """
    results = []
    names = []
    data_list = []
    
    for name, pipeline in pipes:
        cv_results = cross_val_score(pipeline, X_train, y_train, cv=folds, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        data_list.append([name, round(cv_results.mean(), 2), round(cv_results.std(),2)])
    
    # create a dataframe for the results
    table = pd.DataFrame(np.array(data_list), columns=['MODEL NAME', 'RMSE', 'STD'])
    table['RMSE'] = table['RMSE'].astype(float)
    table = table.sort_values(by=['RMSE'], axis=0,ascending=False)
    table.set_index('MODEL NAME', inplace=True)
    
    # convert dataframe to plotly table
    fig = ff.create_table(table, colorscale=[[0, '#1F77B4'], [.5, '#ffffff'],[1, '#ffffff']], index=True)
    fig.update_layout(width=400, plot_bgcolor='rgba(0,0,0,0)', height=300, margin=dict(l=0, r=10, t=10, b=10))
    for i in range(len(fig.layout.annotations)):
        fig.layout.annotations[i].font.size = 12
    
    return fig

In [None]:
fig_linear = cross_val(10, 'neg_root_mean_squared_error')
fig_linear

> From the dataframe above, we can see that `ARDRegressor`, `Ridge` and `BayesianRidge` have lowest `RMSE` values. Also, their standard deviations are approximately the same. Therefore, we will have to select the best model based on certain factors including RMSE, complexity and stability.

#### 5.4.2 Nonlinear Algorithm Evaluation<a id='nlae'></a> 

In [None]:
from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

In [None]:
pipes = []
pipes.append(('KernelRidge', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", KernelRidge())])))
pipes.append(('SVR', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", SVR())])))
pipes.append(('KNeighbors', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", KNeighborsRegressor())])))
pipes.append(('DTR', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", DecisionTreeRegressor())])))

In [None]:
fig_nlinear = cross_val(10, 'neg_root_mean_squared_error')
fig_nlinear

> For the non-linear models, `KernelRidge` performs better than all the other models and the standard deviation is similar to that of the linear models.

Finally, we will also evaluate ensemble algorithms. `Ensemble learning` is a general meta approach to machine learning that seeks better predictive performance by combining the predictions from multiple models.

You can read more about ensemble learning [here](https://machinelearningmastery.com/tour-of-ensemble-learning-algorithms/) 

#### 5.4.3 Ensemble Algorithm Evaluation

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import BaggingRegressor

In [None]:
pipes=[]
pipes.append(('RandomForest', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", RandomForestRegressor())])))
pipes.append(('ExtraTrees', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", ExtraTreesRegressor())])))
pipes.append(('AdaBR', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", AdaBoostRegressor())])))
pipes.append(('GBR', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", GradientBoostingRegressor())])))
pipes.append(('XGBRegressor', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", XGBRegressor())])))
pipes.append(('LGBMRegressor', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", LGBMRegressor())])))
pipes.append(('BaggingRegressor', Pipeline(steps=[("preprocessor", preprocessor), ("regressor", BaggingRegressor())])))

In [None]:
fig_en = cross_val(10, 'neg_root_mean_squared_error')
fig_en

> The `RMSE` outputs from ensemble algorithms are quite high as compared to linear algorithms. To choose the final model, we have to look at the error margin, standard deviation and complexity. Looking at the algorithms, we will likely select <b>Ridge Regressor</b>. `Ridge Regressor` provides a good tradeoff between RMSE, and standard deviation. Its also quite simple to understand and less complex. Therefore, the best model to select will be `Ridge Regressor`.

### 5.5 Hyperparameter Tuning<a id='pt'></a>

After determining the best algorithm to use based on the performance metric, stability and complexity, then the next step is search for the best hyperparameters that yields the best result. This is called `hyperparameter tuning`.

In [None]:
# set pipelines and grid search parameters

pipe = Pipeline(steps=[("preprocessor", preprocessor), ("regressor", Ridge())])
tune_grid = {"regressor__alpha": np.linspace(0, 0.2, 21), "regressor__fit_intercept": [True, False], 
             "regressor__normalize": [True, False]}
grid = GridSearchCV(pipe, param_grid=tune_grid, scoring='neg_mean_absolute_error', cv=12)
grid_result = grid.fit(X_train, y_train)
print("Best: {} using {}".format(grid_result.best_score_, grid_result.best_params_))

> Now we have seen a significant reduction in the RMSE after hyperparameter tuning. We will go ahead and build the final build and visualize its performance.

### 5.6 Finalizing Model<a id='fm'></a>

In this section we will finalize the model by training it on the entire training dataset and make predictions for the hold-out test dataset to confirm our findings.

In [None]:
model = Ridge(alpha=0.01, fit_intercept=True, normalize=False)
linear_pipe = Pipeline(steps=[("preprocessor", preprocessor), ("regressor", model)])
linear_pipe.fit(X_train, y_train)

Now we will go ahead and make inference on the test data and then visualize how our model performed.

In [None]:
predictions = linear_pipe.predict(X_test)

In [None]:
predictions

In [None]:
print('MAE: {}'.format(metrics.mean_absolute_error(y_test, predictions)))
print('MSE: {}'.format(metrics.mean_squared_error(y_test, predictions)))
print('RMSE: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, predictions))))
print('R2: {}'.format(metrics.r2_score(y_test, predictions)))

### 5.7 Model Understanding<a id='mdu'></a>

In this final step, we will look at how well the model performed on the test data and then understand why the model made those predictions. For this step we will use both `sckit-learn evalutions metrics` as well as `yellowbrick visuals` for detailed understanding.

`Yellowbrick library` is excellent for visualizing various ML tasks including hyperparameter tuning, model evaluation, model performance etc. You can explore more of the library's functionalities [here](https://www.scikit-yb.org/en/latest/)

In [None]:
from yellowbrick.regressor import PredictionError
from yellowbrick.regressor import ResidualsPlot
from yellowbrick.regressor import CooksDistance
from yellowbrick.regressor import ManualAlphaSelection

Now we will go ahead and visualize model performance.

#### Residual and Prediction Error Plot

A [residual plot](https://www.scikit-yb.org/en/latest/api/regressor/residuals.html) shows the difference between the observed value of the target variable (y) and the predicted value (ŷ), i.e. the error of the prediction. We will go ahead and visualize how our Ridge regressor performed. The [prediction error plot](https://www.scikit-yb.org/en/latest/api/regressor/peplot.html) shows the actual targets from the dataset against the predicted values generated by our model. This allows us to see how much variance is in the model.

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(12,5))

# residual plot
visualizer = ResidualsPlot(linear_pipe, ax=axes[0])
visualizer.fit(X_train, y_train)  
visualizer.score(X_test, y_test) 
visualizer.finalize()

# predicted error plot
visualizer = PredictionError(linear_pipe, ax=axes[1])
visualizer.fit(X_train, y_train)  
visualizer.score(X_test, y_test)
visualizer.finalize()

fig.tight_layout(pad=3.0)
plt.show()  

> Here we can see that points are randomly dispersed around the horizontal axis which clearly indicates that a linear model was the right call. In addition, the Ridge Regressor is performing well and we observe a Test R squared value of `90.4%`. Also, we can see from the histogram that our error is approximately normally distributed around zero, which also generally indicates our model is fitted well as we can see from the prediction error plot.

#### Alpha Selection and Cooks Distance Plot 

Regularization is designed to penalize model complexity, therefore the higher the alpha, the less complex the model, decreasing the error due to variance (overfit). In order to determine the `sweet spot` such that our model doesn't overfit due to high alpha values, we plot the alpha selection point to determine the optimal alpha value. The reason we're plotting this graph is to help us determine if our model is `overfitting` or not. 

[Cook’s Distance](https://www.scikit-yb.org/en/latest/api/regressor/influence.html) is a measure of an observation or instances’ influence on a linear regression. Datasets with a large number of highly influential points might not be suitable for linear regression without further processing such as outlier removal or imputation. 

Lets go ahead and plot this two graphs.

In [None]:
# apply transformation on train data

transformed_X = preprocessor.fit_transform(X_train)

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(12,5))

# alpha selection plot
alphas = np.linspace(0, 0.2, 21)
visualizer = ManualAlphaSelection(Ridge(), alphas=alphas, cv=12, scoring="neg_mean_squared_error", ax=axes[0])
visualizer.fit(transformed_X, y_train)
visualizer.finalize()

# Instantiate and fit the visualizer
visualizer = CooksDistance(ax=axes[1])
visualizer.fit(transformed_X, y_train)
visualizer.finalize()

fig.tight_layout(pad=3.0)
plt.show()  

> From the alpha error plot, we can see that the error decreases until it reaches the `sweet spot (optimal alpha value)`. As the alpha value increases, the model tends to overfit and can clearly see an increase in the error after the sweet spot. The right alpha value is `0.01` which is what we selected in training our model thus we're confident that our model is not overfitting.

> Looking at the cook's distance, we can see that we've just one highly influential plot indicating less outliers in our dataset. Therefore, selecting the Ridge regressor was a good option.

### 5.8 Save Model<a id='sm'></a>

The final step after evaluating our model and checking the performance is to save model to be used for deployment. Now we will go ahead and save our model to a pickle file.

In [None]:
# save model to a pickle file

with open('final_model', 'wb') as file:
    pickle.dump(linear_pipe, file)

### 6. Conclusion and Recommendation<a id='cr'></a>

[Move Up](#mu)

Before the beginning of this project, we set out to achieve 2 main goals based on the dataset we had available. The goals were;
* Segment dataset and draw unique insights, including visualization of the transaction volume and assessing the effect of any outliers.
* Explore correlation between customer attributes and build a regression and a decision-tree prediction model based on your findings. 

At the end of this project, we were able to draw unique insights from the dataset by answering about 6 questions. Some of the insights generated were:
* Most of the transactions were performed by customers who were relatively young and between the ages of `18-40yrs`
* We also found that the average transaction amount was about `29 AUD`
* Also, `52.2%` of the overall transactions were completed by males as compared to `48%` for females.
* Out of the percentage stated above, most of transactions per gender were debit transactions which clearly shows most customers are spending than earning. However, amounts were greater for credit transactions.
* Again, out of the overall number of transactions, only `5000` were actually posted, in order words completed whereas the rest were authorized and still waiting for funds to be deducted from account.
* Male spenders spend approximately `500 AUD` more than their female counterparts. To buttress that, the statistical analysis we performed showed that males were big spenders as compared to females. As a result, we concluded with `95%` confidence that average spend for males will mostly fall between `28.2 AUD` and `29.2 AUD` and average female spend would fall between `26.4 AUD` and `27.4 AUD`
* We also found out that `Sydney` and `Melbourne` are the top two suburbs where debit transactions take place a lot. Also, New South Wale was found to be the state where spending was highest.
* We also discovered that, spending decreases slightly towards the end of the month.
* Finally, further investigation is needed to be carried out as to why there was not transaction on `August 16, 2018`.

To add, we went ahead to evaluate various algorithms and finally selected the `ridge regressor` algorithm which was used to build a model that was able to explain approximately `90%` of the variation in the dataset.

Despite the high coefficient of determination (R squared) value for salary prediction we had a very high `RMSE` value (8661) which could be as a result of the fact that the data we used was not that rich (large). Therefore, to reduce the RMSE we recommend we train the model with a more data and determine which features will greatly impact our model performance through feature selection techniques. 

In conclusion, we can confidently say that we have achieved our 2 main goals and have also tested our inital hypothesis.

### 7. References<a id='r'></a>

[Move Up](#mu)

* [Scikit Learn Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html)
* [Machine Learning Mastery](https://machinelearningmastery.com/extra-trees-ensemble-with-python/)
* [Yellowbrick Library](https://www.scikit-yb.org/en/latest/)
* [Data Science Infinity DS Templates](https://data-science-infinity.teachable.com/)
* [Data Science Blog](https://www.reneshbedre.com/blog/anova.html)