# Coder Academy and THE ICONIC Masterclass

This notebook has the following sections...

* [Part 0: Using this notebook](#Part-0:-Using-this-notebook)
* [Part 1: Introduction](#Part-1:-Introduction)
* [Part 2: First look at our dataset](#Part-2:-First-look-at-our-dataset)
* [Part 3: Data pre-processing and feature engineering](#Part-3:-Data-pre-processing-and-feature-engineering)
* [Part 4: Clustering the dataset](#Part-4:-Clustering-the-dataset)
* [Part 5: Classifying our inferred gender](#Part-5:-Classifying-our-inferred-gender)
* [Part 6: Putting it altogether and next steps](#Part-6:-Putting-it-altogether-and-next-steps)

![](../img/iconic_coder.png)

# Part 0: Using this notebook

[Top](#Coder-Academy-and-THE-ICONIC-Masterclass) | [Previous Section](#Coder-Academy-and-THE-ICONIC-Masterclass) | [Next Section](#Part-1:-Introduction) | [Bottom](#Wrap-up)

## What is Python?

Python is an _interpretive_ programming language invented in the 1980s. It's actually named after Monty Python and Holy Grail. In this class we'll be using Python to build our machine learning algorithms. 

### Why learn Python?

Python has gained popularity because it has an easier syntax (rules to follow while coding) than many other programming languages. Python is very diverse in its applications which has led to its adoption in areas such as data science and web development.

All of the following companies actively use Python:

![Image](https://www.probytes.net/wp-content/uploads/2018/08/appl.png)

## How do I interact with this notebook?

A Jupyter Notebook is an interactive way to work with code in a web browser. Jupyter is a pseudo-acronym for three programming languages: Julia, python and (e)r. Notebooks provide a format to add instructions + code in one file, which is why we're using it!

We'll quickly do some practice to introduce you how to use this notebook. For a list of keyboard shortcuts you can take a look at [Max Melnick's](http://maxmelnick.com/2016/04/19/python-beginner-tips-and-tricks.html) beginner tips for Jupyter Notebook.

Here's a quick run down of some of the most basic commands to use:

- A cell with a **<span style="color:blue">blue</span>** background is in **Command Mode**. This will allow you to toggle up/down cells using the arrow keys. You can press enter/return on a cell in command mode to enter edit mode

- A cell with a **<span style="color:green">green</span>** background is in **Edit Mode**. This will allow you to change the content of cells. You can press the escape key on a cell in command mode to enter edit mode

- To run the contents of a cell, you can type:
  - `cmd + enter`, which will run the cotents of a cell and keep the cursor in place
  - `shift + enter`, which will run the contents of a cell, and move the cursor to the next cell (or create a new cell)

### Exercise

Edit the below by changing "Gretchen" to your own name by entering edit mode, and then running the cell using the directions above.

In [None]:
print("Hello, Gretchen")

We can add/delete cells using the following commands in <span style="color:blue">**Command Mode**</span>:

- `a`, adds a cell above the current cell
- `b`, adds a cell below the current cell
- `d + d`, (pressing the "d" key twice in succession) deletes a cell

### Exercise

Add/delete the cells such that each individual cell prints the numbers 1-5 in order. The numbers 2 and 4 are already completed for you.

In [None]:
print(2)

In [None]:
print(4)

# Part 1: Introduction

[Top](#Coder-Academy-and-THE-ICONIC-Masterclass) | [Previous Section](#Part-0:-Using-this-notebook) | [Next Section](#Part-2:-First-look-at-our-dataset) | [Bottom](#Wrap-up)

## The problem

As explained, [THE ICONIC](https://www.theiconic.com.au/) does not receive all information about a person when they create an online profile, but the more information they receive from a person, the better they can tailor their marketing towards specific individuals.

The goal of this masterclass is to develop a way to infer information about an individual by aligning buying behaviours with demographic traits.

### Exercise

Take 10 minutes to do some internet research, and try to find out how online buying behaviours differ between genders. Write down your findings - they might inform how you build your algorithms for the remainder of the masterclass!


## Data Pipeline

In this session, we'll build a **data pipeline** to infer gender based upon behavioural data. From [wikipedia](https://en.wikipedia.org/wiki/Pipeline_(computing))...

> In computing, a **pipeline** (also known as a **data pipeline**) is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion.

Data pipelines deal with the process of collecting, modifying and analysing a dataset towards some goal. Here's a picture of a data pipeline from [this medium blog](https://medium.com/the-data-experience/building-a-data-pipeline-from-scratch-32b712cfb1db)...

![](https://cdn-images-1.medium.com/max/1600/1*8-NNHZhRVb5EPHK5iin92Q.png)

For the rest of this lesson we'll build-up this pipeline. We will....

1. Analyse our dataset, by taking a look at the columns available
2. Process the dataset, by creating **usuable features** for our algorithms and **normalisng** these features
3. **Infer a gender** on our dataset using clustering
4. Analyse our inferred gender, and build a **classification algorithm** that can be used to predict our inferred gender from new data

# Part 2: First look at our dataset

[Top](#Coder-Academy-and-THE-ICONIC-Masterclass) | [Previous Section](#Part-1:-Introduction) | [Next Section](#Part-3:-Data-pre-processing-and-feature-engineering) | [Bottom](#Wrap-up)

Let's start to look at the data available to us. This data has been provided by the **THE ICONIC**. It has been **de-identified**, meaning it has been modified in such a way where the data could not lead back to the actual population of individuals it represents.

## Pandas introduction

To analyse our dataset with Python we should first load the dataset. To do this, we will use the [Pandas](https://pandas.pydata.org/) module.

> A **module** is a set of code-files that can be loaded to add additional capabilities to our program

Pandas allows us to manipulate tabular data in Python. Let's import pandas so that we can use it for the rest of our Python session.

In [None]:
# Import pandas and give it the nickname "pd"
import pandas as pd

Now we can upload our data set into Python, and take a quick look at the data available to us.

---

Run the following code cell which will do the following...

1. Upload our data into Python, specifically within a variable called `raw_data`
2. Print a brief description of each column in our dataset, including...
 * The type of data, which will be called `<class 'pandas.core.frame.DataFrame'>`
 * The type of `index`, or row names from the data: `RangeIndex: 46030 entries, 0 to 46029`
 * Number of data columns: `Data columns (total 42 columns)`
 * The column names and information


For example, the following describes a column called `afterpay_payments`, of which there are `46030` filled-in values for this column (non-blank), and the column is a number, or `int64`.

```
afterpay_payments                46030 non-null int64
```

In [None]:
# Upload the dataset
raw_data = pd.read_csv('../data/data_iconic_workshop.csv.gz')

# Print basic information about the dataset
raw_data.info()

So...what do these columns mean? Some of them might be pretty obvious, like `orders` probably represents the number of orders for a customer. But what might `sacc_items` mean? Here's a little more information about the dataset to make life easier for you. 

| Column                   | Value   | Description                                                              | 
|--------------------------|---------|--------------------------------------------------------------------------| 
| customer_id              | string  | ID of the customer - super duper hashed                                  | 
| days_since_first_order   | integer | Days since the first order was made                                      | 
| days_since_last_order    | integer | Days since the last order was made                                       | 
| int_is_newsletter_subscriber | string  | Flag for a newsletter subscriber (1 = Yes, 0 = No)                                        | 
| orders                   | integer | Number of orders                                                         | 
| items                    | integer | Number of items                                                          | 
| cancels                  | integer | Number of cancellations - when the order is cancelled after being placed | 
| returns                  | integer | Number of returned orders                                                | 
| different_addresses      | integer | Number of times a different billing and shipping address was used        | 
| shipping_addresses       | integer | Number of different shipping addresses used                              | 
| devices                  | integer | Number of unique devices used                                            | 
| vouchers                 | integer | Number of times a voucher was applied                                    | 
| cc_payments              | integer | Binary indicating if credit card was used for payment                       | 
| paypal_payments          | integer | Binary indicating if PayPal was used for payment                              | 
| afterpay_payments        | integer | Binary indicating if AfterPay was used for payment                            | 
| apple_payments           | integer | Binary indicating if Apple Pay was used for payment                           | 
| female_items             | integer | Number of items purchased for women                                         | 
| male_items               | integer | Number of items purchased for men                                           | 
| unisex_items             | integer | Number of unisex items purchased                                         | 
| wapp_items               | integer | Number of Women Apparel items purchased                                  | 
| wftw_items               | integer | Number of Women Footwear items purchased                                 | 
| mapp_items               | integer | Number of Men Apparel items purchased                                    | 
| wacc_items               | integer | Number of Women Accessories items purchased                              | 
| macc_items               | integer | Number of Men Accessories items purchased                                | 
| mftw_items               | integer | Number of Men Footwear items purchased                                   | 
| wspt_items               | integer | Number of Women Sport items purchased                                    | 
| mspt_items               | integer | Number of Men Sport items purchased                                      | 
| curvy_items              | integer | Number of Curvy items purchased                                          | 
| sacc_items               | integer | Number of Sport Accessories items purchased                              | 
| msite_orders             | integer | Number of Mobile Site orders                                             | 
| desktop_orders           | integer | Number of Desktop orders                                                 | 
| android_orders           | integer | Number of Android app orders                                             | 
| ios_orders               | integer | Number of iOS app orders                                                 | 
| other_device_orders      | integer | Number of Other device orders                                            | 
| work_orders              | integer | Number of orders shipped to work                                         | 
| home_orders              | integer | Number of orders shipped to home                                         | 
| parcelpoint_orders       | integer | Number of orders shipped to a parcelpoint                                | 
| other_collection_orders  | integer | Number of orders shipped to other collection points                      | 
| average_discount_onoffer | float   | Average discount rate of items typically purchased                       | 
| average_discount_used    | float   | Average discount finally used on top of existing discount                | 
| revenue                  | float   | $ Dollar spent overall per person                                        |


We have already performed some data cleaning for you to save time. To give some perspective on the data cleaning process, since **a lot of a data scientist's job is to clean data** we have...

* Removed null values, or blank values, within the data
* Confirmed units within the columns are appropriate
* Changed relevant string variables to numerical values (computers really do not like text...)

You can see that our data has a _ton_ of different information about a customer. Let's just take a look at a single row of data, which represents a specific customer's buying patterns:

In [None]:
# Print out the first row of data
raw_data[0:1].transpose()

In [None]:
raw_data.describe().transpose()

## Visualising the data

Now that we have an idea about the type of data available to us, it might be helpful to start prying the dataset for useful trends.

Remember we are trying to categorise our customers as males and females. We know a little bit about what purchasing behaviours for males and females looks like, so it might be helpful to see whether these qualitative trends are evident in our dataset. If so, we might be able to utilise these features to **cluster** or **separate** the individuals in our dataset into their respective male/female groups.

Looking at raw data is tough to do...especially when we have ~45,000 rows to deal with in our dataset.

What might be more helpful is to visualise our dataset (humans like visuals over text!). We need to have questions in-hand, as there are 40 columns to look at, and there's no easy way to look at every column at once.

![](https://www.quantinsti.com/wp-content/uploads/2017/07/seaburn-1.png)

The following code will import the [`matplotlib`](https://matplotlib.org/) and [`seaborn`](https://matplotlib.org/) libraries, the two main libraries we will use for data visualisation within this masterclass.

It will also run the 

```python
%matplotlib inline
```

command, which will allow us to render the images we create _within_ our notebook, instead of explicitly running a command each time to show the graphs/plots we generate with seaborn.

In [None]:
# Import matplotlib
import matplotlib.pyplot as plt

# Import seaborn
import seaborn as sns

# State to render images inline
%matplotlib inline

### Distributions

Something that we often like to look at are the **distributions** of our dataset. One way to visualise the distribution of our data is by using a boxplot.

<img src="../img/boxplot_revenue.png" width="800">

A boxplot shows the **spread** of our data, including:

1. The minimum value
2. The maximum value
3. The median

When you plot the data with a boxplot, you may observe that some of the points seem far away from the majority of values. These are called **outliers**. For example, if you received a dataset that described the height of staff at your workplace, and there were a few people that were 2.1 meters tall, they would be outliers, because people aren't normally 2.1 meters tall.

From [nist](https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm):

> An **outlier** is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Before abnormal observations can be singled out, it is necessary to characterize normal observations.

### Exercise

The following code uses the [sns.boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html) command to visualise variables within our dataset. It iterates through an input list of variables to show us these distributions.

Take a look at the variables available to us, and choose some variables you are interested in visualising, and add them to the list below. Data points which have outliers are typically far-away from the center box within the box plot.

Take note of which features have a lot of outliers as you visualise the data.

In [None]:
# List of features
boxplot_vars = ['male_items', 'female_items']

# Create boxplot
for v in boxplot_vars:
    plt.figure(figsize=(10, 3))
    sns.boxplot(raw_data[v])

### Outlier filtering

It is good to get rid of really extreme points if they are non-normal to a dataset. Also outliers can make it _really hard_ to visualise data!

Run the following cell to define a function we can use for outlier filtering.

In [None]:
def outlier_cutoff(my_data, cols, cutoff=1.5):
    """
    Filter out outliers from a dataset within a set of columns and using a specified
    cuttoff values.
    
    inputs: my_data <pd.DataFrame>: A dataset
            cols <list>: A list of columns to filter
            cutoff <float>: A cutoff value
            
    output: The filtered DataFrame
    """
    # Get original number of columns
    orig_num_rows = my_data.shape[0]
        
    # Go through columns
    for c in cols:
        # Get IQR
        percentile_25 = my_data[c].quantile(q=0.25)
        percentile_75 = my_data[c].quantile(q=0.75)

        # Calculate cutoff * IQR
        iqr = percentile_75 - percentile_25
        high_cut = percentile_75 + (iqr * cutoff)
        low_cut = percentile_25 - (iqr * cutoff)

        # Filter
        my_data = my_data.loc[(my_data[c] >= low_cut) & (my_data[c] <= high_cut), :]
    
    # Print the amount of data lost
    print('Number of columns eliminated: '  + str(orig_num_rows - my_data.shape[0]))
    
    return my_data

### Exercise

Add variables to the 'cols_to_filter' list below to filter outliers within those columns. **NOTE** the more columns we filter, and the _lower_ the cutoff, the more data we will lose, which is not good!

In [None]:
# Columns to filter
cols_to_filter = ['items', 'revenue']
cutoff = 10

# Filter data
filtered_data = outlier_cutoff(raw_data.copy(), cols_to_filter, cutoff=cutoff)

### Correlations

Two variables **correlate** when:

> Either of the variables are so related that one directly implies or is complementary to the other (from [Merriam-Webster](https://www.merriam-webster.com/dictionary/correlate))

Put into mathematics, this simply means that knowing information about one variable _implies_ information about another variables.

Let's take a look at an example of a correlation. Here is a positive correlation based upon the revenue from a customer vs. the number of items they purchased:

<img src="../img/corr_items_revenue.png" width="800">


If someone told you that the number of items a customer bought was increasing, and asked you if the revenue increased too, you'd probably say yes. This is because you know that more items bought by a customer generates more revenue.

There are two main types of correlation:

> A **positive** correlation implies raising the value of one variable will also raise the value of another variable. Subsequently, lowering the value of one variable will lower the value of another variable.

> A **negative** correlation implies the opposite. Raising the value of one variable _lowers_ the value of another variable.

There is a numerical measure of correlation, called $r^2$. There are three main points about the $r^2$ value of variables...

* A $r^2$ that approaches 1 implies **positive correlation**
* A $r^2$ that approaches -1 implies **negative correlation** 
* A $r^2$ that approaches 0 implies **no correlation**

Here are some pictures to show you examples of correlated, non-correlated, and negative correlated variables with corresponding $r^2$ values.

<br>

![](https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/10/pearson-2-small.png)


We can use a [seaborn heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) to show us the correlation of different variables. Let's first create a function that makes a heatmap for us.

The function can always be called as so...

```python
make_heatmap(my_data, columns)
```

In [None]:
def make_heatmap(my_data, columns):
    """
    Make a heatmap of data given certain columns.
    
    inputs: my_data <pd.DataFrame>: The data we want to draw the heatmap of
            columns <list>: The columns to draw the heatmap of
    """
    
    # Create a figure
    plt.figure(figsize=(20, 10))

    # Draw a heatmap
    sns.heatmap(my_data[numerical_cols].corr(), annot=True, fmt=".1f")

In [None]:
# First get numerical columns
numerical_cols = [i for i in filtered_data.columns if str(filtered_data[i].dtype) != 'object']

# Make heatmatp
make_heatmap(filtered_data, numerical_cols)

### Exercise

Think about the following...

* What variables correlate with each other? Did you expect these to correlate?
* Are there any correlations that _do not exist_ that you imagined would be there?
* Any new insights on how this might help decipher male vs. female buyers?

Why do looking at correlations matter? 

> Typically, we want to eliminate correlated variables because keeping both variables do not give _any extra information_ about the underlying data.

There are a few different ways to eliminate correlation. A simple way to do it is using something called [Principle Component Analysis (PCA)](http://setosa.io/ev/principal-component-analysis/), which creates new variables that reduce the correlation within our data. The issue is that PCA is not very **explainable**, and it's not very transparent how the outputs of PCA reflect your original features.

We won't cover PCA in this masterclass, but we'll come back to correlation reduction.

# Part 3: Data pre-processing and feature engineering

[Top](#Coder-Academy-and-THE-ICONIC-Masterclass) | [Previous Section](#Part-2:-First-look-at-our-dataset) | [Next Section](#Part-4:-Clustering-the-dataset) | [Bottom](#Wrap-up)

We are now going to start using the word **feature** interchangeably **variable**.

> A **variable** often refers to the raw-columns within our dataset. A **feature** is an input to our machine learning algorithm. A variable might _become_ a feature, or we might create a feature off of one or more variables.

For example, maybe we have data that is supposed to help us predict the price of a house. This dataset might include variables such as the number of bedrooms and bathrooms. What we might realise is that the number of bedrooms and bathrooms individually do not affect the price of the house, but the total number of rooms do. 

So, what we do is we make a new variable called **total_rooms = bedrooms + bathrooms**, and use this variable in our machine learning algorithm, but _not_ the individual bedrooms and bathrooms. Thus, the **total_rooms** is a _feature_.

### Feature creation

We can use the variables (columns) in our dataset as a starting point for features in our algorithm. We might also want to create other features in our dataset that help us get a better understanding of our customers. These features might also reduce noise from correlated variables.

For example, pretend we have the variable `female_items` that describes the number of items that were categorised as "female" bought on the ICONIC website. If customer 1 buys 50 `female_items`, and customer 2 buys 600 `female_items`, does this tell us customer 2 tends to buy female items, or does it tell us simply that customer 2 just tends to buy more? We can see the correlation between `female_items` and `items` is pretty evident by plotting a scatterplot of these two variable below.

In [None]:
# Make scatterplot
plt.figure(figsize=(15, 5))
sns.scatterplot(x=filtered_data['female_items'], y=filtered_data['items'])
plt.title('Items vs. Female Items')
print('')

Let's create one feature to help reduce this correlation. What we'll do is divide the `female_items` by the total `items`, and make a new feature called `pct_female_items`. We'll then plot the `pct_female_items` vs. `items` to show that the correlation is reduced.

We'll also drop `female_items`, and also `male_items`, since the `pct_female_items` and `items` together both imply this information.

**Thoughts...** why is it ok to drop`male_items`?

In [None]:
# Create pct_female_items variable
filtered_data['pct_female_items'] = filtered_data['female_items'] / filtered_data['items']

# Drop male_items and female_items
filtered_data.drop(['male_items', 'female_items'], inplace=True, axis=1)

# Plot pct_female_items vs. items
plt.figure(figsize=(15, 5))
sns.scatterplot(x=filtered_data['pct_female_items'], y=filtered_data['items'])
plt.title('Items vs. Pct Female Items')
print('')

Pretty funky shape...but definitely not correlated. We'll leave some time at the end of this lesson to create more features.

### Normalising Data

Machine learning algorithms are often susceptible to bias based upon the different scales within our dataset. For instance, run the following code, which will show you an example of two columns, `revenue` and `items`, which are on different scales.

In [None]:
# Make figure
plt.figure(figsize=(15, 8))

# Draw figure
sns.boxplot(data=filtered_data[['revenue', 'items']], orient='h')

As you can see, the revenue distribution is much more varied than the items distribution (which makes sense, as one item might cost $100)! Often times, the machine learning algorithms we will use prioritise splitting datasets towards variables that have a lot of variance. Thus a typical step performed in data pre-processing is called **normalisation**, which will set our data on the same scale.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
def standardise_data(my_data, cols_to_norm):
    """
    Normalise the dataset.
    
    inputs: my_data <pd.DataFrame>, the data to normalise
            cols_to_norm <list>, the columns to normalise
                             
    output: The normalised dataset
    """
    # Create a normaliser
    sclr = StandardScaler()
    
    # Fit the dataset
    sclr.fit(my_data[cols_to_norm])
    norm_my_data = pd.DataFrame(sclr.transform(my_data[cols_to_norm]), columns=cols_to_norm)
    
    return norm_my_data

In [None]:
# We will use all numerical columns as features
numerical_cols = [i for i in filtered_data.columns if str(filtered_data[i].dtype) != 'object']

# Standardise data
norm_data = standardise_data(filtered_data, numerical_cols)

# Part 4: Clustering the dataset

[Top](#Coder-Academy-and-THE-ICONIC-Masterclass) | [Previous Section](#Part-3:-Data-pre-processing-and-feature-engineering) | [Next Section](#Part-5:-Classifying-our-inferred-gender) | [Bottom](#Wrap-up)

## Introduction to Clustering

Clustering allows us to separate the data into groups, or clusters using the features we have created within our data. Here is a gif that shows an example of clustering data into two clusters, based upon two variables: the male items purchased, and the number of mobile device orders.

<table>
    <tr>
        <td style="padding:25px"><img src="../img/cluster_w_titles.png" width="350"></td>
        <td style="padding:25px"><img src="https://cdn-images-1.medium.com/max/1600/1*WkU1q0Cuha2QKU5JnkcZBw.gif" width="350"></td>
    </tr>
</table>

As you see there are two variables in the dataset, and each colour on the video represents a different cluster, of which there are two total. In this video, we are performing two steps:

1. An **Update Cluster Assignment** step: which re-colours each point in space as a new cluster
2. An **Update Cluster Centers** step: which moves around the small diamonds to new spaces

## Clustering Algorithm

We call this method of clustering **K-Means** clustering. This method places **K** diamonds, or centers, randomly within our dataset (in the above example K=2). It then iteratively moves these centers by:

1. Labeling the points closest to them as belonging to "their" cluster (the **assignment step**)
2. Then shifts the centers by putting them at the current center point of the labeled points (the **update step**)

Let's run K-Means on our dataset. As you might have guessed, we'll create **two clusters**, one for our inferred males, and one for females. The following function can be used to run K-Means.

First we need to import K-Means from the [sklearn library](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). We will also import the [silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html), which gives a measure on how well our clustering performs.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

In [None]:
def run_kmeans(my_data, cont_features):
    """
    Run K-Means on a dataset with a given feature set.
    
    inputs: my_data <pd.DataFrame>, the dataset to create clusters from
            cont_features <list>, the list of features to run K-Means on
            
    output: the clusters
    """

    # Run kmeans
    kmeans = KMeans(n_clusters=2, random_state=42).fit(my_data[cont_features])
    pred = kmeans.predict(my_data[cont_features])
    
    # Score
    silhouette = silhouette_score(my_data[cont_features], pred)
    print("The silhouette score is: " + str(silhouette))
    
    return pred

Now we can run K-Means.

In [None]:
clusters = run_kmeans(norm_data, norm_data.columns)

One note on our silhouette score:

> The **silhouette score** is a measure of cluster performance. It tells us information about how **compact** and **far apart** our clusters are. 

If the sihouette score is...

* Close to 1.0, it means our clusters are compact and far apart. 
* Close to -1.0, it means that any data point is closer to an opposing cluster, and should be placed into that opposing cluster rather than the cluster it is currently within.

# Part 5: Classifying our inferred gender

[Top](#Coder-Academy-and-THE-ICONIC-Masterclass) | [Previous Section](#Part-4:-Clustering-the-dataset) | [Next Section](#Part-6:-Putting-it-altogether-and-next-steps) | [Bottom](#Wrap-up)

Each cluster we have made represents the "inferred" gender. Though we do not have the actual gender, if the differentiation of these clusters corresponds to the qualitative knowledge we know about male vs. female consumers, we can have some level of confidence that these are two differentiated groups of people. 

Even if these groups do not correspond to the actual gender of a person, our feature space can inform us about how we could tailor marketing to each of these groups. What we then need to do is create an algorithm that captures _new_ information and classifies this new information into each of our inferred gender classes.


Here's a visual of this process:

---

![](../img/ICONIC_Classification_Drawing.png)

---

We'll look at a simple algorithm to do this, called a randaom forest which is very _transparent_ about how our features inform the cluster labels.

**Note:** Random forests are very simple algorithms to train and test, because they are fast, and it is easy to peel back the layers of the algorithm and see how changes in the inputs affect the results of the prediction. After training a random forest getting an understanding about the underlying features that infer gender, one might use something more sophisticated, such as a [neural network](https://www.digitaltrends.com/cool-tech/what-is-an-artificial-neural-network/) using the [Tensorflow](https://www.tensorflow.org/tutorials/) library, to create more accurate predictions.

That being said, a good mantra to follow in data science is...**if the simplest solution works, stick with it!**


## Classification Algorithm - Random Forest

A random forest is a classification algorithm that can be used to classify new data into our two classes. From [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html):

> A **random forest** is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

To break this down a little further, random forest's create trees of data by splitting the data along features. The goal is to use these split points to accurately sort our data into the classes at hand.

The following image shows an example of using a tree of data to classify whether someone survived the Titanic disaster (or not) using the gender, age, and cabin class of a passenger.

---

<img src="../img/THE_ICONIC_RF.png" alt="Drawing" style="width: 700px;"/>

---

Random forests create multiple trees from different samples of a dataset. When one uses the forest to then predict a class, that sample's prediction is calculated from each tree, and the majority prediction is used as the final class label.

The following code will [import the RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from the sklearn library. It will also import the [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) module from sklearn, which is a tool that will help us gain better knowledge of how our algorithm will perform on new data, and the [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) module, which will score the effectiveness of these models.

We also import the [time](https://docs.python.org/3/library/time.html) module, which lets us evaluate model training time.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
import time

The following function will allow us to train and validate the performance of a random forest classifier.

In [None]:
def create_random_forest(my_data, features, target_var, k=10, n_estimators=100, max_depth=2):
    """
    Train and test a random forest using K-Fold cross validation.
    
    inputs: my_data <pd.DataFrame>, the input data
            features <list>, the list of features
            target_var <str>, the name of the column to predict
            k <int>, the number of folds to use for the algorithm
    outputs: model <RandomForestClassifier>, the final trained random forest classifier
             results <list>, f1 score of the final results
             feat_importance <pd.DataFrame>, the importance of each feature in a dataframe
             
    """
    # Start time
    start_time = time.time()
    
    # Create a RandomForestClassifier
    clf = RandomForestClassifier(
        n_estimators=n_estimators, 
        max_depth=max_depth,
        random_state=42,
        class_weight="balanced_subsample",
    )
    
    # Train with StratifiedKFold
    kfold = StratifiedKFold(n_splits=k, random_state=42)
    results = cross_val_score(clf, my_data[features], my_data[target_var], cv=kfold, scoring='f1')
    
    # Print result
    print("F1: Mean %.3f +/- (%.3f)" % (results.mean(), results.std()))

    # Fit with all data
    clf.fit(my_data[features], my_data[target_var])
    
    # Feature importance
    feat_importance = pd.DataFrame(
        clf.feature_importances_, index=features, columns=['Importance']
    ).sort_values(['Importance'], ascending=False)
    feat_importance['Index'] = range(feat_importance.shape[0])
    
    # Graph
    feat_importance_cut = feat_importance.loc[feat_importance['Importance'] > 0.01, :]
    plt.figure(figsize=(15, 5))
    sns.pointplot(x='Index', y='Importance', data=feat_importance_cut, linestyles='')
    plt.xlabel(xlabel='')
    for i, ind in enumerate(feat_importance_cut.index.values):
        x = feat_importance.loc[ind, 'Index']
        y = feat_importance.loc[ind, 'Importance']
        plt.text(x+0.08, y, ind, fontsize=9)
        
    # End
    end_time = time.time()
    print('Elasped time: %.2f seconds' % (end_time - start_time))
    
    # Return the model and the feature importance
    return clf, results, feat_importance[['Importance']]

Run the following cell to train the algorithm and take a look at the outputs. We will go back to using the filtered_data we originally created, to give us an idea about how the _original_ columns factored into our cluster creation.

In [None]:
# We will use all numerical columns as features
numerical_cols = [i for i in filtered_data.columns if str(filtered_data[i].dtype) != 'object']

rf_data = filtered_data.copy()
rf_data['Clusters'] = clusters
model, results, feat_importance = create_random_forest(rf_data, numerical_cols, 'Clusters', k=5)

Random forests work really well in the following cases:

* They train **very fast**
* Random forests do not require a lot of **preprocessing** to use
* They decrease the likelihood to **overfitting**, meaning, they generalise well to new data
* They are **transparent**, meaning, we know the impact of how features in our model impact the class distinction

The graph printed out shows all features that had an importance over 0.01 for our prediction. The _higher_ the importance, the more relevant the feature is towards destinguishing classes. 

In addition, we used an "F1" score to determine how well our algorithm performed. An "F1" score closer to 1.0 means that our algorithm is achieving perfect predictions. An [F1 score](https://en.wikipedia.org/wiki/F1_score) is commonly used when the classes we are predicting are _imbalanced_, meaning we do not necessarily have 50% of our data in one class, and 50% of our data within another class.

### Optional: Parameter Tuning and Cross-validation

There are a few other parameters we have not talked about yet that can affect how good our model is, namely, the **n_estimators**, and the **max_depth**

> The **number of estimators**, or **n_estimators** describes the number of trees used for the forest. Generally, _more trees_ reduce overfitting, but lower the model performance on our training data.

> The **max_depth**, deepens our tree. The more depth, the better the fit to our current dataset. Usually the larger depth, the more susceptible we are to overfitting. 

We call the process of tuning parameters, such as n_estimators and max_depth, cross-validation.

### Exercise

The following code has two variables that can be used to change the `n_estimators` and `max_depth` used within our model. Two example values are given. Play with the parameters and check how they impact the accuracy of the model. Also look how the total run time of the model is influenced.

In [None]:
# We will use all numerical columns as features
numerical_cols = [i for i in filtered_data.columns if str(filtered_data[i].dtype) != 'object']

# Set variables
n_estimators = 100
max_depth = 50

# Run the model
model, results, feat_importance = create_random_forest(
    rf_data, numerical_cols, 'Clusters', k=10, n_estimators=n_estimators, max_depth=max_depth
)

# Part 6: Putting it altogether and next steps

[Top](#Coder-Academy-and-THE-ICONIC-Masterclass) | [Previous Section](#Part-5:-Classifying-our-inferred-gender) | [Next Section](#Wrap-up) | [Bottom](#Wrap-up)

Let's put it all together. The following function will combine _all_ the steps we have done above into a single pipeline. Here's a visual to show all of the steps in numbered order.

---

<img src="../img/ICONIC_Full_Pipeline_Drawing.png" width="700">

---

Note there are some new pieces of the puzzle we have not dealt with.

* When we ran our random forest, we saw what features had an impact towards our cluster labels. Usually data scientists have a good **qualitative understanding** of what should influence their outcome. Did the features that were important within the random forest match your understanding? If not, there are a few things we could do...
  * We could remove these features, and limit which columns we use for clustering
  * We could create more new features, which might help eliminate some of the correlations. Remember, we did this with the `pct_female_items` column we created

* To actually process our new data, we would have to somehow host our trained algorithm. Tools like [Amazon Sagemaker](https://aws.amazon.com/sagemaker/) allow for easy hosting of machine learning algorithms. We won't go into this tonight, but it might be good to check out one of these tools.

As you see, **most** of these decisions involve data cleaning, and most of a data scientist's job involves data cleaning! We need to make sure our algorithms are _understandable_, since there is uncertainty in the accuracy of our predictions.

## Setting up our pipeline

Run the following code which creates the entire pipline. It...

1. Transforms the data using normalisation
2. Clusters the data based upon these transformed features
3. Runs a random forest to gauge how features contributes to our classification
4. Outputs the random forest model to predict new data


In [None]:
def gender_inference_pipeline(
    my_data,
    features,
    k_folds=5,
    rf_estimators=100,
    rf_depth=10,
):
    """
    Run entire pipeline based upon a given number of features
    
    inputs: my_data <pd.DataFrame>, the dataset
            features <list>, the list of variables to use for the data
            k_folds <int>, the number of folds to train upon
            rf_estimators <int>, number of trees in the rf
            rf_depth <int>, the depth of each tree
            
    outputs: The final model, f1_scores from training, feature importance, and inferred gender
    """
    # Print features
    print('Using the following features to infer gender: ')
    print('*'* 54)
    for i in features:
        print('* ' + i + ' ' * (50 - len(i)) + ' *')
    print('*'* 54)
    print('')

    
    # Reset index
    my_data = my_data.reset_index(drop=True)
    
    # Normalise data
    norm_data = standardise_data(my_data, features)
    
    # Cluster data
    clusters = run_kmeans(norm_data, features)
    
    # Train random forest
    rf_data = my_data.copy()
    rf_data['Clusters'] = clusters
    model, results, feat_importance = create_random_forest(
        rf_data, features, 'Clusters', k=k_folds, n_estimators=rf_estimators, max_depth=rf_depth
    )

    return model, results, feat_importance, clusters

Run the following to train the entire pipeline with filtered data.

In [None]:
# We will use all numerical columns as features
numerical_cols = [i for i in filtered_data.columns if str(filtered_data[i].dtype) != 'object']

# Run the model
end_model, f1_scores, feat_importance, clusters = gender_inference_pipeline(
    my_data=filtered_data,
    features=numerical_cols,
    k_folds=5,
    rf_estimators=100,
    rf_depth=50,
)

## Exercise

There are a **ton** of parameters we could change within our model. I've also recopied the list of columns within our data below.

| Column                   | Value   | Description                                                              | 
|--------------------------|---------|--------------------------------------------------------------------------| 
| customer_id              | string  | ID of the customer - super duper hashed                                  | 
| days_since_first_order   | integer | Days since the first order was made                                      | 
| days_since_last_order    | integer | Days since the last order was made                                       | 
| int_is_newsletter_subscriber | string  | Flag for a newsletter subscriber (1 = Yes, 0 = No)                                        | 
| orders                   | integer | Number of orders                                                         | 
| items                    | integer | Number of items                                                          | 
| cancels                  | integer | Number of cancellations - when the order is cancelled after being placed | 
| returns                  | integer | Number of returned orders                                                | 
| different_addresses      | integer | Number of times a different billing and shipping address was used        | 
| shipping_addresses       | integer | Number of different shipping addresses used                              | 
| devices                  | integer | Number of unique devices used                                            | 
| vouchers                 | integer | Number of times a voucher was applied                                    | 
| cc_payments              | integer | Binary indicating if credit card was used for payment                       | 
| paypal_payments          | integer | Binary indicating if PayPal was used for payment                              | 
| afterpay_payments        | integer | Binary indicating if AfterPay was used for payment                            | 
| apple_payments           | integer | Binary indicating if Apple Pay was used for payment                           | 
| pct_female_items             | integer | Percentage of items purchased for women                                         | 
| unisex_items             | integer | Number of unisex items purchased                                         | 
| wapp_items               | integer | Number of Women Apparel items purchased                                  | 
| wftw_items               | integer | Number of Women Footwear items purchased                                 | 
| mapp_items               | integer | Number of Men Apparel items purchased                                    | 
| wacc_items               | integer | Number of Women Accessories items purchased                              | 
| macc_items               | integer | Number of Men Accessories items purchased                                | 
| mftw_items               | integer | Number of Men Footwear items purchased                                   | 
| wspt_items               | integer | Number of Women Sport items purchased                                    | 
| mspt_items               | integer | Number of Men Sport items purchased                                      | 
| curvy_items              | integer | Number of Curvy items purchased                                          | 
| sacc_items               | integer | Number of Sport Accessories items purchased                              | 
| msite_orders             | integer | Number of Mobile Site orders                                             | 
| desktop_orders           | integer | Number of Desktop orders                                                 | 
| android_orders           | integer | Number of Android app orders                                             | 
| ios_orders               | integer | Number of iOS app orders                                                 | 
| other_device_orders      | integer | Number of Other device orders                                            | 
| work_orders              | integer | Number of orders shipped to work                                         | 
| home_orders              | integer | Number of orders shipped to home                                         | 
| parcelpoint_orders       | integer | Number of orders shipped to a parcelpoint                                | 
| other_collection_orders  | integer | Number of orders shipped to other collection points                      | 
| average_discount_onoffer | float   | Average discount rate of items typically purchased                       | 
| average_discount_used    | float   | Average discount finally used on top of existing discount                | 
| revenue                  | float   | $ Dollar spent overall per person                                        |

#### Your goal is to change the...

* Features used in the model by editing the `features` list
* If you want to be an adventurous coder, you could create more features
* We could also filter out more outliers in our dataset, based upon your results
* Other parameters (for those who know how to do parameter tuning)...
  * The depth of the random forest
  * The number of random forest estimators

First see...

1. How do variables separate the inferred gender?
2. How do modifying the other parameters change your model accuracy?

We will also create an [sns.pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html) visualisation of the top three features in the model, based upon these clusters.

In [None]:
# Parameters
feature_list = ['desktop_orders', 'days_since_last_order', 'pct_female_items', 'android_orders']

# Affects random forest
rf_estimators = 100
rf_depth = 50
k_folds = 2

# Run the model
end_model, f1_scores, feat_importance, clusters = gender_inference_pipeline(
    my_data=filtered_data,
    features=feature_list,
    k_folds=5,
    rf_estimators=100,
    rf_depth=50,
)

# Create pair plot
pairplot_data = filtered_data[feat_importance.index[range(min(3, feat_importance.shape[0]))]].copy()
pairplot_data['Inferred Gender'] = ['Cluster 1' if i == 1 else 'Cluster 0' for i in clusters]
sns.pairplot(
    pairplot_data, 
    hue='Inferred Gender', 
    plot_kws={'alpha': 0.6, 's': 50, 'edgecolor': 'k'}
)

### Using the model to predict an inferred gender

The function above outputted an `end_model` variable with our trained random froest. We could use this to predict an inferred gender on any data. Here's an example, using the first value from our training set:

In [None]:
# Predit gender
gender = end_model.predict(filtered_data[feature_list].loc[[0], :])

print('Inferred gender(s): ' + str(gender))

# Wrap-up

Thank you for attending our masterclass! We hope we _demystified_ a little bit of what actually occurs when you build a machine learning process. Big thank you to Kshira Saagar from THE ICONIC for his time and lending us data for the workshop.

We will be sending out a **survey over email** to get your feedback on the session!

If you would like to download your work for today, please click **File->Download As->.html**. You will not be able to run the cells, but you will be able to view the material you learned within a web browser.

## Survey

We would appreciate it if you could complete a quick feedback survey for tonight's Masterclass. You can find the survey here: http://bit.ly/ml_iconic_survey

THANK YOU!!