# Live Coding: Clustering - Searching for Customer Profiles

After finishing this notebook, here some more ideas on how to work on this: https://www.kaggle.com/code/karnikakapoor/customer-segmentation-clustering/notebook


## Dataset
This dataset (again) comes from Kaggle: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis

### Problem Statement

Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

### Attributes

People

    ID: Customer's unique identifier
    Year_Birth: Customer's birth year
    Education: Customer's education level
    Marital_Status: Customer's marital status
    Income: Customer's yearly household income
    Kidhome: Number of children in customer's household
    Teenhome: Number of teenagers in customer's household
    Dt_Customer: Date of customer's enrollment with the company
    Recency: Number of days since customer's last purchase
    Complain: 1 if the customer complained in the last 2 years, 0 otherwise

Products

    MntWines: Amount spent on wine in last 2 years
    MntFruits: Amount spent on fruits in last 2 years
    MntMeatProducts: Amount spent on meat in last 2 years
    MntFishProducts: Amount spent on fish in last 2 years
    MntSweetProducts: Amount spent on sweets in last 2 years
    MntGoldProds: Amount spent on gold in last 2 years

Promotion

    NumDealsPurchases: Number of purchases made with a discount
    AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
    AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
    AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
    AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
    AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
    Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

Place

    NumWebPurchases: Number of purchases made through the company’s website
    NumCatalogPurchases: Number of purchases made using a catalogue
    NumStorePurchases: Number of purchases made directly in stores
    NumWebVisitsMonth: Number of visits to company’s website in the last month

#### Target

Need to perform clustering to summarize customer segments.

#### Acknowledgement

The dataset for this project is provided by Dr. Omar Romero-Hernandez. 

## Bibliotheken importieren

In [None]:
import os

import numpy as np
import pandas as pd
import seaborn as sb  # data visualization library  
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 500)

## Importing data

The data can be imported using Pandas with the command `pd.read_csv()`.
In many cases, this does not work directly. This is usually due to one of the following issues:
- `FileNotFoundError` --> Either the file name is spelled incorrectly or the path is incorrect.
- `UnicodeDecodeError` --> Either the file name (+path) contains invalid characters (in Windows, for example, "//" must often be used instead of "/"), or the file itself is not saved in the expected "encoding." For the latter, there are two options: (1) Convert the file using an editor. Or (2) set the parameter `encoding=...` accordingly. 
There are many possible encodings ([see link](https://docs.python.org/3/library/codecs.html#standard-encodings)), but the most common are "utf-8" (the standard), "ANSI" (on Mac: "iso-8859-1" or "ISO8859") or "ASCII".
- `ParserError` --> Usually means that the "delimiter" (i.e., the separator) is specified incorrectly. It is best to open the file briefly with an editor and check, then set it accordingly with `delimiter="..."`. Typical separators are `","`, `";"`, `"\t"` (tab).
- If the file does not start with the desired column names, this can be corrected by specifying the rows to be skipped --> `skiprows=1` (1, 2, 3,... depending on the case).

In [None]:
filename = "datasets/marketing_campaign.csv"  # adjust to your own path

data = pd.read_csv(filename, sep= #add your code)

## (1) Exercise: Initial data exploration!
Use Pandas to answer the following questions:

- How many columns are there and what are they? --> `data.columns`
- Are there any missing values? --> `.info()`
- Initial overview and: Are there any unusual entries? --> `.describe()`

In [None]:
data.head()

Use ID as index:

In [None]:
data = data.set_index("ID")

## Data pre-processing: Convert date
This can be done in Pandas with `pd.to_datetime()`. Either the existing format can be described --> `format = '%d-%m-%Y'`

Or we can simply try what Pandas "guesses" --> `infer_datetime_format=True`

The date is then no longer a string (even if it looks like one when displayed!), but a separate `Timestamp` format.

### Test it out:

- What does `data["Dt_Customer"][0].day` return?
- What does `data["Dt_Customer"][0].year` return?

In [None]:
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"],
                                     dayfirst=True, format = '%d-%m-%Y')
                                     #infer_datetime_format=True)

## (1b) Further data exploration
- Which features/characteristics have "outliers"?

To answer this question, it is useful to display the features as a distribution using `plot(kind="hist")` or directly with `.hist()`.

Incidentally, this can even be done for all numerical features at once with `data.hist()`. However, the size should be adjusted using `figsize=(...)`.

## (2) Data cleaning
- Remove entries with missing values by using `.dropna()`.

There are some outliers in the distributions/histograms. There are some entries in `Income` that deviate significantly from the rest. The age data also appears to be incorrect.
- All data points with `Income` >= 150000 should be removed.
- All data points with `Year_Birth` < 1925 should be removed.

## (3) Explore data
- Explore the features "Marital_Status" and "Education" with `value_counts()`. This can also be done graphically with additional `.plot(kind="barh")` or `.plot(kind="pie")`.

# Preprocessing
Now we want to convert, remove, and add a few features...
- Replace `Year_of_Birth` with a new feature `Age`, which is calculated using `2020 - Year_of_Birth`.
- Add a new feature `Spent`, which contains the sum of all purchases:
`'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases'`

### Rename feature
We would prefer to rename some features. To do this, please execute the following:
```python
data=data.rename(
    columns={
        "MntWines": "Wines",
        "MntFruits":"Fruits",
        "MntMeatProducts":"Meat",
        "MntFishProducts":"Fish",
        "MntSweetProducts":"Sweets",
        "MntGoldProds":"Gold"
    }
)
```

### Remove features
- The following features should be removed: "Year_Birth", "Z_CostContact", "Z_Revenue"

You can remove them using `df = df.drop(list_of_cols_to_drop, axis=1)`, for example.

## Correlation Analysis
We can view all Pearson correlations using `data.corr()`.
Given the number of correlations, it makes sense to display them graphically using

```python
sb.heatmap(data.corr())
```

### Make it look nicer:
The plot will (probably) be a little easier to read if we display the values (`annot=True`).

We can also select a different "colormap", see [matplotlib documentation](https://matplotlib.org/stable/tutorials/colors/colormaps.html).
It is also good to set min and max values, which can be done by using `vmin` and `vmax`.

### Exercise:
- Display the correlation matrix graphically.

In [None]:
fig, ax = plt.subplots(figsize=(14, 14))

sb.heatmap(
    data.corr(),
    # annot=True,
    # linewidths=.5,
    cmap=#value
    vmin=#value,
    vmax= #value,
    fmt= '.1f', ax=ax
)

# Clustering
Now we can try k-Means by using [Scikit-Learn](https://scikit-learn.org/stable/modules/clustering.html).

But before we run the clustering algorithm we need to process the data in certaing ways to make sure that the algorithm can run successfully.

In [None]:
data.select_dtypes(include='number')

## Data Scaling

For many ML/data science techniques, it is essential to scale the data appropriately. In this case, techniques that take absolute values into account would otherwise result in a strong bias toward "large" numbers, such as `Income`, etc.

We will therefore standardize the values using the `StandardScaler` from the `scikit-learn` library.

In [None]:
from sklearn.preprocessing import StandardScaler

# Creating a copy of data (with only numerical values)
data_numerical = data.copy().select_dtypes(include='number')

# Creating a subset of dataframe by dropping the features on deals accepted and promotions
cols_remove = ['AcceptedCmp3', 'AcceptedCmp4',
               'AcceptedCmp5', 'AcceptedCmp1',
               'AcceptedCmp2', 'Complain', 'Response']
data_numerical = data_numerical.drop(cols_remove, axis=1)

In [None]:
# Scaling
# add code to scale your data --> use scikit-learn StandardScaler

In [None]:
data_scaled.head()

## K-means
- here with 4 clusters as default!

The import is done via `from sklearn.cluster import KMeans`.
Then the algorithm must be run on the data with `.fit()`:
`kmeans = KMeans(...).fit(data_scaled)`

As parameters, we must add at least `n_clusters=...`. It is also better to add `random_state=...`.

## Inspect the results visually

In [None]:
import seaborn as sb

sb.barplot(x='cluster',y='Age', data=data)
plt.title("Age / Cluster")
plt.show()

sb.barplot(x='cluster',y='Income', data=data)
plt.title("Income / Cluster")
plt.show()

sb.barplot(x='cluster',y='Kidhome', data=data)
plt.title("Kids / Cluster")
plt.show()

sb.barplot(x='cluster',y='Teenhome', data=data)
plt.title("Teens / Cluster")
plt.show()