# <p style="background-color:#003672;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Customer Segmentation</p>

In this project, I will be extracting valuable informations from the [Customer Personaility Analysis dataset](https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis), which contains a plethora of different features about each customer. These feature are going to be useful as we will use them for a form of unsupervised learning : clustering. Indeed, we're going to create clusters of the different kinds of customers to help the choose the right marketing solutions. The data has the following features:

## People

| Name | Description |
| :-----------: | :-----------: |
|ID| Customer's unique identifier|
|Year_Birth| Customer's birth year|
|Education|Customer's education level|
|Marital_Status| Customer's marital status|
|Income| Customer's yearly household income|
|Kidhome| Number of children in customer's household|
|Teenhome| Number of teenagers in customer's household|
|Dt_Customer| Date of customer's enrollment with the company|
|Recency| Number of days since customer's last purchase|
|Complain| 1 if the customer complained in the last 2 years, 0 otherwise|

## Products

| Name | Description |
| :-----------: | :-----------: |
|MntWines| Amount spent on wine in last 2 years|
|MntFruits| Amount spent on fruits in last 2 years|
|MntMeatProducts| Amount spent on meat in last 2 years|
|MntFishProducts| Amount spent on fish in last 2 years|
|MntSweetProducts| Amount spent on sweets in last 2 years|
|MntGoldProds| Amount spent on gold in last 2 years|

## Promotion

| Name | Description |
| :-----------: | :-----------: |
|NumDealsPurchases| Number of purchases made with a discount
|AcceptedCmp1| 1 if customer accepted the offer in the 1st campaign, 0 otherwise|
|AcceptedCmp2| 1 if customer accepted the offer in the 2nd campaign, 0 otherwise|
|AcceptedCmp3| 1 if customer accepted the offer in the 3rd campaign, 0 otherwise|
|AcceptedCmp4| 1 if customer accepted the offer in the 4th campaign, 0 otherwise|
|AcceptedCmp5| 1 if customer accepted the offer in the 5th campaign, 0 otherwise|
|Response| 1 if customer accepted the offer in the last campaign, 0 otherwise|

## Place

| Name | Description |
| :-----------: | :-----------: |
|NumWebPurchases| Number of purchases made through the company’s website|
|NumCatalogPurchases| Number of purchases made using a catalogue|
|NumStorePurchases| Number of purchases made directly in stores|
|NumWebVisitsMonth| Number of visits to company’s website in the last month|

# <p style="background-color:#003672;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Table of content</p>

To be filled...

<a id ="1"></a>
# <p style="background-color:#003672;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Setup</p>

In [155]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

S = "\033[1m" + '\033[96m'
E = "\033[0m"

palette = ["#003672", "#943400", "#ED8B75", "#F2DC5D", "#0E9594"]

In [156]:
data = pd.read_csv("../input/customer-personality-analysis/marketing_campaign.csv", sep="\t")
data.head()

<a id ="1"></a>
# <p style="background-color:#003672;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Data cleaning</p>

In [157]:
shape = data.shape
print(S + f"The dataframe has {shape[0]} records and {shape[1]} features" + E)

In [158]:
data.isna().sum()

In [159]:
print(S+"The number of missing values is low, we can consider dropping these records" + E)
data.dropna(inplace=True)
data.isna().sum()

In [160]:
data.nunique()

In [161]:
print(S + "Z_CostContact and Z_Revenue  being useless ( they only have one unique value), it's not going to help us, so we can drop them. Same goes for ID, but because it doesn't bring any value" + E)
data.drop(["Z_CostContact", "Z_Revenue", "ID"], inplace=True, axis=1)

<a id ="2"></a>
# <p style="background-color:#003672;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Feature engineering</p>

In [162]:
data["Age"] = 2022-data["Year_Birth"]
data[["Age","Year_Birth"]].head()

In [163]:
data["Education"].unique()

In [164]:
data["Education"].replace(['Graduation', 'PhD', 'Master','2n Cycle'],"Post Graduate", inplace=True)
data["Education"].replace("Basic","Under Graduate", inplace=True)
data["Education"].unique()

In [165]:
data["Marital_Status"].unique()

In [166]:
data["Marital_Status"].replace(["Together","Married"], "Relationship", inplace=True)
data["Marital_Status"].replace(["Divorced","Widow","Alone","Absurd","YOLO"], "Single", inplace=True)
data["Marital_Status"].unique()

In [167]:
data["Kids"] = data["Kidhome"] + data["Teenhome"]
data["Kids"].head()

In [168]:
data.dtypes

In [169]:
data["Spent"] = data["MntWines"] + data["MntFruits"] + data["MntMeatProducts"] + data["MntFishProducts"] + data["MntSweetProducts"] + data["MntGoldProds"]
data["Spent"].head()

In [170]:
data["TotalAccepted"] = data["AcceptedCmp1"] + data["AcceptedCmp2"] + data["AcceptedCmp3"] + data["AcceptedCmp4"] + data["AcceptedCmp5"]
data["TotalAccepted"].head()

In [171]:
data["YearsSinceCustm"] = 2022 - data["Dt_Customer"].str.slice(6,10,1).astype("int16")
data["YearsSinceCustm"].head()

In [172]:
# Get rid of an extreme outlier
data = data[data["Income"] < 600000]

In [173]:
data["NmbPurch"] = data["NumWebPurchases"] + data["NumStorePurchases"] + data["NumCatalogPurchases"]
data["NmbPurch"].head()

In [174]:
data.dtypes

In [175]:
# remove tabs
df = data.drop(["Year_Birth","Kidhome","Teenhome","Dt_Customer",
                "MntWines","MntFruits","MntMeatProducts","MntFishProducts","MntSweetProducts","MntGoldProds",
                "NumWebPurchases","NumCatalogPurchases","NumStorePurchases",
                "AcceptedCmp3","AcceptedCmp2","AcceptedCmp1","AcceptedCmp4","AcceptedCmp5"], axis=1)
df.head()

<a id ="3"></a>
# <p style="background-color:#003672;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">EDA</p>

In [176]:
# df["Income"].quantile(0.16)
print(S+f"16% of the customer of this company earn less than 30,000. The median salary is {df['Income'].median():,.0f}. Let's see how much do they spend depending on their salary"+E)

plt.figure(figsize=(15,5))
sns.scatterplot(x="Income",y="Spent",data=df);

In [177]:
print(S+"As expected, the more the person earns, the more they spend. How about the case where you have children ?\n\n\n")

fig, (ax1,ax2) = plt.subplots(1,2,figsize=(25,7))
fig.suptitle("- Income and expenses depending on the number of kids -",size=22,weight="bold", color=palette[4])


sns.scatterplot(x="Income",y="Spent",data=df, hue="Kids", palette=palette[:4], ax=ax1)

sns.boxplot(x="Kids",y="Spent",data=df, palette=palette[:4], ax=ax2);

From these plots, we learn 2 things:

* The more children a person has, the less likely they are to spend money
* Income and kids have a negative correlation : a person with fewer kids generally earn more money and the opposite is true as well

In [178]:
print(S+"We're going to do the same thing depending on the marital status\n\n\n")

fig, (ax1,ax2) = plt.subplots(1,2,figsize=(25,7))
fig.suptitle("- Income and expenses depending on the marital status -",size=22,weight="bold", color=palette[4])


sns.scatterplot(x="Income",y="Spent",data=df, hue="Marital_Status", palette=[palette[0],palette[2]], ax=ax1)

sns.boxplot(x="Marital_Status",y="Spent",data=df, palette=[palette[0],palette[2]], ax=ax2);

In [179]:
print(S+"Finally, we're going to check other factors such as loyalty (those who are clients for longer) and Education : " +E)

plt.figure(figsize=(20,7))

plt.subplot(121)
sns.boxplot(data=df, y="Spent", x="YearsSinceCustm")

plt.subplot(122)
sns.scatterplot(x="Income",y="Spent",data=df, hue="Education", palette=[palette[0],palette[2]]);

* In addition of being a minority, Under Graduates represent a minority which doesn't spend a lot because they don't earn a lot
* More loyal clients are more likely to spend more money

<a id ="4"></a>
# <p style="background-color:#003672;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">Data preprocessing</p>

### Making everything numerical

In [180]:
df.dtypes

In [181]:
object_cols = (df.dtypes == 'object')
objects = list(object_cols[object_cols].index)
objects

In [182]:
le = LabelEncoder()
for obj in objects:
    df[obj] = df[[obj]].apply(le.fit_transform)

In [183]:
df.dtypes
# all values should be numerical

### Scaling

In [184]:
ss = StandardScaler()
ss.fit(df)
scaled = pd.DataFrame(ss.transform(df),columns= df.columns )
scaled.head()

### Reduce memory usage

In [185]:
def reduce_memory_usage(df, verbose=True):
    # from https://www.kaggle.com/code/karnikakapoor/customer-segmentation-clustering#DATA-PREPROCESSING
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            f"Mem. usage decreased from {start_mem:.2f} Mb to {end_mem:.2f} Mb ({100 * (start_mem - end_mem) / start_mem:.1f}% reduction)"
        )
    return df

data = reduce_memory_usage(scaled, verbose=True)