### Table of Contents

# 1. Import Data

## 1.1 Import the needed libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import date
from scipy.stats import zscore

from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import KNNImputer
from sklearn.cluster import KMeans

%matplotlib inline
pd.set_option('display.max_columns', None)

## 1.2 Import a integrate data

In [None]:
df_crm = pd.read_csv('crm.csv')
df_mkt = pd.read_csv('mkt.csv')
df_sales = pd.read_excel('sales.xlsx')

In [None]:
df = pd.merge(pd.merge(df_crm,df_sales,on='CustomerID',how="inner"),df_mkt,on="CustomerID",how="inner")

## 1.3 Set Index


In [None]:
df.set_index('CustomerID',inplace = True)

## 1.4 Check and removing duplicates

In [None]:
df[df.duplicated()] # checking duplicates

In [None]:
df = df[~df.duplicated()] # drop duplicates rows

# 2. Explore Data

## 2.0 Data profiling

Se não quiserem instalar a biblioteca não corram esta secção. Caso contrário o comando para instalar é pip install ydata-profiling. No final **apagar esta secção**

In [None]:
#from ydata_profiling import ProfileReport
#profile= ProfileReport (df, title= "DSML_Project")

In [None]:
#profile.to_file('DSML_profile.html')

## 2.1 Basic Exploration

Q: _To check the number of columns and rows_ we used `shape` _attribute_

In [None]:
df.shape

> A: _The dataset has **7000 rows** and **26 columns**_

__*Q*__: Check the name of the features of the dataset we used `columns` _attribute_

In [None]:
df.columns

> A: The dataset has the following columns/features names: <br>
        >Index. CustomerID
        >1. 'Name' <br>
        >2. 'Birthyear'<br>
        >3. 'Education'<br>
        >4. 'Marital_Status'<br>
        >5. 'Income'<br>
        >6. 'Kid_Younger6'<br>
        >7. 'Children_6to18'<br>
        >8. 'Date_Adherence'<br>
        >9. 'Recency'<br>
        >10. 'MntMeat&Fish'<br>
        >11. 'MntEntries'<br>
        >12. 'MntVegan&Vegetarian'<br>
        >13. 'MntDrinks'<br>
        >14. 'MntDesserts'<br>
        >15. 'MntAdditionalRequests'<br>
        >16. 'NumOfferPurchases'<br>
        >17. 'NumAppPurchases'<br>
        >18. 'NumTakeAwayPurchases'<br>
        >19. 'NumStorePurchases'<br>
        >20. 'NumAppVisitsMonth'<br>
        >21. 'Complain'<br>
        >22. 'Response_Cmp1'<br>
        >23. 'Response_Cmp2'<br>
        >24. 'Response_Cmp3'<br>
        >25. 'Response_Cmp4'<br>
        >26. 'Response_Cmp5'<br>

Q: First glance of the dataset using `head` and `tail` methods to check the first and last 5 rows.

In [None]:
df.head(3)

In [None]:
df.tail(3)

Q: To check the basic information of the dataset we've used the `info` method

In [None]:
df.info()

>A: We can observe the data type of the dataset and the how many of features per data type  `dtypes: float64 - (7), int64 - (15), object - (4)`, the memory usage of `1.4+MB`, and the non-null values present per columns. <br>
> Using only `info` method we understand that `'Education', 'Recency', 'MntDrinks'` have __14, 23, 28 null values__ that require some action.

# 2.2 Statistical Exploration

## 2.2.1 Numerical Variables

In [None]:
df.describe()

> _The describe return we can get a first glance and make some conclusion:_

>__Birthyear__ - could originate an Age column for readability purposes<br>
__Income__ - Min and Max are very far from each other and far from the mean value which could indicate outliers<br>
__Recency__ - 6977 valid values, hence we should look in deep and decide on how to minimize that effect of missing values<br>
__MntMeat&Fish__ - Min and Max are distant from each other and have high standard deviation which could effect some future conclusion<br>
__MntEntries__ - Again has high standard deviation that we should analyze, Min and Max far apart, similar to MntMeat&Fish<br>
__MntVegan&Vegetarian__ - Similar to the previous two Mnt columns<br>
__MntDrinks, MntDesserts__ - Seems to be very similar between them<br>
__MntAdditionalRequests__ - The max value standard deviation seems high and also the max value very far apart from the mean<br>
__NumOfferPurchases, NumTakeAwayPurchases, NumAppVisitsMonth__  - Have a max value to distante from the mean that could be true but we need to take into account<br>
__NumAppPurchases, SumStorePurchases__ - Seems does not have strange summary statistcs<br>
__Kid_Younger6, Children_6to18__ - 75% of clients have at least one child

**Q**: Skewness of each variable 

In [None]:
df.skew()

Concerning the variables' skewness, we can conclude the following:
- `Moderate skewness (between |0.5| and |1.0|)`: Birthyear, Income, Kid_Younger6, Children_6to18, Recency, NumAppPurchases, NumStorePurchases, NumAppVisitsMonth
- `High skewness (higher than |1.0|)`: MntMeat&Fish, MntEntries, MntVegan&Vegetarian, MntDrinks, MntDesserts, MntAdditionalRequests, NumOfferPurchases, NumTakeAwayPurchases, Complain, Response_Cmp1, Response_Cmp2, Response_Cmp3, Response_Cmp4, Response_Cmp5

In [None]:
df.kurt()

Features with kurtosis higher than 3 could indicate presence of outliers, hence we should have special considerantion with the following features:
>MntEntries, MntVegan&Vegetarian, MntDrinks, MntDesserts, NumOfferPurchases, NumAppVisitsMonth

Note: Binomial Variables Complain, and Response_Cmp1 the kurtosis we will not consider as outliers

## 2.2.2 Categorical Values

In [None]:
df.describe(include = object)

> We can conclude that the education as **14 missing** values

#### Level/Possible values of Categorical Features

### `Name` prefix unique values and count

In [None]:
df['Name'].str.partition(" ")[0].value_counts()

With the prefix we can generate a `gender` feature to further explore the dataset. We will deal with that in the data transformation capther

#### **`Gender`** feature creation

In [None]:
df["Gender"] = df['Name'].str.partition(" ")[0]
df = df.replace({"Gender":{"Mr.": 1,"Miss": 0,"Mrs.": 0}})

### `Education` unique values and count

In [None]:
df["Education"].value_counts()

We have some issues that will need trasformatioin:<br>
- Graduation, Master, HighSchool are written in different ways<br>
- `Basic` and `HighSchool` need different levels?

#### Education standardization

In [None]:
df = df.replace({"Education":{"master":"Master", "graduation":"Graduation", "phd":"PhD","highschool":"HighSchool"}})

### `Marital_Status` unique values and count

In [None]:
df["Marital_Status"].value_counts()

Similarly to previous feature we also have some issues that need transformation:<br>
- Married, Together, Single, Divorced and Widow are written with lower and capital letters
- We could also consider that Married and Together are similar and joined them in the same level<br>

#### Marital_Status standardization

In [None]:
df = df.replace({"Marital_Status":{"married":"Married", "together":"Married", "single":"Single","widow":"Widow","divorced":"Divorced","Together":"Married"}})
df["Marital_Status"].value_counts()

`Date_Adherence` unqiue values and count

In [None]:
df["Date_Adherence"].value_counts()

`Date_Adherence` is a date and will need transformation to a date format for further exploration

## 2.3 Visual Exploration

### 2.3.1 Numerical Variables

In [None]:
df_corr = df.corr(method = 'spearman')
figure = plt.figure(figsize=(16,10))
sns.heatmap(df_corr, annot=True, fmt = '.1g', mask = np.triu(df_corr))

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (10,6))
sns.histplot(ax = ax1, data = df, x = 'Birthyear', color="c", bins= 5)
sns.histplot(ax = ax2, data = df, x = 'Birthyear', color="c")

- Birthyear at glance using 5 bins seems to follow a normally distribution shape, althought zooming in we can observe some heavy drops that we should explore further

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(nrows = 1, ncols = 3, figsize=(10,6), sharex = True, sharey= True)
sns.countplot(ax = ax1, data = df, x = 'Kid_Younger6', color="y", alpha = 0.8)
sns.countplot(ax = ax2, data = df, x = 'Children_6to18', color="m", alpha = 0.5)
sns.countplot(ax = ax3, x = df["Kid_Younger6"]+df["Children_6to18"])
plt.xlabel("Total number of kids")

- Here we can see that the clients majority one child, having either one `Kid_Younger6` or one `Children_6to18` years of age

In [None]:
sns.histplot(data = df['Recency'], color="k", alpha=0.3)

In [None]:
sns.histplot(data = df['Income'], color="g", alpha=0.3)

- `Income` suggest normal distribution with possible some outliers around 220k of monetary units

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (20, 16))
ax1.boxplot(df['MntVegan&Vegetarian'])
ax2.boxplot(df['Income'])

In [None]:
df.drop(df[abs(zscore(df['MntVegan&Vegetarian'])) > 3].index, inplace=True)
df.drop(df[abs(zscore(df['Income'])) > 3].index, inplace=True)

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (20, 16))
ax1.boxplot(df['MntVegan&Vegetarian'])
ax2.boxplot(df['Income'])

#### Income vs Gender

In [None]:
sns.stripplot(data = df, x = "Income", y = "Gender")

In [None]:
fig, ((ax1, ax2),(ax3,ax4),(ax5,ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.histplot(ax = ax1, data = df, x = 'MntMeat&Fish', color="g")
sns.histplot(ax = ax2, data = df, x = 'MntVegan&Vegetarian', color="b")
sns.histplot(ax = ax3, data = df, x = 'MntEntries', color="r")
sns.histplot(ax = ax4, data = df, x = 'MntDrinks', color="y")
sns.histplot(ax = ax5, data = df, x = 'MntDesserts', color="w")
sns.histplot(ax = ax6, data = df, x = 'MntAdditionalRequests', color="m")

- `MntMeat&Fish` and `MntVegan&Vegatarian` are the categories where customers spend the most
- `MntVegan&Vegetarian` seems to have **outliers** above the 15000 price units
- All the `Mnt%` variables have a heavy concentration of values on the lower values of monetary units axis

#### Total Monetary Spend feature creation and display

In [None]:
df["MntTotal"] = df['MntMeat&Fish'] + df['MntEntries'] + df['MntVegan&Vegetarian'] + df['MntDrinks'] + df['MntDesserts'] + df['MntAdditionalRequests']
df["MntTotal"]
# em falta Mnt Add Requests

In [None]:
sns.histplot(data = df, x = 'MntTotal', color="g")

In [None]:
def scatterplot_list(data: pd.DataFrame, x: list, y: list, hue: str, marker_size: int = 50, figsize: tuple = (10, 6), rug: bool = False, ax: plt.Axes = None):
    if ax is None:
        num_subplots = len(y) * len(x)
        if num_subplots == 1:
            fig, axs = plt.subplots(1, 1, figsize=figsize)
            axs = [axs]
        else:
            num_rows = (num_subplots + 1) // 2
            num_cols = 2 if num_subplots > 1 else 1
            fig, axs = plt.subplots(num_rows, num_cols, figsize=(figsize[0]*num_cols, figsize[1]*num_rows))
            axs = axs.ravel()
    else:
        axs = ax

    for i, x_var in enumerate(x):
        for j, y_var in enumerate(y):
            idx = j * len(x) + i
            sns.scatterplot(data=data, x=x_var, y=y_var, hue=hue, s=marker_size, ax=axs[idx], palette='Dark2', label=hue)
            axs[idx].set_xlabel(x_var)
            axs[idx].set_ylabel(y_var)
            axs[idx].legend(loc='upper right')
            if rug:
                sns.rugplot(data=data, x=x_var, y=y_var, hue=hue, ax=axs[idx], alpha=0.5)

    plt.tight_layout()
    plt.show()

#### Scatter plot Monetary vs Income vs Gender

In [None]:
scatterplot_list(data= df, x= ['Income'], y= ['MntTotal'], hue= 'Gender')

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, x = "Income", y = "MntMeat&Fish", hue = "Gender")
sns.scatterplot(ax = ax2, data = df, x = "Income", y = "MntVegan&Vegetarian", hue = "Gender")
sns.scatterplot(ax = ax3, data = df, x = "Income", y = "MntEntries", hue = "Gender")
sns.scatterplot(ax = ax4, data = df, x = "Income", y = "MntDrinks", hue = "Gender")
sns.scatterplot(ax = ax5, data = df, x = "Income", y = "MntDesserts", hue = "Gender")
sns.scatterplot(ax = ax6, data = df, x = "Income", y = "MntAdditionalRequests", hue = "Gender")

#### Scatter plot the Monetery vs Income vs "Total of Kids"

In [None]:
df["Total_Kids"] = df["Kid_Younger6"] + df["Children_6to18"]

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, x = "Income", y = "MntMeat&Fish", hue = "Total_Kids")
sns.scatterplot(ax = ax2, data = df, x = "Income", y = "MntVegan&Vegetarian", hue = "Total_Kids")
sns.scatterplot(ax = ax3, data = df, x = "Income", y = "MntEntries", hue = "Total_Kids")
sns.scatterplot(ax = ax4, data = df, x = "Income", y = "MntDrinks", hue = "Total_Kids")
sns.scatterplot(ax = ax5, data = df, x = "Income", y = "MntDesserts", hue = "Total_Kids")
sns.scatterplot(ax = ax6, data = df, x = "Income", y = "MntAdditionalRequests", hue = "Total_Kids")

- Here we conclude that customer with less kid spend more money in general accross all subcategories
- In the `MntVegan&Vegetarian`subcategory the higher spenders are clearly the one's with no children

#### Scatter plot the Monetery vs Income vs "Kid_Younger6"

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, x = "Income", y = "MntMeat&Fish", hue = "Kid_Younger6")
sns.scatterplot(ax = ax2, data = df, x = "Income", y = "MntVegan&Vegetarian", hue = "Kid_Younger6")
sns.scatterplot(ax = ax3, data = df, x = "Income", y = "MntEntries", hue = "Kid_Younger6")
sns.scatterplot(ax = ax4, data = df, x = "Income", y = "MntDrinks", hue = "Kid_Younger6")
sns.scatterplot(ax = ax5, data = df, x = "Income", y = "MntDesserts", hue = "Kid_Younger6")
sns.scatterplot(ax = ax6, data = df, x = "Income", y = "MntAdditionalRequests", hue = "Kid_Younger6")

- Here it's possible to conclude that the subgroup with kids under 6 years spend little money on both `Meat&Fish`and  `Vegan&Vegetarian` that could relate on the fact that younger kids have their own food

#### Scatter plot the Monetery vs Income vs "Children_6to18"

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, x = "Income", y = "MntMeat&Fish", hue = "Children_6to18")
sns.scatterplot(ax = ax2, data = df, x = "Income", y = "MntVegan&Vegetarian", hue = "Children_6to18")
sns.scatterplot(ax = ax3, data = df, x = "Income", y = "MntEntries", hue = "Children_6to18")
sns.scatterplot(ax = ax4, data = df, x = "Income", y = "MntDrinks", hue = "Children_6to18")
sns.scatterplot(ax = ax5, data = df, x = "Income", y = "MntDesserts", hue = "Children_6to18")
sns.scatterplot(ax = ax6, data = df, x = "Income", y = "MntAdditionalRequests", hue = "Children_6to18")

#### Kids boolean variable creation

In [None]:
df["has_Kids"] = df["Total_Kids"].apply(lambda x: 0 if x == 0 else 1)
df["has_Kids"]

#### Income vs Total Monetary vs Kids (Y/N)

In [None]:
sns.scatterplot(data = df, x = "Income", y = "MntTotal", hue = "has_Kids")

- From this scatterplot we can see that in general costumer without has_Kids spend more money than the one's who have

Now let's try to understand in the behavior in the monetary subclasses

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, x = "Income", y = "MntMeat&Fish", hue = "has_Kids")
sns.scatterplot(ax = ax2, data = df, x = "Income", y = "MntVegan&Vegetarian", hue = "has_Kids")
sns.scatterplot(ax = ax3, data = df, x = "Income", y = "MntEntries", hue = "has_Kids")
sns.scatterplot(ax = ax4, data = df, x = "Income", y = "MntDrinks", hue = "has_Kids")
sns.scatterplot(ax = ax5, data = df, x = "Income", y = "MntDesserts", hue = "has_Kids")
sns.scatterplot(ax = ax6, data = df, x = "Income", y = "MntAdditionalRequests", hue = "has_Kids")

#### Monetary vs Income vs Education

In [None]:
sns.scatterplot(data = df, hue = "Education", x = "MntTotal", y = "Income")
sns.rugplot(data = df, hue = "Education", x = "MntTotal", y = "Income")

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, hue = "Education", x = "MntMeat&Fish", y = "Income")
sns.rugplot(ax = ax1, data = df, hue = "Education", x = "MntMeat&Fish", y = "Income")
sns.scatterplot(ax = ax2, data = df, hue = "Education", x = "MntVegan&Vegetarian", y = "Income")
sns.rugplot(ax = ax2, data = df, hue = "Education", x = "MntVegan&Vegetarian", y = "Income")
sns.scatterplot(ax = ax3, data = df, hue = "Education", x = "MntEntries", y = "Income")
sns.rugplot(ax = ax3, data = df, hue = "Education", x = "MntEntries", y = "Income")
sns.scatterplot(ax = ax4, data = df, hue = "Education", x = "MntDrinks", y = "Income")
sns.rugplot(ax = ax4, data = df, hue = "Education", x = "MntDrinks", y = "Income")
sns.scatterplot(ax= ax5, data = df, hue = "Education", x = "MntDesserts", y = "Income")
sns.rugplot(ax = ax5, data = df, hue = "Education", x = "MntDesserts", y = "Income")
sns.scatterplot(ax = ax6, data = df, hue = "Education", x = "MntAdditionalRequests", y = "Income")
sns.rugplot(ax = ax6, data = df, hue = "Education", x = "MntAdditionalRequests", y = "Income")

#### Monetary vs Education

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.stripplot(ax = ax1, data = df, y = "Education",  order = ["Basic", "HighSchool", "Graduation","Master","PhD"],  x = "MntMeat&Fish")
sns.stripplot(ax = ax2, data = df, y = "Education",  order = ["Basic", "HighSchool", "Graduation","Master","PhD"],  x = "MntVegan&Vegetarian")
sns.stripplot(ax = ax3, data = df, y = "Education",  order = ["Basic", "HighSchool", "Graduation","Master","PhD"],  x = "MntEntries")
sns.stripplot(ax = ax4, data = df, y = "Education",  order = ["Basic", "HighSchool", "Graduation","Master","PhD"],  x = "MntDrinks")
sns.stripplot(ax = ax5, data = df, y = "Education",  order = ["Basic", "HighSchool", "Graduation","Master","PhD"],  x = "MntDesserts")
sns.stripplot(ax = ax6, data = df, y = "Education",  order = ["Basic", "HighSchool", "Graduation","Master","PhD"],  x = "MntAdditionalRequests")

#### Marital_Status vs Monetary vs has_Kids

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, y = "Marital_Status", x = "MntMeat&Fish", hue = "has_Kids")
sns.scatterplot(ax = ax2, data = df, y = "Marital_Status", x = "MntVegan&Vegetarian", hue = "has_Kids")
sns.scatterplot(ax = ax3, data = df, y = "Marital_Status", x = "MntEntries", hue = "has_Kids")
sns.scatterplot(ax = ax4, data = df, y = "Marital_Status", x = "MntDrinks", hue = "has_Kids")
sns.scatterplot(ax = ax5, data = df, y = "Marital_Status", x = "MntDesserts", hue = "has_Kids")
sns.scatterplot(ax = ax6, data = df, y = "Marital_Status", x = "MntAdditionalRequests", hue = "has_Kids")

> No valuable information besides the perception that every `Marital_Status` have has_Kids

#### Monetary vs Number of Purchases

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5,ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, y = "NumOfferPurchases", x = "MntTotal", hue = "has_Kids")
sns.scatterplot(ax = ax2, data = df, y = "NumAppPurchases", x = "MntTotal", hue = "has_Kids")
sns.scatterplot(ax = ax3, data = df, y = "NumTakeAwayPurchases", x = "MntTotal", hue = "has_Kids")
sns.scatterplot(ax = ax4, data = df, y = "NumStorePurchases", x = "MntTotal", hue = "has_Kids")
sns.scatterplot(ax = ax5, data = df, y = "NumAppVisitsMonth", x = "MntTotal", hue = "has_Kids")
sns.scatterplot(ax = ax6, data = df, y = "Complain", x = "MntTotal", hue = "has_Kids")

> **Info** Mnt Total
- NumberOfferPurchases: The customer who most use **Offers** have has_Kids, and that the majority of money spend is without offers
- NumAppPurchases: Customer who used **Food Apps** less than 4 times spend much less money than the rest, and the costumers without has_Kids tend to spend more
- NumTakeAway - Customer who ordered **take-away** less than 4 times spend much less money than the rest, and the customer without has_Kids spend more money than the one's with has_Kids.
- NumStorePurchases - Again, Customer who went the **store** less than 4 times spend much less money than the rest, and customer without has_Kids tend to spend more
- NumAppVisitsMonth - The customers who have has_Kids visit the restautant in **Food delivery apps** more often (monthly) than the one's without, nevetheless the one's without has_Kids tend to spend more (Here you could try to understand the _**value per visit**_ to try to boost the usage of the app from nohas_Kids customer)

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, y = "NumOfferPurchases", x = "MntMeat&Fish", hue = "has_Kids")
sns.scatterplot(ax = ax2, data = df, y = "NumAppPurchases", x = "MntMeat&Fish", hue = "has_Kids")
sns.scatterplot(ax = ax3, data = df, y = "NumTakeAwayPurchases", x = "MntMeat&Fish", hue = "has_Kids")
sns.scatterplot(ax = ax4, data = df, y = "NumStorePurchases", x = "MntMeat&Fish", hue = "has_Kids")
sns.scatterplot(ax = ax5, data = df, y = "NumAppVisitsMonth", x = "MntMeat&Fish", hue = "has_Kids")
sns.scatterplot(ax = ax6, data = df, y = "Complain", x = "MntMeat&Fish", hue = "has_Kids")

> **Info** Meat&Fish
- NumberOfferPurchases: The customer who most use **Offers** have has_Kids, and that the majority of money spend is without offers
- NumAppPurchases: Customer who used **Food Apps** less than 4 times spend much less money than the rest and the customer without has_Kids tend to spend more
- NumTakeAway - Customer who ordered **take-away** less than 4 times spend much less money than the rest, and the customer without has_Kids spend order more take-away more often
- NumStorePurchases - Again, Customer who went the **store** less than 4 times spend much less money than the rest
- NumAppVisitsMonth - The customers who have has_Kids visit the restautant in **Food delivery apps** more often (monthly) than the one's without, nevetheless the one's without has_Kids tend to spend more (Here you could try to understand the _**value per visit**_ to try to boost the usage of the app from no has_Kids customer)

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, y = "NumOfferPurchases", x = "MntVegan&Vegetarian", hue = "has_Kids")
sns.scatterplot(ax = ax2, data = df, y = "NumAppPurchases", x = "MntVegan&Vegetarian", hue = "has_Kids")
sns.scatterplot(ax = ax3, data = df, y = "NumTakeAwayPurchases", x = "MntVegan&Vegetarian", hue = "has_Kids")
sns.scatterplot(ax = ax4, data = df, y = "NumStorePurchases", x = "MntVegan&Vegetarian", hue = "has_Kids")
sns.scatterplot(ax = ax5, data = df, y = "NumAppVisitsMonth", x = "MntVegan&Vegetarian", hue = "has_Kids")
sns.scatterplot(ax = ax6, data = df, y = "Complain", x = "MntVegan&Vegetarian", hue = "has_Kids")

> **Info** Mnt&Vegetarian
- NumberOfferPurchases: The customer who most use **Offers** have has_Kids, and that the majority of money spend is without offers
- NumAppPurchases: Customer who used **Food Apps** less than 4 times spend much less money than the rest, customers without has_Kids tend to spend more
- NumTakeAway - Customer who ordered **take-away** less than 4 times spend much less money than the rest, and the customer without has_Kids spend more money than the one's with has_Kids.
- NumStorePurchases - Again, Customer who went the **store** less than 4 times spend much less money than the rest, and customer without has_Kids tend to spend more
- NumAppVisitsMonth - The customers whome have has_Kids visit the restautant in **Food delivery apps** more often (monthly) than the one's without, nevetheless the one's without has_Kids tend to spend more (Here you could try to understand the _**value per visit**_ to try to try to boost the usage of the app from nohas_Kids customer)

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, y = "NumOfferPurchases", x = "MntEntries", hue = "has_Kids")
sns.scatterplot(ax = ax2, data = df, y = "NumAppPurchases", x = "MntEntries", hue = "has_Kids")
sns.scatterplot(ax = ax3, data = df, y = "NumTakeAwayPurchases", x = "MntEntries", hue = "has_Kids")
sns.scatterplot(ax = ax4, data = df, y = "NumStorePurchases", x = "MntEntries", hue = "has_Kids")
sns.scatterplot(ax = ax5, data = df, y = "NumAppVisitsMonth", x = "MntEntries", hue = "has_Kids")
sns.scatterplot(ax = ax6, data = df, y = "Complain", x = "MntEntries", hue = "has_Kids")

> **Info** MntEntries
- NumberOfferPurchases: The customer who most use **Offers** have has_Kids, and that the majority of money spend is without offers
- NumAppPurchases: Customer who used **Food Apps** less than 4 times spend much less money than the rest
- NumTakeAway - Customer who ordered **take-away** less than 3 times spend much less money than the rest, and the customer without has_Kids spend more money than the one's with has_Kids.
- NumStorePurchases - Customer who went the **store** less than 4 times spend much less money than the rest
- NumAppVisitsMonth - The customers whome have has_Kids visit the restautant in **Food delivery apps** more often (monthly) than the one's without, nevetheless the one's without has_Kids tend to spend more (Here you could try to understand the _**value per visit**_ to try to try to boost the usage of the app from nohas_Kids customer)

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, y = "NumOfferPurchases", x = "MntDrinks", hue = "has_Kids")
sns.scatterplot(ax = ax2, data = df, y = "NumAppPurchases", x = "MntDrinks", hue = "has_Kids")
sns.scatterplot(ax = ax3, data = df, y = "NumTakeAwayPurchases", x = "MntDrinks", hue = "has_Kids")
sns.scatterplot(ax = ax4, data = df, y = "NumStorePurchases", x = "MntDrinks", hue = "has_Kids")
sns.scatterplot(ax = ax5, data = df, y = "NumAppVisitsMonth", x = "MntDrinks", hue = "has_Kids")
sns.scatterplot(ax = ax6, data = df, y = "Complain", x = "MntDrinks", hue = "has_Kids")

**Info** MntDrinks
- NumberOfferPurchases: The customer who most use Offers have has_Kids, and that the majority of money spend is without offers
- NumAppPurchases: Customer who used Food Apps less than 4 times spend much less money than the rest
- NumTakeAway - Customer who ordered take-away less than 3 times spend much less money than the rest, and the customer without has_Kids spend more money and order more often than the one’s with has_Kids.
- NumStorePurchases - Customer who went the store less than 4 times spend much less money than the rest
- NumAppVisitsMonth - The customers whome have has_Kids visit the restautant in Food delivery apps more often (monthly) than the one’s without, nevetheless the one’s without has_Kids tend to spend more (Here you could try to understand the value per visit to try to try to boost the usage of the app from nohas_Kids customer)

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, y = "NumOfferPurchases", x = "MntDesserts", hue = "has_Kids")
sns.scatterplot(ax = ax2, data = df, y = "NumAppPurchases", x = "MntDesserts", hue = "has_Kids")
sns.scatterplot(ax = ax3, data = df, y = "NumTakeAwayPurchases", x = "MntDesserts", hue = "has_Kids")
sns.scatterplot(ax = ax4, data = df, y = "NumStorePurchases", x = "MntDesserts", hue = "has_Kids")
sns.scatterplot(ax = ax5, data = df, y = "NumAppVisitsMonth", x = "MntDesserts", hue = "has_Kids")
sns.scatterplot(ax = ax6, data = df, y = "Complain", x = "MntDesserts", hue = "has_Kids")

**Info** MntDesserts
- NumberOfferPurchases: The customer who most use Offers have has_Kids, and that the majority of money spend is without offers
- NumAppPurchases: Customer who used Food Apps less than 4 times spend much less money than the rest
- NumTakeAway - Customer who ordered take-away less than 3 times spend much less money than the rest, and the customer without has_Kids spend more money and order more often than the one’s with has_Kids.
- NumStorePurchases - Customer who went the store less than 4 times spend much less money than the rest
- NumAppVisitsMonth - The customers whome have has_Kids visit the restautant in Food delivery apps more often (monthly) than the one’s without, nevetheless the one’s without has_Kids tend to spend more (Here you could try to understand the value per visit to try to try to boost the usage of the app from nohas_Kids customer)

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, y = "NumOfferPurchases", x = "MntAdditionalRequests", hue = "has_Kids")
sns.scatterplot(ax = ax2, data = df, y = "NumAppPurchases", x = "MntAdditionalRequests", hue = "has_Kids")
sns.scatterplot(ax = ax3, data = df, y = "NumTakeAwayPurchases", x = "MntAdditionalRequests", hue = "has_Kids")
sns.scatterplot(ax = ax4, data = df, y = "NumStorePurchases", x = "MntAdditionalRequests", hue = "has_Kids")
sns.scatterplot(ax = ax5, data = df, y = "NumAppVisitsMonth", x = "MntAdditionalRequests", hue = "has_Kids")
sns.scatterplot(ax = ax6, data = df, y = "Complain", x = "MntAdditionalRequests", hue = "has_Kids")

**Info** MntAdditionaalRequests
- NumberOfferPurchases: The customer who most use Offers have has_Kids, and that the majority of money spend is without offers
- NumAppPurchases: Customer who used Food Apps less than 4 times spend much less money than the rest, and customers with has_Kids do it more often
- NumTakeAway - Customer who ordered take-away less than 3 times spend much less money than the rest, and the customer without has_Kids spend more money and order more often than the one’s with has_Kids.
- NumStorePurchases - Customer who went the store less than 4 times spend much less money than the rest
- NumAppVisitsMonth - The customers whome have has_Kids visit the restautant in Food delivery apps more often (monthly) than the one’s without (Here you could try to understand the value per visit to try to try to boost the usage of the app from nohas_Kids customer)

### Number of Purchases visualization

In [None]:
fig, ((ax1, ax2),(ax3,ax4),(ax5,ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.histplot(ax = ax1, data = df, x = 'NumOfferPurchases', color="g")
sns.histplot(ax = ax2, data = df, x = 'NumAppPurchases', color="b")
sns.histplot(ax = ax3, data = df, x = 'NumTakeAwayPurchases', color="r")
sns.histplot(ax = ax4, data = df, x = 'NumStorePurchases', color="y")
sns.histplot(ax = ax5, data = df, x = 'NumAppVisitsMonth', color="k")
sns.countplot(ax = ax6, data = df, x = 'Complain', color="c")

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.scatterplot(ax = ax1, data = df, y = "NumOfferPurchases", x = "Income", hue = "Gender")
sns.scatterplot(ax = ax2, data = df, y = "NumAppPurchases", x = "Income", hue = "Gender")
sns.scatterplot(ax = ax3, data = df, y = "NumTakeAwayPurchases", x = "Income", hue = "Gender")
sns.scatterplot(ax = ax4, data = df, y = "NumStorePurchases", x = "Income", hue = "Gender")
sns.scatterplot(ax = ax5, data = df, y = "NumAppVisitsMonth", x = "Income", hue = "Gender")
sns.scatterplot(ax = ax6, data = df, y = "Complain", x = "Income", hue = "Gender")

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize = (20,16))
sns.stripplot(ax = ax1, data = df, x = "NumOfferPurchases", y = "Marital_Status", hue = "Gender")
sns.stripplot(ax = ax2, data = df, x = "NumAppPurchases", y = "Marital_Status", hue = "Gender")
sns.stripplot(ax = ax3, data = df, x = "NumTakeAwayPurchases", y = "Marital_Status", hue = "Gender")
sns.stripplot(ax = ax4, data = df, x = "NumStorePurchases", y = "Marital_Status", hue = "Gender")
sns.stripplot(ax = ax5, data = df, x = "NumAppVisitsMonth", y = "Marital_Status", hue = "Gender")
sns.stripplot(ax = ax6, data = df, x = "Complain", y = "Marital_Status", hue = "Gender")

#### 

In [None]:
df_long = pd.melt(df[['Response_Cmp1', 'Response_Cmp2','Response_Cmp3', 'Response_Cmp4', 'Response_Cmp5']])

In [None]:
sns.countplot(data = df_long, y = 'variable', hue='value')

In [None]:
df.head()

In [None]:
# sns.stripplot(data = df, y = "has_Kids", x = "MntTotal", hue = "Response_Cmp1")

### 2.3.2 Categorical Variables

In [None]:
df.describe(include = "object")

In [None]:
sns.countplot(data = df, x = 'Education', order = ["Basic","HighSchool", "Graduation","Master","PhD"])

**Question** should we join basic and highschool into undergrate level?

In [None]:
sns.countplot(data = df, x = 'Marital_Status')

**Question** should we reduce Together and Married into one level?

In [None]:
sns.countplot(data = df, x = 'Gender')

In [None]:
sns.stripplot(data = df, y = "Education",x = "Income")

In [None]:
sns.stripplot(data = df, y = "Gender",x = "Income")

In [None]:
sns.stripplot(data = df, y = "Gender",x = "MntTotal")

## 2.4. In-Depth Exploration

# 3. Preprocess Data

## 3.1. Data Cleaning

### 3.1.1. Outliers

- boxplot?

### 3.1.2. Missing Values

In [None]:
df["Response_Cmp1"].isna().sum()

In [None]:
df["Response_Cmp2"].isna().sum()

In [None]:
df["Response_Cmp3"].isna().sum()

In [None]:
df["Response_Cmp4"].isna().sum()


In [None]:
df.info()

In [None]:
df.isna().sum()

- **`Education`**, **`Recency`**, **`MntDrinks`** and **`MntTotal`** (due to dependancy of `MntDrinks`) have missing values

#### Filling the missing values

Fill `Education` with the mode

In [None]:
df["Education"].fillna(df["Education"].mode()[0], inplace = True)

Fill `Recency` with the median value

In [None]:
df["Recency"].fillna(df["Recency"].mean(), inplace = True)

In [None]:
df.drop(columns = "MntTotal", inplace = True)

In [None]:
df_mnt = df[[ 'MntMeat&Fish', 'MntEntries', 'MntVegan&Vegetarian', 'MntDrinks',
       'MntDesserts', 'MntAdditionalRequests']]

imputer = KNNImputer(n_neighbors=3)
array_impute = imputer.fit_transform(df_mnt)
df_mnt = pd.DataFrame(array_impute, columns = df_mnt.columns)

In [None]:
df["MntDrinks"] = df_mnt["MntDrinks"].values

In [None]:
df["MntTotal"] = df['MntMeat&Fish'] + df['MntEntries'] + df['MntVegan&Vegetarian'] + df['MntDrinks'] + df['MntDesserts']

In [None]:
df.isna().sum()

## 3.2. Data Transformation

### 3.2.1. Create new Variables

### Utils

#### Creating Age variable from the Birthyear

In [None]:
df['Age'] = df.Birthyear.apply(lambda x: date.today().year-x)

In [None]:
#### Creating card adherence age variable from the Date adherence

In [None]:
from datetime import datetime
df = df.replace({"Date_Adherence":{"2/29/2022": datetime.strptime("2022-03-01", '%Y-%m-%d')}}) #2022 is not a leap year, therefore 29/02/2022 is not a possible day

In [None]:
df['daysAsCardClient'] = df['Date_Adherence'].apply(lambda x: (date.today() - x.date()).days)

In [None]:
edu_encode = pd.get_dummies(df.Education, drop_first= True)
df = pd.concat([df, edu_encode], axis = 1)
df.drop('Education', axis = 1, inplace = True)

In [None]:
marital_encode = pd.get_dummies(df.Marital_Status, drop_first= True)
df = pd.concat([df, marital_encode], axis = 1)
df.drop('Marital_Status', axis= 1, inplace = True)

In [None]:
df['Mnt_pday_card']= df.MntTotal/df.daysAsCardClient

In [None]:
import statistics
df["Abv_Avg_Mnt"] = df["MntTotal"].apply(lambda x: 0 if x <= statistics.mean(df["MntTotal"])  else 1)
df["Abv_Avg_Mnt"]

In [None]:
df

## Data Review

Ver a dataframe no seu estado final
Drop: Id, name, birthyear, date_adherence, total_kids, mntTotal

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df_train= df.copy()
df_train.drop(['Name', 'Birthyear', 'Date_Adherence'], axis = 1, inplace = True)

In [None]:
df_train

## Data scaling
min max: income, recency, mnt..., purchases ..., age, daysasClient, mnt per ...

In [None]:
scaler = MinMaxScaler()
df_train = pd.DataFrame(scaler.fit_transform(df_train))
#df_train.describe()

## PCA


In [None]:
from sklearn.decomposition import PCA
from sklearn import preprocessing
df_train2 = df_train.copy()
scaled_df_train2 = preprocessing.scale(df_train2)
pca = PCA(n_components=8)
pca.fit(scaled_df_train2)
pca_data = pca.transform(scaled_df_train2)
per_var = np.round(pca.explained_variance_ratio_*100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(per_var) +1)]
plt.plot(per_var,'ro-')
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot')
plt.figure(figsize=(10,10))
plt.show()
pca_df = pd.DataFrame(pca_data, columns=labels)
pca_df


In [None]:
pca.explained_variance_ratio_.cumsum() # 16PC's explicam 81% da variância

A implementação do pca acima pareceu me estranha. deixo aqui outra e quando reunirmos vemos

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(df_train)
var= pca.explained_variance_ratio_
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

plt.title("PCA Variance against num of Componmnets")
plt.ylabel("Variance %")
plt.xlabel("Number of componments")
l = plt.axhline(80, color="red")

plt.plot(var1)
plt.grid()

In [None]:
pca = PCA(n_components=10)
pca_train=pca.fit_transform(df_train)
pca_train

~17 variaveis explicam 94% da variancia

## Loading Scores for each PC

In [None]:
loading_scores = pd.Series(pca.components_[1], index=df_train2.columns)
sorted_loading_scores= loading_scores.abs().sort_values(ascending=False)
top_8 = sorted_loading_scores.index.values
print(loading_scores[top_8])

In [None]:
plt.figure(figsize=(10,8))
plt.scatter(pca_data[:,0],pca_data[:,1],c=df_train['Abv_Avg_Mnt'])


##  Correlação entre PC's e as variáveis originais

In [None]:
df_comp = pd.DataFrame(pca.components_,columns=df_train2.columns)
plt.figure(figsize=(10,10))
sns.heatmap(df_comp,cmap='plasma')

## Model Train

In [None]:
kmeans= KMeans(n_clusters = 6, max_iter =100, random_state= 1)
kmeans.fit(pca_train)
kmeans.labels_

In [None]:
kmeans2= KMeans(n_clusters = 6, max_iter = 100, random_state = 1)
kmeans2.fit(df_train)
kmeans.labels_

### 3.2.2. Misclassifications

### 3.2.3. Incoherencies

> 3.2.3.1 Clients that spent money but never had a registered Purchase

In [None]:
df[(df[['MntMeat&Fish', 'MntEntries',
       'MntVegan&Vegetarian', 'MntDrinks', 'MntDesserts',
       'MntAdditionalRequests']].sum(axis = 1) >= 0) & (df[['NumOfferPurchases', 'NumAppPurchases',
       'NumTakeAwayPurchases', 'NumStorePurchases']].sum(axis = 1) <= 0)].shape

### 3.2.4. Binning

### 3.2.5. Reclassification

### 3.2.6. Power Transform

## 3.3. Data Reduction

### 3.3.1. Multicollinearity - Check correlation

### 3.3.2. Unary Variables

### 3.3.3. Variables with a high percentage of missing values

## 3.2. Back to Data Transformation

### 3.2.7. Apply ordinal encoding and create Dummy variables

### 3.2.8. Scaling