# <center>Data Mining Project</center>

<center>
Master in Data Science and Advanced Analytics <br>
NOVA Information Management School
</center>

** **
## <center>*ABCDEats Inc*</center>

<center>
Group 19 <br>
Jan-Louis Schneider, 20240506  <br>
Marta Boavida, 20240519  <br>
Matilde Miguel, 20240549  <br>
Sofia Gomes, 20240848  <br>
</center>

** **


## <span style="color:salmon"> Project Description </span> 

In this project, you will act as consultants for ABCDEats Inc. (ABCDE), a fictional food delivery service partnering with a range of restaurants to offer diverse meal options. Your task is to analyse customer data collected over three months from three cities to help ABCDE develop a data-driven strategy tailored to various customer segments. The description of the data is provided under the Dataset Description section of this document. <br>

We recommend segmenting customers using multiple perspectives. Examples of segmentation perspectives include value-based segmentation, which groups customers by their economic value; preference or behaviour-based segmentation which focuses on purchasing habits; and demographic segmentation which categorises customers by attributes like age, gender, and income to understand different interaction patterns. <br>

Ultimately, the company seeks a final segmentation that integrates these perspectives to enable them to develop a comprehensive marketing strategy.

## <span style="color:salmon"> Table of Contents  </span>   

1. [Import Libraries & Data](#1-import-data--libraries)  <br> <br> 
2. [Explore Data Analysis](#2-explore-data-analysis)  
    2.1 [Explore dataset](#21-explore-dataset)  
    2.2 [Correct datatypes](#22-correct-data-types)   
    2.3 [Duplicates](#22-duplicates)   
    2.4 [Missing values](#23-missing-values)   
    2.5 [Numerical variables](#24-numerical-variables)   
    2.6 [Categorical variables](#25-categorical-variables)  <br> <br> 
3. [New Features](#3-new-features)   
    3.1 [Customer Lifetime](#31-customer-lifetime)    
    3.2 [Most frequent order day of the week](#32-most-frequent-order-day-of-the-week)  
    3.3 [Most frequent part of the day](#33-most-frequent-part-of-the-day)  
    3.4 [Total monetary units spend](#34-total-monetary-units-spend)    
    3.5 [Average monetary units per product](#35-average-monetary-units-per-product)   
    3.6 [Average monetary units per order](#36-average-monetary-units-per-order)   
    3.7 [Average order size](#37-average-order-size)     
    3.8 [Culinary profile](#38-culinary-profile)       
    3.9 [Loyalty to chain restaurants](#39-loyalty-to-chain-restaurants)     
    3.10 [Loyalty to venders](#310-loyalty-to-venders)    <br> <br>  
5. [Visulalizations and relationships between features](#4-visualizations-and-relationships-between-features)   
    4.1 [Correlation of all numerical features](#41-correlation-of-all-numerical-features)   
    4.2 [Visualization of total Orders placed per Hour](#42-visualization-of-total-orders-placed-per-hour)   
    4.3 [Visualization of total Orders placed per week day](#43-visualization-of-total-orders-placed-per-week-day)     
    4.4 [Percentage of each payment_method for each age_group](#44-percentage-of-each-payment_method-for-each-age_group)    
    4.5 [Proportions of each last_promo for each payment_method](#45-proportions-of-each-last_promo-for-each-payment_method)    
    4.6 [Means of each payment_method in vendor_count](#46-means-of-each-payment_method-in-vendor_count)   
    4.7 [Means of each payment_method in lifetime_days](#47-means-of-each-payment_method-in-lifetime_days)   
    4.8 [Means of each payment_method in total_expenses](#48-means-of-each-payment_method-in-total_expenses)   
    4.9 [Means of each payment_method in avg_per_product](#49-means-of-each-payment_method-in-avg_per_product)   
    4.10 [Means of each payment_method in avg_per_order](#410-means-of-each-payment_method-in-avg_per_order)   
    4.11 [Means of each payment_method in avg_order_size](#411-means-of-each-payment_method-in-avg_order_size)    
    4.12 [Means of each payment_method in culinary_variety](#412-means-of-each-payment_method-in-culinary_variety)    
    4.13 [Means of each payment_method in chain_preference](#413-means-of-each-payment_method-in-chain_preference)    
    4.14 [Total expenses per age group](#414-total-expenses-per-age-group)    
    4.15 [Culinary variety per age group](#415-culinary-variety-per-age-group)   
    4.16 [Means of each last_promo in lifetime_days](#416-means-of-each-last_promo-in-lifetime_days)   
    4.17 [Means of each last_promo in total_expenses](#417-means-of-each-last_promo-in-total_expenses)   
    4.18 [Means of each last_promo in avg_per_product](#418-means-of-each-last_promo-in-avg_per_product)     
    4.19 [Means of each last_promo in avg_per_order](#419-means-of-each-last_promo-in-avg_per_order)     
    4.20 [Means of each last_promo in chain_preferences](#420-means-of-each-last_promo-in-chain_preferences)     
    4.21 [Means of each last_promo in loyalty_to_venders](#421-means-of-each-last_promo-in-loyalty_to_venders)   
    4.22 [Relations between the costumer age and some types of cuisine](#422-relations-between-the-costumer-age-and-some-types-of-cuisines)    
    4.23 [Means of each total_expenses in chain_preferences](#423-means-of-each-total_expenses-in-chain_preferences)   
    4.24 [Proportions of each last_promo value for the two groups of people](#424-proportions-of-each-last_promo-value-for-the-two-groups-of-people)     
    4.25 [Cuisines and total_expenses](#425-cuisines-and-total_expenses)     
    4.26 [Cuisines and loyalty_to_venders](#426-cuisines-and-loyalty_to_venders)     
    4.27 [Customer_regions](#427-customer_regions)        


## <span style="color:salmon"> 1. Import Data & Libraries </span> 

In [1]:
import pandas as pd 
import numpy as np
import scipy

import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
import scipy.stats as stats
import warnings

from math import ceil
from sklearn.impute import KNNImputer

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

In [2]:
df = pd.read_csv("../dataset/DM2425_ABCDEats_DATASET.csv")

## <span style="color:salmon"> 2. Explore Data Analysis </span> 
Conduct an in-depth exploration of the dataset. Summarise key statistics for the data, and discuss their possible implications.

#### <span style="color:salmon"> 2.1 Explore Dataset </span>  
In order to a better understanding the dataset, we used a funcion head, to see the first ten lines.

In [3]:
df.head(10)

Unnamed: 0,customer_id,customer_region,customer_age,vendor_count,product_count,is_chain,first_order,last_order,last_promo,payment_method,CUI_American,CUI_Asian,CUI_Beverages,CUI_Cafe,CUI_Chicken Dishes,CUI_Chinese,CUI_Desserts,CUI_Healthy,CUI_Indian,CUI_Italian,CUI_Japanese,CUI_Noodle Dishes,CUI_OTHER,CUI_Street Food / Snacks,CUI_Thai,DOW_0,DOW_1,DOW_2,DOW_3,DOW_4,DOW_5,DOW_6,HR_0,HR_1,HR_2,HR_3,HR_4,HR_5,HR_6,HR_7,HR_8,HR_9,HR_10,HR_11,HR_12,HR_13,HR_14,HR_15,HR_16,HR_17,HR_18,HR_19,HR_20,HR_21,HR_22,HR_23
0,1b8f824d5e,2360,18.0,2,5,1,0.0,1,DELIVERY,DIGI,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28.88,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,1,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0
1,5d272b9dcb,8670,17.0,2,2,2,0.0,1,DISCOUNT,DIGI,12.82,6.39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,1,0.0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
2,f6d1b2ba63,4660,38.0,1,2,2,0.0,1,DISCOUNT,CASH,9.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,1,0.0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
3,180c632ed8,4660,,2,3,1,0.0,2,DELIVERY,DIGI,0.0,13.7,0.0,0.0,0.0,0.0,0.0,0.0,17.86,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,1,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0
4,4eb37a6705,4660,20.0,2,5,0,0.0,2,-,DIGI,14.57,40.87,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,1,0.0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,6aef2b6726,8670,40.0,2,2,0,0.0,2,FREEBIE,DIGI,0.0,24.92,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,1,0.0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,8475ee66ef,2440,24.0,2,2,2,0.0,2,-,CARD,5.88,0.0,1.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,1,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0
7,f2f53bcc67,8670,27.0,2,3,2,0.0,2,DISCOUNT,DIGI,11.71,0.0,24.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,1,0.0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
8,5b650c89cc,2360,20.0,3,4,2,0.0,3,DISCOUNT,DIGI,2.75,0.0,0.0,0.0,0.0,0.0,0.0,4.39,0.0,0.0,0.0,0.0,7.3,0.0,0.0,0,0,1,0,0,0,2,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1
9,84775a7237,8670,20.0,2,3,0,0.0,3,DELIVERY,CARD,0.0,32.48,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1,0,0,0,1,0.0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Then, we print all the columns of the dataset.

In [4]:
df.columns.values

array(['customer_id', 'customer_region', 'customer_age', 'vendor_count',
       'product_count', 'is_chain', 'first_order', 'last_order',
       'last_promo', 'payment_method', 'CUI_American', 'CUI_Asian',
       'CUI_Beverages', 'CUI_Cafe', 'CUI_Chicken Dishes', 'CUI_Chinese',
       'CUI_Desserts', 'CUI_Healthy', 'CUI_Indian', 'CUI_Italian',
       'CUI_Japanese', 'CUI_Noodle Dishes', 'CUI_OTHER',
       'CUI_Street Food / Snacks', 'CUI_Thai', 'DOW_0', 'DOW_1', 'DOW_2',
       'DOW_3', 'DOW_4', 'DOW_5', 'DOW_6', 'HR_0', 'HR_1', 'HR_2', 'HR_3',
       'HR_4', 'HR_5', 'HR_6', 'HR_7', 'HR_8', 'HR_9', 'HR_10', 'HR_11',
       'HR_12', 'HR_13', 'HR_14', 'HR_15', 'HR_16', 'HR_17', 'HR_18',
       'HR_19', 'HR_20', 'HR_21', 'HR_22', 'HR_23'], dtype=object)

The function info() allow us to get information about the index dtype and columns, non-null values and memory usage.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31888 entries, 0 to 31887
Data columns (total 56 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               31888 non-null  object 
 1   customer_region           31888 non-null  object 
 2   customer_age              31161 non-null  float64
 3   vendor_count              31888 non-null  int64  
 4   product_count             31888 non-null  int64  
 5   is_chain                  31888 non-null  int64  
 6   first_order               31782 non-null  float64
 7   last_order                31888 non-null  int64  
 8   last_promo                31888 non-null  object 
 9   payment_method            31888 non-null  object 
 10  CUI_American              31888 non-null  float64
 11  CUI_Asian                 31888 non-null  float64
 12  CUI_Beverages             31888 non-null  float64
 13  CUI_Cafe                  31888 non-null  float64
 14  CUI_Ch

#### <span style="color:salmon"> 2.1.1 Setting the index </span> 

In [6]:
df = df.set_index("customer_id")

In [7]:
df.head()

Unnamed: 0_level_0,customer_region,customer_age,vendor_count,product_count,is_chain,first_order,last_order,last_promo,payment_method,CUI_American,CUI_Asian,CUI_Beverages,CUI_Cafe,CUI_Chicken Dishes,CUI_Chinese,CUI_Desserts,CUI_Healthy,CUI_Indian,CUI_Italian,CUI_Japanese,CUI_Noodle Dishes,CUI_OTHER,CUI_Street Food / Snacks,CUI_Thai,DOW_0,DOW_1,DOW_2,DOW_3,DOW_4,DOW_5,DOW_6,HR_0,HR_1,HR_2,HR_3,HR_4,HR_5,HR_6,HR_7,HR_8,HR_9,HR_10,HR_11,HR_12,HR_13,HR_14,HR_15,HR_16,HR_17,HR_18,HR_19,HR_20,HR_21,HR_22,HR_23
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1
1b8f824d5e,2360,18.0,2,5,1,0.0,1,DELIVERY,DIGI,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28.88,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,1,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0
5d272b9dcb,8670,17.0,2,2,2,0.0,1,DISCOUNT,DIGI,12.82,6.39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,1,0.0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
f6d1b2ba63,4660,38.0,1,2,2,0.0,1,DISCOUNT,CASH,9.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,1,0.0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
180c632ed8,4660,,2,3,1,0.0,2,DELIVERY,DIGI,0.0,13.7,0.0,0.0,0.0,0.0,0.0,0.0,17.86,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,1,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0
4eb37a6705,4660,20.0,2,5,0,0.0,2,-,DIGI,14.57,40.87,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,1,0.0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Descriptives:

In [8]:
# Describe all the columns that have numerical values
df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
customer_region,31888.0,9.0,8670,9761.0,,,,,,,
customer_age,31161.0,,,,27.506499,7.160898,15.0,23.0,26.0,31.0,80.0
vendor_count,31888.0,,,,3.102609,2.771587,0.0,1.0,2.0,4.0,41.0
product_count,31888.0,,,,5.668245,6.957287,0.0,2.0,3.0,7.0,269.0
is_chain,31888.0,,,,2.818866,3.977529,0.0,1.0,2.0,3.0,83.0
first_order,31782.0,,,,28.478604,24.109086,0.0,7.0,22.0,45.0,90.0
last_order,31888.0,,,,63.675521,23.226123,0.0,49.0,70.0,83.0,90.0
last_promo,31888.0,4.0,-,16748.0,,,,,,,
payment_method,31888.0,3.0,CARD,20161.0,,,,,,,
CUI_American,31888.0,,,,4.880438,11.654018,0.0,0.0,0.0,5.66,280.21


#### <span style="color:salmon"> 2.4 Numerical variables </span>
Numerical variables represent measurable quantities and can be analyzed mathematically.

**After analyse that variables, we can conclude that:** <br>  
  1. customer age:
     + average customer is 27.5
     + youngest costumer=15y // oldest customer = 80y
     + 75% of customers are 31yo or younger
     + oldest customer= 80yo (outlier??)
     + most of customers are young, there are a few older individuals in the dataset <br><br>  
  2. vendor count:
     + entries with 0 (needs further exploration)
     + average vendor count is 3
     + 75% of unique vendors customers have ordered from is 4
     + max vendor count is 41 (outlier??)
     + most customers ordered from few vendors, but there are customers with much higher count <br><br>  

  3. product_count:
     + entries with 0 products (no products purchased)
     + max product count is 269 (outlier)
     + most product count is low  <br><br>  

  4. is_chain:
     + relative small amount of orders made in chain restaurants
     + max nº orders made in chain is 83
     + most count of orders made in chain restaurants is low, but there are customers with a very high number of orders in chain (outlier??) <br><br>  
  
  5. first_order:
     + on average customers place their first order 28 days after joining the app
     + st dev = 24.1 suggests a wide spread in values of first orders. (significant variability in the amount of time customers take to make their first order)
     + min = 0 (customers who did their first order today)
     + max = 90 (outlier??)
     + 75% of customers placed their first order 45 days after entering the database <br><br>  

  6. last order:
     + st dev= 23 suggests a wide spread in values of first orders. (significant variability in the amount of time customers take to make their first order)
     + min = 0 (can indicate the customer placed their first order on the first day they joined the dataset/didnt do an order yet. Compare with min first order.) <br><br>  

  7. cuisines types:
     + average may not be a good comparison measure due to the presence of extreme values. mean may be low because high spenders are a low fraction of the total dataset.
     + their spending don't increase the overall avg significance
     + 75% of customers spend nothing or a significant small amount (Asian & American cuisines) in all cuisine types, the maximum values of these cuisines
     + are significantly high (specially for American (280) and Asian (896) cuisines which indicate the presence of potential outliers who frequently order or spend heavily <br><br>  

  8. DW_0-DW_6:
     + 75% of customers ordered 1 time in that day
     + the max indicate the presence of outliers and suggests the peak
     + most customers only ordered 1 time in each day of the week. <br><br>  

  9. hours of the day
      + HR_0 : no activity at midnight
      + HR_1-23: 75% of customers placed no order in these hours. Max can be outlier or unusual behavior

#### <span style="color:salmon"> 2.5 Categorical variables </span>
Categorical variables represent characteristics or qualities that group data into distinct categories or labels.

**After analysing that variables, we can conclude:** 
  1. Customer_region:
     + Customers are from 8 different regions
     + Most customers are located in region 8670
     + Missing values <br><br>  

  2. last_promo:
     + There are promotions in 3 different categories
     + Most customers use promotions in the delivery category
     + Missing values <br><br>  

  3. payment_method:
     + There are 3 different payment methods used by customers
     + Most customers use Card as their preferred payment method <br><br>  

###Then, we checked if there was stange values in this variables. - delete because unique values improved above!!

After analysing the categorical variables, we realise that there isn’t strange data.

#### <span style="color:salmon"> 2.2 Duplicates </span> 
To have a correct data, we need to check if the dataset have duplicates 

In [9]:
# Identify number of duplicates in the dataset.
dup = df["is_chain"].duplicated().value_counts()
dup

is_chain
True     31828
False       60
Name: count, dtype: int64

To see the percentage of duplicates that exists: 

In [10]:
# Percentage of Duplicates in the dataset.
percentage_duplicates = ((df.duplicated().mean()) * 100).round(3)
print(f"Percentage of Duplicates in the dataset is: {percentage_duplicates}%")

Percentage of Duplicates in the dataset is: 0.188%


Then, we removed the duplicates, using the function drop_duplicates().

In [11]:
# Remove duplicates.
df.drop_duplicates(inplace=True)

#### <span style="color:salmon"> 2.1.1 Unique values </span> 

In [12]:
for column in df.columns:
    unique_values = df[column].unique()
    unique_values_num = df[column].nunique()
    print(f"Unique values in '{column}':")
    print(unique_values)
    print(f"Number of unique values in '{column}':")
    print(unique_values_num)
    
    print()

Unique values in 'customer_region':
['2360' '8670' '4660' '2440' '-' '4140' '2490' '8370' '8550']
Number of unique values in 'customer_region':
9

Unique values in 'customer_age':
[18. 17. 38. nan 20. 40. 24. 27. 33. 26. 21. 51. 35. 22. 31. 15. 47. 19.
 28. 32. 25. 37. 42. 48. 16. 34. 29. 23. 30. 39. 46. 41. 49. 53. 36. 43.
 44. 45. 57. 58. 68. 56. 61. 60. 54. 59. 50. 55. 52. 65. 75. 66. 79. 80.
 63. 77. 62. 69. 72. 64. 76. 74. 67. 70. 78. 71. 73.]
Number of unique values in 'customer_age':
66

Unique values in 'vendor_count':
[ 2  1  3  4  5  7  6 11  9  8 12 14 20 13 10 16 24 18 17 15 25 30 19 21
 23 22 28 32 41 40 26  0 27 34 31 35 29]
Number of unique values in 'vendor_count':
37

Unique values in 'product_count':
[  5   2   3   4   6  10   8  17   7  26   9  32  15  13  16  28  19  12
  18  11  25  14  23  29  21  31  95  24  64  27  37  49  22  41  51  34
  39  20  30  47  40  53  38  54  35  33  56  65  45  63  36  79  48  70
  44  84  46  66  57  58  83 110 117 269  75  55  42 

In [13]:
df['customer_age'].value_counts().sort_index(ascending=False)

customer_age
80.0       3
79.0       2
78.0       1
77.0       6
76.0       2
75.0       2
74.0       3
73.0       1
72.0       3
71.0       1
70.0       3
69.0       2
68.0       8
67.0       4
66.0       5
65.0       6
64.0       2
63.0       7
62.0       8
61.0       7
60.0       9
59.0      12
58.0      26
57.0      17
56.0      19
55.0      28
54.0      34
53.0      53
52.0      47
51.0      58
50.0      65
49.0      59
48.0      89
47.0      88
46.0     123
45.0     123
44.0     157
43.0     219
42.0     212
41.0     258
40.0     315
39.0     343
38.0     399
37.0     425
36.0     520
35.0     618
34.0     740
33.0     860
32.0     968
31.0    1090
30.0    1288
29.0    1501
28.0    1626
27.0    1916
26.0    2056
25.0    2258
24.0    2299
23.0    2356
22.0    2313
21.0    1993
20.0    1438
19.0    1056
18.0     586
17.0     269
16.0      76
15.0      20
Name: count, dtype: int64

In [14]:
df['CUI_Asian'].value_counts().sort_index(ascending=False)

CUI_Asian
896.71        1
537.49        1
419.05        1
395.70        1
394.31        1
385.74        1
380.40        1
375.19        1
360.48        1
337.89        1
316.16        1
310.89        1
299.37        1
297.76        1
287.09        1
286.72        1
284.07        1
283.84        1
282.15        1
273.75        1
270.85        1
262.75        1
251.64        1
244.59        1
243.99        1
241.49        1
240.90        1
240.31        1
238.10        1
234.12        1
233.41        1
233.14        1
227.54        1
227.32        1
225.01        1
223.58        1
220.38        1
218.56        1
217.81        1
215.22        1
214.63        1
214.24        1
211.32        1
206.24        1
201.64        1
201.24        1
197.98        2
197.40        1
196.94        1
196.75        1
196.61        1
196.32        1
196.00        1
195.81        1
195.42        1
193.23        1
192.28        1
192.21        1
191.55        1
190.10        1
189.27        1
187.68        

In [15]:
df['CUI_Beverages'].value_counts().sort_index(ascending=False)

CUI_Beverages
229.22        1
192.43        1
176.38        1
173.32        1
173.16        1
171.14        1
170.12        1
168.91        1
158.44        1
156.23        1
147.70        1
138.57        1
137.06        1
136.31        1
131.63        1
126.43        1
125.51        1
124.88        1
121.20        1
118.11        1
116.18        1
113.48        1
112.07        1
110.77        1
110.11        1
108.23        1
106.20        1
104.28        1
102.51        1
102.22        1
100.16        1
98.95         1
98.51         1
98.36         1
96.92         1
95.45         1
93.42         1
92.61         1
92.46         1
91.88         1
91.37         1
90.90         1
90.25         1
89.76         1
89.18         1
87.22         1
86.05         1
85.23         1
85.13         1
84.95         1
83.87         1
83.44         1
81.47         1
81.07         1
80.10         1
80.04         1
79.94         1
79.25         1
79.18         1
78.83         1
78.01         1
75.66     

In [16]:
df['CUI_Cafe'].value_counts().sort_index(ascending=False)

CUI_Cafe
326.10        1
266.60        1
213.26        1
197.09        1
196.60        1
191.46        1
185.73        1
182.52        1
177.75        1
176.77        1
155.11        1
141.06        1
122.17        1
120.01        1
118.72        1
115.31        1
111.77        1
111.42        1
108.64        1
106.32        1
105.78        1
101.44        1
100.96        1
100.63        1
100.17        1
99.52         1
97.22         1
92.20         1
91.68         1
87.46         1
86.84         1
85.06         1
84.74         1
83.25         1
82.74         1
82.18         1
81.22         1
79.19         1
78.80         1
78.24         1
76.70         1
76.57         1
75.07         1
75.03         1
74.81         1
74.04         1
72.95         1
72.91         1
69.72         1
69.67         1
68.93         1
68.90         1
68.81         1
68.64         1
66.53         1
66.24         1
65.65         1
65.01         1
64.70         1
64.48         1
64.42         1
63.18         1

In [17]:
df['CUI_Chicken Dishes'].value_counts().sort_index(ascending=False)

CUI_Chicken Dishes
219.66        1
126.89        1
107.73        1
104.25        1
99.29         1
96.30         1
94.06         1
79.07         1
68.50         1
65.21         1
64.16         1
56.44         1
55.10         1
54.45         1
53.06         1
49.97         1
48.69         1
48.34         1
46.74         1
44.32         1
42.64         1
42.45         1
41.85         1
41.15         1
39.38         1
39.36         1
38.04         1
38.01         1
37.76         1
37.59         1
37.51         1
37.42         1
37.41         1
37.38         1
37.36         1
36.93         1
36.43         1
35.95         1
35.33         1
35.09         1
35.04         1
34.50         1
33.86         1
33.67         1
33.60         1
33.06         1
32.77         1
32.31         1
32.20         1
31.71         1
31.68         1
31.60         1
31.31         1
31.21         1
31.15         1
30.66         1
30.59         1
30.04         1
29.99         1
29.84         1
29.62         1
29.40

In [18]:
df['CUI_Chinese'].value_counts().sort_index(ascending=False)

CUI_Chinese
739.73        1
435.64        1
362.87        1
195.30        1
165.13        1
160.97        1
159.08        1
152.73        1
132.76        1
130.79        1
129.29        1
125.72        1
120.05        1
113.58        1
111.10        1
106.82        1
103.06        1
102.09        1
94.99         1
90.86         1
90.81         1
88.33         1
87.09         1
85.48         1
85.09         1
84.75         1
83.54         1
83.39         1
83.26         1
80.91         1
78.92         1
77.16         1
76.60         1
76.15         1
75.68         1
75.54         1
75.20         1
74.73         1
74.28         1
74.08         1
73.90         1
73.07         1
72.01         1
71.21         1
70.77         1
70.13         1
70.10         1
69.97         1
68.31         1
67.81         1
66.89         1
66.70         1
66.58         2
65.79         1
65.31         1
64.73         1
64.20         1
64.02         1
63.20         1
62.49         1
62.09         1
61.02       

In [19]:
df['CUI_Desserts'].value_counts().sort_index(ascending=False)

CUI_Desserts
230.07        1
206.62        1
197.55        1
185.53        1
138.81        1
118.80        1
117.09        1
111.19        1
104.99        1
96.47         1
91.45         1
86.13         1
85.42         1
80.43         1
73.45         2
73.41         1
71.95         1
69.97         1
69.85         1
69.49         1
69.28         1
67.92         1
67.47         1
67.30         1
67.20         1
66.66         1
64.93         1
62.94         1
61.96         1
61.89         1
61.27         1
60.94         1
59.05         1
57.25         1
56.27         1
55.81         1
55.47         1
54.83         1
54.63         1
54.01         1
53.93         1
53.53         1
51.46         1
50.41         1
50.25         1
50.11         1
49.65         1
49.53         1
49.50         1
49.09         1
49.07         1
49.04         1
48.64         1
48.39         1
48.19         1
47.51         1
47.28         1
46.10         1
45.64         1
44.74         1
44.38         1
43.96      

In [20]:
df['CUI_Healthy'].value_counts().sort_index(ascending=False)

CUI_Healthy
255.81        1
238.02        1
209.81        1
154.24        1
137.33        1
131.42        1
128.18        1
122.22        1
118.45        1
115.10        1
112.80        1
112.44        1
111.55        1
108.44        1
103.08        1
100.88        1
99.96         1
99.84         1
98.70         1
97.53         1
96.21         1
93.50         1
93.14         1
92.28         1
90.41         1
88.05         1
87.73         1
79.80         1
78.54         1
77.77         1
76.85         1
75.88         1
73.70         1
73.67         1
73.24         1
73.04         1
69.79         1
68.28         1
67.69         1
67.31         1
66.98         1
66.33         1
66.09         1
65.55         1
65.28         1
63.43         1
62.94         1
62.82         1
62.60         1
61.88         1
61.57         1
61.55         1
61.52         1
60.28         1
59.33         1
59.13         1
58.67         1
58.45         1
58.29         1
58.20         1
57.20         1
56.86       

In [21]:
df['CUI_Indian'].value_counts().sort_index(ascending=False)

CUI_Indian
309.07        1
233.01        1
196.79        1
162.38        1
161.15        1
156.04        1
155.92        1
140.51        1
133.88        1
131.09        1
126.47        1
121.12        1
119.62        1
116.52        1
113.91        1
113.81        1
113.53        1
108.34        1
107.59        1
107.45        1
106.08        1
103.19        1
102.25        1
98.49         1
97.38         1
95.84         1
94.35         1
94.28         1
92.97         1
92.91         1
91.55         1
91.08         1
88.97         1
88.83         1
87.82         1
87.69         1
85.21         2
85.11         1
83.22         1
81.91         1
81.89         1
81.60         1
81.21         1
78.54         1
77.77         1
77.53         1
77.36         1
77.04         1
76.23         1
76.02         1
76.01         1
75.61         1
75.50         1
74.58         1
74.42         1
74.25         1
71.37         1
71.26         1
68.98         1
68.36         1
67.63         1
67.28        

In [22]:
df['CUI_Italian'].value_counts().sort_index(ascending=False)

CUI_Italian
468.33        1
345.61        1
276.22        1
258.24        1
230.99        1
224.06        1
197.87        1
186.49        1
177.65        1
167.12        1
160.69        1
160.57        1
157.87        1
150.88        1
149.10        1
148.69        1
148.55        1
147.74        1
147.63        1
145.83        1
143.64        1
142.17        1
141.40        1
140.94        1
140.90        1
139.23        1
138.33        1
137.66        1
135.75        2
133.29        1
131.70        1
130.15        1
128.99        1
128.69        1
126.49        1
125.40        1
121.99        1
121.83        1
121.48        1
121.00        1
120.60        1
120.26        1
119.90        1
119.23        1
118.68        1
113.80        1
113.13        1
113.11        1
110.17        1
109.05        1
108.98        1
108.62        1
107.24        1
105.80        1
105.44        1
102.61        1
102.30        1
101.42        1
99.35         1
99.20         1
98.81         1
97.95       

In [23]:
df['CUI_Japanese'].value_counts().sort_index(ascending=False)

CUI_Japanese
706.14        1
240.16        1
211.78        1
205.21        1
195.06        1
182.49        1
181.55        1
174.15        1
165.76        1
158.69        1
128.24        1
128.05        1
126.50        1
124.09        1
123.55        1
121.55        1
121.00        1
118.41        1
113.10        1
112.98        1
111.93        1
110.13        1
109.79        1
108.03        1
107.97        1
107.83        1
107.62        1
104.39        1
101.66        1
101.13        1
100.20        1
99.76         1
99.57         1
99.43         1
98.94         1
98.27         1
97.77         1
97.56         1
96.14         1
95.21         1
94.30         1
93.45         1
93.06         1
92.24         1
90.65         1
90.02         1
89.89         1
89.32         1
88.85         1
87.85         1
87.53         1
87.50         1
87.28         1
86.99         1
86.94         1
86.79         1
86.72         1
86.18         1
85.60         1
85.10         1
84.95         1
84.05      

In [24]:
df['CUI_Noodle Dishes'].value_counts().sort_index(ascending=False)

CUI_Noodle Dishes
275.11        1
197.84        1
153.87        1
122.47        1
119.90        1
115.25        1
100.08        1
96.18         1
95.00         1
82.07         1
80.29         1
78.90         1
78.79         1
76.62         1
69.97         1
68.30         1
67.89         1
67.78         1
65.68         1
63.02         1
60.21         1
60.10         1
58.33         1
57.23         1
57.10         1
56.87         1
55.50         1
55.40         1
54.24         1
53.60         2
52.87         1
52.60         1
52.12         1
51.25         1
51.10         1
51.06         1
50.94         1
50.83         1
49.78         1
49.46         1
49.39         1
49.30         1
49.26         1
49.23         1
48.71         1
47.84         1
47.66         1
47.65         1
47.10         1
46.73         1
46.09         1
46.00         1
45.34         1
45.18         1
44.85         1
43.92         1
43.50         1
43.43         1
43.39         1
43.20         1
43.18         1
43.08 

In [25]:
df['CUI_OTHER'].value_counts().sort_index(ascending=False)

CUI_OTHER
366.08        1
243.18        1
233.77        1
227.90        1
202.54        1
191.27        1
171.23        1
167.04        1
165.65        1
158.66        1
157.76        1
154.07        1
148.20        1
144.03        1
143.51        1
139.42        1
139.16        1
138.08        1
138.03        1
132.66        1
132.42        1
130.21        1
129.08        1
129.00        1
121.43        1
120.81        1
119.64        1
117.80        1
115.99        1
114.91        1
114.69        1
114.60        1
112.39        1
111.52        1
110.69        1
109.31        1
108.13        1
104.91        1
103.63        1
103.42        1
102.50        1
102.14        1
101.94        1
101.41        1
99.95         1
99.00         1
98.69         1
98.55         1
98.20         1
98.13         1
93.72         1
91.94         1
91.47         1
91.16         1
91.08         1
90.32         2
89.16         1
87.89         1
87.34         1
87.29         1
85.80         1
85.66         

In [26]:
df['CUI_Street Food / Snacks'].value_counts().sort_index(ascending=False)

CUI_Street Food / Snacks
454.45        1
382.39        1
318.04        1
309.04        1
308.48        1
291.61        1
285.47        1
263.84        1
253.66        1
246.72        1
246.00        1
232.69        1
230.22        1
229.32        1
221.20        1
220.58        1
202.73        1
202.21        1
200.26        1
197.47        1
195.51        1
194.99        1
193.53        1
188.93        1
187.23        1
186.35        1
186.20        1
181.06        1
180.31        1
177.58        1
176.46        1
176.18        1
172.71        1
166.95        1
165.20        1
164.99        1
164.11        1
163.23        1
162.53        1
161.70        1
161.15        1
159.52        1
155.75        1
155.74        1
154.32        1
153.92        1
153.29        1
152.26        1
150.35        1
149.77        1
148.17        1
144.25        1
144.10        1
143.81        1
143.68        1
143.50        1
142.80        1
142.05        1
141.88        1
141.50        1
140.92        1

In [27]:
df['CUI_Thai'].value_counts().sort_index(ascending=False)

CUI_Thai
136.38        1
129.56        1
120.61        1
113.09        1
109.64        1
99.15         1
97.50         1
90.63         1
90.57         1
86.02         1
83.04         1
82.42         1
81.32         1
81.29         1
80.33         1
76.93         1
73.39         1
73.18         1
72.52         1
63.75         1
63.17         1
62.87         1
60.79         1
60.65         1
60.45         1
60.18         1
59.20         1
58.74         1
58.37         1
56.52         1
55.43         1
55.04         1
54.77         1
54.32         1
52.67         1
52.04         1
51.08         1
50.62         1
50.23         1
49.71         1
48.97         1
48.73         1
48.60         1
48.58         1
48.22         1
48.04         1
48.00         1
47.15         1
46.53         1
46.35         1
46.32         1
46.08         1
46.03         1
46.00         1
45.80         1
45.61         1
45.49         1
45.46         1
45.43         1
45.08         1
44.76         1
44.70         1

#### All listed Variables - unique variables summary 

1. **Customer_region**: there is one strange value '-' (9 unique values)
2. **Customer_age**: no strange value (66 unique values) 
3. **Vendor_count**: there are vendor_count = 0 (37 unique values)
4. **product_count**: there are product_count = 0 (93 unique values)
5. **is_chain** (nº orders made in chain restaurants): no strange value (60 unique values)
6. **First_order**: there are first_order = 0 (91 unique values)
7. **Last_order**:  there are last_order = 0 (91 unique values)
8. **Last_promo**: there is one strange value '-' (4 unique values)
9. **Payment_method**: no strange value (3 unique values)

10. **CUI_American**: there are some strange description such as: '*** CARRIER UNDETERMINED ***', (3327 unique values) 
11. **CUI_Asian**: no strange value CUI_Asian = 0 (19983) (4650 unique values)
12. **CUI_Beverages**: no strange values CUI_Beverages = 0 (26453) (2194 unique values)
13. **CUI_Cafe**: no strange value CUI_Cafe = 0 (30522) (1064 unique values)
8. **CUI_Chicken Dishes**: no strange value CUI_Chicken_Dishes = 0 (28640) (1372 unique values)
9. **CUI_Chinese**: no strange value CUI_Chicken_Dishes= 0 (28366) (1842 unique values)
10. **CUI_Desserts**: no strange value CUI_Desserts = 0 (29872) (1148 unique values) 
11. **CUI_Healthy**: no strange value CUI_Healthy = 0 (29719) (1285 unique values)
12. **CUI_Indian'**: no strange values CUI_Indian = 0 (28440) (2002 unique values)
13. **CUI_Italian**: no strange values CUI_Italian = 0 (25440)  (2907 unique values)
8. **CUI_Japanese**: no strange values CUI_Japanese = 0 (25587) (2649 unique values)
9. **CUI_Noodle Dishes**: no strange values CUI_Noodle Dishes = 0 (29662) (1276 unique values)
10. **CUI_OTHER**: no strange values CUI_OTHER = 0 (24847) (2756 unique values) 
11. **CUI_Street Food / Snacks**: no strange values CUI_Street Food/Snacks = 0 (27639) (2554 unique values)
12. **CUI_Thai**: no strange values CUI_Thai = 0 (29510) (1373 unique values)

12. **DOW_0**: no strange values (14 unique values)
12. **DOW_1**: no strange values (17 unique values)
12. **DOW_2**: no strange values (16 unique values)
12. **DOW_3**: no strange values (17 unique values)
12. **DOW_4**: no strange values (17 unique values)
12. **DOW_5**: no strange values (17 unique values)
12. **DOW_6**: no strange values (17 unique values)

12. **HR_0**: unique value = 0 (1 unique values) 
12. **HR_1**: no strange values (11 unique values)
12. **HR_2**: no strange values (12 unique values)
12. **HR_3**: no strange values (12 unique values)
12. **HR_4**: no strange values (13 unique values)
12. **HR_5**: no strange values (7 unique values)
12. **HR_6**: no strange values (9 unique values)
12. **HR_7**: no strange values (13 unique values)
12. **HR_8**: no strange values (18 unique values)
12. **HR_9**: no strange values (17 unique values)
12. **HR_10**: no strange values (20 unique values)
12. **HR_11**: no strange values (20 unique values)
12. **HR_12**: no strange values (19 unique values)
12. **HR_13**: no strange values (14 unique values)
12. **HR_14**: no strange values (13 unique values)
12. **HR_15**: no strange values (15 unique values)
12. **HR_16**: no strange values (15 unique values)
12. **HR_17**: no strange values (19 unique values)
12. **HR_18**: no strange values (20 unique values)
12. **HR_19**: no strange values (19 unique values)
12. **HR_20**: no strange values (16 unique values)
12. **HR_21**: no strange values (10 unique values)
12. **HR_22**: no strange values (10 unique values)
12. **HR_23**: no strange values (10 unique values)


#### <span style="color:salmon"> 2.2 Incorrect Data Types </span> 
The function dtypes show the types of the columns.

In [28]:
df.dtypes

customer_region              object
customer_age                float64
vendor_count                  int64
product_count                 int64
is_chain                      int64
first_order                 float64
last_order                    int64
last_promo                   object
payment_method               object
CUI_American                float64
CUI_Asian                   float64
CUI_Beverages               float64
CUI_Cafe                    float64
CUI_Chicken Dishes          float64
CUI_Chinese                 float64
CUI_Desserts                float64
CUI_Healthy                 float64
CUI_Indian                  float64
CUI_Italian                 float64
CUI_Japanese                float64
CUI_Noodle Dishes           float64
CUI_OTHER                   float64
CUI_Street Food / Snacks    float64
CUI_Thai                    float64
DOW_0                         int64
DOW_1                         int64
DOW_2                         int64
DOW_3                       

After analyse the datatype, we realise that there are some columns that have an incorrect type, like customer_age, first_order and HR_0.

In [29]:
# convert "customer_age" from float to int
df["customer_age"]= df["customer_age"].astype('int64')

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

In [None]:
# convert "first_order" from float to int
df["first_order"] = df["first_order"].astype("int64")

In [None]:
# convert "HR_0" from float to int
df["HR_0"] = df["HR_0"].astype("Int64")

#### <span style="color:salmon"> 2.3 Missing Values </span> 
Some values are presented like "", "-", so we replaced them with nan values.

In [None]:
df.replace("", np.nan, inplace=True)
df.replace("-", np.nan, inplace=True)
# replace missing values with nan

To see which columns have nan values and how many exists, we used the function isna().sum()

In [None]:
# Check for missing values
missing_values = df.isna().sum().sort_values(ascending=False)
missing_values

The variables that have missing values are: customer_region, customer_age, first_order, last_promo and HR_0. With last_promo having a very high percentage (52.5%).

In [None]:
# Percentage of missing values in each variable:
missing_percentage = ((df.isnull().sum() / len(df)) * 100).sort_values(ascending=False)
missing_percentage = missing_percentage[missing_percentage > 0]

print(f"Percentage of missing values:\n {missing_percentage}")


## Product_count vs vendor count incoherences:

Another detail about incorrect data is between product_count and vendor_count. There is information about products that were not purchased but there is information about sellers who sold them.

In [None]:
# Product_count = 0, some have vendor_count = 1. 
product_count_zero = df.loc[df["product_count"]==0, ["vendor_count", "product_count", "first_order", "last_order"]]
print(f"Total number of customers with product_count equal to 0: {len(product_count_zero)} customers")

In [None]:
# unlock vendor count = 0
vendor_count_zero = df.loc[df["vendor_count"]==0, ["vendor_count", "product_count", "first_order", "last_order"]]
print(f"Total number of customers with vendor_count and product_count equal to 0: {len(vendor_count_zero)} customers")

In [None]:
product_zero_vendor_one = df.loc[(df["product_count"]==0) & (df["vendor_count"]>=1), ["vendor_count", "product_count", "first_order", "last_order"]]
product_zero_vendor_one_percentage = len(product_zero_vendor_one) / len(df) * 100
print(f"Number of customers with product_count = 0 and vendor_count >= 1: {len(product_zero_vendor_one)} customers")
print(f"Percentage of customers with product_count = 0 and vendor_count >= 1: {round(product_zero_vendor_one_percentage, 4)} %")

### <span style="color:salmon"> 4. Visualizing features - univariate </span> 
### <span style="color:yellow"> MARTA - INÍCIO </span> 

In [None]:
non_metric_features = ["customer_region", "last_promo", "payment_method"]
metric_features = df.columns.drop(non_metric_features)
metric_features

### METRIC FEATURES - NUMERICAL

In [None]:
sns.set()

# Set up the figure and axes
rows, cols = 12, 5  
fig, axes = plt.subplots(rows, cols, figsize=(25, 30))  

# Plot each feature
for ax, feat in zip(axes.flatten(), metric_features):
    ax.hist(df[feat], bins=20, color='skyblue', edgecolor='black')  
    ax.set_title(feat, fontsize=10, y=-0.2)  
    
# Hide any unused subplots if the number of features is less than rows * cols
for ax in axes.flatten()[len(metric_features):]:
    ax.set_visible(False)

# Set a global title and adjust layout
plt.suptitle("Numeric Variables' Histograms", fontsize=16, y=1.02)  
plt.tight_layout()
plt.show()

Plotting variables excluding zero, because it biases the scaling: (used to define outliers threshold)

In [None]:
CUI_variables = ["CUI_American", "CUI_Asian", "CUI_Beverages", "CUI_Cafe", 
                 "CUI_Chicken Dishes", "CUI_Chinese", "CUI_Desserts", 
                 "CUI_Healthy", "CUI_Indian", "CUI_Italian", "CUI_Japanese", 
                 "CUI_Noodle Dishes", "CUI_OTHER", "CUI_Street Food / Snacks", 
                 "CUI_Thai", "DOW_0" ,"DOW_1", "DOW_2", "DOW_3", "DOW_4", "DOW_5", "DOW_6",
                 "HR_1", "HR_2", "HR_3", "HR_4", "HR_5", "HR_6", "HR_7", "HR_8", "HR_9", 
                 "HR_10", "HR_11", "HR_12", "HR_13", "HR_14", "HR_15", "HR_16", "HR_17", "HR_18", 
                 "HR_19", "HR_20", "HR_21", "HR_22", "HR_23"]

sns.set()

# Calculate rows and columns
n_features = len(CUI_variables)
cols = 5  #
rows = -(-n_features // cols)  

# Set up the figure and axes
fig, axes = plt.subplots(rows, cols, figsize=(25, rows * 5)) 

# Plot each CUI variable
for ax, feat in zip(axes.flatten(), CUI_variables):
    data_no_zero = df[df[feat] != 0][feat] # Exclude zero values
    ax.hist(data_no_zero, bins=20, color='skyblue', edgecolor='black')  
    ax.set_title(feat, fontsize=10, y=-0.2)  

# Hide  unused subplots if the number of CUI variables is less than rows * cols
for ax in axes.flatten()[len(CUI_variables):]:
    ax.set_visible(False)

# Set a global title and adjust layout
plt.suptitle("CUI Variables' Histograms (Excluding Zero)", fontsize=16, y=1.02)  
plt.tight_layout()
plt.show()

### NON METRIC FEATURES - Categorical

In [None]:
for column in non_metric_features:
    
    categories = df[column].value_counts()

    categories_sorted = categories.sort_values(ascending=True)

    data_filtered = df[df[column].isin(categories_sorted.index)]
    
   
    plt.figure(figsize=(10, 5))
    sns.countplot(data=data_filtered, 
                  x=column, 
                  order=categories_sorted.index,  
                  palette='tab20b')
    
  
    plt.xlabel(column, fontsize=14)
    plt.ylabel('Count', fontsize=14)
    plt.title(f'Categories in {column}')
    
    
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

## Treat missing values & Inconsistent values:

In [None]:
# Check percentage of missing values:
missing_percentage = ((df.isnull().sum() / len(df)) * 100).sort_values(ascending=False)
missing_percentage = missing_percentage[missing_percentage > 0]

print(f"Percentage of missing values:\n {missing_percentage}")

In [None]:
# Input missing values in numerical features using median:
median_variables = ["customer_age", "HR_0"]
for column in median_variables:
    median_value = df[column].median()
    df[column] = df[column].fillna(median_value)

In [None]:
# Check percentage of missing values:
missing_percentage = ((df.isnull().sum() / len(df)) * 100).sort_values(ascending=False)
missing_percentage = missing_percentage[missing_percentage > 0]

print(f"Percentage of missing values:\n {missing_percentage}")

In [None]:
# Input missing values in categorical features using mode:
mode_variables = ["last_promo", "customer_region"]
for column in mode_variables:
    mode_value = df[column].mode()[0]
    df[column] = df[column].fillna(mode_value)

In [None]:
# Check percentage of missing values:
missing_percentage = ((df.isnull().sum() / len(df)) * 100).sort_values(ascending=False)
missing_percentage = missing_percentage[missing_percentage > 0]

print(f"Percentage of missing values:\n {missing_percentage}")

In [None]:
missing_percentage_first_order = df.loc[df["first_order"].isna()]

In [None]:
# Input missing values in first_order using knn:

knn_rows = df.loc[df["first_order"].isna()]

features_for_imputation = ["last_order", "product_count", "vendor_count", "DOW_0", "DOW_1", "DOW_2", "DOW_3", "DOW_4", "DOW_5", "DOW_6",  
                            "HR_0", "HR_1", "HR_2", "HR_3", "HR_4", "HR_5", "HR_6", "HR_7", "HR_8", "HR_9", "HR_10", "HR_11", "HR_12", "HR_13", 
                            "HR_14", "HR_15", "HR_16", "HR_17", "HR_18", "HR_19", "HR_20", "HR_21", "HR_22", "HR_23",
                            "CUI_American", "CUI_Asian", "CUI_Beverages", "CUI_Cafe", "CUI_Chicken Dishes", "CUI_Chinese", "CUI_Desserts", 
                            "CUI_Healthy", "CUI_Indian", "CUI_Italian", "CUI_Japanese", "CUI_Noodle Dishes", "CUI_OTHER", "CUI_Street Food / Snacks", "CUI_Thai"]

imputation_data = df.loc[df["first_order"].isna(), features_for_imputation]

knn_imputer = KNNImputer(n_neighbors=5, weights="uniform")

imputed_values = knn_imputer.fit_transform(imputation_data)

df.loc[df["first_order"].isna(), "first_order"] = imputed_values[:, 0]  

print("Missing values in first_order have been imputed using KNN.")

In [None]:
# Check if missing values have been imputed for first_order:
df.loc[missing_percentage_first_order.index]

##### NOTE: HR_0 has only 1 unique value = 0, I dont think its informative. consider to drop the variable!

### Treat strange values:

#### Product_count vs vendor_count:

In [None]:
df.loc[(df["product_count"]==0) & (df["vendor_count"]>=1)]

In [None]:
# Replace product_count = 0 with NaN for these rows
df.loc[(df["product_count"] == 0) & (df["vendor_count"] >= 1), "product_count"] = np.nan

In [None]:
product_vendor_count_missing = df.loc[(df["product_count"].isna()) & (df["vendor_count"] >= 1)]
product_vendor_count_missing

In [None]:
# Input missing values in product_count using knn:
knn_rows = df.loc[df["product_count"].isna()]

features_for_imputation = metric_features.drop(["product_count", "customer_age", "is_chain", "CUI_American", "CUI_Asian", "CUI_Beverages", "CUI_Cafe", 
                                                 "CUI_Chicken Dishes", "CUI_Chinese", "CUI_Desserts", "CUI_Healthy", "CUI_Indian", "CUI_Italian", "CUI_Japanese", 
                                                 "CUI_Noodle Dishes", "CUI_OTHER", "CUI_Street Food / Snacks", "CUI_Thai"])

imputation_data = df.loc[df["product_count"].isna(), features_for_imputation]

knn_imputer = KNNImputer(n_neighbors=5, weights="uniform")

imputed_values = knn_imputer.fit_transform(imputation_data)

df.loc[df["product_count"].isna(), "product_count"] = imputed_values[:, 0]  

print("Missing values in product_count have been imputed using KNN.")

In [None]:
# Check if missing values have been imputed for product_count:
df.loc[product_vendor_count_missing.index]

## OUTLIERS:

### Outlier visualization:

In [None]:
sns.set()

# Set up the figure and axes
rows, cols = 12, 5  
fig, axes = plt.subplots(rows, cols, figsize=(25, 30))  

# Plot each feature as a box plot
for ax, feat in zip(axes.flatten(), metric_features):
    sns.boxplot(data=df, x=feat, ax=ax, color='skyblue')  
    ax.set_title(feat, fontsize=10, y=-0.2)  #

# Hide unused subplots:
for ax in axes.flatten()[len(metric_features):]:
    ax.set_visible(False)

# Set a global title and adjust layout
plt.suptitle("Numeric Variables' Box Plots", fontsize=16, y=1.02)  
plt.tight_layout()
plt.show()

### Outlier Removal:

1. AUTOMATIC METHOD:

In [None]:
# Compute the inter-quartile range
q1 = df[metric_features].quantile(0.25)
q3 = df[metric_features].quantile(0.75)
iqr = q3 - q1

# Compute the limits:
lower_lim = q1 - (1.5 * iqr)
upper_lim = q3 + (1.5 * iqr)

for feature in metric_features:
    print(f"{feature:<25}  Lower Limit: {lower_lim[feature]:>10}      Upper Limit: {upper_lim[feature]:>10}")

Observations in which all features are outliers:

In [None]:
def identify_outliers(dataframe, metric_features, lower_lim, upper_lim):
    outliers = {}
    obvious_outliers = []

    for metric in metric_features:
        if metric not in dataframe.columns:
            continue
        
        if metric not in lower_lim or metric not in upper_lim:
            continue
        
        outliers[metric] = []
        llim = lower_lim[metric]
        ulim = upper_lim[metric]
        
        for i, value in enumerate(dataframe[metric]):
            if pd.isna(value):
                continue
            
            if value < llim or value > ulim:
                outliers[metric].append(value)
        
        print(f"Total outliers in {metric}: {len(outliers[metric])}")

    # Check for observations that are outliers in all features (Obvious Outliers)
    for index, row in dataframe.iterrows():
        is_global_outlier = True
        for metric in metric_features:
            if metric not in dataframe.columns or metric not in lower_lim or metric not in upper_lim:
                is_global_outlier = False
                break
            
            value = row[metric]
            if pd.isna(value):
                is_global_outlier = False
                break
            
            llim = lower_lim[metric]
            ulim = upper_lim[metric]
            
            if llim <= value <= ulim:
                is_global_outlier = False
                break
        
        if is_global_outlier:
            obvious_outliers.append(index)
    print("-----------------------------")
    print(f"Total global outliers: {len(obvious_outliers)}")
    return outliers, obvious_outliers
    
    
outliers, obvious_outliers = identify_outliers(df, metric_features, lower_lim, upper_lim)

Conclusion: There is no observation in which all features are outliers. Since there is no outlier in 'HR_0', 'last_order', 'first_order'.

Check if there is any observation only with outliers, except on these features.

In [None]:
metric_features_test = metric_features.drop(['HR_0', 'last_order', 'first_order'])
outliers, obvious_outliers = identify_outliers(df, metric_features_test, lower_lim, upper_lim)

Conclusion: There is no observation with outliers in all features.

Observations in which at least one feature is an outlier:

In [None]:
filters_iqr = []                                            
for metric in metric_features:
    llim = lower_lim[metric]
    ulim = upper_lim[metric]
    filters_iqr.append(df[metric].between(llim, ulim, inclusive='neither'))

filters_iqr_all = pd.concat(filters_iqr, axis=1).all(axis=1)

In [None]:
filters_iqr_all

In [None]:
len(df[~filters_iqr_all])
# Number of observations with at least one features considered an outlier
percentage_outliers = len(df[filters_iqr_all])/len(df)*100
percentage_data_kept = round(100 - percentage_outliers, 5)
print(f"Percentage of observations with at least one features considered an outlier: {percentage_outliers}%")
print(f"Percentage of data kept after removing outliers: {percentage_data_kept}%")

Conclusion: All observations have outliers in some feature

2. MANUAL METHOD:

In [None]:
filters_manual1 = (
                (df["customer_age"] <= 70) #??
                &
                (df["vendor_count"] <= 30) #??
                &
                (df["product_count"] <= 100)                            
                &
                (df["is_chain"] <= 55)                                  
                &
                (df["CUI_American"] <= 150)
                &
                (df["CUI_Asian"] <= 450)                            
                &
                (df["CUI_Beverages"] <= 150)  
                &
                (df["CUI_Cafe"] <= 140)
                &
                (df["CUI_Chicken Dishes"] <= 70)                            
                &
                (df["CUI_Chinese"] <= 200)                                  
                &
                (df["CUI_Desserts"] <= 80)
                &
                (df["CUI_Healthy"] <= 150)                            
                &
                (df["CUI_Indian"] <= 150)
                &
                (df["CUI_Italian"] <= 200)
                &
                (df["CUI_Japanese"] <= 150)
                &
                (df["CUI_Noodle Dishes"] <= 90)
                &
                (df["CUI_OTHER"] <= 180)
                &
                (df["CUI_Street Food / Snacks"] <= 210)
                &
                (df["CUI_Thai"] <= 70)
                &
                (df["DOW_0"] <= 12)
                &
                (df["DOW_1"] <= 14)
                &
                (df["DOW_2"] <= 10) #(??)
                &
                (df["DOW_3"] <= 12)
                &
                (df["DOW_4"] <=12) #(??)
                &
                (df["DOW_5"] <= 12) #(??)
                &
                (df["DOW_6"] <= 15)
                &
                (df["HR_1"] <= 10)                                  
                &
                (df["HR_2"] < 8)
                &
                (df["HR_3"] <= 8) #(??)                            
                &
                (df["HR_4"] <= 8)                       
                &
                (df["HR_5"] <= 5)                                  
                &
                (df["HR_6"] <= 6)
                &
                (df["HR_7"] <= 10)                            
                &
                (df["HR_8"] <= 19)  
                &
                (df["HR_9"] <= 12)
                &
                (df["HR_10"] < 15)                            
                &
                (df["HR_11"] <= 15)                                  
                &
                (df["HR_12"] <= 15)
                &
                (df["HR_13"] <= 10)                            
                &
                (df["HR_14"] <= 10) 
                &
                (df["HR_15"] <= 12)  
                &
                (df["HR_16"] <= 15)
                &
                (df["HR_17"] <= 16)
                &
                (df["HR_18"] <= 15)                                  
                &
                (df["HR_19"] <= 15)
                &
                (df["HR_20"] <= 15)                            
                &
                (df["HR_21"] <= 7)
                &
                (df["HR_22"] <= 8)
                &
                (df["HR_23"] <= 7)    
)                     

df_out_man1 = df[filters_manual1]

In [None]:
print('Percentage of data kept after removing outliers:', 100*(np.round(df_out_man1.shape[0] / df.shape[0], decimals=5)))

In [None]:
filters_manual2 = (
                (df["customer_age"] <= 70) 
                &
                (df["vendor_count"] <= 35) 
                &
                (df["product_count"] <= 80)                          
                &
                (df["is_chain"] <= 40)                                 
                &
                (df["CUI_American"] <= 100)
                &
                (df["CUI_Asian"] <= 300) #200                           
                &
                (df["CUI_Beverages"] <= 100)  
                &
                (df["CUI_Cafe"] <= 120)  
                &
                (df["CUI_Chicken Dishes"] <= 55) #50                           
                &
                (df["CUI_Chinese"] <= 150)                                
                &
                (df["CUI_Desserts"] <= 70) #100
                &
                (df["CUI_Healthy"] <= 120)                           
                &
                (df["CUI_Indian"] <= 120)  
                &
                (df["CUI_Italian"] <= 150)
                &
                (df["CUI_Japanese"] <= 120) 
                &
                (df["CUI_Noodle Dishes"] <= 70)
                &
                (df["CUI_OTHER"] <= 120)
                &
                (df["CUI_Street Food / Snacks"] <= 200)
                &
                (df["CUI_Thai"] <= 60)
                &
                (df["DOW_0"] <= 12) #8
                &
                (df["DOW_1"] <= 12) #8
                &
                (df["DOW_2"] <= 12) #8
                &
                (df["DOW_3"] <= 12) #8
                &
                (df["DOW_4"] <=12) #8
                &
                (df["DOW_5"] <= 13) #8
                &
                (df["DOW_6"] <= 13) #10
                &
                (df["HR_1"] <= 8)                                 
                &
                (df["HR_2"] < 8) 
                &
                (df["HR_3"] <= 8)                             
                &
                (df["HR_4"] < 10)                       
                &
                (df["HR_5"] < 5)                                  
                &
                (df["HR_6"] <= 5)
                &
                (df["HR_7"] <= 5)                            
                &
                (df["HR_8"] <= 15)  #10
                &
                (df["HR_9"] <= 13) #10
                &
                (df["HR_10"] < 15) #10                         
                &
                (df["HR_11"] <= 15)  #10                                
                &
                (df["HR_12"] < 10) #10
                &
                (df["HR_13"] < 8)  #6                          
                &
                (df["HR_14"] < 8) 
                &
                (df["HR_15"] < 10)  
                &
                (df["HR_16"] <= 15) #10
                &
                (df["HR_17"] < 15) #10
                &
                (df["HR_18"] < 12) #10                                 
                &
                (df["HR_19"] < 15) 
                &
                (df["HR_20"] < 10)                            
                &
                (df["HR_21"] < 6)
                &
                (df["HR_22"] < 8)
                &
                (df["HR_23"] < 6)    
)                     

df_out_man2 = df[filters_manual2]

In [None]:
print('Percentage of data kept after removing outliers:', 100*(np.round(df_out_man2.shape[0] / df.shape[0], decimals=5)))

In [None]:
filters_manual3 = (
                (df["customer_age"] <= 70) 
                &
                (df["vendor_count"] <= 35) 
                &
                (df["product_count"] <= 80)                          
                &
                (df["is_chain"] <= 40)                                 
                &
                (df["CUI_American"] <= 100)
                &
                (df["CUI_Asian"] <= 200) #200                           
                &
                (df["CUI_Beverages"] <= 100)  
                &
                (df["CUI_Cafe"] <= 120)  
                &
                (df["CUI_Chicken Dishes"] <= 50) #50                           
                &
                (df["CUI_Chinese"] <= 150)                                
                &
                (df["CUI_Desserts"] <= 100) #100
                &
                (df["CUI_Healthy"] <= 120)                           
                &
                (df["CUI_Indian"] <= 120)  
                &
                (df["CUI_Italian"] <= 150)
                &
                (df["CUI_Japanese"] <= 120) 
                &
                (df["CUI_Noodle Dishes"] <= 70)
                &
                (df["CUI_OTHER"] <= 120)
                &
                (df["CUI_Street Food / Snacks"] <= 200)
                &
                (df["CUI_Thai"] <= 60)
                &
                (df["DOW_0"] <= 8) #8
                &
                (df["DOW_1"] <= 8) #8
                &
                (df["DOW_2"] <= 8) #8
                &
                (df["DOW_3"] <= 8) #8
                &
                (df["DOW_4"] <=8) #8
                &
                (df["DOW_5"] <=8) #8
                &
                (df["DOW_6"] <= 10) #10
                &
                (df["HR_1"] <= 8)                                 
                &
                (df["HR_2"] < 8) 
                &
                (df["HR_3"] <= 8)                             
                &
                (df["HR_4"] < 10)                       
                &
                (df["HR_5"] < 5)                                  
                &
                (df["HR_6"] <= 5)
                &
                (df["HR_7"] <= 5)                            
                &
                (df["HR_8"] <= 10)  #10
                &
                (df["HR_9"] <= 10) #10
                &
                (df["HR_10"] < 10) #10                         
                &
                (df["HR_11"] <= 10)  #10                                
                &
                (df["HR_12"] < 10) #10
                &
                (df["HR_13"] < 6)  #6                          
                &
                (df["HR_14"] < 8) 
                &
                (df["HR_15"] < 10)  
                &
                (df["HR_16"] <= 10) #10
                &
                (df["HR_17"] < 10) #10
                &
                (df["HR_18"] < 10) #10                                 
                &
                (df["HR_19"] < 15) 
                &
                (df["HR_20"] < 10)                            
                &
                (df["HR_21"] < 6)
                &
                (df["HR_22"] < 8)
                &
                (df["HR_23"] < 6)    
)                     

df_out_man3 = df[filters_manual3]

In [None]:
# Number of observations with at least one features considered an outlier
percentage_data_kept_manual = 100*(np.round(df_out_man3.shape[0] / df.shape[0], decimals=5))
percentage_outliers_manual = round(100 - percentage_data_kept_manual, 5)
print(f"Percentage of observations with at least one features considered an outlier: {percentage_outliers_manual}%")
print(f"Percentage of data kept after removing outliers: {percentage_data_kept_manual}%")

#### Remove outliers combining automatic and manual approaches:

In [31]:
df = df[(filters_iqr_all | filters_manual3)]

NameError: name 'filters_iqr_all' is not defined

### <span style="color:yellow"> MARTA - FINAL </span> 

## <span style="color:salmon"> 3. New Features  </span> 
Creating new features can significantly enhance our analysis by providing additional insights and improving the performance of models

#### <span style="color:salmon"> 3.1 Customer Lifetime  </span>
Interval of customer activity, so we have an idea of ​​how many days the customer ordered.

In [None]:
df['lifetime_days'] = df['last_order'] - df['first_order']
df['lifetime_days'].dtype

#### <span style="color:salmon"> 3.2 Most frequent order day of the week  </span>
Indicates the days of the week on which the customer placed the most orders.

In [None]:
dows = ['DOW_1', 'DOW_2', 'DOW_3', 'DOW_4', 'DOW_5', 'DOW_6', 'DOW_0'] # this order so it is from Monday to Sunday, not Sunday to Saturday]
def frequent_days(customer):
    max_value = customer[dows].max() # Day with the most orders
    result = []
    for col in dows: # Checks if there is more than one day with max_value
        if customer[col] == max_value:
            result.append(col)
    return result

df['preferred_order_days'] = df.apply(frequent_days, axis=1)
df['preferred_order_days'].dtype # obj 
all(isinstance(i, list) for i in df['preferred_order_days']) # confirm that all values ​​are lists

In [None]:
df["preferred_order_days"].head()

#### <span style="color:salmon"> 3.3 Most frequent part of the day  </span>
6h-12h --> Morning (Breakfast)  
12h-18h --> Afternoon (Lunch)  
18h-00h --> Evening (Dinner)  
00h-6h --> Night

In [None]:
def part_of_the_day(hour):
    if 6 <= hour < 12:
        return '06h-12h'
    elif 12 <= hour < 18:
        return '12h-18h'
    elif 18 <= hour < 24:
        return '18h-00h'
    else:  # 0 <= hour < 6
        return '00h-06h'

def frequent_hours(customer):
    part_counts = {
        '06h-12h': 0,
        '12h-18h': 0,
        '18h-00h': 0,
        '00h-06h': 0}
    for hour in range(24):
        num_orders = customer[f'HR_{hour}']
        if pd.isna(num_orders): # Ignore NaN
            continue
        part_of_day = part_of_the_day(hour)
        part_counts[part_of_day] += num_orders

    # Part of the day with the highest number of orders
    max_value = 0
    result = []
    for part, count in part_counts.items():
        if count > max_value:
            max_value = count  
            result = [part] 
        elif count == max_value:
            result.append(part) 
    return result
    
df['preferred_part_of_day'] = df.apply(frequent_hours, axis=1)
df['preferred_part_of_day'].dtype # obj 
all(isinstance(i, list) for i in df['preferred_part_of_day']) # confirm that all values ​​are lists

In [None]:
df["preferred_part_of_day"].head()

#### <span style="color:salmon"> 3.4 Total monetary units spend </span>
Sum all total expenses.

In [None]:
cuisine = df.filter(like='CUI_').columns.tolist() # Types of cuisine
df['total_expenses'] = df[cuisine].sum(axis=1)
df['total_expenses'].dtype

#### <span style="color:salmon"> 3.5 Average monetary units per product </span>
Show the average monetary of all products.

In [None]:
df['avg_per_product'] = pd.to_numeric(df['total_expenses'] / df['product_count'].replace(0, pd.NA), errors='coerce')
df['avg_per_product'].dtype

#### <span style="color:salmon"> 3.6 Average monetary units per order </span>
Show the average monetary per order. 

In [None]:
df['avg_per_order'] = pd.to_numeric(df['total_expenses'] / df[dows].sum(axis=1).replace(0, pd.NA), errors='coerce')
df['avg_per_order'].dtype

#### <span style="color:salmon"> 3.7 Average order size </span>
Help identifying users who make larger orders.

In [None]:
df['avg_order_size'] = pd.to_numeric(df['product_count'] / df[dows].sum(axis=1).replace(0, pd.NA), errors='coerce')
df['avg_order_size'].dtype

#### <span style="color:salmon"> 3.8 Culinary profile </span>
A proportion of ordered cuisines. A higher number indicates more diversity of types of cuisine you ordered.

In [None]:
total_cuisine = len(cuisine)

df['culinary_variety'] = round((df[cuisine].gt(0).sum(axis=1) / total_cuisine), 5)
df['culinary_variety'].dtype

#### <span style="color:salmon"> 3.9 Loyalty to chain restaurants </span>
Proportion of orders from restaurant chains. A high value indicates that you prefer to try different restaurant chains. A lower value is only more faithful to certain chains.

In [None]:
df['chain_preference'] = pd.to_numeric(df['is_chain'] / df[dows].sum(axis=1).replace(0, pd.NA), errors='coerce')
df['chain_preference'].dtype

#### <span style="color:salmon"> 3.10 Loyalty to venders </span>
Proportion of orders from specific restaurants. A high value indicates that you prefer to try different restaurants. A lower tend to be more loyal to specific restaurants.

In [None]:
df['loyalty_to_venders'] = pd.to_numeric(df['vendor_count'] / df[dows].sum(axis=1).replace(0, pd.NA), errors='coerce')
df['loyalty_to_venders'].dtype

To see all the new features that we added:

In [None]:
df.head(10)

### <span style="color:yellow"> MARTA - INÍCIO </span> 

#### New metric and non metric features:

In [None]:
new_metric_features = ['lifetime_days', 'total_expenses', 'avg_per_product', 'avg_per_order', 'avg_order_size', 'culinary_variety', 'chain_preference', 'loyalty_to_venders']
new_non_metric_features = ['preferred_order_days', 'preferred_part_of_day']
new_features = new_metric_features + new_non_metric_features

### Descriptives:

In [None]:
df[new_features].describe(include="all").T

### Missing values:

In [None]:
missing_rows = df[new_features].isna().any(axis=1)

In [None]:
# Percentage of missing values in each variable:
missing_percentage = ((df[new_features].isnull().sum() / len(df)) * 100).sort_values(ascending=False)
missing_percentage = missing_percentage[missing_percentage > 0]

print(f"Percentage of missing values:\n {missing_percentage}")

### VISUALIZING NEW METRIC FEATURES (NUMERICAL):

In [None]:
sns.set()

# Set up the figure and axes
rows, cols = 12, 5 
fig, axes = plt.subplots(rows, cols, figsize=(25, 30))  

# Plot each feature
for ax, feat in zip(axes.flatten(), new_metric_features):
    ax.hist(df[feat], bins=20, color='skyblue', edgecolor='black')  
    ax.set_title(feat, fontsize=10, y=-0.2)  

# Hide unused subplots:
for ax in axes.flatten()[len(new_metric_features):]:
    ax.set_visible(False)

# Set a global title and adjust layout
plt.suptitle("Numeric Variables' Histograms", fontsize=16, y=1.02)  
plt.tight_layout()
plt.show()

### 1. VISUALIZING NEW NON-METRIC FEATURES (CATEGORICAL):

In [None]:
for column in new_non_metric_features:
    
    top_categories = df[column].value_counts().head(20)

    top_categories_sorted = top_categories.sort_values(ascending=True)

    data_filtered = df[df[column].isin(top_categories_sorted.index)]
    
   
    plt.figure(figsize=(10, 5))
    sns.countplot(data=data_filtered, 
                  x=column, 
                  order=top_categories_sorted.index,  
                  palette='tab20b')
    
  
    plt.xlabel(column, fontsize=14)
    plt.ylabel('Count', fontsize=14)
    plt.title(f'Top 20 Categories in {column}')
    
    
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

### Treat missing values in new features:

In [None]:
# Percentage of missing values in each variable:
missing_percentage = ((df[new_features].isnull().sum() / len(df)) * 100).sort_values(ascending=False)
missing_percentage = missing_percentage[missing_percentage > 0]

print(f"Percentage of missing values:\n {missing_percentage}")

In [None]:
# Fill numerical missing values with median:
median_variables = ['avg_per_product', 'avg_per_order', 'avg_order_size', 'chain_preference', 'loyalty_to_venders']
for column in median_variables:
    median_value = df[column].median()
    df[column] = df[column].fillna(median_value)

In [None]:
# Percentage of missing values in each variable:
missing_percentage = ((df[new_features].isnull().sum() / len(df)) * 100).sort_values(ascending=False)
missing_percentage = missing_percentage[missing_percentage > 0]

print(f"Percentage of missing values:\n {missing_percentage}")

In [None]:
# Store the index of rows with missing values in new_features
missing_rows_index = df[missing_rows].index

# Filter the DataFrame using the index
df_missing = df.loc[missing_rows_index]
df_missing

### Outliers - New features:

In [None]:
sns.set()


selected_features = new_metric_features

# Set up the figure and axes
rows, cols = 3, 3  #
fig, axes = plt.subplots(rows, cols, figsize=(15, 10)) 

# Flatten axes for iteration
axes = axes.flatten()

# Plot each feature as a box plot
for i, (ax, feat) in enumerate(zip(axes, selected_features)):
    sns.boxplot(data=df, x=feat, ax=ax, color='skyblue')  
    ax.set_title(feat, fontsize=10)  
    
# Hide any unused subplots:
for ax in axes[len(selected_features):]:
    ax.set_visible(False)

# Set a global title and adjust layout
plt.suptitle("Selected Numeric Variables' Box Plots", fontsize=16, y=1.02)  
plt.tight_layout()
plt.show()

### Outlier Removal:

In [None]:
# Compute the interquartile range
q1 = df[new_metric_features].quantile(0.25)
q3 = df[new_metric_features].quantile(0.75)
iqr = q3 - q1

# Compute the limits:
lower_lim = q1 - (1.5 * iqr)
upper_lim = q3 + (1.5 * iqr)

for feature in new_metric_features:
    print(f"{feature:<25}  Lower Limit: {lower_lim[feature].round(5):>10}      Upper Limit: {upper_lim[feature].round(5):>10}")

Observations in which all features are outliers:

In [None]:
outliers, obvious_outliers = identify_outliers(df, new_metric_features, lower_lim, upper_lim)

Conclusion: There is no observation in which all new features is an outlier. There is no outlier in 'lifetime_days', 'chain_preference'.

Check if there is any observation only with outliers, except on these features.

In [None]:
new_metric_features_test = ['total_expenses', 'avg_per_product', 'avg_per_order', 'avg_order_size', 'culinary_variety', 'loyalty_to_venders']
outliers, obvious_outliers = identify_outliers(df, new_metric_features_test, lower_lim, upper_lim)

Conclusion: There is no observation with outliers in all new features.

Observations in which at least one new feature is an outlier:

In [None]:
new_filters_iqr = []                                            
for metric in new_metric_features:
    llim = lower_lim[metric]
    ulim = upper_lim[metric]
    new_filters_iqr.append(df[metric].between(llim, ulim, inclusive='neither'))

new_filters_iqr_all = pd.concat(new_filters_iqr, axis=1).all(axis=1)

In [None]:
new_filters_iqr_all

In [None]:
# Number of observations with at least one features considered an outlier
new_features_percentage_data_kept = len(df[new_filters_iqr_all])/len(df)*100
new_features_percentage_outliers = round(100 - new_features_percentage_data_kept, 5)
print(f"Percentage of observations with at least one features considered an outlier: {new_features_percentage_outliers}%")
print(f"Percentage of data kept after removing outliers: {new_features_percentage_data_kept}%")

2. MANUAL METHOD:

In [32]:
filters_manual_new_features = (
                (df["total_expenses"] <= 350) #
                &
                (df["avg_per_product"] <= 22) #??
                &
                (df["avg_per_order"] <= 70)  #50                          
                &
                (df["avg_order_size"] <= 4)                                  
                &
                (df["culinary_variety"] <= 0.7)
                &
                (df["loyalty_to_venders"] >= 0.1)
)                     

df_out_man_new_features = df[filters_manual_new_features]


KeyboardInterrupt



In [None]:
# Number of observations with at least one features considered an outlier
new_features_percentage_data_kept_manual = 100*(np.round(df_out_man_new_features.shape[0] / df.shape[0], decimals=5))
new_features_percentage_outliers_manual = round(100 - new_features_percentage_data_kept_manual, 5)
print(f"Percentage of observations with at least one features considered an outlier: {new_features_percentage_outliers_manual}%")
print(f"Percentage of data kept after removing outliers: {new_features_percentage_data_kept_manual}%")

Remove outliers combining automatic and manual methods:

In [None]:
df = df[(new_filters_iqr_all | filters_manual_new_features)]

## Visualize all features (after preprocessing):

In [None]:
all_metric_features = [
    'customer_age', 'vendor_count', 'product_count', 'is_chain', 'first_order', 
    'last_order', 'CUI_American', 'CUI_Asian', 'CUI_Beverages', 'CUI_Cafe', 
    'CUI_Chicken Dishes', 'CUI_Chinese', 'CUI_Desserts', 'CUI_Healthy', 
    'CUI_Indian', 'CUI_Italian', 'CUI_Japanese', 'CUI_Noodle Dishes', 
    'CUI_OTHER', 'CUI_Street Food / Snacks', 'CUI_Thai', 'DOW_0', 'DOW_1', 
    'DOW_2', 'DOW_3', 'DOW_4', 'DOW_5', 'DOW_6', 'HR_0', 'HR_1', 'HR_2', 
    'HR_3', 'HR_4', 'HR_5', 'HR_6', 'HR_7', 'HR_8', 'HR_9', 'HR_10', 'HR_11', 
    'HR_12', 'HR_13', 'HR_14', 'HR_15', 'HR_16', 'HR_17', 'HR_18', 'HR_19', 
    'HR_20', 'HR_21', 'HR_22', 'HR_23', 'lifetime_days', 'total_expenses', 
    'avg_per_product', 'avg_per_order', 'avg_order_size', 'culinary_variety', 
    'chain_preference', 'loyalty_to_venders'
]

all_non_metric_features = [
    'customer_region', 'last_promo', 'payment_method', 
    'preferred_order_days', 'preferred_part_of_day'
]

len(all_metric_features)

### Numeric features:

In [None]:
sns.set()

# Set up the figure and axes
rows, cols = 12, 5  
fig, axes = plt.subplots(rows, cols, figsize=(25, 30)) 

# Plot each feature
for ax, feat in zip(axes.flatten(), all_metric_features):
    ax.hist(df[feat], bins=20, color='skyblue', edgecolor='black')  
    ax.set_title(feat, fontsize=10, y=-0.2)  

# Hide unused subplots:
for ax in axes.flatten()[len(all_metric_features):]:
    ax.set_visible(False)

# Set a global title and adjust layout 
plt.suptitle("Numeric Variables' Histograms", fontsize=16, y=1.02)  
plt.tight_layout() 
plt.show()

### Categorical features:

In [None]:
for column in all_non_metric_features:
    
    top_categories = df[column].value_counts().head(20)

    top_categories_sorted = top_categories.sort_values(ascending=True)

    data_filtered = df[df[column].isin(top_categories_sorted.index)]
    
   
    plt.figure(figsize=(10, 5))
    sns.countplot(data=data_filtered, 
                  x=column, 
                  order=top_categories_sorted.index,  
                  palette='tab20b')
    
  
    plt.xlabel(column, fontsize=14)
    plt.ylabel('Count', fontsize=14)
    plt.title(f'Top 20 Categories in {column}')
    
    
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

### <span style="color:yellow"> MARTA - FINAL </span> 

## <span style="color:salmon"> 4. Visualizations and relationships between features </span> 
In order to explore the relationships among the features, we looked for trends and patterns

#### <span style="color:salmon"> 4.1 Correlation of all numerical features  </span> 
We filter all correlations that are greather than 0.7 and smaller than 1.0 to avoid correlations with itself:

In [None]:
# correlations of all numerical features
corr_matrix = df.select_dtypes(include=['int64', 'float64']).corr(method='pearson')

# transforming them into a list
corr_pairs = corr_matrix.unstack()   

# filtering for correlations only > 0.7 and < 1.0 and dropping duplicates
corr_pairs[(corr_pairs > 0.7) & (corr_pairs < 1.0)].drop_duplicates() 

**Conclusions:** 
+ vendor_count and product_count have high correlation, probably because more orders lead to potentially more different vendors
+ vendor_count and culinary_variety also have high correlation, logically because ordering from more vendors usually leads to ordering from different cuisines as well
+ product_count has high correlation with total_expenses, also logically because more products ordered leads to more expenses
+ correlations with is_chain probably random since the types of values (mostly numbers between 1 and 3) are similar between is_chain and vendor/product_count
+ avg_per_product correlate with avg_per_order, which is logically
+ avg_per_order correlates slightly with avg_order_size, also logically because more expensive orders usually have more products

#### <span style="color:salmon"> 4.2 Visualization of total Orders placed per Hour </span> 
The number total of orders in each hour is represented by:

In [None]:
# creating a list with all names of the HR_x columns
HRs = [f'HR_{i}' for i in range(24)] 

# plotting all the hours and its total orders placed
df[HRs].sum().plot(kind='bar', figsize=(9, 5), width=0.95, color='darkblue') 

plt.xlabel('Hour')
plt.ylabel('Orders placed')
plt.title('total Orders placed per Hour')
plt.show()

**Conclusions:** 
+ We can see the most orders placed between 10h to 12h and 16h to 18h

#### <span style="color:salmon"> 4.3 Visualization of total Orders placed per week day </span> 
The number total of orders in each day is represented by:

In [None]:
# plotting the data
df[dows].sum().plot(kind='bar', figsize=(8, 4), width=0.98, color='darkblue')   

plt.title('Total Orders placed per week day')
plt.grid(axis='y', linestyle='-', linewidth=0, color='gray')
plt.xticks(ticks= [0,1,2,3,4,5,6], labels=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], rotation=0)
plt.ylabel('Orders Placed')
plt.grid(axis='y')
plt.show()

**Conclusions:** 
+ We can see the most orders placed on Thursday and Saturday, least on Sunday

#### <span style="color:salmon"> 4.4 Percentage of each payment_method for each age_group </span>
First we create a column to put the custumer_age in smaller groups:

In [None]:
df['customer_age'].sort_values().unique()
# classficating age in groups, 15-17, 18-22 etc.
binsAge = [15, 18, 23, 29, 36, 50, 81]      
labelsAge = ['15-17', '18-22', '23-28', '29-35', '36-49', '50+']
df['customer_age_group'] = pd.cut(df['customer_age'], bins=binsAge, labels=labelsAge, right=False)
warnings.filterwarnings('ignore')

Then, we created three plots: 
+ Percentage of use of DIGI by age
+ Percentage of use of CASH by age
+ Percentage of use of CARD by age

In [None]:
#Plot for Payment method used based on AGE
warnings.filterwarnings('ignore')
Age_CASH = df[df['payment_method'] == 'CASH'].groupby('customer_age_group')['payment_method'].count().div(df.groupby('customer_age_group')['payment_method'].count())
Age_DIGI = df[df['payment_method'] == 'DIGI'].groupby('customer_age_group')['payment_method'].count().div(df.groupby('customer_age_group')['payment_method'].count())
Age_CARD = df[df['payment_method'] == 'CARD'].groupby('customer_age_group')['payment_method'].count().div(df.groupby('customer_age_group')['payment_method'].count())
fig, iplot = plt.subplots(1, 3, figsize=(15, 3))

Age_DIGI.plot(kind='bar', color='darkblue', width=0.95, ax=iplot[0])
iplot[0].set_ylabel('Percentage of use of Paymentmethod')
iplot[0].set_xlabel('AgeGroup')
iplot[0].set_xticks(range(len(Age_CASH.index)))
iplot[0].set_xticklabels(Age_CASH.index, rotation=0)
iplot[0].set_title('Percentage of use of DIGI by age')

Age_CASH.plot(kind='bar', color='blue', width=0.95, ax=iplot[1])
iplot[1].set_ylabel('Percentage of use of Paymentmethod')
iplot[1].set_xlabel('AgeGroup')
iplot[1].set_xticks(range(len(Age_CASH.index)))
iplot[1].set_xticklabels(Age_CASH.index, rotation=0) 
iplot[1].set_title('Percentage of use of CASH by age')

Age_CARD.plot(kind='bar', color='lightblue', width=0.95, ax=iplot[2])
iplot[2].set_ylabel('Percentage of use of Paymentmethod')
iplot[2].set_xlabel('AgeGroup')
iplot[2].set_xticks(range(len(Age_CASH.index)))
iplot[2].set_xticklabels(Age_CASH.index, rotation=0) 
iplot[2].set_title('Percentage of use of CARD by age')

**Conclusions:** 
+ DIGI is balanced, CASH is mostly used by older people above 50, while these use less CARD

#### <span style="color:salmon"> 4.5 Proportions of each last_promo for each payment_method </span>  
To see the last promo in each payment, we did a plot with the three variables:

In [None]:
value_counts = df.groupby(['payment_method', 'last_promo']).size().unstack().fillna(0)

# calculating proportions of each last_promo value for each payment_method
normalized_counts = value_counts.div(value_counts.sum(axis=1), axis=0)

colors = {'CARD': 'darkblue', 'CASH': 'blue', 'DIGI': 'lightblue'}
# creating a plot with different colors for each payment_method
normalized_counts.plot(kind='bar', figsize=(6, 4), color=[colors[col] for col in normalized_counts.index]) 

plt.title('Proportions of each last_promo for each payment_method')
plt.xlabel('Payment Method')
plt.ylabel('Proportion of Last Promo')
plt.xticks(rotation=0)
plt.legend(title='Last Promo', bbox_to_anchor=(1.05, 1))
plt.tight_layout()
plt.show()

**Conclusions:** 
+ DELIVERY always has the highest proportion, but for CARD the last_promos is more balanced. 
+ For CASH, the proportion of freebie is lower than usual

#### <span style="color:salmon"> 4.6 Means of each payment_method in vendor_count </span> 
The mean of each payment that have been ordered to unique vendors:

In [None]:
# calculating average value of vendor_count for each payment method
df.groupby(['payment_method'])['vendor_count'].mean()

**Conclusions** 
+ CASH < DIGI < CARD
+ CARD highest and CASH lowest, maybe because people who try a lot different things and are more open to experimenting new restaurants also are more open to use modern or alternative payment methods
+ Open to use new payment methods like CARD, while more conservative people who still use CASH stay with restaurants they know and like

#### <span style="color:salmon"> 4.7 Means of each payment_method in lifetime_days </span>
Mean of each payment of the diference of last_order and first_order:

In [None]:
# calculating average value of lifetime_days for each payment method
df.groupby(['payment_method'])['lifetime_days'].mean()

**Conclusions:** 
+ CASH < DIGI < CARD
+ CARD highest and CASH lowest maybe because who plans to use more frequently registers his CARD on website/app and first time users only use CASH

#### <span style="color:salmon"> 4.8 Means of each payment_method in total_expenses </span>
Mean of each payment of the sum of total expenses:

In [None]:
# calculating average value of total_expenses for each payment method
df.groupby(['payment_method'])['total_expenses'].mean() 

**Conclusions:** 
+ CASH < DIGI < CARD
+ total_expenses highest with CARD, than DIGI and CASH close. explanation could be that with CASH you have better feeling for how much you spent, with CARD easier
+ To overspend without realizing so much

#### <span style="color:salmon"> 4.9 Means of each payment_method in avg_per_product </span> 
Mean of each payment of the average monetary of all products:

In [None]:
# calculating average value of avg_per_product for each payment method
df.groupby(['payment_method'])['avg_per_product'].mean()

**Conclusions:** 
+ CARD < CASH < DIGI
+ The difference between the average per product of CARD, CASH and DIGI is not very big, being DIGI the highest average

#### <span style="color:salmon"> 4.10 Means of each payment_method in avg_per_order </span>  
Mean of each payment of the average monetary per order:

In [None]:
# calculating average value of avg_per_order for each payment method
df.groupby(['payment_method'])['avg_per_order'].mean()

**Conclusions:** 
+ CARD < CASH < DIGI
+ The average monetary per order is greater if paying by DIGI

#### <span style="color:salmon"> 4.11 Means of each payment_method in avg_order_size </span> 
Mean of each payment of the average monetary per order size:

In [None]:
# calculating average value of avg_order_size for each payment method
df.groupby(['payment_method'])['avg_order_size'].mean()

**Conclusions:** 
+ CASH < CARD < DIGI
+ Who made majors order payed with DIGI, but the differecence is minimal

#### <span style="color:salmon"> 4.12 Means of each payment_method in culinary_variety </span>
Mean of each payment of the porportion of ordered cuisines:

In [None]:
# calculating average value of culinary_variety for each payment method
df.groupby(['payment_method'])['culinary_variety'].mean() 

**Conclusions:** 
+ CASH < DIGI < CARD
+ CARD highest, CASH lowest. similar to vendor_count, people with CASH are more conservative people, trying less new method ("new" payment_methods)

#### <span style="color:salmon"> 4.13 Means of each payment_method in chain_preference </span>
Mean of each payment of the porportion of orders from restaurant chains:

In [None]:
# calculating average value of chain_preference for each payment method
df.groupby(['payment_method'])['chain_preference'].mean()

**Conclusions:** 
+ CARD < DIGI < CASH
+ In chain_preference, the customers prefer paying in CASH, but the difference is minimal

#### <span style="color:salmon"> 4.14 Total expenses per age group </span> 
Mean of the total expenses in each age group:

In [None]:
# calculating average value of total_expenses for each customer_age_group
df.groupby(['customer_age_group'])['total_expenses'].mean().plot(marker='o', color='darkblue', figsize=(5,3)) 
plt.ylabel('Total expenses')
plt.title('Total expenses per age group')
plt.show()

**Conclusions:** 
+ Lower at younger age, after 23-28 more regular, probably because young people have less money

#### <span style="color:salmon"> 4.15 Culinary variety per age group </span>
Mean of the culinary variety in each age group:

In [None]:
# calculating average value of culinary_variety for each customer_age_group
df.groupby(['customer_age_group'])['culinary_variety'].mean().plot(marker='o', color='darkblue', figsize=(5,3)) 
plt.ylabel('culinary variety')
plt.title('culinary variety per age group')
plt.show()

**Conclusions:** 
+ No big differences but peak at 23-28, maybe because people start to live on their own and try more different things

#### <span style="color:salmon"> 4.16 Means of each last_promo in lifetime_days </span> 
Mean of each last_promo in lifetime_days:

In [None]:
# calculating average value of lifetime_days  for each last_promo
df.groupby(['last_promo'])['lifetime_days'].mean() 

**Conclusions:** 
+ DELIVERY < DISCOUNT < FREEBIE
+ FREEBIE highest, DELIVERY lowest

#### <span style="color:salmon"> 4.17 Means of each last_promo in total_expenses </span>
Average value of total_expenses for each last_promo:

In [None]:
df.groupby(['last_promo'])['total_expenses'].mean() 

**Conclusions:** 
+ DELIVERY < DISCOUNT < FREEBIE
+ FREEBIE highest, DELIVERY lowest but more equal to DISCOUNT. 
+ FREEBIE Discount leads to more expenses in total

#### <span style="color:salmon"> 4.18 Means of each last_promo in avg_per_product </span> 
Average value of avg_per_product for each last_promo:

In [None]:
df.groupby(['last_promo'])['avg_per_product'].mean()

**Conclusions:** 
+ DISCOUNT < FREEBIE < DELIVERY
+ In the last_promo, the average_per_product is greater on DELIVERY, but the difference is minimal

#### <span style="color:salmon"> 4.19 Means of each last_promo in avg_per_order </span>  
Average value of avg_per_order for each last_promo:

In [None]:
df.groupby(['last_promo'])['avg_per_order'].mean() 

**Conclusions:** 
+ DISCOUNT < FREEBIE < DELIVERY
+ In the last_promo, the average_per_order is greater on DELIVERY, but the difference is minimal

#### <span style="color:salmon"> 4.20 Means of each last_promo in chain_preferences </span>
Average value of chain_preference for each last_promo:

In [None]:
df.groupby(['last_promo'])['chain_preference'].mean() 

**Conclusions:** 
+ FREEBIE < DELIVERY < DISCOUNT
+ DISCOUNT highest, people with DISCOUNT promo tend to go more to chains

#### <span style="color:salmon"> 4.21 Means of each last_promo in loyalty_to_venders </span>
Average value of loyalty_to_venders for each last_promo:

In [None]:
df.groupby(['last_promo'])['loyalty_to_venders'].mean() 

**Conclusions:** 
+ DELIVERY < DISCOUNT < FREEBIE
+ FREEBIE highest, FREEBIE leads to more loyalty

#### <span style="color:salmon"> 4.22 Relations between the costumer age and some types of cuisines </span> 
We do with CUI_Asian, CUI_Desserts and CUI_Healthy per age, because they  are the most  relevant. 

In [None]:
# Plotting the amount of money spent on cuisines based on age
warnings.filterwarnings('ignore')
fig, aplots = plt.subplots(1, 3, figsize=(15, 4))
aplots = aplots.flatten()
CUIcolumns = df[['CUI_Asian', 'CUI_Desserts', 'CUI_Healthy']].columns # choosing only the columns to plot

for i, col in enumerate(CUIcolumns):
    df.groupby('customer_age_group')[col].mean().plot(ax=aplots[i], marker='o', color='midnightblue', label=col)   
    aplots[i].set_title(f'average amount of units spent in {col} per age')
    aplots[i].set_xlabel('customer_age_group')
    aplots[i].set_ylabel('average amount of units spent')
    aplots[i].legend()
    
plt.tight_layout()
plt.show()

**Conclusions:** 
+ *Asian* clear increase by age, maybe because potentially more expensive 
+ *Dessert* clear decrease with age potentially because older people tend to look more on their health
+ *Healthy* peak on 23-28, maybe because at this age more people start to feel more of the effects of unhealthy food, while as children these effects might not necessarily as obvious to the children, so maybe in this age group people especially try to focus more on their health

#### <span style="color:salmon"> 4.23 Means of each total_expenses in chain_preferences </span> 
The values at 'True' stand for the costumer who belong to the highest spenders in order to calculate the mean of each total expenses in chain_preferences:

In [None]:
df.groupby(df['total_expenses'] > 45)['chain_preference'].mean()

**Conclusions:** 
+ Big spenders tend to go to less different chains

#### <span style="color:salmon"> 4.24 Proportions of each last_promo value for the two groups of people </span> 
Proportions of each last_promo value for the two groups of people:

In [None]:
df.groupby(df['total_expenses'] > 45)['last_promo'].value_counts(normalize=True)

**Conclusions:** 
+ Big spenders use FREEBIE a lot more, while low spenders use DELIVERY the most

#### <span style="color:salmon"> 4.25 Cuisines and total_expenses  </span>
We also check for the cuisines, in which cuisines the big spending costumer have a higher increase than in other, to see where the big spenders spent the most in comparisation to the 'normal' spenders:

In [None]:
percent_differences = {}

for column in cuisine:
    mean_above_45 = df[df['total_expenses'] > 45][column].mean()
    mean_below_45 = df[df['total_expenses'] <= 45][column].mean()
    
    percent_difference = ((mean_above_45 - mean_below_45) / mean_below_45) * 100
    percent_differences[column] = percent_difference
    
pd.DataFrame(list(percent_differences.items()), columns=['Column', 'Percent Difference']).sort_values(by='Percent Difference', ascending=False)

**Conclusions:** 
+ The biggest spenders have their highest increase of spending compared to the other costumers in StreetFood/Snacks, Cafe, Asian, Healthy and Desserts, while the lowest increase is in Chicken Dishes, Noodles Dishes and Indian

#### <span style="color:salmon"> 4.26 Cuisines and loyalty_to_venders  </span> 
A higher loyality value significates that a costumer tends to place more orders at single vendors rather than using a lot of vendors. <br>

So the ones with loaylty value of higher than 7 will be considered as the most loyal costumers in this step.

In [None]:
percent_differences2 = {}

for column in cuisine:
    mean_above_7 = df[df['loyalty_to_venders'] < 0.15][column].mean()
    mean_below_7 = df[df['loyalty_to_venders'] >= 0.15][column].mean()
    
    percent_difference2 = ((mean_above_7 - mean_below_7) / mean_below_7) * 100
    percent_differences2[column] = percent_difference2
    
pd.DataFrame(list(percent_differences2.items()), columns=['Column', 'Percent Difference']).sort_values(by='Percent Difference', ascending=False)

**Conclusions:**  
+ The most loyal costumers to restaurants are Italian Cuisine, followed by Chinese and Cafe. The lowest are Desserts and Snacks where Costumers tend to choose any place without too much thought.

#### <span style="color:salmon"> 4.27 Customer_regions  </span>
Total_expensees per customer_region:

In [None]:
df.groupby(['customer_region'])['total_expenses'].mean().plot(color='darkblue', figsize=(5,3), kind='bar', width=0.95) 
plt.ylabel('total_expenses')
plt.xticks(rotation=1)
plt.title('total_expenses per customer_region')
plt.show()

**Conclusions:** 
+ The region 8550 spends more than the others

### DATA CLEANING AND PREPROCESSING:
- Treat missing values (DONE)
- Outliers (visualization + treatment) (DONE)
- Data transformation (scaling and encoding)
- Feature selection 