# Customer Segmentation of Arvato-Bertelsmann customers

The Arvato-Group is one of total 8 business units in the Bertelsmann Group which is a worldwide operating service company head-quarted in Germany.<br>
The main operating field of Avarto are logistics- and supply chain services and solutions, financial services as well as the operation of IT Systems. Concerning the general figures to get a grasp of the company, the company employs a staff around 77.342 persons (2020) and generates a sales volume of 5.56 Mrd. EUR per a (2024).

The present project can be localized in the financial services branch of Arvato (Arvato Financial Solutions).<br><br>
<span style="color: green;">**One client of Arvato Financial Solutions, a Mail-Order Company selling organic products, wants to be advised concerning a more efficient way to acquire new clients.<br>
In essence, the company wants their acquisition marketing campaings instead of reaching out to everyone (costly), target more precisely those persons which show the highest probability to turn into new customers.**</span>
<br><br>
<span style="text-decoration: underline;">The project spans two main tasks:</span>
1) Customer Segmentation: An Analysis of the existing customer database dataset is carried out and on this basis a general recommandation of which people in Germany are most likely to be new customers of the company is generated. <br><br>
2) Modelling Campaign-Responses: Using the results of 1) to build a machine learning model that predicts whether or not an individual will respond to the respective campaign.

This notebook focuses on the first main task.




## II. Methodology

* General description how we'll proceed
* Short description of the datasets at hand
* Exploratory Analysis of the two datasets
* Short plan what needs to be done to clean the dataset for further use
* PCA of bigger ds
* PCA application on customer ds
* Clustering 

<img src = '../data/img/procedure_segmentation.PNG'/>

# Loading the data and importing the libraries

In [44]:
#Import relevant libraries
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 999)
pd.set_option('display.max_colwidth', None)

import os
import re

import matplotlib.pyplot as plt
import seaborn as sns

import seaborn as sns

from scipy.stats import skew, kurtosis

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, roc_auc_score

from scipy import stats
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

In [124]:
root_path = os.path.dirname(os.getcwd())

mailout_train = pd.read_csv(rf'{root_path}\data\modeling\Udacity_MAILOUT_052018_TRAIN.csv', sep=';', low_memory=False)
mailout_test = pd.read_csv(rf'{root_path}\data\modeling\Udacity_MAILOUT_052018_TEST.csv', sep=';', low_memory=False)

feature_summary = pd.read_excel(rf'{root_path}\data\description\DIAS Attributes - Values 2017.xlsx')

In [125]:
#original data not to be touched
mailout_train_original = mailout_train.copy()
mailout_test_original = mailout_test.copy()

# Checking the data

In [126]:
mailout_train.info(verbose = True, memory_usage=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42962 entries, 0 to 42961
Data columns (total 367 columns):
 #    Column                       Non-Null Count  Dtype  
---   ------                       --------------  -----  
 0    LNR                          42962 non-null  int64  
 1    AGER_TYP                     42962 non-null  int64  
 2    AKT_DAT_KL                   35993 non-null  float64
 3    ALTER_HH                     35993 non-null  float64
 4    ALTER_KIND1                  1988 non-null   float64
 5    ALTER_KIND2                  756 non-null    float64
 6    ALTER_KIND3                  174 non-null    float64
 7    ALTER_KIND4                  41 non-null     float64
 8    ALTERSKATEGORIE_FEIN         34807 non-null  float64
 9    ANZ_HAUSHALTE_AKTIV          35185 non-null  float64
 10   ANZ_HH_TITEL                 34716 non-null  float64
 11   ANZ_KINDER                   35993 non-null  float64
 12   ANZ_PERSONEN                 35993 non-null  float64
 13  

In [127]:
mailout_test.info(verbose = True, memory_usage=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42833 entries, 0 to 42832
Data columns (total 366 columns):
 #    Column                       Non-Null Count  Dtype  
---   ------                       --------------  -----  
 0    LNR                          42833 non-null  int64  
 1    AGER_TYP                     42833 non-null  int64  
 2    AKT_DAT_KL                   35944 non-null  float64
 3    ALTER_HH                     35944 non-null  float64
 4    ALTER_KIND1                  2013 non-null   float64
 5    ALTER_KIND2                  762 non-null    float64
 6    ALTER_KIND3                  201 non-null    float64
 7    ALTER_KIND4                  39 non-null     float64
 8    ALTERSKATEGORIE_FEIN         34715 non-null  float64
 9    ANZ_HAUSHALTE_AKTIV          35206 non-null  float64
 10   ANZ_HH_TITEL                 34687 non-null  float64
 11   ANZ_KINDER                   35944 non-null  float64
 12   ANZ_PERSONEN                 35944 non-null  float64
 13  

In [128]:
np.setdiff1d(mailout_train.columns, mailout_test.columns)

array(['RESPONSE'], dtype=object)

Same column count as in the dataset of the segmentation section.

In [129]:
mailout_train["RESPONSE"].value_counts()

RESPONSE
0    42430
1      532
Name: count, dtype: int64

Strongly unbalanced training set which needs to be adressed in the further procedure.

# Data Preparation & Training

In [130]:
from sklearn.base import BaseEstimator, TransformerMixin
from imblearn.pipeline import Pipeline

#from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report
from imblearn.under_sampling import RandomUnderSampler

from sklearn.model_selection import cross_validate, GridSearchCV

from xgboost import XGBClassifier

import pandas as pd

In [131]:
X_train = mailout_train.drop("RESPONSE", axis = 1)
y_train = mailout_train["RESPONSE"]

X_test = mailout_test.copy()

In [132]:
class Replace_unknown_values(BaseEstimator, TransformerMixin):
    def __init__(self):      

        self.feature_summary = pd.read_excel(rf'{root_path}\data\description\DIAS Attributes - Values 2017.xlsx')
        self.mapping = {}
        self.ls = []

    def fit(self, X, y = None):
        self.feat_unknown = self.feature_summary[self.feature_summary["Meaning"] == "unknown"]
        for col in self.feat_unknown["Attribute"].unique():
            self.ls = self.feat_unknown[self.feat_unknown["Attribute"] == col]["Value"].values.tolist()

        if isinstance(self.ls[0], str):
            self.mapping[col] = [int(element) for element in self.ls[0].split(",")]
        else:
            self.mapping[col] = self.ls

        return self
    
    def transform(self, X, y = None):

        for col in X.columns:
            if col in list(self.mapping.keys()):
                X[col] = np.where(X[col].isin(self.mapping[col]), np.nan, X[col])  

        X = X.replace(["XX", "X"], np.nan)

        return X
    
class Drop_specific_cols(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y = None):
        return self

    def transform(self, X, y = None):
        
        print(X.shape)

        X = X.drop(self.columns, axis = 1)
        print(f"The following columns have been dropped: {self.columns}")

        return X

class Handle_missing_columns(BaseEstimator, TransformerMixin):
    def __init__(self, missing_value_params):
        self.missing_value_params = missing_value_params

        self.col_threshold = self.missing_value_params["col_threshold"]
        self.rows_to_drop = None
        self.cols_to_drop = None

    def fit(self, X, y = None):
    
        self.cols_to_drop = X.isnull().mean()[X.isnull().mean()>self.col_threshold].index
        return self

    def transform(self, X, y = None):

        X = X.drop(self.cols_to_drop, axis = 1)
        return X
    
class Handle_high_correlation(BaseEstimator, TransformerMixin):
    def __init__(self, corr_threshold, untouchable_feats):
        self.corr_threshold = corr_threshold
        self.untouchable_feats = untouchable_feats
        self.corr_list = {}
        self.selected_cols = []

    def fit(self, X, y = None):
        
        processed_cols = [] #already processed or dropped before
        processed_cols.extend(self.untouchable_feats)

        self.selected_cols = [] #selected features
        self.selected_cols.extend(self.untouchable_feats)

        corr_list_tmp = {} #dictionary for each selected col with the highly correlated (and dropped features)
        nan_matrix = X.isnull().sum().sort_values().reset_index()
        corr_df = X.select_dtypes(exclude=["object"]).corr()

        #looping through the columns of correlation_df
        for col in corr_df.columns:

            if col not in processed_cols:
                
                processed_cols.append(col)
                corr_ls = []
                nan_rank_col = nan_matrix[nan_matrix["index"] == col].index.item()
                selected = True

                for counterpart in corr_df[col].index.tolist():

                    if (counterpart not in self.selected_cols) & (counterpart != col) & (counterpart not in processed_cols):
                        corr_ = corr_df[col].loc[counterpart].item()

                        if abs(corr_) > self.corr_threshold:
                            
                            if nan_rank_col < nan_matrix[nan_matrix["index"] == counterpart].index.item():
                                corr_ls.append([counterpart, np.round(corr_,2)])
                                processed_cols.append(counterpart)
                            else:
                                selected = False
                                break
                
                if selected == True:
                    self.selected_cols.append(col)
                    corr_list_tmp[col] = corr_ls
        
        self.selected_cols
        for col in list(corr_list_tmp.keys()):

            if len(corr_list_tmp[col])!=0:
                self.corr_list[col] = corr_list_tmp[col]           

        return self

    def transform(self, X, y = None):

        X = X[self.selected_cols]
                    
        return X

class Process_youth_years(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.feature_summary = pd.read_excel(rf'{root_path}\data\description\DIAS Attributes - Values 2017.xlsx')

    def fit(self, X, y = None):
        return self

    def transform(self, X, y = None):
        
        def extract_movement(txt):
            """Extract the movement from the Attribute PRAEGENDE_JUGENDJAHRE	

            Args:
                txt (str): text value of the attribute PRAEGENDE_JUGENDJAHRE	

            Returns:
                str: if extraction works out: movement type, else: input value
            """

            try:

                start_ = re.search(r"\(", txt).start()
                end_ = re.search(r"\)", txt).start() + 1

                return_txt = txt[start_: end_]
                txt_ls = return_txt.split(",")

                return txt_ls[0][1:]
            except:
                
                return txt

        def get_mapping(feature):
            """Function to provide a feature mapping from feature values to meaning of these values

            Args:
                feature (str): Feature to provide a mapping for

            Raises:
                LookupError: if a feature is not found in the feature_summary dataframe. This error is raised to inform the user of the non-existence

            Returns:
                dict: Dictionary with the mapping feature_value:meaning 
            """

            if feature in self.feature_summary["Attribute"].unique().tolist():
                mapping_ = self.feature_summary[self.feature_summary["Attribute"] == feature][["Value", "Meaning"]].set_index("Value").to_dict()["Meaning"]
            else:
                raise LookupError("Can't find the provided feature in data.")

            return mapping_
        
        X_ = X["PRAEGENDE_JUGENDJAHRE"].copy().to_frame()
        X_["praegende_jugendjahre_cat"] = X_["PRAEGENDE_JUGENDJAHRE"].map(get_mapping("PRAEGENDE_JUGENDJAHRE"))

        X_["youth_years"] = X_["praegende_jugendjahre_cat"].apply(lambda x: str(x)[:2])
        X_["movement_type"] = X_["praegende_jugendjahre_cat"].str[:].apply(lambda x: extract_movement(x))

        X.drop("PRAEGENDE_JUGENDJAHRE", axis = 1, inplace = True)
        X = pd.concat([X, X_], axis = 1)

        X.drop("praegende_jugendjahre_cat", axis = 1, inplace = True)
        X.drop("PRAEGENDE_JUGENDJAHRE", axis = 1, inplace = True)

        return X

class Impute_missing_data(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.imputer = None
        self.df_dtypes = None

    def fit(self, X, y = None):
        self.imputer = SimpleImputer(strategy="most_frequent")
        self.df_dtypes = {feat: str(X.dtypes.loc[feat]) for feat in X.dtypes.index}
        self.imputer.fit(X)

        return self

    def transform(self, X, y = None):   
        
        X = pd.DataFrame(self.imputer.transform(X), columns = X.columns)
        for col in list(self.df_dtypes.keys()):
            X[col] = X[col].astype(self.df_dtypes[col])

        return X

class Encode_features(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.feats_to_encode = None
        self.encoder = None
        self.o_w_mapping = None

    def fit(self, X, y = None):
        
        self.feats_to_encode = X.select_dtypes(include = ["object"]).columns.tolist()
        X_cat = X[self.feats_to_encode].copy()
        

        #encode cat
        self.encoder = OneHotEncoder(handle_unknown='ignore',sparse_output=True, drop = "if_binary")
        self.encoder.fit(X_cat)

        return self

    def transform(self, X, y = None):   
        
        X_cat = X[self.feats_to_encode].copy()
        X_num = X.drop(self.feats_to_encode, axis = 1)

        X_cat_encoded = self.encoder.transform(X_cat).todense()

        categorical_columns = [f'{col}_{cat}' for i, col in enumerate(X_cat.columns) for cat in self.encoder.categories_[i]]

        print("X_cat_encoded shape: ",X_cat_encoded.shape)
        print("cat cols list: ", categorical_columns)

        X_cat_encoded = pd.DataFrame(X_cat_encoded, columns = self.encoder.get_feature_names_out().tolist())    

        X = pd.concat([X_num, X_cat_encoded], axis = 1).reset_index()

        return X
    
class Scaling_features(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.scaler = None

    def fit(self, X, y = None):
        
        self.scaler = StandardScaler()
        self.scaler.fit(X)        

        return self

    def transform(self, X, y = None):   

        X = pd.DataFrame(self.scaler.transform(X), columns = X.columns)
        
        print("check scaling")

        return X
#####################################

In [133]:
missing_value_params = {
        "col_threshold": 0.5
}
non_informative_feats = ["LNR", "EINGEFUEGT_AM", "EINGEZOGENAM_HH_JAHR", "NATIONALITAET_KZ"]

ml_pipeline = Pipeline(steps = [
    ('replace_unknown_values', Replace_unknown_values()),
    ('drop_specific_columns', Drop_specific_cols(columns = non_informative_feats)),
    ('handle_missing_values', Handle_missing_columns(missing_value_params=missing_value_params)),
    ('handle_high_correlation', Handle_high_correlation(corr_threshold=0.85, untouchable_feats=["PRAEGENDE_JUGENDJAHRE"])),
    ('process_youth_years', Process_youth_years()),
    ('impute_missing_data', Impute_missing_data()),
    ('encode_features', Encode_features()),
    ('scale_features', Scaling_features()),
    ("undersampling", RandomUnderSampler(sampling_strategy="majority", random_state=42)),
    ("model", LogisticRegression(class_weight = "balanced"))
])

In [134]:
scoring = {
    'precision': make_scorer(precision_score, average='macro'),
    'recall': make_scorer(recall_score, average='macro'),
    'f1': make_scorer(f1_score, average='macro'),
    'roc_auc': make_scorer(roc_auc_score, average='macro')
}

In [136]:
models = {
    "RandomForestClassifier": RandomForestClassifier(n_estimators=300, max_depth=10),
    "LogisticRegression": LogisticRegression(),
    "AdaBoost": AdaBoostClassifier(n_estimators=200),
    "XGboost": XGBClassifier(n_estimators = 200)
}

cv_scores = {

}


for model in list(models.keys()):

    ml_pipeline.set_params(model = models[model])

    print(f"Crossvalidating model {model}...")

    current_scoring = cross_validate(
        estimator=ml_pipeline,
        X = X_train,
        y = y_train,
        cv=5,
        n_jobs=-1,
        verbose=0,
        scoring=scoring
    )

    cv_scores[model] = current_scoring

    for k in list(cv_scores[model].keys()):
        current_values = cv_scores[model][k]
        print(f"{k}: {current_values}")


    print("-"*50)


Crossvalidating model RandomForestClassifier...
fit_time: [15.36750507 15.24575663 15.34846902 15.38250327 14.82106924]
score_time: [0.73506451 0.7044363  0.75309896 0.74937892 0.66956091]
test_precision: [0.50819522 0.50688946 0.50964931 0.50739025 0.50956139]
test_recall: [0.63546997 0.63511699 0.6920077  0.6488545  0.69325504]
test_f1: [0.44025011 0.3930171  0.39405813 0.38294458 0.38324891]
test_roc_auc: [0.63546997 0.63511699 0.6920077  0.6488545  0.69325504]
--------------------------------------------------
Crossvalidating model LogisticRegression...
fit_time: [14.55321288 14.59996152 14.31131506 14.38354135 14.53388095]
score_time: [0.67647743 0.66126633 0.70100045 0.68780136 0.66683197]
test_precision: [0.50381152 0.50270463 0.50348154 0.50489967 0.50344481]
test_recall: [0.57411548 0.55480329 0.57121608 0.59737348 0.5704112 ]
test_f1: [0.39258923 0.36019831 0.36058636 0.38787959 0.33451667]
test_roc_auc: [0.57411548 0.55480329 0.57121608 0.59737348 0.5704112 ]
---------------

In [137]:
#drop rows with too many missing values
mailout_train_dropped = mailout_train[mailout_train.isnull().sum(axis = 1)<100].copy()

In [138]:
X_train_, y_train_ = mailout_train_dropped.drop("RESPONSE", axis = 1), mailout_train_dropped["RESPONSE"].copy()

In [139]:
models = {
    "RandomForestClassifier": RandomForestClassifier(n_estimators=300, max_depth=10),
    "LogisticRegression": LogisticRegression(class_weight="balanced"),
    "GradientBoostClassifier": AdaBoostClassifier(n_estimators=200),
    "XGboost": XGBClassifier(n_estimators = 200)
}

cv_scores = {

}


for model in list(models.keys()):

    ml_pipeline.set_params(model = models[model])

    print(f"Crossvalidating model {model}...")

    current_scoring = cross_validate(
        estimator=ml_pipeline,
        X = X_train_,
        y = y_train_,
        cv=5,
        n_jobs=-1,
        verbose=0,
        scoring=scoring
    )

    cv_scores[model] = current_scoring

    for k in list(cv_scores[model].keys()):
        current_values = cv_scores[model][k]
        print(f"{k}: {current_values}")


    print("-"*50)

Crossvalidating model RandomForestClassifier...
fit_time: [12.21594191 12.09699225 12.20138383 12.15191746 12.17460394]
score_time: [0.68695879 0.63891292 0.68450022 0.6636343  0.68641591]
test_precision: [0.51183488 0.51102193 0.51390946 0.51164987 0.51082724]
test_recall: [0.71689604 0.70315963 0.75362095 0.71259546 0.69878673]
test_f1: [0.42525635 0.42245379 0.42979932 0.42608867 0.42322174]
test_roc_auc: [0.71689604 0.70315963 0.75362095 0.71259546 0.69878673]
--------------------------------------------------
Crossvalidating model LogisticRegression...
fit_time: [11.97018051 11.78930187 11.96217203 11.92515683 11.8410182 ]
score_time: [0.57735276 0.56418419 0.56866527 0.57264137 0.55666447]
test_precision: [0.50490042 0.50483878 0.50397032 0.50441676 0.50529669]
test_recall: [0.59709011 0.59661707 0.57754007 0.58887803 0.60702059]
test_f1: [0.38555123 0.38042792 0.39167281 0.37353234 0.37050295]
test_roc_auc: [0.59709011 0.59661707 0.57754007 0.58887803 0.60702059]
---------------

In [140]:
for model in list(cv_scores.keys()):

    models_mean = np.mean(cv_scores[model]['test_roc_auc'])
    print(f"Model {model}: {np.round(models_mean,2)}")

Model RandomForestClassifier: 0.72
Model LogisticRegression: 0.59
Model GradientBoostClassifier: 0.68
Model XGboost: 0.73


In [155]:
ml_pipeline.set_params(model = models["XGboost"])

hyperparams = {
    'model__learning_rate': [0.01, 0.1],
    'model__max_depth': [5, 7],
    'model__subsample': [0.6, 0.8],
    'model__colsample_bytree': [0.8, 1.0],
    'model__n_estimators': [200, 300],
    'model__tree_method': ['gpu_hist']
}

grid_search = GridSearchCV(estimator=ml_pipeline, param_grid=hyperparams, scoring = scoring, cv = 5, verbose = 1, return_train_score= True, refit = "roc_auc", n_jobs=-1)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 32 candidates, totalling 160 fits
(42962, 366)
The following columns have been dropped: ['LNR', 'EINGEFUEGT_AM', 'EINGEZOGENAM_HH_JAHR', 'NATIONALITAET_KZ']
X_cat_encoded shape:  (42962, 8)
cat cols list:  ['youth_years_40', 'youth_years_50', 'youth_years_60', 'youth_years_70', 'youth_years_80', 'youth_years_90', 'youth_years_na', 'movement_type_Avantgarde', 'movement_type_Mainstream']
check scaling


In [161]:
opt_xgboost_params = grid_search.best_params_
opt_xgboost_params

{'model__colsample_bytree': 1.0,
 'model__learning_rate': 0.01,
 'model__max_depth': 5,
 'model__n_estimators': 200,
 'model__subsample': 0.6,
 'model__tree_method': 'gpu_hist'}

In [163]:
ml_pipeline.set_params(**opt_xgboost_params)

current_scoring = cross_validate(
        estimator=ml_pipeline,
        X = X_train_,
        y = y_train_,
        cv=5,
        n_jobs=-1,
        verbose=0,
        scoring=scoring
        )

In [164]:
current_scoring

{'fit_time': array([15.9989543 , 16.09333944, 16.02906513, 16.02860451, 16.06034088]),
 'score_time': array([0.75433135, 0.76253819, 0.78845859, 0.78892112, 0.76118398]),
 'test_precision': array([0.51524656, 0.5130992 , 0.51807752, 0.51500526, 0.51457722]),
 'test_recall': array([0.76393053, 0.72514719, 0.80535672, 0.7617607 , 0.76041973]),
 'test_f1': array([0.44360739, 0.44158068, 0.45273986, 0.4416748 , 0.43582911]),
 'test_roc_auc': array([0.76393053, 0.72514719, 0.80535672, 0.7617607 , 0.76041973])}

In [165]:
np.mean(current_scoring["test_roc_auc"])

np.float64(0.7633229747053278)

## III. Results

## IV. Discussion

* imputation
* further analysis of the columns
* Outlier

https://de.wikipedia.org/wiki/Arvato