# Customer Segmentation of Arvato-Bertelsmann customers

## Summary 

lore ipsum

## I. Introduction

The Arvato-Group is one of total 8 business units in the Bertelsmann Group which is a worldwide operating service company head-quarted in Germany.<br>
The main operating field of Avarto are logistics- and supply chain services and solutions, financial services as well as the operation of IT Systems. Concerning the general figures to get a grasp of the company, the company employs a staff around 77.342 persons (2020) and generates a sales volume of 5.56 Mrd. EUR per a (2024).

The present project can be localized in the financial services branch of Arvato (Arvato Financial Solutions).<br><br>
<span style="color: green;">**One client of Arvato Financial Solutions, a Mail-Order Company selling organic products, wants to be advised concerning a more efficient way to acquire new clients.<br>
In essence, the company wants their acquisition marketing campaings instead of reaching out to everyone (costly), target more precisely those persons which show the highest probability to turn into new customers.**</span>
<br><br>
<span style="text-decoration: underline;">The project spans two main tasks:</span>
1) Customer Segmentation: An Analysis of the existing customer database dataset is carried out and on this basis a general recommandation of which people in Germany are most likely to be new customers of the company is generated. <br><br>
2) Modelling Campaign-Responses: Using the results of 1) to build a machine learning model that predicts whether or not an individual will respond to the respective campaign.

This notebook focuses on the first main task.




## II. Methodology

* General description how we'll proceed
* Short description of the datasets at hand
* Exploratory Analysis of the two datasets
* Short plan what needs to be done to clean the dataset for further use
* PCA of bigger ds
* PCA application on customer ds
* Clustering 

#### General description of the methodology

#### Import relevant libraries and load the data

In [2]:
#Import relevant libraries
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 999)
pd.set_option('display.max_colwidth', None)

import os

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [9]:
#load the relevant data
root_path = os.path.dirname(os.getcwd())


azdias = pd.read_csv(rf'{root_path}\data\Udacity_AZDIAS_052018.csv', sep=';', low_memory=False)
customers = pd.read_csv(rf'{root_path}\data\Udacity_CUSTOMERS_052018.csv', sep=';', low_memory=False)

#### Description of the datasets at hand

There are four data files associated with this project:

1) `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
2) `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
3) `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
4) `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

#############<br>
Whereas the first two files are rather relevant for the customer segmentation (first main task), the later two dataset are more relevant for the second main task (Modelling Campaign-Responses).<br>

The azdias-dataset and the customers-dataset show only a small difference concerning the attributes / columns

In [14]:
np.setdiff1d(
    np.array(customers.columns),
    np.array(azdias.columns),
    assume_unique=True
)

array(['PRODUCT_GROUP', 'CUSTOMER_GROUP', 'ONLINE_PURCHASE'], dtype=object)

Hence, the description of the columns of the datasets at hand can be combined due to the large amount of common columns.<br>

In general, the datasets 1. and 2. contain data where one record describes one individual person over a large amount of attributes-

Arvato provided two excel files containing column-descriptions for the data relevant for the project. <br><br>

The columns can be distinguished on an "information level" into: 
* <span style="text-decoration: underline;"> **> Person:**</span> data describing the indvidual at hand e.g. age, sex, financial topology, nationality etc.<br><br>
* <span style="text-decoration: underline;"> **> Household:**</span> data describing the circumstances within the household the indivual lives in e.g. no of persons in household, academic titles in household, children in household, transaction activity in Household etc.<br><br>
* <span style="text-decoration: underline;"> **> Building:**</span> data describing the building the household is localized e.g. number of households in the building, type of building, neighbourhood-area indicator etc.<br><br>
* <span style="text-decoration: underline;"> **> Microcell (RR4_ID):**</span> data describing the individuals by the CAMEO consumer classifications system. <br><br>
* <span style="text-decoration: underline;"> **> Microcell (RR3_ID):**</span> data describing

In [13]:
azdias.columns.tolist()

['LNR',
 'AGER_TYP',
 'AKT_DAT_KL',
 'ALTER_HH',
 'ALTER_KIND1',
 'ALTER_KIND2',
 'ALTER_KIND3',
 'ALTER_KIND4',
 'ALTERSKATEGORIE_FEIN',
 'ANZ_HAUSHALTE_AKTIV',
 'ANZ_HH_TITEL',
 'ANZ_KINDER',
 'ANZ_PERSONEN',
 'ANZ_STATISTISCHE_HAUSHALTE',
 'ANZ_TITEL',
 'ARBEIT',
 'BALLRAUM',
 'CAMEO_DEU_2015',
 'CAMEO_DEUG_2015',
 'CAMEO_INTL_2015',
 'CJT_GESAMTTYP',
 'CJT_KATALOGNUTZER',
 'CJT_TYP_1',
 'CJT_TYP_2',
 'CJT_TYP_3',
 'CJT_TYP_4',
 'CJT_TYP_5',
 'CJT_TYP_6',
 'D19_BANKEN_ANZ_12',
 'D19_BANKEN_ANZ_24',
 'D19_BANKEN_DATUM',
 'D19_BANKEN_DIREKT',
 'D19_BANKEN_GROSS',
 'D19_BANKEN_LOKAL',
 'D19_BANKEN_OFFLINE_DATUM',
 'D19_BANKEN_ONLINE_DATUM',
 'D19_BANKEN_ONLINE_QUOTE_12',
 'D19_BANKEN_REST',
 'D19_BEKLEIDUNG_GEH',
 'D19_BEKLEIDUNG_REST',
 'D19_BILDUNG',
 'D19_BIO_OEKO',
 'D19_BUCH_CD',
 'D19_DIGIT_SERV',
 'D19_DROGERIEARTIKEL',
 'D19_ENERGIE',
 'D19_FREIZEIT',
 'D19_GARTEN',
 'D19_GESAMT_ANZ_12',
 'D19_GESAMT_ANZ_24',
 'D19_GESAMT_DATUM',
 'D19_GESAMT_OFFLINE_DATUM',
 'D19_GESAMT_ONLINE

#### Exploratory Data Analysis

#### Data Preparation

#### Principal Component Analysis

#### Clustering

## III. Results

## IV. Discussion

https://de.wikipedia.org/wiki/Arvato