# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

In [1]:
import os
import sys

src_dir = os.path.join(os.getcwd(), '..', 'src')
sys.path.append(src_dir)
data_dir = os.path.join(os.getcwd(), '..', 'data')

# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# magic word for producing visualizations in notebook
%matplotlib inline

In [None]:
pd.set_option('display.max_columns', None)



In [125]:
def import_data(name,path):
    '''
    import data and clean some columns
    '''
    
    spec_dtypes = {'CAMEO_DEU_2015': 'object','CAMEO_DEUG_2015': 'object','CAMEO_INTL_2015':'object'} 
    data = pd.read_csv(os.path.join(path,'00_raw_data/',name), sep=';',dtype = spec_dtypes,error_bad_lines=False,quoting=3 )    
    
    # object columns are imported with brackets which have to be removed
    object_cols = data.select_dtypes(include=['object']).columns.to_list()
    for col in object_cols:
        #azdiad[col] = azdiad[col].str.replace('""',"")
        data[col] = data[col].str.strip('"')
    
    # convert numeric columns to numeric 
    for col in ['CAMEO_DEUG_2015','CAMEO_INTL_2015']:
        data[col]= pd.to_numeric(np.where(data[col].isin(['X','XX']),-1, data[col]))
    data['CAMEO_DEU_2015'] = np.where(data['CAMEO_DEU_2015'].isin(['X','XX']),np.nan, data['CAMEO_DEU_2015'])
    
    return customers




In [128]:
customers = import_data('Udacity_CUSTOMERS_052018.csv',data_dir)
azdiad = import_data('Udacity_AZDIAS_052018.csv',data_dir)


In [129]:
customers.select_dtypes(include=['object'])

Unnamed: 0,CAMEO_DEU_2015,D19_LETZTER_KAUF_BRANCHE,EINGEFUEGT_AM,OST_WEST_KZ,PRODUCT_GROUP,CUSTOMER_GROUP
0,1A,D19_UNBEKANNT,1992-02-12 00:00:00,W,COSMETIC_AND_FOOD,MULTI_BUYER
1,,D19_BANKEN_GROSS,,,FOOD,SINGLE_BUYER
2,5D,D19_UNBEKANNT,1992-02-10 00:00:00,W,COSMETIC_AND_FOOD,MULTI_BUYER
3,4C,D19_NAHRUNGSERGAENZUNG,1992-02-10 00:00:00,W,COSMETIC,MULTI_BUYER
4,7B,D19_SCHUHE,1992-02-12 00:00:00,W,FOOD,MULTI_BUYER
...,...,...,...,...,...,...
191647,1C,D19_BANKEN_REST,1992-02-10 00:00:00,W,COSMETIC_AND_FOOD,MULTI_BUYER
191648,5B,D19_UNBEKANNT,1997-03-06 00:00:00,W,COSMETIC,SINGLE_BUYER
191649,4D,D19_TECHNIK,1992-02-10 00:00:00,W,COSMETIC_AND_FOOD,MULTI_BUYER
191650,4C,D19_BANKEN_REST,1992-02-10 00:00:00,W,FOOD,SINGLE_BUYER


In [122]:
azdiad['D19_LETZTER_KAUF_BRANCHE'].str.strip('"')



0                       NA
1                       NA
2            D19_UNBEKANNT
3            D19_UNBEKANNT
4               D19_SCHUHE
                ...       
638576    D19_TELKO_MOBILE
638577       D19_UNBEKANNT
638578    D19_VERSAND_REST
638579                  NA
638580                   D
Name: D19_LETZTER_KAUF_BRANCHE, Length: 638581, dtype: object

In [124]:
azdiad.select_dtypes(include=['object'])

Unnamed: 0,CAMEO_DEU_2015,CAMEO_DEUG_2015,CAMEO_INTL_2015,D19_LETZTER_KAUF_BRANCHE,EINGEFUEGT_AM,OST_WEST_KZ
0,,,,,,
1,8A,8,51,,1992-02-10 00:00:00,W
2,4C,4,24,D19_UNBEKANNT,1992-02-12 00:00:00,W
3,2A,2,12,D19_UNBEKANNT,1997-04-21 00:00:00,W
4,6B,6,43,D19_SCHUHE,1992-02-12 00:00:00,W
...,...,...,...,...,...,...
638576,9C,9,51,D19_TELKO_MOBILE,1992-02-12 00:00:00,W
638577,6A,6,31,D19_UNBEKANNT,1992-02-10 00:00:00,W
638578,8A,8,51,D19_VERSAND_REST,1992-02-12 00:00:00,W
638579,1B,1,14,,1992-02-12 00:00:00,W


In [123]:
object_cols = azdiad.select_dtypes(include=['object']).columns.to_list()
for col in object_cols:
    #azdiad[col] = azdiad[col].str.replace('""',"")
    azdiad[col] = azdiad[col].str.strip('"')

#azdiad.groupby('CAMEO_DEU_2015').size()
#azdiad.groupby('CAMEO_DEUG_2015').size()
#azdiad.groupby('CAMEO_INTL_2015').size()

#azdiad['EINGEFUEGT_AM'].str.replace('""',"")


In [25]:
#engine='python' 
#azdias = pd.read_csv(os.path.join(data_dir,'00_raw_data/Udacity_AZDIAS_052018.csv'),sep=';',quoting=3, error_bad_lines=False)
spec_dtypes = {'CAMEO_DEU_2015': 'object','CAMEO_DEUG_2015': 'object','CAMEO_INTL_2015':'object'}
#customers = pd.read_csv(os.path.join(data_dir,'00_raw_data/Udacity_CUSTOMERS_052018.csv'), sep=';',dtype = spec_dtypes
 



In [63]:
#customers2.select_dtypes(include=['object']).head()
customers.select_dtypes(include=['object']).head()




Unnamed: 0,CAMEO_DEU_2015,D19_LETZTER_KAUF_BRANCHE,EINGEFUEGT_AM,OST_WEST_KZ,PRODUCT_GROUP,CUSTOMER_GROUP
0,1A,D19_UNBEKANNT,1992-02-12 00:00:00,W,COSMETIC_AND_FOOD,MULTI_BUYER
1,,D19_BANKEN_GROSS,,,FOOD,SINGLE_BUYER
2,5D,D19_UNBEKANNT,1992-02-10 00:00:00,W,COSMETIC_AND_FOOD,MULTI_BUYER
3,4C,D19_NAHRUNGSERGAENZUNG,1992-02-10 00:00:00,W,COSMETIC,MULTI_BUYER
4,7B,D19_SCHUHE,1992-02-12 00:00:00,W,FOOD,MULTI_BUYER


In [41]:
customers2.select_dtypes(include=['object']).head()

## Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.