# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

In [None]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn import preprocessing
from collections import defaultdict
from kmodes.kmodes import KModes
import re
from sklearn.manifold import TSNE
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# magic word for producing visualizations in notebook
%matplotlib inline

## Part 0: Get to Know the Data

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. [One of them](./DIAS Information Levels - Attributes 2017.xlsx) is a top-level list of attributes and descriptions, organized by informational category. [The other](./DIAS Attributes - Values 2017.xlsx) is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the `.csv` data files in this project that they're semicolon (`;`) delimited, so an additional argument in the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

In [None]:
# load in the data, specifying datatypes for columns 18 and 19, which have mixtures of datatypes, to speed up import
azdias = pd.read_csv('data/Udacity_AZDIAS_052018.csv', sep=';', dtype={18: object, 19: object})
customers = pd.read_csv('data/Udacity_CUSTOMERS_052018.csv', sep=';', dtype={18: object, 19: object})

# Create backup copies to avoid having to reimport
azdias_copy = azdias.copy()
customers_copy = customers.copy()

In [None]:
# Overview of customer data
customers.head()

In [None]:
# Overview of azdias data
azdias.head()

### Sort data types

In [None]:
# Find names of problematic 18th and 19th columns
customers.columns[18], customers.columns[19]
# According to the schema, these columns are the New German CAMEO Typology established together with Call Credit in late 2015

In [None]:
# Also according to the schema, there should be an additional associated column named. Finding this to investigate this column too
cameo_cols = [col for col in customers.columns if 'CAMEO' in col]
print(cameo_cols)
# 3rd Cameo column is CAMEO_DEU_2015

In [None]:
# First problematic column contains numbers and NaN values
customers['CAMEO_DEUG_2015'].head()

In [None]:
# Second problematic column contains numbers and NaN values
customers['CAMEO_INTL_2015'].head()

In [None]:
# The associated 3rd columm is a mixture of number-letter combos and NaN values
customers['CAMEO_DEU_2015'].head()

In [None]:
# Change columns 18 and 19 to type float
customers['CAMEO_DEUG_2015'] = pd.to_numeric(customers['CAMEO_DEUG_2015'], errors='coerce')
customers['CAMEO_INTL_2015'] = pd.to_numeric(customers['CAMEO_INTL_2015'], errors='coerce')
azdias['CAMEO_DEUG_2015'] = pd.to_numeric(azdias['CAMEO_DEUG_2015'], errors='coerce')
azdias['CAMEO_INTL_2015'] = pd.to_numeric(azdias['CAMEO_INTL_2015'], errors='coerce')

In [None]:
# Extract letter from CAMEO_DEU_2015 into new column and remove original which is duplicate of CAMEO_DEUG_2015
customers['CAMEO_DEU_2015_let'] = customers['CAMEO_DEU_2015'].str[1]
customers = customers.drop(columns = 'CAMEO_DEU_2015', axis=1)
azdias['CAMEO_DEU_2015_let'] = azdias['CAMEO_DEU_2015'].str[1]
azdias = azdias.drop(columns = 'CAMEO_DEU_2015', axis=1)

In [None]:
# could show they are the same

In [None]:
# could show missing data is -1

In [None]:
# Based on schema provided, the majority of missing data is indicated by -1
# Ahead of imputation, transform these -1 values to NaNs
customers = customers.replace({-1:np.nan})
azdias = azdias.replace({-1:np.nan})

In [None]:
# Find and visualise columns with most missing data in customers dataset
round(customers.isnull().sum(axis = 0)/customers.shape[0]*100,2).sort_values(ascending = False).head(20).plot(kind = 'bar', figsize=(20,10))
plt.title(" % of missing values per column in customers dataset", fontdict={'fontsize': 16})
plt.ylabel('% of missing values', fontdict={'fontsize': 12})
plt.xlabel('columns in dataset', fontdict={'fontsize': 12})

In [None]:
# Find columns with most missing data in azdias dataset
round(azdias.isnull().sum(axis = 0)/azdias.shape[0]*100,2).sort_values(ascending = False).head(20).plot(kind = 'bar', figsize=(20,10))
plt.title(" % of missing values per column in azdias dataset", fontdict={'fontsize': 16})
plt.ylabel('% of missing values', fontdict={'fontsize': 12})
plt.xlabel('columns in dataset', fontdict={'fontsize': 12})

In [None]:
# Remove columns with a high proportion of missing data
empty_cols = customers.columns[customers.isnull().sum(axis = 0)/customers.shape[0]*100 > 50]
customers = customers.drop(columns = empty_cols, axis=1)

empty_cols = azdias.columns[azdias.isnull().sum(axis = 0)/azdias.shape[0]*100 > 50]
azdias = azdias.drop(columns = empty_cols, axis=1)

In [None]:
# Remove rows with a high proportion of missing data
empty_rows = customers[customers.isnull().sum(axis = 1)/customers.shape[1]*100 > 80].index
customers = customers.drop(empty_rows, axis=0)

empty_rows = azdias[azdias.isnull().sum(axis = 1)/azdias.shape[1]*100 > 80].index
azdias = azdias.drop(empty_rows, axis=0)

In [None]:
# Create function which fills missing values with mode - using mode as data is categorical 
# e.g. ages are grouped into ranges rather than a field holding the age of an individual 
fill_mode = lambda col: col.fillna(col.mode()[0])
# Apply to all columns in customers dataset
customers = customers.apply(fill_mode, axis=0)
# Apply to all columns in azdias dataset
azdias = azdias.apply(fill_mode, axis=0)

In [None]:
# One hot encode categorical data
object_columns = customers.columns[customers.dtypes == object]
customers_clean = pd.get_dummies(data=customers, columns=object_columns)   

object_columns = azdias.columns[azdias.dtypes == object]
azdias_clean = pd.get_dummies(data=azdias, columns=object_columns)

In [None]:
customers.head()

In [None]:
# Check number of missing values in customers dataset
round(customers_clean.isnull().sum(axis = 0)/customers_clean.shape[0]*100,2).sort_values(ascending = False).head(10)

In [None]:
# Check number of missing values in azdias dataset
round(azdias_clean.isnull().sum(axis = 0)/azdias_clean.shape[0]*100,2).sort_values(ascending = False).head(10)

In [None]:
# Put cleaning steps in function
def clean_data(df):
    '''
     INPUT:
     customers_df - dataframe of customer data
     genpop_df - dataframe of data for the general population
     OUTPUT:
     customers_df_clean - dataframe of customer data ready for segmentation
     genpop_df_clean - dataframe of data for the general population ready for segmentation
    
     Removes columns which aren't present in both dataframes
     Sorts columns with incorrect datatypes
     Extract letter field from CAMEO_DEU_2015
     Removes columns with high proportion of missing data
     Impute missing catgorical data
     '''
    
    # Sort columns with incorrect datatypes
    df['CAMEO_DEUG_2015'] = pd.to_numeric(df['CAMEO_DEUG_2015'], errors='coerce')
    df['CAMEO_INTL_2015'] = pd.to_numeric(df['CAMEO_INTL_2015'], errors='coerce')
    for col in (['CAMEO_DEUG_2015', 'CAMEO_INTL_2015']):
        df[col] = pd.to_numeric(df['CAMEO_DEUG_2015'], errors='coerce')
    # Extract letter field from CAMEO_DEU_2015 and remove original, which is a duplicate of CAMEO_DEUG_2015
    df['CAMEO_DEU_2015_let'] = df['CAMEO_DEU_2015'].str[1]
    df.drop(columns = 'CAMEO_DEU_2015', axis=1)
    # Change -1 values to Nans
    df = df.replace({-1:np.nan})
    # Remove EINGEFUEGT_AM which doesn't seem very helpful
    df = df.drop(columns = ['EINGEFUEGT_AM'], axis=1)
    # Remove columns with a high proportion of missing data
    empty_cols = df.columns[df.isnull().sum(axis = 0)/df.shape[0]*100 > 50]
    df = df.drop(columns = empty_cols, axis=1)
    # Remove rows with a high proportion of missing data
    empty_rows = df[df.isnull().sum(axis = 1)/df.shape[1]*100 > 80].index
    df = df.drop(empty_rows, axis=0)
    # Impute missing data with mode
    # Create function which fills missing values with mode
    fill_mode = lambda col: col.fillna(col.mode()[0])
    # Apply to all columns
    df = df.apply(fill_mode, axis=0)
    ## Sort categorical data
    # One hot encode categorical data
    object_columns = df.columns[df.dtypes == object]
    df_clean = pd.get_dummies(data=df, columns=object_columns)

    
    return df_clean

In [None]:
# Check function is working
customers = customers_copy
azdias = azdias_copy
customers_clean = clean_data(customers)
azdias_clean = clean_data(azdias)

In [None]:
round(customers_clean.isnull().sum(axis = 0)/customers_clean.shape[0]*100,2).sort_values(ascending = False).head(10)

In [None]:
round(azdias_clean.isnull().sum(axis = 0)/azdias_clean.shape[0]*100,2).sort_values(ascending = False).head(10)
# There are no longer any missing values in the datasets suggesting the function is working

In [None]:
# Remove columns which aren't present in both dataframes ahead of segmentation
uncommon_cols = set(customers_clean.columns).symmetric_difference(set(azdias_clean.columns))
customers_clean = customers_clean.drop(columns = uncommon_cols, axis=1)

In [None]:
customers_clean.shape

In [None]:
azdias_clean.shape
# Both datasets have the same number of rows and so are ready for segmenatation steps

In [None]:
##Old function
# def prepare_seg_data(customers_df, genpop_df):
#     '''
#      INPUT:
#      customers_df - dataframe of customer data
#      genpop_df - dataframe of data for the general population
#      OUTPUT:
#      customers_df_clean - dataframe of customer data ready for segmentation
#      genpop_df_clean - dataframe of data for the general population ready for segmentation
    
#      Removes columns which aren't present in both dataframes
#      Sorts columns with incorrect datatypes
#      Extract letter field from CAMEO_DEU_2015
#      Removes columns with high proportion of missing data
#      Impute missing catgorical data
#      '''
     
#     for df in (customers_df, genpop_df):
#         # Sort columns with incorrect datatypes
#         df['CAMEO_DEUG_2015'] = pd.to_numeric(df['CAMEO_DEUG_2015'], errors='coerce')
#         df['CAMEO_INTL_2015'] = pd.to_numeric(df['CAMEO_INTL_2015'], errors='coerce')
#         #for col in (['CAMEO_DEUG_2015', 'CAMEO_INTL_2015']):
#         #    df[col] = pd.to_numeric(df['CAMEO_DEUG_2015'], errors='coerce')
#         # Extract letter field from CAMEO_DEU_2015 and remove original, which is a duplicate of CAMEO_DEUG_2015
#         df['CAMEO_DEU_2015_let'] = df['CAMEO_DEU_2015'].str[1]
#         df.drop(columns = 'CAMEO_DEU_2015', axis=1)
#         # Change -1 values to Nans
#         df = df.replace({-1:np.nan})
#         # Remove columns with a high proportion of missing data
#         empty_cols = df.columns[df.isnull().sum(axis = 0)/df.shape[0]*100 > 50]
#         df = df.drop(columns = empty_cols, axis=1)
#         # Remove rows with a high proportion of missing data
#         empty_rows = df[df.isnull().sum(axis = 1)/df.shape[1]*100 > 80].index
#         df = df.drop(empty_rows, axis=0)
#         ## Sort categorical data
#         # One hot encode categorical data
#         object_columns = df.columns[df.dtypes == object]
#         df = pd.get_dummies(data=df, columns=object_columns)
#         # Impute missing data with mode
#         # Create function which fills missing values with mode
#         fill_mode = lambda col: col.fillna(col.mode()[0])
#         # Apply to all columns
#         df = df.apply(fill_mode, axis=0)
        
        
#     # Removes columns which aren't present in both dataframes
#     uncommon_cols = set(customers_df.columns).symmetric_difference(set(genpop_df.columns))
#     customers_df = customers_df.drop(columns = uncommon_cols, axis=1)
    
#     customers_df_clean = customers_df
#     genpop_df_clean = genpop_df
    
#     return customers_df_clean, genpop_df_clean

## Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

### Standardise data

In [None]:
# Look at distribution of a subset of columns based on code on https://towardsdatascience.com/a-guide-to-pandas-and-matplotlib-for-data-exploration-56fad95f951c
customers_subset = customers_clean.iloc[: ,1:10]
sns.pairplot(customers_subset)

In [None]:
# Even when looking at the first 10 columns many are not normally distributed
# Many clustering algorithms require features to be normally distributed therefore use apply scaler to standardize data

In [None]:
# Declare and fit scaler to 
scaler = StandardScaler()
scaler.fit(customers_clean)
customers_scaled_features = scaler.transform(customers_clean)
customers_scaled = pd.DataFrame(customers_scaled_features, columns=customers_clean.columns)
customers_scaled.head()

# Apply scaler to customers dataset
azdias_scaled_features = scaler.transform(azdias_clean)
azdias_scaled = pd.DataFrame(azdias_scaled_features, columns=azdias_clean.columns)
azdias_scaled.head()

### Perform PCA

In [None]:
# There are a lot of columns in the dataset and it is likely that a lot will correlate

In [None]:
# There are a lot of columns in the dataset and it is likely that a lot will correlate
# Create visualisation of correlations between columns based on code on https://towardsdatascience.com/a-guide-to-pandas-and-matplotlib-for-data-exploration-56fad95f951c
corr = customers_subset.corr()
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap=sns.diverging_palette(220, 10, as_cmap=True))

In [None]:
# Looking at the visualisation above, ANZ_STATISTISCHE_HAUSHALTE correlates strongly and somewhat with 
#ANZ_HAUSHALTE_AKTIV and ANZ_HH_TITEL
# Therefore apply principle component analysis to reduce dimensionality of the datasets

In [None]:
# Determine the number of components to use using link: https://towardsdatascience.com/an-approach-to-choosing-the-number-of-components-in-a-principal-component-analysis-pca-3b9f3d6e73fe
#Fitting the PCA algorithm with our Data
pca = PCA().fit(customers_scaled)
#Plotting the Cumulative Summation of the Explained Variance
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Pulsar Dataset Explained Variance')
plt.show()
# Plot suggests 250 components descripe 90% of the data

In [None]:
# Apply PCA to azdias dataset
pca = PCA(n_components=250)
customers_components = pca.fit_transform(customers_scaled)

# Apply to customers datast
azdias_components = pca.transform(azdias_scaled)

### Apply Kmeans clustering algorithm

In [None]:
#Determine number of clusters using elbow method - https://towardsdatascience.com/customer-segmentation-using-k-means-clustering-d33964f238c3
sse = []
for k in range(1,10):
    kmeans = KMeans(n_clusters=k, init="k-means++")
    kmeans.fit(customers_components)
    sse.append(kmeans.inertia_)
    
plt.figure(figsize=(12,6))    
plt.plot(range(1,10), sse)
plt.xlabel("Number of clusters")
plt.ylabel("SSE")
plt.show()
# 6 clusters looks optimal

In [None]:
kmeans = KMeans(n_clusters=6)
customers_scaled['clusters'] = kmeans.fit_predict(customers_components)
azdias_scaled['clusters'] = kmeans.predict(azdias_components)

### Explore segments

In [None]:
# Visualise clustering results to see if there was a non-Euclidean shape to the data that k-means failed to pick up on
tsne = TSNE(n_components=2, random_state=1986)
twodim_arr = tsne.fit_transform(customers_scaled.iloc[:,:-1])

color = iter(plt.cm.rainbow(np.linspace(0,1,3)))

for group in list(customers_scaled['clusters'].unique().tolist()):
    c = next(color)
    plt.scatter(twodim_arr[customers_scaled['clusters'] == group, 0], 
                twodim_arr[customers_scaled['clusters'] == group, 1],
                color=c,
                label=group)

In [None]:
# Create boxplots showing how the clusters differ for each of the KPIs in the dataset
fig, axes = plt.subplots(4, 1)
fig.subplots_adjust(hspace=0.5)
fig.set_figheight(16)
fig.set_figwidth(12) 

for i, kpi in enumerate(customers_scaled.iloc[:,:-1].columns.tolist()):
    axes[i].set_title(kpi)
    data = list()
    for j in range(3):
        data.append(customers_scaled[customers_scaled['clusters'] == j][kpi])
    axes[i].boxplot(data, labels=['cluster_0', 'cluster_1', 'cluster_2', 'cluster_3', 'cluster_4', 'cluster_5'])

In [None]:
# Calculate distribution of customer and genpop individuals across clusters
customer_perc = customers_scaled['clusters'].value_counts()/customers_scaled['clusters'].shape[0]*100
gen_pop_perc = azdias_scaled['clusters'].value_counts()/azdias_scaled['clusters'].shape[0]*100
distributions = {'Customers': customer_perc, 'Genpop': gen_pop_perc}
dists = pd.DataFrame(data=distributions)

# Visualise distributions
dists.plot(kind = 'bar', figsize=(20,10))
plt.ylabel("Percentage of population", fontdict={'fontsize': 12})
plt.xlabel("Cluster number", fontdict={'fontsize': 12})
plt.title("Distribution of customer and general population data across clusters", fontdict={'fontsize': 16})

Customers are more likely to be in clusters 0 and 3 and less likely to be in cluster 5 than the general population.

In [None]:
# Find average score for clusters 0 and 3
cust_avg = customers_scaled[customers_scaled['clusters'].isin([0, 3])].mean()
genpop_avg = azdias_scaled[azdias_scaled['clusters'].isin([0, 3])].mean()

In [None]:
# Calcuate difference between customer score and gen pop score
customer_scores = pd.concat([cust_avg, genpop_avg], axis=1).rename(columns={0: "customers", 1: "genpop"})
customer_scores['diff'] = abs(customer_scores['customers'] - customer_scores['genpop'])
customer_scores['diff'].sort_values(ascending = False).head(20).plot(kind = 'bar', figsize=(20,10))
plt.ylabel("Difference in average score", fontdict={'fontsize': 12})
plt.xlabel("Features", fontdict={'fontsize': 12})
plt.title("Chart to demonstrate the features which differ most between Customer and General Population", fontdict={'fontsize': 16})

From the above analysis, it is clear that the biggest differentiators between customers and general pop are listed below with their meaning. Many fields are not in the data descriptions and so are missing meanings.
'LNR'
'AKT_DAT_KL'
'VK_ZG11'
'VK_DISTANZ'
'D19_KONSUMTYP' - Consumption type
'CJT_TYP_3'- Customer Journey Typology - advertisinginterested Store-shopper
'CJT_TYP_5' - Customer Journey Typology - Advertising- and Cross-Channel-Enthusiast
'CJT_TYP_6' - Customer Journey Typology - Advertising-Enthusiast with restricted Cross-Channel-Behaviour 
'ALTERSKATEGORIE_FEIN'- Age classification - Higher = holder
'CJT_TYP_4' - Customer Journey Typology - advertisinginterested Online-shopper
'EINGEZOGENAM_HH_JAHR' - Potentially related to year of birth
'VK_DHT4A',
'PRAEGENDE_JUGENDJAHRE' - dominating movement in the person's youth (avantgarde or mainstream) - higher score = later 
'WOHNDAUER_2008' - Length of residence = higher = later
'FINANZ_MINIMALIST' - financial typology: low financial interest = higher score = low interest(?)
'CJT_KATALOGNUTZER'
'SEMIO_TRADV' - affinity indicating in what way the person is traditional minded higher score = low affinity
'RT_SCHNAEPPCHEN'
'FINANZ_VORSORGER' - financial typology: be prepared = higher score = low
'HH_EINKOMMEN_SCORE' - estimated household net income - higher score = low

## Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

In [None]:
mailout_train = pd.read_csv('data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')

In [None]:
Y = mailout_train['RESPONSE']
X = mailout_train.drop(columns = ['RESPONSE'], axis = 1)

In [None]:
# Compare various models to find the one which gives highest accuracy using https://machinelearningmastery.com/compare-machine-learning-algorithms-python-scikit-learn/

In [None]:
# Prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

In [None]:
# Evaluate each model in turn
seed = 10
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

In [None]:
# Boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

## Part 3: Kaggle Competition

Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link [here](http://www.kaggle.com/t/21e6d45d4c574c7fa2d868f0e8c83140), you'll be taken to the competition page where, if you have a Kaggle account, you can enter. If you're one of the top performers, you may have the chance to be contacted by a hiring manager from Arvato or Bertelsmann for an interview!

Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.

In [None]:
mailout_test = pd.read_csv('data/Udacity_MAILOUT_052018_TEST.csv', sep=';')