In [1]:
# dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# magic visualizations
%matplotlib inline

# Objective

Create a customer segmentation report based on census information and e-mail sales by a company, using demographic information to determine how customers are different to general population. Then use this analysis to make predictions to figure out wich members of the general population are more lekely to become a customer for the e-mail order company, <b>based on a unsupervised machine learning model</b>.

Based on this report, the company would be able to define a marketing strategy so as to reach more consumer out.

# 1. Metadata 

Lets talk about the data files available for this project:

- Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail order company; 191 652 persons (rows) x 369 features (columns).
- Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar or different from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, <b>"RESPONSE"</b>, which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. One of them is a top-level list of attribuxtes and descriptions, organized by informational category. The other is a detailed mapping of data values for each feature in alphabetical order.

The DIAS information level includes the sort of mindful by every row:

- social minded
- familiar minded
- religious
- material minded
- dreamily
- sensual minded
- eventful orientated
- cultural minded
- rational mind
- critical minded
- dominant minded
- fightfull attitude
- traditional minded
- traditional minded

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the .csv data files in this project that they're semicolon (;) delimited, so an additional argument in the read_csv() call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

# The approach used to know datasets can be split in the following steps:
## Extract, Transform and Load
<b>Extract</b>: Gathering the information from every dataset.

<b>Transform</b>: Data cleaning, summarization, selection, joining, filtering and aggregating.

<b>Load</b>: Relational or not relational database, locally or in AWS.

1. Exploratory Data Analysis (EDA)
2. Unsupervised Machine Learning Model: Clustering analysis
3. Customer segmentation report
4. Supervised Machine Learning Model: 

# 1. Extract

In [None]:
# load in the data
azdias = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_AZDIAS_052018.csv', sep=';')
customers = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_CUSTOMERS_052018.csv', sep=';')