## Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

This notebook will be divided in several blocks

+ Data load
+ Data cleaning
+ Feature engineering
+ Data visualization
+ Model training
+ Model selection
+ Model tuning
+ Save results


In [7]:
#Import general libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import io

#Import SageMaker libraries
import boto3
import sagemaker

from sagemaker.content_types import CONTENT_TYPE_CSV

### 2.1 Data load

In [2]:
# boto3 client to get S3 data
s3_client = boto3.client('s3')
bucket_name='sagemaker-eu-west-1-848439228145'

In [4]:
import re 

# get a list of objects in the bucket
obj_list=s3_client.list_objects(Bucket=bucket_name)

def filter_csv(string):
    return re.search(r'.csv', string)


files=[]
for contents in obj_list['Contents']:
    files.append(contents['Key'])
    
filtered_list = list(filter(filter_csv, files))
    
    
# print csv objects in in S3 bucket  
print(filtered_list)

['Capstone/Udacity_AZDIAS_052018.csv', 'Capstone/Udacity_CUSTOMERS_052018.csv', 'Capstone/Udacity_MAILOUT_052018_TEST.csv', 'Capstone/Udacity_MAILOUT_052018_TRAIN.csv', 'arvato/azdias.csv', 'arvato/customers.csv', 'arvato/transform/pca/transform/test/azdias.csv.out', 'arvato/transform/pca/transform/test/customers.csv.out', 'test/customers.csv.out']


In [5]:
def load_dataframe_from_s3(s3_client, bucket, name):
    data_object = s3_client.get_object(Bucket=bucket, Key=name)
    data_body = data_object["Body"].read()
    data_stream = io.BytesIO(data_body)
    
    return pd.read_csv(data_stream, header=0, delimiter=",") 

In [30]:
mailout_train_df = None
mailout_train_df = load_dataframe_from_s3(s3_client, bucket_name, filtered_list[3])
mailout_train_df.head()

  if self.run_code(code, result):


Unnamed: 0.1,Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,...,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,RESPONSE,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,0,1763,2,1.0,8.0,,,,,8.0,...,5.0,2.0,1.0,6.0,9.0,3.0,3,0,2,4
1,1,1771,1,4.0,13.0,,,,,13.0,...,1.0,2.0,1.0,4.0,9.0,7.0,1,0,2,3
2,2,1776,1,1.0,9.0,,,,,7.0,...,6.0,4.0,2.0,,9.0,2.0,3,0,1,4
3,3,1460,2,1.0,6.0,,,,,6.0,...,8.0,11.0,11.0,6.0,9.0,1.0,3,0,2,4
4,4,1783,2,1.0,9.0,,,,,9.0,...,2.0,2.0,1.0,6.0,9.0,3.0,3,0,1,3


In [31]:
mailout_test_df = None
mailout_test_df = load_dataframe_from_s3(s3_client, bucket_name, filtered_list[2])
mailout_test_df.head()

  if self.run_code(code, result):


Unnamed: 0.1,Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,...,VHN,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,0,1754,2,1.0,7.0,,,,,6.0,...,4.0,5.0,6.0,3.0,6.0,9.0,3.0,3,1,4
1,1,1770,-1,1.0,0.0,,,,,0.0,...,1.0,5.0,2.0,1.0,6.0,9.0,5.0,3,1,4
2,2,1465,2,9.0,16.0,,,,,11.0,...,3.0,9.0,6.0,3.0,2.0,9.0,4.0,3,2,4
3,3,1470,-1,7.0,0.0,,,,,0.0,...,2.0,6.0,6.0,3.0,,9.0,2.0,3,2,4
4,4,1478,1,1.0,21.0,,,,,13.0,...,1.0,2.0,4.0,3.0,3.0,9.0,7.0,4,2,4


### 2.1 Data cleaning

Find rows with more that 25% missing values

In [22]:
rows = mailout_train_df.shape[0]
missing_mailout_train = mailout_train_df.isnull().sum().sort_values(ascending = False).divide(other = (rows/100))

display(missing_mailout_train.loc[missing_mailout_train > 25])

ALTER_KIND4     99.904567
ALTER_KIND3     99.594991
ALTER_KIND2     98.240305
ALTER_KIND1     95.372655
KK_KUNDENTYP    58.926493
EXTSEL992       37.121177
dtype: float64

In [15]:
rows = mailout_test_df.shape[0]
missing_mailout_test = mailout_test_df.isnull().sum().sort_values(ascending = False).divide(other = (rows/100))

display(missing_mailout_test.loc[missing_mailout_test > 25])

ALTER_KIND4     99.908949
ALTER_KIND3     99.530736
ALTER_KIND2     98.220998
ALTER_KIND1     95.300353
KK_KUNDENTYP    58.447926
EXTSEL992       36.908458
dtype: float64

#### 2.1.1 Drop columns with more than a 60% of missing values

In [25]:
#make a dict with the names of the columns and then drop this columns from dataframe
def dropMissingColumns(df, threshold = 20):
    missing_percentages = df.isnull().sum().sort_values(ascending = False).divide(other = (rows/100))
    drop_columns = missing_percentages[missing_percentages > threshold]
    df.drop(columns = list(drop_columns.index), axis = 1, inplace = True)

In [32]:
dropMissingColumns(mailout_train_df, threshold = 60)
mailout_train_df.shape

(42962, 364)

In [33]:
dropMissingColumns(mailout_test_df, threshold = 60)
mailout_test_df.shape

(42833, 363)

#### 2.1.2 Impute missing values 

In [41]:
import sys
!{sys.executable} -m pip install scikit-learn --user --upgrade pip

Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/5e/d8/312e03adf4c78663e17d802fe2440072376fee46cada1404f1727ed77a32/scikit_learn-0.22.2.post1-cp36-cp36m-manylinux1_x86_64.whl (7.1MB)
[K    100% |████████████████████████████████| 7.1MB 7.0MB/s eta 0:00:01
[?25hCollecting pip
[?25l  Downloading https://files.pythonhosted.org/packages/54/2e/df11ea7e23e7e761d484ed3740285a34e38548cf2bad2bed3dd5768ec8b9/pip-20.1-py2.py3-none-any.whl (1.5MB)
[K    100% |████████████████████████████████| 1.5MB 19.8MB/s ta 0:00:01
[?25hRequirement not upgraded as not directly required: scipy>=0.17.0 in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages (from scikit-learn) (1.1.0)
Requirement not upgraded as not directly required: numpy>=1.11.0 in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages (from scikit-learn) (1.15.4)
Collecting joblib>=0.11 (from scikit-learn)
[?25l  Downloading https://files.pythonhosted.org/packages/28/5c/c

In [46]:
#from sklearn.experimental import enable_iterative_imputer
#from sklearn.experimental import enable_iterative_imputer
!{sys.executable} -m pip install -U scikit-learn 
import sklearn
print (sklearn.__version__)
#python -m pip install scikit-learn --user --upgrade pip



Requirement already up-to-date: scikit-learn in /home/ec2-user/.local/lib/python3.6/site-packages (0.22.2.post1)
0.20.3
