# DAT275x Principles of Machine Learning: Python Edition FINAL CHALLENGE

Let's being by loading a few essential packages.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.options.display.max_columns = None

The necessary CSV files have already been loaded into this directory. We'll begin by loading the file containing all the features we'd like and displaying the first few rows.

In [3]:
customers = pd.read_csv('AdvWorksCusts.csv')
display(customers.shape)
display(customers.head())

(16519, 23)

Unnamed: 0,CustomerID,Title,FirstName,MiddleName,LastName,Suffix,AddressLine1,AddressLine2,City,StateProvinceName,CountryRegionName,PostalCode,PhoneNumber,BirthDate,Education,Occupation,Gender,MaritalStatus,HomeOwnerFlag,NumberCarsOwned,NumberChildrenAtHome,TotalChildren,YearlyIncome
0,11000,,Jon,V,Yang,,3761 N. 14th St,,Rockhampton,Queensland,Australia,4700,1 (11) 500 555-0162,1966-04-08,Bachelors,Professional,M,M,1,0,0,2,137947
1,11001,,Eugene,L,Huang,,2243 W St.,,Seaford,Victoria,Australia,3198,1 (11) 500 555-0110,1965-05-14,Bachelors,Professional,M,S,0,1,3,3,101141
2,11002,,Ruben,,Torres,,5844 Linden Land,,Hobart,Tasmania,Australia,7001,1 (11) 500 555-0184,1965-08-12,Bachelors,Professional,M,M,1,1,3,3,91945
3,11003,,Christy,,Zhu,,1825 Village Pl.,,North Ryde,New South Wales,Australia,2113,1 (11) 500 555-0162,1968-02-15,Bachelors,Professional,F,S,0,1,0,0,86688
4,11004,,Elizabeth,,Johnson,,7553 Harness Circle,,Wollongong,New South Wales,Australia,2500,1 (11) 500 555-0131,1968-08-08,Bachelors,Professional,F,S,1,4,5,5,92771


The assignment notes mentioned that there were some duplicate rows that need to be removed and that only the most recent row should be kept. "Recent" was not defined, but I'll assume that a higher index is a more recent entry, reflecting perhaps an update to a customer's previously-entered information. We'll need to test this however.

First of all, let's see how many duplicate customer IDs there are. This is the only column we can be guarenteed is a unique identifier for a customer.

In [4]:
customers_dupes = customers[customers['CustomerID'].duplicated(keep=False)]
customers_dupes.shape

(230, 23)

That's rather a lot of duplicates. Let's put them next to each other and display a few.

In [5]:
customers_dupes.sort_values(by='CustomerID').head(10)

Unnamed: 0,CustomerID,Title,FirstName,MiddleName,LastName,Suffix,AddressLine1,AddressLine2,City,StateProvinceName,CountryRegionName,PostalCode,PhoneNumber,BirthDate,Education,Occupation,Gender,MaritalStatus,HomeOwnerFlag,NumberCarsOwned,NumberChildrenAtHome,TotalChildren,YearlyIncome
5708,11041,,Amanda,M,Carter,,5826 Escobar,,Glendale,California,United States,91203,295-555-0145,1977-10-16,Partial College,Skilled Manual,F,M,1,2,0,0,78170
34,11041,,Amanda,M,Carter,,5826 Escobar,,Glendale,California,United States,91203,295-555-0145,1977-10-16,Partial College,Skilled Manual,F,M,1,2,0,0,78170
126,11143,,Jonathan,M,Henderson,,165 East Lane Road,,Lakewood,California,United States,90712,149-555-0113,1977-02-04,High School,Skilled Manual,M,M,1,2,0,0,43666
7907,11143,,Jonathan,M,Henderson,,165 East Lane Road,,Lakewood,California,United States,90712,149-555-0113,1977-02-04,High School,Skilled Manual,M,M,1,2,0,0,43666
12779,11172,,Gabrielle,J,Adams,,5621 Arcadia Pl.,,Lynnwood,Washington,United States,98036,403-555-0152,1967-11-21,Bachelors,Management,F,M,1,2,0,1,97616
151,11172,,Gabrielle,J,Adams,,5621 Arcadia Pl.,,Lynnwood,Washington,United States,98036,403-555-0152,1967-11-21,Bachelors,Management,F,M,1,2,0,1,97616
187,11210,,Edward,J,Wood,,1039 Adelaide St.,,West Covina,California,United States,91791,229-555-0114,1948-06-08,Partial College,Professional,M,M,1,2,1,4,87565
4376,11210,,Edward,J,Wood,,1039 Adelaide St.,,West Covina,California,United States,91791,229-555-0114,1948-06-08,Partial College,Professional,M,M,1,2,1,4,87565
13577,11218,,Olivia,,Brown,,3964 Stony Hill Circle,,Tacoma,Washington,United States,98403,414-555-0147,1950-09-11,Partial College,Professional,F,S,0,2,1,2,76220
194,11218,,Olivia,,Brown,,3964 Stony Hill Circle,,Tacoma,Washington,United States,98403,414-555-0147,1950-09-11,Partial College,Professional,F,S,0,2,1,2,76220


OK, but we still don't really know if there are any differences among these duplicate customers IDs. But maybe looking at the number of unique values would shed some light on things.

In [6]:
customers_dupes.nunique()

AttributeError: 'DataFrame' object has no attribute 'nunique'

There are 115 unique customer IDs, but only 132 unique values for `AddressLine1` and 105 unique values for `PhoneNumber`, which supports the idea that some customers would have updated their information at some point. Some customers might have updated their yearly income as well, but there's reason to believe that there would naturally be duplicates in there anyway, so that's not a safe assumption to make.

With some confidence we can drop rows with duplicate customer IDs, keeping the row with the higher index.

In [None]:
customers.drop_duplicates(subset='CustomerID', keep='last', inplace=True)
customers.shape

In [None]:
customers.describe()

The dataframe we have now in still incomplete. It lacks two important columns: one that tracks how much each customer has spent per month at the store and whether or not the customer has purchased a bicycle. These feature are located in two other CSV files.

In [None]:
customers_avg_spend = pd.read_csv('AW_AveMonthSpend.csv')
customers_bike_buyer = pd.read_csv('AW_BikeBuyer.csv')

In [None]:
print(customers_avg_spend.shape)
print(customers_bike_buyer.shape)

These two new columns also seem to contain duplicates, so we'll run the operation above on them as well.

In [None]:
for df in (customers_avg_spend, customers_bike_buyer):
    df.drop_duplicates(subset='CustomerID', keep='last', inplace=True)

print(customers_avg_spend.shape)
print(customers_bike_buyer.shape)

Now, let's show some summary stats for these two features.

In [None]:
display(customers_avg_spend.describe())
display(customers_bike_buyer.describe())

Just to make further typing a little easier, we can attach the two new features to the primary dataframe.

In [None]:
customer_data = customers.merge(customers_avg_spend, on='CustomerID').merge(customers_bike_buyer, on='CustomerID')
customer_data.head()

Another time-saving adjustment will be to make another feature that calculates a customer's age in years as it was on January 1st, 1998.

In [None]:
customer_data['BirthDate'] = pd.to_datetime(customer_data['BirthDate'])
customer_data['Age'] = customer_data['BirthDate'].apply(lambda x: int((pd.to_datetime('1998-01-01') - x).days / 365))
customer_data['Age'].describe()

And now for some plotting. Let's start with seeing the distribution of bike buyers and non-buyers.

In [None]:
sns.countplot(data=customer_data, x='BikeBuyer')

There are far fewer people who bought a bike compared to those who did not. This discrepency is going to be important later. The boosted decision tree that we going to use in senstive to class imbalances.

And because the assignment has asked us to, let's see take a look at median yearly income as grouped by occupation.

In [None]:
customer_data.groupby(by='Occupation')['YearlyIncome'].median()

In [None]:
sns.scatterplot(data=customer_data, x='Age', y='AveMonthSpend', hue='Gender', alpha=0.5)

In [None]:
sns.barplot(data=customer_data, x='MaritalStatus', y='AveMonthSpend')

In [None]:
customer_data.groupby(by='MaritalStatus').median()['YearlyIncome']

In [None]:
customer_data.groupby(by='NumberCarsOwned').median()['YearlyIncome']

In [None]:
customer_data.groupby(by='Gender').median()['YearlyIncome']

In [None]:
customer_data.groupby(by='NumberChildrenAtHome').median()['YearlyIncome']

In [None]:
customer_data.groupby(by='BikeBuyer').median()

In [None]:
customer_data[customer_data['BikeBuyer']==1]['MaritalStatus'].mode()

## Classification

Our first machine learning task is to predict which customers are likely to purchase a bicycle. To that end, we will be using the `BikeBuyer` feature as our label. The machine learning model we will employ is the boosted decision tree, one of the most effective ensemble methods.



In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import sklearn.metrics as sklm

The first step is to split the data into training and testing divisions.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    customer_data.drop(['BikeBuyer'], axis=1),
    customer_data['BikeBuyer'],
    test_size=0.3)

Let's create a feature processing function. It will include the dupe-dropping procedure from above alone with steps to drop useless features, normalize numerical features, and enable one-hot encoding for categorical features.

In [None]:
customer_data[['Title', 'PostalCode']]

In [None]:
def process(df):
    df = df.drop_duplicates(subset='CustomerID', keep='last', inplace=True)
    df = df[['']]

In [None]:
customer_data.info()

In [None]:
def process(df):
    df.drop(['CustomerID', 'Title',
             'FirstName', 'MiddleName',
             'LastName', 'Suffix',
             'AddressLine1', 'AddressLine2',
             'StateProvinceName', 'CountryRegionName',
             'PostalCode', 'PhoneNumber',
             'BirthDate'], axis=1, inplace=True)

In [None]:
!conda update -y numpy pandas scikit-learn seaborn

Fetching package metadata ...............
Solving package specifications: .

Package plan for installation in environment /home/nbuser/anaconda3_420:

The following NEW packages will be INSTALLED:

    mkl_fft:      1.0.6-py35h7dd41cf_0 
    mkl_random:   1.0.1-py35h4414c95_1 
    readline:     7.0-ha6073c6_4       

The following packages will be UPDATED:

    conda:        4.3.31-py35_0         --> 4.5.11-py35_0        
    conda-env:    2.6.0-h36134e3_1      --> 2.6.0-1              
    libgcc-ng:    7.2.0-h7cc24e2_2      --> 8.2.0-hdf63c60_1     
    libstdcxx-ng: 7.2.0-h7a57d05_2      --> 8.2.0-hdf63c60_1     
    numpy:        1.11.3-py35h1b885b7_9 --> 1.15.2-py35h1d66e8a_0
    numpy-base:   1.11.3-py35h3dfced4_9 --> 1.15.2-py35h81de0dd_0
    pandas:       0.19.2-np111py35_1    --> 0.23.4-py35h04863e7_0
    pycosat:      0.6.1-py35_1          --> 0.6.3-py35h14c3975_0 
    scikit-learn: 0.19.1-py35hbf1f462_0 --> 0.20.0-py35h4989274_1
    seaborn:      0.8-py35_0            --> 0.