# Predict Conversions from Quotes

**Goal**: Build a model to predict whether a quote will convert to a purchase.

**Data Sources**
* MySQL records of quote details and outcome
* Paypal records of flooring samples purchases
* Mailchimp subscriber list

In [10]:
import bd_mailchimp
import bd_paypal
import bd_mysql

import pandas as pd
import numpy as np

## Load data

In [2]:
# Load mailchimp data
csv_path = '/Users/lindsay/Documents/Data Science/BrazilianDirect/csv/mailchimp/members_export_21_march_2016.csv'
mail, first_email = bd_mailchimp.process_mailchimp(csv_path)

In [6]:
# Load paypal data
csv_dir = '/Users/lindsay/Documents/Data Science/BrazilianDirect/csv/paypal/'
paypal, first_sample = bd_paypal.process_paypal(csv_dir)

In [11]:
# Load MySQL data
config_path = '/Users/lindsay/Documents/Data Science/BrazilianDirect/cfg/mysql.cfg'
con = bd_mysql.connect_bd_mysql(config_path)
df = bd_mysql.download_quote_data(con)
df = bd_mysql.pre_process_mysql(df)

## Join data

In [14]:
# merge mysql & mail chimp
df_all = pd.merge(df, mail, how='left', on='email')

# add paypal
df_all = pd.merge(df_all, paypal, how='left', on='email')

In [15]:
# Replace nan with 0
df_all['mail_chimp'] = df_all['mail_chimp'].fillna(value=0)
df_all['samples'] = df_all['samples'].fillna(value=0)

## Filter out quotes before mail chimp & samples

In [16]:
earliest_date = pd.datetime.date(max(first_email, first_sample))
df_all = df_all.loc[df_all['date_created'] >= earliest_date, :]
df_all.shape

(20821, 26)

## Drop columns that won't be used for predictions

* `quote_id`: unique identifier
* `email`: nearly unique identifier (not many repeat customers)
* `date_created`: too fine grained to use
* `days_until_needed`: transformed into a binned variable
* `ship_state`: transformed into a grouped variable for regions
* `install_subfloor`: too many missing values
* `sq_ft`: transformed into a binned variable
* `milling`: milling is perfectly correlated with `finish` (unfinished = square edge, prefinished = micro bevel)
* `year`: interested only in monthly seasonality
* `state_division`: will use the regional divisions, which have fewer categories

In [18]:
drop_cols = ['quote_id',
             'email',
             'date_created',
             'days_until_needed',
             'ship_state',
             'install_subfloor',
             'sq_ft',
             'milling',
             'year',
             'state_division']
df_all = df_all.drop(drop_cols, axis=1)