# Business Understanding and Set-up
Purpose: Ask relevant questions and define objectives for the problem that needs to be tackled
## Key Question
We are tasked by Kickstarter to come up with a model to predict viable targets for projects, in order for them to provide a good estimate for project creators (particularly as Kickstarter can influence certain parameters such as staff picks). 
* What would be a reasonable target **after the project creator enters key parameters**?
* What would be a reasonable adapted target **after Kickstarter specifies its parameters (e.g. spotlight, staff pick)**?


## Glossary
* **TARGET: SUCCESS** - New column with 1 for success, 0 for fail (alternative for target: converted_pledged_amount)
* **backers_count** - The number of supporters that actually invested in the project
* **blurb** - 
* **category** - Main category the project falls in (e.g. "food", "music")
* **converted_pledged_amount** - Pledged amount of USD realised at the deadline, converted from "pledged" via "static_usd_rate", rounded
* **country** - Country of origin of the project
* **created_at** - 
* **creator** - 
* **currency** - Currency of the project (e.g. USD, GBP)
* **currency_symbol** - 
* **currency_trailing_code** - 
* **current_currency** - 
* **deadline** - Deadline of the project (can be used to analyze timeframes)
* **disable_communication** - 
* **file** - 
* **friends** - 
* **fx_rate** - 
* **goal** - Amount of USD the project asked for initially
* **id** - Project ID
* **is_backing** - 
* **is_starrable** - 
* **is_starred** - 
* **launched_at** - Launch date of the project (can be used to analyze timeframes)
* **location** - 
* **name** - Name of the project
* **permissions** - 
* **photo** - 
* **pledged** - 
* **profile** - 
* **slug** - 
* **source_url** - 
* **spotlight** - 
* **staff_pick** - 
* **state** - Was the project successful at the end of the day? state is a categorical variable divided into the levels successful, failed, live, cancelled, undefined and suspended. For the sake of clarity, we will only look at whether a project was successful or failed (hence, we will remove all projects that are not classified as one of the two). Projects that failed or were successful make up around 88% of all projects.
* **state_changed_at** - 
* **static_usd_rate** - "Rate of conversion from "currency" to USD
* **urls** - 
* **usd_pledged** - Pledged amount of USD realised at the deadline, converted from "pledged" via "static_usd_rate"
* **usd_type** - ??? (NaN where "current currency" is not USD)

## Variable Description
* **data_raw** - Originally imported dataset
* **data** - Main working dataset containing cleaned and refined data
* **data_insp** - Copy of data before 3 Data Cleaning used during 3.1 Inspection
* **data_clean** - Copy of data after 3 Data Cleaning used during 4 Data Exploration
* **data_results** - Main working dataset incl. predicted values and residuals from selected regression model
* **data_results_1..** - Interim results of the regression models
* ...

## Outcome/Recommendations
- As per the key question, a) five high-grade and b) five sub-par investment opportunities have been identified
- Optimal months to buy houses and year-on-year price movements still to be analyzed

## To-Do-List
- Is the data/csv's [monthly snapshot data](https://aito.ai/example-gallery/predict-and-explain-a-kickstarter-campaign-success/)? i.e., are the multiple ID entries just projects that span multiple months? Shouldn't they be more then?
- Investigate TARGET: SUCCESS via "state": 
  - Should we drop all that are not successful or state?
  - Is there a connection between "canceled", "live", "suspended" and the occurrence of multiple ID's?
- Features:
  - Column with avg backing per backer (converted_pledged_amount / backers_count)
  - Calculate the #th project that it is for the specific person? (might be more successful if already submitted previous projects)
- Is "conv. pledged amount" > "goal" the definitive criterion for "state"=successful?
- What is usd_type domestic vs international vs NaN?
- "File" stores the original csv file. Are there any issues with entries from particular files?
- Anything useful in "friends", "is_backing", "is_starred", "permissions"?
- Do we have duplicate projects? (yes, we have same IDs frequently. Maybe only keep the latest entry?)
- Create timeseries/timeline
- Gradient Descent?
- Lasso?
- Helpful links for solving: [Using ML to predict Kickstarter success](https://towardsdatascience.com/using-machine-learning-to-predict-kickstarter-success-e371ab56a743); 

## Import libraries (overarching)

In [1]:
# Import overarching libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.gridspec import GridSpec
import scipy as sc
from scipy.stats import kstest
from scipy.stats import zscore
import seaborn as sns
import math

import statsmodels.formula.api as smf

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import r2_score

import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff

sns.set(style="white")
%matplotlib inline

#Subplots
from plotly.subplots import make_subplots
from numpy import median

## Dashboard

In [2]:
random_state = 100
test_size = 0.3

# Data Mining
Purpose: Gather and scrape the data necessary for the project

In [3]:
# Import libraries
from numpy import loadtxt
import os, glob

In [4]:
# Import multiple csv files and merge into one dataframe
all_files = glob.glob(os.path.join("data", "*.csv"))
all_df = []
for f in all_files:
    df = pd.read_csv(f, sep=',')
    df['file'] = f.split('/')[-1]
    all_df.append(df)
data_raw = pd.concat(all_df, ignore_index=True, sort=True)

In [31]:
# Assign data_raw to data
data = data_raw.copy()

# Data Cleaning
Purpose: Fix the inconsistencies within the data and handle the missing values

In [32]:
# Import libraries
from datetime import datetime
import ast

## Inspection
Purpose: Getting a good sense of the data

In [33]:
# Clean dataset for inspection
data_insp = data.copy()

In [34]:
# Display shape of "data"
data_insp.shape

(209222, 38)

In [35]:
# Display head(5) of "data_insp"
pd.set_option('display.max_columns', 50)
data_insp.head(6)

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,created_at,creator,currency,currency_symbol,currency_trailing_code,current_currency,deadline,disable_communication,file,friends,fx_rate,goal,id,is_backing,is_starrable,is_starred,launched_at,location,name,permissions,photo,pledged,profile,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,315,Babalus Shoes,"{""id"":266,""name"":""Footwear"",""slug"":""fashion/fo...",28645,US,1541459205,"{""id"":2094277840,""name"":""Lucy Conroy"",""slug"":""...",USD,$,True,USD,1552539775,False,Kickstarter040.csv,,1.0,28000.0,2108505034,,False,,1548223375,"{""id"":2462429,""name"":""Novato"",""slug"":""novato-c...",Babalus Children's Shoes,,"{""key"":""assets/023/667/205/a565fde5382d6b53276...",28645.0,"{""id"":3508024,""project_id"":3508024,""state"":""in...",babalus-childrens-shoes,https://www.kickstarter.com/discover/categorie...,False,False,live,1548223375,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",28645.0,international
1,47,A colorful Dia de los Muertos themed oracle de...,"{""id"":273,""name"":""Playing Cards"",""slug"":""games...",1950,US,1501684093,"{""id"":723886115,""name"":""Lisa Vollrath"",""slug"":...",USD,$,True,USD,1504976459,False,Kickstarter040.csv,,1.0,1000.0,928751314,,False,,1502384459,"{""id"":2400549,""name"":""Euless"",""slug"":""euless-t...",The Ofrenda Oracle Deck,,"{""key"":""assets/017/766/989/dd9f18c773a8546d996...",1950.0,"{""id"":3094785,""project_id"":3094785,""state"":""ac...",the-ofrenda-oracle-deck,https://www.kickstarter.com/discover/categorie...,True,False,successful,1504976459,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",1950.0,domestic
2,271,"Electra's long awaited, eclectic Debut Pop/Roc...","{""id"":43,""name"":""Rock"",""slug"":""music/rock"",""po...",22404,US,1348987533,"{""id"":323849677,""name"":""Electra"",""is_registere...",USD,$,True,USD,1371013395,False,Kickstarter040.csv,,1.0,15000.0,928014092,,False,,1368421395,"{""id"":2423474,""name"":""Hollywood"",""slug"":""holly...","Record Electra's Debut Album (Pop, Rock, Class...",,"{""key"":""assets/011/433/681/489fd66f7861fefd8c8...",22404.0,"{""id"":359847,""project_id"":359847,""state"":""inac...",record-electras-debut-album-pop-rock-classical,https://www.kickstarter.com/discover/categorie...,True,False,successful,1371013395,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",22404.0,international
3,3,The Mist of Tribunal is a turn-based card game...,"{""id"":273,""name"":""Playing Cards"",""slug"":""games...",165,GB,1483780271,"{""id"":196281496,""name"":""Artur Ordijanc (delete...",GBP,£,False,USD,1489425776,False,Kickstarter040.csv,,1.308394,10000.0,596091328,,False,,1484245376,"{""id"":475457,""name"":""Kaunas"",""slug"":""kaunas-ka...",The Mist of Tribunal - A Card Game,,"{""key"":""assets/015/091/198/216fbf1bdc3739e7971...",136.0,"{""id"":2825329,""project_id"":2825329,""state"":""in...",the-mist-of-tribunal-a-card-game,https://www.kickstarter.com/discover/categorie...,False,False,failed,1489425776,1.216066,"{""web"":{""project"":""https://www.kickstarter.com...",165.384934,domestic
4,3,"Livng with a brain impairment, what its like t...","{""id"":48,""name"":""Nonfiction"",""slug"":""publishin...",2820,US,1354817071,"{""id"":1178460181,""name"":""Dawn Johnston"",""is_re...",USD,$,True,USD,1357763527,False,Kickstarter040.csv,,1.0,2800.0,998516049,,False,,1355171527,"{""id"":2507703,""name"":""Traverse City"",""slug"":""t...",Help change the face of Brain Impairment,,"{""key"":""assets/011/457/844/37ba63d35fefaba76e9...",2820.0,"{""id"":417385,""project_id"":417385,""state"":""inac...",help-change-the-face-of-brain-impairment,https://www.kickstarter.com/discover/categorie...,True,False,successful,1357763527,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",2820.0,domestic
5,35,Annapolis Chamber Players is a non-for profit ...,"{""id"":36,""name"":""Classical Music"",""slug"":""musi...",3725,US,1414172150,"{""id"":682189804,""name"":""Annapolis Chamber Play...",USD,$,True,USD,1430533546,False,Kickstarter040.csv,,1.0,3500.0,1224600291,,False,,1427941546,"{""id"":2354877,""name"":""Annapolis"",""slug"":""annap...",Annapolis Chamber Music Project,,"{""key"":""assets/011/921/106/c9ad5416f0b588b37b7...",3725.0,"{""id"":1465941,""project_id"":1465941,""state"":""in...",annapolis-chamber-music-project,https://www.kickstarter.com/discover/categorie...,True,False,successful,1430533546,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",3725.0,domestic


In [36]:
# Display columns of "data"
data_insp.columns

Index(['backers_count', 'blurb', 'category', 'converted_pledged_amount',
       'country', 'created_at', 'creator', 'currency', 'currency_symbol',
       'currency_trailing_code', 'current_currency', 'deadline',
       'disable_communication', 'file', 'friends', 'fx_rate', 'goal', 'id',
       'is_backing', 'is_starrable', 'is_starred', 'launched_at', 'location',
       'name', 'permissions', 'photo', 'pledged', 'profile', 'slug',
       'source_url', 'spotlight', 'staff_pick', 'state', 'state_changed_at',
       'static_usd_rate', 'urls', 'usd_pledged', 'usd_type'],
      dtype='object')

In [37]:
# Compare with random single csv file
data0 = pd.read_csv("data/Kickstarter015.csv")
data.shape

(209222, 38)

In [38]:
# Describe data (summary)
data_insp.describe().round(2)

Unnamed: 0,backers_count,converted_pledged_amount,created_at,deadline,fx_rate,goal,id,launched_at,pledged,state_changed_at,static_usd_rate,usd_pledged
count,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0
mean,145.42,12892.9,1456089000.0,1463033000.0,0.99,49176.04,1073222000.0,1460206000.0,18814.03,1462838000.0,1.01,12892.13
std,885.97,88894.14,63397110.0,63056180.0,0.21,1179427.0,619805100.0,63090290.0,322959.62,62904210.0,0.23,88901.24
min,0.0,0.0,1240366000.0,1241334000.0,0.01,0.01,8624.0,1240603000.0,0.0,1241334000.0,0.01,0.0
25%,4.0,106.0,1413317000.0,1420607000.0,1.0,1500.0,535105400.0,1417639000.0,110.0,1420485000.0,1.0,106.0
50%,27.0,1537.0,1457895000.0,1464754000.0,1.0,5000.0,1074579000.0,1461924000.0,1556.0,1464709000.0,1.0,1537.36
75%,89.0,6548.0,1511595000.0,1519437000.0,1.0,15000.0,1609369000.0,1516694000.0,6887.2,1519366000.0,1.0,6550.0
max,105857.0,8596474.0,1552527000.0,1557721000.0,1.88,100000000.0,2147476000.0,1552537000.0,81030744.0,1552537000.0,1.72,8596474.58


In [39]:
# List datatypes (data.info())
data_insp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209222 entries, 0 to 209221
Data columns (total 38 columns):
backers_count               209222 non-null int64
blurb                       209214 non-null object
category                    209222 non-null object
converted_pledged_amount    209222 non-null int64
country                     209222 non-null object
created_at                  209222 non-null int64
creator                     209222 non-null object
currency                    209222 non-null object
currency_symbol             209222 non-null object
currency_trailing_code      209222 non-null bool
current_currency            209222 non-null object
deadline                    209222 non-null int64
disable_communication       209222 non-null bool
file                        209222 non-null object
friends                     300 non-null object
fx_rate                     209222 non-null float64
goal                        209222 non-null float64
id                          209

In [40]:
# List unique entries per column
data_insp.nunique()

backers_count                 3246
blurb                       180700
category                       169
converted_pledged_amount     31387
country                         22
created_at                  182158
creator                     208562
currency                        14
currency_symbol                  6
currency_trailing_code           2
current_currency                 5
deadline                    170854
disable_communication            2
file                            56
friends                          1
fx_rate                         67
goal                          5110
id                          182264
is_backing                       1
is_starrable                     2
is_starred                       2
launched_at                 182109
location                     15235
name                        181680
permissions                      1
photo                       182263
pledged                      44387
profile                     182265
slug                

In [41]:
# List correlation values
data_insp.corr()

Unnamed: 0,backers_count,converted_pledged_amount,created_at,currency_trailing_code,deadline,disable_communication,fx_rate,goal,id,is_starrable,launched_at,pledged,spotlight,staff_pick,state_changed_at,static_usd_rate,usd_pledged
backers_count,1.0,0.805201,0.025763,0.010566,0.031456,-0.007194,0.003313,0.010019,-0.002635,-0.006344,0.031367,0.251022,0.123834,0.157586,0.031798,-0.000877,0.804528
converted_pledged_amount,0.805201,1.0,0.031417,0.011129,0.036793,-0.005632,0.001674,0.010144,-0.001675,-0.006009,0.036437,0.311144,0.108999,0.143488,0.037109,-0.002628,0.999861
created_at,0.025763,0.031417,1.0,-0.163048,0.983713,-0.00559,-0.067095,0.003976,-0.003722,0.2594,0.983917,0.027735,-0.038476,-0.044145,0.983657,-0.103533,0.031305
currency_trailing_code,0.010566,0.011129,-0.163048,1.0,-0.159145,-0.003072,-0.53071,-0.001589,0.003703,-0.029686,-0.159141,-0.014496,0.026252,0.008887,-0.159209,-0.588899,0.010939
deadline,0.031456,0.036793,0.983713,-0.159145,1.0,-0.007501,-0.0664,0.004798,-0.002373,0.265997,0.999869,0.029352,-0.036473,-0.035992,0.999926,-0.104718,0.036689
disable_communication,-0.007194,-0.005632,-0.00559,-0.003072,-0.007501,1.0,-0.006571,0.008566,0.003209,-0.010032,-0.007664,-0.0024,-0.061833,-0.021032,-0.009238,-0.003848,-0.005637
fx_rate,0.003313,0.001674,-0.067095,-0.53071,-0.0664,-0.006571,1.0,-0.034576,-0.002264,-0.00958,-0.065782,-0.08668,0.019898,-0.000702,-0.066377,0.962465,0.001465
goal,0.010019,0.010144,0.003976,-0.001589,0.004798,0.008566,-0.034576,1.0,0.001609,-0.000269,0.004305,0.127876,-0.033962,-0.004262,0.004632,-0.031928,0.01024
id,-0.002635,-0.001675,-0.003722,0.003703,-0.002373,0.003209,-0.002264,0.001609,1.0,0.001471,-0.002432,-0.003055,-0.000518,0.003184,-0.002404,-0.002388,-0.001697
is_starrable,-0.006344,-0.006009,0.2594,-0.029686,0.265997,-0.010032,-0.00958,-0.000269,0.001471,1.0,0.264429,-0.002045,-0.207692,-0.019257,0.25753,-0.035726,-0.006544


In [42]:
# List missing values

def count_missing(data):
    null_cols = data_insp.columns[data.isnull().any(axis=0)]
    X_null = data[null_cols].isnull().sum()
    X_null = X_null.sort_values(ascending=False)
    print(X_null)
    
count_missing(data_insp)

permissions    208922
is_starred     208922
is_backing     208922
friends        208922
usd_type          480
location          226
blurb               8
dtype: int64


## Observations
- High correlation:
  - usd_pledged and converted_pledged_amount (drop one)
  - backers_count and converted_pledged_amount (create avg_backing, then drop backers_count)
  - Only 182.264 unique ID's, so projects can be listed multiple times (depending on what? changing state?)

## Data Handling
Purpose: Construct a clean dataset

In [43]:
# Initial drop of unnecessary columns
data.drop(["blurb", 'currency_symbol', 'file', 'friends', 'fx_rate', 'is_backing', 'is_starred', 'location', 'name', 'permissions', 'photo', 'pledged', 'slug', "urls", 'usd_pledged'], axis=1, inplace=True)

In [44]:
# Extract category_sub from "category"
data["category_sub"] = [data.category[i].split('"')[5] for i in range(len(data.category))]
data["category_sub"] = data["category_sub"].str.replace("%20", "_")

In [45]:
# Extract category from "source_url"
data["category"] = [data.source_url[i].split("/")[5] for i in range(len(data.source_url))]
data["category"] = data["category"].str.replace("%20", "_")

In [46]:
# Extract id of "creator"
data["creator_id"] = [data.creator[i].split('"')[2][1:-1] for i in range(len(data.creator))]

In [47]:
# Extract project_id of "profile"
data["project_id"] = [data.profile[i].split('"')[4][1:-1] for i in range(len(data.profile))]

In [48]:
# Extract profile_state of "profile"
data["profile_state"] = [data.profile[i].split('"')[7] for i in range(len(data.profile))]

In [49]:
# Extract profile_state_changed_at of "profile"
data["profile_state_changed_at"] = [data.profile[i].split('"')[10][1:-1] for i in range(len(data.profile))]

In [50]:
# Display content of newly created columns
#data.creator_id.value_counts()
#data.project_id.value_counts()
#data.profile_state.value_counts()
#data.profile_state_changed_at.value_counts()
#data[data.creator_id == "1704592942"]

In [51]:
# Extract city from "location" column (OPTIONAL, might not be needed)

In [52]:
# Drop rows with non-numerical values (only if few!)
#data = data[pd.to_numeric(data['column_with_nonnum'], errors='coerce').notnull()]
#data['column_with_nonnum'] = data['column_with_nonnum'].astype('int')
#data.dtypes

In [53]:
# Transform created_at from timestamp to datetime
data.created_at = pd.to_datetime(data.created_at, unit='s')

In [54]:
# Transform deadline from timestamp to datetime
data.deadline = pd.to_datetime(data.deadline, unit='s')

In [55]:
# Transform launched_at from timestamp to datetime
data.launched_at = pd.to_datetime(data.launched_at, unit='s')

In [56]:
# Transform state_changed_at from timestamp to datetime
data.state_changed_at = pd.to_datetime(data.state_changed_at, unit='s')

In [57]:
# Transform profile_state_changed_at from timestamp to datetime
data.profile_state_changed_at = pd.to_datetime(data.profile_state_changed_at, unit='s')

In [58]:
# Convert currency_trailing_code to 1/0
data.currency_trailing_code.replace([True, False], [1, 0], inplace=True)

In [59]:
# Convert disable_communication to 1/0
data.disable_communication.replace([True, False], [1, 0], inplace=True)

In [60]:
# Convert is_starrable to 1/0
data.is_starrable.replace([True, False], [1, 0], inplace=True)

In [61]:
# Convert spotlight to 1/0
data.spotlight.replace([True, False], [1, 0], inplace=True)

In [62]:
# Convert staff_pick to 1/0
data.staff_pick.replace([True, False], [1, 0], inplace=True)

In [63]:
# Convert profile_state to 1/0
data.profile_state.replace(["active", "inactive"], [1, 0], inplace=True)

In [64]:
# Convert NaN in usd_type to "not_USD"
data.usd_type.replace(["NaN"], ["not_USD"], inplace=True)

In [65]:
# Creating list for categorical predictors/features (used in "Scaling with Preprocessing Pipeline") 
#cat_features = list(data.columns[data.dtypes==object])
#cat_features

In [66]:
# Creating list for numerical predictors/features (removing target column, used in "Scaling with Preprocessing Pipeline")
#num_features = list(data.columns[data.dtypes!=object])
#num_features.remove('TARGET')
#num_features

In [68]:
# Apply further cleaning
#code

In [69]:
# Post-cleaning drop of unnecessary columns
data.drop(["creator", "profile"], axis=1, inplace=True)

In [70]:
data.shape

(209222, 26)

In [71]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209222 entries, 0 to 209221
Data columns (total 26 columns):
backers_count               209222 non-null int64
category                    209222 non-null object
converted_pledged_amount    209222 non-null int64
country                     209222 non-null object
created_at                  209222 non-null datetime64[ns]
currency                    209222 non-null object
currency_trailing_code      209222 non-null int64
current_currency            209222 non-null object
deadline                    209222 non-null datetime64[ns]
disable_communication       209222 non-null int64
goal                        209222 non-null float64
id                          209222 non-null int64
is_starrable                209222 non-null int64
launched_at                 209222 non-null datetime64[ns]
source_url                  209222 non-null object
spotlight                   209222 non-null int64
staff_pick                  209222 non-null int64
state

# Data Exploration
Purpose: Form hypotheses about your defined problem by visually analyzing the data

In [None]:
# Clean data set for exploration
data_clean = data.copy()

In [None]:
# Drop features for exploration
data_clean = data_clean.drop(["column1", "column2", "column3"], axis=1, inplace=True)

In [None]:
# Separate continuous vs. categorical variables
data_cat_col = ['condition', 'date', 'date_month', 'date_year', 'grade','zipcode', 'view']
data_cont_col = [el for el in data_clean.columns if el not in data_cat_col]
data_cont = data_clean[data_cont_col]
data_cat = data_clean[data_cat_col]

In [None]:
# Look at data skew
data_clean.skew()

In [None]:
# Plot correlation heatmap for continuous variables
#Generate a mask for the upper triangle
mask = np.triu(np.ones_like(data_cont.corr(), dtype=np.bool))

#Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

#Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

#Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(data_cont.corr(), mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True, fmt=".1g");

In [None]:
# Plot factorplot
sns.factorplot("sex", col='education_level', data=data_clean, hue='income', kind="count", col_wrap=4);

In [None]:
# Plot jointplot (1 feature)
sns.jointplot(x='feature', y='target', data=data_clean, kind='hex');

In [None]:
# Plot lmplot (3 features)
sns.lmplot(y='target', x='feature_1', data=loans, hue='feature_2', \
           col='feature_3 (plotted as separate graphs)', palette='Set1', scatter_kws={'alpha':0.3})

In [None]:
# Plot all variables as pairplot
sns.pairplot(data_clean);

In [None]:
# Plot selection of variables as pairplot
sns.pairplot(data_clean, kind="reg", vars=["price", "bedrooms", "bathrooms", "sqft_lot"], 
             plot_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'alpha': 0.1}});

In [None]:
# Plot FacetGrid
g = sns.FacetGrid(data_clean,
                  col='view',
                  row='bedrooms',
                  hue='waterfront',
                  palette='Set2')
g = (g.map(plt.scatter, 'sqft_living', 'price').add_legend())

In [None]:
# Plot continuous variables
plt.hist(data_clean.price, bins = 25);
plt.hist(np.log(data_clean.price), bins = 25);
plt.tight_layout()

In [None]:
# Plot skewed features (OPTIMIZE - HOW?)

In [None]:
# Plot categorical variables
sns.stripplot(x=data_clean.condition.values, y = data_clean.price.values, 
              jitter=0.1, alpha=0.5);

sns.stripplot(x=data_clean.grade.values, y = data_clean.price.values,
              jitter=0.1, alpha=0.5);

sns.stripplot(x=data_clean.zipcode.values, y = data_clean.price.values,
              jitter=0.1, alpha=0.5);

sns.pointplot(x = data_clean.zipcode.values, y = data_clean.price.values,
              order = data_clean.groupby("zipcode")["price"].mean().sort_values().index);

plt.tight_layout()

# Feature Engineering
Purpose: Select important features and construct more meaningful ones using the raw data that you have

To-Do's:
- Start with brainstorming session to determine which features could be useful

In [None]:
# Import libraries
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler, Imputer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline, make_pipeline # Same, but with the latter it is not necessary to name estimator and transformer
from sklearn.compose import ColumnTransformer

## Design new features

In [None]:
# Create 1/0 target feature for success
data["success"] = data["state"] = "successful" # Adapt value
data.new_column.replace([True, False], [1, 0], inplace=True)

In [None]:
# Create avg backing feature (converted_pledged_amount / backers_count)
data["backing_avg"] = data["converted_pledged_amount"] / data["backers_count"]

In [None]:
# Create log feature from continuous variable
data["new_column"] = [math.log(el) for el in data["old_column"]]

In [None]:
# Fill column to represent the max from two columns
data["column1"] = data[["column1","column2"]].max(axis=1)

In [None]:
# Drop features
data = data.drop(["converted_pledged_amount", "state", "column3"], axis=1, inplace=True)

## Preprocessing with manual code (alt. to pipeline)

### Imputation of Missing Data

In [None]:
# Overwrite original data with generic values (mean/median)
imp = Imputer(strategy='median')
data = imp.fit_transform(data)

In [None]:
# Overwrite original data with generic values, after train/test split (mean/median)
X_train = train.fillna(train.mean())
X_test = test.fillna(train.mean())

# Features for feature importances
features = list(train.columns)

### Scale numerical features (manual, alt. to pipeline)

In [None]:
# Transform highly skewed features (log-transform)
skewed = ["skewed_feature1", "skewed_feature2"]
data[skewed] = data[skewed].apply(lambda x: np.log(x + 1))

In [None]:
# Normalize numerical features (MinMaxScaler: rescales the data set such that all feature values are in the range [0, 1])
scaler = MinMaxScaler()
numerical = ["numerical_feature1 (e.g. age)", "numerical_feature2"]
data[numerical] = scaler.fit_transform(data[numerical])

In [None]:
# Normalize numerical features (StandardScaler: removes the mean and scales the data to unit variance)
scaler = StandardScaler()
numerical = ["numerical_feature1 (e.g. age)", "numerical_feature2"]
data[numerical] = scaler.fit(data[numerical])

In [None]:
# Normalize features (z-Score, not used in final model)
#Create new df
data_norm = data.copy(deep=True)
#drop columns
data_norm.drop(['yr_built', 'yr_renovated', 'yr_since_built'],
               axis=1,
               inplace=True)
#z normalize variables
data_norm.loc[:, [
    'date', 'bedrooms', 'bathrooms', 'waterfront', 'price_log', 'sqft_lot',
    'sqft_above', 'sqft_basement', 'floors', 'lat', 'long', 'sqft_living15',
    'sqft_lot15', 'yr_since_last_renovated'
]] = data.loc[:, [
    'date', 'bedrooms', 'bathrooms', 'waterfront', 'price_log', 'sqft_lot',
    'sqft_above', 'sqft_basement', 'floors', 'lat', 'long', 'sqft_living15',
    'sqft_lot15', 'yr_since_last_renovated'
]].apply(zscore)
#add intercept column
data_norm['intercept'] = 1

### Create dummies for categorical features (manual, alt. to pipeline)

In [None]:
# One-hot encode categorical (non-numeric) features (OPTIMIZE - FIND GOOD METHOD)

# Version 1
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)


# Version 2
features = pd.get_dummies(features_raw)

# TODO: Encode the 'income_raw' data to numerical values
income = income_raw.apply(lambda x: 1 if x == '>50K' else 0)

# Print the number of features after one-hot encoding
encoded = list(features.columns)
print ("{} total features after one-hot encoding.".format(len(encoded)))

# Uncomment the following line to see the encoded feature names 
# print encoded
# Left uncommented due to output size


# Version 3
cat = data.filter(["condition", 
                   "grade", 
                   "zipcode", "view"], axis=1).astype("category")
data_dum = pd.DataFrame()
data_dum_i = pd.DataFrame()
for i in cat:
    data_dum_i = pd.get_dummies(cat[i], prefix=i, drop_first=True)
    data_dum = pd.concat([data_dum, data_dum_i], axis=1)
data_without_dum = data
data = pd.concat([data, data_dum], axis=1)
data.reset_index(drop = True, inplace = True)

# data_dum = pd.get_dummies(data, columns=["condition", "grade", "zipcode"], prefix=["condition", "grade","zipcode"])
# data_wo_dum = data
# data_new = pd.concat([data, data_dum], axis=1)

## Train/test split

### Train/test split

In [None]:
# Define predictors and target variable
X = data.drop(["target", "optional_column1"], axis=1)
y = data["target"]

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=test_size,
                                                        random_state=random_state,
                                                        shuffle=False) # Use stratify=Y if labels are inbalanced (e.g. most wines are 5 or 6; check with value_counts()!)

### Dummy classifier (to establish baseline)

In [None]:
# Dummy classifier (requires train/test split)
dum_clf = DummyClassifier(strategy= 'most_frequent').fit(X_train,y_train)
y_pred_dum_clf = dum_clf.predict(X_test)

#Distribution of y test
print('y actual : \n' +  str(y_test.value_counts()))

#Distribution of y predicted
print('y predicted : \n' + str(pd.Series(y_pred_dum_clf).value_counts()))

## Preprocessing with Pipeline
Building a pipeline always follows the same syntax. In our case we create one pipeline for our numerical features and one for our categorical features. In the end both are combined into one pipeline called "preprocessor". 

### Pipeline for imputing and scaling numerical and categorical features

In [None]:
# Pipeline using Pipeline
# Pipeline for numerical features
num_pipeline = Pipeline([
    ('imputer_num', SimpleImputer(strategy='median')),
    ('std_scaler', StandardScaler())
])

# Pipeline for categorical features 
cat_pipeline = Pipeline([
    ('imputer_cat', SimpleImputer(strategy='constant', fill_value='missing')),
    ('1hot', OneHotEncoder(handle_unknown='ignore'))
])

# Complete pipeline
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

In [None]:
# Pipeline using make_pipeline (To-Do)

### Pipeline for imputing, scaling and fitting to model

In [None]:
# Build pipeline
model = make_pipeline(Imputer(strategy='mean'),
                      PolynomialFeatures(degree=2),
                      LinearRegression())

In [None]:
# Fit model and test (is the predict code correct?)
model.fit(X_train, y_train)  # X with missing values, from above
print(y_train)
print(model.predict(X_test))

# Predictive Modeling
Purpose: Train machine learning models (supervised learning), evaluate their performance and use them to make predictions

In [None]:
# Import libraries
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_predict, cross_val_score, cross_validate
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import make_scorer, fbeta_score, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.base import clone
from sklearn.ensemble import AdaBoostClassifier, VotingClassifier
from xgboost import XGBClassifier
import statsmodels.api as sm

## Feature Selection

### Feature importance with Decision Trees

In [None]:
### Display top features
fi = pd.DataFrame({'feature': list(X_train.columns),
                   'importance': tree.feature_importances_}).\
                    sort_values('importance', ascending = False)
fi.head()

### Feature importance with AdaBoost

In [None]:
# Train a supervised learning model that has 'feature_importances_'
model = AdaBoostClassifier().fit(X_train,y_train)

# TODO: Extract the feature importances
importances = model.feature_importances_

In [None]:
# Reduce the feature space
X_train_reduced = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:5]]]
X_test_reduced = X_test[X_test.columns.values[(np.argsort(importances)[::-1])[:5]]]

# Train on the "best" model found from grid search earlier
clf = (clone(best_clf)).fit(X_train_reduced, y_train)

# Make new predictions
reduced_predictions = clf.predict(X_test_reduced)

# Report scores from the final model using both versions of data
print("Final Model trained on full data\n------")
print("Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))
print("\nFinal Model trained on reduced data\n------")
print("Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, reduced_predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, reduced_predictions, beta = 0.5)))

## Regression (R) and Classification (C) Models

### Linear regression (R and C)

In [None]:
# Linear regression with statsmodels sm.OLS
X_train = sm.add_constant(X_train)
lr = sm.OLS(y_train,X_train).fit()

In [None]:
# Linear regression with sklearn LinearRegression()
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
#plt.scatter(y_test, y_pred)
#plt.xlabel('Y Test')
#plt.ylabel('Predicted Y')
#print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
#print('MSE:', metrics.mean_squared_error(y_test, y_pred))
#print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
#print(r2_score(y_test, y_pred))

### Logistic regression (R and C)

In [None]:
# Logistic regression (manual)
logr = LogisticRegression().fit(X_train,y_train)
y_pred = logr.predict(X_test)

In [None]:
# Logistic regression (using pipeline)
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('logreg', LogisticRegression(max_iter=1000))
])

In [None]:
# Logistic regression (using pipeline, making predictions using cross validation and probabilities)
y_train_predicted = cross_val_predict(pipeline, X_train, y_train, cv=5)

In [None]:
# Logistic regression (using pipeline, printing results)
print('Cross validation scores:')
print('-------------------------')
print("Accuracy: {:.2f}".format(accuracy_score(y_train, y_train_predicted)))
print("Recall: {:.2f}".format(recall_score(y_train, y_train_predicted)))
print("Precision: {:.2f}".format(precision_score(y_train, y_train_predicted)))

### Lasso (where?)

In [None]:
#1.2 Lasso with X, y values from #1.1 (0.51) - not used 
#from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
#import matplotlib

#lasso = LassoCV()
#lasso.fit(X, y)
#print("Best alpha using built-in LassoCV: %f" % lasso.alpha_)
#print("Best score using built-in LassoCV: %f" %lasso.score(X,y))
#coef = pd.Series(lasso.coef_, index = X.columns)

#print("Lasso picked " 
#      + str(sum(coef != 0)) 
#      + " variables and eliminated the other " 
#      +  str(sum(coef == 0)) + " variables")

#imp_coef = coef.sort_values()
#matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
#imp_coef.plot(kind = "barh")
#plt.title("Feature importance using Lasso Model")

### Decision Trees (R and C)
- Real world application: Decision Trees and, in general, CART (Classification and Regression Trees) are often used in financial analysis. A concrete example of it is: for predicting which stocks to buy based on past peformance. Reference
- Strengths:
  - Able to handle categorical and numerical data.
  - Doesn't require much data pre-processing, and can handle data which hasn't been normalized, or encoded for Machine Learning Suitability.
  - Simple to understand and interpret.
- Weaknesses:
  - Complex Decision Trees do not generalize well to the data and can result in overfitting.
  - Unstable, as small variations in the data can result in a different decision tree. Hence they are usually used in an ensemble (like Random Forests) to build robustness.
  - Can create biased trees if some classes dominate.
- Candidacy: Since a decision tree can handle both numerical and categorical data, it's a good candidate for our case (although, the pre-processing steps might already mitigate whatever advantage we would have had). It's also easy to interpret, so we will know what happens under the hood to interpret the results.

In [None]:
# Fit DecisionTreeClassifier
dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, y_train)

In [None]:
# Get predictions
y_train_pred = dt_clf.predict(X_train)
y_train_proba = dt_clf.predict_proba(X_train)

y_pred = dt_clf.predict(X_test)
y_pred_proba = dt_clf.predict_proba(X_test)

### Random Forest Classifier (C)

In [None]:
# Create model
rf_clf = RandomForestClassifier(n_estimators=100,
                              random_state=random_state,
                              max_depth=5
                              max_features="sqrt",
                              n_jobs=-1, verbose=1)

In [None]:
# Fit on training data
rf_clf.fit(X_train, y_train)

In [None]:
# Training predictions
y_train_pred = model.predict(X_train)
y_train_prob = model.predict_proba(X_train)[:,1]

In [None]:
# Testing predictions (to determine performance)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:,1]

### Support Vector Machines (C)
- Real world application: Example of a real world use of SVMs include image classification and image segmentation. For example: Face detection in an image. Reference
- Strenghs:
  - Effective in high dimensional spaces, or when there are a lot of features.
  - Kernel functions can be used to adapt to different cases, and can be completely customized if needed. Thus SVMs are versatile.
- Weaknesses:
  - Doesn't perform well with large datasets.
  - Doesn't directly provide probability estimates.
- Candidacy: SVMs were chosen because of their effectiveness given high dimensionality. After incorporating dummy variables, we have more than 100 features in our dataset, so SVMs should be a classifier that works regardless of that. Also, our dataset is not that large to be a deterrent.

In [None]:
# Pipeline: Find optimal C with Cross-Validation and then Kernel SVC
pca = RandomizedPCA(n_components=150, whiten=True, random_state=random_state) # Principal component analysis (PCA) using randomized SVD
svc_pre = SVC(kernel='rbf', class_weight='balanced')
svc_clf = make_pipeline(pca, svc_pre)

In [None]:
# Linear SVC
svc_clf = SVC(kernel="linear", C=1E10) # low C = soft margin, including many values; high C = hard fit
svc_clf.fit(X_train, y_train)

In [None]:
# Kernel SVC (radial basis function)
svc_clf = SVC(kernel='rbf', C=1E6)
svc_clf.fit(X_train, y_train)

In [None]:
# Tune model with GridSearchCV (find optimal C)
param_grid = {'svc__C': [1, 5, 10, 50],
              'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]} # Others: kernel, degree (only for poly)
grid = GridSearchCV(svc_clf, param_grid)

%time grid.fit(X_train, y_train)
print(grid.best_params_)

In [None]:
# Try out different kernels
kernels = [‘linear’, ‘rbf’, ‘poly’]
for kernel in kernels:
    svc = svm.SVC(kernel=kernel).fit(X, y)

In [None]:
# Predict labels for test data
svc_clf_est = grid.best_estimator_
y_pred = svc_clf_est.predict(X_test)

## Ensemble Methods

### Max Voting

In [None]:
# Compare two models
model1 = LogisticRegression(random_state=random_state)
model2 = DecisionTreeClassifier(random_state=random_state)

v_clf = VotingClassifier(estimators=[('lr', model1), ('dt', model2)], voting='hard')
v_clf.fit(X_train,y_train)
v_clf.score(X_test,y_test)

### Averaging

In [None]:
# Averages three models
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(X_train,y_train)
model2.fit(X_train,y_train)
model3.fit(X_train,y_train)

pred1=model1.predict_proba(X_test)
pred2=model2.predict_proba(X_test)
pred3=model3.predict_proba(X_test)

y_pred=(pred1+pred2+pred3)/3

### Weighted Average

In [None]:
# Averages three weighted models
model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(X_train,y_train)
model2.fit(X_train,y_train)
model3.fit(X_train,y_train)

pred1=model1.predict_proba(X_test)
pred2=model2.predict_proba(X_test)
pred3=model3.predict_proba(X_test)

y_pred=(pred1\*0.3+pred2\*0.3+pred3\*0.4)

### Stacking

In [None]:
# Define function for stacking
def Stacking(model,train,y,test,n_fold):
    folds=StratifiedKFold(n_splits=n_fold,random_state=1)
    test_pred=np.empty((test.shape[0],1),float)
    train_pred=np.empty((0,1),float)
    for train_indices,val_indices in folds.split(train,y.values):
        x_train,x_val=train.iloc[train_indices],train.iloc[val_indices]
        y_train,y_val=y.iloc[train_indices],y.iloc[val_indices]
  
        model.fit(X=x_train,y=y_train)
        train_pred=np.append(train_pred,model.predict(x_val))
        test_pred=np.append(test_pred,model.predict(test))
    return test_pred.reshape(-1,1),train_pred

In [None]:
# Call function
Stacking(rf_clf, X_train, y_train, X_test, 3)

### Gradient Boost

### AdaBoost
- Real world application: Ensemble methods are used extensively in Kaggle competitions, usually in image detection. A real world example of Adaboost is object detection in image, ex: identifying players during a game of basketball. Reference
- Strengths:
  - Ensemble methods, including Adaboost are more robust than single estimators, have improved generalizability.
  - Simple models can be combined to build a complex model, which is computationally fast.
- Weaknesses:
  - If we have a biased underlying classifier, it will lead to a biased boosted model.
- Candidacy: Ensemble methods are considered to be high quality classifiers, and adaboost is the one of most popular boosting algorithms. We also have a class imbalance in our dataset, which boosting might be robust to.

In [None]:
# Find good application guideline (reference: notebooks/hh-2020-ds1-Ensemble-Methods-2/2_Adaboost_Codealong.ipynb)
clf_ada = AdaBoostClassifier(random_state=random_state)

### XGBoost

In [None]:
# Determine best parameters with GridSearch
cv_params = {'max_depth': [1,2,3,4,5,6], 'min_child_weight': [1,2,3,4]}    # parameters to be tries in the grid search
fix_params = {'learning_rate': 0.1, 'n_estimators': 100, 'objective': 'binary:logistic'}   #other parameters, fixed for the moment 
csv = GridSearchCV(XGBClassifier(**fix_params), cv_params, scoring = 'f1', cv = 5)

In [None]:
# Fit model and print best parameters
csv.fit(X_train, y_train)
csv.best_params_

In [None]:
# Create model with best parameters
xgb_clf = XGBClassifier(learning_rate=0.1, max_depth=4, min_child_weight=3)
xgb_clf.fit(X_train, y_train)

In [None]:
# Predict values
y_pred = xgb_clf.predict(X_test)
predictions = [round(value) for value in y_pred]

## Model Optimization

### Random Forest Optimization (Random Search)

In [None]:
# Fit optimized RandomForestClassifier

# Hyperparameter grid
param_grid = {
    'n_estimators': np.linspace(10, 200).astype(int),
    'max_depth': [None] + list(np.linspace(3, 20).astype(int)),
    'max_features': ['auto', 'sqrt', None] + list(np.arange(0.5, 1, 0.1)),
    'max_leaf_nodes': [None] + list(np.linspace(10, 50, 500).astype(int)),
    'min_samples_split': [2, 5, 10],
    'bootstrap': [True, False]
}

# Estimator for use in random search
estimator = RandomForestClassifier(random_state=random_state)

# Create the random search model
rs_rf_clf = RandomizedSearchCV(estimator, param_grid, n_jobs = -1, 
                        scoring = 'roc_auc', cv = 3, 
                        n_iter = 10, verbose = 1, random_state=random_state)

# Fit 
rs_rf_clf.fit(X_train, y_train)

In [None]:
# Display best parameters
rs_rf_clf.best_params_

In [None]:
# Use best model for predictions
best_model = rs.best_estimator_

y_train_pred = best_model.predict(train)
y_train_proba = best_model.predict_proba(train)[:, 1]

y_test_pred = best_model.predict(test)
y_test_proba = best_model.predict_proba(test)[:, 1]

### Training and Predicting Pipeline

In [None]:
# Define function for train_predict
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test): 
    '''
    inputs:
       - learner: the learning algorithm/classifier to be trained and predicted on
       - sample_size: the size of samples (number) to be drawn from training set
       - X_train: features training set
       - y_train: income training set
       - X_test: features testing set
       - y_test: income testing set
    '''
    
    results = {}
    
    # TODO: Fit the learner to the training data using slicing with 'sample_size'
    start = time() # Get start time
    learner = learner.fit(X_train[:sample_size],y_train[:sample_size])
    end = time() # Get end time
    
    # TODO: Calculate the training time
    results['train_time'] = end - start
        
    # TODO: Get the predictions on the test set,
    #       then get predictions on the first 300 training samples
    start = time() # Get start time
    predictions_test = learner.predict(X_test)
    predictions_train = learner.predict(X_train[:300])
    end = time() # Get end time
    
    # TODO: Calculate the total prediction time
    results['pred_time'] = end - start
            
    # TODO: Compute accuracy on the first 300 training samples
    results['acc_train'] = accuracy_score(y_train[:300],predictions_train)
        
    # TODO: Compute accuracy on test set
    results['acc_test'] = accuracy_score(y_test,predictions_test)
    
    # TODO: Compute F-score on the the first 300 training samples
    results['f_train'] = fbeta_score(y_train[:300],predictions_train,0.5)
        
    # TODO: Compute F-score on the test set
    results['f_test'] = fbeta_score(y_test,predictions_test,0.5)
       
    # Success
    print ("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
        
    # Return the results
    return results

In [None]:
# Set classifiers (if applicable)
clf_A = DecisionTreeClassifier(random_state=101)
clf_B = SVC(random_state = 101)
clf_C = AdaBoostClassifier(random_state = 101)

In [None]:
# Set sample sizes
samples_1 = int(round(len(X_train) / 100))
samples_10 = int(round(len(X_train) / 10))
samples_100 = len(X_train)

In [None]:
# Collect results for various sample sizes
results = {}
for clf in [clf_A, clf_B, clf_C]: # Define which classifiers shall be used
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results[clf_name][i] = \
        train_predict(clf, samples, X_train, y_train, X_test, y_test)

## Model Tuning

### Grid Search

- Grid-search is used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions.
- Grid-searching can be applied across machine learning to calculate the best parameters to use for any given model.

In [None]:
# Initialize the classifier
clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier()) 

In [None]:
# Create the parameters list you wish to tune (depending on the model!)
parameters = {'n_estimators':[50, 120],                
              'learning_rate':[0.1, 0.5, 1.],               
              'base_estimator__min_samples_split' : np.arange(2, 8, 2),               
              'base_estimator__max_depth' : np.arange(1, 4, 1),
              'base_estimator__min_child_weight' : np.arange(1, 4, 1)
             } 

In [None]:
# Fit grid and print results
# TODO: Make an fbeta_score scoring object
scorer = make_scorer(fbeta_score,beta=0.5) 

# TODO: Perform grid search on the classifier using 'scorer' as the scoring method
grid_obj = GridSearchCV(clf,parameters,scorer) 

# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_fit = grid_obj.fit(X_train,y_train) 
print(grid_fit.best_score_)
print(grid_fit.best_params_)

# Get the estimator
best_clf = grid_fit.best_estimator_ 

# Make predictions using the unoptimized and model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test) 

# Report the before-and-afterscores
print("Unoptimized model\n------")
print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5)))
print("\nOptimized Model\n------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))
print(best_clf)

### Grid Search (Pipeline)
In order to optimize our model we will use gird search. At first we have to define a parameter space we want to search for the best parameter combination. Then we have to initiate our grid search via GridSearchCV. The last step is to use the fit method providing our training data as input.

In [None]:
# Defining parameter space for grid-search. Since we want to access the classifier step in our pipeline 
# we have to add 'logreg__' infront of the corresponding hyperparameters. 
param_logreg = {'logreg__penalty':('l1','l2'),
                'logreg__C': [0.01, 0.1, 1, 10, 100]
               }

grid_logreg = GridSearchCV(pipeline, param_grid=param_logreg, cv=3, scoring='accuracy', 
                           verbose=5, n_jobs=-1) # scoring can also be "precision", "recall", ...

In [None]:
# Fit model
grid_logreg.fit(X_train, y_train)

In [None]:
# Predict values based on new parameters
y_pred = grid_logreg.predict(X_test)

In [None]:
# Show best parameters
print('Best score:\n{:.2f}'.format(grid_logreg.best_score_))
print("Best parameters:\n{}".format(grid_logreg.best_params_))

In [None]:
# Save best model as best_model
best_model = grid_logreg.best_estimator_['logreg']

## Final Evaluation
Finally we have a good model. Let's see if it also passes the final evaluation on the test data. Therefore we have to prepare the test set in the same way we did with the training data. Thanks to our pipeline it's done in a blink. :)

### Based on Train/Test Predict/Proba

In [None]:
# Define formula for evaluation
def evaluate_model(predictions, probs, train_predictions, train_probs):
    """Compare machine learning model to baseline performance.
    Computes statistics and shows ROC curve."""
    
    baseline = {}
    
    baseline['recall'] = recall_score(test_labels, [1 for _ in range(len(test_labels))])
    baseline['precision'] = precision_score(test_labels, [1 for _ in range(len(test_labels))])
    baseline['roc'] = 0.5
    
    results = {}
    
    results['recall'] = recall_score(test_labels, predictions)
    results['precision'] = precision_score(test_labels, predictions)
    results['roc'] = roc_auc_score(test_labels, probs)
    
    train_results = {}
    train_results['recall'] = recall_score(train_labels, train_predictions)
    train_results['precision'] = precision_score(train_labels, train_predictions)
    train_results['roc'] = roc_auc_score(train_labels, train_probs)
    
    for metric in ['recall', 'precision', 'roc']:
        print(f'{metric.capitalize()} Baseline: {round(baseline[metric], 2)} Test: {round(results[metric], 2)} Train: {round(train_results[metric], 2)}')
    
    # Calculate false positive rates and true positive rates
    base_fpr, base_tpr, _ = roc_curve(test_labels, [1 for _ in range(len(test_labels))])
    model_fpr, model_tpr, _ = roc_curve(test_labels, probs)

    plt.figure(figsize = (8, 6))
    plt.rcParams['font.size'] = 16
    
    # Plot both curves
    plt.plot(base_fpr, base_tpr, 'b', label = 'baseline')
    plt.plot(model_fpr, model_tpr, 'r', label = 'model')
    plt.legend();
    plt.xlabel('False Positive Rate'); plt.ylabel('True Positive Rate'); plt.title('ROC Curves');


In [None]:
# Call formula
evaluate_model(predictions, probs, train_predictions, train_probs)

### Based on Preprocessor Pipeline

In [None]:
# Preparing the test set 
preprocessor.fit(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

In [None]:
# Calculating the accuracy, recall and precision for the test set with the optimized model
y_test_predicted = best_model.predict(X_test_preprocessed)

print("Accuracy: {:.2f}".format(accuracy_score(y_test, y_test_predicted)))
print("Recall: {:.2f}".format(recall_score(y_test, y_test_predicted)))
print("Precision: {:.2f}".format(precision_score(y_test, y_test_predicted)))

# Data Visualization
Purpose: Communicate the findings with key stakeholders using plots and interactive visualizations

In [None]:
# Import libraries
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, roc_curve, confusion_matrix, f1_score, classification_report

## Print results

In [None]:
# Print metrics
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred)))

## Plot Visualizations

### Confusion Matrix

In [None]:
# Define function to plot confusion matrix
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred, labels=[2,4])
np.set_printoptions(precision=2)

print (classification_report(y_test, y_pred))

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['class_1','class_2'],normalize= False,  title='Confusion matrix')

### Other results

In [None]:
# mean price on date
data_results.groupby("date")["price"].mean().plot(kind="line", x="date", y="price");

## Map Visualizations

In [None]:
# Scatter Map for deviation from predicted value (all listings)
fig = px.scatter_mapbox(data_results,
                        lat="lat",
                        lon="long",
                        hover_name="price",
                        hover_data=["sqft_above", 'floors'],
                        size='price',
                        color='res_rel',
                        color_continuous_scale=[[0, "rgb(255, 203, 100)"],[0.5, "rgb(116, 193, 185)"],[1, "rgb(116, 193, 185)"]],
                        color_discrete_sequence=["fuchsia"],
                        zoom=7,
                        height=400)
fig.update_layout(mapbox_style="carto-positron")
fig.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})
#fig.write_html("html_fig/map_for_four_1.html")   #saved as html in html_fig
fig.show()

In [None]:
# Density Map for deviation from predicted value (top 5 high grade listings)
fig = px.density_mapbox(data_res_high,
                        lat='lat',
                        lon='long',
                        z='res_rel',
                        radius=5,
                        color_continuous_scale="electric" ,
                        center=dict(lat=47.53, lon=-122.23),
                        zoom=7,
                        mapbox_style="open-street-map")
#fig.write_html("figures/map_price.html")
#fig.show()

# Findings and Recommendations
Purpose: Summarize the key outcomes and findings of this project

- xyz
- xyz

# Future Work
Purpose: Validate and extend findings of this project

- Inclusion of further variables (e.g. ...)
- Inclusion of further publicly available data (e.g. [Kaggle Competition](https://www.kaggle.com/kemical/kickstarter-projects))

# References and Further Information

Sometimes you might want to transform your features in a very specific way, which is not implemented in scikit-learn yet. In those cases you can create your very own custome transformers. In order to work seamlessly with everything scikit-learn provides you need to create a class and implement the three methods .fit(), .transform() and .fit_transform().
Two useful base classes on which you can construct your personal transformer can be imported with the following command:

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

If you want to learn more about building your own transformers or pipelines in general I would recommend to have a look at the following books:

- Introduction to Machine Learning with Python by Müller and Guido (2017), Chapter 6
- Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow by Geron (2019), Chapter 2

## Useful Links/Blogs
- [Data Preprocessing Concepts (Theory)](https://towardsdatascience.com/data-preprocessing-concepts-fa946d11c825)
- [Data Preprocession in Practice](https://towardsdatascience.com/data-preprocessing-in-python-b52b652e37d5)
- [Pipeline in ML (SVM) with Scikit-learn: A Simple Example](https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976)
- [Pipeline in ML (Decision Trees)](https://towardsdatascience.com/understanding-decision-tree-classification-with-scikit-learn-2ddf272731bd)
- [Feature Engineering](https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html)
- [Hyperparameters and Model Validation](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html)
- [Grid Search for Model Tuning](https://towardsdatascience.com/grid-search-for-model-tuning-3319b259367e)