## Baseline Model

In this notebook we will develop a baseline model to predict if a Kickstarter project will succeed. This baseline model will let us understand how the model performance can be improved with the use of feature engineering and selection efforts.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('Data/Kickstarter project.csv')
data.head().T

Unnamed: 0,0,1,2,3,4
ID,1000002330,1000003930,1000004038,1000007540,1000011046
name,The Songs of Adelaide & Abullah,Greeting From Earth: ZGAC Arts Capsule For ET,Where is Hank?,ToshiCapital Rekordz Needs Help to Complete Album,Community Film Project: The Art of Neighborhoo...
category,Poetry,Narrative Film,Narrative Film,Music,Film & Video
main_category,Publishing,Film & Video,Film & Video,Music,Film & Video
currency,GBP,USD,USD,USD,USD
deadline,2015-10-09,2017-11-01,2013-02-26,2012-04-16,2015-08-29
goal,1000,30000,45000,5000,19500
launched,2015-08-11 12:12:28,2017-09-02 04:43:57,2013-01-12 00:20:50,2012-03-17 03:24:11,2015-07-04 08:35:03
pledged,0,2421,220,1,1283
state,failed,failed,failed,failed,canceled


What we can do here is predict if a Kickstarter project will succeed. We get the outcome from the state column. To predict the outcome we can use features such as category, currency, funding goal, country, and when it was launched.




#### Preparing target column

In [3]:
data.state.unique()

array(['failed', 'canceled', 'successful', 'live', 'undefined',
       'suspended'], dtype=object)

In [4]:
data.groupby('state')['ID'].count()

state
canceled       38779
failed        197719
live            2799
successful    133956
suspended       1846
undefined       3562
Name: ID, dtype: int64

In [5]:
# 1 - Dropping projects that are "live"
data = data.query('state != "live"')

# 2 - Counting "successful" states as outcome = 1
data = data.assign(outcome=(data['state'] == 'successful').astype(int))

data.head(10)

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real,outcome
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95,0
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0,0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0,0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0,0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0,0
5,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01,50000.0,2016-02-26 13:38:27,52375.0,successful,224,US,52375.0,52375.0,50000.0,1
6,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,Food,USD,2014-12-21,1000.0,2014-12-01 18:30:44,1205.0,successful,16,US,1205.0,1205.0,1000.0,1
7,1000030581,Chaser Strips. Our Strips make Shots their B*tch!,Drinks,Food,USD,2016-03-17,25000.0,2016-02-01 20:05:12,453.0,failed,40,US,453.0,453.0,25000.0,0
8,1000034518,SPIN - Premium Retractable In-Ear Headphones w...,Product Design,Design,USD,2014-05-29,125000.0,2014-04-24 18:14:43,8233.0,canceled,58,US,8233.0,8233.0,125000.0,0
9,100004195,STUDIO IN THE SKY - A Documentary Feature Film...,Documentary,Film & Video,USD,2014-08-10,65000.0,2014-07-11 21:55:48,6240.57,canceled,43,US,6240.57,6240.57,65000.0,0


The first thing we need to do is construct a baseline model. All new features, processing, encodings, and feature selection should improve upon this baseline model. First we need to do a bit of feature engineering before training the model itself.


#### Converting timestamps
The feature "Launched" contains date and time. We will convert it to date and time values through the .dt attribute on the timestamp column.

In [6]:
data['launched'] = pd.to_datetime(data['launched'], errors='coerce')

data = data.assign(hour = data.launched.dt.hour,
                   day = data.launched.dt.day,
                   month = data.launched.dt.month,
                   year = data.launched.dt.year) 

data.head().T

Unnamed: 0,0,1,2,3,4
ID,1000002330,1000003930,1000004038,1000007540,1000011046
name,The Songs of Adelaide & Abullah,Greeting From Earth: ZGAC Arts Capsule For ET,Where is Hank?,ToshiCapital Rekordz Needs Help to Complete Album,Community Film Project: The Art of Neighborhoo...
category,Poetry,Narrative Film,Narrative Film,Music,Film & Video
main_category,Publishing,Film & Video,Film & Video,Music,Film & Video
currency,GBP,USD,USD,USD,USD
deadline,2015-10-09,2017-11-01,2013-02-26,2012-04-16,2015-08-29
goal,1000,30000,45000,5000,19500
launched,2015-08-11 12:12:28,2017-09-02 04:43:57,2013-01-12 00:20:50,2012-03-17 03:24:11,2015-07-04 08:35:03
pledged,0,2421,220,1,1283
state,failed,failed,failed,failed,canceled


#### Prepping categorical variables
We need to convert categorical variables into integers, so our model can use the data.

In [7]:
from sklearn.preprocessing import LabelEncoder
categorical_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = data[categorical_features].apply(encoder.fit_transform)
encoded.head()

"""for feature in cat_features:
    encoded = label_encoder.fit_transform(clicks[feature])
    clicks[feature + '_labels'] = encoded"""

"for feature in cat_features:\n    encoded = label_encoder.fit_transform(clicks[feature])\n    clicks[feature + '_labels'] = encoded"

In [8]:
#collecting features for the model
final_data = data[['goal', 'hour', 'day', 'month', 'year', 'outcome']].join(encoded)
final_data.head()

Unnamed: 0,goal,hour,day,month,year,outcome,category,currency,country
0,1000.0,12,11,8,2015,0,108,5,9
1,30000.0,4,2,9,2017,0,93,13,22
2,45000.0,0,12,1,2013,0,93,13,22
3,5000.0,3,17,3,2012,0,90,13,22
4,19500.0,8,4,7,2015,0,55,13,22


When should we use one hot encoding ? 

For columns with a big number of values (example 40,000),it will create an extremely sparse matrix with that number of values. This many columns will make the model run very slow, so in general we want to avoid one-hot encoding features with many levels. LightGBM models work with label encoded features, so we don't actually need to one-hot encode the categorical features.

#### Creating training, validation, and test splits
We'll use 10% of the data as a validation set, 10% for testing, and the other 80% for training.

In [9]:
valid_fraction = 0.1
valid_size = int(len(final_data) * valid_fraction)

train = final_data[:-2 * valid_size]
valid = final_data[-2 * valid_size:-valid_size]
test = final_data[-valid_size:]

In general we want to be careful that each data set has the same proportion of target classes. We will print out the fraction of successful outcomes for each of our datasets.

In [10]:
for each in [train, valid, test]:
    print(f"Outcome fraction = {each.outcome.mean():.4f}")

Outcome fraction = 0.3570
Outcome fraction = 0.3539
Outcome fraction = 0.3542


Each set is around 35% true outcomes likely because the data was well randomized beforehand. A good way to do this automatically is with sklearn.model_selection.StratifiedShuffleSplit.

This is time series data. Are they any special considerations when creating train/test splits for time series? If so, what and why?

Since our model is meant to predict events in the future, we must also validate the model on events in the future. If the data is mixed up between the training and test sets, then future data will leak in to the model and our validation results will overestimate the performance on new data.

#### Train a LightGBM model

In [11]:
import lightgbm as lgb

feature_cols = train.columns.drop('outcome')

dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10, verbose_eval=False)

#### Making predictions & evaluating the model

In [12]:
from sklearn import metrics
ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['outcome'], ypred)

print(f"Test AUC score: {score}")

Test AUC score: 0.747615303004287
