# Kickstarter Tabletop Model
This notebook attemps to classify the success of a Kickstater Tabletop Game

Data was obtained from https://webrobots.io/kickstarter-datasets/

## Dependencies

In [76]:
import time
import pandas as pd
import json
import datetime
import numpy
import random

from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers.embeddings import Embedding
from sklearn.model_selection import GridSearchCV, train_test_split
from keras.wrappers.scikit_learn import KerasClassifier
from keras.optimizers import SGD
from keras.constraints import maxnorm
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from keras.utils import to_categorical
import seaborn as sns

random.seed(42)

## Data Pre-Processing for the Model Input
The KickstarterTabletopFilter Notebook has the information used to scrape the files in the webrobots source. The Notebook Scrapes for all tabletop games in the files and obtaines desired values for the input of the models

In [77]:
tp_df = pd.read_csv("data/KS_data.csv").dropna()
tp_df["KS_lenght_seconds"] = (tp_df["date_end_epoch"] - tp_df["date_launched_epoch"])/(60*60*24)
tp_df.head()

Unnamed: 0.1,Unnamed: 0,ks_state,game_name,blurb,game_id,money_pledged,backers_count,location_id,money_goal,month_created,...,hour_launched,month_end,day_end,hour_end,month_obtained,day_obtained,date_created_epoch,date_launched_epoch,date_end_epoch,KS_lenght_seconds
0,27219,successful,"Indie Nerd Board Game, Needs Player Character ...",pictured here is the character we would like t...,1839304671,1741.02,28,2459115.0,1500,5,...,2,6,15,1,12,13,1242364202,1242369560,1245042600,30.937963
2,27393,successful,Lines of Fire: A fantasy battle board game. (M...,"Lines of Fire: a fast-paced battle game, kicks...",694655492,535.0,26,2383144.0,350,10,...,7,9,30,23,12,13,1256252218,1283342111,1285905540,29.669317
6,28668,successful,Inevitable: dystopian tabletop gaming,Violence! Betrayal! Laughs! Evil supercompu...,1889864699,9435.0,101,2367105.0,3000,3,...,14,6,1,12,12,13,1267636201,1267729972,1275408060,88.866759
7,27649,successful,"Happy Birthday, Robot!","A storytelling party game for clever kids, gam...",866507610,3030.0,110,2464592.0,1050,3,...,14,6,1,0,12,13,1269174182,1269195764,1275368340,71.441852
8,27312,successful,Maschine Zeit - A Roleplaying Game,You don't truly know someone until they're ble...,928107375,2630.0,45,2471217.0,650,4,...,12,5,1,11,12,13,1270394676,1270399059,1272727440,26.948854


In [78]:
#These variables are used to match the information to test results below
game_list = tp_df["game_name"].tail(len(tp_df) - round(len(tp_df)*4/5))
month_obtained = tp_df["month_obtained"].tail(len(tp_df) - round(len(tp_df)*4/5))
day_obtained = tp_df["day_obtained"].tail(len(tp_df) - round(len(tp_df)*4/5))

In [79]:
# There are a lot of variables are that are not directly placed into the model,
#but modified or combined as part of model input.
tp_df = tp_df.drop(['Unnamed: 0'], axis=1)
tp_df = tp_df.drop(['game_id'], axis=1)
tp_df = tp_df.drop(['location_id'], axis=1)
tp_df = tp_df.drop(['game_name'], axis=1)
tp_df = tp_df.drop(['date_created_epoch'], axis=1)
tp_df = tp_df.drop(['date_launched_epoch'], axis=1)
tp_df = tp_df.drop(['date_end_epoch'], axis=1)
tp_df = tp_df.drop(['day_obtained'], axis=1)
tp_df = tp_df.drop(['backers_count'], axis=1)
tp_df.head()

Unnamed: 0,ks_state,blurb,money_pledged,money_goal,month_created,day_created,hour_created,month_launched,day_launched,hour_launched,month_end,day_end,hour_end,month_obtained,KS_lenght_seconds
0,successful,pictured here is the character we would like t...,1741.02,1500,5,15,1,5,15,2,6,15,1,12,30.937963
2,successful,"Lines of Fire: a fast-paced battle game, kicks...",535.0,350,10,22,18,9,1,7,9,30,23,12,29.669317
6,successful,Violence! Betrayal! Laughs! Evil supercompu...,9435.0,3000,3,3,12,3,4,14,6,1,12,12,88.866759
7,successful,"A storytelling party game for clever kids, gam...",3030.0,1050,3,21,8,3,21,14,6,1,0,12,71.441852
8,successful,You don't truly know someone until they're ble...,2630.0,650,4,4,11,4,4,12,5,1,11,12,26.948854


## Encoding the different Classes that the Model will Output
These are the ranges of success that the model will try to predict, there are models out there at predict Kickstarter Success, but not many that deal with levels of success.

In [80]:
tp_df["ks_state"][tp_df["ks_state"] == "successful"] = 1
tp_df["ks_state"][tp_df["ks_state"] != 1] = 0
ks_money_list = []
tp_df["ks_money_made"] = [ "<100%" if p < g else\
                         "100%-125%" if p <= g*1.25 else\
                         "125%-150%" if p <= g*1.5 else\
                         "150%-200%" if p <= g*2 else\
                         "200%-500%" if p <= g*5 else \
                          "500%-1000%" if p <= g*10 \
                          else ">1000%"
                          for p,g in zip(tp_df["money_pledged"], tp_df["money_goal"])]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


## Stats Model - Time and Money Inputs
There is not much numerical data included in the kickstarter files, but there are a lot of useful dates that can be broken down to train the models.

In [81]:
X = tp_df.drop(["ks_state"], axis=1)\
        .drop(["money_pledged"], axis=1)\
        .drop(["ks_money_made"], axis = 1)\
        .drop(["blurb"], axis = 1)
X.head()

Unnamed: 0,money_goal,month_created,day_created,hour_created,month_launched,day_launched,hour_launched,month_end,day_end,hour_end,month_obtained,KS_lenght_seconds
0,1500,5,15,1,5,15,2,6,15,1,12,30.937963
2,350,10,22,18,9,1,7,9,30,23,12,29.669317
6,3000,3,3,12,3,4,14,6,1,12,12,88.866759
7,1050,3,21,8,3,21,14,6,1,0,12,71.441852
8,650,4,4,11,4,4,12,5,1,11,12,26.948854


## Text Model - Blurb Text Input
Kickstarter data includes a Sentence (Blurb) that describes something about the product in question, which can be used to train a model for detecting success based on that description.

### One-Hot Encoding Blurbs

In [82]:
vocab_size = 1000
print(tp_df["blurb"][0])
tp_df["blurb"] = [one_hot(d, vocab_size,filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~',
                                  lower=True, split=' ') for d in tp_df["blurb"]]
print(tp_df["blurb"][0])
field_length = tp_df["blurb"].astype(str).map(len)

max_length = 30
X_blurbs_train = pad_sequences(list(tp_df["blurb"].head(round(len(X)*4/5))),\
                              maxlen=max_length, padding='pre')
X_blurbs_test = pad_sequences(list(tp_df["blurb"].tail(len(y) - round(len(y)*4/5))),\
                              maxlen=max_length, padding='pre')

pictured here is the character we would like to sculpt as a high quality game miniature - the Forsaker, a man that has abandoned his sanity to face..
[420, 689, 48, 764, 172, 394, 81, 453, 287, 565, 528, 192, 882, 446, 100, 326, 764, 931, 192, 170, 399, 968, 270, 677, 891, 287, 852]


### Model(s) Targets

In [83]:
y = tp_df[["ks_money_made"]]
y.head()

Unnamed: 0,ks_money_made
0,100%-125%
2,150%-200%
6,200%-500%
7,200%-500%
8,200%-500%


## Generating Test and Train Inputs/Targets

### Model Input Scaling

In [108]:
#Training and Test Inputs
X_train = X.head(round(len(X)*4/5))
X_test = X.tail(len(X) - round(len(X)*4/5))

#Input Scaler                            
X_scaler = MinMaxScaler().fit(X_train)
#Input Train and Test Scalers
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)  

print(X_train_scaled[0])                              
print (X_blurbs_train[0])

[2.99800060e-04 3.63636364e-01 4.66666667e-01 4.34782609e-02
 3.63636364e-01 4.66666667e-01 8.69565217e-02 4.54545455e-01
 4.66666667e-01 4.34782609e-02 1.00000000e+00 3.36140150e-01]
[  0   0   0 420 689  48 764 172 394  81 453 287 565 528 192 882 446 100
 326 764 931 192 170 399 968 270 677 891 287 852]


### Model Target Scaling

In [109]:
#Training and Test Targets
y_train = y.head(round(len(y)*4/5))
y_test = y.tail(len(y) - round(len(y)*4/5))

#Target Training Encoder
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
encoded_y_train = label_encoder.transform(y_train)
print(encoded_y_train[0])
y_train_scaled = to_categorical(encoded_y_train)
print(y_train_scaled.shape)

#Target Testing Encoder
label_encoder.fit(y_test)
encoded_y_test = label_encoder.transform(y_test)
print(encoded_y_test[0])
y_test_scaled = to_categorical(encoded_y_test)
print(y_test_scaled.shape)

0
(6741, 7)
5
(1685, 7)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


## Blurb Model 

In [85]:
#Simple Sequential model using keras to input the encoding arrays
model_blurb = Sequential()
model_blurb.add(Embedding(vocab_size, 8, input_length=max_length))
model_blurb.add(Flatten())
model_blurb.add(Dense(100, activation='relu'))
model_blurb.add(Dense(80, activation='relu'))
model_blurb.add(Dense(y_train_scaled.shape[1], activation='softmax'))

print(model_blurb.summary())
## The model can be compiled and fitted by itself by using the code below.
model_blurb.compile(loss='categorical_crossentropy',
                    optimizer="adam",
                    metrics=['mean_absolute_error', 'mean_squared_error'])
model_blurb.fit(X_blurbs_train, y_train_scaled, epochs=20, verbose=1)

0
(6741, 7)
5
(1685, 7)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 30, 8)             8000      
_________________________________________________________________
flatten_3 (Flatten)          (None, 240)               0         
_________________________________________________________________
dense_58 (Dense)             (None, 100)               24100     
_________________________________________________________________
dense_59 (Dense)             (None, 80)                8080      
_________________________________________________________________
dense_60 (Dense)             (None, 7)                 567       
Total params: 40,747
Trainable params: 40,747
Non-trainable params: 0
_________________________________________________________________
None


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


### Blurb Model Results

In [118]:
results = pd.DataFrame()
# make a prediction
ynew = model_blurb.predict(X_blurbs_test)
# show the inputs and predicted outputs
encoded_predictions = model_blurb.predict_classes(X_blurbs_test)
prediction_labels = label_encoder.inverse_transform(encoded_predictions)
prediction_labels
results["game_name"] = list(game_list)
results["prediction"] = prediction_labels
results["actual"] = y_test.reset_index().drop("index",axis=1)
acc = len(results[results["prediction"] == results["actual"]])/len(results)
print (f"Model accuracy: {acc*100}%")
results.head(10)

Model accuracy: 26.290801186943618%


Unnamed: 0,game_name,prediction,actual
0,Sounds from the Void - Cthulhu horror music co...,200%-500%,<100%
1,Sounds from the Void - Cthulhu horror music co...,200%-500%,125%-150%
2,PitchCar Expansion 7: The Loop,200%-500%,500%-1000%
3,Cube Attack,<100%,<100%
4,Cargo Express,<100%,125%-150%
5,Cargo Express,<100%,<100%
6,Krosmaster Blast: A 2-4 player tactical skirmi...,200%-500%,100%-125%
7,Krosmaster Blast: A 2-4 player tactical skirmi...,200%-500%,200%-500%
8,Promo Paradise,<100%,>1000%
9,Promo Paradise,<100%,>1000%


## Stats Model

In [89]:
#Gridsearch prefers to have the model in a function
def stats_model(activation = "relu",
                dropout_rate = 0.2,
                init_mode = 'uniform',
                weight_constraint = 0,
                optimizer = 'adam',
               ):
    #General sequential model with some parameters as inputs that can be optimized by using gridsearch    
    model_stats = Sequential()
    model_stats.add(Dense(100, input_dim=X_test_scaled.shape[1],\
                          activation=activation,\
                          kernel_initializer =init_mode))
    model_stats.add(Dropout(dropout_rate))
    model_stats.add(Dense(50, activation=activation))
    model_stats.add(Dropout(dropout_rate))
    model_stats.add(Dense(20, activation=activation))
    model_stats.add(Dense(y_train_scaled.shape[1],
                          kernel_initializer =init_mode,
                          activation='softmax'))

    model_stats.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer,metrics=['accuracy','mean_absolute_error', 'mean_squared_error'])
    return model_stats

### Stats Model Optimization using Gridsearch

In [90]:
#Parameters to Optimize: there are many more to choose from.
epochs = [500,1000]
batch_size = [20]
dropout_rate = [0.2]
optimizer = [ 'SGD','Adam']

param_grid = dict(dropout_rate= dropout_rate,
                  epochs=epochs,
                  batch_size=batch_size,
                  optimizer = optimizer)

model_stats = KerasClassifier(build_fn=stats_model, class_weight="balanced", verbose=2)

grid = GridSearchCV(estimator=model_stats,param_grid=param_grid,n_jobs=-1)

## These 2 lines will create activate the gridsearch and print the best score and parameters
# grid_result = grid.fit(X_train_scaled,y_train)
# print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")

### Stats Model in Action
The cell below runs model which can be even more optimized with more GridSearch parameter changes.

In [93]:
activation = "relu"
dropout_rate = 0.2
init_mode = 'uniform'
weight_constraint = 0
optimizer = 'adam'            
        
model_stats = Sequential()
model_stats.add(Dense(100, input_dim=X_test_scaled.shape[1],\
                      activation=activation,\
                      kernel_initializer =init_mode))
model_stats.add(Dropout(dropout_rate))
model_stats.add(Dense(50, activation=activation))
model_stats.add(Dropout(dropout_rate))
model_stats.add(Dense(20, activation=activation))
model_stats.add(Dense(y_train_scaled.shape[1],
                      kernel_initializer =init_mode,
                      activation='softmax'))

model_stats.compile(loss='categorical_crossentropy', optimizer=optimizer,metrics=['accuracy','mean_absolute_error', 'mean_squared_error'])

model_stats.fit(X_train_scaled,y_train_scaled, epochs=20, verbose=2,batch_size=20)

model_stats.save("KSmodelStats.h5")

In [94]:
## This will load the model so it is not run every time
model_stats = load_model("KSmodelStats.h5")

## Stats Model Results

In [120]:
results = pd.DataFrame()
# make a prediction
ynew = model_stats.predict(X_test)
# show the inputs and predicted outputs
encoded_predictions = model_stats.predict_classes(X_test_scaled)
prediction_labels = label_encoder.inverse_transform(encoded_predictions)
prediction_labels
results["game_name"] = list(game_list)
results["prediction"] = prediction_labels
results["actual"] = y_test.reset_index().drop("index", axis = 1)
acc = len(results[results["prediction"] == results["actual"]])/len(results)
print (f"Model accuracy: {acc*100}%")
results.head(10)

Model accuracy: 30.267062314540063%


Unnamed: 0,game_name,prediction,actual
0,Sounds from the Void - Cthulhu horror music co...,<100%,<100%
1,Sounds from the Void - Cthulhu horror music co...,100%-125%,125%-150%
2,PitchCar Expansion 7: The Loop,100%-125%,500%-1000%
3,Cube Attack,<100%,<100%
4,Cargo Express,100%-125%,125%-150%
5,Cargo Express,<100%,<100%
6,Krosmaster Blast: A 2-4 player tactical skirmi...,<100%,100%-125%
7,Krosmaster Blast: A 2-4 player tactical skirmi...,100%-125%,200%-500%
8,Promo Paradise,100%-125%,>1000%
9,Promo Paradise,<100%,>1000%


# Why not both? (Merged Model)
To merged both text and stats inputs, I thought of Merging the 2 models into one sing keras

In [102]:
mergedOutput = Add()([model_blurb.output, model_stats.output])

out = Dense(128, activation='relu')(mergedOutput)
out = Dropout(0.2)(out)
out = Dense(32, activation='relu')(out)
out = Dropout(0.2)(out)
out = Dense(y_train_scaled.shape[1], activation='softmax')(out)

mixed_model = Model(
    [model_blurb.input, model_stats.input], #model with two input tensors
    out                         #and one output tensor
) 

mixed_model.compile(loss='categorical_crossentropy', optimizer="adam",metrics=['accuracy','mean_absolute_error', 'mean_squared_error'])


mixed_model.fit([X_blurbs_train,X_train_scaled],y_train_scaled, epochs=20, verbose=2, batch_size = 20)

Epoch 1/20
 - 2s - loss: 1.8142 - mean_absolute_error: 0.2334 - mean_squared_error: 0.1161
Epoch 2/20
 - 0s - loss: 1.7213 - mean_absolute_error: 0.2238 - mean_squared_error: 0.1119
Epoch 3/20
 - 1s - loss: 1.6564 - mean_absolute_error: 0.2160 - mean_squared_error: 0.1078
Epoch 4/20
 - 1s - loss: 1.5708 - mean_absolute_error: 0.2061 - mean_squared_error: 0.1028
Epoch 5/20
 - 1s - loss: 1.5153 - mean_absolute_error: 0.2000 - mean_squared_error: 0.0995
Epoch 6/20
 - 1s - loss: 1.4618 - mean_absolute_error: 0.1947 - mean_squared_error: 0.0967
Epoch 7/20
 - 1s - loss: 1.4296 - mean_absolute_error: 0.1920 - mean_squared_error: 0.0954
Epoch 8/20
 - 1s - loss: 1.3941 - mean_absolute_error: 0.1892 - mean_squared_error: 0.0940
Epoch 9/20
 - 1s - loss: 1.3375 - mean_absolute_error: 0.1844 - mean_squared_error: 0.0915
Epoch 10/20
 - 1s - loss: 1.3028 - mean_absolute_error: 0.1814 - mean_squared_error: 0.0902
Epoch 11/20
 - 1s - loss: 1.2469 - mean_absolute_error: 0.1770 - mean_squared_error: 0.08

<keras.callbacks.History at 0x167cefbb2b0>

## Merged Model Results


In [122]:

results = pd.DataFrame()
# make a prediction
ynew = mixed_model.predict([X_blurbs_test,X_test])
# show the inputs and predicted outputs
encoded_predictions = mixed_model.predict([X_blurbs_test,X_test_scaled])
results["game_name"] = list(game_list)
results["prediction"] = [np.argmax(x) for x in encoded_predictions]
print(len(prediction_labels))
results["actual"] = y_test.reset_index().drop("index", axis = 1)
#The type of output of this merged models requires for it to be encoded back
results["prediction"] = [ "<100%" if p == 0 else\
                         "100%-125%" if p == 1 else\
                         "125%-150%" if p == 2 else\
                         "150%-200%" if p == 3 else\
                         "200%-500%" if p == 4 else \
                          "500%-1000%" if p == 5 \
                          else ">1000%"
                          for p in results["prediction"]]
results.head()

acc = len(results[results["prediction"] == results["actual"]])/len(results)
results.head(10)

X=money_goal           500.0
month_created         11.0
day_created            7.0
hour_created           9.0
month_launched        11.0
day_launched          13.0
hour_launched          8.0
month_end             11.0
day_end               27.0
hour_end               8.0
month_obtained        11.0
KS_lenght_seconds     14.0
Name: 6746, dtype: float64, Predicted=[0.17747924 0.16661307 0.20076221 0.19039293 0.04205324 0.19591649
 0.02678278]
1685


Unnamed: 0,game_name,prediction,actual
0,Sounds from the Void - Cthulhu horror music co...,150%-200%,<100%
1,Sounds from the Void - Cthulhu horror music co...,150%-200%,125%-150%
2,PitchCar Expansion 7: The Loop,150%-200%,500%-1000%
3,Cube Attack,500%-1000%,<100%
4,Cargo Express,500%-1000%,125%-150%
5,Cargo Express,500%-1000%,<100%
6,Krosmaster Blast: A 2-4 player tactical skirmi...,150%-200%,100%-125%
7,Krosmaster Blast: A 2-4 player tactical skirmi...,150%-200%,200%-500%
8,Promo Paradise,125%-150%,>1000%
9,Promo Paradise,500%-1000%,>1000%


In [106]:
model_stats.save("KSmodelMixed.h5")