# Kickstarter Projects
#### More than 300,000 kickstarter projects
## About Dataset
### Context
I'm a crowdfunding enthusiast and i'm watching kickstarter since its early days. Right now I just collect data and the only app i've made is this twitter bot which tweet any project reaching some milestone: @bloomwatcher . I have a lot of other ideas, but sadly not enough time to develop them… But I hope you can!
### Content
You'll find most useful data for project analysis. Columns are self explanatory except:
- usd_pledged: conversion in US dollars of the pledged column (conversion done by kickstarter).
- usd pledge real: conversion in US dollars of the pledged column (conversion from Fixer.io API).
- usd goal real: conversion in US dollars of the goal column (conversion from Fixer.io API).
### Acknowledgements
Data are collected from Kickstarter Platform

usd conversion (usdpledgedreal and usdgoalreal columns) were generated from convert ks pledges to usd script done by tonyplaysguitar
### Inspiration
I hope to see great projects, and why not a model to predict if a project will be successful before it is released? :)

# First look at the data

In [191]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv('data/ks-projects-201801.csv', index_col=0)

In [192]:
data.head()

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


In [193]:
data.describe()

Unnamed: 0,goal,pledged,backers,usd pledged,usd_pledged_real,usd_goal_real
count,378661.0,378661.0,378661.0,374864.0,378661.0,378661.0
mean,49080.79,9682.979,105.617476,7036.729,9058.924,45454.4
std,1183391.0,95636.01,907.185035,78639.75,90973.34,1152950.0
min,0.01,0.0,0.0,0.0,0.0,0.01
25%,2000.0,30.0,2.0,16.98,31.0,2000.0
50%,5200.0,620.0,12.0,394.72,624.33,5500.0
75%,16000.0,4076.0,56.0,3034.09,4050.0,15500.0
max,100000000.0,20338990.0,219382.0,20338990.0,20338990.0,166361400.0


In [194]:
data.describe(include='object')

Unnamed: 0,name,category,main_category,currency,deadline,launched,state,country
count,378657,378661,378661,378661,378661,378661,378661,378661
unique,375764,159,15,14,3164,378089,6,23
top,New EP/Music Development,Product Design,Film & Video,USD,2014-08-08,1970-01-01 01:00:00,failed,US
freq,41,22314,63585,295365,705,7,197719,292627


In [195]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 378661 entries, 1000002330 to 999988282
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   name              378657 non-null  object 
 1   category          378661 non-null  object 
 2   main_category     378661 non-null  object 
 3   currency          378661 non-null  object 
 4   deadline          378661 non-null  object 
 5   goal              378661 non-null  float64
 6   launched          378661 non-null  object 
 7   pledged           378661 non-null  float64
 8   state             378661 non-null  object 
 9   backers           378661 non-null  int64  
 10  country           378661 non-null  object 
 11  usd pledged       374864 non-null  float64
 12  usd_pledged_real  378661 non-null  float64
 13  usd_goal_real     378661 non-null  float64
dtypes: float64(5), int64(1), object(8)
memory usage: 43.3+ MB


In [196]:
data.isna().sum()

name                   4
category               0
main_category          0
currency               0
deadline               0
goal                   0
launched               0
pledged                0
state                  0
backers                0
country                0
usd pledged         3797
usd_pledged_real       0
usd_goal_real          0
dtype: int64

We have some missing values, which we will impute in the next section

# Data preparation and cleaning

Let's make the deadline feature and launched features datetime fields

In [197]:
data['deadline'] = pd.to_datetime(data['deadline'])
data['launched'] = pd.to_datetime(data['launched'])

Here, I will get rid of missing values and drop features which are not explanatory 

In [198]:
data.columns

Index(['name', 'category', 'main_category', 'currency', 'deadline', 'goal',
       'launched', 'pledged', 'state', 'backers', 'country', 'usd pledged',
       'usd_pledged_real', 'usd_goal_real'],
      dtype='object')

Well, definitely, the feature name doesn't have any explanatory power

In [199]:
data.drop(['name'], inplace=True, axis=1)

Now let's impute missing values in usd_pledged

In [200]:
data['usd pledged'].fillna(data['usd pledged'].mean(), inplace=True)
data['usd pledged'].isna().sum()

0

Also let's look at state column, which occurs to be out target

In [201]:
data['state'].unique()

array(['failed', 'canceled', 'successful', 'live', 'undefined',
       'suspended'], dtype=object)

So it looks like successful and live are some state of successful obviously, and others are some case of failure

In [202]:
data['state'].replace('canceled', 'failed', inplace=True)
data['state'].replace('live', 'successful', inplace=True)
data['state'].replace('undefined', 'failed', inplace=True)
data['state'].replace('suspended', 'failed', inplace=True)
data['state'].unique()

array(['failed', 'successful'], dtype=object)

And let's split the dataset into y and X, which are correspondingly target and features

In [203]:
y = data.pop('state')
X = data.copy()

In [204]:
y.head()

ID
1000002330    failed
1000003930    failed
1000004038    failed
1000007540    failed
1000011046    failed
Name: state, dtype: object

In [205]:
X.head()

Unnamed: 0_level_0,category,main_category,currency,deadline,goal,launched,pledged,backers,country,usd pledged,usd_pledged_real,usd_goal_real
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1000002330,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,0,GB,0.0,0.0,1533.95
1000003930,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,15,US,100.0,2421.0,30000.0
1000004038,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,3,US,220.0,220.0,45000.0
1000007540,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,1,US,1.0,1.0,5000.0
1000011046,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,14,US,1283.0,1283.0,19500.0


# Feature engineering, encoding and scaling

Let's make the datetime columns comprehensible to the model

In [206]:
X['deadline_year'] = X['deadline'].dt.year.astype(np.float64)
X['deadline_month'] = X['deadline'].dt.month.astype(np.float64)

X['launched_year'] = X['launched'].dt.year.astype(np.float64)
X['launched_month'] = X['launched'].dt.month.astype(np.float64)

X['project_time_length'] = (X['deadline'] - X['launched']).dt.days.astype(np.float64)
X.drop(['deadline', 'launched'], inplace=True, axis=1)
X.head(10)

Unnamed: 0_level_0,category,main_category,currency,goal,pledged,backers,country,usd pledged,usd_pledged_real,usd_goal_real,deadline_year,deadline_month,launched_year,launched_month,project_time_length
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1000002330,Poetry,Publishing,GBP,1000.0,0.0,0,GB,0.0,0.0,1533.95,2015.0,10.0,2015.0,8.0,58.0
1000003930,Narrative Film,Film & Video,USD,30000.0,2421.0,15,US,100.0,2421.0,30000.0,2017.0,11.0,2017.0,9.0,59.0
1000004038,Narrative Film,Film & Video,USD,45000.0,220.0,3,US,220.0,220.0,45000.0,2013.0,2.0,2013.0,1.0,44.0
1000007540,Music,Music,USD,5000.0,1.0,1,US,1.0,1.0,5000.0,2012.0,4.0,2012.0,3.0,29.0
1000011046,Film & Video,Film & Video,USD,19500.0,1283.0,14,US,1283.0,1283.0,19500.0,2015.0,8.0,2015.0,7.0,55.0
1000014025,Restaurants,Food,USD,50000.0,52375.0,224,US,52375.0,52375.0,50000.0,2016.0,4.0,2016.0,2.0,34.0
1000023410,Food,Food,USD,1000.0,1205.0,16,US,1205.0,1205.0,1000.0,2014.0,12.0,2014.0,12.0,19.0
1000030581,Drinks,Food,USD,25000.0,453.0,40,US,453.0,453.0,25000.0,2016.0,3.0,2016.0,2.0,44.0
1000034518,Product Design,Design,USD,125000.0,8233.0,58,US,8233.0,8233.0,125000.0,2014.0,5.0,2014.0,4.0,34.0
100004195,Documentary,Film & Video,USD,65000.0,6240.57,43,US,6240.57,6240.57,65000.0,2014.0,8.0,2014.0,7.0,29.0


For encoding of the target I am going to use sklearn's ordinal encoder

In [207]:
from sklearn.preprocessing import OrdinalEncoder 
target_encoder = OrdinalEncoder()
y = pd.Series(target_encoder.fit_transform(y.values.reshape(-1,1)).reshape(1,-1)[0], index=y.index)
y.head(10)

ID
1000002330    0.0
1000003930    0.0
1000004038    0.0
1000007540    0.0
1000011046    0.0
1000014025    1.0
1000023410    1.0
1000030581    0.0
1000034518    0.0
100004195     0.0
dtype: float64

For others I am going to use one hot encoder

In [208]:
from sklearn.preprocessing import OneHotEncoder

cat_cols = [x for x in X.columns if X[x].dtype == 'object']
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
encoded = pd.DataFrame(encoder.fit_transform(X.loc[:, cat_cols]), index=X.index)
X.drop(cat_cols, inplace=True, axis=1)
X = pd.concat([X, encoded], axis=1)
X.head()


Unnamed: 0_level_0,goal,pledged,backers,usd pledged,usd_pledged_real,usd_goal_real,deadline_year,deadline_month,launched_year,launched_month,...,201,202,203,204,205,206,207,208,209,210
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000002330,1000.0,0.0,0,0.0,0.0,1533.95,2015.0,10.0,2015.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000003930,30000.0,2421.0,15,100.0,2421.0,30000.0,2017.0,11.0,2017.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1000004038,45000.0,220.0,3,220.0,220.0,45000.0,2013.0,2.0,2013.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1000007540,5000.0,1.0,1,1.0,1.0,5000.0,2012.0,4.0,2012.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1000011046,19500.0,1283.0,14,1283.0,1283.0,19500.0,2015.0,8.0,2015.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


For scaling purposes I am going to use Standard Scaler

In [209]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), index=X.index, columns=X.columns)
X.head()

Unnamed: 0_level_0,goal,pledged,backers,usd pledged,usd_pledged_real,usd_goal_real,deadline_year,deadline_month,launched_year,launched_month,...,201,202,203,204,205,206,207,208,209,210
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000002330,-0.04063,-0.101248,-0.116423,-0.089933,-0.099578,-0.038094,0.352985,0.981071,0.389063,0.461979,...,-0.010278,-0.012797,-0.068179,-0.100643,-0.087361,-0.043281,-0.061936,-0.068276,-0.038312,-1.84426
1000003930,-0.016124,-0.075934,-0.099889,-0.088655,-0.072966,-0.013404,1.392245,1.280244,1.423568,0.762268,...,-0.010278,-0.012797,-0.068179,-0.100643,-0.087361,-0.043281,-0.061936,-0.068276,-0.038312,0.542223
1000004038,-0.003448,-0.098948,-0.113117,-0.087121,-0.09716,-0.000394,-0.686274,-1.412315,-0.645443,-1.640042,...,-0.010278,-0.012797,-0.068179,-0.100643,-0.087361,-0.043281,-0.061936,-0.068276,-0.038312,0.542223
1000007540,-0.03725,-0.101238,-0.115321,-0.08992,-0.099567,-0.035088,-1.205904,-0.813969,-1.162696,-1.039464,...,-0.010278,-0.012797,-0.068179,-0.100643,-0.087361,-0.043281,-0.061936,-0.068276,-0.038312,0.542223
1000011046,-0.024997,-0.087833,-0.100991,-0.073535,-0.085475,-0.022511,0.352985,0.382724,0.389063,0.16169,...,-0.010278,-0.012797,-0.068179,-0.100643,-0.087361,-0.043281,-0.061936,-0.068276,-0.038312,0.542223


# Modeling

Let's split the data into training and testing sets

In [210]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X,y, shuffle=True, random_state=1)

And now let's train the model!

In [212]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(256, activation='relu', input_shape=[X.shape[1]]),
    layers.Dense(512, activation='relu'),  
    layers.Dense(256, activation='relu'),  
    layers.Dense(1, activation='sigmoid'),
])
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['binary_accuracy'],
)
early_stopping = keras.callbacks.EarlyStopping(
    patience=10,
    min_delta=0.001,
    restore_best_weights=True,
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=512,
    epochs=1000,
    callbacks=[early_stopping],
    verbose=1, # hide the output because we have so many epochs
)
history_df = pd.DataFrame(history.history)
# Start the plot at epoch 5
history_df.loc[5:, ['loss', 'val_loss']].plot()
history_df.loc[5:, ['binary_accuracy', 'val_binary_accuracy']].plot()
print(("Best Validation Loss: {:0.4f}" +\
      "\nBest Validation Accuracy: {:0.4f}")\
      .format(history_df['val_loss'].min(), 
              history_df['val_binary_accuracy'].max()))

Train on 283995 samples, validate on 94666 samples
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Best Validation Loss: 0.1308
Best Validation Accuracy: 0.9527


In [213]:
model.evaluate(X_valid, y_valid)



[0.1308456769762158, 0.9518729]