## Multi-Layer Perceptron and Random Forest Comparison

- Compare the two methods.
- Determine if a kickstarter project will be successfully funded or not.
- Dataset from https://www.kaggle.com/kemical/kickstarter-projects/data

### A. Data Cleaning

In [1]:
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None

In [2]:
# Open file.
df = pd.read_csv('ks-projects-201801.csv')
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


In [3]:
# Create features.
# Pledged amount as a percentage of goal.
df.loc[:,'goal_reached'] = df['pledged'] / df['goal']

# Convert zero backers to 1 to prevent undefined division.
df.loc[df['backers'] == 0, 'backers'] = 1

# Pledged amount per backer.
df.loc[:,'pledge_per_backer'] = df['pledged'] / df['backers']

# Create the subset.
df_sub = df[['category', 'main_category', 'goal_reached', 'pledge_per_backer', 'state', 'country']]

# Check shape and data types.
print(df_sub.shape, '\n')
print(df_sub.dtypes)

(378661, 6) 

category              object
main_category         object
goal_reached         float64
pledge_per_backer    float64
state                 object
country               object
dtype: object


In [4]:
# Check for nulls.
df_sub.isnull().sum()

category             0
main_category        0
goal_reached         0
pledge_per_backer    0
state                0
country              0
dtype: int64

In [5]:
# Code the goal_reached and pledge_per_backer features as 0 or 1.
df_sub['goal_reached_cat'] = np.where(df_sub['goal_reached']>0.5, 1, 0)
df_sub['pledge_per_backer_cat'] = np.where(df_sub['pledge_per_backer']>df_sub['pledge_per_backer'].mean(), 1, 0)

# Create the final subset of features that will be used to predict if a kickstarter project will be successfully funded.
df_final = df_sub.loc[:,['category', 'main_category', 'goal_reached_cat', 'pledge_per_backer_cat', 'state', 'country']]

# Set variables.
X = df_final.drop(['state'], axis=1)
X = pd.get_dummies(X)
y = df_final['state']

# Split into train and test sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

### B. Multi-Layer Perceptron

In [6]:
# Start time for execution speed.
import time
start_time = time.clock()

# Fit the model.
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(100,4,), alpha=0.05)
mlp.fit(X_train, y_train)

# Accuracy.
print('Accuracy: ', mlp.score(X_test, y_test))

# Cross validation scores.
from sklearn.model_selection import cross_val_score
print('CV Scores: ', cross_val_score(mlp, X, y, cv=5))

# End time for execution speed.
print('Runtime: '+'%s seconds'% (time.clock() - start_time))

Accuracy:  0.867639679927
CV Scores:  [ 0.86703638  0.86670276  0.86679343  0.86692547  0.866392  ]
Runtime: 356.5996350144538 seconds


### C. Random Forest

In [7]:
# Start time for execution speed.
start_time = time.clock()

# Fit the model.
from sklearn import ensemble
rfc = ensemble.RandomForestClassifier()
rfc.fit(X_train, y_train)

# Accuracy.
print('Accuracy: ', rfc.score(X_test, y_test))

# Cross validation scores.
print('CV Scores: ', cross_val_score(rfc, X, y, cv=5))

# End time for execution speed.
print('Runtime: '+'%s seconds'% (time.clock() - start_time))

Accuracy:  0.865289307124
CV Scores:  [ 0.86462006  0.86445803  0.86469392  0.86524851  0.86454331]
Runtime: 66.59204713626877 seconds


MLP is almost similar to RF in set up complexity. It's accuracy is within RF's range but does take five times longer to execute. Increasing the hidden layer(s) size doesn't do much in improving accuracy and in fact further extends execution time.