In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Kickstarter is a crowdfunding site. A project is only funded if the goal is met.The objective of this project was to use machine learning and data science to predict the chances of success or failure of a kickstarter project. This will give an insight into what makes a kickstarter successful and to find out what features have the biggest effect on its success rate.

The dataset used for this was from webrobots.io, a data mining site. It contains monthly files with multiple CSV files in with data. Due to the data scraping schedule, there is a large amount of overlap creating duplicate entries. The raw data had very little missing values. It has a large amount of columns that are irrelevant to the project.

In [None]:
df = pd.read_csv('../input/final-kickstarter-data/final-dataset.csv')
print(df)

Data Preparation
The data needed a lot of preparation. Of the 27 columns, 9 columns remain. Start date and end date are stored as unix time stamps. Start date and end date are converted to a number of days a campaign is running for. Name and blurb are converted to a word count. The goal is converted to a common currency using the exchange rate.


In [None]:
df.columns

In [None]:
#need to encode categories columns as wont work with strings
from sklearn.preprocessing import LabelEncoder
encodeCategories = ['category', 'city', 'country', 'state']
df[encodeCategories].head()

print(df)

With LGBM it had issues with working with certain datatypes and categorical data, therefore I needed to use encoding to change the values to something that can be read in by the algorithm. There was multiple ways to encode this data but i opted to use LabelEncoder from sklearn as it fit my needs better

In [None]:
encoder = LabelEncoder()

encoded = df[encodeCategories].apply(encoder.fit_transform)
encoded.head()
print(df)

The dataframe is then joined with the encoded values. Once this was done it solved the issues i was having with invalid datatypes.

In [None]:
df = df[['blurb_word_count','duration_days', 'name_word_count', 'staff_pick', 'usd_goal']].join(encoded)
df.columns


From  sklearn the train_test_spit module is imported. This allows me to easily split my data into 2 sets of test and training data.

In [None]:
from sklearn.model_selection import train_test_split
X = df[['blurb_word_count', 'category', 'city', 'country',
       'duration_days', 'name_word_count', 'staff_pick', 'usd_goal']]
y = df['state']

The data is split with test size being 20 percent. After testing the training data sizes I found that 20 percent testing size produced the best results for the model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
import lightgbm as lgb
print (y_train)
print (y_test)
print (X_train)
print (X_test)

Below is where I set up the LGBM classifier. Some parameter tuning was done to try and maximise the results. I opted to not tune the module as i was seeing very little increases in results or preformance and instead opted to test removing certain columns to see how it would affect the results. In the end I found that removing columns reduced the accuracy of the algorithm, therefore I used all the columns that where in the dataset. All of the columns that I have are important due to cleaning and preprocessing we did to only keep the most important columns that would have an affect on the final result

In [None]:

# clf = lgb.LGBMClassifier(
# n_estimators=400,
 #   learning_rate=0.03,
 #   num_leaves=30,
 #   colsample_bytree=.8,
 #   subsample=.9,
 #   max_depth=7,
 #   reg_alpha=.1,
 #   reg_lambda=.1,
 #   min_split_gain=.01,
 #   min_child_weight=2,
 #   silent=-1,
#    verbose=-1,)
#clf.fit(X_train, y_train) 
#clf.fit(
#    X_train, y_train, 
#    eval_set= [(X_train, y_train), (X_test, y_test)], 
 #   eval_metric='auc', verbose=100, early_stopping_rounds=30  #30
#)
#

clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train) 


In [None]:
# predict the results
y_pred=clf.predict(X_test)

Below we can see that when the algorithm runs it gives a accuracy of 81 percent and a very similar result on the testing data. This is in my opinion a good accuracy at predicting success rate 

In [None]:
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(y_pred, y_test)
print('LightGBM Model accuracy score: {0:0.4f}'.format(accuracy_score(y_test, y_pred)))

In [None]:
y_pred_train = clf.predict(X_train)
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

Below i uses sklearn to produce a classification report. This report shows how well it classified the data. It shows me that it is more sucessful at predicting failures rather than sucesses but still has a good f1 score in my opinion 

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Below i used the feature importance chart to show that category is the most important attribute when trying to predict success with the algorithm while staff pick is lower priority. This could be due to the fact that it is quiet uncommon for a project to get staff pick even though projects that are staff pick are 90 percent chance to be sucessfull

In [None]:
lgb.plot_importance(clf)

Below i have displayed a confusion matrix. It shows that the algorithm is much more successful at getting true positives rather than true negatives. False positives and false negatives are close in value to eachother but still a low number in comparison to the true values. It is only slightly worse at false potitives compared to false negatives

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix\n\n', cm)
print('\nTrue Positives(TP) = ', cm[0,0])
print('\nTrue Negatives(TN) = ', cm[1,1])
print('\nFalse Positives(FP) = ', cm[0,1])
print('\nFalse Negatives(FN) = ', cm[1,0])