# Capstone Project

## Goal

The goal of this project is to use predictive analytics to determine what will make it more likely to have a successful Kickstarter based on historical data. The historical data tells us which projects were successful and which projects were not.

https://www.kickstarter.com/help/handbook/funding

Kickstarter provides what is called a creator's handbook for funding. The original objective of this analysis was to determine what leads to successful boardgames. From there the idea was to create a boardgame based on my findings to see if I could create a successful boardgame based on the findings. However, an important first phase of this analysis was to see if I could predict whether or not a project would be successful. So that is what I did here.

## Question: What is the probability of a successful Kickstarter project given certain criteria?

###  Import Libraries

**Note:** All relevant libraries and modules were added here as the project continued so as to make it easier to process the entire document.

In [1]:
import os
import glob
import pandas as pd
# os.chdir("./datasets/kickstarter_data/") # uncomment to run initially
import string

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
import numpy as np
import re
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

### Gather Data

Data were found using the following link and downloaded onto my local drive.  
https://webrobots.io/kickstarter-datasets/

### Combine Data

Data were combined using the following code. To prevent errors as I continued to work through this document, I commented out this cell after I initially combined the data.

In [2]:
## uncomment to run initially
## credit: https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/
# extension = 'csv'
# all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

# #combine all files in the list
# combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
# #export to csv
# combined_csv.to_csv( "combined.csv", index=False, encoding='utf-8-sig')

### Read in Data

In [3]:
df = pd.read_csv('./datasets/kickstarter_data/combined.csv')

### Exploratory Data Analysis (EDA)

In [4]:
df.shape

(217433, 38)

In [5]:
df.head(2)

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,country_displayable_name,created_at,creator,currency,currency_symbol,...,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,1,we are going Production herbal teabag of plan...,"{""id"":313,""name"":""Small Batch"",""slug"":""food/sm...",19,AU,Australia,1441269202,"{""id"":1555219532,""name"":""ehsan"",""is_registered...",AUD,$,...,production-herbal-teabag-of-plants-native-to-iran,https://www.kickstarter.com/discover/categorie...,False,False,failed,1444141184,0.691164,"{""web"":{""project"":""https://www.kickstarter.com...",18.661436,domestic
1,637,Two agents battle each other in another dimens...,"{""id"":34,""name"":""Tabletop Games"",""slug"":""games...",16233,US,the United States,1576048498,"{""id"":99575233,""name"":""David Gerrard"",""is_regi...",USD,$,...,slip-strike-0,https://www.kickstarter.com/discover/categorie...,True,False,successful,1583987400,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",16233.0,domestic


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217433 entries, 0 to 217432
Data columns (total 38 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   backers_count             217433 non-null  int64  
 1   blurb                     217425 non-null  object 
 2   category                  217433 non-null  object 
 3   converted_pledged_amount  217433 non-null  int64  
 4   country                   217433 non-null  object 
 5   country_displayable_name  217433 non-null  object 
 6   created_at                217433 non-null  int64  
 7   creator                   217433 non-null  object 
 8   currency                  217433 non-null  object 
 9   currency_symbol           217433 non-null  object 
 10  currency_trailing_code    217433 non-null  bool   
 11  current_currency          217433 non-null  object 
 12  deadline                  217433 non-null  int64  
 13  disable_communication     217433 non-null  b

In [7]:
category_state = df[['category', 'state']]

In [8]:
category_state.head()

Unnamed: 0,category,state
0,"{""id"":313,""name"":""Small Batch"",""slug"":""food/sm...",failed
1,"{""id"":34,""name"":""Tabletop Games"",""slug"":""games...",successful
2,"{""id"":262,""name"":""Accessories"",""slug"":""fashion...",successful
3,"{""id"":313,""name"":""Small Batch"",""slug"":""food/sm...",failed
4,"{""id"":28,""name"":""Product Design"",""slug"":""desig...",successful


In [9]:
type(category_state.category[0])

str

In [10]:
category_state.category = category_state.category.str.replace(':', ',')

punctuation = "!\"#$%&'()*+-.:;<=>?@[\\]^_`{|}~"

def remove_punctuation(s):
    s_sans_punct = ""
    for letter in s:
        if letter not in punctuation:
            s_sans_punct += letter
    return s_sans_punct

# splits record strings up into lists
new_category = []
for line in category_state.category:
    line = remove_punctuation(line)
    new_category.append(line.split(','))
    
category_state.category = new_category

for line in category_state.category:
    for element in line:
        clean_data = remove_punctuation(element)

all_categories = {}
for j, line in enumerate(category_state.category):
    categories = {}
    for i, ele in enumerate(line[:-4]):
        if i % 2 == 0:
            categories[ele] = line[i+1]
    all_categories[j] = categories

category = pd.DataFrame(all_categories).T
category.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,id,name,slug,position,parentid,parentname,color,urls
0,313,Small Batch,food/small batch,10,10,Food,16725570,web
1,34,Tabletop Games,games/tabletop games,6,12,Games,51627,web
2,262,Accessories,fashion/accessories,1,9,Fashion,16752598,web
3,313,Small Batch,food/small batch,10,10,Food,16725570,web
4,28,Product Design,design/product design,5,7,Design,2577151,web


In [11]:
category_state.head()

Unnamed: 0,category,state
0,"[id, 313, name, Small Batch, slug, food/small ...",failed
1,"[id, 34, name, Tabletop Games, slug, games/tab...",successful
2,"[id, 262, name, Accessories, slug, fashion/acc...",successful
3,"[id, 313, name, Small Batch, slug, food/small ...",failed
4,"[id, 28, name, Product Design, slug, design/pr...",successful


In [12]:
category_state.drop([
    'category'
], axis=1, inplace=True)

category_state.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,state
0,failed
1,successful
2,successful
3,failed
4,successful


In [13]:
category_state = category_state.merge(category, how='outer', left_index=True, right_index=True)
category_state.head()

Unnamed: 0,state,id,name,slug,position,parentid,parentname,color,urls
0,failed,313,Small Batch,food/small batch,10,10,Food,16725570,web
1,successful,34,Tabletop Games,games/tabletop games,6,12,Games,51627,web
2,successful,262,Accessories,fashion/accessories,1,9,Fashion,16752598,web
3,failed,313,Small Batch,food/small batch,10,10,Food,16725570,web
4,successful,28,Product Design,design/product design,5,7,Design,2577151,web


### Rename Parentname Data to Category

In [14]:
category_state.rename(columns = {'parentname':'category'}, inplace = True) 

### Create New Dataframe for Category_State

In [15]:
category_state = category_state[['state', 'category']]

category_state.head()

Unnamed: 0,state,category
0,failed,Food
1,successful,Games
2,successful,Fashion
3,failed,Food
4,successful,Design


In [16]:
category_state.state.value_counts()

successful    127093
failed         76260
canceled        9029
live            5051
Name: state, dtype: int64

In [17]:
category_state = category_state.loc[(category_state.state == 'successful') | (category_state.state == 'failed')]

In [18]:
category_state.state.value_counts()

successful    127093
failed         76260
Name: state, dtype: int64

### Dummify 'State' and 'Category' Data

In [19]:
category_state.head()

Unnamed: 0,state,category
0,failed,Food
1,successful,Games
2,successful,Fashion
3,failed,Food
4,successful,Design


In [20]:
category_state = pd.get_dummies(category_state, drop_first=True)
category_state.head()

Unnamed: 0,state_successful,category_Comics,category_Crafts,category_Dance,category_Design,category_Fashion,category_Film Video,category_Food,category_Games,category_Journalism,category_Music,category_Photography,category_Publishing,category_Technology,category_Theater
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
4,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0


### Logistic Regression

In [24]:
X = category_state.drop(['state_successful'], axis = 'columns')
y = category_state.state_successful

In [25]:
# Train-test-split
X_train, X_test, y_train, y_test = train_test_split (X, y, random_state = 42)

# Scale our data.
# Relabeling scaled data as "Z" is common
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)


logreg = LogisticRegression(C=1e9, solver='lbfgs')
logreg.fit(Z_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(Z_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(logreg.score(Z_train, y_train))
print(logreg.score(Z_test, y_test))

[[ 6919 12233]
 [ 4534 27153]]
0.6708367756402691
0.6701941422923346


In [27]:
# Assign the coefficients to a list coef
coef = logreg.coef_
for p,c in zip(X,list(coef[0])):
    print(p + '\t' + str(c))

category_Comics	0.20677307351967253
category_Crafts	-0.20047463197669232
category_Dance	0.024151663726879387
category_Design	-0.01298428115724556
category_Fashion	-0.020049580464913234
category_Film  Video	-0.1411163381653347
category_Food	-0.37863957072876214
category_Games	-0.003284845515249729
category_Journalism	-0.26245641853370816
category_Music	-0.036685354314825544
category_Photography	-0.15148881852756724
category_Publishing	0.028816384896704426
category_Technology	-0.3374180235622511
category_Theater	0.00942953098700657


In [29]:
X

Unnamed: 0,category_Comics,category_Crafts,category_Dance,category_Design,category_Fashion,category_Film Video,category_Food,category_Games,category_Journalism,category_Music,category_Photography,category_Publishing,category_Technology,category_Theater
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
217428,0,0,0,0,0,0,0,0,0,0,0,0,0,0
217429,0,0,0,0,0,0,0,0,0,0,1,0,0,0
217430,0,0,0,0,0,0,0,0,0,0,0,0,1,0
217431,0,0,0,0,0,0,1,0,0,0,0,0,0,0


In [43]:
logreg.predict_proba(X)[1]

array([0.36672255, 0.63327745])

In [42]:
X.category_Comics.value_counts()

0    194912
1      8441
Name: category_Comics, dtype: int64