# Capstone Project

## Goal

The goal of this project is to use predictive analytics to determine what will make it more likely to have a successful Kickstarter based on historical data. The historical data tells us which projects were successful and which projects were not.

https://www.kickstarter.com/help/handbook/funding

Kickstarter provides what is called a creator's handbook for funding. The original objective of this analysis was to determine what leads to successful boardgames. From there the idea was to create a boardgame based on my findings to see if I could create a successful boardgame based on the findings. However, an important first phase of this analysis was to see if I could predict whether or not a project would be successful. So that is what I did here.

## Question: What is the probability of a successful Kickstarter project given certain criteria?

###  Import Libraries

**Note:** All relevant libraries and modules were added here as the project continued so as to make it easier to process the entire document.

In [1]:
import os
import glob
import pandas as pd
# os.chdir("./datasets/kickstarter_data/") # uncomment to run initially
import string

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
import numpy as np
import re
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, roc_auc_score

### Maximize Display Rows and Columns

In [2]:
# # remove comment if you want to see everything
pd.set_option('display.max_rows', 9999)
pd.set_option('display.max_columns', 300)
# pd.set_option('display.width', 9999)

### Gather Data

Data were found using the following link and downloaded onto my local drive.  
https://webrobots.io/kickstarter-datasets/

### Combine Data

Data were combined using the following code. To prevent errors as I continued to work through this document, I commented out this cell after I initially combined the data.

In [3]:
## uncomment to run initially
## credit: https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/
# extension = 'csv'
# all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

# #combine all files in the list
# combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
# #export to csv
# combined_csv.to_csv( "combined.csv", index=False, encoding='utf-8-sig')

### Read in Data

In [4]:
df = pd.read_csv('./datasets/kickstarter_data/combined.csv')

### Exploratory Data Analysis (EDA)

In [5]:
df.shape

(217433, 38)

In [6]:
df.head(2)

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,country_displayable_name,created_at,creator,currency,currency_symbol,currency_trailing_code,current_currency,deadline,disable_communication,friends,fx_rate,goal,id,is_backing,is_starrable,is_starred,launched_at,location,name,permissions,photo,pledged,profile,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,1,we are going Production herbal teabag of plan...,"{""id"":313,""name"":""Small Batch"",""slug"":""food/sm...",19,AU,Australia,1441269202,"{""id"":1555219532,""name"":""ehsan"",""is_registered...",AUD,$,True,USD,1444141184,False,,0.643694,14000.0,18648848,,False,,1441549184,"{""id"":1098081,""name"":""Perth"",""slug"":""perth-wa-...",Production herbal teabag of plants native to Iran,,"{""key"":""assets/012/241/749/145d362f576a69a5338...",27.0,"{""id"":2100811,""project_id"":2100811,""state"":""in...",production-herbal-teabag-of-plants-native-to-iran,https://www.kickstarter.com/discover/categorie...,False,False,failed,1444141184,0.691164,"{""web"":{""project"":""https://www.kickstarter.com...",18.661436,domestic
1,637,Two agents battle each other in another dimens...,"{""id"":34,""name"":""Tabletop Games"",""slug"":""games...",16233,US,the United States,1576048498,"{""id"":99575233,""name"":""David Gerrard"",""is_regi...",USD,$,True,USD,1583987400,False,,1.0,6000.0,1576306701,,False,,1581353979,"{""id"":2490383,""name"":""Seattle"",""slug"":""seattle...",Slip Strike,,"{""key"":""assets/027/753/183/1b44d6f57a405f04bb3...",16233.0,"{""id"":3869441,""project_id"":3869441,""state"":""in...",slip-strike-0,https://www.kickstarter.com/discover/categorie...,True,False,successful,1583987400,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",16233.0,domestic


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217433 entries, 0 to 217432
Data columns (total 38 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   backers_count             217433 non-null  int64  
 1   blurb                     217425 non-null  object 
 2   category                  217433 non-null  object 
 3   converted_pledged_amount  217433 non-null  int64  
 4   country                   217433 non-null  object 
 5   country_displayable_name  217433 non-null  object 
 6   created_at                217433 non-null  int64  
 7   creator                   217433 non-null  object 
 8   currency                  217433 non-null  object 
 9   currency_symbol           217433 non-null  object 
 10  currency_trailing_code    217433 non-null  bool   
 11  current_currency          217433 non-null  object 
 12  deadline                  217433 non-null  int64  
 13  disable_communication     217433 non-null  b

In [8]:
type(df.category[0])

str

In [9]:
df.category = df.category.str.replace(':', ',')

punctuation = "!\"#$%&'()*+-.:;<=>?@[\\]^_`{|}~"

def remove_punctuation(s):
    s_sans_punct = ""
    for letter in s:
        if letter not in punctuation:
            s_sans_punct += letter
    return s_sans_punct

# splits record strings up into lists
new_category = []
for line in df.category:
    line = remove_punctuation(line)
    new_category.append(line.split(','))
    
df.category = new_category

for line in df.category:
    for element in line:
        clean_data = remove_punctuation(element)

all_categories = {}
for j, line in enumerate(df.category):
    categories = {}
    for i, ele in enumerate(line[:-4]):
        if i % 2 == 0:
            categories[ele] = line[i+1]
    all_categories[j] = categories

category = pd.DataFrame(all_categories).T
category.head(2)

Unnamed: 0,id,name,slug,position,parentid,parentname,color,urls
0,313,Small Batch,food/small batch,10,10,Food,16725570,web
1,34,Tabletop Games,games/tabletop games,6,12,Games,51627,web


In [10]:
df.head(2)

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,country_displayable_name,created_at,creator,currency,currency_symbol,currency_trailing_code,current_currency,deadline,disable_communication,friends,fx_rate,goal,id,is_backing,is_starrable,is_starred,launched_at,location,name,permissions,photo,pledged,profile,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,1,we are going Production herbal teabag of plan...,"[id, 313, name, Small Batch, slug, food/small ...",19,AU,Australia,1441269202,"{""id"":1555219532,""name"":""ehsan"",""is_registered...",AUD,$,True,USD,1444141184,False,,0.643694,14000.0,18648848,,False,,1441549184,"{""id"":1098081,""name"":""Perth"",""slug"":""perth-wa-...",Production herbal teabag of plants native to Iran,,"{""key"":""assets/012/241/749/145d362f576a69a5338...",27.0,"{""id"":2100811,""project_id"":2100811,""state"":""in...",production-herbal-teabag-of-plants-native-to-iran,https://www.kickstarter.com/discover/categorie...,False,False,failed,1444141184,0.691164,"{""web"":{""project"":""https://www.kickstarter.com...",18.661436,domestic
1,637,Two agents battle each other in another dimens...,"[id, 34, name, Tabletop Games, slug, games/tab...",16233,US,the United States,1576048498,"{""id"":99575233,""name"":""David Gerrard"",""is_regi...",USD,$,True,USD,1583987400,False,,1.0,6000.0,1576306701,,False,,1581353979,"{""id"":2490383,""name"":""Seattle"",""slug"":""seattle...",Slip Strike,,"{""key"":""assets/027/753/183/1b44d6f57a405f04bb3...",16233.0,"{""id"":3869441,""project_id"":3869441,""state"":""in...",slip-strike-0,https://www.kickstarter.com/discover/categorie...,True,False,successful,1583987400,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",16233.0,domestic


#### Drop Features

In [11]:
df.columns

Index(['backers_count', 'blurb', 'category', 'converted_pledged_amount',
       'country', 'country_displayable_name', 'created_at', 'creator',
       'currency', 'currency_symbol', 'currency_trailing_code',
       'current_currency', 'deadline', 'disable_communication', 'friends',
       'fx_rate', 'goal', 'id', 'is_backing', 'is_starrable', 'is_starred',
       'launched_at', 'location', 'name', 'permissions', 'photo', 'pledged',
       'profile', 'slug', 'source_url', 'spotlight', 'staff_pick', 'state',
       'state_changed_at', 'static_usd_rate', 'urls', 'usd_pledged',
       'usd_type'],
      dtype='object')

In [12]:
df.drop([
    'category',
], axis=1, inplace=True)

In [13]:
df.head(2)

Unnamed: 0,backers_count,blurb,converted_pledged_amount,country,country_displayable_name,created_at,creator,currency,currency_symbol,currency_trailing_code,current_currency,deadline,disable_communication,friends,fx_rate,goal,id,is_backing,is_starrable,is_starred,launched_at,location,name,permissions,photo,pledged,profile,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,1,we are going Production herbal teabag of plan...,19,AU,Australia,1441269202,"{""id"":1555219532,""name"":""ehsan"",""is_registered...",AUD,$,True,USD,1444141184,False,,0.643694,14000.0,18648848,,False,,1441549184,"{""id"":1098081,""name"":""Perth"",""slug"":""perth-wa-...",Production herbal teabag of plants native to Iran,,"{""key"":""assets/012/241/749/145d362f576a69a5338...",27.0,"{""id"":2100811,""project_id"":2100811,""state"":""in...",production-herbal-teabag-of-plants-native-to-iran,https://www.kickstarter.com/discover/categorie...,False,False,failed,1444141184,0.691164,"{""web"":{""project"":""https://www.kickstarter.com...",18.661436,domestic
1,637,Two agents battle each other in another dimens...,16233,US,the United States,1576048498,"{""id"":99575233,""name"":""David Gerrard"",""is_regi...",USD,$,True,USD,1583987400,False,,1.0,6000.0,1576306701,,False,,1581353979,"{""id"":2490383,""name"":""Seattle"",""slug"":""seattle...",Slip Strike,,"{""key"":""assets/027/753/183/1b44d6f57a405f04bb3...",16233.0,"{""id"":3869441,""project_id"":3869441,""state"":""in...",slip-strike-0,https://www.kickstarter.com/discover/categorie...,True,False,successful,1583987400,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",16233.0,domestic


In [14]:
df.converted_pledged_amount.value_counts()

0          17622
1           6873
2           1705
10          1258
25          1089
           ...  
13113          1
279351         1
318244         1
183014         1
1001047        1
Name: converted_pledged_amount, Length: 32766, dtype: int64

In [15]:
df.drop([
    'country'
], axis=1, inplace=True)

In [16]:
df.country_displayable_name.value_counts()

the United States     149871
the United Kingdom     25046
Canada                 10240
Australia               5194
Germany                 3945
France                  3140
Mexico                  3054
Italy                   2742
Spain                   2467
the Netherlands         1923
Sweden                  1598
Hong Kong               1540
Denmark                  997
New Zealand              965
Singapore                887
Switzerland              753
Ireland                  709
Belgium                  647
Japan                    579
Austria                  548
Norway                   515
Luxembourg                73
Name: country_displayable_name, dtype: int64

In [17]:
df.creator.value_counts()
df.drop([
    'creator'
], axis=1, inplace=True)

In [18]:
df.currency.value_counts()

USD    149871
GBP     25046
EUR     16194
CAD     10240
AUD      5194
MXN      3054
SEK      1598
HKD      1540
DKK       997
NZD       965
SGD       887
CHF       753
JPY       579
NOK       515
Name: currency, dtype: int64

In [19]:
df.drop([
    'currency_symbol'
], axis=1, inplace=True)

In [20]:
df.currency_trailing_code.value_counts()

True     174861
False     42572
Name: currency_trailing_code, dtype: int64

In [21]:
df.head(2)

Unnamed: 0,backers_count,blurb,converted_pledged_amount,country_displayable_name,created_at,currency,currency_trailing_code,current_currency,deadline,disable_communication,friends,fx_rate,goal,id,is_backing,is_starrable,is_starred,launched_at,location,name,permissions,photo,pledged,profile,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,1,we are going Production herbal teabag of plan...,19,Australia,1441269202,AUD,True,USD,1444141184,False,,0.643694,14000.0,18648848,,False,,1441549184,"{""id"":1098081,""name"":""Perth"",""slug"":""perth-wa-...",Production herbal teabag of plants native to Iran,,"{""key"":""assets/012/241/749/145d362f576a69a5338...",27.0,"{""id"":2100811,""project_id"":2100811,""state"":""in...",production-herbal-teabag-of-plants-native-to-iran,https://www.kickstarter.com/discover/categorie...,False,False,failed,1444141184,0.691164,"{""web"":{""project"":""https://www.kickstarter.com...",18.661436,domestic
1,637,Two agents battle each other in another dimens...,16233,the United States,1576048498,USD,True,USD,1583987400,False,,1.0,6000.0,1576306701,,False,,1581353979,"{""id"":2490383,""name"":""Seattle"",""slug"":""seattle...",Slip Strike,,"{""key"":""assets/027/753/183/1b44d6f57a405f04bb3...",16233.0,"{""id"":3869441,""project_id"":3869441,""state"":""in...",slip-strike-0,https://www.kickstarter.com/discover/categorie...,True,False,successful,1583987400,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",16233.0,domestic


In [22]:
df.current_currency.value_counts()
# eliminate the small amount of EU values since the vast majority is in USD
df = df.loc[(df.current_currency == 'USD')]
df.current_currency.value_counts()
# drop this column since it has no remaining value
df.drop([
    'current_currency'
], axis=1, inplace=True)

In [23]:
# only one value
df.disable_communication.value_counts()
df.drop([
    'disable_communication'
], axis=1, inplace=True)

In [24]:
df.drop([
    'friends'
], axis=1, inplace=True)

In [25]:
df.fx_rate.value_counts()

1.000000    149731
1.221140     18617
1.080912     12362
0.709285      7743
1.226759      6407
0.643694      3952
1.085077      3813
0.711371      2489
0.041296      2167
0.101724      1246
0.647046      1238
0.129025      1052
0.041245       887
0.144964       752
0.598910       732
0.703586       679
1.027844       592
0.129018       486
0.009354       440
0.098205       403
0.102376       350
0.145478       244
0.601356       232
0.705470       205
1.031539       160
0.009327       139
0.098548       111
Name: fx_rate, dtype: int64

In [26]:
df.goal.value_counts()

5.000000e+03    15466
1.000000e+04    13671
1.000000e+03    10266
2.000000e+03     8877
3.000000e+03     8740
5.000000e+02     8551
1.500000e+04     7165
2.000000e+04     6638
2.500000e+03     6459
1.500000e+03     6131
2.500000e+04     5004
5.000000e+04     4808
4.000000e+03     4630
6.000000e+03     4115
3.000000e+04     3986
3.500000e+03     3789
8.000000e+03     3451
3.000000e+02     2964
7.000000e+03     2643
1.200000e+04     2633
7.500000e+03     2458
1.000000e+05     2405
6.000000e+02     2261
2.000000e+02     2188
1.000000e+02     2153
2.500000e+02     2058
1.200000e+03     1959
4.000000e+02     1871
8.000000e+02     1830
3.500000e+04     1717
4.000000e+04     1676
4.500000e+03     1647
5.500000e+03     1454
7.500000e+02     1298
6.500000e+03     1291
7.000000e+02     1171
6.000000e+04     1110
9.000000e+03     1099
3.500000e+02     1081
7.500000e+04      913
1.800000e+04      885
1.500000e+02      879
1.500000e+05      861
8.500000e+03      766
1.800000e+03      704
1.250000e+

In [27]:
df.head(2)

Unnamed: 0,backers_count,blurb,converted_pledged_amount,country_displayable_name,created_at,currency,currency_trailing_code,deadline,fx_rate,goal,id,is_backing,is_starrable,is_starred,launched_at,location,name,permissions,photo,pledged,profile,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,1,we are going Production herbal teabag of plan...,19,Australia,1441269202,AUD,True,1444141184,0.643694,14000.0,18648848,,False,,1441549184,"{""id"":1098081,""name"":""Perth"",""slug"":""perth-wa-...",Production herbal teabag of plants native to Iran,,"{""key"":""assets/012/241/749/145d362f576a69a5338...",27.0,"{""id"":2100811,""project_id"":2100811,""state"":""in...",production-herbal-teabag-of-plants-native-to-iran,https://www.kickstarter.com/discover/categorie...,False,False,failed,1444141184,0.691164,"{""web"":{""project"":""https://www.kickstarter.com...",18.661436,domestic
1,637,Two agents battle each other in another dimens...,16233,the United States,1576048498,USD,True,1583987400,1.0,6000.0,1576306701,,False,,1581353979,"{""id"":2490383,""name"":""Seattle"",""slug"":""seattle...",Slip Strike,,"{""key"":""assets/027/753/183/1b44d6f57a405f04bb3...",16233.0,"{""id"":3869441,""project_id"":3869441,""state"":""in...",slip-strike-0,https://www.kickstarter.com/discover/categorie...,True,False,successful,1583987400,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",16233.0,domestic


In [28]:
df.drop([
    'is_backing'
], axis=1, inplace=True)

In [29]:
df.is_starrable.value_counts()
# no unique values
df.drop([
    'is_starrable'
], axis=1, inplace=True)

In [30]:
df.is_starred.value_counts()
# no unique values
df.drop([
    'is_starred'
], axis=1, inplace=True)

In [31]:
df.location.value_counts()

{"id":2442047,"name":"Los Angeles","slug":"los-angeles-ca","short_name":"Los Angeles, CA","displayable_name":"Los Angeles, CA","localized_name":"Los Angeles","country":"US","state":"CA","type":"Town","is_root":false,"expanded_country":"United States","urls":{"web":{"discover":"https://www.kickstarter.com/discover/places/los-angeles-ca","location":"https://www.kickstarter.com/locations/los-angeles-ca"},"api":{"nearby_projects":"https://api.kickstarter.com/v1/discover?signature=1589491226.79c52b464f25291240c04aef284035d65d945da0&woe_id=2442047"}}}                                                9722
{"id":44418,"name":"London","slug":"london-gb","short_name":"London, UK","displayable_name":"London, UK","localized_name":"London","country":"GB","state":"England","type":"Town","is_root":false,"expanded_country":"United Kingdom","urls":{"web":{"discover":"https://www.kickstarter.com/discover/places/london-gb","location":"https://www.kickstarter.com/locations/london-gb"},"api":{"nearby_project

In [32]:
df.permissions.value_counts()
# no unique values
df.drop([
    'permissions'
], axis=1, inplace=True)

In [33]:
df.drop([
    'photo'
], axis=1, inplace=True)

In [34]:
df.pledged.value_counts()

0.00         16545
1.00          6818
2.00          1744
10.00         1722
25.00         1195
             ...  
167759.00        1
17588.00         1
41308.11         1
10486.50         1
5511.77          1
Name: pledged, Length: 48021, dtype: int64

In [35]:
df.profile.value_counts()
df.drop([
    'profile'
], axis=1, inplace=True)

In [36]:
df.drop([
    'source_url'
], axis=1, inplace=True)

In [37]:
df.head(2)

Unnamed: 0,backers_count,blurb,converted_pledged_amount,country_displayable_name,created_at,currency,currency_trailing_code,deadline,fx_rate,goal,id,launched_at,location,name,pledged,slug,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,1,we are going Production herbal teabag of plan...,19,Australia,1441269202,AUD,True,1444141184,0.643694,14000.0,18648848,1441549184,"{""id"":1098081,""name"":""Perth"",""slug"":""perth-wa-...",Production herbal teabag of plants native to Iran,27.0,production-herbal-teabag-of-plants-native-to-iran,False,False,failed,1444141184,0.691164,"{""web"":{""project"":""https://www.kickstarter.com...",18.661436,domestic
1,637,Two agents battle each other in another dimens...,16233,the United States,1576048498,USD,True,1583987400,1.0,6000.0,1576306701,1581353979,"{""id"":2490383,""name"":""Seattle"",""slug"":""seattle...",Slip Strike,16233.0,slip-strike-0,True,False,successful,1583987400,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",16233.0,domestic


In [38]:
df.slug.value_counts()

infinite-academy-a-super-new-way-of-learning          3
accidentally-on-purpose-0                             2
mayday-heyday-parfait                                 2
team-apex-naturalites-zero-issues                     2
ritual-0                                              2
                                                     ..
future-loves-past-needs-to-share-this-album-with-y    1
launching-mommys-toolbox                              1
m-stick-one-source-multi-use-smart-led-light          1
customizable-and-transparent-charity-app              1
christinas-cupcake-trailer                            1
Name: slug, Length: 189836, dtype: int64

In [39]:
df.spotlight.value_counts()

True     126980
False     90249
Name: spotlight, dtype: int64

In [40]:
df.staff_pick.value_counts()

False    188549
True      28680
Name: staff_pick, dtype: int64

In [41]:
df.state.value_counts()

successful    126980
failed         76260
canceled        9029
live            4960
Name: state, dtype: int64

In [42]:
df.state_changed_at.value_counts()

1572580740    31
1583038740    30
1559361542    28
1572591540    23
1561953540    21
              ..
1460066954     1
1382199943     1
1423358598     1
1574390015     1
1461977088     1
Name: state_changed_at, Length: 179423, dtype: int64

In [43]:
df.static_usd_rate.value_counts()

1.000000    149732
1.086105        54
1.109449        54
1.228667        51
1.215900        51
             ...  
0.049003         1
0.748048         1
1.032681         1
0.793573         1
1.313698         1
Name: static_usd_rate, Length: 13527, dtype: int64

In [44]:
df.drop([
    'urls'
], axis=1, inplace=True)

In [45]:
df.usd_pledged.value_counts()

0.000000         16545
1.000000          4702
2.000000          1164
10.000000         1122
25.000000          986
                 ...  
161149.000000        1
152.713749           1
0.766440             1
6172.610000          1
2380.436829          1
Name: usd_pledged, Length: 86070, dtype: int64

In [46]:
df.usd_type.value_counts()
df.drop([
    'usd_type'
], axis=1, inplace=True)

#### Merge DataFrames

In [47]:
df = df.merge(category, how='outer', left_index=True, right_index=True)
df.head(2)

Unnamed: 0,backers_count,blurb,converted_pledged_amount,country_displayable_name,created_at,currency,currency_trailing_code,deadline,fx_rate,goal,id_x,launched_at,location,name_x,pledged,slug_x,spotlight,staff_pick,state,state_changed_at,static_usd_rate,usd_pledged,id_y,name_y,slug_y,position,parentid,parentname,color,urls
0,1.0,we are going Production herbal teabag of plan...,19.0,Australia,1441269000.0,AUD,True,1444141000.0,0.643694,14000.0,18648850.0,1441549000.0,"{""id"":1098081,""name"":""Perth"",""slug"":""perth-wa-...",Production herbal teabag of plants native to Iran,27.0,production-herbal-teabag-of-plants-native-to-iran,False,False,failed,1444141000.0,0.691164,18.661436,313,Small Batch,food/small batch,10,10,Food,16725570,web
1,637.0,Two agents battle each other in another dimens...,16233.0,the United States,1576048000.0,USD,True,1583987000.0,1.0,6000.0,1576307000.0,1581354000.0,"{""id"":2490383,""name"":""Seattle"",""slug"":""seattle...",Slip Strike,16233.0,slip-strike-0,True,False,successful,1583987000.0,1.0,16233.0,34,Tabletop Games,games/tabletop games,6,12,Games,51627,web


In [48]:
df.drop([
    'slug_x',
    'id_y',
    'slug_y',
    'color',
    'urls'
], axis=1, inplace=True)

In [49]:
df.head(1)

Unnamed: 0,backers_count,blurb,converted_pledged_amount,country_displayable_name,created_at,currency,currency_trailing_code,deadline,fx_rate,goal,id_x,launched_at,location,name_x,pledged,spotlight,staff_pick,state,state_changed_at,static_usd_rate,usd_pledged,name_y,position,parentid,parentname
0,1.0,we are going Production herbal teabag of plan...,19.0,Australia,1441269000.0,AUD,True,1444141000.0,0.643694,14000.0,18648848.0,1441549000.0,"{""id"":1098081,""name"":""Perth"",""slug"":""perth-wa-...",Production herbal teabag of plants native to Iran,27.0,False,False,failed,1444141000.0,0.691164,18.661436,Small Batch,10,10,Food


#### Rename Parentname Data to Category

In [50]:
df.rename(columns = {'parentname':'category'}, inplace = True)
df.rename(columns = {'name_y':'sub_category'}, inplace = True) 

#### Select Only Successful and Failed Projects

In [51]:
df = df.loc[(df.state == 'successful') | (df.state == 'failed')]

In [52]:
df.state.value_counts()

successful    126980
failed         76260
Name: state, dtype: int64

#### Dummify Categorical Data

In [53]:
df.head(2)

Unnamed: 0,backers_count,blurb,converted_pledged_amount,country_displayable_name,created_at,currency,currency_trailing_code,deadline,fx_rate,goal,id_x,launched_at,location,name_x,pledged,spotlight,staff_pick,state,state_changed_at,static_usd_rate,usd_pledged,sub_category,position,parentid,category
0,1.0,we are going Production herbal teabag of plan...,19.0,Australia,1441269000.0,AUD,True,1444141000.0,0.643694,14000.0,18648850.0,1441549000.0,"{""id"":1098081,""name"":""Perth"",""slug"":""perth-wa-...",Production herbal teabag of plants native to Iran,27.0,False,False,failed,1444141000.0,0.691164,18.661436,Small Batch,10,10,Food
1,637.0,Two agents battle each other in another dimens...,16233.0,the United States,1576048000.0,USD,True,1583987000.0,1.0,6000.0,1576307000.0,1581354000.0,"{""id"":2490383,""name"":""Seattle"",""slug"":""seattle...",Slip Strike,16233.0,True,False,successful,1583987000.0,1.0,16233.0,Tabletop Games,6,12,Games


In [54]:
currency_trailing_code = pd.get_dummies(df['currency_trailing_code'], drop_first=True)
currency_trailing_code.head()
df.drop([
    'currency_trailing_code'
], axis=1, inplace=True)

In [55]:
country = pd.get_dummies(df['country_displayable_name'], drop_first=True)
country.head()
df.drop([
    'country_displayable_name'
], axis=1, inplace=True)

In [56]:
currency = pd.get_dummies(df['currency'], drop_first=True)
currency.head()
df.drop([
    'currency'
], axis=1, inplace=True)

In [57]:
spotlight = pd.get_dummies(df['spotlight'], drop_first=True)
spotlight.rename(columns = {'True':'spotlight_true'}, inplace = True)
spotlight.head()
df.drop([
    'spotlight'
], axis=1, inplace=True)

In [58]:
staff_pick = pd.get_dummies(df['staff_pick'], drop_first=True)
staff_pick.rename(columns = {'True':'staff_pick_true'}, inplace = True)
staff_pick.head()
df.drop([
    'staff_pick'
], axis=1, inplace=True)

In [59]:
state = pd.get_dummies(df['state'], drop_first=True)
state.head()
df.drop([
    'state'
], axis=1, inplace=True)

In [60]:
sub_category = pd.get_dummies(df['sub_category'], drop_first=True)
sub_category.head()
df.drop([
    'sub_category'
], axis=1, inplace=True)

In [61]:
category = pd.get_dummies(df['category'], drop_first=True)
category.head()
df.drop([
    'category'
], axis=1, inplace=True)

#### Merge Dummy Data with Original DataFrame

In [62]:
df.shape

(203240, 17)

In [63]:
df = df.merge(country, left_index=True, right_index=True)
df = df.merge(country, left_index=True, right_index=True)
df = df.merge(currency, left_index=True, right_index=True)
df = df.merge(spotlight, left_index=True, right_index=True)
df = df.merge(staff_pick, left_index=True, right_index=True)
df = df.merge(state, left_index=True, right_index=True)
df = df.merge(sub_category, left_index=True, right_index=True)
df = df.merge(category, left_index=True, right_index=True)

In [64]:
df.shape

(203240, 249)

In [65]:
df.head(1)

Unnamed: 0,backers_count,blurb,converted_pledged_amount,created_at,deadline,fx_rate,goal,id_x,launched_at,location,name_x,pledged,state_changed_at,static_usd_rate,usd_pledged,position,parentid,Austria_x,Belgium_x,Canada_x,Denmark_x,France_x,Germany_x,Hong Kong_x,Ireland_x,Italy_x,Japan_x,Luxembourg_x,Mexico_x,New Zealand_x,Norway_x,Singapore_x,Spain_x,Sweden_x,Switzerland_x,the Netherlands_x,the United Kingdom_x,the United States_x,Austria_y,Belgium_y,Canada_y,Denmark_y,France_y,Germany_y,Hong Kong_y,Ireland_y,Italy_y,Japan_y,Luxembourg_y,Mexico_y,New Zealand_y,Norway_y,Singapore_y,Spain_y,Sweden_y,Switzerland_y,the Netherlands_y,the United Kingdom_y,the United States_y,CAD,CHF,DKK,EUR,GBP,HKD,JPY,MXN,NOK,NZD,SEK,SGD,USD,True_x,True_y,successful,Academic,Accessories,Action,Animals,Animation,Anthologies,Apparel,Apps,Architecture,Art,Art Books,Audio,Bacon,Blues,Calendars,Camera Equipment,Candles,Ceramics,Childrens Books,Childrenswear,Chiptune,Civic Design,Classical Music,Comedy,Comic Books,Comics_x,Community Gardens,Conceptual Art,Cookbooks,Country Folk,Couture,Crafts_x,Crochet,DIY,DIY Electronics,Dance_x,Design_x,Digital Art,Documentary,Drama,Drinks,Electronic Music,Embroidery,Events,Experimental,Fabrication Tools,Faith,Family,Fantasy,Farmers Markets,Farms,Fashion_x,Festivals,Fiction,Film Video_x,Fine Art,Flight,Food_x,Food Trucks,Footwear,Gadgets,Games_x,Gaming Hardware,Glass,Graphic Design,Graphic Novels,Hardware,HipHop,Horror,Illustration,Immersive,Indie Rock,Installations,Interactive Design,Jazz,Jewelry,Journalism_x,Kids,Knitting,Latin,Letterpress,Literary Journals,Literary Spaces,Live Games,Makerspaces,Metal,Mixed Media,Mobile Games,Movie Theaters,Music_x,Music Videos,Musical,Narrative Film,Nature,Nonfiction,Painting,People,Performance Art,Performances,Periodicals,Pet Fashion,Photo,Photobooks,Photography_x,Places,Playing Cards,Plays,Poetry,Pop,Pottery,Print,Printing,Product Design,Public Art,Publishing_x,Punk,Puzzles,Quilts,RB,Radio Podcasts,Readytowear,Residencies,Restaurants,Robots,Rock,Romance,Science Fiction,Sculpture,Shorts,Small Batch,Social Practice,Software,Sound,Space Exploration,Spaces,Stationery,Tabletop Games,Taxidermy,Technology_x,Television,Textiles,Theater_x,Thrillers,Toys,Translations,Typography,Vegan,Video,Video Art,Video Games,Wearables,Weaving,Web,Webcomics,Webseries,Woodworking,Workshops,World Music,Young Adult,Zines,Comics_y,Crafts_y,Dance_y,Design_y,Fashion_y,Film Video_y,Food_y,Games_y,Journalism_y,Music_y,Photography_y,Publishing_y,Technology_y,Theater_y
0,1.0,we are going Production herbal teabag of plan...,19.0,1441269000.0,1444141000.0,0.643694,14000.0,18648848.0,1441549000.0,"{""id"":1098081,""name"":""Perth"",""slug"":""perth-wa-...",Production herbal teabag of plants native to Iran,27.0,1444141000.0,0.691164,18.661436,10,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


In [66]:
missing_values= df.isnull().sum()
missing_values.sort_values(ascending=False)
df.dropna(inplace=True)

In [67]:
df.shape

(195117, 249)

### Logistic Regression

In [68]:
X = df.drop([
    'successful',
    'blurb',
    'location',
    'name_x'
], axis = 'columns')
y = df.successful

In [69]:
# Train-test-split
X_train, X_test, y_train, y_test = train_test_split (X, y, random_state = 42)

# Scale our data.
# Relabeling scaled data as "Z" is common
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)


logreg = LogisticRegression(C=1e9, solver='lbfgs')
logreg.fit(Z_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(Z_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(logreg.score(Z_train, y_train))
print(logreg.score(Z_test, y_test))

[[18988     0]
 [    0 29792]]
1.0
1.0


In [70]:
# Assign the coefficients to a list coef
coef = logreg.coef_
for p,c in zip(X,list(coef[0])):
    print(p + '\t' + str(c))

backers_count	0.5047370761967033
converted_pledged_amount	0.32048192236878786
created_at	0.003494640317477045
deadline	-0.03156314292016197
fx_rate	0.049388600722279095
goal	-0.39274821045853475
id_x	-0.0023816468769559066
launched_at	0.01305616241822038
pledged	0.09752649425638116
state_changed_at	-0.03156236941226721
static_usd_rate	-0.09520030250268355
usd_pledged	0.32118786073111566
position	-0.026624744529659957
parentid	-0.04460441304204819
Austria_x	-0.007070875845435892
Belgium_x	-0.008664126176677233
Canada_x	0.013969999215982373
Denmark_x	-0.034449625582898526
France_x	-0.0068684307002554465
Germany_x	-0.013137325680755211
Hong Kong_x	0.01424835865046498
Ireland_x	-0.006336523860264695
Italy_x	-0.028140918235792227
Japan_x	0.007263535776783879
Luxembourg_x	0.018197822381790005
Mexico_x	-0.012367076141207894
New Zealand_x	0.006517648636003901
Norway_x	-0.011896746311428105
Singapore_x	0.004492791138975874
Spain_x	-0.015305955303021947
Sweden_x	-0.025105945120475134
Switzerland

### Feature Selection

In [78]:
# Make predictions
predictions = logreg.predict_proba(X)
predictions_target = predictions[:,1]

# Calculate the AUC value
auc = roc_auc_score(y, predictions_target)
print(round(auc,2))

0.5


In [79]:
def auc(variables, target, df):
    X = df[variables]
    y = df[target]
    
    logreg = LogisticRegression(C=1e9, solver='lbfgs')
    logreg.fit(X_train, y_train)
    
    predictions = logreg.predict_proba(X)
    auc = roc_auc_score(y, predictions)
    return(auc)

In [81]:
auc = auc(['Theater_y'], ['successful'], df)
print(round(auc,2))

ValueError: X has 1 features per sample; expecting 245

In [74]:
# def next_best(current_variables, candidate_variables, target, df):
#     best_auc = -1 # will 1 work instead of -1?
#     best_variable = None
    
#     for v in candidate_variables:
#         auc_v = auc(current_variables + [v], target, df)
        
#         if auc_v >= best_auc:
#             best_auc = auc_v
#             best_variable = v
#     return best_variable

In [75]:
# current_variables = ['variables']
# candidate_variables = ['next_variables']
# next_variable = next_best(current_variables, candidate_variables, df)
# print(next_variable)

In [76]:
# candidate_variables = [
#     'backers_count', 
#     'converted_pledged_amount', 
#     'created_at',
#     'deadline',
#     'fx_rate',
#     'goal',
#     'id_x'
# ]
# current_variables = []
# target = df.successful

# max_number_variables = 5
# number_iterations = min(max_number_variables, len(candidate_variables))
# for i in range(0, number_iterations):
#     next_variable = next_best(current_variables, candidate_variables, target, df)
#     current_variables = current_variables + [next_variable]
#     candidate_variables.remote(next_variable)
    
# print(current_variables)

In [77]:
# Calculate the AUC of a model that uses "max_gift", "mean_gift" and "min_gift" as predictors
auc_current = auc(['max_gift', 'mean_gift', 'min_gift'], ["target"], basetable)
print(round(auc_current,4))

# Calculate which variable among "age" and "gender_F" should be added to the variables "max_gift", "mean_gift" and "min_gift"
next_variable = next_best(['max_gift', 'mean_gift', 'min_gift'], ['age', 'gender_F'], ["target"],basetable)
print(next_variable)

# Calculate the AUC of a model that uses "max_gift", "mean_gift", "min_gift" and "age" as predictors
auc_current_age = auc(['max_gift', 'mean_gift', 'min_gift', 'age'], ["target"], basetable)
print(round(auc_current_age,4))

# Calculate the AUC of a model that uses "max_gift", "mean_gift", "min_gift" and "gender_F" as predictors
auc_current_gender_F = auc(['max_gift', 'mean_gift', 'min_gift','gender_F'], ["target"], basetable)
print(round(auc_current_gender_F,4))

NameError: name 'auc' is not defined

In [None]:
# Create a dataframe new_data from current_data that has only the relevant predictors 
new_data = current_data[['age', 'gender_F', 'time_since_last_gift']]

# Make a prediction for each observation in new_data and assign it to predictions
predictions = logreg.predict_proba(new_data)

# Sort the predictions
predictions_sorted = predictions.sort(['probability'])

# Print the row of predictions_sorted that has the donor that is most likely to donate
print(predictions_sorted.tail(1))