# Capstone

## Executive Summary

Kickstarter is a crowdfunding platform that helps inspired creators receive the funding they need to bring their ideas to life. What makes the platform unique is that it is that it attracts people in all financial classes to the platform to invest in ideas. This provides a unique opportunity for future creators because he shows real market interest in products and ideas that have yet to be created. Future creators can use this knowledge to inspire their own ideas by using Kickstarter data to find ideas that they support and create products based on those ideas. However, the purpose of this research is to take creators past simply picking topics by interest and, instead, helping them pick by more objective features.

In this research, my goal was to see whether or not I could predict success given the data of a product on Kickstarter. Using that data, I created a logistic regression model to predict whether or not a project would be successful. The baseline, or percentage of successes of total observations was 65%. Seperating out over 200 features and running 4 models, three models covering a category and the fourth covering all categories, I was able to determine that there were features that greatly influenced success: the number of backers and the financial goal. After filtering out as many variables I could, my predictions after tuning the model was roughly 85% for all four models.

For creator's using this model, they will find that for every backer that they gain, their liklihood of success increases dramatically. However, they will also find that the larger their financial goal, the lower their chances of succeeding are as well.

## Problem Statement

As somone interested in creating my entrepreneurs in the world, I thought it would be a good idea to see if I could take several years of Kickstarter data and see if I can predict whether a project would be successful or not and then use that information to make recommendations for those interested in creating something for themselves. But first, I needed to answer the following:

>  Can a model be created that predicts the success of a project on Kickstarter, greater than the baseline rate?

## Gather the Data

Data were found using the following link and downloaded onto my local drive.  
https://webrobots.io/kickstarter-datasets/

## Import Libraries

In [1]:
import pandas as pd
import glob
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

## Combine Data

In [2]:
# # uncomment to run initially
# # credit: https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/
# extension = 'csv'
# all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

# #combine all files in the list
# combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
# #export to csv
# combined_csv.to_csv( "combined.csv", index=False, encoding='utf-8-sig')

## Read Data

In [3]:
# read in the data
df = pd.read_csv('./datasets/kickstarter_data/combined.csv')

## Clean Data

In [4]:
# identify what the columns are and what the values look like
df.head(2)

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,country_displayable_name,created_at,creator,currency,currency_symbol,...,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,1,we are going Production herbal teabag of plan...,"{""id"":313,""name"":""Small Batch"",""slug"":""food/sm...",19,AU,Australia,1441269202,"{""id"":1555219532,""name"":""ehsan"",""is_registered...",AUD,$,...,production-herbal-teabag-of-plants-native-to-iran,https://www.kickstarter.com/discover/categorie...,False,False,failed,1444141184,0.691164,"{""web"":{""project"":""https://www.kickstarter.com...",18.661436,domestic
1,637,Two agents battle each other in another dimens...,"{""id"":34,""name"":""Tabletop Games"",""slug"":""games...",16233,US,the United States,1576048498,"{""id"":99575233,""name"":""David Gerrard"",""is_regi...",USD,$,...,slip-strike-0,https://www.kickstarter.com/discover/categorie...,True,False,successful,1583987400,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",16233.0,domestic


In [5]:
# get basic info on the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217433 entries, 0 to 217432
Data columns (total 38 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   backers_count             217433 non-null  int64  
 1   blurb                     217425 non-null  object 
 2   category                  217433 non-null  object 
 3   converted_pledged_amount  217433 non-null  int64  
 4   country                   217433 non-null  object 
 5   country_displayable_name  217433 non-null  object 
 6   created_at                217433 non-null  int64  
 7   creator                   217433 non-null  object 
 8   currency                  217433 non-null  object 
 9   currency_symbol           217433 non-null  object 
 10  currency_trailing_code    217433 non-null  bool   
 11  current_currency          217433 non-null  object 
 12  deadline                  217433 non-null  int64  
 13  disable_communication     217433 non-null  b

In [6]:
# see how many rows, columns there are
df.shape

(217433, 38)

In [7]:
# get the percentage of missing values
missing_values= df.isnull().sum()
# sort from most missing to least
(missing_values/len(df)).sort_values(ascending=False)

is_backing                  0.999669
permissions                 0.999669
friends                     0.999669
is_starred                  0.999669
location                    0.000989
usd_type                    0.000938
blurb                       0.000037
staff_pick                  0.000000
spotlight                   0.000000
category                    0.000000
converted_pledged_amount    0.000000
country                     0.000000
country_displayable_name    0.000000
created_at                  0.000000
creator                     0.000000
currency                    0.000000
currency_symbol             0.000000
currency_trailing_code      0.000000
current_currency            0.000000
deadline                    0.000000
disable_communication       0.000000
urls                        0.000000
fx_rate                     0.000000
goal                        0.000000
id                          0.000000
usd_pledged                 0.000000
is_starrable                0.000000
s

### Drop Missing Values

In [8]:
# drop the columns with a majority of the data missing
df.drop([
    'is_backing',
    'permissions',
    'friends',
    'is_starred'
], axis=1, inplace=True)

In [9]:
# confirm drop completed
# get the percentage of missing values
missing_values= df.isnull().sum()
# sort from most missing to least
(missing_values/len(df)).sort_values(ascending=False)

location                    0.000989
usd_type                    0.000938
blurb                       0.000037
currency                    0.000000
disable_communication       0.000000
deadline                    0.000000
current_currency            0.000000
currency_trailing_code      0.000000
currency_symbol             0.000000
creator                     0.000000
goal                        0.000000
created_at                  0.000000
country_displayable_name    0.000000
country                     0.000000
converted_pledged_amount    0.000000
category                    0.000000
fx_rate                     0.000000
id                          0.000000
usd_pledged                 0.000000
is_starrable                0.000000
launched_at                 0.000000
name                        0.000000
photo                       0.000000
pledged                     0.000000
profile                     0.000000
slug                        0.000000
source_url                  0.000000
s

In [10]:
# drop remaining rows with missing values
df.dropna(inplace=True)

In [11]:
df.describe()

Unnamed: 0,backers_count,converted_pledged_amount,created_at,deadline,fx_rate,goal,id,launched_at,pledged,state_changed_at,static_usd_rate,usd_pledged
count,217006.0,217006.0,217006.0,217006.0,217006.0,217006.0,217006.0,217006.0,217006.0,217006.0,217006.0,217006.0
mean,153.397791,13918.43,1475155000.0,1482196000.0,0.971724,50935.42,1073520000.0,1479353000.0,25307.48,1482044000.0,1.001748,13929.65
std,956.261442,111670.7,72962000.0,72681080.0,0.213597,1226416.0,619465700.0,72684190.0,915798.9,72577960.0,0.239873,111681.6
min,0.0,0.0,1240366000.0,1242468000.0,0.009327,0.01,18520.0,1240920000.0,0.0,1242468000.0,0.008771,0.0
25%,4.0,125.0,1422486000.0,1428764000.0,1.0,1500.0,536864300.0,1425915000.0,130.0,1428638000.0,1.0,125.0
50%,29.0,1630.0,1476549000.0,1483467000.0,1.0,5000.0,1073560000.0,1480564000.0,1675.0,1483394000.0,1.0,1631.0
75%,93.0,6818.0,1540804000.0,1549072000.0,1.0,15000.0,1610402000.0,1546214000.0,7341.0,1549039000.0,1.0,6831.308
max,105857.0,12969610.0,1589423000.0,1594600000.0,1.226759,100000000.0,2147476000.0,1589431000.0,235320500.0,1589432000.0,1.716408,12969610.0


In [12]:
df.location.value_counts()

{"id":2442047,"name":"Los Angeles","slug":"los-angeles-ca","short_name":"Los Angeles, CA","displayable_name":"Los Angeles, CA","localized_name":"Los Angeles","country":"US","state":"CA","type":"Town","is_root":false,"expanded_country":"United States","urls":{"web":{"discover":"https://www.kickstarter.com/discover/places/los-angeles-ca","location":"https://www.kickstarter.com/locations/los-angeles-ca"},"api":{"nearby_projects":"https://api.kickstarter.com/v1/discover?signature=1589491226.79c52b464f25291240c04aef284035d65d945da0&woe_id=2442047"}}}                        9721
{"id":44418,"name":"London","slug":"london-gb","short_name":"London, UK","displayable_name":"London, UK","localized_name":"London","country":"GB","state":"England","type":"Town","is_root":false,"expanded_country":"United Kingdom","urls":{"web":{"discover":"https://www.kickstarter.com/discover/places/london-gb","location":"https://www.kickstarter.com/locations/london-gb"},"api":{"nearby_projects":"https://api.kickstar

In [13]:
df.usd_type.value_counts()

domestic         216946
international        60
Name: usd_type, dtype: int64

In [14]:
df.blurb.value_counts()

ALL-NEW SEXY BADGIRL characters from comic book INDIE legend Everette Hartsoe. 100% artwork in book                                        35
A beautiful natural Fine art nude book exemplifying the female form presented by female producer Nina Vain.                                28
Hard Enamel Pins                                                                                                                           22
The Decentralized Dance Party was founded on the belief that Partying is an art form that has the power to change the world.               17
Award Winning Footwear Designs | Crafted Using Italian Leathers with Bold and Comfortable Features | London Navy Men's Luxury Footwear     15
                                                                                                                                           ..
Pre-order Soulajar's new album "Between Here and There". Plus exclusive experiences & more!                                                 1
Locate

In [15]:
df.converted_pledged_amount.value_counts()

0         17603
1          6871
2          1703
10         1258
25         1087
          ...  
9040          1
13138         1
25432         1
20642         1
175395        1
Name: converted_pledged_amount, Length: 32730, dtype: int64

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217006 entries, 0 to 217432
Data columns (total 34 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   backers_count             217006 non-null  int64  
 1   blurb                     217006 non-null  object 
 2   category                  217006 non-null  object 
 3   converted_pledged_amount  217006 non-null  int64  
 4   country                   217006 non-null  object 
 5   country_displayable_name  217006 non-null  object 
 6   created_at                217006 non-null  int64  
 7   creator                   217006 non-null  object 
 8   currency                  217006 non-null  object 
 9   currency_symbol           217006 non-null  object 
 10  currency_trailing_code    217006 non-null  bool   
 11  current_currency          217006 non-null  object 
 12  deadline                  217006 non-null  int64  
 13  disable_communication     217006 non-null  b

In [17]:
df.country.value_counts()

US    149510
GB     25023
CA     10232
AU      5190
DE      3940
FR      3138
MX      3054
IT      2740
ES      2462
NL      1920
SE      1596
HK      1538
DK       996
NZ       964
SG       884
CH       752
IE       709
BE       645
JP       579
AT       548
NO       514
LU        72
Name: country, dtype: int64

In [18]:
df.country_displayable_name.value_counts()

the United States     149510
the United Kingdom     25023
Canada                 10232
Australia               5190
Germany                 3940
France                  3138
Mexico                  3054
Italy                   2740
Spain                   2462
the Netherlands         1920
Sweden                  1596
Hong Kong               1538
Denmark                  996
New Zealand              964
Singapore                884
Switzerland              752
Ireland                  709
Belgium                  645
Japan                    579
Austria                  548
Norway                   514
Luxembourg                72
Name: country_displayable_name, dtype: int64

In [19]:
df.created_at.value_counts()

1554821069    4
1551365530    4
1572624598    4
1551737370    4
1544463752    4
             ..
1428144038    1
1444652965    1
1471131556    1
1450278958    1
1438941717    1
Name: created_at, Length: 189511, dtype: int64

In [20]:
df.creator.value_counts()

{"id":2053011023,"name":"Benjamin Hennessey","slug":"combatmedallions","is_registered":null,"chosen_currency":null,"is_superbacker":null,"avatar":{"thumb":"https://ksr-ugc.imgix.net/assets/008/647/822/59acad1fb0a00a22cd0c5df2db43343f_original.jpg?ixlib=rb-2.1.0&w=40&h=40&fit=crop&v=1461536749&auto=format&frame=1&q=92&s=681661727d91252651719bdc7202b454","small":"https://ksr-ugc.imgix.net/assets/008/647/822/59acad1fb0a00a22cd0c5df2db43343f_original.jpg?ixlib=rb-2.1.0&w=160&h=160&fit=crop&v=1461536749&auto=format&frame=1&q=92&s=119a85455aafb64c83a17e481c02a595","medium":"https://ksr-ugc.imgix.net/assets/008/647/822/59acad1fb0a00a22cd0c5df2db43343f_original.jpg?ixlib=rb-2.1.0&w=160&h=160&fit=crop&v=1461536749&auto=format&frame=1&q=92&s=119a85455aafb64c83a17e481c02a595"},"urls":{"web":{"user":"https://www.kickstarter.com/profile/combatmedallions"},"api":{"user":"https://api.kickstarter.com/v1/users/2053011023?signature=1589516310.c7ebe463c5a4b9915638287eb55c3dbe464dffc5"}}}    12
{"id":1712

In [21]:
df.currency.value_counts()

USD    149510
GBP     25023
EUR     16174
CAD     10232
AUD      5190
MXN      3054
SEK      1596
HKD      1538
DKK       996
NZD       964
SGD       884
CHF       752
JPY       579
NOK       514
Name: currency, dtype: int64

In [22]:
df.currency_symbol.value_counts()

$      171372
£       25023
€       16174
kr       3106
Fr        752
¥         579
Name: currency_symbol, dtype: int64

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217006 entries, 0 to 217432
Data columns (total 34 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   backers_count             217006 non-null  int64  
 1   blurb                     217006 non-null  object 
 2   category                  217006 non-null  object 
 3   converted_pledged_amount  217006 non-null  int64  
 4   country                   217006 non-null  object 
 5   country_displayable_name  217006 non-null  object 
 6   created_at                217006 non-null  int64  
 7   creator                   217006 non-null  object 
 8   currency                  217006 non-null  object 
 9   currency_symbol           217006 non-null  object 
 10  currency_trailing_code    217006 non-null  bool   
 11  current_currency          217006 non-null  object 
 12  deadline                  217006 non-null  int64  
 13  disable_communication     217006 non-null  b

In [24]:
df.currency_trailing_code.value_counts()

True     174478
False     42528
Name: currency_trailing_code, dtype: int64

In [25]:
df.current_currency.value_counts()

USD    217006
Name: current_currency, dtype: int64

In [26]:
df.deadline.value_counts()

1572580740    32
1583038740    31
1559361540    28
1572591540    23
1459483140    22
              ..
1517174784     1
1525008383     1
1428590384     1
1495914493     1
1441947660     1
Name: deadline, Length: 178289, dtype: int64

In [27]:
df.disable_communication.value_counts()

False    217006
Name: disable_communication, dtype: int64

In [28]:
df.fx_rate.value_counts()

1.000000    149510
1.221140     18616
1.080912     12361
0.709285      7743
1.226759      6407
0.643694      3952
1.085077      3813
0.711371      2489
0.041296      2167
0.101724      1246
0.647046      1238
0.129025      1052
0.041245       887
0.144964       752
0.598910       732
0.703586       679
1.027844       592
0.129018       486
0.009354       440
0.098205       403
0.102376       350
0.145478       244
0.601356       232
0.705470       205
1.031539       160
0.009327       139
0.098548       111
Name: fx_rate, dtype: int64

In [29]:
df.goal.value_counts()

5000.0     15452
10000.0    13659
1000.0     10254
2000.0      8858
3000.0      8728
           ...  
61.0           1
14495.0        1
83160.0        1
20782.0        1
10130.0        1
Name: goal, Length: 5519, dtype: int64

In [30]:
df.is_starrable.value_counts()

False    212099
True       4907
Name: is_starrable, dtype: int64

In [31]:
df.launched_at.value_counts()

1497283150    4
1582642801    4
1588359602    4
1581440419    4
1574168401    4
             ..
1566166760    1
1404155623    1
1427212857    1
1466541797    1
1446479823    1
Name: launched_at, Length: 189453, dtype: int64

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217006 entries, 0 to 217432
Data columns (total 34 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   backers_count             217006 non-null  int64  
 1   blurb                     217006 non-null  object 
 2   category                  217006 non-null  object 
 3   converted_pledged_amount  217006 non-null  int64  
 4   country                   217006 non-null  object 
 5   country_displayable_name  217006 non-null  object 
 6   created_at                217006 non-null  int64  
 7   creator                   217006 non-null  object 
 8   currency                  217006 non-null  object 
 9   currency_symbol           217006 non-null  object 
 10  currency_trailing_code    217006 non-null  bool   
 11  current_currency          217006 non-null  object 
 12  deadline                  217006 non-null  int64  
 13  disable_communication     217006 non-null  b

In [33]:
df.name.value_counts()

Home                                                          8
Debut Album                                                   8
A Midsummer Night's Dream                                     7
Reflections                                                   6
The Other Side                                                6
                                                             ..
Bees and Honey, Hives and Mead! VilleBilly Bees is Buzzin'    1
#Scanners                                                     1
Deep Red Returns in Full Color!!                              1
Miami XL: A New AlterLatina Comedy Web Series                 1
Pimp My Carroza Bogota                                        1
Name: name, Length: 188992, dtype: int64

In [34]:
df.photo.value_counts()

{"key":null,"full":"https://ksr-ugc.imgix.net/missing_project_photo.png?ixlib=rb-2.1.0&crop=faces&w=560&h=315&fit=crop&v=&auto=format&frame=1&q=92&s=ef9622ff4223deef49fa8ad823aea9e2","ed":"https://ksr-ugc.imgix.net/missing_project_photo.png?ixlib=rb-2.1.0&crop=faces&w=352&h=198&fit=crop&v=&auto=format&frame=1&q=92&s=54a9c4d0b0b9a4dd8b9bc750f5cbab0a","med":"https://ksr-ugc.imgix.net/missing_project_photo.png?ixlib=rb-2.1.0&crop=faces&w=272&h=153&fit=crop&v=&auto=format&frame=1&q=92&s=9190ef46fcf7ec4c0715bae1a204c47d","little":"https://ksr-ugc.imgix.net/missing_project_photo.png?ixlib=rb-2.1.0&crop=faces&w=208&h=117&fit=crop&v=&auto=format&frame=1&q=92&s=cc0886f218b6ba9280e60cfccf1c839c","small":"https://ksr-ugc.imgix.net/missing_project_photo.png?ixlib=rb-2.1.0&crop=faces&w=160&h=90&fit=crop&v=&auto=format&frame=1&q=92&s=23bb8e82cb40d860a59b192531038aed","thumb":"https://ksr-ugc.imgix.net/missing_project_photo.png?ixlib=rb-2.1.0&crop=faces&w=48&h=27&fit=crop&v=&auto=format&frame=1&q=92&

In [35]:
df.pledged.value_counts()

0.00        16530
1.00         6818
2.00         1743
10.00        1722
25.00        1193
            ...  
15306.00        1
12285.49        1
47523.00        1
31545.00        1
5805.55         1
Name: pledged, Length: 47968, dtype: int64

In [36]:
df.profile.value_counts()

{"id":3992603,"project_id":3992603,"state":"inactive","state_changed_at":1589216270,"name":null,"blurb":null,"background_color":null,"text_color":null,"link_background_color":null,"link_text_color":null,"link_text":null,"link_url":null,"show_feature_image":false,"background_image_opacity":0.8,"should_show_feature_image_section":true,"feature_image_attributes":{"image_urls":{"default":"https://ksr-ugc.imgix.net/assets/029/043/271/0d91691252f9105a7b81930ca5177374_original.png?ixlib=rb-2.1.0&crop=faces&w=1552&h=873&fit=crop&v=1589216331&auto=format&frame=1&q=92&s=cb6ae59dcc183da6ae1a3b5b59ec5cce","baseball_card":"https://ksr-ugc.imgix.net/assets/029/043/271/0d91691252f9105a7b81930ca5177374_original.png?ixlib=rb-2.1.0&crop=faces&w=560&h=315&fit=crop&v=1589216331&auto=format&frame=1&q=92&s=e2b4031b6cac9bfcb063b680b129214a"}}}                                                                                                                                                                        

In [37]:
df.slug.value_counts()

infinite-academy-a-super-new-way-of-learning                   3
cooper-lightwood-a-bright-green-idea                           2
miss-jodi-music                                                2
the-order-of-santa-claus-become-an-official-helper-to-santa    2
bee-health-guru-a-smartphone-app-for-beekeepers                2
                                                              ..
dustless-soul-creations                                        1
rokpak-worlds-first-solar-battery-pack-drybox-all              1
ghost-train-movie                                              1
missloutoyous-daydream-deliveries-subscription-box             1
frugalosophy-a-financial-philosophy                            1
Name: slug, Length: 189615, dtype: int64

In [38]:
df.spotlight.value_counts()

True     126821
False     90185
Name: spotlight, dtype: int64

In [39]:
df.staff_pick.value_counts()

False    188376
True      28630
Name: staff_pick, dtype: int64

In [40]:
df.state.value_counts()

successful    126821
failed         76210
canceled        9015
live            4960
Name: state, dtype: int64

In [41]:
df.state_changed_at.value_counts()

1572580740    31
1583038740    30
1559361542    28
1572591540    23
1561953540    21
              ..
1460066954     1
1382199943     1
1423358598     1
1574390015     1
1461977088     1
Name: state_changed_at, Length: 179202, dtype: int64

In [42]:
df.static_usd_rate.value_counts()

1.000000    149511
1.086105        54
1.109449        54
1.228667        51
1.215900        51
             ...  
0.049003         1
0.748048         1
1.032681         1
0.793573         1
1.313698         1
Name: static_usd_rate, Length: 13527, dtype: int64

In [43]:
df.usd_pledged.value_counts()

0.000000        16530
1.000000         4702
2.000000         1163
10.000000        1122
25.000000         984
                ...  
11965.722057        1
62.008843           1
11164.676665        1
136.825139          1
10726.000000        1
Name: usd_pledged, Length: 86015, dtype: int64

In [44]:
df.usd_type.value_counts()

domestic         216946
international        60
Name: usd_type, dtype: int64

### Drop Unnecessary Features

In [45]:
df.drop([
    'location', # removed because values won't aid in determining success
    'usd_type', # removed because value is the same
    'blurb', # removed because it is text data
    'converted_pledged_amount', # removed because I don't know what this is
    'country', # removed because another variable has same information
    'created_at', # removed because date won't convert
    'creator', # removed because values won't aid in determining success
    'currency', # removed because another variable has same information
    'currency_symbol', # remove because another variable has similar information
    'currency_trailing_code', # removed because I don't know what this is
    'current_currency', # removed because of no unique values
    'deadline', # removed because data won't convert
    'disable_communication', # removed because no unique values
    'fx_rate', # removed because I don't know what this is
    'id', # feature provides no value
    'is_starrable', # removed because I don't know what this is
    'launched_at', # removed because date won't convert
    'name', # removed because it is text data
    'photo', # removed because values won't aid in determining success
    'profile', # removed because values won't aid in determining success
    'slug', # removed because it is text data
    'source_url', # removed because values won't aid in determining success
    'state_changed_at', # removed because date won't convert
    'static_usd_rate', # dropped because I don't know what this is
    'urls', # dropped because this won't aid in determining success
    'usd_type' # removed because minimal unique values
], axis=1, inplace=True)

### Isolate Successful and Failed Projects Only

In [46]:
df = df.loc[(df.state == 'successful') | (df.state == 'failed')]

In [47]:
df.describe()

Unnamed: 0,backers_count,goal,pledged,usd_pledged
count,203031.0,203031.0,203031.0,203031.0
mean,158.598667,46692.54,26100.2,14396.86
std,975.347676,1152487.0,944984.3,114219.3
min,0.0,0.01,0.0,0.0
25%,5.0,1500.0,180.0,171.9462
50%,32.0,5000.0,1941.0,1862.23
75%,98.0,14000.0,7815.0,7267.065
max,105857.0,100000000.0,235320500.0,12969610.0


In [48]:
success = df.loc[(df.state == 'successful')]
success.describe()

Unnamed: 0,backers_count,goal,pledged,usd_pledged
count,126821.0,126821.0,126821.0,126821.0
mean,246.705309,14159.69,40694.05,22406.34
std,1225.173521,257018.6,1195085.0,143861.7
min,1.0,0.01,1.0,0.9139121
25%,32.0,1000.0,1727.0,1718.0
50%,70.0,3500.0,5053.0,4839.05
75%,164.0,10000.0,14161.06,12715.0
max,105857.0,68000000.0,235320500.0,12969610.0


In [49]:
failed = df.loc[(df.state == 'failed')]
failed.describe()

Unnamed: 0,backers_count,goal,pledged,usd_pledged
count,76210.0,76210.0,76210.0,76210.0
mean,11.980475,100830.4,1814.596,1068.296136
std,45.406468,1850388.0,37090.71,5578.608086
min,0.0,1.0,0.0,0.0
25%,1.0,2500.0,1.0,1.132288
50%,3.0,7500.0,59.0,56.0
75%,9.0,25000.0,475.0,435.0
max,4435.0,100000000.0,6598984.0,607628.38


## Create Category and Sub Category Columns

In [50]:
df.category = df.category.str.replace(':', ',')

punctuation = "!\"#$%&'()*+-.:;<=>?@[\\]^_`{|}~"

def remove_punctuation(s):
    s_sans_punct = ""
    for letter in s:
        if letter not in punctuation:
            s_sans_punct += letter
    return s_sans_punct

# splits record strings up into lists
new_category = []
for line in df.category:
    line = remove_punctuation(line)
    new_category.append(line.split(','))
    
df.category = new_category

for line in df.category:
    for element in line:
        clean_data = remove_punctuation(element)

all_categories = {}
for j, line in enumerate(df.category):
    categories = {}
    for i, ele in enumerate(line[:-4]):
        if i % 2 == 0:
            categories[ele] = line[i+1]
    all_categories[j] = categories

category = pd.DataFrame(all_categories).T
category.head(2)

Unnamed: 0,id,name,slug,position,parentid,parentname,color,urls
0,313,Small Batch,food/small batch,10,10,Food,16725570,web
1,34,Tabletop Games,games/tabletop games,6,12,Games,51627,web


## Drop Unnecessary Features in Category DataFrame

In [51]:
category.drop([
    'id', # not a helpful feature
    'slug', # feature provides no value
    'parentid', # not a helpful feature
    'color', # not a helpful feature
    'urls' # not a helpful feature
], axis=1, inplace=True)

## Rename Features in Category DataFrame

In [52]:
# rename features in the category dataframe
category.rename(columns = {'name':'sub_category'}, inplace = True)
category.rename(columns = {'parentname':'category'}, inplace = True)

## Rename Features in Original DataFrame

In [53]:
# rename features in the category dataframe
df.rename(columns = {'country_displayable_name':'country'}, inplace = True)

## Concat DataFrames

In [54]:
df.head(3)

Unnamed: 0,backers_count,category,country,goal,pledged,spotlight,staff_pick,state,usd_pledged
0,1,"[id, 313, name, Small Batch, slug, food/small ...",Australia,14000.0,27.0,False,False,failed,18.661436
1,637,"[id, 34, name, Tabletop Games, slug, games/tab...",the United States,6000.0,16233.0,True,False,successful,16233.0
2,50,"[id, 262, name, Accessories, slug, fashion/acc...",Canada,450.0,1294.29,True,False,successful,987.413673


In [55]:
category.head(3)

Unnamed: 0,sub_category,position,category
0,Small Batch,10,Food
1,Tabletop Games,6,Games
2,Accessories,1,Fashion


In [56]:
df.drop([
    'category' # dropped to prevent problems in concat
], axis=1, inplace=True)

In [57]:
df = pd.concat([df, category], axis=1)
df.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,category
0,1.0,Australia,14000.0,27.0,False,False,failed,18.66144,Small Batch,10,Food
1,637.0,the United States,6000.0,16233.0,True,False,successful,16233.0,Tabletop Games,6,Games
2,50.0,Canada,450.0,1294.29,True,False,successful,987.4137,Accessories,1,Fashion
3,8.0,the United States,28000.0,361.0,False,False,failed,361.0,Small Batch,10,Food
4,6452.0,the United States,15000.0,1385803.0,True,False,successful,1385803.0,Product Design,5,Design


## Explore Data

In [58]:
# get the percentage of missing values
missing_values= df.isnull().sum()
# sort from most missing to least
(missing_values/len(df)).sort_values(ascending=False)

category         0.098650
position         0.062091
sub_category     0.062091
usd_pledged      0.062091
state            0.062091
staff_pick       0.062091
spotlight        0.062091
pledged          0.062091
goal             0.062091
country          0.062091
backers_count    0.062091
dtype: float64

In [59]:
df.shape

(216472, 11)

In [60]:
df.dropna(inplace=True)

In [61]:
df.shape

(182222, 11)

In [62]:
df.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,category
0,1.0,Australia,14000.0,27.0,False,False,failed,18.66144,Small Batch,10,Food
1,637.0,the United States,6000.0,16233.0,True,False,successful,16233.0,Tabletop Games,6,Games
2,50.0,Canada,450.0,1294.29,True,False,successful,987.4137,Accessories,1,Fashion
3,8.0,the United States,28000.0,361.0,False,False,failed,361.0,Small Batch,10,Food
4,6452.0,the United States,15000.0,1385803.0,True,False,successful,1385803.0,Product Design,5,Design


## Create Category DataFrames

In [63]:
music = df.loc[(df.category == 'Music')]
film_video = df.loc[(df.category == 'Film  Video')]
publishing = df.loc[(df.category == 'Publishing')]
art = df.loc[(df.category == 'Art')]
technology = df.loc[(df.category == 'Technology')]
food = df.loc[(df.category == 'Food')]
fashion = df.loc[(df.category == 'Fashion')]
games = df.loc[(df.category == 'Games')]
comics = df.loc[(df.category == 'Comics')]
design = df.loc[(df.category == 'Design')]
photography = df.loc[(df.category == 'Photography')]
theater = df.loc[(df.category == 'Theater')]
crafts = df.loc[(df.category == 'Crafts')]
journalism = df.loc[(df.category == 'Journalism')]
dance = df.loc[(df.category == 'Dance')]

## Use Get Dummies for Each New DataFrame and Original

### Music

In [64]:
music.head(2)

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,category
27,0.0,the United States,60000.0,0.0,False,False,failed,0.0,Electronic Music,6,Music
39,37.0,the Netherlands,200.0,315.0,True,False,successful,363.93129,Electronic Music,6,Music


In [65]:
music.columns

Index(['backers_count', 'country', 'goal', 'pledged', 'spotlight',
       'staff_pick', 'state', 'usd_pledged', 'sub_category', 'position',
       'category'],
      dtype='object')

In [66]:
spotlight = pd.get_dummies(music.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(music.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(music.state, drop_first=True)
sub_category = pd.get_dummies(music.sub_category, drop_first=True)
category = pd.get_dummies(music.category, drop_first=True)

# concat dummy variables into dataframe
music = pd.concat([music, spotlight, staff_pick, state, sub_category, category], axis=1)

music.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,...,Indie Rock,Jazz,Kids,Latin,Metal,Pop,Punk,RB,Rock,World Music
27,0.0,the United States,60000.0,0.0,False,False,failed,0.0,Electronic Music,6,...,0,0,0,0,0,0,0,0,0,0
39,37.0,the Netherlands,200.0,315.0,True,False,successful,363.93129,Electronic Music,6,...,0,0,0,0,0,0,0,0,0,0
41,38.0,Germany,5000.0,4713.0,False,False,failed,5362.342106,Blues,1,...,0,0,0,0,0,0,0,0,0,0
42,2.0,the United States,2017.0,11.0,False,False,failed,11.0,Electronic Music,6,...,0,0,0,0,0,0,0,0,0,0
51,155.0,the United States,11000.0,16287.0,True,True,successful,16287.0,Electronic Music,6,...,0,0,0,0,0,0,0,0,0,0


In [99]:
music.describe()

Unnamed: 0,backers_count,goal,pledged,usd_pledged,dummy_spot_True,dummy_pick_True,successful,Chiptune,Classical Music,Comedy,...,Indie Rock,Jazz,Kids,Latin,Metal,Pop,Punk,RB,Rock,World Music
count,24376.0,24376.0,24376.0,24376.0,24376.0,24376.0,24376.0,24376.0,24376.0,24376.0,...,24376.0,24376.0,24376.0,24376.0,24376.0,24376.0,24376.0,24376.0,24376.0,24376.0
mean,159.333771,42468.36,29904.84,14412.39,0.621103,0.141779,0.621103,0.002051,0.095299,0.003405,...,0.097145,0.080858,0.013251,0.006933,0.034337,0.095586,0.015056,0.021332,0.098991,0.089925
std,859.362161,1020616.0,983147.4,101537.9,0.485122,0.34883,0.485122,0.045245,0.293633,0.058254,...,0.296161,0.272623,0.114349,0.082977,0.182097,0.294028,0.121777,0.144493,0.298656,0.286079
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5.0,1500.0,187.0,178.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,32.0,5000.0,1965.0,1900.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,100.0,14762.5,8000.257,7386.887,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,64867.0,100000000.0,146910200.0,7850867.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Film & Video

In [67]:
spotlight = pd.get_dummies(film_video.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(film_video.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(film_video.state, drop_first=True)
sub_category = pd.get_dummies(film_video.sub_category, drop_first=True)
category = pd.get_dummies(film_video.category, drop_first=True)

# concat dummy variables into dataframe
film_video = pd.concat([film_video, spotlight, staff_pick, state, sub_category, category], axis=1)

film_video.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,...,Horror,Movie Theaters,Music Videos,Narrative Film,Romance,Science Fiction,Shorts,Television,Thrillers,Webseries
12,0.0,the United States,2350.0,0.0,False,False,failed,0.0,Family,7,...,0,0,0,0,0,0,0,0,0,0
47,2852.0,the United States,75000.0,108346.83,True,False,successful,108346.83,Festivals,9,...,0,0,0,0,0,0,0,0,0,0
135,2.0,the United States,5000.0,160.0,False,False,failed,160.0,Festivals,9,...,0,0,0,0,0,0,0,0,0,0
149,49.0,Australia,3000.0,11766.32,True,False,successful,8153.435086,Festivals,9,...,0,0,0,0,0,0,0,0,0,0
150,4.0,the United States,250.0,272.0,True,False,successful,272.0,Family,7,...,0,0,0,0,0,0,0,0,0,0


In [100]:
film_video.describe()

Unnamed: 0,backers_count,goal,pledged,usd_pledged,dummy_spot_True,dummy_pick_True,successful,Animation,Comedy,Documentary,...,Horror,Movie Theaters,Music Videos,Narrative Film,Romance,Science Fiction,Shorts,Television,Thrillers,Webseries
count,25105.0,25105.0,25105.0,25105.0,25105.0,25105.0,25105.0,25105.0,25105.0,25105.0,...,25105.0,25105.0,25105.0,25105.0,25105.0,25105.0,25105.0,25105.0,25105.0,25105.0
mean,166.314599,39390.02,20873.19,14616.15,0.622665,0.136108,0.622665,0.092691,0.095678,0.104521,...,0.059669,0.013185,0.029277,0.091496,0.008644,0.035571,0.106234,0.041386,0.034535,0.091456
std,996.829867,846099.7,189423.4,95949.46,0.48473,0.34291,0.48473,0.290004,0.294155,0.305941,...,0.236878,0.114067,0.168585,0.288319,0.092571,0.185221,0.308143,0.199186,0.182602,0.288262
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5.0,1500.0,184.0,172.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31.0,5000.0,1884.0,1810.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,98.0,15000.0,7895.0,7321.488,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,58730.0,100000000.0,18574950.0,5764229.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Publishing

In [68]:
spotlight = pd.get_dummies(publishing.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(publishing.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(publishing.state, drop_first=True)
sub_category = pd.get_dummies(publishing.sub_category, drop_first=True)
category = pd.get_dummies(publishing.category, drop_first=True)

# concat dummy variables into dataframe
publishing = pd.concat([publishing, spotlight, staff_pick, state, sub_category, category], axis=1)

publishing.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,...,Letterpress,Literary Journals,Literary Spaces,Nonfiction,Periodicals,Poetry,Radio Podcasts,Translations,Young Adult,Zines
13,18.0,Canada,500.0,656.29,True,False,successful,493.559488,Comedy,6,...,0,0,0,0,0,0,0,0,0,0
30,0.0,the United Kingdom,2200.0,0.0,False,False,failed,0.0,Anthologies,2,...,0,0,0,0,0,0,0,0,0,0
40,23.0,the United States,260.0,514.0,True,False,successful,514.0,Anthologies,2,...,0,0,0,0,0,0,0,0,0,0
58,5.0,Ireland,4000.0,155.0,False,True,failed,165.153645,Periodicals,10,...,0,0,0,0,1,0,0,0,0,0
68,14.0,Australia,4000.0,657.0,False,False,failed,519.408649,Anthologies,2,...,0,0,0,0,0,0,0,0,0,0


In [101]:
publishing.describe()

Unnamed: 0,backers_count,goal,pledged,usd_pledged,dummy_spot_True,dummy_pick_True,successful,Anthologies,Art Books,Calendars,...,Letterpress,Literary Journals,Literary Spaces,Nonfiction,Periodicals,Poetry,Radio Podcasts,Translations,Young Adult,Zines
count,18287.0,18287.0,18287.0,18287.0,18287.0,18287.0,18287.0,18287.0,18287.0,18287.0,...,18287.0,18287.0,18287.0,18287.0,18287.0,18287.0,18287.0,18287.0,18287.0,18287.0
mean,159.870181,43759.07,20381.08,14326.72,0.625526,0.134795,0.625526,0.035326,0.133811,0.024717,...,0.004265,0.018319,0.007273,0.132881,0.067425,0.083119,0.059441,0.009077,0.050036,0.031717
std,1167.830122,1169780.0,242687.7,147591.1,0.484,0.341514,0.484,0.184607,0.340458,0.155266,...,0.065172,0.134106,0.084973,0.339456,0.250763,0.27607,0.236455,0.094845,0.218024,0.175249
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5.0,1400.0,168.0,161.0092,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,32.0,5000.0,1930.0,1870.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,100.0,12999.0,7820.615,7285.979,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,88887.0,100000000.0,18574950.0,12143440.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Art

In [69]:
spotlight = pd.get_dummies(art.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(art.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(art.state, drop_first=True)
sub_category = pd.get_dummies(art.sub_category, drop_first=True)
category = pd.get_dummies(art.category, drop_first=True)

# concat dummy variables into dataframe
art = pd.concat([art, spotlight, staff_pick, state, sub_category, category], axis=1)

art.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,...,Illustration,Installations,Mixed Media,Painting,Performance Art,Public Art,Sculpture,Social Practice,Textiles,Video Art
6,16.0,the United Kingdom,540.0,561.0,True,False,successful,932.333524,Public Art,9,...,0,0,0,0,0,1,0,0,0,0
81,34.0,Mexico,35000.0,35750.79,True,True,successful,1887.521589,Public Art,9,...,0,0,0,0,0,1,0,0,0,0
82,278.0,the United States,26000.0,30121.0,True,False,successful,30121.0,Public Art,9,...,0,0,0,0,0,1,0,0,0,0
83,20.0,the United States,500.0,565.0,True,False,successful,565.0,Public Art,9,...,0,0,0,0,0,1,0,0,0,0
92,15.0,the United States,3000.0,271.0,False,False,failed,271.0,Public Art,9,...,0,0,0,0,0,1,0,0,0,0


### Technology

In [70]:
spotlight = pd.get_dummies(technology.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(technology.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(technology.state, drop_first=True)
sub_category = pd.get_dummies(technology.sub_category, drop_first=True)
category = pd.get_dummies(technology.category, drop_first=True)

# concat dummy variables into dataframe
technology = pd.concat([technology, spotlight, staff_pick, state, sub_category, category], axis=1)

technology.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,...,Flight,Gadgets,Hardware,Makerspaces,Robots,Software,Sound,Space Exploration,Wearables,Web
9,63.0,the United States,45000.0,46000.01,True,False,successful,46000.01,Software,11,...,0,0,0,0,0,1,0,0,0,0
16,194.0,the United States,30000.0,34094.0,True,True,successful,34094.0,Apps,2,...,0,0,0,0,0,0,0,0,0,0
19,129.0,the United States,1500.0,3840.0,True,False,successful,3840.0,Apps,2,...,0,0,0,0,0,0,0,0,0,0
33,43.0,Sweden,200000.0,14000.0,False,False,failed,1696.06248,Apps,2,...,0,0,0,0,0,0,0,0,0,0
38,235.0,the United States,10000.0,20442.0,True,False,successful,20442.0,Apps,2,...,0,0,0,0,0,0,0,0,0,0


### Food

In [71]:
spotlight = pd.get_dummies(food.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(food.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(food.state, drop_first=True)
sub_category = pd.get_dummies(food.sub_category, drop_first=True)
category = pd.get_dummies(food.category, drop_first=True)

# concat dummy variables into dataframe
food = pd.concat([food, spotlight, staff_pick, state, sub_category, category], axis=1)

food.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,...,Cookbooks,Drinks,Events,Farmers Markets,Farms,Food Trucks,Restaurants,Small Batch,Spaces,Vegan
0,1.0,Australia,14000.0,27.0,False,False,failed,18.661436,Small Batch,10,...,0,0,0,0,0,0,0,1,0,0
3,8.0,the United States,28000.0,361.0,False,False,failed,361.0,Small Batch,10,...,0,0,0,0,0,0,0,1,0,0
7,93.0,the United States,5000.0,5951.0,True,False,successful,5951.0,Small Batch,10,...,0,0,0,0,0,0,0,1,0,0
8,40.0,the United States,2000.0,2117.0,True,False,successful,2117.0,Farms,7,...,0,0,0,0,1,0,0,0,0,0
11,148.0,the United States,13500.0,16000.0,True,False,successful,16000.0,Restaurants,9,...,0,0,0,0,0,0,1,0,0,0


### Fashion

In [72]:
spotlight = pd.get_dummies(fashion.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(fashion.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(fashion.state, drop_first=True)
sub_category = pd.get_dummies(fashion.sub_category, drop_first=True)
category = pd.get_dummies(fashion.category, drop_first=True)

# concat dummy variables into dataframe
fashion = pd.concat([fashion, spotlight, staff_pick, state, sub_category, category], axis=1)

fashion.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,...,dummy_spot_True,dummy_pick_True,successful,Apparel,Childrenswear,Couture,Footwear,Jewelry,Pet Fashion,Readytowear
2,50.0,Canada,450.0,1294.29,True,False,successful,987.413673,Accessories,1,...,1,0,1,0,0,0,0,0,0,0
20,4.0,Switzerland,2000.0,42.0,False,False,failed,44.801577,Accessories,1,...,0,0,0,0,0,0,0,0,0,0
36,488.0,the United Kingdom,10000.0,15225.77,True,True,successful,24733.179442,Accessories,1,...,1,1,1,0,0,0,0,0,0,0
37,320.0,the United States,10.0,11081.0,True,False,successful,11081.0,Accessories,1,...,1,0,1,0,0,0,0,0,0,0
43,7.0,the United States,200.0,66.0,False,False,failed,66.0,Jewelry,6,...,0,0,0,0,0,0,0,1,0,0


### Games

In [73]:
spotlight = pd.get_dummies(games.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(games.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(games.state, drop_first=True)
sub_category = pd.get_dummies(games.sub_category, drop_first=True)
category = pd.get_dummies(games.category, drop_first=True)

# concat dummy variables into dataframe
games = pd.concat([games, spotlight, staff_pick, state, sub_category, category], axis=1)

games.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,category,dummy_spot_True,dummy_pick_True,successful,Live Games,Mobile Games,Playing Cards,Puzzles,Tabletop Games,Video Games
1,637.0,the United States,6000.0,16233.0,True,False,successful,16233.0,Tabletop Games,6,Games,1,0,1,0,0,0,0,1,0
5,4731.0,Italy,40000.0,217144.39,True,True,successful,247905.496405,Tabletop Games,6,Games,1,1,1,0,0,0,0,1,0
21,17.0,the United Kingdom,250.0,288.0,True,False,successful,356.095855,Tabletop Games,6,Games,1,0,1,0,0,0,0,1,0
28,198.0,the United Kingdom,50000.0,50132.0,True,False,successful,67527.394422,Tabletop Games,6,Games,1,0,1,0,0,0,0,1,0
35,88.0,the United States,2800.0,3554.0,True,False,successful,3554.0,Tabletop Games,6,Games,1,0,1,0,0,0,0,1,0


### Comics

In [74]:
spotlight = pd.get_dummies(comics.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(comics.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(comics.state, drop_first=True)
sub_category = pd.get_dummies(comics.sub_category, drop_first=True)
category = pd.get_dummies(comics.category, drop_first=True)

# concat dummy variables into dataframe
comics = pd.concat([comics, spotlight, staff_pick, state, sub_category, category], axis=1)

comics.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,category,dummy_spot_True,dummy_pick_True,successful,Comic Books,Events,Graphic Novels,Webcomics
362,16.0,Italy,1250.0,1276.0,True,False,successful,1409.463513,Graphic Novels,4,Comics,1,0,1,0,0,1,0
375,49.0,the United States,1701.0,6009.56,True,False,successful,6009.56,Graphic Novels,4,Comics,1,0,1,0,0,1,0
379,38.0,the United States,1100.0,1420.0,True,False,successful,1420.0,Webcomics,5,Comics,1,0,1,0,0,0,1
385,58.0,the United States,10000.0,10011.0,True,False,successful,10011.0,Webcomics,5,Comics,1,0,1,0,0,0,1
387,25.0,the United States,300.0,572.0,True,False,successful,572.0,Graphic Novels,4,Comics,1,0,1,0,0,1,0


### Design

In [75]:
spotlight = pd.get_dummies(design.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(design.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(design.state, drop_first=True)
sub_category = pd.get_dummies(design.sub_category, drop_first=True)
category = pd.get_dummies(design.category, drop_first=True)

# concat dummy variables into dataframe
design = pd.concat([design, spotlight, staff_pick, state, sub_category, category], axis=1)

design.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,category,dummy_spot_True,dummy_pick_True,successful,Civic Design,Graphic Design,Interactive Design,Product Design,Toys,Typography
4,6452.0,the United States,15000.0,1385803.0,True,False,successful,1385803.0,Product Design,5,Design,1,0,1,0,0,0,1,0,0
10,36.0,Canada,7000.0,7982.29,True,False,successful,6022.638,Product Design,5,Design,1,0,1,0,0,0,1,0,0
15,63.0,Switzerland,55000.0,55721.0,True,False,successful,54636.47,Product Design,5,Design,1,0,1,0,0,0,1,0,0
46,18.0,the United Kingdom,900.0,1015.0,True,False,successful,1291.955,Product Design,5,Design,1,0,1,0,0,0,1,0,0
50,145.0,the United States,5000.0,6332.0,True,False,successful,6332.0,Product Design,5,Design,1,0,1,0,0,0,1,0,0


### Photography

In [76]:
spotlight = pd.get_dummies(photography.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(photography.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(photography.state, drop_first=True)
sub_category = pd.get_dummies(photography.sub_category, drop_first=True)
category = pd.get_dummies(photography.category, drop_first=True)

# concat dummy variables into dataframe
photography = pd.concat([photography, spotlight, staff_pick, state, sub_category, category], axis=1)

photography.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,category,dummy_spot_True,dummy_pick_True,successful,Fine Art,Nature,People,Photobooks,Places
393,27.0,the United States,1500.0,1827.0,True,False,successful,1827.0,Places,6,Photography,1,0,1,0,0,0,0,1
472,56.0,the United States,4350.0,4778.0,True,False,successful,4778.0,Places,6,Photography,1,0,1,0,0,0,0,1
483,354.0,the United Kingdom,16000.0,19001.0,True,False,successful,30343.243939,Places,6,Photography,1,0,1,0,0,0,0,1
497,16.0,the United Kingdom,4000.0,4020.0,True,False,successful,5104.820236,Places,6,Photography,1,0,1,0,0,0,0,1
513,54.0,the United States,4000.0,4366.0,True,True,successful,4366.0,Places,6,Photography,1,1,1,0,0,0,0,1


### Theater

In [77]:
spotlight = pd.get_dummies(theater.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(theater.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(theater.state, drop_first=True)
sub_category = pd.get_dummies(theater.sub_category, drop_first=True)
category = pd.get_dummies(theater.category, drop_first=True)

# concat dummy variables into dataframe
theater = pd.concat([theater, spotlight, staff_pick, state, sub_category, category], axis=1)

theater.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,category,dummy_spot_True,dummy_pick_True,successful,Experimental,Festivals,Immersive,Musical,Plays,Spaces
24,0.0,the United States,12000.0,0.0,False,False,failed,0.0,Spaces,7,Theater,0,0,0,0,0,0,0,0,1
134,184.0,the United States,9000.0,13216.0,True,True,successful,13216.0,Spaces,7,Theater,1,1,1,0,0,0,0,0,1
240,2.0,Mexico,13000.0,38.8,False,False,failed,1.770517,Spaces,7,Theater,0,0,0,0,0,0,0,0,1
299,536.0,the United States,31500.0,33595.02,True,False,successful,33595.02,Spaces,7,Theater,1,0,1,0,0,0,0,0,1
336,96.0,Switzerland,9700.0,76421.0,True,False,successful,76747.86943,Spaces,7,Theater,1,0,1,0,0,0,0,0,1


### Crafts

In [78]:
spotlight = pd.get_dummies(crafts.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(crafts.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(crafts.state, drop_first=True)
sub_category = pd.get_dummies(crafts.sub_category, drop_first=True)
category = pd.get_dummies(crafts.category, drop_first=True)

# concat dummy variables into dataframe
crafts = pd.concat([crafts, spotlight, staff_pick, state, sub_category, category], axis=1)

crafts.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,...,Embroidery,Glass,Knitting,Pottery,Printing,Quilts,Stationery,Taxidermy,Weaving,Woodworking
60,0.0,Italy,3000.0,0.0,False,False,failed,0.0,Glass,5,...,0,1,0,0,0,0,0,0,0,0
66,30.0,the United States,10000.0,4320.0,False,False,failed,4320.0,Pottery,8,...,0,0,0,1,0,0,0,0,0,0
148,4.0,the United Kingdom,500.0,138.0,False,False,failed,208.945281,Glass,5,...,0,1,0,0,0,0,0,0,0,0
224,8.0,the United Kingdom,4100.0,151.0,False,False,failed,229.306598,Pottery,8,...,0,0,0,1,0,0,0,0,0,0
309,1.0,the United States,7500.0,25.0,False,False,failed,25.0,Glass,5,...,0,1,0,0,0,0,0,0,0,0


### Journalism

In [79]:
spotlight = pd.get_dummies(journalism.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(journalism.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(journalism.state, drop_first=True)
sub_category = pd.get_dummies(journalism.sub_category, drop_first=True)
category = pd.get_dummies(journalism.category, drop_first=True)

# concat dummy variables into dataframe
journalism = pd.concat([journalism, spotlight, staff_pick, state, sub_category, category], axis=1)

journalism.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,category,dummy_spot_True,dummy_pick_True,successful,Photo,Print,Video,Web
14,6.0,the United States,6000.0,280.0,False,False,failed,280.0,Web,5,Journalism,0,0,0,0,0,0,1
22,271.0,the United States,1500.0,11779.0,True,False,successful,11779.0,Web,5,Journalism,1,0,1,0,0,0,1
115,117.0,the United States,2500.0,10889.0,True,False,successful,10889.0,Web,5,Journalism,1,0,1,0,0,0,1
208,192.0,the United States,27500.0,27568.0,True,False,successful,27568.0,Web,5,Journalism,1,0,1,0,0,0,1
303,86.0,the United Kingdom,300.0,442.0,True,False,successful,577.877492,Web,5,Journalism,1,0,1,0,0,0,1


### Dance

In [80]:
spotlight = pd.get_dummies(dance.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(dance.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(dance.state, drop_first=True)
sub_category = pd.get_dummies(dance.sub_category, drop_first=True)
category = pd.get_dummies(dance.category, drop_first=True)

# concat dummy variables into dataframe
dance = pd.concat([dance, spotlight, staff_pick, state, sub_category, category], axis=1)

dance.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,category,dummy_spot_True,dummy_pick_True,successful,Residencies,Spaces,Workshops
376,157.0,the United States,20000.0,21444.0,True,False,successful,21444.0,Spaces,3,Dance,1,0,1,0,1,0
390,1.0,Mexico,12000.0,18.8,False,False,failed,0.967988,Spaces,3,Dance,0,0,0,0,1,0
594,3.0,the United States,2000.0,20.0,False,False,failed,20.0,Spaces,3,Dance,0,0,0,0,1,0
630,3.0,the United States,3000.0,36.0,False,False,failed,36.0,Spaces,3,Dance,0,0,0,0,1,0
716,95.0,the United States,2000.0,2242.0,True,False,successful,2242.0,Spaces,3,Dance,1,0,1,0,1,0


### All Categories

In [81]:
spotlight = pd.get_dummies(df.spotlight, prefix='dummy_spot', drop_first=True)
staff_pick = pd.get_dummies(df.staff_pick, prefix='dummy_pick',drop_first=True)
state = pd.get_dummies(df.state, drop_first=True)
# sub_category = pd.get_dummies(df.sub_category, drop_first=True) ## left out to reduce features
category = pd.get_dummies(df.category, drop_first=True)

# concat dummy variables into dataframe
df = pd.concat([df, spotlight, staff_pick, state, category], axis=1)

df.head()

Unnamed: 0,backers_count,country,goal,pledged,spotlight,staff_pick,state,usd_pledged,sub_category,position,...,Fashion,Film Video,Food,Games,Journalism,Music,Photography,Publishing,Technology,Theater
0,1.0,Australia,14000.0,27.0,False,False,failed,18.66144,Small Batch,10,...,0,0,1,0,0,0,0,0,0,0
1,637.0,the United States,6000.0,16233.0,True,False,successful,16233.0,Tabletop Games,6,...,0,0,0,1,0,0,0,0,0,0
2,50.0,Canada,450.0,1294.29,True,False,successful,987.4137,Accessories,1,...,1,0,0,0,0,0,0,0,0,0
3,8.0,the United States,28000.0,361.0,False,False,failed,361.0,Small Batch,10,...,0,0,1,0,0,0,0,0,0,0
4,6452.0,the United States,15000.0,1385803.0,True,False,successful,1385803.0,Product Design,5,...,0,0,0,0,0,0,0,0,0,0


In [82]:
df.columns

Index(['backers_count', 'country', 'goal', 'pledged', 'spotlight',
       'staff_pick', 'state', 'usd_pledged', 'sub_category', 'position',
       'category', 'dummy_spot_True', 'dummy_pick_True', 'successful',
       'Comics', 'Crafts', 'Dance', 'Design', 'Fashion', 'Film  Video', 'Food',
       'Games', 'Journalism', 'Music', 'Photography', 'Publishing',
       'Technology', 'Theater'],
      dtype='object')

## Modeling - Music

In [83]:
X = music.drop([
    'successful',
    'country',
    'state',
    'sub_category',
    'category',
    'spotlight',
    'dummy_spot_True', # removed because of high coefficient, perfect predictor of success
    'pledged', # removed because of high coefficient
    'usd_pledged' # removed because of high coefficient
#     'backers_count', # removed because of high coefficient
#     'goal' # removed because of high coefficient
], axis=1)
y = music.successful

In [84]:
# Train-test-split
X_train, X_test, y_train, y_test = train_test_split (X, y, random_state = 42)

# Scale our data.
# Relabeling scaled data as "Z" is common
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)


logreg = LogisticRegression(solver='lbfgs')
logreg.fit(Z_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(Z_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(f'Baseline prediction:{music.successful.mean()}')
print(f'Training prediction:{logreg.score(Z_train, y_train)}')
print(f'Testing prediction:{logreg.score(Z_test, y_test)}')

[[2087  251]
 [ 495 3261]]
Baseline prediction:0.6211027239908107
Training prediction:0.881468110709988
Testing prediction:0.8775845093534624


In [85]:
# Assign the coefficients to a list coef
coef = logreg.coef_
for p,c in zip(X,list(coef[0])):
    print(p + '\t' + str(c))

backers_count	26.593462605459152
goal	-11.276638511383103
staff_pick	0.09075568241221073
position	0.01545414229372816
dummy_pick_True	0.09075568241221073
Chiptune	0.012216705724593449
Classical Music	0.0303404519845471
Comedy	-0.025814272834022205
Country  Folk	0.03206674434352974
Electronic Music	0.016684966502394108
Faith	-0.032497603246841165
HipHop	0.012734701357322062
Indie Rock	-0.011652841851569768
Jazz	0.006894476624623391
Kids	-0.0033027894374458144
Latin	-0.011756639392557748
Metal	0.04290551959901577
Pop	0.03295898894783355
Punk	0.0022580257096238157
RB	-0.003088553151866978
Rock	-0.05264341613864027
World Music	0.01636880590876651


In [86]:
# Assign the coefficients to a list coef
coef = logreg.coef_
odds = np.exp(coef)
for p,c in zip(X,list(odds[0])):
    print(p + '\t' + str(c))
    
    # less than 1, decreases the liklihood of success
    # over 1, increases the liklihood of success
    # close to 1, is a wash

backers_count	354318692171.0814
goal	1.2665376955619859e-05
staff_pick	1.0950014444633553
position	1.0155741750882357
dummy_pick_True	1.0950014444633553
Chiptune	1.0122916344906452
Classical Music	1.0308054139700968
Comedy	0.9745160669093355
Country  Folk	1.0325864223080774
Electronic Music	1.0168249379453307
Faith	0.9680247699449509
HipHop	1.0128161329685919
Indie Rock	0.9884147895557733
Jazz	1.0069182982429827
Kids	0.9967026587718417
Latin	0.9883121998555127
Metal	1.0438392678389377
Pop	1.0335081531153798
Punk	1.0022605769695854
RB	0.9969162115218407
Rock	0.9487182497905843
World Music	1.0165035087836294


## Modeling - Film & Video

In [87]:
X = film_video.drop([
    'successful',
    'country',
    'state',
    'sub_category',
    'category',
    'spotlight',
    'dummy_spot_True', # removed because of high coefficient, perfect predictor of success
    'pledged', # removed because of high coefficient
    'usd_pledged' # removed because of high coefficient
#     'backers_count', # removed because of high coefficient
#     'goal' # removed because of high coefficient
], axis=1)
y = film_video.successful

In [88]:
# Train-test-split
X_train, X_test, y_train, y_test = train_test_split (X, y, random_state = 42)

# Scale our data.
# Relabeling scaled data as "Z" is common
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)


logreg = LogisticRegression(solver='lbfgs')
logreg.fit(Z_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(Z_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(f'Baseline prediction:{music.successful.mean()}')
print(f'Training prediction:{logreg.score(Z_train, y_train)}')
print(f'Testing prediction:{logreg.score(Z_test, y_test)}')

[[2087  227]
 [ 627 3336]]
Baseline prediction:0.6211027239908107
Training prediction:0.8642978542596134
Testing prediction:0.86394774573841


In [89]:
# Assign the coefficients to a list coef
coef = logreg.coef_
for p,c in zip(X,list(coef[0])):
    print(p + '\t' + str(c))

backers_count	30.839118375804517
goal	-12.205336528592778
staff_pick	0.10634736050533854
position	0.0038022219710183275
dummy_pick_True	0.10634736050533854
Animation	-0.0025993004010065714
Comedy	-0.03691291072496805
Documentary	-0.0020728973541812197
Drama	0.005896879963080703
Experimental	0.012092398914032227
Family	0.013863794559809526
Fantasy	0.0037035638467448435
Festivals	0.020033894885597744
Horror	-0.012145358617062679
Movie Theaters	0.013582943674794377
Music Videos	-0.010936424041262558
Narrative Film	-0.01622651035101575
Romance	-0.008050356988229474
Science Fiction	-0.008741203211323744
Shorts	-0.0048140860704951206
Television	-0.019049201139398477
Thrillers	0.006894832067401433
Webseries	0.03486269717759216


In [90]:
# Assign the coefficients to a list coef
coef = logreg.coef_
odds = np.exp(coef)
for p,c in zip(X,list(odds[0])):
    print(p + '\t' + str(c))
    
    # less than 1, decreases the liklihood of success
    # over 1, increases the liklihood of success
    # close to 1, is a wash

backers_count	24731982889904.945
goal	5.003681939689805e-06
staff_pick	1.112208146599004
position	1.0038094595870772
dummy_pick_True	1.112208146599004
Animation	0.9974040748552124
Comedy	0.9637600648596737
Documentary	0.9979292496138016
Drama	1.0059143007857299
Experimental	1.0121658075669377
Family	1.0139603426178068
Fantasy	1.0037104305137579
Festivals	1.0202359202205051
Horror	0.9879280985622311
Movie Theaters	1.0136756109436553
Music Videos	0.989123161229726
Narrative Film	0.9839044302749493
Romance	0.9919819603553911
Science Fiction	0.9912968900310719
Shorts	0.9951974830694614
Television	0.9811310882878059
Thrillers	1.0069186561448835
Webseries	1.035477525052002


## Modeling - Games

In [91]:
X = games.drop([
    'successful',
    'country',
    'state',
    'sub_category',
    'category',
    'spotlight',
    'dummy_spot_True', # removed because of high coefficient, perfect predictor of success
    'pledged', # removed because of high coefficient
    'usd_pledged' # removed because of high coefficient
#     'backers_count', # removed because of high coefficient
#     'goal' # removed because of high coefficient
], axis=1)
y = games.successful

In [92]:
# Train-test-split
X_train, X_test, y_train, y_test = train_test_split (X, y, random_state = 42)

# Scale our data.
# Relabeling scaled data as "Z" is common
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)


logreg = LogisticRegression(solver='lbfgs')
logreg.fit(Z_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(Z_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(f'Baseline prediction:{music.successful.mean()}')
print(f'Training prediction:{logreg.score(Z_train, y_train)}')
print(f'Testing prediction:{logreg.score(Z_test, y_test)}')

[[1020  124]
 [ 278 1575]]
Baseline prediction:0.6211027239908107
Training prediction:0.8698409166759372
Testing prediction:0.8658658658658659


In [93]:
# Assign the coefficients to a list coef
coef = logreg.coef_
for p,c in zip(X,list(coef[0])):
    print(p + '\t' + str(c))

backers_count	24.09775477029855
goal	-2.92144490998336
staff_pick	0.11072190333343032
position	-0.028115258168231638
dummy_pick_True	0.11072190333343032
Live Games	0.033852232793935656
Mobile Games	-0.08641400052312956
Playing Cards	-0.03904879858071515
Puzzles	0.01834051040139442
Tabletop Games	0.01167361224259096
Video Games	0.0067149042290679485


In [94]:
# Assign the coefficients to a list coef
coef = logreg.coef_
odds = np.exp(coef)
for p,c in zip(X,list(odds[0])):
    print(p + '\t' + str(c))
    
    # less than 1, decreases the liklihood of success
    # over 1, increases the liklihood of success
    # close to 1, is a wash

backers_count	29209352040.36911
goal	0.053855814250458316
staff_pick	1.1170842062600224
position	0.9722762975578515
dummy_pick_True	1.1170842062600224
Live Games	1.0344317403459418
Mobile Games	0.9172144253537436
Playing Cards	0.9617037782218492
Puzzles	1.0185097305069106
Tabletop Games	1.0117420147630227
Video Games	1.0067374997457341


## Modeling - All Categories

In [95]:
X = df.drop([
    'successful',
    'country',
    'state',
    'sub_category',
    'category',
    'spotlight',
    'dummy_spot_True', # removed because of high coefficient, perfect predictor of success
    'pledged', # removed because of high coefficient
    'usd_pledged' # removed because of high coefficient
#     'backers_count', # removed because of high coefficient
#     'goal' # removed because of high coefficient
], axis=1)
y = df.successful

In [96]:
# Train-test-split
X_train, X_test, y_train, y_test = train_test_split (X, y, random_state = 42)

# Scale our data.
# Relabeling scaled data as "Z" is common
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)


logreg = LogisticRegression(solver='lbfgs')
logreg.fit(Z_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(Z_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(f'Baseline prediction:{music.successful.mean()}')
print(f'Training prediction:{logreg.score(Z_train, y_train)}')
print(f'Testing prediction:{logreg.score(Z_test, y_test)}')

[[15280  1939]
 [ 4106 24231]]
Baseline prediction:0.6211027239908107
Training prediction:0.8672822794257533
Testing prediction:0.8673061726227062


In [97]:
# Assign the coefficients to a list coef
coef = logreg.coef_
for p,c in zip(X,list(coef[0])):
    print(p + '\t' + str(c))

backers_count	45.77964955241944
goal	-10.981385475233017
staff_pick	0.03189546661516596
position	-0.0050619513445431825
dummy_pick_True	0.03189546661516596
Comics	-0.00034661423687312303
Crafts	-0.013157340446118278
Dance	0.00889112179181509
Design	-0.002773458044333541
Fashion	0.017145002431723635
Film  Video	0.0009644958038125136
Food	0.0066441644893248395
Games	-0.006140259070995544
Journalism	0.0054945152537975335
Music	-0.014061560020678805
Photography	-0.001756033387068303
Publishing	0.002988567433323946
Technology	-0.003301956348857502
Theater	-0.00237660972764201


In [98]:
# Assign the coefficients to a list coef
coef = logreg.coef_
odds = np.exp(coef)
for p,c in zip(X,list(odds[0])):
    print(p + '\t' + str(c))
    
    # less than 1, decreases the liklihood of success
    # over 1, increases the liklihood of success
    # close to 1, is a wash

backers_count	7.618144114696544e+19
goal	1.7015506625396568e-05
staff_pick	1.0324095783964693
position	0.9949508387411332
dummy_pick_True	1.0324095783964693
Comics	0.9996534458269016
Crafts	0.9869288389796218
Dance	1.0089307652195842
Design	0.997230384437287
Fashion	1.01729282156401
Film  Video	1.0009649610794638
Food	1.006666285915866
Games	0.9938785537947494
Journalism	1.0055096377870447
Music	0.9860368419460829
Photography	0.9982455075374563
Publishing	1.0029930376530507
Technology	0.9967034891137986
Theater	0.9976262121732951


## Conclusion
To conclude, everyone has the capacity to create something unique. What may be lacking is inspiration. We can use Kickstarter for that. There are a lot of successful projects on Kickstarter and we know that people are willing to bet their money on an idea. So, we use those ideas that we like and build on them. Then, with the right goals and the people to support us, we can achieve what we want to achieve.

### Recommendation
First, since category and sub-category didn’t really influence whether a project was successful or not, creators should choose topics that interests them. If they don’t love the topic, it will be much harder for them to create in that space.

Once you have the category is decided on, then they can focus on backers. If they choose the category food and the subcategory accessory (for example), they can see what the average number of backers is and make that their goal to achieve.

Finally, once they have a good idea of how many backers they should aim for, they also need to consider their financial goal. I’d recommend doing the same thing as my recommendation for the number of backers: try to keep the financial goal within the ball park of other successful projects in that category. However, there is a balance that needs to be considered: if the project requires ‘x’ number of dollars, and the goal number of backers is ‘y’, the larger the goal, the more impact it will have on your investment per backer. So just keep that in mind.