# Capstone Project

## Goal

The goal of this project is to use predictive analytics to determine what will make it more likely to have a successful Kickstarter based on historical data. The historical data tells us which projects were successful and which projects were not.

https://www.kickstarter.com/help/handbook/funding

Kickstarter provides what is called a creator's handbook for funding. The original objective of this analysis was to determine what leads to successful boardgames. From there the idea was to create a boardgame based on my findings to see if I could create a successful boardgame based on the findings. However, an important first phase of this analysis was to see if I could predict whether or not a project would be successful. So that is what I did here.

## Import Libraries

In [1]:
import os
import glob
import pandas as pd
# os.chdir("./datasets/kickstarter_data/") # uncomment to run initially

import re
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.preprocessing import PolynomialFeatures, PowerTransformer
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import r2_score
from sklearn.feature_selection import RFE

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Import LogisticRegression and LinearRegression from sklearn.linear_model
from sklearn.linear_model import LogisticRegression, LinearRegression

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

import string

## Gather Data

Data came from:
https://webrobots.io/kickstarter-datasets/

## Combine Data

The cell below should only be ran one time. The code 

In [2]:
## uncomment to run initially
## credit: https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/
# extension = 'csv'
# all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

# #combine all files in the list
# combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
# #export to csv
# combined_csv.to_csv( "combined.csv", index=False, encoding='utf-8-sig')

## Read in Data

In [3]:
df = pd.read_csv('./datasets/kickstarter_data/combined.csv')

## Exploratory Data Analysis (EDA)

In [4]:
pd.set_option('display.max_rows', 9999)
pd.set_option('display.max_columns', 9999)
pd.set_option('display.width', 9999)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217433 entries, 0 to 217432
Data columns (total 38 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   backers_count             217433 non-null  int64  
 1   blurb                     217425 non-null  object 
 2   category                  217433 non-null  object 
 3   converted_pledged_amount  217433 non-null  int64  
 4   country                   217433 non-null  object 
 5   country_displayable_name  217433 non-null  object 
 6   created_at                217433 non-null  int64  
 7   creator                   217433 non-null  object 
 8   currency                  217433 non-null  object 
 9   currency_symbol           217433 non-null  object 
 10  currency_trailing_code    217433 non-null  bool   
 11  current_currency          217433 non-null  object 
 12  deadline                  217433 non-null  int64  
 13  disable_communication     217433 non-null  b

In [6]:
df.describe()

Unnamed: 0,backers_count,converted_pledged_amount,created_at,deadline,fx_rate,goal,id,launched_at,pledged,state_changed_at,static_usd_rate,usd_pledged
count,217433.0,217433.0,217433.0,217433.0,217433.0,217433.0,217433.0,217433.0,217433.0,217433.0,217433.0,217433.0
mean,153.312377,13914.86,1475045000.0,1482085000.0,0.972468,50864.0,1073505000.0,1479240000.0,25285.57,1481932000.0,1.001734,13919.28
std,955.46558,111587.3,73251890.0,72977420.0,0.224465,1225217.0,619408500.0,72985110.0,914915.4,72873330.0,0.239715,111583.7
min,0.0,0.0,1240366000.0,1242468000.0,0.009327,0.01,18520.0,1240674000.0,0.0,1242468000.0,0.008771,0.0
25%,4.0,125.0,1422421000.0,1428688000.0,1.0,1500.0,536953800.0,1425783000.0,130.0,1428555000.0,1.0,125.0
50%,29.0,1632.0,1476545000.0,1483462000.0,1.0,5000.0,1073543000.0,1480562000.0,1677.0,1483387000.0,1.0,1633.22
75%,93.0,6820.0,1540860000.0,1549209000.0,1.0,15000.0,1610309000.0,1546381000.0,7340.0,1549132000.0,1.0,6833.0
max,105857.0,12969610.0,1589423000.0,1594600000.0,9.464383,100000000.0,2147476000.0,1589431000.0,235320500.0,1589432000.0,1.716408,12969610.0


### Missing Data

In [7]:
missing_values= df.isnull().sum()
missing_values/len(df)
missing_values.sort_values(ascending=False)

is_backing                  217361
permissions                 217361
friends                     217361
is_starred                  217361
location                       215
usd_type                       204
blurb                            8
staff_pick                       0
spotlight                        0
category                         0
converted_pledged_amount         0
country                          0
country_displayable_name         0
created_at                       0
creator                          0
currency                         0
currency_symbol                  0
currency_trailing_code           0
current_currency                 0
deadline                         0
disable_communication            0
urls                             0
fx_rate                          0
goal                             0
id                               0
usd_pledged                      0
is_starrable                     0
static_usd_rate                  0
launched_at         

### Resolve Missing Values

In [8]:
# drop these features due to having a significant number of missing values
df.drop([
    'friends',
    'is_backing',
    'is_starred',
    'permissions'
], axis=1, inplace=True)

In [9]:
# eliminate remaining missing values
df.dropna(inplace=True)

In [10]:
# verify missing values were resolved
missing_values= df.isnull().sum()
missing_values/len(df)
missing_values.sort_values(ascending=False)

usd_type                    0
currency                    0
fx_rate                     0
disable_communication       0
deadline                    0
current_currency            0
currency_trailing_code      0
currency_symbol             0
creator                     0
usd_pledged                 0
created_at                  0
country_displayable_name    0
country                     0
converted_pledged_amount    0
category                    0
blurb                       0
goal                        0
id                          0
is_starrable                0
launched_at                 0
location                    0
name                        0
photo                       0
pledged                     0
profile                     0
slug                        0
source_url                  0
spotlight                   0
staff_pick                  0
state                       0
state_changed_at            0
static_usd_rate             0
urls                        0
backers_co

### Re-Explore Data

In [11]:
df.head()

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,country_displayable_name,created_at,creator,currency,currency_symbol,currency_trailing_code,current_currency,deadline,disable_communication,fx_rate,goal,id,is_starrable,launched_at,location,name,photo,pledged,profile,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,1,we are going Production herbal teabag of plan...,"{""id"":313,""name"":""Small Batch"",""slug"":""food/sm...",19,AU,Australia,1441269202,"{""id"":1555219532,""name"":""ehsan"",""is_registered...",AUD,$,True,USD,1444141184,False,0.643694,14000.0,18648848,False,1441549184,"{""id"":1098081,""name"":""Perth"",""slug"":""perth-wa-...",Production herbal teabag of plants native to Iran,"{""key"":""assets/012/241/749/145d362f576a69a5338...",27.0,"{""id"":2100811,""project_id"":2100811,""state"":""in...",production-herbal-teabag-of-plants-native-to-iran,https://www.kickstarter.com/discover/categorie...,False,False,failed,1444141184,0.691164,"{""web"":{""project"":""https://www.kickstarter.com...",18.66144,domestic
1,637,Two agents battle each other in another dimens...,"{""id"":34,""name"":""Tabletop Games"",""slug"":""games...",16233,US,the United States,1576048498,"{""id"":99575233,""name"":""David Gerrard"",""is_regi...",USD,$,True,USD,1583987400,False,1.0,6000.0,1576306701,False,1581353979,"{""id"":2490383,""name"":""Seattle"",""slug"":""seattle...",Slip Strike,"{""key"":""assets/027/753/183/1b44d6f57a405f04bb3...",16233.0,"{""id"":3869441,""project_id"":3869441,""state"":""in...",slip-strike-0,https://www.kickstarter.com/discover/categorie...,True,False,successful,1583987400,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",16233.0,domestic
2,50,A collection of Hard Enamel pins inspired by T...,"{""id"":262,""name"":""Accessories"",""slug"":""fashion...",983,CA,Canada,1560821709,"{""id"":1855173855,""name"":""Caitlin Peters"",""slug...",CAD,$,True,USD,1564165822,False,0.709285,450.0,1778685627,False,1562005822,"{""id"":4118,""name"":""Toronto"",""slug"":""toronto-on...",Tattoo Shop Flash,"{""key"":""assets/025/697/130/b8583345b2d665acfed...",1294.29,"{""id"":3755821,""project_id"":3755821,""state"":""in...",tattoo-shop-flash,https://www.kickstarter.com/discover/categorie...,True,False,successful,1564165825,0.7629,"{""web"":{""project"":""https://www.kickstarter.com...",987.4137,domestic
3,8,"Low carb, no sugar sauces and marinades using ...","{""id"":313,""name"":""Small Batch"",""slug"":""food/sm...",361,US,the United States,1563139848,"{""id"":1148188586,""name"":""Ian"",""slug"":""penningt...",USD,$,True,USD,1569530542,False,1.0,28000.0,962045189,False,1564346542,"{""id"":2521691,""name"":""Winchester"",""slug"":""winc...",Pennington's - Keto Sauces and Marinades,"{""key"":""assets/025/806/308/d30cf95898d7dfd33a9...",361.0,"{""id"":3772788,""project_id"":3772788,""state"":""in...",penningtons-keto-sauces-and-marinades,https://www.kickstarter.com/discover/categorie...,False,False,failed,1569530544,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",361.0,domestic
4,6452,The everyday bag fused with Parisian chic and ...,"{""id"":28,""name"":""Product Design"",""slug"":""desig...",1385803,US,the United States,1561364892,"{""id"":1085606247,""name"":""Laflore"",""slug"":""bobo...",USD,$,True,USD,1568408340,False,1.0,15000.0,630821552,False,1564502174,"{""id"":615702,""name"":""Paris"",""slug"":""paris-fr"",...",bobobark - Designed for Women. Made for Life.,"{""key"":""assets/026/466/907/06ce3a51dfc44baf851...",1385803.0,"{""id"":3759849,""project_id"":3759849,""state"":""ac...",bobobark-designed-for-women-made-for-life,https://www.kickstarter.com/discover/categorie...,True,False,successful,1568408340,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",1385803.0,domestic


In [12]:
df.columns

Index(['backers_count', 'blurb', 'category', 'converted_pledged_amount', 'country', 'country_displayable_name', 'created_at', 'creator', 'currency', 'currency_symbol', 'currency_trailing_code', 'current_currency', 'deadline', 'disable_communication', 'fx_rate', 'goal', 'id', 'is_starrable', 'launched_at', 'location', 'name', 'photo', 'pledged', 'profile', 'slug', 'source_url', 'spotlight', 'staff_pick', 'state', 'state_changed_at', 'static_usd_rate', 'urls', 'usd_pledged', 'usd_type'], dtype='object')

In [13]:
df.describe()

Unnamed: 0,backers_count,converted_pledged_amount,created_at,deadline,fx_rate,goal,id,launched_at,pledged,state_changed_at,static_usd_rate,usd_pledged
count,217006.0,217006.0,217006.0,217006.0,217006.0,217006.0,217006.0,217006.0,217006.0,217006.0,217006.0,217006.0
mean,153.397791,13918.43,1475155000.0,1482196000.0,0.971724,50935.42,1073520000.0,1479353000.0,25307.48,1482044000.0,1.001748,13929.65
std,956.261442,111670.7,72962000.0,72681080.0,0.213597,1226416.0,619465700.0,72684190.0,915798.9,72577960.0,0.239873,111681.6
min,0.0,0.0,1240366000.0,1242468000.0,0.009327,0.01,18520.0,1240920000.0,0.0,1242468000.0,0.008771,0.0
25%,4.0,125.0,1422486000.0,1428764000.0,1.0,1500.0,536864300.0,1425915000.0,130.0,1428638000.0,1.0,125.0
50%,29.0,1630.0,1476549000.0,1483467000.0,1.0,5000.0,1073560000.0,1480564000.0,1675.0,1483394000.0,1.0,1631.0
75%,93.0,6818.0,1540804000.0,1549072000.0,1.0,15000.0,1610402000.0,1546214000.0,7341.0,1549039000.0,1.0,6831.308
max,105857.0,12969610.0,1589423000.0,1594600000.0,1.226759,100000000.0,2147476000.0,1589431000.0,235320500.0,1589432000.0,1.716408,12969610.0


### Observations

At the time of this writing, no data dictionary can be found so I have to make some assumptions as to what some of these features are based on research on the terms. For terms that I cannot explain, they will likely be removed unless they provide substantial meaning.

After all of the missing values were removed, 34 columns remained:

||Feature|Data Type|Description|
|--------|--------|--------|-------|
|1|Backers count|integer|number of backers supporting the project|
|2|Blurb| text|text that describes the project|
|3|Category|object|a string of text that includes the project ID, the 'name' of the project, 'slug' which includes the name and the broader category that the project falls into, position number, parent id, parent name (the broader category), color number, and the url|
|4|Converted pledged amount|integer| -------------- |
|5|Country|nominal| --------------|
|6|Country displayable name|nominal|----------------|
|7|Created at|timestamp| ----------------|
|8|Creator|object|a string of text that includes the project ID, the 'name' of the project, 'slug' which includes the name and the broader category that the project falls into, position number, parent id, parent name (the broader category), color number, and the url|
|9|Currency|nominal| ----------------|
|10|Currency symbol|nominal| the symbol for the type of currency|
|11|Currency trailing code|boolean| ----------|
|13|Deadline|integer|--------|
|14|Disable communication|boolean|----------------------|
|15|FX_rate|float|-----------|
|16|Goal|float|--------------|
|17|ID|integer| number of backers supporting the project|
|18|Is starrable|integer| number of backers supporting the project|
|19|Launched at|integer| number of backers supporting the project|
|20|Location|integer| number of backers supporting the project|
|21|Name|text| --------------------|
|22|Photo|integer| number of backers supporting the project|
|23|Pledged|integer| number of backers supporting the project|
|24|Profile|integer| number of backers supporting the project|
|25|Slug|integer| number of backers supporting the project|
|26|Source url|integer| number of backers supporting the project|
|27|Spotlight|integer| number of backers supporting the project|
|28|Staff pick|integer| number of backers supporting the project|
|29|State|nominal| number of backers supporting the project|
|30|State changed at|integer| number of backers supporting the project|
|31|Static usd rate|integer| number of backers supporting the project|
|32|Urls|integer| number of backers supporting the project|
|33|USD pledged|integer| number of backers supporting the project|
|34|USD type|integer| number of backers supporting the project|

### Break Up the Strings and Add Them As Columns

#### Break Up Category

In [14]:
df.category = df.category.str.replace(':', ',')

punctuation = "!\"#$%&'()*+-.:;<=>?@[\\]^_`{|}~"

def remove_punctuation(s):
    s_sans_punct = ""
    for letter in s:
        if letter not in punctuation:
            s_sans_punct += letter
    return s_sans_punct

# splits record strings up into lists
new_category = []
for line in df.category:
    line = remove_punctuation(line)
    new_category.append(line.split(','))
    
df.category = new_category

for line in df.category:
    for element in line:
        clean_data = remove_punctuation(element)

all_categories = {}
for j, line in enumerate(df.category):
    categories = {}
    for i, ele in enumerate(line[:-4]):
        if i % 2 == 0:
            categories[ele] = line[i+1]
    all_categories[j] = categories

category = pd.DataFrame(all_categories).T
category.head()

Unnamed: 0,id,name,slug,position,parentid,parentname,color,urls
0,313,Small Batch,food/small batch,10,10,Food,16725570,web
1,34,Tabletop Games,games/tabletop games,6,12,Games,51627,web
2,262,Accessories,fashion/accessories,1,9,Fashion,16752598,web
3,313,Small Batch,food/small batch,10,10,Food,16725570,web
4,28,Product Design,design/product design,5,7,Design,2577151,web


#### Drop Category Columns That Won't Help

In [15]:
category.drop([
    'id',
    'slug',
    'color',
    'urls'
], axis=1, inplace=True)

In [16]:
missing_values= category.isnull().sum()
missing_values/len(category)

name          0.00000
position      0.00000
parentid      0.03851
parentname    0.03851
dtype: float64

In [17]:
# eliminate remaining missing values
category.dropna(inplace=True)

In [18]:
missing_values= category.isnull().sum()
missing_values/len(category)

name          0.0
position      0.0
parentid      0.0
parentname    0.0
dtype: float64

#### Merge Category Dataframe into Original Dataframe

In [19]:
df = df.merge(category, how='outer', left_index=True, right_index=True)

### One Hot Encode

In [20]:
# what does this tell me?
df.state.value_counts()

successful    126821
failed         76210
canceled        9015
live            4960
Name: state, dtype: int64

In [21]:
df.state = pd.get_dummies(df.state, columns=['dummy'], drop_first=True)

In [22]:
# what does this tell me?
df.staff_pick.value_counts()

False    188376
True      28630
Name: staff_pick, dtype: int64

In [23]:
df.staff_pick = pd.get_dummies(df.staff_pick, columns=['dummy'], drop_first=True)

In [24]:
# what does this tell me?
df.spotlight.value_counts()

True     126821
False     90185
Name: spotlight, dtype: int64

In [25]:
df.spotlight = pd.get_dummies(df.spotlight, columns=['dummy'], drop_first=True)

In [26]:
# what does this tell me?
# severe imbalance
df.is_starrable.value_counts()

False    212099
True       4907
Name: is_starrable, dtype: int64

In [27]:
df.is_starrable = pd.get_dummies(df.is_starrable, columns=['dummy'], drop_first=True)

### Figure out what to do with these

In [28]:
# what does this tell me?
df.usd_pledged.value_counts()

0.000000        16530
1.000000         4702
2.000000         1163
10.000000        1122
25.000000         984
                ...  
11965.722057        1
62.008843           1
11164.676665        1
136.825139          1
10726.000000        1
Name: usd_pledged, Length: 86015, dtype: int64

In [29]:
# what does this tell me?
df.static_usd_rate.value_counts()

1.000000    149511
1.086105        54
1.109449        54
1.228667        51
1.215900        51
             ...  
0.049003         1
0.748048         1
1.032681         1
0.793573         1
1.313698         1
Name: static_usd_rate, Length: 13527, dtype: int64

In [30]:
# what does this tell me?
df.state_changed_at.value_counts()

1.572581e+09    31
1.583039e+09    30
1.559362e+09    28
1.572592e+09    23
1.561954e+09    21
                ..
1.441225e+09     1
1.441223e+09     1
1.441219e+09     1
1.441217e+09     1
1.353211e+09     1
Name: state_changed_at, Length: 179202, dtype: int64

In [31]:
# what does this tell me?
# df.slug.value_counts()

In [32]:
# what does this tell me?
# a massive string that appears to be largely useless
df.profile.value_counts()

{"id":3992603,"project_id":3992603,"state":"inactive","state_changed_at":1589216270,"name":null,"blurb":null,"background_color":null,"text_color":null,"link_background_color":null,"link_text_color":null,"link_text":null,"link_url":null,"show_feature_image":false,"background_image_opacity":0.8,"should_show_feature_image_section":true,"feature_image_attributes":{"image_urls":{"default":"https://ksr-ugc.imgix.net/assets/029/043/271/0d91691252f9105a7b81930ca5177374_original.png?ixlib=rb-2.1.0&crop=faces&w=1552&h=873&fit=crop&v=1589216331&auto=format&frame=1&q=92&s=cb6ae59dcc183da6ae1a3b5b59ec5cce","baseball_card":"https://ksr-ugc.imgix.net/assets/029/043/271/0d91691252f9105a7b81930ca5177374_original.png?ixlib=rb-2.1.0&crop=faces&w=560&h=315&fit=crop&v=1589216331&auto=format&frame=1&q=92&s=e2b4031b6cac9bfcb063b680b129214a"}}}                                                                                                                                                                        

In [33]:
# what does this tell me?
df.pledged.value_counts()

0.00        16530
1.00         6818
2.00         1743
10.00        1722
25.00        1193
            ...  
15306.00        1
12285.49        1
47523.00        1
31545.00        1
5805.55         1
Name: pledged, Length: 47968, dtype: int64

In [34]:
# what does this tell me?
# massive string with what appears to be largely useless information
df.photo.value_counts()

{"key":null,"full":"https://ksr-ugc.imgix.net/missing_project_photo.png?ixlib=rb-2.1.0&crop=faces&w=560&h=315&fit=crop&v=&auto=format&frame=1&q=92&s=ef9622ff4223deef49fa8ad823aea9e2","ed":"https://ksr-ugc.imgix.net/missing_project_photo.png?ixlib=rb-2.1.0&crop=faces&w=352&h=198&fit=crop&v=&auto=format&frame=1&q=92&s=54a9c4d0b0b9a4dd8b9bc750f5cbab0a","med":"https://ksr-ugc.imgix.net/missing_project_photo.png?ixlib=rb-2.1.0&crop=faces&w=272&h=153&fit=crop&v=&auto=format&frame=1&q=92&s=9190ef46fcf7ec4c0715bae1a204c47d","little":"https://ksr-ugc.imgix.net/missing_project_photo.png?ixlib=rb-2.1.0&crop=faces&w=208&h=117&fit=crop&v=&auto=format&frame=1&q=92&s=cc0886f218b6ba9280e60cfccf1c839c","small":"https://ksr-ugc.imgix.net/missing_project_photo.png?ixlib=rb-2.1.0&crop=faces&w=160&h=90&fit=crop&v=&auto=format&frame=1&q=92&s=23bb8e82cb40d860a59b192531038aed","thumb":"https://ksr-ugc.imgix.net/missing_project_photo.png?ixlib=rb-2.1.0&crop=faces&w=48&h=27&fit=crop&v=&auto=format&frame=1&q=92&

In [35]:
# # what does this tell me?
# df.name.value_counts()

In [36]:
# what does this tell me?
df.location.value_counts()

{"id":2442047,"name":"Los Angeles","slug":"los-angeles-ca","short_name":"Los Angeles, CA","displayable_name":"Los Angeles, CA","localized_name":"Los Angeles","country":"US","state":"CA","type":"Town","is_root":false,"expanded_country":"United States","urls":{"web":{"discover":"https://www.kickstarter.com/discover/places/los-angeles-ca","location":"https://www.kickstarter.com/locations/los-angeles-ca"},"api":{"nearby_projects":"https://api.kickstarter.com/v1/discover?signature=1589491226.79c52b464f25291240c04aef284035d65d945da0&woe_id=2442047"}}}                          9721
{"id":44418,"name":"London","slug":"london-gb","short_name":"London, UK","displayable_name":"London, UK","localized_name":"London","country":"GB","state":"England","type":"Town","is_root":false,"expanded_country":"United Kingdom","urls":{"web":{"discover":"https://www.kickstarter.com/discover/places/london-gb","location":"https://www.kickstarter.com/locations/london-gb"},"api":{"nearby_projects":"https://api.kickst

In [37]:
# what does this tell me?
df.launched_at.value_counts()

1.582643e+09    4
1.580482e+09    4
1.580231e+09    4
1.542035e+09    4
1.575378e+09    4
               ..
1.437352e+09    1
1.437349e+09    1
1.487755e+09    1
1.431369e+09    1
1.445629e+09    1
Name: launched_at, Length: 189453, dtype: int64

In [38]:
# what does this tell me?
df.fx_rate.value_counts()

1.000000    149510
1.221140     18616
1.080912     12361
0.709285      7743
1.226759      6407
0.643694      3952
1.085077      3813
0.711371      2489
0.041296      2167
0.101724      1246
0.647046      1238
0.129025      1052
0.041245       887
0.144964       752
0.598910       732
0.703586       679
1.027844       592
0.129018       486
0.009354       440
0.098205       403
0.102376       350
0.145478       244
0.601356       232
0.705470       205
1.031539       160
0.009327       139
0.098548       111
Name: fx_rate, dtype: int64

In [39]:
# what does this tell me?
df.goal.value_counts()

5.000000e+03    15452
1.000000e+04    13659
1.000000e+03    10254
2.000000e+03     8858
3.000000e+03     8728
5.000000e+02     8542
1.500000e+04     7158
2.000000e+04     6637
2.500000e+03     6451
1.500000e+03     6125
2.500000e+04     5000
5.000000e+04     4807
4.000000e+03     4620
6.000000e+03     4112
3.000000e+04     3986
3.500000e+03     3783
8.000000e+03     3448
3.000000e+02     2957
7.000000e+03     2639
1.200000e+04     2630
7.500000e+03     2454
1.000000e+05     2404
6.000000e+02     2259
2.000000e+02     2186
1.000000e+02     2149
2.500000e+02     2058
1.200000e+03     1957
4.000000e+02     1870
8.000000e+02     1826
3.500000e+04     1717
4.000000e+04     1675
4.500000e+03     1643
5.500000e+03     1453
7.500000e+02     1298
6.500000e+03     1289
7.000000e+02     1171
6.000000e+04     1109
9.000000e+03     1098
3.500000e+02     1079
7.500000e+04      913
1.800000e+04      885
1.500000e+02      879
1.500000e+05      861
8.500000e+03      763
1.800000e+03      704
1.250000e+

In [40]:
# time difference between created and deadline?
df.deadline.value_counts()

1.572581e+09    32
1.583039e+09    31
1.559362e+09    28
1.572592e+09    23
1.459483e+09    22
                ..
1.442094e+09     1
1.442092e+09     1
1.442089e+09     1
1.442088e+09     1
1.428631e+09     1
Name: deadline, Length: 178289, dtype: int64

### Remove these columns because they won't tell us anything useful

In [41]:
df.columns

Index(['backers_count', 'blurb', 'category', 'converted_pledged_amount', 'country', 'country_displayable_name', 'created_at', 'creator', 'currency', 'currency_symbol', 'currency_trailing_code', 'current_currency', 'deadline', 'disable_communication', 'fx_rate', 'goal', 'id', 'is_starrable', 'launched_at', 'location', 'name_x', 'photo', 'pledged', 'profile', 'slug', 'source_url', 'spotlight', 'staff_pick', 'state', 'state_changed_at', 'static_usd_rate', 'urls', 'usd_pledged', 'usd_type', 'name_y', 'position', 'parentid', 'parentname'], dtype='object')

In [42]:
df.drop([
    'profile'
], axis=1, inplace=True)

In [43]:
# what does this tell me?
df.usd_type.value_counts()
df.drop([
    'usd_type'
], axis=1, inplace=True)

In [44]:
# what does this tell me?
df.urls.value_counts()
df.drop([
    'urls'
], axis=1, inplace=True)

In [45]:
# what does this tell me?
df.source_url.value_counts()
df.drop([
    'source_url'
], axis=1, inplace=True)

In [46]:
# drop id because all observations share the same result
df.id.value_counts()
df.drop([
    'id'
], axis=1, inplace=True)

In [47]:
# drop current_currency because all observations share the same result
df.current_currency.value_counts()
df.drop([
    'current_currency'
], axis=1, inplace=True)

In [48]:
# drop current_currency because all observations share the same result
df.disable_communication.value_counts()
df.drop([
    'disable_communication'
], axis=1, inplace=True)

## Logistic Regression

In [49]:
# Feature Selection

In [50]:
# X and y

In [51]:
# # Train-test-split
# X_train, X_test, y_train, y_test = train_test_split (X, y, random_state = 42)

# # Scale our data.
# # Relabeling scaled data as "Z" is common
# sc = StandardScaler()
# Z_train = sc.fit_transform(X_train)
# Z_test = sc.transform(X_test)


# logreg = LogisticRegression(C=1e9, solver='lbfgs')
# logreg.fit(Z_train, y_train)

# # Predict the labels of the test set: y_pred
# y_pred = logreg.predict(Z_test)

# # Compute and print the confusion matrix and classification report
# print(confusion_matrix(y_test, y_pred))
# print(logreg.score(Z_train, y_train))
# print(logreg.score(Z_test, y_test))

In [52]:
# Assign the coefficients to a list coef
coef = logreg.coef_
for p,c in zip(features,list(coef[0])):
    print(p + '\t' + str(c))

NameError: name 'logreg' is not defined

In [None]:
logreg.predict_proba(25, 1)

In [None]:
new_data = 