# Kickstarter project (need to spell check)
## Our project for this semester is to try and predict whether a fundraising campaign in kickstarter will succeed or not.

This type of prediction can actualy be useful in several scenarios, whether for an entrepreneur trying to evaluate his chances, the kickstarter company itself that would like to promote promising campaigns or for an investor considering backing a company.

There are a few datasets available in kaggle such as: [here](https://www.kaggle.com/codename007/funding-successful-projects) and [here](https://www.kaggle.com/kemical/kickstarter-projects). These datasets are more limited timespan wise and in their richness of data. The dataset that we used in our project is offered [here](https://webrobots.io/kickstarter-datasets/). It is very large and somewhat messy, so our first steps are going to be devoted to get to know this dataset and clean it up so we can use it easily.

The data is spread around some 57 very large csv files. Our first step would be to unify it all into a single dataframe, and explore the columns:

In [1]:
import pandas as pd
import dataCleaning as dc
import json
import matplotlib.pyplot as plt

In [2]:
df = dc.make_dataframe() #Files are assumed to be located in rawData sub.dir. caches pickle in cwd.
#print first few rows
df.head()

read dataframe from cache rick.pickle


Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,country_displayable_name,created_at,creator,currency,currency_symbol,...,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,1,Family Cafe and Brewery that serves our incred...,"{""id"":312,""name"":""Restaurants"",""slug"":""food/re...",10,US,the United States,1442537392,"{""id"":404037385,""name"":""tina vo"",""is_registere...",USD,$,...,your-cafe,https://www.kickstarter.com/discover/categorie...,False,False,failed,1448085184,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",10.0,international
1,9,Patented tilting table technology makes this g...,"{""id"":271,""name"":""Live Games"",""slug"":""games/li...",1179,US,the United States,1450797872,"{""id"":480973030,""name"":""Lightwerks, LLC"",""slug...",USD,$,...,tabletop-football-best-new-tailgating-game-for...,https://www.kickstarter.com/discover/categorie...,False,False,canceled,1474634950,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",1179.0,international
2,11,"magnetic window cleaner,you can stand outside ...","{""id"":337,""name"":""Gadgets"",""slug"":""technology/...",311,US,the United States,1485037783,"{""id"":612501588,""name"":""mark woods"",""is_regist...",USD,$,...,easy-clean-car-window-cleaner,https://www.kickstarter.com/discover/categorie...,False,False,failed,1488659761,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",311.0,domestic
3,12,A spiral life planner created BY a chronic pai...,"{""id"":325,""name"":""Calendars"",""slug"":""publishin...",488,US,the United States,1535305714,"{""id"":520931083,""name"":""McKenna"",""slug"":""warri...",USD,$,...,warrior-life-planner,https://www.kickstarter.com/discover/categorie...,False,False,failed,1540962000,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",488.0,international
4,39,"A full-cast, feature-length audio drama produc...","{""id"":239,""name"":""Radio & Podcasts"",""slug"":""pu...",3809,GB,the United Kingdom,1547286149,"{""id"":129272378,""name"":""Graham Richards"",""is_r...",GBP,£,...,hawk-the-slayer-part-one-of-an-audio-trilogy,https://www.kickstarter.com/discover/categorie...,False,False,live,1572001457,1.291421,"{""web"":{""project"":""https://www.kickstarter.com...",3829.062376,domestic


Great! Let's get a few details about this data: What are the features, how many records exist:

In [3]:
cols = list(df.columns.values)
print(cols)
num_recs = len(df.index)
print()
print('There are originaly ', f'{num_recs:,}' , 'records in data')

['backers_count', 'blurb', 'category', 'converted_pledged_amount', 'country', 'country_displayable_name', 'created_at', 'creator', 'currency', 'currency_symbol', 'currency_trailing_code', 'current_currency', 'deadline', 'disable_communication', 'friends', 'fx_rate', 'goal', 'id', 'is_backing', 'is_starrable', 'is_starred', 'launched_at', 'location', 'name', 'permissions', 'photo', 'pledged', 'profile', 'slug', 'source_url', 'spotlight', 'staff_pick', 'state', 'state_changed_at', 'static_usd_rate', 'urls', 'usd_pledged', 'usd_type']

There are originaly  211,424 records in data


We can already see redundant attributes which we are sure we will not need (used for display purpases). Let's start with dropping these.


In [4]:
redundant = ['country_displayable_name', 'currency_symbol', 'currency_trailing_code', 'current_currency',
             'state_changed_at', 'source_url','disable_communication', 'profile']
df = df.drop(columns=redundant)
print('sanity check, print new columns:')
cols = list(df.columns.values)
print(cols)


sanity check, print new columns:
['backers_count', 'blurb', 'category', 'converted_pledged_amount', 'country', 'created_at', 'creator', 'currency', 'deadline', 'disable_communication', 'friends', 'fx_rate', 'goal', 'id', 'is_backing', 'is_starrable', 'is_starred', 'launched_at', 'location', 'name', 'permissions', 'photo', 'pledged', 'profile', 'slug', 'source_url', 'spotlight', 'staff_pick', 'state', 'static_usd_rate', 'urls', 'usd_pledged', 'usd_type']


Taking a first peek at the data via Excel hints that there are still many empty columns:
![peek](img/firstPeek.png)

Let's see what columns contain mostly null values:


In [5]:
nes = df.isna().sum()
print(nes)

backers_count                    0
blurb                            8
category                         0
converted_pledged_amount         0
country                          0
created_at                       0
creator                          0
currency                         0
deadline                         0
disable_communication            0
friends                     210980
fx_rate                          0
goal                             0
id                               0
is_backing                  210980
is_starrable                     0
is_starred                  210980
launched_at                      0
location                       217
name                             0
permissions                 210980
photo                            0
pledged                          0
profile                          0
slug                             0
source_url                       0
spotlight                        0
staff_pick                       0
state               

Looks like we can drop 'friends','is_backing','is_starred','permissions' as well.

In [6]:
empty = ['friends','is_backing','is_starred','permissions']
df = df.drop(columns=empty)
cols = list(df.columns.values)
print(cols)

['backers_count', 'blurb', 'category', 'converted_pledged_amount', 'country', 'created_at', 'creator', 'currency', 'deadline', 'disable_communication', 'fx_rate', 'goal', 'id', 'is_starrable', 'launched_at', 'location', 'name', 'photo', 'pledged', 'profile', 'slug', 'source_url', 'spotlight', 'staff_pick', 'state', 'static_usd_rate', 'urls', 'usd_pledged', 'usd_type']


From looking at the data we can also see that the time fields are given in UNIX time. It'll be usefull ahead if we can break each date into a day month year trio. We'll run the conversion and replace each column with the corresponding 3 fields.

In [7]:
timefields = ['created_at','deadline','launched_at']
dc.convert_time(df,timefields)
print('sanity check')
df.head()

sanity check


Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,created_at,creator,currency,deadline,disable_communication,...,profile,slug,source_url,spotlight,staff_pick,state,static_usd_rate,urls,usd_pledged,usd_type
0,1,Family Cafe and Brewery that serves our incred...,"{""id"":312,""name"":""Restaurants"",""slug"":""food/re...",10,US,2015-09-18 00:49:52,"{""id"":404037385,""name"":""tina vo"",""is_registere...",USD,2015-11-21 05:53:03,False,...,"{""id"":2128735,""project_id"":2128735,""state"":""in...",your-cafe,https://www.kickstarter.com/discover/categorie...,False,False,failed,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",10.0,international
1,9,Patented tilting table technology makes this g...,"{""id"":271,""name"":""Live Games"",""slug"":""games/li...",1179,US,2015-12-22 15:24:32,"{""id"":480973030,""name"":""Lightwerks, LLC"",""slug...",USD,2016-10-08 10:15:12,False,...,"{""id"":2288438,""project_id"":2288438,""state"":""in...",tabletop-football-best-new-tailgating-game-for...,https://www.kickstarter.com/discover/categorie...,False,False,canceled,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",1179.0,international
2,11,"magnetic window cleaner,you can stand outside ...","{""id"":337,""name"":""Gadgets"",""slug"":""technology/...",311,US,2017-01-21 22:29:43,"{""id"":612501588,""name"":""mark woods"",""is_regist...",USD,2017-03-04 20:36:00,False,...,"{""id"":2846002,""project_id"":2846002,""state"":""in...",easy-clean-car-window-cleaner,https://www.kickstarter.com/discover/categorie...,False,False,failed,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",311.0,domestic
3,12,A spiral life planner created BY a chronic pai...,"{""id"":325,""name"":""Calendars"",""slug"":""publishin...",488,US,2018-08-26 17:48:34,"{""id"":520931083,""name"":""McKenna"",""slug"":""warri...",USD,2018-10-31 05:00:00,False,...,"{""id"":3456735,""project_id"":3456735,""state"":""in...",warrior-life-planner,https://www.kickstarter.com/discover/categorie...,False,False,failed,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",488.0,international
4,39,"A full-cast, feature-length audio drama produc...","{""id"":239,""name"":""Radio & Podcasts"",""slug"":""pu...",3809,GB,2019-01-12 09:42:29,"{""id"":129272378,""name"":""Graham Richards"",""is_r...",GBP,2019-12-04 12:04:16,False,...,"{""id"":3552123,""project_id"":3552123,""state"":""in...",hawk-the-slayer-part-one-of-an-audio-trilogy,https://www.kickstarter.com/discover/categorie...,False,False,live,1.291421,"{""web"":{""project"":""https://www.kickstarter.com...",3829.062376,domestic


Another inconviniency in this dataset is that some of the fields are given in json form, specificaly the catagory a