In [1]:
from src.data_cleaning import *

%reload_ext autoreload
%autoreload 2
# import_functs()

With `import_ks_data()`, we will pull all of the raw kickstarter data from within the `data` folder, and combine it into one file. To avoid having to repeat this process, it automatically pickles the dataframe as `raw_ks_data.p`.

**Important!** Before running this, please make sure to download the data [here](https://drive.google.com/file/d/1R95VR0kpbABkCy8f_CO5pJPikeaudo2H/view?usp=sharing) and save it in `data` as `Kickstarter_CSVs.zip`, and unzip the file. 

In [2]:
## Uncomment to run
# df = import_ks_data()

If you're not running this notebook for the first time, then run `load_raw_ks_data` instead.

In [3]:
df = load_raw_ks_data()

In [4]:
df.head(3)

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,country_displayable_name,created_at,creator,currency,currency_symbol,currency_trailing_code,current_currency,deadline,disable_communication,friends,fx_rate,goal,id,is_backing,is_starrable,is_starred,last_update_published_at,launched_at,location,name,permissions,photo,pledged,profile,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,unread_messages_count,unseen_activity_count,urls,usd_pledged,usd_type
0,1,This is a project I created to find out why 10...,"{""id"":360,""name"":""Video"",""slug"":""journalism/vi...",20,US,the United States,1494022111,"{""id"":220745515,""name"":""Stephanie Balfrey"",""sl...",USD,$,True,USD,1499212951,False,,1.0,1000.0,104592348,,False,,,1494028951,"{""id"":2503863,""name"":""Tampa"",""slug"":""tampa-fl""...",Breast Cancer Mission Impossible,,"{""key"":""assets/016/550/794/0f20795a1f7d1219d64...",20.0,"{""id"":2989151,""project_id"":2989151,""state"":""in...",breast-cancer-mission-impossible,https://www.kickstarter.com/discover/categorie...,False,False,failed,1499212951,1.0,,,"{""web"":{""project"":""https://www.kickstarter.com...",20.0,domestic
1,82,Seek & Behold is a full length album paired wi...,"{""id"":318,""name"":""Faith"",""slug"":""music/faith"",...",12580,US,the United States,1477503356,"{""id"":1889961770,""name"":""Debrianna Grace Cabit...",USD,$,True,USD,1482825631,False,,1.0,10000.0,1852641962,,False,,,1478505631,"{""id"":2385447,""name"":""Costa Mesa"",""slug"":""cost...",Debrianna Grace Cabitac: Seek & Behold,,"{""key"":""assets/014/355/790/607842fe5163267e666...",12580.0,"{""id"":2734869,""project_id"":2734869,""state"":""in...",debrianna-grace-cabitac-seek-and-behold,https://www.kickstarter.com/discover/categorie...,True,False,successful,1482825631,1.0,,,"{""web"":{""project"":""https://www.kickstarter.com...",12580.0,domestic
2,30,After a lifetime of talking myself out of shar...,"{""id"":318,""name"":""Faith"",""slug"":""music/faith"",...",2491,US,the United States,1426640212,"{""id"":1600855781,""name"":""Liz Roberson"",""is_reg...",USD,$,True,USD,1429238219,False,,1.0,2000.0,64426037,,False,,,1426646219,"{""id"":2430835,""name"":""Katy"",""slug"":""katy-tx"",""...",Liz Roberson Debut Album!,,"{""key"":""assets/012/072/616/3942c9eeecd39a1cd87...",2491.0,"{""id"":1776308,""project_id"":1776308,""state"":""ac...",liz-roberson-debut-album,https://www.kickstarter.com/discover/categorie...,True,False,successful,1429238223,1.0,,,"{""web"":{""project"":""https://www.kickstarter.com...",2491.0,domestic


Some immidiately noticable things include the category column, which conains dictionaries. With a closer look...

In [5]:
df.category.iloc[0]

'{"id":360,"name":"Video","slug":"journalism/video","position":4,"parent_id":13,"parent_name":"Journalism","color":1228010,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/journalism/video"}}}'

We can see that it contains important information, such as the category name and the parent category, if the project is within a subcategory. However most of the information isn't useful, so using `expand_cateogry` we expand that column, and extract just the `name` and `parent_name` columns, renaming them to `category_name` and `category_parent_name`.
Since we just want the main category, not the sub-category, all of the blanks in `category_parent_name` will be replaced with the items in `category_name`

In [6]:
df = expand_category(df)
df.head(3)

Unnamed: 0,backers_count,blurb,converted_pledged_amount,country,country_displayable_name,created_at,creator,currency,currency_symbol,currency_trailing_code,current_currency,deadline,disable_communication,friends,fx_rate,goal,id,is_backing,is_starrable,is_starred,last_update_published_at,launched_at,location,name,permissions,photo,pledged,profile,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,unread_messages_count,unseen_activity_count,urls,usd_pledged,usd_type,category_parent_name
0,1,This is a project I created to find out why 10...,20,US,the United States,1494022111,"{""id"":220745515,""name"":""Stephanie Balfrey"",""sl...",USD,$,True,USD,1499212951,False,,1.0,1000.0,104592348,,False,,,1494028951,"{""id"":2503863,""name"":""Tampa"",""slug"":""tampa-fl""...",Breast Cancer Mission Impossible,,"{""key"":""assets/016/550/794/0f20795a1f7d1219d64...",20.0,"{""id"":2989151,""project_id"":2989151,""state"":""in...",breast-cancer-mission-impossible,https://www.kickstarter.com/discover/categorie...,False,False,failed,1499212951,1.0,,,"{""web"":{""project"":""https://www.kickstarter.com...",20.0,domestic,Journalism
1,82,Seek & Behold is a full length album paired wi...,12580,US,the United States,1477503356,"{""id"":1889961770,""name"":""Debrianna Grace Cabit...",USD,$,True,USD,1482825631,False,,1.0,10000.0,1852641962,,False,,,1478505631,"{""id"":2385447,""name"":""Costa Mesa"",""slug"":""cost...",Debrianna Grace Cabitac: Seek & Behold,,"{""key"":""assets/014/355/790/607842fe5163267e666...",12580.0,"{""id"":2734869,""project_id"":2734869,""state"":""in...",debrianna-grace-cabitac-seek-and-behold,https://www.kickstarter.com/discover/categorie...,True,False,successful,1482825631,1.0,,,"{""web"":{""project"":""https://www.kickstarter.com...",12580.0,domestic,Music
2,30,After a lifetime of talking myself out of shar...,2491,US,the United States,1426640212,"{""id"":1600855781,""name"":""Liz Roberson"",""is_reg...",USD,$,True,USD,1429238219,False,,1.0,2000.0,64426037,,False,,,1426646219,"{""id"":2430835,""name"":""Katy"",""slug"":""katy-tx"",""...",Liz Roberson Debut Album!,,"{""key"":""assets/012/072/616/3942c9eeecd39a1cd87...",2491.0,"{""id"":1776308,""project_id"":1776308,""state"":""ac...",liz-roberson-debut-album,https://www.kickstarter.com/discover/categorie...,True,False,successful,1429238223,1.0,,,"{""web"":{""project"":""https://www.kickstarter.com...",2491.0,domestic,Music


Next, we're going to begin correcting data types.

All dates are in unix format, so the following four columns will be changed to datetime64[s]
    - `created_at`, `deadline`, `launched_at`, `state_changed`

On top of that, we'll be correcting the datatypes for a few more columns! 
- `country`, `currency`, `currency_symbol`, and `category_parent_name` will be changed to the pandas datatype `category`

- `state` will be updated to change `successful` into 1, and `failed` into 0

As well as dropping unwanted columns, and duplicate rows!

In [7]:
df = correct_dtypes(df)
df = drop_rows(df)
df = drop_cols(df)

Next, we'll be extracting the end and start dates for each project, as well as the project length! 
On top of that, we will also be dropping the original date columns.

In [8]:
df = extract_dates(df)

With that, we'll also split and pickle the DataFrames as `KS_data.pkl` and `KS_blurb_data.pkl`, removing the blurb text and adding the length of the blurbs to the dataframe. 
   
   - `KS_data.pkl` will be used for visualizations
   - `KS_blurb_data.pkl` sets aside the blurb text for future NLP use

In [9]:
blurb_df, df = split_and_pickle_df(df)

In [10]:
df.head()

Unnamed: 0,country,country_displayable_name,currency,currency_symbol,goal,id,staff_pick,state,category_parent_name,start_month,end_month,project_length,blurb_len
122550,US,the United States,USD,$,3600.0,1134843983,False,0,Games,5,6,26,76
64587,US,the United States,USD,$,1500.0,1094489060,False,1,Games,5,6,22,70
91387,GB,the United Kingdom,GBP,£,25.0,2051827639,False,1,Games,6,6,5,133
76726,US,the United States,USD,$,37500.0,1188113847,True,1,Comics,5,6,28,132
161875,US,the United States,USD,$,500.0,299455523,False,1,Games,5,6,21,41


Next, we'll be taking the `currency_symbol` and `country_displayable_name` columns and storing them in a dictionary, using `country` and `currency` as keys. This is for vizualization use, as well as removing redundant columns

`state` will also be switched to numbers, a 1 for successful projects and a 0 for failed.

In [11]:
dict_list, df = create_dicts(df)

After that, it's on to label encoding string columns so that they can be fed into different machine learning algorithms. We're using SKLearn's label encoder, and outputting it to `LE_dict` for decoding use in case we need it. 

In [12]:
enc_train_df, enc_test_df, LE_dict = split_and_encode(df)

In [13]:
enc_train_df.head()

Unnamed: 0_level_0,country,currency,goal,staff_pick,state,category_parent_name,start_month,end_month,project_length,blurb_len
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
864202524,21,13,6200.0,True,1,2,6,7,30,100
1329733276,8,4,20000.0,False,1,13,9,9,22,46
1123973612,21,13,40000.0,True,1,8,4,5,30,132
1226762918,21,13,15000.0,False,0,0,2,3,29,134
132864990,1,0,25000.0,True,1,6,4,5,26,129


And that's it! Both the train and test data are pickled, as well as the label encoder and dictionary, all ready to be used in other ways.

In [14]:
enc_train_df.to_pickle('data/KS_train_data.pkl')
enc_test_df.to_pickle('data/KS_test_data.pkl')

pickle.dump([dict_list, LE_dict], open('data/encoders.pkl', 'wb'))
