# Initial data cleaning

In [1]:
# Importing libraries
import numpy as np
import pandas as pd
import json
import time

In [2]:
# Loading dataset and dropping the first column, because it contains rownumbers
df = pd.read_csv('data/Kickstarter_merged.csv')
df = df.drop(columns='Unnamed: 0')

### Cleaning step 1a: dropping columns with many missing data

In [3]:
# Checking for missing values
#df.info()

In [4]:
# Computing percentage of non-null values
((df.shape[0]-df.friends.isnull().sum())/df.shape[0])*100

0.14338836260049134

Drop the columns **friends**, **is_backing**, **is_starred** and **permissions**, because these contain only 0.14% non-null values.

In [5]:
# Deleting columns friends, is_backing, is_starred and permissions
df = df.drop(columns=['friends','is_backing','is_starred','permissions'])

In [6]:
# Looking at the values in the first 10 rows of each column
#df.iloc[0:10, 0:11]
#df.iloc[0:10, 11:21]
#df.iloc[0:10, 21:33]

### Cleaning step 1b: dropping useless columns
Drop the columns **currency_symbol** (we have the currency code), **photo** (contains urls), **profile** (contains much missing data), **source_url** and **urls**, because these contain urls.

In [7]:
# Deleting columns currency_symbol, photo, profile, source_url and urls
df = df.drop(columns=['currency_symbol','photo','profile','source_url','urls'])

### Cleaning step 2
Get relevant values from JSON dictionaries in columns **category**, **creator** and **location**.
#### Category

In [8]:
# Looking for keys of relevant values
df.category[0]

'{"id":266,"name":"Footwear","slug":"fashion/footwear","position":5,"parent_id":9,"color":16752598,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/fashion/footwear"}}}'

"name", "id" and "parent_id" are the only relevant keys in category.

In [9]:
# Getting the first value of name out of JSON formatted dictionary
ca = json.loads(df.category[0])
print(ca.get("name"))
print(ca.get("id"))
print(ca.get("parent_id"))

Footwear
266
9


In [10]:
# Extract the category name and make a new column with the category name
df["category_name"] = ""
df["category_id"] = ""
df["category_parent_id"] = ""
for i in range(len(df.category)):
    try:
        dict_cat = json.loads(df.category[i])
        df.category_name[i] = dict_cat.get("name")
        df.category_id[i] = dict_cat.get("id")
        df.category_parent_id[i] = dict_cat.get("parent_id")
    except:
        df.category_name[i] = NaN
        df.category_id[i] = NaN
        df.category_parent_id[i] = NaN

# Drop the original column 'category'.
df = df.drop(columns='category')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.category_name[i] = dict_cat.get("name")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.category_id[i] = dict_cat.get("id")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.category_parent_id[i] = dict_cat.get("parent_id")


In [21]:
df.category_id.nunique()

169

#### Creator

In [11]:
# Looking for keys of relevant values
df.creator[0]

'{"id":2094277840,"name":"Lucy Conroy","slug":"babalus","is_registered":null,"chosen_currency":null,"avatar":{"thumb":"https://ksr-ugc.imgix.net/assets/023/784/556/6ed11b25c853ec1aef7f4360d0eb59ef_original.jpg?ixlib=rb-1.1.0&w=40&h=40&fit=crop&v=1548222691&auto=format&frame=1&q=92&s=b64463d8ae6195f7aeb62393e2ca2dde","small":"https://ksr-ugc.imgix.net/assets/023/784/556/6ed11b25c853ec1aef7f4360d0eb59ef_original.jpg?ixlib=rb-1.1.0&w=160&h=160&fit=crop&v=1548222691&auto=format&frame=1&q=92&s=00bc518b23a932bd76fb6e21f4eb6834","medium":"https://ksr-ugc.imgix.net/assets/023/784/556/6ed11b25c853ec1aef7f4360d0eb59ef_original.jpg?ixlib=rb-1.1.0&w=160&h=160&fit=crop&v=1548222691&auto=format&frame=1&q=92&s=00bc518b23a932bd76fb6e21f4eb6834"},"urls":{"web":{"user":"https://www.kickstarter.com/profile/babalus"},"api":{"user":"https://api.kickstarter.com/v1/users/2094277840?signature=1552621545.c7a32fed985a78dec253fe61c1acb7a99edbc0af"}}}'

"name" is the only relevant key in creator.

In [12]:
# Getting the first value of name out of JSON formatted library
cr = json.loads(df.creator[0])
print(cr.get("name"))

Lucy Conroy


In [13]:
# Extracting the creator name and make a new column with the creator name
df["creator_name"] = ""
for j in range(len(df.creator)):
    try:
        dict_cre = json.loads(df.creator[j])
        df.creator_name[j] = dict_cre.get("name")
    except:
        df.creator_name[j] = 'NaN'

# Drop the original column 'creator'.
df = df.drop(columns='creator')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.creator_name[j] = dict_cre.get("name")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.creator_name[j] = 'NaN'


#### Location

In [14]:
# Looking for keys of relevant values.
df.location[0]

'{"id":2462429,"name":"Novato","slug":"novato-ca","short_name":"Novato, CA","displayable_name":"Novato, CA","localized_name":"Novato","country":"US","state":"CA","type":"Town","is_root":false,"urls":{"web":{"discover":"https://www.kickstarter.com/discover/places/novato-ca","location":"https://www.kickstarter.com/locations/novato-ca"},"api":{"nearby_projects":"https://api.kickstarter.com/v1/discover?signature=1552595066.49b64db66a5124f5831752d055cd09aff20cc652&woe_id=2462429"}}}'

"name" and "state" are the only relevant keys in location.

In [15]:
# Getting the first values of name and state out of JSON formatted library
lo = json.loads(df.location[0])
print(lo.get("name"))
print(lo.get("state"))

Novato
CA


In [16]:
# Extracting the location name and state and make new columns with the location name and state
df["location_name"] = ""
df["location_state"] = ""
for k in range(len(df.location)):
    try:
        dict_loc = json.loads(df.location[k])
        df.location_name[k] = dict_loc.get("name")
        df.location_state[k] = dict_loc.get("state")
    except:
        df.location_name[k] = 'NaN'
        df.location_state[k] = 'NaN'
        
# Drop the original column 'location'.
df = df.drop(columns='location')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.location_name[k] = dict_loc.get("name")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.location_state[k] = dict_loc.get("state")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.location_name[k] = 'NaN'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.location_state[k] = 'NaN'


### Cleaning step 3
Deal with UNIX time stamps in columns **created_at**, **deadline**, **launched_at** and **state_changed_at**.

In [17]:
# Computing readable time based on UNIX time stamps and make new columns with readable time stamps
df["created_at_rd"] = ""
df["deadline_rd"] = ""
df["launched_at_rd"] = ""
df["state_changed_at_rd"] = ""

for l in range(len(df.created_at)):
    try:
        df.created_at_rd[l] = time.ctime(df.created_at[l])
        df.deadline_rd[l] = time.ctime(df.deadline[l])
        df.launched_at_rd[l] = time.ctime(df.launched_at[l])
        df.state_changed_at_rd[l] = time.ctime(df.state_changed_at[l])
    except:
        df.created_at_rd[l] = 'NaN'
        df.deadline_rd[l] = 'NaN'
        df.launched_at_rd[l] = 'NaN'
        df.state_changed_at_rd[l] = 'NaN'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.created_at_rd[l] = time.ctime(df.created_at[l])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.deadline_rd[l] = time.ctime(df.deadline[l])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.launched_at_rd[l] = time.ctime(df.launched_at[l])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.state_changed_

### Check the new columns

In [19]:
df.iloc[0:10, 25:35]

Unnamed: 0,category_name,category_id,category_parent_id,creator_name,location_name,location_state,created_at_rd,deadline_rd
0,Footwear,266,9.0,Lucy Conroy,Novato,CA,Tue Nov 6 00:06:45 2018,Thu Mar 14 06:02:55 2019
1,Playing Cards,273,12.0,Lisa Vollrath,Euless,TX,Wed Aug 2 16:28:13 2017,Sat Sep 9 19:00:59 2017
2,Rock,43,14.0,Electra,Hollywood,CA,Sun Sep 30 08:45:33 2012,Wed Jun 12 07:03:15 2013
3,Playing Cards,273,12.0,Artur Ordijanc (deleted),Kaunas,Kaunas County,Sat Jan 7 10:11:11 2017,Mon Mar 13 18:22:56 2017
4,Nonfiction,48,18.0,Dawn Johnston,Traverse City,MI,Thu Dec 6 19:04:31 2012,Wed Jan 9 21:32:07 2013
5,Classical Music,36,14.0,Annapolis Chamber Players,Annapolis,MD,Fri Oct 24 19:35:50 2014,Sat May 2 04:25:46 2015
6,Classical Music,36,14.0,The Tekalli Duo,New Haven,CT,Sun Sep 1 03:12:35 2013,Sat Oct 12 03:12:00 2013
7,Music,14,,funktoast,Kaysville,UT,Tue Jan 8 17:38:03 2019,Wed Feb 13 15:15:05 2019
8,Immersive,283,17.0,Overflow Theatre Company,Northampton,England,Thu Mar 24 12:20:44 2016,Wed May 11 01:00:00 2016
9,Accessories,262,9.0,Lauren Ackerley,Wolverhampton,England,Wed Jan 9 23:05:06 2019,Wed Feb 20 14:00:01 2019


In [20]:
# Save the initially cleaned dataset
df.to_csv('data/Kickstarter_init_cleaned2.csv')