Live.csv contains live projects. This can still be used because some of the projects are funded currently under the percentage funded tab, where 100% represents full funding.

most_backed.csv contains the top 4000 most backed projects ever on kickstarter, implying at least successful descriptions

df_text_eng.csv contains just descriptions of projects and labeled with successful or failed

data found here:

https://www.kaggle.com/socathie/kickstarter-project-statistics?select=live.csv

https://www.kaggle.com/oscarvilla/kickstarter-nlp

In [None]:
import pandas as pd
import numpy as np

try:
  from pandas_profiling import ProfileReport
except:
  !pip install pandas-profiling==2.*
  !pip install category_encoders==2.*
  from pandas_profiling import ProfileReport
import os

Mount my drive so I can get local files. I will share the zip file with you so you can put it in your own drive or upload it directly

In [None]:
from google.colab import drive
drive.mount("/content/drive") 

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!unzip drive/My\ Drive/Lambda/kickstarter/Data.zip

Archive:  drive/My Drive/Lambda/kickstarter/Data.zip
replace Data/df_text_eng.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: Data/df_text_eng.csv    
replace Data/live.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: Data/live.csv           
replace Data/most_backed.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: Data/most_backed.csv    


In [None]:
# Get the three dataframes holding kickstarter data
dfs = []
for file in os.listdir('Data'):
  dfs.append(pd.read_csv('Data/' + file, index_col=0))
  print(dfs[-1].head())

                                               blurb       state
1  Using their own character, users go on educati...      failed
2  MicroFly is a quadcopter packed with WiFi, 6 s...  successful
3  A small indie press, run as a collective for a...      failed
4  Zylor is a new baby cosplayer! Back this kicks...      failed
5  Hatoful Boyfriend meet Skeletons! A comedy Dat...      failed
   amt.pledged  ...                                                url
0    8782571.0  ...                /projects/elanlee/exploding-kittens
1    6465690.0  ...   /projects/antsylabs/fidget-cube-a-vinyl-desk-toy
2    5408916.0  ...  /projects/readingrainbow/bring-reading-rainbow...
3    5702153.0  ...  /projects/559914737/the-veronica-mars-movie-pr...
4    3336371.0  ...         /projects/doublefine/double-fine-adventure

[5 rows x 12 columns]
   amt.pledged  ...                                                url
0      15823.0  ...  /projects/1608905146/catalysts-explorers-and-s...
1       6859.0  ...

Requires no cleaning, and since it only has 2 columns, this represents upper bound cardinality. In other words, since it only has the blurb column and the target column, we don't know the values of any other data that we could add

In [None]:
dfs[0].head() 

Unnamed: 0,blurb,state
1,"Using their own character, users go on educati...",failed
2,"MicroFly is a quadcopter packed with WiFi, 6 s...",successful
3,"A small indie press, run as a collective for a...",failed
4,Zylor is a new baby cosplayer! Back this kicks...,failed
5,Hatoful Boyfriend meet Skeletons! A comedy Dat...,failed


This will require cleaning, since it contains the blurb column, but no direct 'state' column. We infer the state column by stating that any project which is funded in an amount greater than or equal to the 'percentage.funded' column is successful, while all others have failed.

In [None]:
dfs[1].head()

Unnamed: 0,amt.pledged,blurb,by,category,currency,goal,location,num.backers,num.backers.tier,pledge.tier,title,url
0,8782571.0,\nThis is a card game for people who are into ...,Elan Lee,Tabletop Games,usd,10000.0,"Los Angeles, CA",219382,"[15505, 202934, 200, 5]","[20.0, 35.0, 100.0, 500.0]",Exploding Kittens,/projects/elanlee/exploding-kittens
1,6465690.0,"\nAn unusually addicting, high-quality desk to...",Matthew and Mark McLachlan,Product Design,usd,15000.0,"Denver, CO",154926,"[788, 250, 43073, 21796, 41727, 21627, 12215, ...","[1.0, 14.0, 19.0, 19.0, 35.0, 35.0, 79.0, 79.0...",Fidget Cube: A Vinyl Desk Toy,/projects/antsylabs/fidget-cube-a-vinyl-desk-toy
2,5408916.0,\nBring Reading Rainbow’s library of interacti...,LeVar Burton & Reading Rainbow,Web,usd,1000000.0,"Los Angeles, CA",105857,"[19639, 14343, 9136, 2259, 5666, 24512, 4957, ...","[5.0, 10.0, 25.0, 30.0, 35.0, 50.0, 75.0, 100....","Bring Reading Rainbow Back for Every Child, Ev...",/projects/readingrainbow/bring-reading-rainbow...
3,5702153.0,\nUPDATED: This is it. We're making a Veronica...,Rob Thomas,Narrative Film,usd,2000000.0,"San Diego, CA",91585,"[5938, 8423, 11509, 22997, 23227, 1865, 7260, ...","[1.0, 10.0, 25.0, 35.0, 50.0, 75.0, 100.0, 150...",The Veronica Mars Movie Project,/projects/559914737/the-veronica-mars-movie-pr...
4,3336371.0,"\nAn adventure game from Tim Schafer, Double F...",Double Fine and 2 Player Productions,Video Games,usd,400000.0,"San Francisco, CA",87142,"[47946, 24636, 1090, 11530, 900, 148, 100, 10, 4]","[15.0, 30.0, 60.0, 100.0, 250.0, 500.0, 1000.0...",Double Fine Adventure,/projects/doublefine/double-fine-adventure


In [None]:
dfs[1] = dfs[1][['blurb', 'percentage.funded']]
dfs[1].head(2)

KeyError: ignored

In [None]:
dfs[1]['state'] = np.where(dfs[1]['percentage.funded']>=100, 'successful', 'failed')
dfs[1].drop(columns = ['percentage.funded'], inplace=True)
dfs[1].head()

For this column, we have the goal amount of money, along with the amount of money pledged. We infer the state column by saying that any row which contains an amount pledged greater than or equal to the goal is successful while all others have failed. However, note that this is the dataframe for the top 4000 most successful kickstarters, and all of them exceeded their goal by a large amount.


In [None]:
dfs[2].head()

In [None]:
dfs[2] = dfs[2][['blurb', 'amt.pledged', 'goal']]
dfs[2].head(2)

In [None]:
# This df only contains successes
dfs[2]['state'] = ['successful'] * len(dfs[2]) 
dfs[2].drop(columns=['amt.pledged', 'goal'], inplace=True)
dfs[2].head()

In [None]:
df = pd.concat(dfs, ignore_index=True)
df.head(20)

In [None]:
df.info()

In [None]:
df.describe()

The dataframes contain some null and duplicate values, so we drop any row containing any amount of either duplicate or null values

In [None]:
df[len(df.blurb) < 10]

NameError: ignored

In [None]:
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)
df.to_csv('kickstarter.csv')

In [None]:
new = pd.read_csv('kickstarter.csv', index_col=0)
new.head()

Unnamed: 0,blurb,state
0,"Using their own character, users go on educati...",failed
1,"MicroFly is a quadcopter packed with WiFi, 6 s...",successful
2,"A small indie press, run as a collective for a...",failed
3,Zylor is a new baby cosplayer! Back this kicks...,failed
4,Hatoful Boyfriend meet Skeletons! A comedy Dat...,failed
