## Kickstarter EDA
This study pulls together an assortment of Kickstarter files and runs an exploratory data analysis (EDA) on how different factors affect success. <br> 
Depending on the data file, success is defined by one or more of the following metrics: 
1. Success/Failure: Whether a fundraising goal was reached or not. 
2. Backers:  Number of backers. 
3. Magnitude: degree to which goals were passed. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
import os
import re
import collections  # returns frequencies 

# For Text processing
import string             # use punctuation method to remove punctuation in lines later 
from wordcloud import WordCloud, STOPWORDS  
from bs4 import BeautifulSoup as bfs  # used to read in website data
import nltk 
# Lemmatizer reduces words into their root form: wolves -> wolf, jumping -> jump, etc. 
from nltk.stem import WordNetLemmatizer as WNL    
from nltk.corpus import wordnet 
from nltk import pos_tag
from nltk.corpus import stopwords

In [2]:
# display more than one record in cell 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Dark mode display 
from jupyterthemes import jtplot
jtplot.style(theme='monokai', context='notebook', ticks=True, grid=False)

### Data Files
In this study, we pull data files that contain success variables (response variables), with different feature variables (Factors). The data files are taken from: 
1. [Kickstarter 2010-7](https://www.kaggle.com/carlolepelaars/exploration-of-kickstarter-data-2010-2017/data): data17 
2. [Kickstarter: 2010-5](https://www.kaggle.com/dilipajm/kickstarter-project-funding-prediction/data)  data15
3. [Short data set with 4k records](https://www.kaggle.com/socathie/kickstarter-project-statistics) data_sht

In [3]:
# Import Kickstarter files 
data17 = pd.read_csv('ks-projects-201612.csv', encoding = "ISO-8859-1")
data15 = pd.read_csv('ks_train.csv')
data_sht = pd.read_csv('most_backed.csv')

  interactivity=interactivity, compiler=compiler, result=result)


### Preprocessing
Format all numbers within 3 decimal places 

In [5]:
# first format to 3 decimals 
pd.options.display.float_format = '{:.2f}'.format

### Preprocessing 2010-7 series (data17)
This is the largest of the three files (328k records). <br> 
1. Remove all leading and trailling spaces from **column titles**
2. Remove rows where column conversion failed: this shows up as "unnamed" columns 13-16. 
3. Format **date** and **numeric** columns from object to datetime or numeric format 
4. Add fields for response variable: <br> 
   a. Pledged vs. Goal: Measures magnitude of success <br> 
   b. Average Pledge  <br> 
5. Check **Duplicates**

In [7]:
data17.dtypes
xcols = [x.strip() for x in data17.columns]
data17.columns = xcols
data17.columns

ID                  int64
name               object
category           object
main_category      object
currency           object
deadline           object
goal               object
launched           object
pledged            object
state              object
backers            object
country            object
usd pledged        object
Unnamed: 13        object
Unnamed: 14        object
Unnamed: 15        object
Unnamed: 16       float64
dtype: object

Index(['ID', 'name', 'category', 'main_category', 'currency', 'deadline',
       'goal', 'launched', 'pledged', 'state', 'backers', 'country',
       'usd pledged', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15',
       'Unnamed: 16'],
      dtype='object')

**Processing 'Unnamed" columns** <br> 
Most records have 'Unnamed' columns with are empty. We examine the non-null records and find 625/320+k records have unnamed columns that are not-null.<br> 
We remove these records 

In [15]:
print("sample of records with null Unnamed columns ")
data17[data17['Unnamed: 13'].isnull()].iloc[1:5,13:17]
print("No. of records w/ to be removed:", len(data17[data17['Unnamed: 13'].notnull()]))
data17 = data17.drop(data17.columns[13:17],axis=1)

sample of records with null Unnamed columns 


Unnamed: 0,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
1,,,,
2,,,,
3,,,,
4,,,,


No. of records w/ to be removed: 625


**Remove records with null response variables.** <br> 
Find that removing null goal values also removes all null backers. 

In [31]:
nan_goal_id = data17.ID[data17.goal.isnull()]
data17 = data17.drop(data17[data17.ID.isin(nan_goal_id)].index)
print('No. of Null Backers')
len(data17[data17.backers.isnull()])
print("No. of Null goals ")
len(data17[data17.goal.isnull()])


No. of Null Backers


0

No. of Null goals 


0

**Format dates and numeric fields**

In [32]:
# Numeric: convert goal, pledged, backers, and usd pledged to numbers 
data17['goal'] = pd.to_numeric(data17.goal, errors='coerce')
data17['pledged'] = pd.to_numeric(data17.pledged, errors='coerce')
data17['usd pledged'] = pd.to_numeric(data17['usd pledged'], errors='coerce')
data17['backers'] = pd.to_numeric(data17.backers, errors='coerce')
data17[['goal','pledged','usd pledged','backers']].dtypes

# Format dates
data17.launched = pd.to_datetime(data17.launched)
data17.deadline = pd.to_datetime(data17.deadline)
data17.dtypes

goal           float64
pledged        float64
usd pledged    float64
backers        float64
dtype: object

ID                        int64
name                     object
category                 object
main_category            object
currency                 object
deadline         datetime64[ns]
goal                    float64
launched         datetime64[ns]
pledged                 float64
state                    object
backers                 float64
country                  object
usd pledged             float64
dtype: object

**Process Response Variables**
1. Average Pledge 
2. Pledged vs. Goal

In [48]:
# Average Pledge 
data17['avg_pledge'] = np.where(data17['backers']==0,0,data17['usd pledged'] / data17['backers'] )
print("Avg Pledge Stats:")
stats.describe(data17['avg_pledge'])

# Pledged v. Goal 
data17['pledgedvgoal'] = np.where(data17["usd pledged"].isnull(), 0, data17['usd pledged'] / data17['goal'])
print("Pledged v. Goal Stats:")
stats.describe(data17['pledgedvgoal'])

Avg Pledge Stats:


DescribeResult(nobs=323118, minmax=(0.0, 10000.0), mean=59.78752558892141, variance=17451.190525467344, skewness=25.40103612965577, kurtosis=1312.1669675800613)

Pledged v. Goal Stats:


DescribeResult(nobs=323118, minmax=(0.0, 55266.57), mean=2.6102743624017553, variance=30920.894458854862, skewness=190.1314333700369, kurtosis=46355.55784777164)

In [52]:
# Check for duplicates='ID'
dup = data17[data17.duplicated(subset='ID',keep=False)]
len(dup)

0

### Preprocessing Kickstarter 2010-15 Series
1. Format Dates 
2. Check for duplicates

In [50]:
data15.deadline = pd.to_datetime(data15.deadline, origin='unix', unit='s')
data15.state_changed_at = pd.to_datetime(data15.state_changed_at, origin='unix', unit='s')
data15.created_at = pd.to_datetime(data15.created_at, origin='unix', unit='s')
data15.launched_at = pd.to_datetime(data15.launched_at, origin='unix', unit='s')
data15[['deadline','state_changed_at','created_at','launched_at']].dtypes
data15[['deadline','state_changed_at','created_at','launched_at']].head()

deadline            datetime64[ns]
state_changed_at    datetime64[ns]
created_at          datetime64[ns]
launched_at         datetime64[ns]
dtype: object

Unnamed: 0,deadline,state_changed_at,created_at,launched_at
0,2009-05-03 06:59:59,2009-05-03 07:00:17,2009-04-24 19:15:07,2009-04-24 19:52:03
1,2009-05-15 23:10:00,2009-05-16 00:00:18,2009-04-28 23:10:24,2009-04-29 03:26:32
2,2009-05-22 21:26:00,2009-05-22 21:30:18,2009-05-12 21:26:53,2009-05-12 21:39:58
3,2009-05-29 00:09:00,2009-05-29 00:15:21,2009-04-29 00:09:55,2009-04-29 00:58:50
4,2009-05-31 11:38:00,2009-05-31 11:45:17,2009-05-01 11:38:34,2009-05-01 12:22:21


In [51]:
# check for duplicates 
dup = data15[data15.duplicated(subset='project_id',keep=False) ]
len(dup)  # no duplicates 

0

### Preprocessing of 4k-record set
While small (4k records), this is the only dataset is **Pledge tiers and backers**, both of which are stored in lists. We use the below functions to create a flattened version of the dataframe. 

In [11]:
# Function: break lists into seperate lines 
    # Function flat_len_list returns length of each list: 
    # Sample list: [1.0, 14.0, 19.0, 19.0, 35.0, 35.0, 79.0, 79.0, 129.0, 129.0, 849.0, 849.0]
        # starts with brackets, seperated by rows  
def flat_len_list(df, list_column): 
    len_list = []
    for fld_row in range(df.shape[0]):
        rec = df[list_column][fld_row][1:-1]   # remove open/close brackets from list 
        rec_split = rec.split(',') 
        len_list.append(len(rec_split))       # append list 'lengths'
    return len_list 
 
    # Function flatten_list_df creates expanded list with each tier receiving a record.  
def flatten_list_df(df, list_column): 
    t_list = []
    for fld_row in range(df.shape[0]):
        rec = df[list_column][fld_row][1:-1]   # remove open/close brackets from list 
        rec_split = rec.split(',') 
        for x in rec_split:
            t_list.append(x)
    return t_list 

flat_pledge = flatten_list_df(df2, "pledge.tier")
flat_backers = flatten_list_df(df2, "num.backers.tier")

In [12]:
# Function to flatten og dataset and add flat fields 
def expand_df_flatten(df, list_column, new_column):
    lens_of_lists = flat_len_list(df, list_column) 
    origin_rows = range(df.shape[0])   # range object for no. of rows: 4k rows 
    """ create array marking each digit with its row. 
    For instance first list (row zero) has 26 digits, 2nd list 75 digits. 3rd list has 457 digits 
    destination rows will have 26 0's, 75 1's, 457 2's.... 
    This will be used for number of row copies. 
    """
    destination_rows = np.repeat(origin_rows, lens_of_lists)
        # create DF excluding list column 
    non_list_cols = (
      [idx for idx, col in enumerate(df.columns)
       if col != list_column]
    )
    expanded_df = df.iloc[destination_rows, non_list_cols].copy()
    expanded_df[new_column] = flatten_list_df(df, list_column) 
    return expanded_df

### Text Processing
Import packages and load functionns. 

In [5]:
wlem = WNL()  # Inititate Lemmatizer

In [6]:
def get_wordnet_pos(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [7]:
# Function for word presence: takes a string of words, seperates them, and checks for presence of word 
def str_presence(input_string, search_string):
    token = input_string
    token = token.lower()
    token = token.translate(token.maketrans('','',string.punctuation))
    token = [tk.strip() for tk in token.split("-")] 
    token = [tk for tk in token if len(tk)>2]   # remove article words
    # pos_tags = pos_tag(token)
    # token = [wlem.lemmatize(t[0],get_wordnet_pos(t[1])) for t in pos_tags]
    return int(any(x in token for x in search_string))