# Exploring Kickstarter Project Data
## by Michael Mosin

## Preliminary Wrangling

This document explores a dataset comprised of various attributes for an assortment of 3786 Kickstarter projects

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [None]:
# Import dataset
# Dataset downloaded from CSV link under "2019-05-16" on site: https://webrobots.io/kickstarter-datasets/
df = pd.read_csv('Kickstarter.csv')

In [None]:
# Adding ability to view all dataframe columns
# as per https://stackoverflow.com/questions/49188960/how-to-show-all-of-columns-name-on-pandas-dataframe/49189503
pd.set_option('display.max_columns', None)
df.head()

In [None]:
df.info()

In [None]:
df.duplicated().value_counts()

In [None]:
# Make copy of main dataframe so as to keep original data intact.
df_copy = df.copy()
df_copy.shape

## Tracking Data Quality and Tidiness Issues:

### Quality:

- Variables "friends", "is_backing", "is_starred", and "permissions" only have one entry and should be dropped
- Per https://help.kickstarter.com/hc/en-us/articles/115005135834-What-is-Spotlight-:
    - Variable 'spotlight' should be dropped, since it only applies to successful projects after the fact, and is thus not a predictor of anything.
- Variables "created_at", "deadline", "launched_at", and "state_changed_at" are set in unix time instead of readable datetime
- Variables with financial values such as "converted_pledged_amount", "goal", "pledged", and "usd_pledged" are set to different decimal places, and should be rounded to at most two decimal places

- Only two entries are missing data for "location" (not a big deal, given that we have "country" data; these)
- Only eleven entries are missing data for "usd_type" (this variable is not important to the investigation)

### Tidiness:

- Data entries in the columns "category", "creator", "location", "photo", "profile", and "urls" contain multiple pieces of information. If separated, they could be their own dataframes or made into additional columns in the main dataframe.
    - The "category" variable can garner category and sub-category info for the projects
    - The "location" variable can garner data regarding the project's state name, city name, and city type
    - The "creator","photo", "profile", "urls" variables have no data that is relevant to this project and should be dropped


## Addressing Data Quality and Tidiness Issues

### Quality: 

#### Remove (essentially) empty columns and 'spotlight'

In [None]:
df_copy = df_copy.drop(columns = ["friends", "is_backing", "is_starred", "permissions", "spotlight"])
df_copy.shape

#### Fix time categories

In [None]:
# Converting unix time to readable date-time
# as per https://stackoverflow.com/questions/19231871/convert-unix-time-to-readable-date-in-pandas-dataframe
date_cols = ["created_at", "deadline", "launched_at", "state_changed_at"]
for i in date_cols:
    df_copy[i] = pd.to_datetime(df_copy[i],unit='s')

df_copy[date_cols].head()

In [None]:
df_copy[date_cols].describe()

#### Fix financial categories

In [None]:
# Round financial values to at most two decimal places
money_cols = ["converted_pledged_amount", "goal", "pledged", "usd_pledged"]
for i in money_cols:
    df_copy = df_copy.round(2)

df_copy[money_cols].head()

### Tidiness: 

#### Feature Engineering - address tidiness issue of "category" variable by creating variables holding extracted values for main category and sub-category:

In [None]:
# View full string entries for "category" variable to gauge the complexity of category strings:
df_copy['category'][15]

In [None]:
df_copy['category'][798]

In [None]:
# Extract product categories and sub-catgories from strings in "category" variable into their own columns in dataframe
# (Used regular expression)

import re  

df_copy['main_cat'] = ''
df_copy['sub_cat'] = ''

for i in np.arange(df_copy.shape[0]):
    match = re.findall('(([- &\'\\\\]|\w+)+)', df_copy['category'][i])
    df_copy['main_cat'][i] = match[5][0].title()
    df_copy['sub_cat'][i] = match[3][0].replace('\\','')

In [None]:
df_copy[['name','main_cat', 'sub_cat', 'category']].head() 

In [None]:
df_copy.main_cat.value_counts()

In [None]:
df_copy.sub_cat.value_counts()

#### Feature Engineering - address tidiness issue of "location" variable by creating variables holding extracted values for state, city, and type of city:

In [None]:
df_copy[df_copy.location.isnull()]

In [None]:
# View full string entry for "location" variable of third row entry to gauge the complexity of location strings:
df_copy['location'][2]

In [None]:
# Extract product (country) states, cities, and city types from strings in "location" variable into their own columns in dataframe
# (Used regular expression)

df_copy['location_state'] = ''
df_copy['location_city'] = ''
df_copy['location_type'] = ''

for i in np.arange(df_copy.shape[0]):
    if pd.notna(df_copy.location[i]) == True:
        match = re.findall('((?:[^"]\w+)+)', df_copy['location'][i])
        df_copy['location_state'][i] = match[17]
        df_copy['location_city'][i] = match[3]
        df_copy['location_type'][i] = match[19]
    else:
        df_copy['location_state'][i] = 'NaN'
        df_copy['location_city'][i] = 'NaN'
        df_copy['location_type'][i] = 'NaN'
    

In [None]:
df_copy[['name', 'country', 'location_state', 'location_city', 'location_type', 'location']][1930:1933]

### What is the structure of your dataset?

There are 3786 Kickstarter projects in this dataset, with a total of 37 features, some of which are untidy, and some of which are not of interest for my exploration. I have engineered a few categorical features (related to project ctegories and location) which may come to be useful for exploration. 


### What is/are the main feature(s) of interest in your dataset?

I am interested in finding out which project qualities correlate with different types of project outcomes (or, the final "state" of the project). 


### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I believe the following features could illuminate patterns in project outcomes:
- Number of backers
- Length of time project was open
- Total money pledged (relative to funding goal)
- Project category
- Project location (country, city type)
- If project was "staff pick"

In [None]:
# Save wrangled dataframe to new CSV file to make future manipulating easier
df_copy.to_csv('data_wrangled.csv', index=False)

## Streamlining Wrangled Dataset 

### Removing extra variables, and engineering other potentially relevant features

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [None]:
# Import wrangled dataset
df = pd.read_csv('data_wrangled.csv')
pd.set_option('display.max_columns', None)
df.head()

In [None]:
# Remove unnecessary variables
df.drop(columns = ['category',
                   'converted_pledged_amount',
                   'creator',
                   'currency_symbol',
                   'currency_trailing_code',
                   'current_currency',
                   'disable_communication',
                   'fx_rate',
                   'location',
                   'photo',
                   'profile',
                   'slug',
                   'source_url',
                   'static_usd_rate',
                   'urls',
                   'usd_type'],
       inplace = True)
df.head()

#### Engineering features related to the time variables:

- created_at
- launched_at
- deadline
- state_changed_at

Reference:

http://www.datasciencemadesimple.com/difference-two-timestamps-seconds-minutes-hours-pandas-python-2/
https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html

In [None]:
# Confirm that time variables are of 'datetime64' type:
date_cols = ["created_at", "deadline", "launched_at", "state_changed_at"]
for i in date_cols:
    df[i] = pd.to_datetime(df[i])

df[date_cols].dtypes

In [None]:
import datetime
from dateutil.relativedelta import relativedelta
from datetime import date

In [None]:
# Calculate number of days it took to launch project: 'days_to_launch'
# (days between project creation and project launch: 'launched_at' - 'created_at')

df['days_to_launch'] = df['launched_at'] - df['created_at']
df['days_to_launch']=df['days_to_launch']/np.timedelta64(1,'D')
df.days_to_launch.head()

In [None]:
# Calculate number of days given for project to succeed: 'days_to_succeed'
# (days between project launch and project deadline: 'deadline' - 'launched_at')

df['days_to_succeed'] = df['deadline'] - df['launched_at']
df['days_to_succeed']=df['days_to_succeed']/np.timedelta64(1,'D')
df.days_to_succeed.head()

In [None]:
# Calculate number of days project was active (or reached its final 'state') : 'days_active'
# (days between project launch and project deadline: 'state_changed_at' - 'launched_at')

df['days_active'] = df['state_changed_at'] - df['launched_at']
df['days_active']=df['days_active']/np.timedelta64(1,'D')
df.days_active.head()

In [None]:
# Confirm whether final 'state' occurred before or after 'deadline': 'ended_early'
# Faster code instead of for loops as per reference:
# https://stackoverflow.com/questions/27041724/using-conditional-to-generate-new-column-in-pandas-dataframe)

df['ended_early'] = np.where(df.days_active < df.days_to_succeed, True, False)

In [None]:
df[['days_to_succeed','days_active','ended_early']].head()

#### Engineering Feature: proportion of project funding relative to goal

In [None]:
# Finding ratio of confirmed funds relative to funding goals: 'funded_prop'
# ('pledged' / 'goal')

df['funded_prop'] = df['pledged'] / df['goal']
df.funded_prop.head()

#### Save cleaner dataframe to new CSV file to make future manipulating easier

In [None]:
df.to_csv('data_cleaner.csv', index=False)

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [None]:
# Import wrangled dataset
df = pd.read_csv('data_cleaner.csv')
pd.set_option('display.max_columns', None)
df.head()

In [None]:
# Set color for charts:
base_color = sb.color_palette()[0]

### What's the distribution of project campaign 'states'?

In [None]:
sb.countplot(data = df, x = 'state', color = base_color);

Looks like more projects succeeded than not.

### What's the proportion of campaigns that ended early?
Plot 'ended_early' counts with proportion percentages over bars

In [None]:
# create the plot
ax = sb.countplot(data = df, x = 'ended_early', color = base_color)

# add annotations
# Reference: https://stackoverflow.com/questions/31749448/how-to-add-percentages-on-top-of-bars-in-seaborn
n_points = df.shape[0]
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 25,
            '{:0.1f}%'.format(100*height/n_points),
            ha = 'center')

plt.show()

#### Interjectionary Bivariate Exploration:

A follow-up bivariate exploration is called for to see whether the different campaign 'states' have a split within them regarding if they ended early.

In [None]:
# Reference:
# https://stackoverflow.com/questions/33271098/python-get-a-frequency-count-based-on-two-columns-variables-in-pandas-datafra
df.groupby(['state','ended_early']).size()

In [None]:
sb.countplot(data = df, x = 'state', hue = 'ended_early', palette = 'Blues');

Since each project 'state' only has projects that EITHER ended early or didn't, there is no distribution of endings to be illustrated within each 'state'. We have learned and confirmed that, not surprisingly, all projects which were 'cancelled' or 'suspended' ended early, and all project campaigns that 'failed' or were 'successful' in reaching their funding goal were open until the end of their deadline.

Since I care about whether projects are successful or not, and 'live' project campaigns are still in progress and have yet to be cancelled or suspended, I will continue working with the dataset which excludes project rows that have the state of 'live'.

### What's the proportion of campaigns that ended early - given a removal of 'live' projects?

Plot 'ended_early' counts with proportion percentages over bars WHILE excluding 'live' projects - since they are still in progress and are skewing the chart.


In [None]:
# Isolate sub-dataset which excludes projects that are 'live':
df_notlive = df[df['state']!='live']
df_notlive.shape

In [None]:
# create the plot
ax = sb.countplot(data = df_notlive, x = 'ended_early', color = base_color)

# add annotations
n_points = df_notlive.shape[0]
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 25,
            '{:0.1f}%'.format(100*height/n_points),
            ha = 'center')

plt.show()

This represents the proportion of projects which actually ended early: those which we know to have been cancelled or suspended.

### Check distribution of 'is_starrable' relative to project 'state':

In [None]:
df.groupby(['state','is_starrable']).size()

Since 'is_starrable' is only true for projects that are 'live', and I am excluding the 'live' projects, we can also remove the 'is_starrable' feature.

In [None]:
df_notlive = df_notlive.drop(columns = 'is_starrable')
df_notlive.info()

### What's the distribution of the number of backers per project?

In [None]:
binsize = 50
bins = np.arange(0, df_notlive['backers_count'].max()+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = df_notlive, x = 'backers_count', bins = bins)
plt.xlabel('Number of Backers')
plt.ylabel('Number of Projects')
plt.show();

Looks like there is a VERY wide distribution of backers, skewed by many projects having zero or few backers, and a few outlier projects with thousands of backers.

#### Let's zoom in a little:

In [None]:
binsize = 5
bins = np.arange(5, 500+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = df_notlive, x = 'backers_count', bins = bins)
plt.xlabel('Number of Backers')
plt.ylabel('Number of Projects')
plt.show();

#### Since there's a long tail in the distribution, let's put it on a log scale instead (thus exclusive of campaigns with no backers):

In [None]:
log_binsize = 0.1
bins = 10 ** np.arange(0, np.log10(df_notlive['backers_count'].max())+log_binsize, log_binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = df_notlive, x = 'backers_count', bins = bins)
plt.xscale('log')
plt.xticks([1, 3, 10, 30, 100, 300, 1e3, 3e3, 1e4, 3e4], [1, 3, 10, 30, 100, 300, '1k', '3k', '10k', '30k'])
plt.xlabel('Number of Backers')
plt.ylabel('Number of Projects')
plt.show();

There still appears to be an exponential tail, but at least it is a little more constrained under the log transformation. The log transformation illustrates a bimodal distribution - around 0 to 1 backers, and around 40 to 60 backers.

#### Let's see how many campaigns had fewer than 10 backers:

In [None]:
# Plot countplot that illustrates how many projects have fewer than 10 backers:

plt.figure(figsize=[8, 5])
ax = sb.countplot(data = df_notlive, x = 'backers_count', color = base_color)
plt.xlabel('Number of Backers')
plt.ylabel('Number of Projects')
plt.xlim(-0.5,9.5)
plt.ylim(0,350)

# add annotations
for p in ax.patches[0:10]:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:.0f}'.format(height),
            ha = 'center')

plt.show()

Nearly 1/11 projects had 0 backers. Over one quarter of projects had 5 or fewer backers.

A follow-up bivariate exploration would be a comparison of the number of backers relative to successful and unsuccessful projects.

### What's the distribution of location type?

In [None]:
sb.countplot(data = df_notlive, y = 'location_type', color = base_color);

Looks like I extracted that information incorrectly. However, even if I had, it's almost useless given the fairly little diversity in distribution of the location type, with the vast majority of projects being in towns. At least we've learned that much.

### What's the distribution of campaign countries?

In [None]:
sb.countplot(data = df_notlive, y = 'country', color = base_color, order = df_notlive.country.value_counts().index);

The vast majority of project campaigns are based in the US, followed by other English-speaking countries. There is some visible representation from Northern Europe, Mexico, Western Europe, and Hong Kong.

### What's the distribution of proportion of goal funded?

In [None]:
df_notlive.funded_prop.describe()

In [None]:
sb.boxplot(data = df_notlive, x = 'funded_prop', color = base_color);

In [None]:
over_funded = df_notlive[df_notlive.funded_prop > 
                         df_notlive.funded_prop.quantile(.99)][['goal',
                                                                'pledged',
                                                                'funded_prop']].sort_values('funded_prop')
print(over_funded.shape)
print(over_funded)

Looks like some campaigns had quite the funding success: those past the 99th percentile had received more than 18.7 times their goal. In some cases the project's funding proportion was severely greater due to having a campaign goal as low as several hundred dollars - sometimes even ONLY $1.

#### Let's zoom in on the boxplot:

In [None]:
sb.boxplot(data = df_notlive, x = 'funded_prop', color = base_color);
plt.xlim(-.5,5);

#### To see the distribution a bit better, we can employ a histogram:

In [None]:
binsize = 0.1
bins = np.arange(0, 5+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = df_notlive, x = 'funded_prop', bins = bins)
plt.xlabel('Final Funding as Proportion of Goal')
plt.ylabel('Number of Projects')
plt.show()

The distribution of proportion of funding goal met is exponential and bimodal - at 0 (or nearly 0, for those campaigns which received nothing or nearly nothing), and at 1 (representing campaigns that met their goal or even surpassed it by a little bit).

#### What does the area between 0 and 1 look like, when log transformed? This will be exclusive of campaigns that had funding proportions of 0 or 1:

In [None]:
log_binsize = 0.1
bins = 10 ** np.arange(-3, np.log10(0.99)+log_binsize, log_binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = df_notlive, x = 'funded_prop', bins = bins)
plt.xscale('log')
plt.xticks([0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1], [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1])
plt.xlabel('Final Funding as Proportion of Goal')
plt.ylabel('Number of Projects')
plt.show();

There appears to be a good quantity of campaigns that didn't meet their goal, but still had some funding.

#### What does the distribution between the median and 75th percentile look like?

In [None]:
binsize = 0.005
bins = np.arange(df_notlive.funded_prop.median(),
                 df_notlive.funded_prop.quantile(.75)+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = df_notlive, x = 'funded_prop', bins = bins)
plt.xlabel('Final Funding as Proportion of Goal')
plt.ylabel('Number of Projects')
plt.show();

## Bivariate Exploration

### Exploring Sample-Based Correlations of Continuous Quantitative Variables:

In [None]:
# plot matrix: sample 500 projects so that plots are clearer and render faster
samples = np.random.choice(df_notlive.shape[0], 500, replace = False)
df_samp = df_notlive.reindex(samples)

g = sb.PairGrid(data = df_samp, vars = ['backers_count',
                                        'goal',
                                        'funded_prop', 
                                        'days_to_launch',
                                        'days_to_succeed',])
g = g.map_diag(plt.hist, bins = 50);
g.map_offdiag(plt.scatter);

Nothing too apparent stands out. Most distributions are just reflective of the independent variable's distribution (i.e. backers_count vs days_to_launch resembles the distribution of days_to_launch).

### Check distribution of success of projects split by 'main_cat'

In [None]:
df_notlive['successful'] = np.where(df_notlive.state == 'successful', True, False)

In [None]:
df_notlive[['name','state','successful']].head()

In [None]:
plt.figure(figsize=[15, 5])
ax = sb.countplot(data = df_notlive, x = 'main_cat', hue = 'successful', 
                  order= df_notlive['main_cat'].value_counts().index);
loc, labels = plt.xticks()
ax.set_xticklabels(labels, rotation=45);

n_points = df_notlive['main_cat'].value_counts()
i=0
for p in ax.patches:
    if i >= 15:
        i = 0
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2., height + 5, '{:.2f}'.format(height/n_points[i]),
                ha = 'center')
        i += 1
    else:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2., height + 5, '{:.2f}'.format(height/n_points[i]),
                ha = 'center')
        i +=1
plt.show;

Looks like some categories are very common and successful for Kickstarter campaigns: Music, Technology, Publishing, and Theater. However, even some less common categories have relatively similar success rates, if not better: Comics, Fashion, Film & Video, Art, Photography. A few that are less successful are Food, Crafts, Dance, Games, Design, Journalism.

### Check distribution of success of projects among those with the fewest backers:

In [None]:
# Plot countplot that illustrates how many projects have fewer than 10 backers,
# split by whether or not they were successful:

plt.figure(figsize=[8, 5])
ax = sb.countplot(data = df_notlive, x = 'backers_count', 
                  hue = 'successful')
plt.xlabel('Number of Backers')
plt.ylabel('Number of Projects')
plt.xlim(-0.5,9.5)
plt.ylim(0,350)

# add annotations

for p in ax.patches[0:10]:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:.0f}'.format(height),
            ha = 'center')
    
mid_patch = int(len(ax.patches)/2)
mid_patch10 = int(len(ax.patches)/2+10)
for p in ax.patches[mid_patch:mid_patch10]:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:.0f}'.format(height),
            ha = 'center')

plt.show();

Looks like starting at 9 contributors, campaigns may have more than a 50% chance of succeeding.

### Check distribution of state of projects relative to them being a Staff Pick:

In [None]:
ax = sb.countplot(data = df_notlive, x = 'state', hue = 'staff_pick', 
                  order = df_notlive['state'].value_counts().index)

# add annotations
n_points = df_notlive['state'].value_counts()
i=0
for p in ax.patches[0:7]:
    if i >= 4:
        i = 0
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 5,
                '{:.2f}'.format(height/n_points[i]),
                ha = 'center')
        i += 1
    else:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 5,
                '{:.2f}'.format(height/n_points[i]),
                ha = 'center')
        i +=1
        
plt.show();

In [None]:
# Reference: https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8
table=pd.crosstab(df_notlive.state,df_notlive.staff_pick)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of State vs Staff Pick')
plt.xlabel('State of Campaign')
plt.ylabel('Proportion of Campaigns');

Almost a fifth of successful campaigns were Staff Picks. However, being a Staff Pick does not guarantee success, as can be seen by the 4% of failed campaigns and 6% of canceled campaigns being Staff Picks.

In [None]:
# Reference: https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8
table=pd.crosstab(df_notlive.staff_pick,df_notlive.state)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Staff Pick vs State of Campaign')
plt.xlabel('Staff Pick')
plt.ylabel('Proportion of Campaigns');

Based on this last bar chart, it appears that campaigns that were selected as Staff Pick were more likely to succeed. 

### Compare differences in mean values for various variables among Main Project Categories:

In [None]:
var = ['successful','funded_prop','backers_count','staff_pick','main_cat']
df_notlive[var].groupby('main_cat').mean().sort_values('funded_prop')

### Check distribution of Main Project Categories relative to their proportion of being funded:

In [None]:
plt.figure(figsize=[15, 10])
ax = sb.boxplot(x='main_cat', y='funded_prop', data=df_notlive, color=base_color, order = df_notlive['main_cat'].value_counts().index)
loc, labels = plt.xticks()
ax.set_xticklabels(labels, rotation=45);
plt.ylim(-0.25,7);

Looking at the median point for the boxplots relative to the "1" point on the y-axis, we can see the proportion of projects within each Main Category which met their goal or didn't. Since meeting one's funding goal (getting at least "1" for the funding proportion) defines a Kickstarter campaign success, we can see whether a majority of campaigns succeeded or failed.

### Compare differences in median values for various variables among Main Project Categories:

In [None]:
df_notlive['avg_donation'] = df_notlive['usd_pledged'] / df_notlive['backers_count']
df_notlive['avg_donation'] = df_notlive['avg_donation'].fillna(0)

var = ['successful','funded_prop','avg_donation','backers_count','staff_pick','main_cat']
df_notlive[var].groupby('main_cat').median().sort_values('avg_donation')

Here we confirmed that Crafts, Food, Dance, Games, and even Design are categories that have difficulty getting more than half of their projects funded. For Crafts, Food, and Dance, it appears that a low backer count is a contributing factor.

We can explore even further within a Main Category - to see the boxplot distributions of their Subcategories. For instance, let's take Technology:

### Check distribution of Technology Project Subcategories relative to their proportion of being funded:

In [None]:
df_sub_tech = df_notlive[df_notlive['main_cat']=='Technology'].copy()

plt.figure(figsize=[15, 15])
ax = sb.boxplot(x='sub_cat', y='funded_prop', data=df_sub_tech, color=base_color, order = df_sub_tech['sub_cat'].value_counts().index)
plt.yticks([1,2,3,4,5,6,7,8,9,10,11,12,13], [1,2,3,4,5,6,7,8,9,10,11,12,13])
loc, labels = plt.xticks()
ax.set_xticklabels(labels, rotation=45);
plt.ylim(-0.25,13);

Here we can see that not all technology subcategories are equally successful. Those subcategories which appear to acquire much more funding than requested (and have more than 75% of projects meeting goals) include Gadgets, DIY Electronics, Technology (a bit vague, no?), Sound, and 3D Printing. Apps, Web, Makerspaces, and Space Exploration also have the majority (if not all) of their projects being successful, however not by the same margins. The subcategories which are more likely to not meet their goals are Wearables, Software, Fabrication Tools, and Flight.

In [None]:
var = ['successful','funded_prop','backers_count','staff_pick','sub_cat']
df_sub_tech[var].groupby('sub_cat').mean().sort_values('funded_prop')

## Multivariate Exploration

### Explore spread of projects by their Final State, relative to Proportion of Goal Funded and Number of Backers:

In [None]:
sb.pairplot(x_vars=['funded_prop'], y_vars=['backers_count'], data=df_notlive, hue = 'state', height = 6);
plt.xlim([0,50]);
plt.ylim([0,12000]);

#### Let's zoom in:

In [None]:
sb.pairplot(x_vars=['funded_prop'], y_vars=['backers_count'], data=df_notlive, hue = 'state', height = 6);
plt.xlim([0,4]);
plt.ylim([0,1800]);

This displays what is fairly expected: campaigns that met or exceeded their goal (proportion funded >= 1) are successful. Those that were not able to do so are not.

### Explore spread of Technology projects by their Success, relative to Average Donation and Number of Backers:

In [None]:
g = sb.pairplot(x_vars=['avg_donation'], y_vars=['backers_count'], 
                data=df_notlive[df_notlive['main_cat']=='Technology'], 
                hue = 'successful', height = 6);
plt.xlim([0,500]);
plt.ylim([0,1200]);
plt.show();

### Explore spread of Game projects by their Success, relative to Average Donation and Number of Backers:

In [None]:

g = sb.pairplot(x_vars=['avg_donation'], y_vars=['backers_count'], 
                data=df_notlive[df_notlive['main_cat']=='Games'], 
                hue = 'successful', height = 6);
plt.xlim([0,200]);
plt.ylim([0,600]);
plt.show();

## Conduct Logistic Regression:

#### Assessing affect on success by Staff Pick, number of backers, and average donations.

In [None]:
df_notlive[['successful_f','successful_t']] = pd.get_dummies(df_notlive['successful'])
df_notlive[['staff_pick_f','staff_pick_t']] = pd.get_dummies(df_notlive['staff_pick'])

import statsmodels.api as sm
df_notlive['intercept'] = 1
logit_mod = sm.Logit(df_notlive['successful_t'], df_notlive[['intercept', 'backers_count', 'avg_donation','staff_pick_t']])
results = logit_mod.fit()
results.summary()

In [None]:
np.exp(0.0360)

In [None]:
np.exp(0.0021)

For each one additional backer, success is 1.037 times as likely holding all else constant.

For each additional dollar in an average donation, success is 1.0021 times as likely holding all else constant.

Staff Pick is not a statistically significant predictor of project success

In [None]:
df_notlive.to_csv('final_dataset.csv')