# KICKSTARTER CAMPAIGN SUCCESS PREDICTION

* Date Published: 2019/10/18
* Data Source: https://webrobots.io/kickstarter-datasets/

## Introduction 
Kickstarter is a US based global crowd funding platform focused on bringing funding to creative projects. Since the platform’s launch in 2009, the site has hosted over 159,000 successfully funded projects with over 15 million unique backers. Kickstarter uses an “all-or-nothing” funding system. This means that funds are only dispersed for projects that meet the original funding goal set by the creator.

### Project Objective
Kickstarter earns 5% commission on projects that are successfully funded. Currently, less than 40% of projects on the platform succeed. The objective is to predict which projects are likely to succeed so that these projects can be highlighted on the site either through 'staff picks' or 'featured product' lists.

### Proposed Solutions
Predict Successful Campaigns and promote those with the lowest predicted probability of being successful.
Contact creators from those campaigns that are just below the “success” margin and give them insights that could help them succeed.

In [8]:
import functools
import glob
import io
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd 
import seaborn as sns
import sys

src_dir = os.path.join(os.getcwd(), '..', '..', 'src')
sys.path.append(src_dir)

pd.set_option('display.max_columns', None)

from d02_processing.intermediate_cleaning import kickstarter_deduped_to_intermediate

## EXPLORATORY DATA ANALYSIS

**Here put the function used to deduplicate the original dataset. Also tell them what datasets you took originally from the website**

In [2]:
kick_deduped = pd.read_csv('../../data/02_intermediate/kick_deduped.csv.zip')

  interactivity=interactivity, compiler=compiler, result=result)


In [9]:
kick_inter = kickstarter_deduped_to_intermediate(kick_deduped)

In [11]:
kick_inter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 332899 entries, 0 to 332898
Data columns (total 42 columns):
backers_count               332899 non-null int64
blurb                       332889 non-null object
converted_pledged_amount    195183 non-null float64
country                     332899 non-null object
created_at                  332899 non-null datetime64[ns]
currency                    332899 non-null object
currency_symbol             332899 non-null object
currency_trailing_code      332899 non-null bool
current_currency            195183 non-null object
deadline                    332899 non-null datetime64[ns]
disable_communication       332899 non-null bool
friends                     1629 non-null object
fx_rate                     185035 non-null float64
goal                        332899 non-null float64
id                          332899 non-null int64
is_backing                  1629 non-null object
is_starrable                206127 non-null object
is_starred   

Our current dataset has 42 columns. We first need to narrow the columns down that we would like to work with. Since we are trying to predict whether or not a Kickstarter campaign will be successful or fail, we need to ensure that we are not using any features that contain "future information" (i.e. number of backers or amount pledged), because these features could be proxies for our target variable.

After taking a look through our data dictionary (located in the reference folder in the repository), we have identified 25 columns that we need to drop before building our model.

__**Columns to Drop:**__

1. **backers_count**: This is the number of people who backed the project. This column contains "future information" and could act as a proxy for our target variable.
1. **blurb**: This is a short description of the project. We created a new column (blurb count) and will drop this feature.
1. **currency_symbol**: This feature is redundant with the currency feature.
1. **currency_trailing_code**: This feature is redundant with the currency feature.
1. **converted pledge amount**: This feature contains the amount of money that has been pledged to the campaign. This feature contains "future information" and could be used as a proxy for the target variable.
1. **current_currency**: This column is redundant with the currency column
1. **friends**: This column is 99% empty.
1. **ID**: Unique identifier for the campaign. Will need to be dropped before learning the model.
1. **Name**: Unique identifier for the campaign. Will need to be dropped before learning the model.
1. **is_backing**: This column is ~ 99% empty
1. **is_starrable**: This column contains "future information" regarding how successful Kickstarter believes the campaign will be.
1. **Permissions**: this column is 99% empty
1. **slug**: this column is redundant with name.
1. **source_url**: This is not needed for model building.
1. **spotlight**: This column contains "future information" regarding how successful Kickstarter believes the campaign will be.
1. **staff_pick**: This column contains "future information" regarding how successful Kickstarter believes the campaign will be.
1. **unread_message_count**: This column is empty.
1. **unseen_activity_count**: This column is empty.
1. **URL**: This is not needed for model building.
1. **usd_pleged**: Redundant with the currency column.
1. **country**: Redundant with the currency column (does not actually reflect where the campaign is.
1. **creator_name**: Unnecessary information.
1. **creator_slug**: Unnecessary information.
1. **disabled_communication**: False for all campaigns that have ended.
1. **last_update_published_at**: Column is empty.

Let's take a closer look at our remaining 12 columns:

1. **created at**: (datetime)
1. **currency**: (categorical)
1. **deadline**: (datetime)
1. **fx rate (exchange rate)**: (quantitative)
1. **goal**: (quantitative)
1. **launched at**: (datetime)
1. **sub category**: (categorical)
1. **overall category**: (categorical)
1. **city**: (categorical)
1. **country loc**: (categorical)
1. **state loc**: (categorical)