## Business Questions
* What is the most important feature (if any) in the success of a Kickstarter project?  
* What category of project would most likely succeed (Art, Photography, Tech etc.)?  
* Are there any discernible differences between successful and failed projects?   

* Is there a best month? Or best time to launch a project to increase chances of success?  
* Is there a higher chance of success if I have many small backers or few big ones?
* Do I raise more if I have many small backers or a few big ones? 

## Data set description & understanding
* General sample size?
* Percentage of Kickstarter projects that succeed? 


 usd pledged: it has missing values, it is kickstarter information given about pledge amount in USD but this is already included in usd_pledged_real 

## Preprocessing the data

### Transform data

In [None]:
#Transform deadline
df['deadline'] = pd.to_datetime(df['deadline'])
df['deadline'].sort_values().head()

In [None]:
#Alternatively
data_kick['deadline'] = pd.to_datetime(data_kick['deadline'],
format='%Y-%m-%d %H:%M:%S')

In [None]:
#convert to common datetime format
df.launched = pd.to_datetime(df.launched)

In [None]:
#Alternatively
data_kick['launched'] = pd.to_datetime(data_kick['launched'],
format='%Y-%m-%d %H:%M:%S')

In [None]:
#Get rid of '  '
df.columns = [s.replace(' ','') for s in df.columns]

### Explore data

In [None]:
#look at null values
df.isnull().sum()

In [None]:
#Check out counts in the project categories
df.category.value_counts()

In [None]:
#Check out counts of the project sub-categories
df.main_category.value_counts()

In [None]:
#main_category will help us keep dimensionality low after one hot encoding
print(len(df.category.unique()))
print(len(df.main_category.unique())) 

In [None]:
#Check if ID column is unique
df.shape[0]==len(df.ID.unique())

In [None]:
#investigate data per country
df.country.value_counts()
#N,0" engineering to 'NO'
df.country = df.country.replace(to_replace='N,0"', value='NO')

### Feature engineering

In [None]:
#engineer a 'success' variable
#I will engineer a success variable from the information that is already in the dataset. I am defining success by a project raising at least as much as their goal had stipulated. 
#This is also how Kickstarter defines success.
df['success'] = (df.usd_goal_real <= df.usd_pledged_real)*1
df.success.describe()

In [None]:
#return only successful and failed projects. 
#This makes things more clear later on
data_kick = data_kick.loc[data_kick['state'].isin(
            ['successful', 'failed'])]

In [None]:
#engineer duration variable
#I will also engineer a duration feature to see if the timeline of a project influences the chances at success. The dataset contains information about the date and time a project was launched, and what the fundraising deadline was. 
#From this it is straightforward to calculate how many hours the fundraising was supposed to last.
df['duration'] = (df.deadline - df.launched).astype('timedelta64[h]')

In [None]:
#One hot encode categorical variables, dropping unneeded variables
#I want to one hot encode the categorical variables: main_category and coutnry. 
#This means that each state in each feature will be represented as a binary state. 
#For example if a project originated in the U.S. that will be denoted as a value of 1 in the dataset, and all other countries will have a value of 0 for that project. 
#This will make it easier for any models we deal by quantifying non-numerical values.
df_encoded = pd.get_dummies(df.drop(labels=['name', 'launched', 'deadline',
                                            'category', 'currency', 'usd pledged', 'pledged',
                                            'ID', 'goal'], axis=1), 
                            columns=['main_category', 'country'])



In [None]:
##add a variable that shows average pledge for each project
#The average amount contributed to a project can give us some insight about whether we want to encourage many small contributions or a few larger ones.
df_encoded['average_backing'] = (df_encoded['usd_pledged_real']/(df_encoded['backers']+1))

## Data analysis

In [None]:
#Kickstarter Projects by Success
percent_plot((df_encoded.success.value_counts()/df.shape[0]*100),
             "Kickstarter Projects by Success")

In [None]:
#Kickstarter projects by country
def percent_plot(data, title):
    '''
    INPUT: data- data of which to graph distribution
            title- graph title
    OUTPUT: Distribution of Data by Percentage Points
    '''
    ax = data.plot(kind='bar')
    plt.title(title)
    ax.yaxis.set_major_formatter(PercentFormatter())
    plt.show();
    
percent_plot((df.country.value_counts()/df.shape[0]*100), "Kickstarter Projects by Country")

In [None]:
#successful projects by country`
percent_plot((df[df.usd_pledged_real>=df.usd_goal_real].country.value_counts()/
              df[df.usd_pledged_real>=df.usd_goal_real].shape[0]*100), 
             "Successful Kickstarter Projects by Country")

In [None]:
#Kickstarter projects by category
percent_plot((df.main_category.value_counts()/df.shape[0]*100), 
             "Kickstarter Projects by Category")

In [None]:
#categories of successful projects
percent_plot((df[df.usd_pledged_real>=df.usd_goal_real].main_category.value_counts()/
              df[df.usd_pledged_real>=df.usd_goal_real].shape[0]*100), 
             "Successful Kickstarter Projects by Category")

In [None]:
#Relationship between a project’s goal and the actual amount pledged.
#It is quite obvious that a project is labelled as successful if
#amount pledged ≥ goal and unsuccessful if amount pledged < goal. 
#define colors (darkgreen for successful projects and darkred for failed ones
colors = ('darkgreen','darkred')
#create a plot using seaborn, adjust data to millions
ax = sns.scatterplot(data_kick.usd_pledged_real/1e6, 
                     data_kick.usd_goal_real/1e6, hue=data_kick.state, palette=colors)
#add blue line to better visualize the border between failed and successful projects
sns.lineplot(x=(0,50), y=(0,50), color='darkblue')
#set the axes from -1 to their maximum (-1 looks better than 0 actually)
ax.set(ylim=(-1,None), xlim=(-1,None))
#set labels and title
ax.set(xlabel='Amount Pledged in Millions', ylabel='Goal in Millions', title= 'Goal vs. Pledged')

In [None]:
#Duration of a sucessful project
plt.hist(df_encoded[df_encoded.success==1].duration, bins=20)
plt.title('Successful Project Duration')
plt.xlabel('# of hours')
plt.ylabel('# of projects');

In [None]:
#Summary stats on the duration of sucessfull projects
df_encoded[df_encoded.success==1].duration.describe()

In [None]:
#Summary stats on the duration of UNsucessfull projects
df_encoded[df_encoded.success==0].duration.describe()

In [None]:
#Summary stats on the amount of raised money of sucessfull projects
df_encoded[df_encoded.success==1].usd_goal_real.describe()

In [None]:
#Summary stats on the average pledge of sucessfull projects
df_encoded[df_encoded.success==1].average_backing.describe()

In [None]:
#Correlation matrix
def corr_plot(features, fig_size):
    '''
    INPUT: features- which columns of df_encoded to calculate correlation
            fig_size- size of the correlation heatmap for ease of reading
    OUTPUT: Seaborn Heatmap of Correlations
    '''
    corr=df_encoded[features].corr()
    fig, ax = plt.subplots(figsize=fig_size)
    sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, 
            ax=ax, linewidths=0.01);

In [None]:
#correlation between the time, and goal and pldeged money
corr_plot(['success','duration', 'cancelled', 'usd_goal_real', 
                        'usd_pledged_real', 'backers', 'average_backing'], (5,5))

In [None]:
#correlation between success and country of origin for project
corr_plot(['success', 'country_AT', 'country_AU', 'country_BE',
       'country_CA', 'country_CH', 'country_DE', 'country_DK', 'country_ES',
       'country_FR', 'country_GB', 'country_HK', 'country_IE', 'country_IT',
       'country_JP', 'country_LU', 'country_MX', 'country_NL', 'country_NO',
       'country_NZ', 'country_SE', 'country_SG', 'country_US'], (10,10))

In [None]:
#correlation between sphere of project and success
corr_plot(['success','main_category_Art', 'main_category_Comics',
       'main_category_Crafts', 'main_category_Dance', 'main_category_Design',
       'main_category_Fashion', 'main_category_Film & Video',
       'main_category_Food', 'main_category_Games', 'main_category_Journalism',
       'main_category_Music', 'main_category_Photography',
       'main_category_Publishing', 'main_category_Technology',
       'main_category_Theater'], (10,10))

## ML Classification Model

* Random forest classifier as base? 
* Grid search to optimize random forest classifier?
* Answer to the question: which features were most important in model?
* https://github.com/mkucz95/kickstarter_data/blob/master/kickstarter_data.ipynb