# Kickstarter project (need to spell check)
## Our project for this semester is to try and predict whether a fundraising campaign in kickstarter will succeed or not.

This type of prediction can actualy be useful in several scenarios, whether for an entrepreneur trying to evaluate his chances, the kickstarter company itself that would like to promote promising campaigns or for an investor considering backing a company.

There are a few datasets available in kaggle such as: [here](https://www.kaggle.com/codename007/funding-successful-projects) and [here](https://www.kaggle.com/kemical/kickstarter-projects). These datasets are more limited timespan wise and in their richness of data. The dataset that we used in our project is offered [here](https://webrobots.io/kickstarter-datasets/). It is very large and somewhat messy, so our first steps are going to be devoted to get to know this dataset and clean it up so we can use it easily.

The data is scraped over different periods, the last scrape is from Nov 2019 and contains 57 very large csv files. Our first step would be to unify it all (scrapes from 2015 onwards, each containing about 100,000 records, with a lot of overlaping) into a single dataframe, and explore the columns.
Due to size limitations, we added an extra step here, and removed duplicates and live projects (which are about 10% of the data, but are usless). Otherwise, the built data frame might be to big to fit into memory.


In [None]:
import kickstarter
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore') # some seaborn plots ommit warnings. Known issue.
%matplotlib inline
%load_ext autoreload
%autoreload 2
pd.options.display.max_columns = None
pd.set_option('display.max_rows', None)

This step will auto download the cleaned dataset as a pickle and extract it. It is also possible to build the dataset yourself with passing the argument cache=None, but this is a lengthy process that might take a few hours (downloading about 50 generations of the dataset, each about 1GB and uniting them). Once this pickle is on your computer, it will be auto loaded from it's location.

### note that this step requires internet connectivity and will download up to 1.5GB of data to your computer.

# Phase 1: Loading the data

In [None]:
from kickstarter import data_loader as dl

In [None]:
df = dl.make_dataframe(path=r'rawData') #Files are assumed to be located in rawData sub.dir. caches pickle in cwd.
df.head()

Great! Let's get a few details about this data: What are the features, how many records exist:

In [None]:
cols = list(df.columns.values)
print(pd.Series(cols))
num_recs = len(df.index)
print('There are originaly {} records in data'.format(num_recs))

Taking a first peek at the data via Excel hints that there are many empty columns:
![peek](img/firstPeek.png)

Let's see what columns contain mostly null values:


# Phase 2: Cleaning the data

## Removing Na

In [None]:
nes = df.isna().sum()
nes.sort_values(ascending=False, inplace=False)

We're not missing anything too important so far (though some sound important they are either not used or interchangable with other fields that are kept). Off with their head!

In [None]:
empty = {'friends','is_backing','is_starred','is_starrable','permissions', 'source_url'
         ,'country_displayable_name','converted_pledged_amount','current_currency',
         'usd_type','fx_rate', 'has_more','last_update_published_at','projects','search_url',
         'seed','staff_pick','total_hits','unread_messages_count','unseen_activity_count'}
letgo = [name for name in empty if name in cols] # For rerun
df.drop(columns=letgo,inplace=True)
nes = df.isna().sum()
nes.sort_values(ascending=False, inplace=False)

## Removing redundant columns
We can already see redundant attributes which we are sure we will not need:
- Data that is used for display purpases: such as 'currency_symbol', 'currency_trailing_code'.
- Data that is biased: such as backers count (This is part of the prediction), or disable_communication which is an option for failed projects 
- Data that will not be used by our model: location, 'profile', 'urls','usd_type', 'location'.
Let's start with dropping these.

Looks like we can drop 'friends','is_backing','is_starred','permissions' as they are basicaly empty.

In [None]:
redundant = {'backers_count','currency_symbol', 'currency_trailing_code','disable_communication',
             'profile','urls','spotlight','usd_pledged'}
letgo = [name for name in redundant if name in cols]
df.drop(columns=letgo, inplace=True)
cols = list(df.columns.values)
print(pd.Series(cols))

# Phase 3: Baseline Model

### <span style="color:red">TODO: insert baseline model here</span>

# Phase 4: Feature Extraction

In [None]:
from kickstarter import feature_extraction as fe

## Converting dates
From looking at the data we can also see that the time fields are given in UNIX time. It'll be usefull ahead if we can break each date into a day month year trio. We'll run the conversion and replace each column with the corresponding 3 fields.

In [None]:
timefields = ['created_at','deadline','launched_at','state_changed_at']
fe.convert_time(df,timefields)
df[timefields].head()

Another inconviniency in this dataset is that some of the fields are given in json form, specificaly the 'catagory' and 'creator' attributs. We'll parse just the interesting parts out of these fields and remove all bloat text.

In [None]:
fe.extract_catagories(df) #gets project catagory data

One last thing that remains is to convert the goal amount which is the project's local currency (and not usd).
Once this is done we no longer need the static usd column (it is dropped by the function). We will also parse the project photo url for future use.

In [None]:
fe.convert_goal(df)
fe.get_image_url(df)
df.head()

Cool. Looks like our data is relativly balanced, and projects in our data set are almost eaqualy likely to fail or succeed. Now let's take a look at the creator column. This is a jason field that is (as usual with this dataset) filled with illegal json strings, we'll fix it and extract the creator id.

In [None]:
fe.extract_creator_id(df) #replaces the creator json with creator id int, un
df[["creator_id", "creator"]].head()

Values that we did not fix, got a creator id of -1. Let's see if there are many of these, and if not we'll just drop them.

In [None]:
baddies = df.loc[df['creator_id'] == -1]
print('Number of bad projects dropped due to irregular creator field: {}'.format(len(baddies)))
df.drop(baddies.index, inplace=True)

Let's use this field for some more good and extract the user's profile picture and whether she \ he is a registered kickstarter user.

In [None]:
fe.extract_creator_fields(df)
print(df['creator_status'].value_counts())
print(df['super_creator'].value_counts())

Sadly, the information about the users is missing. We'll drop these two new columns.

In [None]:
cols = list(df.columns.values)
bad = {'creator_status','super_creator'}
letgo = [name for name in bad if name in cols]
df.drop(columns=letgo,inplace=True)

In [None]:
from kickstarter import visio

Now, let's take a look at how our data distributes globaly.
projects by origin country:

In [None]:
visio.plot_success_by_country(df)

Exploring our dataset, we could see that this column is actualy corrupted, where projects from different locations are labled as American or British projects. We'll use another field, 'location' to fix this data. The location information is contained in jason form, and we will parse relevant fields. Where location is NAN, we'll stay with the original contry column. Let's fix this and see how many projects actually come from each country:

In [None]:
fe.extract_country(df)
counts = df['country'].value_counts()
print(counts)

While being more acurate, we also added a lot of noise to our dataset. We will define a threshold, any country with less projects than the threshold will be changed to be considered 'Global'. This will help us keeping are data from being too sparse, and will also save us from satistical errors or biases.

In [None]:
# Minimun number of samples to appear in dataset
thresh = 450
fe.unify_countries(df, counts, thresh)
counts = df['country'].value_counts()
print(counts)

Let's see how this more accurate and compact global partition breaks down to success vs. failure rates.

In [None]:
visio.plot_success_by_country(df)

It's hard to estimate the rest of the world as it's shadowed by the US. Let's check out all countries but the US:

In [None]:
visio.plot_success_by_country(df.loc[df['country'] != 'US'])

As US is the major origin country, Let's use data available to us in JSON form to extract the origin state within the US. This data is noisy: some projects in the US contain nan as state field, or some straight-out garbage arabic words and such. This is a small minority of US projects, so we will just mark all of these as 'US unknown'.

In [None]:
fe.get_us_state(df)
counts = df['country'].value_counts()
print(counts)

The final breakdown by country:

In [None]:
visio.plot_success_by_country(df)

We can see that while most states and countries destribute close to the original distribution, some locations break out: projects from California, New-York or Great Britan (GB) have a larger chance to succeed, while projects from Florida or Texas and several other southern states tend to fail. We extracted all we could from the location column, and can now drop it and move forward.

In [None]:
df.drop(columns=['location'],inplace=True)

Let's see how catagory effects success rates:

In [None]:
visio.plot_success_by_category(df)

It seems as product catagory has an impact on campaign result. Our data set allows us to view this in even finer granularity, by sub catagories:

In [None]:
visio.plot_success_by_sub_category(df)

Another thing to factor in is seasonality, let's see if there is any change in the success depending on project start month. To be able to look at this data over several years, we'll add specific month and year columns for launched_at and deadline. We will also add a field calculating the delta in months between launch and deadline.

In [None]:
fe.extract_month_and_year(df, ['launched_at','deadline'])
fe.add_destination_delta_in_days(df)
visio.plot_success_by_launched_month(df)

Overall, looking at the whole period of given data:

In [None]:
visio.plot_success_over_time(df)

Let's see how the duration of the campaign affects the probability of success.

In [None]:
visio.plot_success_by_destination_delta_in_days(df)

In [None]:
inner = df.loc[df['goal']<30000]
sns.distplot(inner['goal']).set(xlim=(0))
print('number of records out of range:',len(df.loc[df['goal']>30000]))

In [None]:
inner = df.loc[df['goal']>30000]
inner = inner.loc[df['goal']<200000]
sns.distplot(inner['goal']).set(xlim=(0))
#print('number of records out of range:',len(df.loc[df['goal']<50000]))

In [None]:
inner = df.loc[df['goal']<80000]
sns.distplot(inner['goal']).set(xlim=(0))
print('number of records out of range:',len(df.loc[df['goal']>80000]))

In [None]:
cent = df.loc[df['goal']<30000]
cent.plot.scatter(x='goal',y='pledged')

next we want to gather statistics about each creator's previous projects  
### But first!
In order to prevent leakage lets split the dataframe into train and test

In [None]:
from kickstarter.data import Data
data = Data(df) # Some container for the train,test split

Somthing that might be interesting to learn, is how well this creator's past projects did. The function called bellow, extracts per creator the total number of past projects by him/her, the number of successful ones and the number of un-succesful ones (contains failed projects, cancled etc - this field will be delt with next).

In [None]:
from kickstarter.transformers import CreatorTransformer

data.apply_transformer(CreatorTransformer())

In [None]:
from kickstarter.visio import plot_sccess_by_creator_history
plot_sccess_by_creator_history(data.train_df)

We finished extracting all the data contained in the creator column. We can now drop it and move on. 

In [None]:
data.train_df.drop(columns=['creator'],inplace=True)
data.test_df.drop(columns=['creator'],inplace=True)

As this is basicaly what we are asking, let's see how many projects of each status are in our dataset.

In [None]:
visio.plot_distriubtion_by_state_slice(data.train_df)

Since live projects can't be used, we'll clear them out and also unite suspended and canceled project to be labled as failed.
We waited until applying this step, as we wanted to count canceled or suspended projects as a part of our creators history. This reduction, gives us:

In [None]:
fe.fix_state(data) #deletes live projects and unites failed.
print(data.train_df["state"].value_counts())
visio.plot_distriubtion_by_state_slice(data.train_df)
num_recs = len(data.train_df)
print('After processing there are {} records in train data'.format(num_recs))

Now, Let's try running a few naive models and see what it is that we are dealing with here.

In [None]:
import knn_model as knn
import logistic_regression_model as logistic
import random_forest_model as forest
import gradient_boosting_model as gradient_boosting

In [None]:
logReg_pr = logistic.run_model(data)
models = {'Logistic regression' : logReg_pr}

Let's try a few other models: KNN, Random forest and gradient boosting.

In [None]:
knn_pr = knn.run_model(data)
models['KNN'] = knn_pr

In [None]:
forest_pr = forest.run_model(data)
models['Random forest'] = forest_pr

In [None]:
boost_pr = gradient_boosting.run_model(data)
models['Gradient boost'] = boost_pr

In [None]:
visio.plot_precision(models)

Cool! So up until now we used standard techniques. Now we will try and leverage the most interesting data we have in out set. The free text fields (which are the project's name, and 'blurb' which is a short discription of the project), and the projects pictures.

In [None]:
import pickle
with open("pickled_data/before_nima.pickle", "wb") as pickle_file:
    pickle.dump(data, pickle_file)

# Phase 5: Project photos
The first thing we need to do to be able to gain some insights from the images is to be able to access them. We took a step in that direction, when parsing the urls for images in the dataset. Now the more chalenging part was to actualy obtain them. We chose to download them (as opposed to accessing them directly online or some other 'lazy' approach), as we predicted we would want to try a few different models on them and this would save us time on the long run. For this purpose (and to actualy run the models), we used a dedicated Azure cloud VM to download the 314K pictures weighing about 30GB. This enabled us to run uninterupted and with faster connection. The whole downloading process took about 2 days (with the very naive and un-paralelised code bellow).

In [None]:
from kickstarter import nima
import inspect
lines = inspect.getsource(nima.download_photos)
print(lines)

Now that we had the photos we needed to find what we can do with them (actualy we did the reaserch before opening a dedicated VM and dowloading, but this narrates better). 

Doing some reaserch, we found NIMA, a paper by google's AI team, that suggest's leveraging convolutional neural networks to predict how aesthetically pleasing a photograph is.

https://arxiv.org/pdf/1709.05424.pdf

This seemed like a novel feature and we decided to find an implementation of the model on-line, as no model was actualy released by google. We tried a few private repos on git-hub, which did not seem promising (running them on a small sub-sample gave results that did not sit well with our judgment of the photos).

Finaly, we found a project by Idealo (a German e-commerce site, sort of like 'zap.co.il') which implements NIMA and was already succesfully used to rate hotels by on-line pictures.

Leveraging the model on our dataset required some tweeking and learning, especially in the data loading phase, where the original input for the model was different than ours and so where the pictures formats). This was also quite chalenging as running the model was only possible using a docker container we needed to learn how to handle.

Running the model on all 314K pictures with our GPU clad VM took several hours and yielded two jason arrays with the results. We can now add them into the dataset. As this is a lengthy process (due to the unfriendly output of the model) you can uncomment the cell bellow which will automatically download the clean dataset panda as a pickle and load it). 

In [None]:
data = dl.get_pickles('with_NIMA.pickle')

In [None]:
#uncomment if you want to add nima manually
# from kickstarter.transformers import NimaTransformer
# 
# data.apply_transformer(NimaTransformer())
# 
# with open("pickled_data/with_NIMA.pickle", "wb") as pickle_file:
#     pickle.dump(data, pickle_file)

In [None]:
nes = data.df.isna().sum()
print(nes)

In [None]:
data.train_df.dropna(subset=['nima_score','nima_tech'], inplace=True)
data.test_df.dropna(subset=['nima_score','nima_tech'], inplace=True)
len(data.df)

Let's get a sense of what this model returned. We'll display bellow 9 random rhigh scoring images and 9 random low scoring ones. This function retreives these photos on-line, so it requires internet access.

In [None]:
visio.display_imgs(data.df)

Let's compare the distribution of the technical ratings and the aesthetical ones.

In [None]:
sns.distplot(data.df[['nima_score']], hist=False, rug=False, axlabel = 'Image score', label = 'aesthetic score')
sns.distplot(data.df[['nima_tech']], hist=False, rug=False, label = 'Technical score').set_title('Image score distribution')

In [None]:
winners = data.df.loc[data.df['state'] == 'successful']
losers = data.df.loc[data.df['state'] == 'failed']
sns.distplot(losers[['nima_score']], hist=False, rug=False, axlabel = 'Image score', label = 'failed projects')
sns.distplot(winners[['nima_score']], hist=False, rug=False, label = 'successful projects').set_title('Image score distribution')

In [None]:
sns.distplot(losers[['nima_tech']], hist=False, rug=False, axlabel = 'Image score', label = 'failed projects')
sns.distplot(winners[['nima_tech']], hist=False, rug=False, label = 'successful projects').set_title('Image technical score distribution')

As the aesthetical model seems to be the one holding the most potential twards differentiating the distributions of the failed and successful projects, we will focus on it. Let's extract the distributions paramaters:

In [None]:
total_mean = data.df.nima_score.mean()
print('nima score total mean is {}'.format(total_mean))
total_std = data.df.nima_score.std()
print('nima score total std is {}'.format(total_std))
winner_mean = winners.nima_score.mean()
print('winners nima score mean is {}'.format(winner_mean))
winner_std = winners.nima_score.std()
print('winners nima score std is {}'.format(winner_std))
loser_mean = losers.nima_score.mean()
print('losers nime score mean is {}'.format(loser_mean))
loser_std = losers.nima_score.std()
print('losers nima score std is {}'.format(loser_std))

In [None]:
# compare general distribution to normal distribution with same mean and std
sns.distplot(data.df[['nima_score']], hist=False, rug=False, axlabel = 'Image score', label = 'total aesthetic score')
norm = np.random.normal(total_mean,total_std,300000)
sns.distplot(norm, hist=False, rug=False, axlabel = 'Image score', label = 'normal distribution')

In [None]:
# compare successful distribution to normal distribution with same mean and std
sns.distplot(winners[['nima_score']], hist=False, rug=False, label = 'successful projects').set_title('Image score distribution')
norm = np.random.normal(winner_mean,winner_std,300000)
sns.distplot(norm, hist=False, rug=False, axlabel = 'Image score', label = 'normal distribution \n with succ. params')

In [None]:
# compare failed distribution to normal distribution with same mean and std
sns.distplot(losers[['nima_score']], hist=False, rug=False, axlabel = 'Image score', label = 'failed projects')
norm = np.random.normal(loser_mean,loser_std,300000)
sns.distplot(norm, hist=False, rug=False, axlabel = 'Image score', label = 'normal distribution \n with failed params')

This is no ampirical normality test, but we can see that these distributions are practicaly normal, as is expected by the specification of the model.

In [None]:
logReg_pr = logistic.run_model(data, nima = True)
models['Logistic regression with nima'] = logReg_pr

In [None]:
knn_pr = knn.run_model(data, 40, nima = True)
models['KNN with NIMA'] = knn_pr

In [None]:
forest_pr = forest.run_model(data, nima = True)
models['Random forest with nima'] = forest_pr

In [None]:
boost_pr = gradient_boosting.run_model(data, nima = True)
models['Gradient boost with nima'] = boost_pr

In [None]:
visio.plot_precision(models)

# Phase 6: Project text attributes
Looking at NLP attributes of projects

In [None]:
from kickstarter import nlp

In [None]:
nes = data.df.isna().sum()
print(nes)

In [None]:
data.train_df.dropna(subset=['blurb','name'], inplace=True)
data.test_df.dropna(subset=['blurb','name'], inplace=True)
len(data.df)

In [None]:
from kickstarter.transformers import SemanticTransformer
data.apply_transformer(SemanticTransformer())

In [None]:
data.df[["blurb", "blurb_pos", "blurb_neg", "blurb_compound"]].head()

In [None]:
sns.distplot(data.df[['blurb_pos']], hist=False, rug=False, axlabel = 'dist', label = 'Positivness score')
sns.distplot(data.df[['blurb_compound']], hist=False, rug=False, axlabel = 'dist', label = 'Compoundness')
sns.distplot(data.df[['blurb_neg']], hist=False, rug=False, label = 'neg score').set_title('Image score distribution')

In [None]:
logReg_pr = logistic.run_model(data, ['launched_at_month', 'launched_at_year', 'category', 'parent_category', 'destination_delta_in_days', 'goal', 'nima_score','blurb_pos','blurb_neg', 'blurb_compound'])
models = {'Logistic regression' : logReg_pr}

In [None]:
knn_pr = knn.run_model(df, 20 , input_fields = ['launched_at_month', 'launched_at_year', 'category', 'parent_category', 'destination_delta_in_days', 'goal', 'nima_score','blurb_pos','blurb_neg', 'blurb_compound'])
models['KNN'] = knn_pr

In [None]:
forest_pr = forest.run_model(df, 950, ['launched_at_month', 'launched_at_year', 'category', 'parent_category', 'destination_delta_in_days', 'goal', 'nima_score','blurb_pos','blurb_neg', 'blurb_compound'])
models['Random forest'] = forest_pr

In [None]:
boost_pr = gradient_boosting.run_model(df, 1000, ['launched_at_month', 'launched_at_year', 'category', 'parent_category', 'destination_delta_in_days', 'goal', 'nima_score','blurb_pos','blurb_neg', 'blurb_compound'])
models['Gradient boost'] = boost_pr

# Phase 7: Adding one hot encoding
So far, even though we have shown (by plots) the importance of "country", we haven't trained on this categorical data.
Moreover, we treated "category" and "parent_category" as ordinal data, instead of treating it as one hot encoding, and by doing so, we may have created unwanted proximity between values that are not necessarily similar.
Hence, we will try to train on those as one hot encodings.

In [None]:
from kickstarter.transformers import OneHotTransformer
data.apply_transformer(OneHotTransformer())
data.df.head()

# Phase 8: Bag of words

In [None]:
from kickstarter.transformers import BagOfWords
data.apply_transformer(BagOfWords())

# Phase 9: TFIDF

In [None]:
from kickstarter.transformers import TfidfTransformer
data.apply_transformer(TfidfTransformer())

In [None]:
data.df.columns.values

Now, given our final representation for our data...

In [None]:
input_fields = ['launched_at_month', 'launched_at_year', 'destination_delta_in_days', 'goal', 'nima_score','blurb_pos','blurb_neg', 'blurb_compound',"bag_of_words"]
input_fields.extend([col for col in data.df.columns if str.startswith(col, 'category_name_')])
input_fields.extend([col for col in data.df.columns if str.startswith(col, 'parent_category_name_')])
input_fields.extend([col for col in data.df.columns if str.startswith(col, 'country_')])
input_fields.extend([col for col in data.df.columns if str.startswith(col, 'tfidf_')])
input_fields.append("nima_tech")
len(input_fields)

In [None]:
#TODO: WIP delete this
data.input_fields = input_fields
assert set(data.train_x.columns) == set(input_fields)
# data.train_y = data.le.transform(data.train_df["state"])
# data.test_y = data.le.transform(data.test_df["state"])

In [None]:
#TODO: WIP delete this
from lightgbm import LGBMClassifier

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(data.train_x)

X_train = data.train_x
X_test = data.test_x
y_train = data.train_y
y_test = data.test_y

forest = LGBMClassifier()
forest.fit(X_train, y_train)
pred = forest.predict_proba(X_test)

print('precision is: ' + str(1 - np.mean(pred != y_test)))

In [None]:
#TODO: WIP delete this
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import svm

C = [10**i for i in range(-2,3)]
gamma = [10**i for i in range(-7,8)]
parameters = {'kernel':['linear', 'rbf'], 'C':C, 'gamma' : gamma}

cols = list(df.columns.values)
fields = [field for field in input_fields if field in cols]
X = df[fields]
y = df['state']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(X_train, y_train)
print(clf.best_params_)
print(clf.score(X_test, y_test))