# Kickstarter project (need to spell check)
## Our project for this semester is to try and predict whether a fundraising campaign in kickstarter will succeed or not.

This type of prediction can actualy be useful in several scenarios, whether for an entrepreneur trying to evaluate his chances, the kickstarter company itself that would like to promote promising campaigns or for an investor considering backing a company.

There are a few datasets available in kaggle such as: [here](https://www.kaggle.com/codename007/funding-successful-projects) and [here](https://www.kaggle.com/kemical/kickstarter-projects). These datasets are more limited timespan wise and in their richness of data. The dataset that we used in our project is offered [here](https://webrobots.io/kickstarter-datasets/). It is very large and somewhat messy, so our first steps are going to be devoted to get to know this dataset and clean it up so we can use it easily.

The data is scraped over different periods, the last scrape is from Nov 2019 and contains 57 very large csv files. Our first step would be to unify it all (scrapes from 2015 onwards, each containing about 100,000 records, with a lot of overlaping) into a single dataframe, and explore the columns.
Due to size limitations, we added an extra step here, and removed duplicates and live projects (which are about 10% of the data, but are usless). Otherwise, the built data frame might be to big to fit into memory.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import dataCleaning as dc
import visio
import inspect
import warnings
warnings.filterwarnings('ignore') # some seaborn plots ommit warnings. Known issue.
%matplotlib inline
%load_ext autoreload
%autoreload 2
pd.options.display.max_columns = None

This step will auto download the cleaned dataset as a pickle and extract it. It is also possible to build the dataset yourself with passing the argument cache=None, but this is a lengthy process that might take a few hours (downloading about 50 generations of the dataset, each about 1GB and uniting them). Once this pickle is on your computer, it will be auto loaded from it's location.

### note that this step requires internet connectivity and will download up to 1.5GB of data to your computer.

In [None]:
df = dc.make_dataframe(path=r'rawData') #Files are assumed to be located in rawData sub.dir. caches pickle in cwd.
#print first few rows
df.head()

Great! Let's get a few details about this data: What are the features, how many records exist:

In [None]:
cols = list(df.columns.values)
print(cols)
num_recs = len(df.index)
print()
print('There are originaly {} records in data'.format(num_recs))

Taking a first peek at the data via Excel hints that there are many empty columns:
![peek](img/firstPeek.png)

Let's see what columns contain mostly null values:


In [None]:
nes = df.isna().sum()
print(nes)

We're not missing anything too important so far (though some sound important they are either not used or interchangable with other fields that are kept). Off with their head!

In [None]:
empty = {'friends','is_backing','is_starred','is_starrable','permissions','country_displayable_name','converted_pledged_amount',
         'current_currency','usd_type','fx_rate', 'has_more','last_update_published_at','projects','search_url','seed','staff_pick','total_hits','unread_messages_count','unseen_activity_count'}
letgo = [name for name in empty if name in cols]
df.drop(columns=letgo,inplace=True)
cols = list(df.columns.values)
print(cols)

We can already see redundant attributes which we are sure we will not need:
- Data that is used for display purpases: such as 'currency_symbol', 'currency_trailing_code'.
- Data that is biased: such as backers count (This is part of the prediction), or disable_communication which is an option for failed projects 
- Data that will not be used by our model: location, 'profile', 'urls','usd_type', 'location'.
Let's start with dropping these.

Looks like we can drop 'friends','is_backing','is_starred','permissions' as they are basicaly empty.

In [None]:
redundant = {'backers_count','currency_symbol', 'currency_trailing_code','source_url','disable_communication',
             'profile','urls','location','spotlight','usd_pledged'}
letgo = [name for name in redundant if name in cols]
df.drop(columns=letgo, inplace=True)
cols = list(df.columns.values)
print(cols)

From looking at the data we can also see that the time fields are given in UNIX time. It'll be usefull ahead if we can break each date into a day month year trio. We'll run the conversion and replace each column with the corresponding 3 fields.

In [None]:
timefields = ['created_at','deadline','launched_at','state_changed_at']
dc.convert_time(df,timefields)
print('sanity check')
df.head()

Cool! looks alot better. now, one more check we need to do, is to check for duplicates in our dataset. If we find any duplicates (by project id). We will drop all earlier appearances of the same project. Note that this action sorts all projects by update date, so we need to take that in consideration up ahead. We created the dataset in a way which it won't include any duplicates, but just to be sure...

In [None]:
print('There were originaly {} records in data'.format(num_recs))
dc.remove_duplicates(df)
num_recs = len(df.index)
print('After processing there are {} records in data'.format(num_recs))

Another inconviniency in this dataset is that some of the fields are given in json form, specificaly the 'catagory' and 'creator' attributs. We'll parse just the interesting parts out of these fields and remove all bloat text.

In [None]:
#dc.extract_creator(df) #replaces the creator json with creator id int, un
df.drop(columns=['creator'], inplace=True) #currently not used.
dc.extract_catagories(df) #gets project catagory data

One last thing that remains is to convert the goal amount which is the project's local currency (and not usd).
Once this is done we no longer need the static usd column (it is dropped by the function). We will also parse the project photo url for future use.

In [None]:
dc.convert_goal(df)
dc.get_image_url(df)
df.head()

Now we are ready to begin exploring our data.
As this is basicaly what we are asking, let's see how many projects of each status are in our dataset.

In [None]:
visio.plot_distriubtion_by_state_slice(df)

Since live projects can't be used, we'll clear them out and also unite suspended and canceled project to be labled as failed. which gives us:

In [None]:
dc.fix_state(df) #deletes live projects and unites failed.
df.reset_index(drop=True, inplace=True)
print(df.state.value_counts())
visio.plot_distriubtion_by_state_slice(df)
num_recs = len(df.index)
print('After processing there are {} records in data'.format(num_recs))

Cool. Looks like our data is balanced, and projects in our data set are eaqualy likely to fail or succeed.

Now, let's take a look at how our data distributes globaly.
projects by origin country:

In [None]:
visio.plot_success_by_country(df)

Thoose americans are always too big, let's try and give focus to the rest of the world as well:

In [None]:
visio.plot_success_by_country(df.loc[df['country'] != 'US'])

Let's see how success distributes by catagory:

In [None]:
visio.plot_success_by_category(df)

It seems as product catagory has an impact on campaign result. Our data set allows us to view this in even finer granularity, by sub catagories:

In [None]:
visio.plot_success_by_sub_category(df)

Another thing to factor in is seasonality, let's see if there is any change in the success depending on project start month. To be able to look at this data over several years, we'll add specific month and year columns for launched_at and deadline. We will also add a field calculating the delta in months between launch and deadline.

In [None]:
dc.extract_month_and_year(df, ['launched_at','deadline'])
dc.add_destination_delta_in_days(df)
visio.plot_success_by_launched_month(df)

Overall, looking at the whole period of given data:

In [None]:
visio.plot_success_over_time(df)

Let's see how the duration of the campaign affects the probability of success.

In [None]:
visio.plot_success_by_destination_delta_in_days(df)

In [None]:
inner = df.loc[df['goal']<30000]
sns.distplot(inner['goal']).set(xlim=(0))
print('number of records out of range:',len(df.loc[df['goal']>30000]))

In [None]:
inner = df.loc[df['goal']>30000]
inner = inner.loc[df['goal']<200000]
sns.distplot(inner['goal']).set(xlim=(0))
#print('number of records out of range:',len(df.loc[df['goal']<50000]))

In [None]:
inner = df.loc[df['goal']<80000]
sns.distplot(inner['goal']).set(xlim=(0))
print('number of records out of range:',len(df.loc[df['goal']>80000]))

In [None]:
cent = df.loc[df['goal']<30000]
cent.plot.scatter(x='goal',y='pledged')

Now, Let's try running a few naive models and see what it is that we are dealing with here.

In [None]:
import knn_model as knn
import logistic_regression_model as logistic
import random_forest_model as forest
import gradient_boosting_model as gradient_boosting

In [None]:
logReg_pr = logistic.run_model(df)
models = {'Logistic regression' : logReg_pr}

Let's try a few other models: KNN, Random forest and gradient boosting.

In [None]:
knn_pr = knn.run_model(df)
models['KNN'] = knn_pr

In [None]:
forest_pr = forest.run_model(df)
models['Random forest'] = forest_pr

In [None]:
boost_pr = gradient_boosting.run_model(df)
models['Gradient boost'] = boost_pr

In [None]:
visio.plot_precision(models)

Cool! So up until now we used standard techniques. Now we will try and leverage the most interesting data we have in out set. The free text fields (which are the project's name, and 'blurb' which is a short discription of the project), and the projects pictures.

# Project photos
The first thing we need to do to be able to gain some insights from the images is to be able to access them. We took a step in that direction, when parsing the urls for images in the dataset. Now the more chalenging part was to actualy obtain them. We chose to download them (as opposed to accessing them directly online or some other 'lazy' approach), as we predicted we would want to try a few different models on them and this would save us time on the long run. As it enabled us to run uninterupted and with faster connection we used a dedicated Azure cloud VM to download the 314K pictures weighing about 30GB. The whole downloading process took about 2 days (with the very naive and un-paralelised code bellow).

In [None]:
lines = inspect.getsource(dc.download_photos)
print(lines)

Now that we had the photos we needed to find what we can do with them (actualy we did the reaserch before opening a dedicated VM and dowloading, but this narrates better). 

Doing some reaserch, we found NIMA, a paper by google's AI team, that suggest's leveraging convolutional neural networks to predict how aesthetically pleasing a photograph is.

https://arxiv.org/pdf/1709.05424.pdf

This seemed like a novel feature and we decided to find an implementation of the model on-line, as no model was actualy released by google. We tried a few private repos on git-hub, which did not seem promising (running them on a small sub-sample gave results that did not sit well with our judgment of the photos).

Finaly, we found a project by Idealo (a German e-commerce site, sort of like 'zap.co.il') which implements NIMA and was already succesfully used to rate hotels by on-line pictures.

Leveraging the model on our dataset required some tweeking and learning, especially in the data loading phase, where the original input for the model was different than ours and so where the pictures formats). This was also quite chalenging as running the model was only possible using a docker container we needed to learn how to handle.

Running the model on all 314K pictures with our GPU clad VM took several hours and yielded two jason arrays with the results. We can now add them into the dataset. As this is a lengthy process (due to the unfriendly output of the model) you can uncomment the cell bellow which will automatically download the clean dataset panda as a pickle and load it). 

In [None]:
df = dc.get_pickles('with_NIMA.pickle')

In [None]:
dc.add_nima(df, jsonFile='NIMA predictions/predictions_imgs_all.json', columnName = 'nima_score')

In [None]:
dc.add_nima(df, jsonFile='NIMA predictions/predictions_imgs_all_technical.json', columnName = 'nima_tech')

In [None]:
nes = df.isna().sum()
print(nes)

In [None]:
df.dropna(subset=['nima_score','nima_tech'], inplace=True)
len(df)

Let's get a sense of what this model returned. We'll display bellow 9 random rhigh scoring images and 9 random low scoring ones. This function retreives these photos on-line, so it requires internet access.

In [None]:
visio.display_imgs(df)

Let's compare the distribution of the technical ratings and the aesthetical ones.

In [None]:
sns.distplot(df[['nima_score']], hist=False, rug=False, axlabel = 'Image score', label = 'aesthetic score')
sns.distplot(df[['nima_tech']], hist=False, rug=False, label = 'Technical score').set_title('Image score distribution')

In [None]:
winners = df.loc[df['state'] == 'successful']
losers = df.loc[df['state'] == 'failed']
sns.distplot(losers[['nima_score']], hist=False, rug=False, axlabel = 'Image score', label = 'failed projects')
sns.distplot(winners[['nima_score']], hist=False, rug=False, label = 'successful projects').set_title('Image score distribution')

In [None]:
sns.distplot(losers[['nima_tech']], hist=False, rug=False, axlabel = 'Image score', label = 'failed projects')
sns.distplot(winners[['nima_tech']], hist=False, rug=False, label = 'successful projects').set_title('Image technical score distribution')

As the aesthetical model seems to be the one holding the most potential twards differentiating the distributions of the failed and successful projects, we will focus on it. Let's extract the distributions paramaters:

In [None]:
total_mean = df.nima_score.mean()
print('nima score total mean is {}'.format(total_mean))
total_std = df.nima_score.std()
print('nima score total std is {}'.format(total_std))
winner_mean = winners.nima_score.mean()
print('winners nima score mean is {}'.format(winner_mean))
winner_std = winners.nima_score.std()
print('winners nima score std is {}'.format(winner_std))
loser_mean = losers.nima_score.mean()
print('losers nime score mean is {}'.format(loser_mean))
loser_std = losers.nima_score.std()
print('losers nima score std is {}'.format(loser_std))

In [None]:
# compare general distribution to normal distribution with same mean and std
sns.distplot(df[['nima_score']], hist=False, rug=False, axlabel = 'Image score', label = 'total aesthetic score')
norm = np.random.normal(total_mean,total_std,300000)
sns.distplot(norm, hist=False, rug=False, axlabel = 'Image score', label = 'normal distribution')

In [None]:
# compare successful distribution to normal distribution with same mean and std
sns.distplot(winners[['nima_score']], hist=False, rug=False, label = 'successful projects').set_title('Image score distribution')
norm = np.random.normal(winner_mean,winner_std,300000)
sns.distplot(norm, hist=False, rug=False, axlabel = 'Image score', label = 'normal distribution \n with succ. params')

In [None]:
# compare failed distribution to normal distribution with same mean and std
sns.distplot(losers[['nima_score']], hist=False, rug=False, axlabel = 'Image score', label = 'failed projects')
norm = np.random.normal(loser_mean,loser_std,300000)
sns.distplot(norm, hist=False, rug=False, axlabel = 'Image score', label = 'normal distribution \n with failed params')

This is no ampirical normality test, but we can see that these distributions are practicaly normal, as is expected by the specification of the model.

In [None]:
logReg_pr = logistic.run_model(df, nima = True)
models['Logistic regression with nima'] = logReg_pr

In [None]:
knn_pr = knn.run_model(df, nima = True)
models['KNN with NIMA'] = knn_pr

In [None]:
forest_pr = forest.run_model(df, nima = True)
models['Random forest with nima'] = forest_pr

In [None]:
boost_pr = gradient_boosting.run_model(df, nima = True)
models['Gradient boost with nima'] = boost_pr

In [None]:
visio.plot_precision(models)