
COGS 108 - Final Project



Names  
Oprah Winfrey  
Miley Cyrus  
Sam Smith  
J. Cole  
Group Members IDs  
A########  
A########  
A########  
A########  


# Intro:

Startups are centers for innovation as they often use cutting-edge technology to develop new products and services. These small companies also tend to grow quickly and can keep up with a rapidly changing economy. For these reasons, startups are proving to be invaluable in our globally connected economy and it is important that we understand what factors can lead to startup success.  

Unfortunately, CB Insights reports that 9 out of 10 startups fail, where 42% of those unsuccessful startups failed because there was no market need and 29% failed because they lacked sufficient funding. Clearly, many startups struggle with customer discovery and securing funds. As a potential guide for startups, we wanted to analyze what entrepreneurial ideas are most likely to attract customers and funding using a Kickstarter dataset, where Kickstarter is a funding platform for various projects. This dataset contains various fields, including sector and product descriptions, that we can analyze using supervised learning and natural language processing.


# Research Question:

## What keywords in kickstarter descriptions best indicate high fundraising?

To measure success, we are going to be looking at the total funds raised by a project. The alternative would be to see whether or not a project met its minimum fundraising goal, but our preliminary research showed that within the kickstarter dataset the fundraising goal had very little impact on the amount of money a kickstarter campaign actually raises (even projects with small goals often raised 100x their goal). We determined that if we used minimum goal met as our metric the results would be heavily biased towards unambitious projects with small fundraising goals. 

Furthermore, we have decided to break down our analysis by category, for two main reasons 
The meaning of certain words varies heavily between categories (eg, mobile in ‘gaming’ means mobile game app, whereas ‘mobile’ in fashion indicates that a piece of clothing is not restricting)
There is a large difference in average funds raised between categories, so the ‘most important keywords’ would likely just end up being indicators of which category a project belongs to. (For instance, ‘tech’ words might all be far higher than ‘journalism’ words.)  


# Hypothesis

Among venture capitalists, new research in artificial intelligence and machine learning are extremely popular. So, our hypothesis is that such ideas would do very well on Kickstarter, even in categories outside of tech (for instance, AI created music). However, it is entirely possible that Kickstarter users have very different priorities compared to venture capitalists.

# Background and related work

Kickstarter began in April 2009 as a platform through which anyone can post their projects and receive funding from the masses. Oftentimes, entrepreneurs use the site as a means to bring their product innovations to market. Kickstarter requires that the entrepreneur sets a specific funding goal, and users can then “back” the project by pledging a dollar amount. If the total amount raised meets or exceeds the project target, then the entrepreneur gets to keep the pledged amount. Otherwise, the funds are returned to the investors.

As mentioned in the previous section, we plan to investigate which sort of entrepreneurial ideas will likely garner crowdfunding via Kickstarter. For our purposes, we will be using a dataset scraped by Web Robots that contains information, including funding goals and amount raised, on all Kickstarter projects as of 2019. This dataset also provides fields concerning project descriptions and business categories (or sectors) that we hope our model will be able to analyze. We plan on modeling the entrepreneurial idea by combining the category of the project with language features extracted from the title and description.

It is important to note that teams in the private sector and in research have tried to model startup success before. In 2016, researchers at Northwestern University set out to predict the outcome of startups based on factors like seed funding amount, seed funding time, Series A funding with the belief that these factors contribute to the success and failure of a company at every milestone (see "Predicting The Outcome of Startups: Less Failure, More Success"). To predict success/failure of early-stage startups, the team used various supervised learning classifiers, such as Random Forest and Bayesian Networks, and achieved precision accuracies ranging from 85% to 96%.

Although our project idea may be similar to the Northwestern University research project, they use early funding to predict startup long term success; however, we will be using features relating to the startup idea itself to predict early funding. Additionally, rather than a binary classification for success vs. failure, we plan on using a regression model to predict the amount a startup will raise. Ultimately, we hope that our model’s predictions will help us draw conclusions on what ideas can attract people to fund a Kickstarter project. 





In [None]:
Dataset(s)
Fill in your dataset information here

(Copy this information for each dataset)
Dataset Name:
Link to the dataset:
Number of observations:
1-2 sentences describing each dataset.

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# DATA CLEANING - Initial steps


The below code reads in the raw CSV kickstarter data and then
* Removes stopwords and punctuation
* Lemmatizes all words (eg mice -> mouse, running -> run)
* Writes the processed data to JSON format

Lemmatization is a fairly intensive task and with a dataset of over 200,000 this needed to run overnight. 

The recommended usage is to skip this cell and work off of the already lemmatized JSON outputs included in the repository.

You can find the already processed data at https://github.com/Aidankeogh/Cogs108_Repo in the folder "kickstarter_data"

In [2]:
already_preprocessed = True
if not already_preprocessed: 
    import pandas as pd
    import json
    import glob
    import random
    import string
    import spacy
    import nltk
    from nltk.corpus import stopwords
    nlp = spacy.load("en_core_web_sm")
    stopwords = set(stopwords.words('english') + list(string.punctuation))

    # Read in all of the CSV files, concatenate them into one dataset. 
    csv_files = glob.glob("kickstarter_data/Kickstarter*")
    subsets = []
    for csv_file in csv_files:
        subsets.append(pd.read_csv(csv_file))
    dset = pd.concat(subsets)

    # Take in text and return an array of lemmatized, tokenized, and stopword-removed word features
    def text_features(text):
        text = text.strip().replace("\n", " ").replace("\r", " ")
        text = text.lower()
        tokens = nlp(text)
        feats = []    
        for tok in tokens: # lemmatize words that are not pronouns 
            feats.append(tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_)
        feats = [feat for feat in feats if feat not in stopwords]
        return feats

    # Goes through every kickstarter project in the dataset, and writes it back to disk in json format. 
    dump = 0
    projects = []
    for idx, item in dset.iterrows():
        project = {'pledged' : item['pledged'] * item['fx_rate'],
                  'goal'    : item['goal'] * item['fx_rate'],
                  'category': json.loads(item['category'])['slug'].split("/"), 
                  'text'    : str(item['name']) + " " + str(item['blurb']),
                  'text_feats'    : text_features(str(item['name']) + " " + str(item['blurb']))}
        projects.append(project)
        if idx % 1000 == 999:
            with open('kickstarter_data/data' + str(dump) + '.json', 'w') as outfile:  
                json.dump(projects, outfile)
                dump += 1
                projects = []

# Data cleaning - more interesting stuff

These are all the functions needed to convert the lemmatized/tokenized word features into a usable format for scikitlearn's regression models

Intended usage is to start running from here, downloading the json formatted features that are inside the repository. 

In [14]:
import json
import glob
import nltk
import random
# A constant for the top most useful uni-, bi-, and trigrams. Edit this to use more or less of each gram type. 
most_useful = {"uni": 200, "bi": 100, "tri": 0}

In [15]:
# **read_data()**
# - **Func Desc:**<br>
#     This function reads in the entire Kickstarter dataset from json files in the "kickstarter_data" directory.
# - **Return:**<br>
#     An nx5 list of projects, where n represents the total number of projects. Note that there are 5 attributes of a single project: the category, text, pledged amount, goal amount, and text_features.
def read_data():
    projects = []

    # Read in data
    json_files = glob.glob("kickstarter_data/data*")

    for json_file in json_files:
        projects += json.load(open(json_file, 'r'))

    return projects    

# **grams_by_project(*list text*)**
# - **Func Desc:**<br>
#     This function will find all the unigrams, bigrams, and trigrams in the given *text*.
# - **Return:**<br>
#     A dictionary containing all unigrams, bigrams, and trigrams, 
#     where the corresponding keys are "uni", "bi" and "tri"
def grams_by_project(text):
    grams = {}
    
    all_words = []
    all_bigrams = []
    all_trigrams = []
    
    prev_prev = ''
    prev_word = '<SOS>' # Start of sentence

    for w in text:
        # Ignore empty strings and apostrophe+s ending
        if w == "'s" or w == '’s' or w == '' or w == 'cancel':  
            continue

        all_words.append(w)
        all_bigrams.append(prev_word + " " + w)

        if prev_prev != '':
            all_trigrams.append(prev_prev + " " + prev_word + " " + w)

        prev_prev = prev_word
        prev_word = w
    
    grams["uni"] = all_words
    grams["bi"]  = all_bigrams
    grams["tri"] = all_trigrams
    
    return grams

# **grams_by_category(*string category*, **[optional]** *int n*, **[optional]** *boolean do_print*)**
# - **Func Desc:**<br>
#     This function will find the unigrams, bigrams, and trigrams in the given *category*. If *do_print* is set, then the *n* most common unigrams, bigrams, and trigrams will be displayed.
# - **Return:**<br>
#     A dictionary containing all unigrams, bigrams, and trigrams, 
#     where the corresponding keys are "uni", "bi" and "tri"
def grams_by_category(projects, category, n=15, do_print=True):
    grams = {}
    
    all_words = []
    all_bigrams = []
    all_trigrams = []
    
    for project in projects:
        
        # Change this to check out a different sub-category, 
        # 'all' will check the entire thing
        if category != 'all' and category not in project['category']: 
            continue

        prev_prev = ''
        prev_word = '<SOS>' # Start of sentence
        
        proj_grams = grams_by_project(project['text_feats'])
            
        all_words += proj_grams["uni"]
        all_bigrams += proj_grams["bi"]
        all_trigrams += proj_grams["tri"]
        
    grams["uni"] = nltk.FreqDist(all_words)
    grams["bi"]  = nltk.FreqDist(all_bigrams)
    grams["tri"] = nltk.FreqDist(all_trigrams)
    
    if do_print:
        print("-- UNIGRAMS --")
        all_words = nltk.FreqDist(all_words)
        
        for word in all_words.most_common(n):
            print(word[0], "\t", word[1])

        print()
        print("-- BIGRAMS --")
        all_bigrams = nltk.FreqDist(all_bigrams)
        
        for bigram in all_bigrams.most_common(n):
            print(bigram[0], "\t", bigram[1])

        print()
        print("-- TRIGRAMS --")
        all_trigrams = nltk.FreqDist(all_trigrams)
        
        for trigram in all_trigrams.most_common(n):
            print(trigram[0], "\t", trigram[1])
    
    return grams


# **map_gram_to_idx(*dictionary grams*, **[optional]** num_uni, **[optional]** num_bi, **[optional]** num_tri)**
# - **Func Desc:**<br>
#     Given a dictionary of unigrams, bigrams, and trigrams, this function maps each gram to a unique index. We will later use this to vectorize the most unique uni-, bi-, and trigrams. Note that *num_uni* represents the "n" most common unigrams, and similarily for *num_bi* and *num_tri*.
# - **Return:**<br>
#     A dictionary containing all unigrams, bigrams, and trigrams mapped to a unique integer index.
def map_gram_to_idx(grams_dict, num_uni=most_useful["uni"], 
                      num_bi=most_useful["bi"], 
                      num_tri=most_useful["tri"]):
    gram_to_idx = {}
    count = 0
    
    for word, _ in grams_dict["uni"].most_common(num_uni):
        gram_to_idx[word] = count
        count += 1

    for phrase, _ in grams_dict["bi"].most_common(num_bi):
        gram_to_idx[phrase] = count
        count += 1

    for phrase, _ in grams_dict["tri"].most_common(num_tri):
        gram_to_idx[phrase] = count
        count += 1
        
    return gram_to_idx


# **vectorize(*list text*, *dictionary gram_to_idx*)**
# - **Func Desc:**<br>
#     For each uni-, bi-, and trigram in *text*, this function will indicate whether each gram is present in *gram_to_idx* (1: present; 0: not present). Note that *gram_to_idx* represents a mapping of the n most common uni-, bi-, and trigrams of a particular project category.
# - **Return:**<br>
#     A list of 0s and 1s, where 0 indicates that the gram found at *gram_to_idx[i]* is not present in *text* and 1 means that the gram is present.
def vectorize(project, gram_to_idx):
    text = project['text_feats']
    feats = [0] * (len(gram_to_idx) + 1)
    feats[-1] = project['goal']
    proj_grams = grams_by_project(text)
        
    for _, grams in proj_grams.items():
        for g in grams:
            if g in gram_to_idx:
                feats[gram_to_idx[g]] = 1
               
    return feats



# Data Analysis

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.metrics import mean_squared_error

Load in all of the data, and then find the top 10 most common kickstarter categories to analyze. 
We made the decision to limit analysis to category by category, so that we could get a more fine-grained description, indicating what kinds of projects are the most appealing for each category. 

In [17]:
def build_feats(projects, category):
    
    # Find and print most common unigrams and bigrams in category
    grams = grams_by_category(projects, category, do_print=False)

    # Map grams to unique index for easy vectorization
    grams_to_idx = map_gram_to_idx(grams)

    # Map unique index to gram to quickly convert vectorization to txt
    idx_to_grams = [0] * len(grams_to_idx)

    for gram, idx in grams_to_idx.items():
        idx_to_grams[idx] = gram
        
    # Build feats + labels for model training
    feats = []
    labels = []

    for project in projects:
        if project['category'][0] == category or category == 'all':
            encoding = vectorize(project, grams_to_idx)

            # Label represents amt pledged
            label = project['pledged']

            feats.append(encoding)
            labels.append(label)
            
    return idx_to_grams, feats, labels

In [18]:
def create_model(projects, category, validate=False):
            
    idx_to_grams, feats, labels = build_feats(projects, category)
            
    # 90-10 split feats and labels; 90% training data and 10% test data
    feats_train = feats[:int(len(feats) * .9)]
    feats_test  = feats[int(len(feats) * .9):]

    labels_train = labels[:int(len(labels) * .9)]
    labels_test  = labels[int(len(labels) * .9):]
    
    model = linear_model.Ridge(alpha=1000)     # Initialize model
    model.fit(feats_train, labels_train)       # Train model
    
    # If validate=True, then validate model using 10% of data
    if validate:
        predictions = model.predict(feats_test)
        
        MSE = mean_squared_error(predictions, labels_test)
        print("MSE:", MSE)
        
    word_corrs = sorted(zip(idx_to_grams, model.coef_), key=lambda t: -t[1])
        
    return model, word_corrs

In [19]:
projects = read_data()

In [20]:
all_categories = []

# Get list of all possible categories
for project in projects:
    for category in project['category']:
        all_categories.append(category)
        
all_categories = nltk.FreqDist(all_categories)

# Get top-10 categories
top_10_categories = [category[0] for category in all_categories.most_common(10)]

In [21]:
grams = {}
coefs = []

for category in top_10_categories:
    temp = {}
    
    LR, corrs = create_model(projects,category)
    
    temp['grams'] = [t[0] for t in corrs]
    temp['monetary_impact'] = [t[1] for t in corrs]
    
    coefs.append([category, LR.intercept_,LR.coef_[-1]])
    
    grams[category] = pd.DataFrame(temp)

In [22]:
grams_df = pd.concat(grams, axis=1, keys=top_10_categories)
coefs_df = pd.DataFrame.from_records(coefs, columns=['category', 'intercept', 'goal_v_raised'], index='category')

In [23]:
grams_df.style

Unnamed: 0_level_0,film & video,film & video,music,music,technology,technology,art,art,publishing,publishing,food,food,games,games,fashion,fashion,design,design,comics,comics
Unnamed: 0_level_1,grams,monetary_impact,grams,monetary_impact,grams,monetary_impact,grams,monetary_impact,grams,monetary_impact,grams,monetary_impact,grams,monetary_impact,grams,monetary_impact,grams,monetary_impact,grams,monetary_impact
0,animate,7317.33,new,1781.14,smart,24619.1,museum,1933.49,100,1946.78,beer,2774.52,board game,21790.6,jacket,11067.0,versatile,17195.1,hardcover,2430.22
1,documentary,5659.02,new album,1573.29,first,19470.4,book,1831.71,art,1838.52,chef,2559.67,board,20017.1,world,9210.71,travel,14930.8,collection,1889.16
2,bring,5115.72,join,1424.28,camera,18485.7,bring,1323.86,great,1803.69,brewing,2413.46,1 4,18965.8,travel,9086.02,pack,14911.6,webcomic,1844.02
3,big,5058.02,album,1390.37,affordable,16418.2,build,1318.98,artist,1781.25,first,2126.79,1,17121.9,pocket,8325.47,system,12333.0,volume,1783.11
4,episode,4061.5,make,1369.34,world first,16230.6,tarot,1317.42,book,1750.33,kitchen,1945.76,set,16656.5,world,8130.77,backpack,12103.9,book,1710.98
5,back,3819.63,new studio,1277.84,world,14379.5,beauty,1287.68,game,1628.49,craft,1871.91,new,14963.2,world good,7453.08,smart,11848.3,death,1516.59
6,need help,3461.85,record,1236.08,power,13917.5,deck,1205.16,art book,1593.19,american,1687.63,4 player,13756.3,build,7228.73,world,11559.3,print,1443.99
7,new,3453.77,experience,1090.08,3d,13280.8,black,1203.82,fairy tale,1409.25,base,1404.42,survival,13434.4,feature,7050.74,two,10723.7,new,1409.46
8,...,3371.0,play,1088.86,experience,13145.7,art book,1200.49,world,1390.78,home,1340.45,world,11835.3,good,5603.87,carry,9753.01,year,1369.1
9,movie,3058.44,studio album,1087.16,3d printer,13086.7,new,1073.07,inspire,1286.07,fresh,1305.79,game set,11819.1,performance,3202.19,line,8173.3,anthology,1162.6


In [13]:
coefs_df.style

Unnamed: 0_level_0,intercept,goal_v_raised
category,Unnamed: 1_level_1,Unnamed: 2_level_1
film & video,11443.5,0.000258249
music,3702.47,0.002821
technology,32209.1,0.000938943
art,3774.03,6.41484e-05
publishing,5674.28,0.00116006
food,7014.88,-4.33325e-05
games,29016.4,0.00547073
fashion,12740.5,0.00273138
design,33006.9,-9.67402e-05
comics,5180.89,0.445858


# ETHICAL CONSIDERATIONS: 

Collection bias:   
It is important to remember that this is specifically a dataset of Kickstarter users, who are not necessarily representative of the overall population. Kickstarter is a crowdfunding website, and its users are exclusively people who are willing to invest in a project that may not come to fruition for years, or perhaps ever. As such, participation in Kickstarter funding is going to be limited to people with both the wealth necessary to make that kind of investment. Furthermore, since Kickstarter is an online resource, demographics will again be skewed towards typical web users. Therefore, this should not necessarily be viewed as the needs of the population overall, but rather skewed towards the needs of people who are relatively wealthy and web-savvy.

Informed consent and PII:  
	All information is scraped from kickstarter projects that were intentionally made public with the goal of fundraising. Furthermore, for our analysis we only use the aggregate values of the most common 200 unigrams and bigrams, which means we will only be analyzing keywords that were present in thousands of different campaigns. For this reason it is difficult to imagine how any of the results could be used to personally identify individuals or their campaigns. 

Unintended use:   
	One potential abuse of this information is for people to create fake kickstarter projects for the sole purpose of raising money, by using the most popular keywords. However, there is far more to a successful kickstarter than just the description, and we find that just with this analysis alone it would be difficult to trick investors. 
