# A look at crowdfunding

Kickstarter data here: https://webrobots.io/kickstarter-datasets/

Indiegogo data here: https://webrobots.io/indiegogo-dataset/

Remember the data checklist:

**What**

* Number of observations
* Definition of entities
* Missing data
* Activities of relevance for project

**Where**

* Geographical unit of analysis
* Geographical distribution

**When**

* Time coverage and trends




In [1]:
import random

In [2]:
def flatten_list(lol): 
    '''
    Flatten a list
    
    '''
    return([x for el in lol for x in el])

#Get sample

def sample_obs(data,field,sample_size,text_length):
    '''
    Samples observations from a dataset for sense-checking
    
    '''
    
    rel = list(data[field])
    
    out = random.sample(rel,sample_size)
    
    for s in out:
        print('====')
        print(s[:text_length])
        print('\n')


In [3]:
with open('../data/external/crowdfunding/Kickstarter_2018-10-18T03_20_48_880Z.json','r') as infile:
    k_json = [json.loads(line) for line in infile]

In [4]:
kdata = [obs['data'] for obs in k_json]

In [None]:
kdf = pd.DataFrame(kdata)

In [None]:
kdf.head()

In [None]:
kdf.shape

There are around 205,000 projects

#### What is an entity here?

In [None]:
kdf.loc[0]

An entity is a project. 

In [None]:
kdf['state'].value_counts()

This seems to include all projects in Kickstarter ever

In [None]:
kdf.columns

Some variables we will work with shortly:

* Category / blurb (to find activities related to health)
* Country / location (to analyse geography)
* launched at / created at // deadline / status (to measure trends and geography)

### Missing values

In [None]:
kdf.apply(lambda x: x.isna().mean(),axis=0).plot.bar(color='navy')

Few missing data

### Activities of relevance to the project.

In this case it would be projects that are about health or mention health

#### Check categories 

In [None]:
kdf['category'][0]

We need to extract the categories from this field

In [None]:
kdf['category_value']=[x['name'] for x in kdf['category']]

In [None]:
kdf_cats = kdf.category_value.value_counts()

len(kdf_cats)

159 categories

In [None]:
plt.hist(kdf_cats,bins=50,color='navy')

In [None]:
kdf_cats.head(n=10)

In [None]:
'health' in list([x.lower() for x in kdf_cats.index])

No health category

#### Check blurbs

In [None]:
keywords = ['health','well-being','wellbeing']

In [None]:
#Nww boolean field for projects mentioning healthy stuff
kdf['healthy']= [any(x in text.lower() for x in keywords) for text in kdf.blurb]

In [None]:
kdf['healthy'].sum()

1955 have the keyword

In [None]:
health_categories = pd.crosstab(kdf['category_value'],kdf['healthy']).sort_values(True,ascending=False)

health_categories[:10]

Interesting mix

In [None]:
for x in health_categories.index[:15]:
    print(x)
    print('====')
    sample_obs(kdf.loc[(kdf.category_value==x) & (kdf.healthy==True)],'blurb',sample_size=2,text_length=500)
    

In [None]:
pd.crosstab(kdf['state'],kdf['healthy'])

109 live health-related projects right now

### Where

In [None]:
pd.crosstab(kdf['country'],kdf['healthy']).sort_values(True,ascending=False)

Almost exclusively developed countries

What are the healthiest cities? (in terms of kickstarter projects)

In [None]:
kdf['location'][0]

In [None]:
#Extract location value
kdf['location_value'] = [val['short_name'] if pd.isnull(val)==False else np.nan for val in kdf['location']]

In [None]:
health_cities = pd.crosstab(kdf['location_value'],kdf['healthy']).sort_values(True,ascending=False)[:30]

In [None]:
health_cities['ratio'] = health_cities.apply(lambda x: x[True]/x[False],axis=1)

In [None]:
health_cities.sort_values('ratio',ascending=False)['ratio'].plot.bar(color='navy',title='Ratio of health to non health projects by city')

#### When?

In [None]:
# Date formats

kdf['created_year'] = [datetime.datetime.fromtimestamp(time).year for time in kdf['created_at']]

In [None]:
kdf['created_year'].value_counts().sort_values().plot.bar(color='navy')

In [None]:
fig,ax = plt.subplots(figsize=(10,6))

pd.crosstab(kdf['country'],kdf['created_year']).T.plot.bar(stacked=True,ax=ax,title='Projects per country')
ax.legend(bbox_to_anchor=(1.01,1))

I hadn't realised that Kickstarter was declining! US is top orange, UK is blue, China is red) 

In [None]:
pd.crosstab(kdf['created_year'],kdf['healthy'],normalize=1)[True].plot(title='Proportion of projects that mention health')

We detect a decline in the proportion of projects that mention health since a peak of 2016. This could of course be driven by changes in the popularity of different categories

### Check tabletop games

In [None]:
pd.crosstab(kdf['created_year'],kdf['category_value'])['Tabletop Games'].plot.bar(color='navy')