# COGS 108 Final Project

## Group Members: 
- 

### Introduction and Background:



Which components of a kickstarter are most important for a fundraiser to reach its goal? How do factors such as date started, duration, money goal, and category of a kickstarter , comparatively, influence fundraiser success? Answering these questions can guide those who are looking to create their own kickstarter and start a successful business. There have been previous projects in this area studying what separates successful kickstarters from failed ones. We take a different approach and analyze how these factors rank against each other.  

Our initial hypothesis is that kickstarter category will be the most important factor in kickstarter success, due to some categories being inherently more popular and appealing than others. Categories such as “Design and Tech” category will likely be more successful due to the popularity surrounding technology-driven products.


### Data Description:


Dataset Name: Kickstarter Datasets (Web Robots)

Link to the dataset: https://webrobots.io/kickstarter-datasets/

Number of observations: 207k

The dataset is a large (200k+) collection of data about different Kickstarters, including whether they succeed in reaching their goal. Other information pertaining to the kickstarter range from the name and category of the kickstarter to how much it is asking for or how long the fundraising period was.

The information includes: ID, internal kickstarter id, name, name of project, category, main_category, category of campaign, currency used to support, deadline for crowdfunding, fundraising goal, date launched, amount pledged by "crowd", current condition the project is in.


### Starting out: imports

In [2]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import json

from datetime import datetime, timedelta

### Data Cleaning and Pre-processing

Here we load the JSON file and select the columns that we want from the web-scraper's data. We also have to apply a function to parse the category name as it is separated by a forward slash 

In [5]:
with open('Kickstarter_2018-12-13T03_20_05_701Z.json') as f:
    
    data = []
    
    # Iterate through each line in file which is a JSON object
    for line in f:
        
        # Load object
        obj = json.loads(line)
 
        # Choose columns
        item = []
        item.append(obj['data']['id'])
        item.append(obj['data']['name'])
        item.append(obj['data']['blurb'])
        item.append(obj['data']['goal'])
        item.append(obj['data']['pledged'])
        item.append(obj['data']['state'])
        item.append(obj['data']['country'])
        item.append(obj['data']['deadline'])
        item.append(obj['data']['created_at'])
        item.append(obj['data']['launched_at'])
        item.append(obj['data']['backers_count'])
        item.append(obj['data']['usd_pledged'])
        item.append(obj['data']['category']['slug'])
        item.append(obj['data']['category']['name'])
        data.append(item)
        
# Fix columns list if adding/removing columns
columns = ['id', 'name', 'blurb', 'goal', 'pledged', 'state', 'country', 'deadline', 'created_at', 'launched_at', 'backers_count', 'usd_pledged', 'category', 'subcategory']
df = pd.DataFrame(data, columns=columns)

def getSubcategory(name):
    return name.split('/')[0]

# Fix category
df['category'] = df['category'].apply(getSubcategory)
df.head()

Unnamed: 0,id,name,blurb,goal,pledged,state,country,deadline,created_at,launched_at,backers_count,usd_pledged,category,subcategory
0,1555581815,Big Top Without Borders,A documentary about two circuses in remote cor...,25000.0,27455.55,successful,US,1353256229,1339525842,1350660629,170,27455.55,film & video,Documentary
1,583419300,"The Story of ""Pweep"": From Egg - To Peacock",A multi-media IPad book telling the true story...,500.0,535.0,successful,US,1355949544,1351941026,1353357544,10,535.0,publishing,Children's Books
2,1745190062,DC Radio,We are college students that get drunk and the...,3500.0,0.0,failed,CA,1418916011,1415917256,1416324011,0,0.0,journalism,Audio
3,1995203117,Ali Bangerz- two New Full Lenght Albums,"its Ali bangerz,its time to stand up for other...",20000.0,0.0,failed,US,1449345000,1446664703,1446672167,0,0.0,music,World Music
4,359013399,Deja-Vu: Dissecting Memory on Camera,A young neuroscientist attempts to reconnect w...,5000.0,6705.0,successful,US,1287200340,1284003536,1284042614,62,6705.0,film & video,Documentary


Next, we notice that the deadline, created_at, and launched_at values look odd. Turns out they are in unix time, we we will transform them to be more readable.

In [8]:
deadline_str = []
duration_str = []
launch_date = []
deadline_int = []

for i, row in df.iterrows():

    unix_ts = int(row['deadline'])
    dt = (datetime.fromtimestamp(unix_ts) - timedelta(hours=2)).strftime('%Y-%m-%d %H:%M:%S')
    deadline_str.append(dt)
    
    launch_date_ts = int(row['launched_at'])
    launch_dt = (datetime.fromtimestamp(launch_date_ts) - timedelta(hours=2)).strftime('%Y-%m-%d %H:%M:%S')
    launch_date.append(launch_dt)
    
    duration_unix_ts = unix_ts - launch_date_ts
    duration_days = (duration_unix_ts / (60*60*24))
    duration_str.append(duration_days)
    
df['launched_at_str'] = launch_date
df['deadline_str'] = deadline_str
df['duration_days'] = duration_str

for i, row in df.iterrows():
    
    dt = row['deadline_str']
    dt2 = dt.split(' ')[0].split('-')[1]+dt.split(' ')[0].split('-')[2]+dt.split(' ')[0].split('-')[0]
    deadline_int.append(dt2)

df['deadline_int'] = deadline_int
    
# Check df
df.head()

Unnamed: 0,id,name,blurb,goal,pledged,state,country,deadline,created_at,launched_at,backers_count,usd_pledged,category,subcategory,launched_at_str,deadline_str,duration_days,deadline_int
0,1555581815,Big Top Without Borders,A documentary about two circuses in remote cor...,25000.0,27455.55,successful,US,1353256229,1339525842,1350660629,170,27455.55,film & video,Documentary,2012-10-19 06:30:29,2012-11-18 06:30:29,30.041667,11182012
1,583419300,"The Story of ""Pweep"": From Egg - To Peacock",A multi-media IPad book telling the true story...,500.0,535.0,successful,US,1355949544,1351941026,1353357544,10,535.0,publishing,Children's Books,2012-11-19 10:39:04,2012-12-19 10:39:04,30.0,12192012
2,1745190062,DC Radio,We are college students that get drunk and the...,3500.0,0.0,failed,CA,1418916011,1415917256,1416324011,0,0.0,journalism,Audio,2014-11-18 05:20:11,2014-12-18 05:20:11,30.0,12182014
3,1995203117,Ali Bangerz- two New Full Lenght Albums,"its Ali bangerz,its time to stand up for other...",20000.0,0.0,failed,US,1449345000,1446664703,1446672167,0,0.0,music,World Music,2015-11-04 11:22:47,2015-12-05 09:50:00,30.935567,12052015
4,359013399,Deja-Vu: Dissecting Memory on Camera,A young neuroscientist attempts to reconnect w...,5000.0,6705.0,successful,US,1287200340,1284003536,1284042614,62,6705.0,film & video,Documentary,2010-09-09 05:30:14,2010-10-15 18:39:00,36.547755,10152010


Check for any null values in our DataFrame

In [4]:
df.loc[df.isnull().any(axis=1)==True,:]

Unnamed: 0,id,name,blurb,goal,pledged,state,country,deadline,created_at,launched_at,backers_count,usd_pledged,category,subcategory


### Data Visualization

### Data Analysis and Results##

### Privacy and Ethics Considerations

For our research question, we chose to look at what makes a kickstarter project successful.  This data was collected on a website whose owners got the data by webscrapping kickstarter.  The only potential privacy issue would be that the users of kickstarter did not give us direct consent to analyze their data.  However, because kickstarter is a public forum, in order to use the site, users do have to acknowledge that the data will note be private. 

Additionally, we aren’t using the data to create revenue and instead using it for an academic purpose. Thus, it’s unlikely there will be issues related to privacy. Furthermore, our data does not violate the safe harbor laws as no names, addresses or other identifying information is included within the dataset.


### Conclusions and Discussion:

