# Analyzis of the Kickstarter Dataset
This dataset contains 378661 projects that occurred on the Kickstarter crowdfunding platform. There are 2 mains objective attributes for which we can be looking for within the data:

* The pledge a project obtains, which is something more interesting to observe from the point of view of Kickstarter itself, since they profit by commission.
* The final state of the projects, because a project can either achieve its pledge goal and succeed or it can fail.

Our goal with this study is to analyze which are the attributes which influence more both of these objectives. Obtaining a good predicting model would be impossible, given that the biggest key to success at Kickstarter are the introduction videos and the offers, which is information we do not contain in our dataset. Therefore, our main goal is to analyze inference of the attributes on our objectives.

## Loading the data

In [1]:
data <- read.csv("datasets/ks-projects-201801_WithOtherActive.csv",header=T,na.strings="?")
dim(data)

## Analyzing the existing attributes

The `other_active_projects` was created by the team using a Java program. It contains the number of projects that were active at the the of launch of that project. We simply had a hunch it would be relevant

In [2]:
names(data)

## Entry removal

Removing attributes which were not finished (at the time of the snapshot) yet or had wierd status.

In [3]:
data <- data[data$state %in% c("successful", "failed"),]
dim(data)

## Column removal
Removing columns which shouldn't be used or do not tend to be useful
* `goal` because we have `usd_goal_real` which is all in the same currency.
* `pledge` and `usd_pledged` because we have `usd_pledged_real`.
* `state` because we don't want it for the `usd_pledged_real` regression and we can recreate it with the simple condition `usd_pledged_real > usd_goal_real`
* `backers` because that is part of the final result

In [4]:
projects <- data[,-c(7, 9, 10, 11, 13)]
names(projects)

### Transforming the factors into strings

In [5]:
projects$launched <- as.character(projects$launched)
projects$deadline <- as.character(projects$deadline)
projects$category <- as.character(projects$category)
projects$main_category <- as.character(projects$main_category)
projects$country <- as.character(projects$country)
projects$currency <- as.character(projects$currency)

## Now let's understand the attributes, one by one

### ID

In [6]:
length(unique(projects$ID))

Most likely the ID won't be useful in any way.

## Name

In [7]:
length(unique(projects$name))

Not useful as it is, but has potencial to extract other features.

### Launched & Deadline

Sames as with name

### Category & Main Category

In [8]:
unique(projects$main_category)

The high number of different values for this nominal attribute might be a problem.

In [9]:
length(unique(projects$category))

Besides, as we can observe here, there are a few categories that exist in several main categories, which must be handled, otherwise we will have redundat information.

In [10]:
tmp <- unique((projects[,c("category", "main_category")]))
dim(tmp)

In [12]:
library(plyr)
c <- count(tmp, 'category')
c[c$freq > 1,]

Unnamed: 0,category,freq
7,Anthologies,2
25,Comedy,4
45,Events,2
46,Experimental,2
54,Festivals,2
82,Letterpress,2
135,Spaces,3
152,Web,2


### categoryconcat
The problems noted before were simple to deal with. We just replaced the Category and the Main Category attributes with one that is the concatenation of both

In [14]:
projects$categoryconcat <- paste(projects$main_category, projects$category, sep = " - ")
projects <- projects[,-c(3,4)]
length(unique(projects$categoryconcat))