# Python walk-through for Kickstarter data analysis & classification

## *This is a work in progress. Comments and critical feedback are always welcome.*

###### Note:
We have task with classification for kickstarter project. I use different model for this task.

###### Structure
1. Load Data and Modules
2. Initial Exploration.
3. Visualization data
4. Preparation data
5. Preparing for modelling
6. Build models

English is not my native language, so sorry for any mistake.

If you like my Kernel, give me some feedback and also **votes up** my kernel.

# What is Kickstarter?

Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform focused on creativity and merchandising. The company's stated mission is to "help bring creative projects to life". Kickstarter has reportedly received more than $1.9 billion in pledges from 9.4 million backers to fund 257,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology and food-related projects.

People who back Kickstarter projects are offered tangible rewards or experiences in exchange for their pledges.[6] This model traces its roots to subscription model of arts patronage, where artists would go directly to their audiences to fund their work. [More...](https://en.wikipedia.org/wiki/Kickstarter)

LINK: [Kickstarter](https://www.kickstarter.com/)

![title](http://icopartners.com/newblog/wp-content/uploads/2014/03/kickstarter_header.png)

# **1. Load Data and Modules**

**Load Python modules**

In [None]:
import pandas as pd
import numpy as np
import string

In [None]:
# visualization

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.tools as tls
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go
init_notebook_mode(connected=True)
import warnings
from collections import Counter

In [None]:
from sklearn.cross_validation import train_test_split

In [None]:
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

import lightgbm as lgb
from lightgbm import LGBMClassifier

**Load input data.** 

In [None]:
# Upload data
df = pd.read_csv('../input/ks-projects-201801.csv',encoding ='latin1')

# 2. Initial Exploration

In [None]:
df.head(5)

In [None]:
df.describe()

In [None]:
print(df.shape)
print(df.info())

In [None]:
print(df.nunique())

# 3. Visualization data

Thank`s for some plots, 

https://www.kaggle.com/kabure/kickstarter-interactive-explanatory-exploration

**Looking the state column distribution**

In [None]:
percentual_sucess = round(df["state"].value_counts() / len(df["state"]) * 100,2)

print("State Percentual in %: ")
print(percentual_sucess)

state = round(df["state"].value_counts() / len(df["state"]) * 100,2)

labels = list(state.index)
values = list(state.values)

trace1 = go.Pie(labels=labels, values=values, marker=dict(colors=['red']))

layout = go.Layout(title='Distribuition of States', legend=dict(orientation="h"));

fig = go.Figure(data=[trace1], layout=layout)
iplot(fig)

It is very iteresting.

52.2 % - **failed** (We'll use it value for classification | It will be = **0**).

35.4% - **successful** (We'll use it value for classification | It will be = **1**)

Other values we won't use.

###### Exploring the distribution logarithm of these values

In [None]:
df_failed = df[df["state"] == "failed"]
df_sucess = df[df["state"] == "successful"]

#First plot
trace0 = go.Histogram(
    x= np.log(df.usd_goal_real + 1).head(100000),
    histnorm='probability', showlegend=False,
    xbins=dict(
        start=-5.0,
        end=19.0,
        size=1),
    autobiny=True)

#Second plot
trace1 = go.Histogram(
    x = np.log(df.usd_pledged_real + 1).head(100000),
    histnorm='probability', showlegend=False,
    xbins=dict(
        start=-1.0,
        end=17.0,
        size=1))

# Add histogram data
x1 = np.log(df_failed['usd_goal_real']+1).head(100000)
x2 = np.log(df_sucess["usd_goal_real"]+1).head(100000)

trace3 = go.Histogram(
    x=x1,
    opacity=0.60, nbinsx=30, name='Goals Failed', histnorm='probability'
)
trace4 = go.Histogram(
    x=x2,
    opacity=0.60, nbinsx=30, name='Goals Sucessful', histnorm='probability'
)


data = [trace0, trace1, trace3, trace4]
layout = go.Layout(barmode='overlay')

#Creating the grid
fig = tls.make_subplots(rows=2, cols=2, specs=[ [{'colspan': 2}, None], [{}, {}]],
                          subplot_titles=('Failed and Sucessful Projects',
                                          'Goal','Pledged'))

#setting the figs
fig.append_trace(trace0, 2, 1)
fig.append_trace(trace1, 2, 2)
fig.append_trace(trace3, 1, 1)
fig.append_trace(trace4, 1, 1)

fig['layout'].update(title="Distribuitions",
                     height=500, width=900, barmode='overlay')
iplot(fig)

In [None]:
main_cats = df["main_category"].value_counts()
main_cats_failed = df[df["state"] == "failed"]["main_category"].value_counts()
main_cats_sucess = df[df["state"] == "successful"]["main_category"].value_counts()

In [None]:
#First plot
trace0 = go.Bar(
    x=main_cats_failed.index,
    y=main_cats_failed.values,
    name="Failed Category's"
)
#Second plot
trace1 = go.Bar(
    x=main_cats_sucess.index,
    y=main_cats_sucess.values,
    name="Sucess Category's"
)
#Third plot
trace2 = go.Bar(
    x=main_cats.index,
    y=main_cats.values,
    name="All Category's Distribuition"
)

#Creating the grid
fig = tls.make_subplots(rows=2, cols=2, specs=[[{}, {}], [{'colspan': 2}, None]],
                          subplot_titles=('Failed','Sucessful', "General Category's"))

#setting the figs
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)

fig['layout'].update(showlegend=True, title="Main Category's Distribuition",bargap=0.05)
iplot(fig)

We visualization main category. We have 15 categories

In [None]:
df['main_category'].value_counts().plot.bar()
plt.show()

###### Visualization
We visualization currency and country. 'USD' & 'US' are the biggest.

In [None]:
df['currency'].value_counts().plot.bar()
plt.show()

df['country'].value_counts().plot.bar()
plt.show()

###### Visualization
We visualization project`s stay. We will use only failed & successful

In [None]:
df['state'].value_counts().plot.bar()
plt.show()

# 4. Preparation data

In [None]:
df = df[(df['state'] == 'failed') | (df['state'] == 'successful')].copy()
print(df.shape)

###### If we won`t use column to we delete this column

In [None]:
# Delete column => 
# 'ID', 'name', 'category', 'usd pledged', 'usd_pledged_real'

df = df.drop('ID', 1)
    
df = df.drop('name', 1)

#df = df.drop('category', 1)

df = df.drop('usd pledged', 1)
    
df = df.drop('usd_pledged_real', 1)

df = df.drop('backers', 1)

print(df.shape)

In [None]:
# Create new column
# 'duration_days' = 'deadline' - 'launched'

df['launched'] = pd.to_datetime(df['launched'])
df['deadline'] = pd.to_datetime(df['deadline'])

df['duration_days'] = df['deadline'].subtract(df['launched'])
df['duration_days'] = df['duration_days'].astype('timedelta64[D]')

In [None]:
df = df.drop('launched', 1)

df = df.drop('deadline', 1)

df = df.drop('pledged', 1)

In [None]:
df = df[(df['goal'] <= 100000) & (df['goal'] >= 1000)].copy()
df.shape

In [None]:
df.head(5)

######  Encodding column 'state'

In [None]:
# Encoding column 'state',
# failed = 0, successful = 1

df['state'] = df['state'].map({
        'failed': 0,
        'successful': 1         
})

In [None]:
print(df.shape)
df.head(5)

###### Encodding column 'category'

In [None]:
# We use one-hot-codding

df = pd.get_dummies(df, columns = ['category'])

###### Encodding column 'main_category'

In [None]:
# We use one-hot-codding

df = pd.get_dummies(df, columns = ['main_category'])

In [None]:
# Rename 'main_category_Film & Video' to 'main_category_Film'

df.rename(columns={"main_category_Film & Video": "main_category_Film"}, inplace=True)
print('DONE')

In [None]:
# Check

print(df.columns)
print(df.shape)

###### Encodding column 'currency'

In [None]:
# We use one-hot-codding

df = pd.get_dummies(df, columns = ['currency'])

In [None]:
print(df.columns)
print(df.shape)

###### Encodding column 'country'

In [None]:
# use one-hot-coddsing

df = pd.get_dummies(df, columns=['country'])

In [None]:
print(df.columns)
print(df.shape)

# I will use 'name' for modeling. First step

In [None]:
# Upload data
name = pd.read_csv('../input/ks-projects-201801.csv',encoding ='latin1')

In [None]:
# We use only 'name' & 'state'
name = name.drop(['ID', 'category', 'main_category', 'currency', 'deadline',
       'goal', 'launched', 'pledged', 'backers', 'country',
       'usd pledged', 'usd_pledged_real', 'usd_goal_real'], 1)
print(name.shape)

name = name[(name['state'] == 'failed') | (name['state'] == 'successful')].copy()
print(name.shape)

# Encoding column 'state',
# failed = 0, successful = 1
name['state'] = name['state'].map({
        'failed': 0,
        'successful': 1         
})

### Manual Tokenization. Cleaning project`s names

Tutorial for clean text is [here...](https://machinelearningmastery.com/clean-text-machine-learning-python/)

In [None]:
# column 'name' to string
name['name'] = name['name'].astype(str)

In [None]:
# split each "name"
name['name'] = name['name'].str.split()
name.head()

In [None]:
# failed, successful, canceled, undefined, live, suspended
# check key word

i = 0
for n in name['name']:
    if 'successful' in n:
        i = i+1
    if 'failed' in n:
        i = i+1
        
print(i)

# it`s good. We dont need clean key word

In [None]:
# clean each name. We need 'name' without punctuation

name['name'] = name['name'].apply(lambda x:' '.join([i for i in x if i not in string.punctuation]))

In [None]:
# all words have small letters

name['name'] = name['name'].str.lower()

In [None]:
# Filter out Stop Words
# Import stopwords with nltk.

from nltk.corpus import stopwords
stop = stopwords.words('english')

name['name'] = name['name'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [None]:
# to string

name['name'] = name['name'].str.split()
name.head()

In [None]:
# Stem Words
# Stemming refers to the process of reducing each word to its root or base.

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

name['name'] = name['name'].apply(lambda x: [stemmer.stem(y) for y in x])

In [None]:
name.head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [None]:
bag_of_words = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False).fit_transform(name['name'])

# 5. Preparing for modelling

In [None]:
print(df.shape)
df.head()

In [None]:
y = df['state']

print(y.shape)
y.head(5)

In [None]:
df = df.drop('state', 1)

In [None]:
# Split dataframe into random train and test subsets

X_train, X_test, Y_train, Y_test = train_test_split(
    df,
    y, 
    test_size = 0.1,
    random_state=42
)

print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)

# 6. Build models

### Logistic Regression
Logistic regression, despite its name, is a linear model for classification rather than regression. [More...](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)

In [None]:
# Logistic Regression
# 60.86

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)

acc_log = round(logreg.score(X_test, Y_test) * 100, 2)
acc_log

In [None]:
coeff_df = pd.DataFrame(df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)

### KNN
Classifier implementing the k-nearest neighbors vote. [More...](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [None]:
#68.57

knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)

acc_knn = round(knn.score(X_test, Y_test) * 100, 2)
acc_knn

## Linear SVC
Linear Support Vector Classification. Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples. [More...](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)

In [None]:
# Linear SVC
#62.01

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)

acc_linear_svc = round(linear_svc.score(X_test, Y_test) * 100, 2)
acc_linear_svc

## Decision Tree

In [None]:
# Decision Tree
#77.86

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)

acc_decision_tree = round(decision_tree.score(X_test, Y_test) * 100, 2)
acc_decision_tree

## Random Forest

In [None]:
# Random Forest
# 77.86

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

acc_random_forest = round(random_forest.score(X_test, Y_test) * 100, 2)
acc_random_forest

## AdaBoostClassifier
An AdaBoost classifier. An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases. [More...](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier)

In [None]:
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
                         algorithm="SAMME",
                         n_estimators=200)

bdt.fit(X_train, Y_train)

acc_bdt = round(bdt.score(X_test, Y_test) * 100, 2)
acc_bdt

## GradientBoostingClassifier
Gradient Boosting for classification. GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions.  [More...](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier)

In [None]:
clf_gb = GradientBoostingClassifier(n_estimators=100, 
                                 max_depth=1, 
                                 random_state=0)
clf_gb.fit(X_train, Y_train)

acc_clf_gb = round(clf_gb.score(X_test, Y_test) * 100, 2)
acc_clf_gb

## MLPClassifier
Class MLPClassifier implements a multi-layer perceptron (MLP) algorithm that trains using Backpropagation. [More...](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)

In [None]:
mlp = MLPClassifier(solver='lbfgs', 
                    alpha=1e-5, 
                    hidden_layer_sizes=(21, 2), 
                    random_state=1)

mlp.fit(X_train, Y_train)

acc_mlp = round(mlp.score(X_test, Y_test) * 100, 2)
acc_mlp

## BaggingClassifier

In [None]:
bagging = BaggingClassifier(
    KNeighborsClassifier(
        n_neighbors=8,
        weights='distance'
        ),
    oob_score=True,
    max_samples=0.5,
    max_features=1.0
    )
clf_bag = bagging.fit(X_train,Y_train)

acc_clf_bag = round(clf_bag.score(X_test, Y_test) * 100, 2)
acc_clf_bag

# LGBMClassifier

In [None]:
clf_lgbm = LGBMClassifier(
        n_estimators=300,
        num_leaves=15,
        colsample_bytree=.8,
        subsample=.8,
        max_depth=7,
        reg_alpha=.1,
        reg_lambda=.1,
        min_split_gain=.01
    )

clf_lgbm.fit(X_train, 
        Y_train,
        eval_set= [(X_train, Y_train), (X_test, Y_test)], 
        eval_metric='auc', 
        verbose=0, 
        early_stopping_rounds=30
       )

acc_clf_lgbm = round(clf_lgbm.score(X_test, Y_test) * 100, 2)
acc_clf_lgbm

## Evalution models

In [None]:
models = pd.DataFrame({
    'Model': ['KNN', 'Logistic Regression', 
              'Random Forest',   
              'Linear SVC', 
              'Decision Tree', 'BaggingClassifier',
             'AdaBoostClassifier', 'GradientBoostingClassifier',
             'LGBMClassifier'],
    'Score': [acc_knn, acc_log, 
              acc_random_forest,   
              acc_linear_svc, acc_decision_tree,
             acc_clf_bag, acc_bdt, acc_clf_gb, 
              acc_clf_lgbm]})
models.sort_values(by='Score', ascending=False)