### Kickstarter dataset project - Yasmine Maricar
#### Description ####

* [1. Analyzing the dataset](#Q1)
* [2. Developing a model to predict the probability of campaign success](#Q2)
* [3. What are our recommendations to anyone who want to create a Kickstarter campaign?](#Q3)

In [None]:
import pandas as pd
import plotly as plt
import numpy as np

pd.options.display.max_rows = 4000

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

import plotly.express as px 
import plotly.subplots as tls
import plotly
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

from pandas_profiling import ProfileReport

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, KFold, RandomizedSearchCV, GridSearchCV, StratifiedShuffleSplit
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

SEED = 42

<a id='Q1'></a>

# Reading and cleaning up the dataframe

### I am omitting this step for the sake of clarity. This part would made us read from the original dataset and cleaning it up + adding relevant features if the original features are not right.

# Going from the final dataframe

In [None]:
df_final = pd.read_csv('out2.zip')

In [None]:
df_final = df_final.drop(['goal', 'pledged', 'usd pledged'], axis = 1)

In [None]:
df_final.describe(include='all', datetime_is_numeric=True)

In [None]:
# Converting the columns into the right dtypes as for dates and numbers.
df_final["deadline"] = pd.to_datetime(df_final['deadline'])
df_final["launched"] = pd.to_datetime(df_final['launched'])
df_final["ID"] = pd.to_numeric(df_final["ID"])
df_final["backers"] = pd.to_numeric(df_final["backers"])
df_final["real_usd_pledged"] = pd.to_numeric(df_final["real_usd_pledged"])
df_final["usd_goal"] = pd.to_numeric(df_final["usd_goal"])

In [None]:
df_final.dtypes

In [None]:
df_final.isnull().any()

In [None]:
df_final[df_final['country'].isnull()].head()

In [None]:
df_final[df_final['country'].isnull()].shape

Let's drop these because we can see that there is 0 backers and no country nor usd pledged previously, it seems to be a mistake in getting the data

In [None]:
df_final = df_final[~df_final['country'].isnull()]

In [None]:
df_final = df_final.loc[~((df_final['real_usd_pledged']>=df_final['usd_goal']) & (df_final['state']=='failed'))]

In [None]:
df_final = df_final.reset_index(drop=True)

In [None]:
df_final.isnull().any()

In [None]:
df_final.shape

In [None]:
df_final.duplicated().sum()

In [None]:
counts = df_final['name'].value_counts().rename_axis('name').reset_index(name='counts')

In [None]:
duplicate_names = df_final[df_final['name'].isin(counts[counts['counts']>1].name.tolist())]

In [None]:
duplicate_names.shape

In [None]:
duplicate_names.sort_values(by=['name']).head()

I'll leave it as it is, but it's interesting to see that some duplicates seem genuine, others seem to be about the same project revamped/relaunched and others are also another rendition of the same project (play at theater and video for instance...).

It would be interesting to know more about the motives and mindset of people creating these projects 'again' (needs of funds again), are there also possible cases of reboot of past successful projects (hoax ?). 

Overall, it still can be integrated in our model as we want to predict the success/failure of a campaign regardless.

## Distribution of goals and pledges

In [None]:
def plot_continuous_vars(data, column_name):
    plot_dims = (14, 8)
    fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=False, figsize=plot_dims)
    sns.distplot(data[column_name], ax=ax1)
    sns.distplot(np.log1p(data[column_name]), ax=ax2)

In [None]:
plot_continuous_vars(df_final, 'usd_goal')

In [None]:
plot_continuous_vars(df_final, 'real_usd_pledged')

We take the log to better see the distributions as we have outliers in both cases.

In [None]:
df_failed = df_final[df_final["state"] == "failed"]
df_sucess = df_final[df_final["state"] == "successful"]


# Add histogram data
failed = np.log(df_failed['usd_goal']+1)
success = np.log(df_sucess['usd_goal']+1)

trace1 = go.Histogram(
    x=failed,
    opacity=0.60, nbinsx=30, name='Goals Failed', histnorm='probability'
)
trace2 = go.Histogram(
    x=success,
    opacity=0.60, nbinsx=30, name='Goals Sucessful', histnorm='probability'
)

data = [trace1, trace2]
layout = go.Layout(barmode='overlay', title=go.layout.Title(text="Distributions of usd_goal"))

fig = go.Figure(
    data=data,
    layout=layout
)

iplot(fig)

Based on the above histogram, it seems the failed projects tend to have higher values (so higher goals)

In [None]:
import plotly.express as px
fig = px.box(df_final, x="main_category", y="usd_goal")
fig.show()

In [None]:
df_failed = df_final[df_final["state"] == "failed"]
df_success = df_final[df_final["state"] == "successful"]

plot_continuous_vars(df_failed, 'backers')
plot_continuous_vars(df_success, 'backers')

## Feature engineering

Variables for the logistic regression:
* len(name) to take into account the name of the project
* if the name has all upper case words
* if the name contains ! or ?
* number of words in name
* does the name contains non alphanumeric characters
* duration between launch and deadline
* month of launch_date

Others 

* goal in usd
* category (1-hot encoded)
* main category (1-hot encoded)
* country (1-hot encoding)

to predict target variable state

In [None]:
def getDelta(a,b):
    '''Get diffence in days between launch and deadline'''
    return (a - b).days

# Duration of the project   
df_final['duration'] = df_final.apply(lambda x: getDelta(x['deadline'],x['launched']),axis = 1)

In [None]:
df_final['month'] = df_final['launched'].dt.month
df_final['year_month'] = df_final['launched'].map(lambda x: str(x.year) + "-" + str(x.month))

In [None]:
import re

def has_non_chars(name):
    for c in name:
        if not c.isalpha() and c!='?' and c!='!':
            return 1
    return 0

def has_exclamation_interrogation(name):
    if ("!" in name or "?" in name):
        return 1
    return 0

def has_upper(name):
    for word in name.split(' '):
        if word.isupper() and len(re.sub(r'\W+', '', word))>1:
            return 1
    return 0

In [None]:
df_final['len_name'] = df_final.name.str.len()

In [None]:
df_final['name_nb_words'] = df_final.name.apply(lambda x: len(str(x).split(' ')))

In [None]:
df_final['name_non_chars'] = df_final.name.apply(has_non_chars)

In [None]:
df_final['name_has_symbol'] = df_final.name.apply(has_exclamation_interrogation)

In [None]:
df_final['name_upper'] = df_final.name.apply(has_upper)

In [None]:
df_final['cat_full'] = df_final[["main_category","category"]].agg('-'.join, axis=1)

In [None]:
df_final.head()

<a id='Q2'></a>

## I. Let's prepare the dataset to train the model

In [None]:
df_final.columns

In [None]:
ks = df_final.drop(['ID','name','deadline','launched','year_month', 'backers', 'real_usd_pledged'], axis=1).copy()

In [None]:
ks.columns

usd_goal is skewed, let's check the distribution here, let's replace it.

In [None]:
ks['usd_goal_corrected'] = np.log1p(ks['usd_goal'])

In [None]:
ks['state'] = ks.state.map(dict(successful=1, failed=0))

## 1. Generating html report with pandas profiling

In [None]:
profile = ProfileReport(ks, title="Pandas Profiling Report Kickstarter")
profile.to_file('kickstarterds.html')

## 2. Explore manually

In [None]:
# ## This heatmap is also available from pandas-profiling html file.
# corr = ks.corr()
# dims = (16, 10)
# fig, ax = plt.subplots(figsize = dims)
# sns.heatmap(corr, 
#             xticklabels=corr.columns.values,
#             yticklabels=corr.columns.values,ax = ax, cmap="Blues")

In [None]:
# We'll drop name_nb_words because it's highly correlated with len_name
ks = ks.drop(['name_nb_words'], axis=1)
# We can drop currency too as the currency is explained by the country
ks = ks.drop(['currency'], axis=1)
# We can drop category and main_category as it's encoded in cat_full
ks = ks.drop(['category','main_category'], axis=1)

In [None]:
ks.columns

In [None]:
ks.state.value_counts(normalize=True)

We may consider the dataset is balanced because of the 60/40 % ratio

In [None]:
ks.dtypes

In [None]:
ks.describe(include='all')

## II. Model training

In [None]:
y = ks.state
x = ks.drop(['state','usd_goal'], axis = 1)

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=SEED)

In [None]:
print('x_train.shape:', x_train.shape)
print('y_train.shape:', y_train.shape)
print('x_test.shape :', x_test.shape)
print('y_test.shape :', y_test.shape)

In [None]:
x_train.columns

In [None]:
from pprint import pprint
import mlflow

def fetch_logged_data(run_id):
    client = mlflow.tracking.MlflowClient()
    data = client.get_run(run_id).data
    tags = {k: v for k, v in data.tags.items() if not k.startswith("mlflow.")}
    artifacts = [f.path for f in client.list_artifacts(run_id, "model")]
    return data.params, data.metrics, tags, artifacts

# enable autologging
mlflow.sklearn.autolog()


In [None]:
from sklearn.dummy import DummyClassifier
# define model

model = DummyClassifier(strategy='uniform', random_state=42)
with mlflow.start_run() as run:
    model.fit(x_train, y_train)
    

# fetch logged data
params, metrics, tags, artifacts = fetch_logged_data(run.info.run_id)

pprint(params)
pprint(metrics)
pprint(tags)
pprint(artifacts)



### Preprocessing

In [None]:
x_train.columns

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder

numeric_features = ['usd_goal_corrected', 'duration', 'len_name']

numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", RobustScaler()),
    ]
)

categorical_features = ['country', 'cat_full', 'month', 'name_non_chars', 'name_has_symbol', 'name_upper']
categorical_transformer = Pipeline(
    steps=[("onehot", OneHotEncoder(handle_unknown="ignore"))]
)

preprocessing = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    remainder="passthrough",
)


In [None]:
def predict_plot(X, y, classifier, classifier_name):
#     # predict probabilities
#     lr_probs = classifier.predict_proba(X)
#     # keep probabilities for the positive outcome only
#     lr_probs = lr_probs[:, 1]
#     # predict class values
#     yhat = classifier.predict(X)
#     precision = precision_score(y, yhat)
#     lr_precision, lr_recall, _ = precision_recall_curve(y, lr_probs)
#     lr_f1, lr_auc = f1_score(y, yhat), auc(lr_recall, lr_precision)
#     # summarize scores
#     print(classifier_name+': precision=%.3f auc=%.3f' % (precision, lr_auc))
#     # plot the precision-recall curves
#     no_skill = len(y[y==1]) / len(y)
#     pyplot.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
#     pyplot.plot(lr_recall, lr_precision, marker='.', label=classifier_name)
#     # axis labels
#     pyplot.xlabel('Recall')
#     pyplot.ylabel('Precision')
#     # show the legend
#     pyplot.legend()
#     # show the plot
#     pyplot.show()

    yhat = classifier.predict(X)
    
    # Compute fpr, tpr, thresholds and roc auc
    fpr, tpr, thresholds = roc_curve(y, yhat)
    roc_auc = roc_auc_score(y, yhat)

    # Plot ROC curve
    plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
    plt.plot([0, 1], [0, 1], 'k--')  # random predictions curve
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.xlabel('False Positive Rate or (1 - Specifity)')
    plt.ylabel('True Positive Rate or (Sensitivity)')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")

In [None]:
# precision-recall curve and f1 for evaluation purposes

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, roc_curve
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.metrics import auc
from matplotlib import pyplot

lr = LogisticRegression(solver='liblinear')
# model = LogisticRegression(solver='lbfgs')
model = Pipeline([('preprocessing', preprocessing),
                ('lr',lr)])
model.fit(x_train, y_train)

predict_plot(x_train, y_train, model, "Logistic Regression")

In [None]:
predict_plot(x_test, y_test, model, "Logistic Regression")

In [None]:
from lightgbm import LGBMClassifier

# #Specifying the parameter
# params={}
# params['learning_rate']=0.03
# params['boosting_type']='gbdt' #GradientBoostingDecisionTree
# params['objective']='binary' #Binary target feature
# params['metric']='binary_logloss' #metric for binary classification
# params['max_depth']=10
# #train the model 
# clf=lgb.train(params,d_train,100) #train the model on 100 epochs
# #prediction on the test set
# y_pred=clf.predict(X_test)

clf = make_pipeline(
    preprocessing,
    LGBMClassifier()
)

clf.fit(x_train, y_train)
    
predict_plot(x_train, y_train, clf, 'GBM')

In [None]:
predict_plot(x_test, y_test, clf, 'GBM')

In [None]:
make_pipeline(preprocessing, clf).get_params()

In [None]:
clf = make_pipeline(
    preprocessing,
    LGBMClassifier(learning_rate=0.7, boosting_type="gbdt", objective='binary', metric='accuracy', max_depth=-1)
)

clf.fit(x_train, y_train)
    
predict_plot(x_train, y_train, clf, 'GBM')

In [None]:
predict_plot(x_test, y_test, clf, 'GBM')

In [None]:
from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):

    def __init__(
        self, 
        estimator = LogisticRegression(),
    ):
        """
        A Custom BaseEstimator that can switch between classifiers.
        :param estimator: sklearn object - The classifier
        """ 

        self.estimator = estimator


    def fit(self, X, y=None, **kwargs):
        self.estimator.fit(X, y)
        return self


    def predict(self, X, y=None):
        return self.estimator.predict(X)


    def predict_proba(self, X):
        return self.estimator.predict_proba(X)


    def score(self, X, y):
        return self.estimator.score(X, y)

In [None]:
pipeline = Pipeline([('preprocessing', preprocessing), ('clf', ClfSwitcher())])

parameters = [
    {
        'clf__estimator': [LogisticRegression()],
        'clf__estimator__solver': ["lbfgs", "liblinear"],
        "clf__estimator__penalty": ["l2"],
        "clf__estimator__C": [0.1, 0.2, 0.3, 0.5, 1.0],
        "clf__estimator__max_iter": [100, 1000, 2000],
    },
    {
        'clf__estimator': [LogisticRegression()],
        'clf__estimator__solver': ["liblinear"],
        "clf__estimator__penalty": ["l1"],
        "clf__estimator__C": [0.1, 0.2, 0.3, 0.5, 1.0],
        "clf__estimator__max_iter": [100, 1000, 2000],
    },
]

gscv = GridSearchCV(pipeline, parameters, cv=2, n_jobs=-1, verbose=3)
gs_model = gscv.fit(x_train, y_train)

print(gs_model.best_params_, gs_model.best_score_)


### Without sklearn pipelines

In [None]:
# x = pd.get_dummies(x, columns = ['month','category','main_category','country'])
x = pd.get_dummies(x, columns = ['month','cat_full','country'])

In [None]:
from sklearn.preprocessing import RobustScaler
num_cols = ['usd_goal_corrected', 'duration', 'len_name']

transformer = RobustScaler().fit(x[num_cols])

In [None]:
x[num_cols] = transformer.transform(x[num_cols])

In [None]:
x.describe()

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=SEED)

In [None]:
print('x_train.shape:', x_train.shape)
print('y_train.shape:', y_train.shape)
print('x_test.shape :', x_test.shape)
print('y_test.shape :', y_test.shape)

In [None]:
# Creating the model:
lr = LogisticRegression(solver='liblinear') 

# Training the model with the training datas:
lr.fit(x_train, y_train)

In [None]:
y_pred_lr = lr.predict(x_test)

# test data set auc error
print('Train data ROC/AUC :', roc_auc_score(y_true=y_train, y_score=lr.predict(x_train)))
print('Test data ROC/AUC :', roc_auc_score(y_true=y_test, y_score=y_pred_lr))

# confusion matrix
print('\nConfusion matrix')
print(confusion_matrix(y_true=y_test, y_pred=y_pred_lr))

# classification matrix
print('\nClassification matrix')
print(classification_report(y_true=y_test, y_pred=y_pred_lr))

### Let's try cross-validation

In [None]:
%%time

from sklearn.model_selection import GridSearchCV

grid = {'C': np.logspace(-3,3,7), 'penalty': ['l1', 'l2']}

# Creating the model:
lr = LogisticRegression(solver='liblinear') 

# Creating GridSearchCV model:
lr_cv = GridSearchCV(lr, grid, cv=10, scoring='roc_auc') # Using lr model, grid parameters and cross validation of 10 (10 times of accuracy calculation will be applied) 

# Training the model:
lr_cv.fit(x_train, y_train)

print('best paremeters for logistic regression with liblinear: ', lr_cv.best_params_)
print('best score for logistic regression after grid search cv:', lr_cv.best_score_)

In [None]:
lr_tuned = LogisticRegression(solver='liblinear', C=1.0, penalty='l2')
lr_tuned.fit(x_train, y_train)

y_pred_lr = lr_tuned.predict(x_test)

# test data set auc error
print('Train data ROC/AUC :', roc_auc_score(y_true=y_train, y_score=lr_tuned.predict(x_train)))
print('Test data ROC/AUC :', roc_auc_score(y_true=y_test, y_score=y_pred_lr))

# confusion matrix
print('\nConfusion matrix')
print(confusion_matrix(y_true=y_test, y_pred=y_pred_lr))

# classification matrix
print('\nClassification matrix')
print(classification_report(y_true=y_test, y_pred=y_pred_lr))

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

# Compute fpr, tpr, thresholds and roc auc
fpr, tpr, thresholds = roc_curve(y_test, y_pred_lr)
roc_auc = roc_auc_score(y_test, y_pred_lr)

# Plot ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')  # random predictions curve
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate or (1 - Specifity)')
plt.ylabel('True Positive Rate or (Sensitivity)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")

We can then use model.predict_proba(x_test)[:,1] to get the probabilities of label being positive for the target.

In [None]:
# get importance
importance = lr_tuned.coef_
# summarize feature importance
sort_list = [(x.columns[i], v) for i,v in enumerate(importance[0])]

sort_list.sort(key=lambda x:x[1])

In [None]:
# plot feature importance
plt.bar([i for i in range(len(sort_list))], [i[1] for i in sort_list])
plt.show()

for i in sort_list:
    print('%s: %.5f' % (i[0],i[1]))

## Conclusion

<a id='Q3'></a>
3) From what we have observed through EDA (I didn't leave all my code for this part here.) mostly, it seems better to do a project in:

In [None]:
print("The most promising categories to start a kickstarter in are:",", ".join(list(more_success_than_failed.keys())))

Furthermore, it seems that projects with a duration of days below one month have better chances of success.

I think our study is incomplete because we are not studying the potential creators and `backers` interactions towards the project, the comments, number of shares throughout the web are what make the success of a kickstarter project aiming towards a reasonably high amount of money, by targetting the right people and generating contributions to the project in the alloted timeline. We can see that amongst the most successful categories, the mean usd_goal between failed and successful projects is different, failed projects tend to have higher amounts of money as a goal, thus, by keeping the goal similar to previously successful projects in the same domain, the chances to see the project succeed are better.

The factors of success of a project go far beyond what we have as a dataset in this study, as the real issue seems to be how people find these projects. Kickstarter is above all the hosting platform to receive these funds. However, it is interesting to see that we were able to detect some interesting insights and finish up with a final model that has around 68% accuracy.