# Checkout the Data

### Performance Improvement Plan: 
- Use Jax
- Use Google Cloud
- Reduce to 10% 

---
## Model

In [4]:
# Import of relevant packages
import numpy as np
import pandas as pd
import warnings
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score, recall_score, precision_score

from sklearn.linear_model import LogisticRegression

# Set random seed 
RSEED = 42
warnings.filterwarnings("ignore")

In [26]:
# Loading the kickstarter dataset
df = pd.read_csv('data/kickstarter_projects.csv')

# View the first few rows of the dataset
df.head(10)

Unnamed: 0,ID,Name,Category,Subcategory,Country,Launched,Deadline,Goal,Pledged,Backers,State
0,1860890148,Grace Jones Does Not Give A F$#% T-Shirt (limi...,Fashion,Fashion,United States,2009-04-21 21:02:48,2009-05-31,1000,625,30,Failed
1,709707365,CRYSTAL ANTLERS UNTITLED MOVIE,Film & Video,Shorts,United States,2009-04-23 00:07:53,2009-07-20,80000,22,3,Failed
2,1703704063,drawing for dollars,Art,Illustration,United States,2009-04-24 21:52:03,2009-05-03,20,35,3,Successful
3,727286,Offline Wikipedia iPhone app,Technology,Software,United States,2009-04-25 17:36:21,2009-07-14,99,145,25,Successful
4,1622952265,Pantshirts,Fashion,Fashion,United States,2009-04-27 14:10:39,2009-05-26,1900,387,10,Failed
5,2089078683,New York Makes a Book!!,Journalism,Journalism,United States,2009-04-28 13:55:41,2009-05-16,3000,3329,110,Successful
6,830477146,Web Site for Short Horror Film,Film & Video,Shorts,United States,2009-04-29 02:04:21,2009-05-29,200,41,3,Failed
7,266044220,Help me write my second novel.,Publishing,Fiction,United States,2009-04-29 02:58:50,2009-05-29,500,563,18,Successful
8,1502297238,Produce a Play (Canceled),Theater,Theater,United States,2009-04-29 04:37:37,2009-06-01,500,0,0,Canceled
9,813230527,Sponsor Dereck Blackburn (Lostwars) Artist in ...,Music,Rock,United States,2009-04-29 05:26:32,2009-05-16,300,15,2,Failed


In [13]:
df.dtypes

ID              int64
Name           object
Category       object
Subcategory    object
Country        object
Launched       object
Deadline       object
Goal            int64
Pledged         int64
Backers         int64
State          object
dtype: object

---
## Building a Preprocessing Pipeline

In [14]:
# Dropping the unnecessary columns 
df.drop(['ID', 'Pledged', 'Backers',], axis=1, inplace=True)
df.columns

Index(['Name', 'Category', 'Subcategory', 'Country', 'Launched', 'Deadline',
       'Goal', 'State'],
      dtype='object')

In [28]:
# Creating list for categorical features
cat_features = ['Name', 'Category', 'Subcategory', 'Country', 'State']
cat_features


['Name', 'Category', 'Subcategory', 'Country', 'State']

In [30]:
# Creating list for numerical features
num_features = ['Goal']
num_features

['Goal']

In [None]:
# Creating list for date features
date_features = list(df.columns[df.dtypes!=object])
date_features

'Launched', 'Deadline

### Train-Test-Split

Let's split the data set into a training and test set. Using the training set and cross validation we will train our model and find the best hyperparameter combination. In the end the test set will be used for the final evaluation of our best model. 

In [23]:
# Define predictors and target variable
X = df.drop('State', axis=1)
y = df['State']
print(f"We have {X.shape[0]} observations in our dataset and {X.shape[1]} features")
print(f"Our target vector has also {y.shape[0]} values")

We have 374853 observations in our dataset and 7 features
Our target vector has also 374853 values


In [24]:
# Split into train and test set 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RSEED)

In [25]:
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

X_train shape: (299882, 7)
X_test shape: (74971, 7)
y_train shape: (299882,)
y_test shape: (74971,)


### Pipeline

In [None]:
# Initiating Pipeline and calling one step after another each step is built as a list of (key, value):
# - key is the name of the processing step
# - value is an estimator object (processing step)

In [None]:
# Pipeline for categorical features 

cat_pipeline = Pipeline([
    ('1hot', OneHotEncoder(handle_unknown='ignore'))
])

# Keep only projects marked as 'Successful' or 'Failed'
df = df[df['State'].isin(['Successful', 'Failed'])].copy()
df['Successful'] = (df['State'] == 'Successful').astype(int)

In [None]:
# Pipline for numerical features

num_pipeline = Pipeline([
    ('std_scaler', StandardScaler())
])

In [None]:
# Pipline for date features

date_pipeline = Pipeline([
    ('std_scaler', StandardScaler())
])

# Convert date columns
df['Launched'] = pd.to_datetime(df['Launched'])
df['Deadline'] = pd.to_datetime(df['Deadline'])
df['Duration'] = (df['Deadline'] - df['Launched']).dt.days

In [None]:
# Complete pipeline for numerical, categorical and date features

# 'ColumnTransformer' applies transformers (num_pipeline/ cat_pipeline)
# to specific columns of an array or DataFrame (num_features/cat_features)
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

In [None]:
# Build preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_pipeline, num_features),
       # ('date', date_pipeline, date_features),
        ('name', name_pipeline, name_features),
        ('cat', cat_pipeline, cat_features)
    ],
    remainder='drop'
)

In [None]:
# Set up the pipeline
pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', KNeighborsClassifier())  
])

In [None]:
TODO

State: Delete Values: State: Canceled, Live, Suspended, Pledged)
# df = df[df['State'] != 'Failed']
Create Duration
Create Goal / Tag - Wie viel muß ich verdienen am Tag ?

# ToDo


EDA Data Checking: 
- Data Values
- Data Type 
- Check Distribution of Values
- Missing Values 
- Null Values 
- Correlations
- Data Reduction or Sampling

Checkout Result: 
- Dollar or Euro in GOALS
- Time vs Date in LAUNCHED
- Canceld? in STATE (Target)
- Chancelation Date not there? --> Compare Backers which Cancelled
- GOALS, PLEDGES, BACKERS has Outliers
- SUBCATEGORIE AND CATERGORIE need one hot encoding
- ID needs to be cleaned
- NAME needs some kind of segmentic analysis 
- PreLaunch vs. PostLaunch Analysis  !!

Try for Improvments: 
- Models
- Hyperparamter Tuning 
- Feature Engineering (Segmantic Analyses)