# Kickstarter
Estimate the succefull projects

### Import Data and Libraries

In [409]:
# Import of relevant packages
import numpy as np
import pandas as pd
import warnings
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.linear_model import LogisticRegression

# Set random seed 
RSEED = 42

warnings.filterwarnings("ignore")

In [410]:
# Loading the kickstarter dataset
df = pd.read_csv('data/kickstarter_projects.csv')

# View the first few rows of the dataset
df.head(10)

Unnamed: 0,ID,Name,Category,Subcategory,Country,Launched,Deadline,Goal,Pledged,Backers,State
0,1860890148,Grace Jones Does Not Give A F$#% T-Shirt (limi...,Fashion,Fashion,United States,2009-04-21 21:02:48,2009-05-31,1000,625,30,Failed
1,709707365,CRYSTAL ANTLERS UNTITLED MOVIE,Film & Video,Shorts,United States,2009-04-23 00:07:53,2009-07-20,80000,22,3,Failed
2,1703704063,drawing for dollars,Art,Illustration,United States,2009-04-24 21:52:03,2009-05-03,20,35,3,Successful
3,727286,Offline Wikipedia iPhone app,Technology,Software,United States,2009-04-25 17:36:21,2009-07-14,99,145,25,Successful
4,1622952265,Pantshirts,Fashion,Fashion,United States,2009-04-27 14:10:39,2009-05-26,1900,387,10,Failed
5,2089078683,New York Makes a Book!!,Journalism,Journalism,United States,2009-04-28 13:55:41,2009-05-16,3000,3329,110,Successful
6,830477146,Web Site for Short Horror Film,Film & Video,Shorts,United States,2009-04-29 02:04:21,2009-05-29,200,41,3,Failed
7,266044220,Help me write my second novel.,Publishing,Fiction,United States,2009-04-29 02:58:50,2009-05-29,500,563,18,Successful
8,1502297238,Produce a Play (Canceled),Theater,Theater,United States,2009-04-29 04:37:37,2009-06-01,500,0,0,Canceled
9,813230527,Sponsor Dereck Blackburn (Lostwars) Artist in ...,Music,Rock,United States,2009-04-29 05:26:32,2009-05-16,300,15,2,Failed


### Prepare the Data

In [411]:
# View the data types of the columns
df.dtypes

ID              int64
Name           object
Category       object
Subcategory    object
Country        object
Launched       object
Deadline       object
Goal            int64
Pledged         int64
Backers         int64
State          object
dtype: object

In [412]:
# Dropping the unnecessary columns 
df.drop(['ID', 'Pledged', 'Backers',], axis=1, inplace=True)
df.columns

Index(['Name', 'Category', 'Subcategory', 'Country', 'Launched', 'Deadline',
       'Goal', 'State'],
      dtype='object')

In [413]:
# Convert Launched and Deadline columns to datetime
df['Launched'] = pd.to_datetime(df['Launched'])
df['Deadline'] = pd.to_datetime(df['Deadline'])

# Calculate duration in days
df['Duration'] = (df['Deadline'] - df['Launched']).dt.days
df[['Launched', 'Deadline', 'Duration']].head()

Unnamed: 0,Launched,Deadline,Duration
0,2009-04-21 21:02:48,2009-05-31,39
1,2009-04-23 00:07:53,2009-07-20,87
2,2009-04-24 21:52:03,2009-05-03,8
3,2009-04-25 17:36:21,2009-07-14,79
4,2009-04-27 14:10:39,2009-05-26,28


In [414]:
# Delete all states which are not 'Successful' or 'Failed'
df = df[df['State'].isin(['Successful', 'Failed'])]

In [415]:
# Creating list for categorical features
# and removing the target variable 'State'
cat_features = list(df.columns[df.dtypes==object])
cat_features.remove('State')
cat_features

['Name', 'Category', 'Subcategory', 'Country']

In [416]:
# Creating list for numerical features
num_features = ['Goal', 'Duration'] #list(df.columns[df.dtypes!=object])
num_features

['Goal', 'Duration']

In [417]:
# Create list for datetime features
date_features = ['Launched', 'Deadline']
date_features

['Launched', 'Deadline']

### New Features

**Text-based Features:**

Name Sentiment Analysis: Perform sentiment analysis on the project names to extract positive, negative, or neutral sentiments. This could capture the emotional appeal of the project.
Name Length: The length of the project name might be indicative of something.

**Outlier Handling:**

Capping Outliers: For "goals," "pledged," and "backers," consider capping outliers to reduce their impact on the model.

### Train-Test-Split

Let's split the data set into a training and test set. Using the training set and cross validation we will train our model and find the best hyperparameter combination. In the end the test set will be used for the final evaluation of our best model. 

In [418]:
# Define predictors and target variable
X = df.drop('State', axis=1)
y = df['State']
print(f"We have {X.shape[0]} observations in our dataset and {X.shape[1]} features")
print(f"Our target vector has also {y.shape[0]} values")

We have 331462 observations in our dataset and 8 features
Our target vector has also 331462 values


In [419]:
# Split into train and test set 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RSEED)

In [420]:
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

X_train shape: (265169, 8)
X_test shape: (66293, 8)
y_train shape: (265169,)
y_test shape: (66293,)


### Pipeline Parts

In [421]:
# Pipeline for categorical features 
cat_pipeline = Pipeline([
    ('1hot', OneHotEncoder(handle_unknown='ignore'))
])

In [422]:
# Pipeline for datetime features
date_pipeline = Pipeline([
    ('std_scaler', StandardScaler())
])

In [423]:
# Pipline for numerical features
num_pipeline = Pipeline([
    ('std_scaler', StandardScaler())
])

### Prepare the Pipeline

In [428]:
# Build preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_pipeline, num_features),
        ('cat', cat_pipeline, cat_features),
        ('date', date_pipeline, date_features)
    ],
    remainder='drop'
)

### Pick a model

In [425]:
# Set up the pipeline and model
pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', LogisticRegression())  
])

### Run the Pipeline

In [429]:
# Fit the pipeline to the training data
pipe.fit(X_train, y_train)
# Make predictions on the test data
y_pred = pipe.predict(X_test)

### Evaluation

In [430]:
# Generate classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Generate confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

      Failed       0.70      0.77      0.73     39509
  Successful       0.60      0.53      0.56     26784

    accuracy                           0.67     66293
   macro avg       0.65      0.65      0.65     66293
weighted avg       0.66      0.67      0.66     66293

Confusion Matrix:
[[30324  9185]
 [12719 14065]]


### Performance Improvement Plan: 
- Use Jax
- Use Google Cloud
- Reduce to 10% 

Based on your "To Do" list and EDA Data Checking notes, here are some feature engineering ideas that could improve your model quality:

Time-based Features:

Launch Duration: Calculate the duration of the project launch (difference between launch date and deadline).
Pre-Launch vs. Post-Launch Analysis: Create features to capture the time before and after the launch date.
Text-based Features:

Name Sentiment Analysis: Perform sentiment analysis on the project names to extract positive, negative, or neutral sentiments. This could capture the emotional appeal of the project.
Name Length: The length of the project name might be indicative of something.
Category/Subcategory Features:

Combine Categories: Instead of just one-hot encoding, explore creating interaction features between categories and subcategories.
Goal-related Features:

Goal Currency: Extract the currency from the "goal" column (Dollar or Euro) and create a separate feature for it.
Goal Success Ratio: Create a feature that represents the ratio of "pledged" to "goal." This could indicate how close a project was to succeeding.
Backer-related Features:

Backer Pledge Amount: Calculate the average pledge amount per backer (pledged / backers).
Outlier Handling:

Capping Outliers: For "goals," "pledged," and "backers," consider capping outliers to reduce their impact on the model.
To implement these, you'll need to add code to your notebook to perform the feature engineering steps. This might involve:

Converting date columns to datetime objects.
Using string manipulation to extract information from text columns.
Applying mathematical operations to create new features.
Using libraries like nltk or transformers for sentiment analysis.


# ToDo


EDA Data Checking: 
- Data Values
- Data Type 
- Check Distribution of Values
- Missing Values 
- Null Values 
- Correlations
- Data Reduction or Sampling

Checkout Result: 
- Dollar or Euro in GOALS
- Time vs Date in LAUNCHED
- Canceld? in STATE (Target)
- Chancelation Date not there? --> Compare Backers which Cancelled
- GOALS, PLEDGES, BACKERS has Outliers
- SUBCATEGORIE AND CATERGORIE need one hot encoding
- ID needs to be cleaned
- NAME needs some kind of segmentic analysis 
- PreLaunch vs. PostLaunch Analysis  !!

Try for Improvments: 
- Models
- Hyperparamter Tuning 
- Feature Engineering (Segmantic Analyses)