# Kickstarter Success Prediction

This notebook builds a machine learning pipeline to predict whether a Kickstarter campaign will succeed or fail.

## 🎯 Objective:
Classify Kickstarter projects as **successful** or **failed** based on historical features.

## 📊 Dataset:
**Source**: [Kickstarter Projects on Kaggle](https://www.kaggle.com/datasets/kemical/kickstarter-projects)

The dataset contains 300,000+ records of past campaigns, including:
- Launch/deadline dates
- Goal amounts
- Country, currency, and category
- Final state (`successful`, `failed`, etc.)

We use only the `successful` and `failed` rows for binary classification.

This notebook builds a machine learning pipeline to predict whether a Kickstarter campaign will succeed or fail.

## 📦 1. Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, f1_score, roc_auc_score, confusion_matrix, roc_curve, auc

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

In [None]:
import warnings
warnings.filterwarnings('ignore')

## 📂 2. Load Dataset

In [None]:
data = pd.read_csv(r'E:\ML_Project\kickstarter-success-prediction\data\ks-projects-201801.csv')

In [None]:
data.head()

## 🔍 3. Initial Inspection

In [None]:
print(data.info())

In [None]:
plt.figure(figsize=(12, 6))
sns.heatmap(data.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

## 📉 4. Target Distribution

In [None]:
sns.countplot(data=data[data['state'].isin(['failed', 'successful'])], x='state')
plt.title('Target Distribution')
plt.show()

## 🧹 5. Data Cleaning

In [None]:
# Drop unneeded columns
data = data.drop(['ID', 'name'], axis=1)

In [None]:
# Filter only successful and failed
data = data[data['state'].isin(['failed', 'successful'])].reset_index(drop=True)

In [None]:
# Fill missing
data['usd pledged'] = data['usd pledged'].fillna(data['usd pledged'].mean())

In [None]:
# Drop leaky columns
leakage_cols = ['pledged', 'usd pledged', 'usd_pledged_real', 'backers']
data.drop(columns=[col for col in leakage_cols if col in data.columns], inplace=True)

## 🛠️ 6. Feature Engineering

In [None]:
# Convert to datetime
data['deadline'] = pd.to_datetime(data['deadline'], errors='coerce')
data['launched'] = pd.to_datetime(data['launched'], errors='coerce')

In [None]:
# Drop invalid dates
data.dropna(subset=['deadline', 'launched'], inplace=True)

In [None]:
# New features
data['duration_days'] = (data['deadline'] - data['launched']).dt.days
data['launch_month'] = data['launched'].dt.month
data['launch_dow'] = data['launched'].dt.dayofweek
data['launch_weekend'] = data['launch_dow'].isin([5, 6]).astype(int)
data['launch_holiday'] = data['launch_month'].isin([11, 12]).astype(int)

In [None]:
# Log transform goal
data['log_goal'] = np.log1p(data['goal'])
data['goal_per_day'] = data['log_goal'] / data['duration_days'].replace(0, 1)

In [None]:
# Drop old date columns
data.drop(columns=['goal', 'deadline', 'launched'], inplace=True)

In [None]:
# Binning and interaction
data['duration_bins'] = pd.cut(data['duration_days'], bins=[0, 15, 30, 60, 1000], labels=False)
data['goal_bins'] = pd.qcut(data['log_goal'], q=4, labels=False)
data['goal_weekend_interaction'] = data['goal_per_day'] * data['launch_weekend']

## 🔢 7. Prepare Data for Modeling

In [None]:
# Encode target
y = data['state'].map({'failed': 0, 'successful': 1})
X = data.drop('state', axis=1)

In [None]:
# Identify column types
num_cols = X.select_dtypes(include=['float64', 'int64']).columns.tolist()
cat_cols = X.select_dtypes(include=['object']).columns.tolist()

## ⚙️ 8. Preprocessing Pipeline

In [None]:
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

In [None]:
categorical_pipeline = Pipeline([
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

In [None]:
preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, num_cols),
    ('cat', categorical_pipeline, cat_cols)
])

## 🤖 9. Model Pipeline with SMOTE

In [None]:
model_pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

## 🧪 10. Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

## 🏋️‍♂️ 11. Fit Model

In [None]:
model_pipeline.fit(X_train, y_train)

## 🔎 12. Evaluate Model

In [None]:
y_pred = model_pipeline.predict(X_test)
y_proba = model_pipeline.predict_proba(X_test)[:, 1]

In [None]:
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_proba))

## 📊 13. Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## 📈 14. ROC Curve

In [None]:
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.3f}')
plt.plot([0,1],[0,1],'--', color='gray')
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.grid()
plt.show()