# **Kickstarter Success Prediction 🚀**

This Jupyter Notebook walks through the process of building a machine learning model to predict whether a Kickstarter campaign will be successful. We'll perform data loading, feature engineering, model training, and evaluation using a more granular, step-by-step approach.

---

## **1. Import Libraries**
First, we import the necessary Python libraries for data manipulation, visualization, and machine learning.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

---

## **2. Load Data**
Next, we load the Kickstarter dataset.

**Important:** Update the `file_path` variable to the location of your `kickstarter.csv` file.

In [None]:
# --- IMPORTANT: Update this file path to match your file's location ---
file_path = 'e:/ML_Project/kickstarter-success-prediction/data/kickstarter.csv'

try:
    df = pd.read_csv(file_path)
    print("✅ Dataset loaded successfully!")
except FileNotFoundError:
    print(f"❌ Error: The file '{file_path}' was not found.")

---

## **3. Dataset Details & Initial Exploration**
Let's get a better understanding of our dataset. This dataset contains information on thousands of Kickstarter campaigns, each with various attributes like the project's category, funding goal, launch date, and final status.

Our goal is to use these attributes to predict the `binary_state` (successful or failed).

First, a quick preview of the data.

In [None]:
# Display the first few rows of the dataframe
df.head(3)

Now, let's look at the data types and check for missing values.

In [None]:
# Get a concise summary of the dataframe
df.info()

---

## **4. Feature Engineering & Data Cleaning**
In this step, we'll clean the data and create new features to improve our model's performance.

### **4.1. Engineer New Features from Dates**
We start by making a copy of the dataframe to work on.

In [None]:
# Start with a clean copy to preserve the original dataframe
df_processed = df.copy()

Now, we convert the `launched_at` column to a datetime object, which is necessary to extract time-based features.

In [None]:
# Convert 'launched_at' to a datetime object
df_processed['launched_at'] = pd.to_datetime(df_processed['launched_at'])

From the datetime object, we create new features: the day of the week the campaign was launched and a flag for whether it was a weekend.

In [None]:
# Create new time-based features
df_processed['day_of_week'] = df_processed['launched_at'].dt.dayofweek # Monday=0, Sunday=6
df_processed['is_weekend'] = df_processed['day_of_week'].isin([5, 6]).astype(int)
print("✅ Engineered new features: 'day_of_week' and 'is_weekend'.")

### **4.2. Clean and Prepare Target Variable**
We filter the dataset to include only projects that were clearly 'successful' or 'failed', removing other states like 'canceled' or 'live'.

In [None]:
# Filter for 'successful' and 'failed' projects
valid_states = ['successful', 'failed']
df_processed = df_processed[df_processed['binary_state'].isin(valid_states)]
print("✅ Filtered for valid states.")

Next, we convert our target variable, `binary_state`, into a numeric format where `1` represents a successful project and `0` represents a failed one.

In [None]:
# Convert target variable to numeric
df_processed['binary_state'] = df_processed['binary_state'].map({'successful': 1, 'failed': 0})
print("✅ Target variable converted to numeric.")

---

## **5. Prepare Data for Modeling**
Here, we select our features (`X`) and target (`y`) and split them for training and testing.

### **5.1. Define Features (X) and Target (y)**

First, we define our target variable `y`.

In [None]:
# Define the target variable
y = df_processed['binary_state']

Next, we specify columns to be removed. Some are "leaky" (containing information that isn't available at the time of prediction, like `backers_count`) and others are simply unnecessary for modeling.

In [None]:
# Define leaky and unnecessary columns
leaky_columns = ['usd_pledged', 'backers_count', 'spotlight', 'state']
unnecessary_columns = [
    'Unnamed: 0', 'id', 'blurb', 'name', 'currency', 'deadline', 'launched_at',
    'goal', 'category_slug', 'location.country', 'slug', 'location_displayable_name',
    'location_typelocation_country', 'location_statelocation_displayable_name'
]

Now, we create our feature set `X` by dropping the target and the specified unnecessary columns.

In [None]:
# Define features 'X' by dropping target and specified columns
X = df_processed.drop(columns=['binary_state'] + leaky_columns + unnecessary_columns, errors='ignore')

Finally, we convert all remaining categorical text features into numerical format using one-hot encoding.

In [None]:
# One-hot encode all remaining categorical features
X = pd.get_dummies(X, drop_first=True)
print(f"✅ Data prepared for modeling. Total features: {X.shape[1]}")

### **5.2. Split Data**
We split the data into training (80%) and testing (20%) sets.

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("✅ Data split into training and testing sets.")

---

## **6. Train and Evaluate the Model**
We will train a Random Forest Classifier. A key improvement here is using `class_weight='balanced'`, which helps the model handle imbalanced datasets.

### **6.1. Train the Random Forest Model**

Initialize the model with our chosen parameters.

In [None]:
# Initialize the model with our key improvement: class_weight='balanced'
rf_model_improved = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',  # Addresses class imbalance
    random_state=42,
    n_jobs=-1
)

Now, fit the model to the training data.

In [None]:
# Fit the model on the training data
print("🌳 Training the Random Forest model...")
rf_model_improved.fit(X_train, y_train)
print("✅ Model training complete.")

### **6.2. Evaluate Model Performance**
Make predictions on the unseen test data.

In [None]:
# Make predictions on the test set
y_pred_rf_imp = rf_model_improved.predict(X_test)
print("✅ Predictions made on the test set.")

Evaluate the model's accuracy.

In [None]:
# Print the accuracy score
print("\n--- Evaluation of Improved Random Forest Model ---")
print(f"Improved Model Accuracy: {accuracy_score(y_test, y_pred_rf_imp):.4f}")

Use a classification report for a more detailed performance breakdown, including precision, recall, and F1-score for each class.

In [None]:
# Print the full classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf_imp, target_names=['Failed', 'Successful']))

---

## **7. Analyze Feature Importance**
Finally, let's see which features the model found most important.

Get the feature importances from the trained model.

In [None]:
# Get feature importances and names from the trained model
importances = rf_model_improved.feature_importances_
feature_names = X_train.columns

Create a DataFrame to make the data easier to work with.

In [None]:
# Create a DataFrame for visualization
feature_importance_df = pd.DataFrame({'feature': feature_names, 'importance': importances})

Sort the features by importance to find the most influential ones.

In [None]:
# Sort by importance and get the top 20 features
top_20_features = feature_importance_df.sort_values(by='importance', ascending=False).head(20)

Plot the top 20 features.

In [None]:
# Plot the results
plt.figure(figsize=(12, 8))
sns.barplot(x='importance', y='feature', data=top_20_features)
plt.title('Top 20 Most Important Features (Improved Model)')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()