## INITIAL SETUP

This notebook illustrates main findings from our EDA.

In [None]:
#general import 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

from pathlib import Path
from typing import Optional

import matplotlib.pyplot as plt


Notebooks are imported from transformed datasets done via 00_data_cleaning_eda (see notebook for details and reasoning)

In [None]:
# load datasets
CLEANED_DATA_PATH = Path.cwd().resolve().parents[1] / "data" / "cleaned"

df_ks = pd.read_csv(CLEANED_DATA_PATH / 'kickstarter_cleaned.csv')
#check result 
df_ks.head()

In [None]:
# Recreate state column from target for EDA purposes
# target=1 means successful, target=0 means failed
df_ks['state'] = df_ks['target'].map({1: 'successful', 0: 'failed'})
df_ks.head()


# EDA
## Checking for Null-Values
We decided to drop NaN value columns
- we dont need pledged column with NaN (see reasoning in 00_data_cleaning notebook)
- there are no other NaNs 

In [None]:
# show that we dont have missing values
sns.heatmap(df_ks.isna(), cbar=False)
plt.show()

### Picking a measure 
We picked "success" and "failure" as a binary measure to predict for. 
* Reasoning: We want to help future projects assess their chance of "success" or "failure". 
* "cancellations" are self-inflicted failures and thus don't fall in our scope 
* Undefined or suspended states are not informative for our goal (yet)

That leaves us with a clear target. 

In [None]:
# Check out the distribution of the target variable
sns.countplot(x="state", data=df_ks)
plt.title("Project Outcome Distribution")
plt.show()

The resulting data distribution is somewhat imbalanced, but the target is far from being a rare event. We can thus work with most measures when doing the machine learning later on. 

### Checking out the numerical data
We looked at the numerical data (Goal, pledged and number of backers) related to the target. 

All curves are heavily skewed or even power laws. 

We add an exemplary exploration to illustrate: 

In [None]:
# cutting backers down to see the curve 
df1_test = df_ks.query("backers < 80.000")
df1_test['backers'].plot(kind='hist', bins=100) 

# and check by state: 
sns.catplot(data=df1_test, x="state", y="backers", kind="boxen")

If you don't cut down the curves, it's not even possible to see the distributions: 

In [None]:
fig, ax = plt.subplots(nrows=4, figsize = (16,16))
count=0
#histogram
sns.histplot(df_ks['usd_goal_real'], kde=True, ax=ax[0], color='#33658A').set(title='Feature', xlabel='')
#boxplot
sns.boxplot(x=df_ks['usd_goal_real'], ax=ax[1], color='#33658A').set(title="goal")
#scatter plot
sns.scatterplot(x=df_ks.usd_goal_real, y=df_ks.usd_pledged_real, ax=ax[2], color='#33658A').set(title='pledged vs. actual')
#cleaning up: 
ax.flat[-1].set_visible(False)
fig.tight_layout(pad=3)

However, these are not strictly speaking outliers, because the curve has a long tail. Thus we decided to not "outlier clean" but leave the data as-is. 

### Checking out categories and success rates 

We want to know a bit more about categories and countries. 

So we check out "success rates" for each. 

In [None]:
# After we've described the "all numerical curves are not bell curves", we could go into the success rates ones
success_rate = (
    df_ks.assign(success=df_ks["state"] == "successful")
      .groupby("main_category")["success"]
      .mean()
      .sort_values()
)

success_rate.plot(kind="barh", colormap="viridis")
plt.title("Success Rate by Main Category")
plt.show()

In [None]:
#same for country: check the top countries (as there are too many) by success rate 
country_success = (
    df_ks.assign(success=df_ks["state"] == "successful")
      .groupby("country")["success"]
      .mean()
      .sort_values(ascending=False)
)

country_success.head(15).plot(kind="barh", colormap="viridis")
plt.title("Top Countries by Success Rate")
plt.show()

Better understand "Main Categories" before we bin them:
* we check out success rate and amount of projects 

In [None]:
# Filter for only 'successful' and 'failed' states for this analysis (optional but common)
success_fail_df = df_ks[df_ks['state'].isin(['successful', 'failed'])]

# Create a new binary column: 1 if successful, 0 otherwise
success_fail_df['is_successful'] = np.where(success_fail_df['state'] == 'successful', 1, 0)

# Calculate the mean success rate per category
category_success_rate = success_fail_df.groupby('main_category')['is_successful'].mean().sort_values(ascending=False).reset_index()

plt.figure(figsize=(12, 7))
sns.barplot(
    x='is_successful',
    y='main_category',
    data=category_success_rate,
    palette='viridis'
)
plt.title('Project Success Rate by Main Category')
plt.xlabel('Success Rate (Fraction of Projects)')
plt.ylabel('Main Category')
plt.show()

In [None]:
# 2. Main Categories Distribution
plt.figure(figsize=(14, 6))
category_counts = df_ks['main_category'].value_counts().head(15)
sns.barplot(x=category_counts.values, y=category_counts.index, palette='viridis')
plt.title('Top 15 Main Categories by Project Count', fontsize=14, fontweight='bold')
plt.xlabel('Number of Projects')
plt.ylabel('Main Category')
plt.tight_layout()
plt.show()

In [None]:
# 3. Success Rate by Main Category
category_success = df_ks.groupby('main_category')['state'].apply(
    lambda x: (x == 'successful').sum() / len(x) * 100
).sort_values(ascending=False)

plt.figure(figsize=(14, 6))
sns.barplot(x=category_success.values, y=category_success.index, palette='viridis')
plt.title('Success Rate by Main Category (%)', fontsize=14, fontweight='bold')
plt.xlabel('Success Rate (%)')
plt.ylabel('Main Category')
plt.tight_layout()
plt.show()

### Look at the time
We want to see if we see patterns when looking at the time distribution. 

Thus we just plot the frequency of project launches and deadlines over the full amount of time the dataset contains. 

That is not yet a seasonality analysis. 

In [None]:
# 6. Projects Launched Over Time
df_ks["launched"] = pd.to_datetime(df_ks["launched"], errors="coerce")
df_ks["deadline"] = pd.to_datetime(df_ks["deadline"], errors="coerce")

df_time = df_ks.dropna(subset=['launched'])
df_time['year_month'] = df_time['launched'].dt.to_period('M')

monthly_counts = df_time['year_month'].value_counts().sort_index()

# 7. Success Rate Over Time
df_time_success = df_time.groupby('year_month')['state'].apply(
    lambda x: (x == 'successful').sum() / len(x) * 100
).sort_index()

fig1, ax1 = plt.subplots(figsize=(16, 6))

ax1.plot(df_time_success.index.astype(str), df_time_success.values, 
         marker='o', linewidth=2, markersize=4, color='green')
ax1.set_ylabel('Success Rate (%)', color='green')
ax2 = ax1.twinx()

ax2.plot(monthly_counts.index.astype(str), monthly_counts.values, marker='o', linewidth=2, markersize=4)
plt.title('Number of Projects Launched Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Year-Month')
ax2.set_ylabel('Number of Projects', color='blue')
plt.xticks(rotation=45, ha='right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Also, we check out the typical durations of projects.
* That curve is highly unbalanced and does not have a clear curve: 

Duration seems to defined by some influence by the "one month magic number". 
* we see a second spike around two months, and a tiny one around three months

In [None]:
# 8. Campaign Duration Distribution
df_duration = df_ks.dropna(subset=['duration_days'])
df_duration_clean = df_duration[
    (df_duration['duration_days'] >= 0) & 
    (df_duration['duration_days'] <= 90)
]

plt.figure(figsize=(12, 6))
plt.hist(df_duration_clean['duration_days'], bins=30, color='coral', edgecolor='black')
plt.title('Distribution of Campaign Duration (Days)', fontsize=14, fontweight='bold')
plt.xlabel('Campaign Duration (Days)')
plt.ylabel('Frequency')
plt.axvline(df_duration_clean['duration_days'].median(), 
            color='red', linestyle='--', linewidth=2, label=f'Median: {df_duration_clean["duration_days"].median():.0f} days')
plt.legend()
plt.tight_layout()
plt.show()

print(f"Median campaign duration: {df_duration_clean['duration_days'].median():.0f} days")
print(f"Mean campaign duration: {df_duration_clean['duration_days'].mean():.1f} days")

## Final Decisions 

Discussion of original columns provided

| Column | Data | Decision | Done |
|:--------:|:--------:|:--------:|:--------:|
|  ID  |  Unique identifier  |  checked with other dataset  | |
|  Name   |  Project title   |  sentiment analysis would be great, but not feasible  | ignore |
|  Category  |  >150 Subcategories   |  included in main category   | ignore |
|main_category| 15 categories| checked, makes a difference| use but make even less granular|
|currency| currency used for project| decided to work only with usd|ignored|
|goal| money goal for in currency| decided to work only with usd|ignored|
|pledged| money pledged in currency| decided to work only with usd|ignored|
|backers|number of people who pledged money|correlated and not available in the future |ignored|
|usd_pledged|How much money did the project get?|Will not be available in future|ignored|
|usd_pledged_real|redundant apparently|same as pledged | ignored|
|deadline| unclear! probably stated end date of kickstarter| used |ignored|
|state|used as target |removed all but "success and fail"|keep and create new column with numerical values|
|country|country of project|informative, but too many|reduced to continents|