## Feature Engineering Exploration
We use this notebook to consolidate the initial transformation of our initial columns into more informative features. 

In [None]:
#initial imports 
import pandas as pd
import numpy as np
from pathlib import Path
from typing import Optional
import seaborn as sns
import matplotlib.pyplot as plt

import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

From the data-cleaning, we can now import our cleaned files. 

In [None]:
# Load Files as DataFrames
BASE_DIR = Path.cwd().resolve().parents[1]
data_file_1 = BASE_DIR / "data" / "cleaned" / "kickstarter_cleaned_with_cancelled.csv"
data_file_2 = BASE_DIR / "data" / "cleaned" / "kickstarter_cleaned.csv"

filepath_1 = Path(data_file_1)
filepath_2 = Path(data_file_2)

df1 = pd.read_csv(filepath_1, encoding='latin-1', low_memory=False)
df2 = pd.read_csv(filepath_2, low_memory=False)

logger.info(f"Loaded {len(df1)} rows and {len(df1.columns)} columns")
logger.info(f"Loaded {len(df2)} rows and {len(df2.columns)} columns")

### We are looking into potential outlier cleaning for the numerical money-columns. 

In [None]:
# checking for projects which are not "0" in either - as these count as "not happened"
df2_clean = df2[(df2['usd_pledged_real'] > 0) & (df2['usd_goal_real'] > 0)].copy()
#checking the distributions
df2_clean['usd_goal_real'].describe().round(2)
df2_clean['usd_goal_real'].min()
#adding the info to the logger 
logger.info(f"Loaded {len(df1)} rows and {len(df1.columns)} columns")
logger.info(f"Loaded {len(df2)} rows and {len(df2.columns)} columns")
# checking out the "goal" again
df2_clean['usd_goal_real'].describe().round(2)


We decide to bin the data by quantiles. The curves are power-law distributions, so we decided to prioritise having comparable amounts of projects in each bin over having a uniform step size between bins. 

Using quartiles ensures that. 

We also decided on a small number of bins to make it easier for the feature engineering later. 

In [None]:
#do the binning for both goals and pledged 
df2_clean.loc[:, 'usd_goal_bins'] = pd.qcut(df2_clean['usd_goal_real'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
df2_clean.loc[:, 'usd_pledged_bins'] = pd.qcut(df2_clean['usd_pledged_real'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

In [None]:
#check the columns after doing this 
df2_clean.columns

In [None]:
#check the data 
df2_clean.head(10)

### Creating further bins 

##### Categories of Categories
We want to aggregate categories in fewer bins. We checked the curves and manually ascribe categories by "is this the same type of thing" to end up with a consolidated set. 

In [None]:
#grouping countries by main_category
category_map = {
    'Art': 'Creative',
    'Comics': 'Creative',
    'Crafts': 'Creative',
    'Dance': 'Creative',
    'Design': 'Creative',
    'Fashion': 'Consumer',
    'Film & Video': 'Entertainment',
    'Games': 'Entertainment',
    'Music': 'Entertainment',
    'Photography': 'Creative',
    'Publishing': 'Creative',
    'Technology': 'Tech',
    'Food': 'Consumer',
    'Journalism': 'Other',
    'Theater': 'Entertainment'
}

df2_clean.loc[:, 'main_category_grouped'] = df2_clean['main_category'].map(category_map).fillna('Other')

In [None]:
#re-doing the datetime transformation as it was lost by exporting to csv
df2_clean.loc[:, "launched"] = pd.to_datetime(df2_clean["launched"], errors="coerce")
df2_clean.loc[:, "deadline"] = pd.to_datetime(df2_clean["deadline"], errors="coerce")

type(df2_clean['deadline'].iloc[0])

##### Countries by continent

We agree to aggregate countries by continent. 

This is not the only way to do this, as within a continent countries can have vastly different properties. 

However, we decided to go for this granularity as the similarities by continent seem sufficient.

If we did other bins - like "top money countries" or "top number of projects countries", these could easily change later with new data coming in, so they don't seem ideal. 

In [None]:
df2_clean.loc[:, 'country'] = df2_clean['country'].replace('N,0"', 'NO')

#grouping countries by continents
continent_map = {
    'US': 'North America', 'CA': 'North America', 'MX': 'North America',
    'GB': 'Europe', 'DE': 'Europe', 'FR': 'Europe', 'IT': 'Europe',
    'ES': 'Europe', 'NL': 'Europe', 'IE': 'Europe', 'SE': 'Europe',
    'CH': 'Europe', 'AT': 'Europe', 'DK': 'Europe', 'BE': 'Europe', 'LU': 'Europe', 'NO': 'Europe',
    'AU': 'Oceania', 'NZ': 'Oceania', 
    'JP': 'Asia', 'SG': 'Asia', 'HK': 'Asia',
}

df2_clean.loc[:, 'continent'] = df2_clean['country'].map(continent_map).fillna('Other')

##### Time categories 
We go for further categories within time, as we have the hypothesis that seasonality could be relevant for success.

Thus, we add both month and year to our category set. 

In [None]:
# Add year and month as separate columns (still datetime)
df2_clean.loc[:, 'deadline_year'] = df2_clean['deadline'].apply(lambda x: x.year)      #dividing into months and years
df2_clean.loc[:, 'deadline_month'] = df2_clean['deadline'].apply(lambda x: x.month)

In [None]:
#same for launched: Create month and year as separate columns 
df2_clean.loc[:, 'launched_year'] = df2_clean['launched'].apply(lambda x: x.year)          #dividing into months and years
df2_clean.loc[:, 'launched_month'] = df2_clean['launched'].apply(lambda x: x.month)

Continuing with time, let's make sense of "Duration"

Issue: This data is highly irregular. It's neither skewed nor normally distributed, it has enormous spikes at "one month" and also a smaller at the "two month" mark. 

In [None]:
#check out if there's numerical bins adequate 
df2_clean['duration_days'].describe()

As there are no sensible bins coming from this, we will cut arbitrary bins to have workable categories. 

In [None]:
#write bins as "two week slots", avoiding the spike at 29 days 
bins = [15, 29, 45, 60, 75]
#label it 
labels = ['2 weeks', '4 weeks', '6 weeks', '8 weeks']
#add that to the dataframe
df2_clean.loc[:, 'duration_bins'] = pd.cut(df2_clean['duration_days'], bins=bins, labels=labels)
#check it out how it looks like 
df2_clean['duration_bins'].value_counts().sort_index()

In [None]:
#re-check the columns again 
df2_clean.columns

##### Backers
We might not be able to use it (as it predicts the future) but we still played around with some potential measures regarding backers

In [None]:
# backers/pledged
df2_clean.loc[:, 'backers_per_pledged'] = df2_clean['backers'] / df2_clean['usd_pledged_real']
df2_clean.loc[:, 'backer_pledged_bins'] = pd.qcut(df2_clean['backers_per_pledged'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

In [None]:
#check out the df again
df2_clean.head()

In [None]:
#pledged bin per category
df2_clean.loc[:, 'pledged_per_category'] = df2_clean.groupby('main_category')['usd_pledged_real'].transform('mean')
df2_clean.loc[:, 'goal_per_category'] = df2_clean.groupby('main_category')['usd_goal_real'].transform('mean')

# category related bins
df2_clean.loc[:, 'category_goal_percentile'] = df2_clean.groupby('main_category_grouped')['usd_goal_real'].transform(lambda x: pd.qcut(x, q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']))
df2_clean.head()

In [None]:
# difference between category goal percentail and goal bins
df2_test = df2_clean.query('category_goal_percentile != usd_goal_bins')
df2_test.shape

In [None]:
def convert_season(month: Optional[int]) -> Optional[str]:
    """Convert month to season."""
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    elif month in [9, 10, 11]:
        return 'Fall'
    else:
        return None


df2_clean.loc[:, 'launch_season'] = df2_clean['launched_month'].apply(convert_season)
df2_clean.loc[:, 'deadline_season'] = df2_clean['deadline_month'].apply(convert_season)

In [None]:
df2_clean.columns

In [None]:
df2_clean.head()

### Finalising 

Save the created dataset as new csv file.

In [None]:
# Paths
BASE_DIR = Path.cwd().resolve().parents[1]
RAW_PATH = BASE_DIR / "data" / "feature"
FEATURED_PATH = BASE_DIR / "data" / "feature"

RAW_DATA_PATH = Path(RAW_PATH)
FEATURED_DATA_PATH = Path(FEATURED_PATH)

# Create output directory if not exists
FEATURED_DATA_PATH.mkdir(parents=True, exist_ok=True)

# Save main dataset
main_path = FEATURED_DATA_PATH / 'kickstarter_featured.csv'
df2_clean.to_csv(main_path, index=False)
print(f" Saved: {main_path}")

# # Save dataset with cancelled
# cancelled_path = FEATURED_DATA_PATH / 'kickstarter_featured_with_cancelled.csv'
# df_with_cancelled.to_csv(cancelled_path, index=False)
# print(f"\n Saved: {cancelled_path}")