## Feature Engineering Exploration
We use this notebook to consolidate the initial transformation of our initial columns into more informative features. 

In [1]:
#initial imports 
import pandas as pd
import numpy as np
from pathlib import Path
from typing import Optional
import seaborn as sns
import matplotlib.pyplot as plt

import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

From the data-cleaning, we can now import our cleaned files. 

In [2]:
# Load Files as DataFrames
BASE_DIR = Path.cwd().resolve().parents[1]
data_file_1 = BASE_DIR / "data" / "cleaned" / "kickstarter_cleaned_with_cancelled.csv"
data_file_2 = BASE_DIR / "data" / "cleaned" / "kickstarter_cleaned.csv"

filepath_1 = Path(data_file_1)
filepath_2 = Path(data_file_2)

df1 = pd.read_csv(filepath_1, encoding='latin-1', low_memory=False)
df2 = pd.read_csv(filepath_2, low_memory=False)

logger.info(f"Loaded {len(df1)} rows and {len(df1.columns)} columns")
logger.info(f"Loaded {len(df2)} rows and {len(df2.columns)} columns")

INFO:__main__:Loaded 370454 rows and 10 columns
INFO:__main__:Loaded 331675 rows and 10 columns


### We are looking into potential outlier cleaning for the numerical money-columns. 

In [3]:
# checking for projects which are not "0" in either - as these count as "not happened"
df2_clean = df2[(df2['usd_pledged_real'] > 0) & (df2['usd_goal_real'] > 0)]
#checking the distributions
df2_clean['usd_goal_real'].describe().round(2)
df2_clean['usd_goal_real'].min()
#adding the info to the logger 
logger.info(f"Loaded {len(df1)} rows and {len(df1.columns)} columns")
logger.info(f"Loaded {len(df2)} rows and {len(df2.columns)} columns")
# checking out the "goal" again
df2_clean['usd_goal_real'].describe().round(2)


INFO:__main__:Loaded 370454 rows and 10 columns
INFO:__main__:Loaded 331675 rows and 10 columns


count    2.930190e+05
mean     3.507900e+04
std      9.333190e+05
min      1.000000e-02
25%      2.000000e+03
50%      5.000000e+03
75%      1.500000e+04
max      1.101698e+08
Name: usd_goal_real, dtype: float64

We decide to bin the data by quantiles. The curves are power-law distributions, so we decided to prioritise having comparable amounts of projects in each bin over having a uniform step size between bins. 

Using quartiles ensures that. 

We also decided on a small number of bins to make it easier for the feature engineering later. 

In [4]:
#do the binning for both goals and pledged 
df2_clean['usd_goal_bins'] = pd.qcut(df2_clean['usd_goal_real'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
df2_clean['usd_pledged_bins'] = pd.qcut(df2_clean['usd_pledged_real'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['usd_goal_bins'] = pd.qcut(df2_clean['usd_goal_real'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['usd_pledged_bins'] = pd.qcut(df2_clean['usd_pledged_real'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])


In [5]:
#check the columns after doing this 
df2_clean.columns

Index(['id', 'main_category', 'deadline', 'launched', 'backers', 'country',
       'usd_pledged_real', 'usd_goal_real', 'duration_days', 'target',
       'usd_goal_bins', 'usd_pledged_bins'],
      dtype='object')

In [6]:
#check the data 
df2_clean.head(10)

Unnamed: 0,id,main_category,deadline,launched,backers,country,usd_pledged_real,usd_goal_real,duration_days,target,usd_goal_bins,usd_pledged_bins
1,1000003930,Film & Video,2017-11-01,2017-09-02 04:43:57,15,US,2421.0,30000.0,59,0,Very High,High
2,1000004038,Film & Video,2013-02-26,2013-01-12 00:20:50,3,US,220.0,45000.0,44,0,Very High,Low
3,1000007540,Music,2012-04-16,2012-03-17 03:24:11,1,US,1.0,5000.0,29,0,Medium,Very Low
4,1000014025,Food,2016-04-01,2016-02-26 13:38:27,224,US,52375.0,50000.0,34,1,Very High,Very High
5,1000023410,Food,2014-12-21,2014-12-01 18:30:44,16,US,1205.0,1000.0,19,1,Very Low,Medium
6,1000030581,Food,2016-03-17,2016-02-01 20:05:12,40,US,453.0,25000.0,44,0,Very High,Low
8,100005484,Music,2013-04-08,2013-03-09 06:42:58,100,US,12700.0,12500.0,29,1,High,Very High
11,1000057089,Games,2017-05-03,2017-04-05 19:44:18,761,GB,121857.33,6469.73,27,1,Medium,Very High
12,1000064368,Design,2015-02-28,2015-01-29 02:10:53,11,US,664.0,2500.0,29,0,Low,Medium
13,1000064918,Comics,2014-11-08,2014-10-09 22:27:52,16,US,395.0,1500.0,29,0,Very Low,Low


### Creating further bins 

##### Categories of Categories
We want to aggregate categories in fewer bins. We checked the curves and manually ascribe categories by "is this the same type of thing" to end up with a consolidated set. 

In [7]:
#grouping countries by main_category
category_map = {
    'Art': 'Creative',
    'Comics': 'Creative',
    'Crafts': 'Creative',
    'Dance': 'Creative',
    'Design': 'Creative',
    'Fashion': 'Consumer',
    'Film & Video': 'Entertainment',
    'Games': 'Entertainment',
    'Music': 'Entertainment',
    'Photography': 'Creative',
    'Publishing': 'Creative',
    'Technology': 'Tech',
    'Food': 'Consumer',
    'Journalism': 'Other',
    'Theater': 'Entertainment'
}

df2_clean['main_category_grouped'] = df2_clean['main_category'].map(category_map).fillna('Other')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['main_category_grouped'] = df2_clean['main_category'].map(category_map).fillna('Other')


In [8]:
#re-doing the datetime transformation as it was lost by exporting to csv
df2_clean["launched"] = pd.to_datetime(df2_clean["launched"], errors="coerce")
df2_clean["deadline"] = pd.to_datetime(df2_clean["deadline"], errors="coerce")

type(df2_clean['deadline'].iloc[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean["launched"] = pd.to_datetime(df2_clean["launched"], errors="coerce")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean["deadline"] = pd.to_datetime(df2_clean["deadline"], errors="coerce")


pandas._libs.tslibs.timestamps.Timestamp

##### Countries by continent

We agree to aggregate countries by continent. 

This is not the only way to do this, as within a continent countries can have vastly different properties. 

However, we decided to go for this granularity as the similarities by continent seem sufficient.

If we did other bins - like "top money countries" or "top number of projects countries", these could easily change later with new data coming in, so they don't seem ideal. 

In [9]:
df2_clean['country'] = df2_clean['country'].replace('N,0"', 'NO')

#grouping countries by continents
continent_map = {
    'US': 'North America', 'CA': 'North America', 'MX': 'North America',
    'GB': 'Europe', 'DE': 'Europe', 'FR': 'Europe', 'IT': 'Europe',
    'ES': 'Europe', 'NL': 'Europe', 'IE': 'Europe', 'SE': 'Europe',
    'CH': 'Europe', 'AT': 'Europe', 'DK': 'Europe', 'BE': 'Europe', 'LU': 'Europe', 'NO': 'Europe',
    'AU': 'Oceania', 'NZ': 'Oceania', 
    'JP': 'Asia', 'SG': 'Asia', 'HK': 'Asia',
}

df2_clean['continent'] = df2_clean['country'].map(continent_map).fillna('Other')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['country'] = df2_clean['country'].replace('N,0"', 'NO')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['continent'] = df2_clean['country'].map(continent_map).fillna('Other')


##### Time categories 
We go for further categories within time, as we have the hypothesis that seasonality could be relevant for success.

Thus, we add both month and year to our category set. 

In [10]:
#add year and month as separate columns (still datetime)
df2_clean['deadline_year'] = df2_clean['deadline'].apply(lambda x: x.year)      #dividing into months and years
df2_clean['deadline_month'] = df2_clean['deadline'].apply(lambda x: x.month)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['deadline_year'] = df2_clean['deadline'].apply(lambda x: x.year)      #dividing into months and years
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['deadline_month'] = df2_clean['deadline'].apply(lambda x: x.month)


In [11]:
#same for launched: Create month and year as separate columns 
df2_clean['launched_year'] = df2_clean['launched'].apply(lambda x: x.year)          #dividing into months and years
df2_clean['launched_month'] = df2_clean['launched'].apply(lambda x: x.month)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['launched_year'] = df2_clean['launched'].apply(lambda x: x.year)          #dividing into months and years
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['launched_month'] = df2_clean['launched'].apply(lambda x: x.month)


Continuing with time, let's make sense of "Duration"

Issue: This data is highly irregular. It's neither skewed nor normally distributed, it has enormous spikes at "one month" and also a smaller at the "two month" mark. 

In [12]:
#check out if there's numerical bins adequate 
df2_clean['duration_days'].describe()

count    293019.000000
mean         32.881956
std          12.526566
min           0.000000
25%          29.000000
50%          29.000000
75%          35.000000
max          91.000000
Name: duration_days, dtype: float64

As there are no sensible bins coming from this, we will cut arbitrary bins to have workable categories. 

In [13]:
#write bins as "two week slots", avoiding the spike at 29 days 
bins = [15, 29, 45, 60, 75]
#label it 
labels = ['2 weeks', '4 weeks', '6 weeks', '8 weeks']
#add that to the dataframe
df2_clean['duration_bins'] = pd.cut(df2_clean['duration_days'], bins=bins, labels=labels)
#check it out how it looks like 
df2_clean['duration_bins'].value_counts().sort_index()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['duration_bins'] = pd.cut(df2_clean['duration_days'], bins=bins, labels=labels)


duration_bins
2 weeks    168547
4 weeks     70377
6 weeks     34449
8 weeks      1189
Name: count, dtype: int64

In [14]:
#re-check the columns again 
df2_clean.columns

Index(['id', 'main_category', 'deadline', 'launched', 'backers', 'country',
       'usd_pledged_real', 'usd_goal_real', 'duration_days', 'target',
       'usd_goal_bins', 'usd_pledged_bins', 'main_category_grouped',
       'continent', 'deadline_year', 'deadline_month', 'launched_year',
       'launched_month', 'duration_bins'],
      dtype='object')

##### Backers
We might not be able to use it (as it predicts the future) but we still played around with some potential measures regarding backers

In [15]:
# backers/pledged
df2_clean['backers_per_pledged'] = df2_clean['backers'] / df2_clean['usd_pledged_real']
df2_clean['backer_pledged_bins'] = pd.qcut(df2_clean['backers_per_pledged'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['backers_per_pledged'] = df2_clean['backers'] / df2_clean['usd_pledged_real']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['backer_pledged_bins'] = pd.qcut(df2_clean['backers_per_pledged'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])


In [16]:
#check out the df again
df2_clean.head()

Unnamed: 0,id,main_category,deadline,launched,backers,country,usd_pledged_real,usd_goal_real,duration_days,target,...,usd_pledged_bins,main_category_grouped,continent,deadline_year,deadline_month,launched_year,launched_month,duration_bins,backers_per_pledged,backer_pledged_bins
1,1000003930,Film & Video,2017-11-01,2017-09-02 04:43:57,15,US,2421.0,30000.0,59,0,...,High,Entertainment,North America,2017,11,2017,9,6 weeks,0.006196,Very Low
2,1000004038,Film & Video,2013-02-26,2013-01-12 00:20:50,3,US,220.0,45000.0,44,0,...,Low,Entertainment,North America,2013,2,2013,1,4 weeks,0.013636,Low
3,1000007540,Music,2012-04-16,2012-03-17 03:24:11,1,US,1.0,5000.0,29,0,...,Very Low,Entertainment,North America,2012,4,2012,3,2 weeks,1.0,Very High
4,1000014025,Food,2016-04-01,2016-02-26 13:38:27,224,US,52375.0,50000.0,34,1,...,Very High,Consumer,North America,2016,4,2016,2,4 weeks,0.004277,Very Low
5,1000023410,Food,2014-12-21,2014-12-01 18:30:44,16,US,1205.0,1000.0,19,1,...,Medium,Consumer,North America,2014,12,2014,12,2 weeks,0.013278,Low


In [17]:
#pledged bin per category
df2_clean['pledged_per_category'] = df2_clean.groupby('main_category')['usd_pledged_real'].transform('mean')
df2_clean['goal_per_category'] = df2_clean.groupby('main_category')['usd_goal_real'].transform('mean')

# category related bins
df2_clean['category_goal_percentile'] = df2_clean.groupby('main_category_grouped')['usd_goal_real'].transform(lambda x: pd.qcut(x, q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']))
df2_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['pledged_per_category'] = df2_clean.groupby('main_category')['usd_pledged_real'].transform('mean')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['goal_per_category'] = df2_clean.groupby('main_category')['usd_goal_real'].transform('mean')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-

Unnamed: 0,id,main_category,deadline,launched,backers,country,usd_pledged_real,usd_goal_real,duration_days,target,...,deadline_year,deadline_month,launched_year,launched_month,duration_bins,backers_per_pledged,backer_pledged_bins,pledged_per_category,goal_per_category,category_goal_percentile
1,1000003930,Film & Video,2017-11-01,2017-09-02 04:43:57,15,US,2421.0,30000.0,59,0,...,2017,11,2017,9,6 weeks,0.006196,Very Low,7676.247109,58616.915835,Very High
2,1000004038,Film & Video,2013-02-26,2013-01-12 00:20:50,3,US,220.0,45000.0,44,0,...,2013,2,2013,1,4 weeks,0.013636,Low,7676.247109,58616.915835,Very High
3,1000007540,Music,2012-04-16,2012-03-17 03:24:11,1,US,1.0,5000.0,29,0,...,2012,4,2012,3,2 weeks,1.0,Very High,4697.431965,11558.623284,Medium
4,1000014025,Food,2016-04-01,2016-02-26 13:38:27,224,US,52375.0,50000.0,34,1,...,2016,4,2016,2,4 weeks,0.004277,Very Low,6505.672844,30502.224195,Very High
5,1000023410,Food,2014-12-21,2014-12-01 18:30:44,16,US,1205.0,1000.0,19,1,...,2014,12,2014,12,2 weeks,0.013278,Low,6505.672844,30502.224195,Very Low


In [18]:
# difference between category goal percentail and goal bins
df2_test = df2_clean.query('category_goal_percentile != usd_goal_bins')
df2_test.shape

(58521, 24)

In [19]:
def convert_season(month: Optional[int]) -> Optional[str]:
    """Convert month to season."""
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    elif month in [9, 10, 11]:
        return 'Fall'
    else:
        return None


df2_clean['launch_season'] = df2_clean['launched_month'].apply(convert_season)
df2_clean['deadline_season'] = df2_clean['deadline_month'].apply(convert_season)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['launch_season'] = df2_clean['launched_month'].apply(convert_season)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_clean['deadline_season'] = df2_clean['deadline_month'].apply(convert_season)


In [20]:
df2_clean.columns

Index(['id', 'main_category', 'deadline', 'launched', 'backers', 'country',
       'usd_pledged_real', 'usd_goal_real', 'duration_days', 'target',
       'usd_goal_bins', 'usd_pledged_bins', 'main_category_grouped',
       'continent', 'deadline_year', 'deadline_month', 'launched_year',
       'launched_month', 'duration_bins', 'backers_per_pledged',
       'backer_pledged_bins', 'pledged_per_category', 'goal_per_category',
       'category_goal_percentile', 'launch_season', 'deadline_season'],
      dtype='object')

In [21]:
df2_clean.head()

Unnamed: 0,id,main_category,deadline,launched,backers,country,usd_pledged_real,usd_goal_real,duration_days,target,...,launched_year,launched_month,duration_bins,backers_per_pledged,backer_pledged_bins,pledged_per_category,goal_per_category,category_goal_percentile,launch_season,deadline_season
1,1000003930,Film & Video,2017-11-01,2017-09-02 04:43:57,15,US,2421.0,30000.0,59,0,...,2017,9,6 weeks,0.006196,Very Low,7676.247109,58616.915835,Very High,Fall,Fall
2,1000004038,Film & Video,2013-02-26,2013-01-12 00:20:50,3,US,220.0,45000.0,44,0,...,2013,1,4 weeks,0.013636,Low,7676.247109,58616.915835,Very High,Winter,Winter
3,1000007540,Music,2012-04-16,2012-03-17 03:24:11,1,US,1.0,5000.0,29,0,...,2012,3,2 weeks,1.0,Very High,4697.431965,11558.623284,Medium,Spring,Spring
4,1000014025,Food,2016-04-01,2016-02-26 13:38:27,224,US,52375.0,50000.0,34,1,...,2016,2,4 weeks,0.004277,Very Low,6505.672844,30502.224195,Very High,Winter,Spring
5,1000023410,Food,2014-12-21,2014-12-01 18:30:44,16,US,1205.0,1000.0,19,1,...,2014,12,2 weeks,0.013278,Low,6505.672844,30502.224195,Very Low,Winter,Winter


### Finalising 

Save the created dataset as new csv file.

In [22]:
# Paths
BASE_DIR = Path.cwd().resolve().parents[1]
RAW_PATH = BASE_DIR / "data" / "feature"
FEATURED_PATH = BASE_DIR / "data" / "feature"

RAW_DATA_PATH = Path(RAW_PATH)
FEATURED_DATA_PATH = Path(FEATURED_PATH)

# Create output directory if not exists
FEATURED_DATA_PATH.mkdir(parents=True, exist_ok=True)

# Save main dataset
main_path = FEATURED_DATA_PATH / 'kickstarter_featured.csv'
df2_clean.to_csv(main_path, index=False)
print(f" Saved: {main_path}")

# # Save dataset with cancelled
# cancelled_path = FEATURED_DATA_PATH / 'kickstarter_featured_with_cancelled.csv'
# df_with_cancelled.to_csv(cancelled_path, index=False)
# print(f"\n Saved: {cancelled_path}")

 Saved: D:\Programming\ai_ds_bootcamp\ds-ml-project_kickstarters\data\feature\kickstarter_featured.csv
