## Data Encoding, Data Normalization and Data Scaling

The purpose of this practice class - get familiar with data preprocessing techniques as categorical features encoding, data normalization and data scaling.

Data are collected from Kickstarter Platform (https://www.kickstarter.com/)

In [1]:
# Setting the environment

# basic modules
import numpy as np
import pandas as pd

# label encoding
from sklearn import preprocessing

# Box-Cox Transformation
from scipy import stats

# min_max scaling
from mlxtend.preprocessing import minmax_scaling

# visualization
import seaborn as sns
import matplotlib.pyplot as plt

----

In [2]:
# Read you data

data = pd.read_csv("ks-projects-2018.csv", parse_dates=['deadline', 'launched'])

In [4]:
# From what you learned before - take a first look at your data

data.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1415612904,LEGACY OF COOL,Fashion,Fashion,USD,2012-12-12,210000.0,2012-11-12 17:48:40,76628.0,failed,117,US,76628.0,76628.0,210000.0
1,1857414265,UBU - University By U,Web,Technology,GBP,2015-01-22,9000.0,2015-01-02 00:51:02,1.0,failed,1,GB,1.56,1.5,13486.18
2,1450979503,Small Town Restaurants,Nonfiction,Publishing,USD,2013-05-27,30000.0,2013-04-27 16:28:19,1505.0,failed,8,US,1505.0,1505.0,30000.0
3,1125626743,This Is Why We Do (Canceled),Hip-Hop,Music,USD,2016-02-20,3000.0,2016-01-21 06:54:15,1.0,canceled,1,US,1.0,1.0,3000.0
4,1933562448,Be Part of Alex Berger's Debut Record!,Music,Music,USD,2010-01-03,3172.91,2009-12-09 20:49:16,5493.0,successful,81,US,5493.0,5493.0,3172.91


In [5]:
# Let's see how big is our dataset

data.shape

(170397, 15)

In [6]:
# Let's have some useful information about our dataset

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170397 entries, 0 to 170396
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   ID                170397 non-null  int64         
 1   name              170395 non-null  object        
 2   category          170397 non-null  object        
 3   main_category     170397 non-null  object        
 4   currency          170397 non-null  object        
 5   deadline          170397 non-null  datetime64[ns]
 6   goal              170397 non-null  float64       
 7   launched          170397 non-null  datetime64[ns]
 8   pledged           170397 non-null  float64       
 9   state             170397 non-null  object        
 10  backers           170397 non-null  int64         
 11  country           170397 non-null  object        
 12  usd pledged       168659 non-null  float64       
 13  usd_pledged_real  170397 non-null  float64       
 14  usd_

In [7]:
# Handle the missing values in "usd pledged" column

# your code goes here
missing_values = data['usd pledged'].isnull().sum()
print(missing_values)


1738


In [8]:
# Remove unnecessary ID column

data.dropna(subset=['usd pledged'], inplace=True)

-----

### Categorical features encoding

Apply Label encoding from sklearn to "category", "main category" and "country" columns

Save encoded columns as a new column of dataset named respectively "cat_enc", "maincat_enc", "country_enc"

Follow this quick guide to use label encoder from sklearn package
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [None]:
# quick example

# creating a model instance
label_enc = preprocessing.LabelEncoder()

# random data
data_tmp = ['apple', 'banana', 'mango', 'orange', 'lemon']

# fitting (training) the model
label_enc.fit(data_tmp)

# outputting our encoded data
label_enc.transform(data_tmp)

# here you see that each category value was mapped (encoded) into a numerical value

array([0, 1, 3, 4, 2])

![image.png](attachment:image.png)

In [10]:
# Label encoding of "category" column

# your code goes here
label = preprocessing.LabelEncoder()
data['category_encoded'] = label.fit_transform(data['category'])



In [12]:
# Label encoding of "main_category" column

# your code goes here
label = preprocessing.LabelEncoder()
data['main_category_encoded'] = label.fit_transform(data['main_category'])



In [13]:
# Label encoding of "country" column

# your code goes here
label = preprocessing.LabelEncoder()
data['country_encoded'] = label.fit_transform(data['country'])



Apply OneHot encoding to "state" and "main category" columns

OneHot encoding might be a bit challenging, but if you will folow this guideline - everything will work perfectly
<br/>
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

Add your OneHot Encoded columns to your initial dataset

In [None]:
# quick example

# creating a model instance
onehot_enc = preprocessing.OneHotEncoder()

# random data
data_tmp = [['apple'],
            [ 'kiwi'], 
            ['apple'],
            ['lemon'],
            ['orange']]

# fitting (training) the model
onehot_enc.fit(data_tmp)

# outputting names of columns and our encoded data 
print(onehot_enc.get_feature_names_out())
print(onehot_enc.transform(data_tmp).toarray())

# here you see that for each categorical value a new column was created and named
# as "x0_apple", ..., so instead of having one column with 4 values now we have 
# 4 columns with 2 possible values - 1 or 0

['x0_apple' 'x0_kiwi' 'x0_lemon' 'x0_orange']
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]


![image-2.png](attachment:image-2.png)

In [18]:
# OneHot encoding of "state" column

# your code goes here
onehot = preprocessing.OneHotEncoder()
state_onehot = onehot.fit_transform(data[['state']])
state_label = onehot.get_feature_names_out(['state'])
state_data = pd.DataFrame(state_onehot.toarray(), columns=state_label)




In [19]:
# OneHot encoding of "main_category" column

# your code goes here
onehot = preprocessing.OneHotEncoder()
state_onehot = onehot.fit_transform(data[['main_category']])
state_label = onehot.get_feature_names_out(['main_category'])
state_data = pd.DataFrame(state_onehot.toarray(), columns=state_label)




What are the pros and cons for applying Label encoding and OneHot encoding for "main_category" column?
Write your answer below.

Label Encoding:
Pros:

Simple and easy to implement
Can reduce dimensionality of the dataset
Useful when the categories have an inherent order or ranking to them

Cons:

May introduce bias into the model
Can be problematic when the categories do not have an inherent order or ranking to them

OneHot Encoding:
Pros:

Can capture all the information present in the categorical variable
Useful when the categories do not have an inherent order or ranking to them
Useful when the categorical variable has a small number of categories

Cons:

Increases dimensionality of the dataset
Can lead to computational and memory issues when the categorical variable has a large number of categories

#### Hash encoding

Now, once you are more familiar with 2 encoding methods - you will apply hash encoding to "country" column

Follow this link and install category_encoders package <br/>https://contrib.scikit-learn.org/category_encoders/index.html

In [21]:
#Now, once the package was installed run this code
import category_encoders as cat_e

Follow this link https://contrib.scikit-learn.org/category_encoders/hashing.html#
to use HashEncoder on your data
<br/>
Or check out this quick tutorial https://towardsdatascience.com/4-categorical-encoding-concepts-to-know-for-data-scientists-e144851c6383

In [25]:
# HashEncoding of your "country" column

# your code goes here
encoder=cat_e.HashingEncoder(cols='country',n_components=5)
hash_res = encoder.fit_transform(data['country'])
hash_res.sample(5)
pd.concat([encoder.fit_transform(data['country']), data], axis =1).sample(5)


Unnamed: 0,col_0,col_1,col_2,col_3,col_4,ID,name,category,main_category,currency,...,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real,category_encoded,main_category_encoded,country_encoded
36244,0,0,0,1,0,1807232986,World of Aerithn,Tabletop Games,Games,CAD,...,240.0,failed,5,CA,223.59,220.18,917.43,136,8,3
77803,0,0,1,0,0,277308533,Restore Ownership of Little River Band,Rock,Music,USD,...,20.0,failed,1,US,20.0,20.0,1000000.0,125,10,21
60629,0,0,1,0,0,575038066,Although I Deserve To Die Movie,Film & Video,Film & Video,USD,...,1.0,failed,1,US,1.0,1.0,2275000.0,55,6,21
142490,0,0,0,1,0,1612421814,Das Hemd das nicht aus der Hose rutscht,Apparel,Fashion,EUR,...,0.0,failed,0,DE,0.0,0.0,110059.43,7,5,5
124248,0,0,1,0,0,678525505,Guest Wedding Photo Apps,Apps,Technology,USD,...,0.0,failed,0,US,0.0,0.0,5000.0,8,13,21


----

### Data Normalization and Data Scaling

We will use "usd_goal_real" column to apply min-max scaling
<br/>
But, first, sample like 15-20% of your initial dataset - it might take longer time to apply 
transformation on entire dataset

In [29]:
# Sample 20% of your dataset

# your code goes here
data_sampled = data.sample(frac=0.2)
scaler = minmax_scaling(data_sampled[['usd_goal_real']], columns=['usd_goal_real'])
data_sampled['usd_goal_real_scaled'] = scaler
print(data_sampled.head())


               ID                                               name  \
157537  550409822                               Art of Man - Upgrade   
143860  653763938                                     Going for Kona   
81483   614983065                             Terry The Traumasaurus   
122532   50119110  VoiceLots- The Social Network for Social Justi...   
63782   198786764                      Customize ME! The Documentary   

                category main_category currency   deadline     goal  \
157537         Art Books    Publishing      USD 2012-01-19   5000.0   
143860           Fiction    Publishing      USD 2014-07-10    750.0   
81483   Children's Books    Publishing      CAD 2015-03-25  40000.0   
122532              Apps    Technology      USD 2015-04-29  15000.0   
63782        Documentary  Film & Video      USD 2016-05-11   5000.0   

                  launched  pledged       state  backers country  usd pledged  \
157537 2011-12-20 23:51:09     80.0      failed        2   

Min-Max data scaling

In [30]:
scaled_data = minmax_scaling(np.array(data_sampled.usd_goal_real), columns=[0])

In [None]:
# plot both together to compare
# if plotting takes too much time try to decrease the sample size to 5%
fig, ax = plt.subplots(1, 2, figsize=(15, 3))

sns.histplot(np.array(data_sampled.usd_goal_real), ax=ax[0], kde=True, legend=False)
ax[0].set_title("Original Data")

sns.histplot(scaled_data, ax=ax[1], kde=True, legend=False)
ax[1].set_title("Scaled data")

plt.show()

We will use "pledged_real" column to apply Box-Cox normalization

Box-Cox Normalization

In [None]:
pledges = np.array(data_sampled.query("pledged > 0").pledged)
normalized_data = stats.boxcox(pledges)[0]

In [None]:
# plot both together to compare

fig, ax=plt.subplots(1,2)

sns.distplot(pledges, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data, ax=ax[1])
ax[1].set_title("Normalized data")
plt.show()

What are the differences in scaling with Min-max and normalizing with Box-Cox?

Min-max scaling scales numerical features to a fixed range, while Box-Cox normalization transforms non-normal data into a normal distribution. Min-max scaling is useful when the data is roughly uniform, while Box-Cox normalization is useful when the data is skewed or has a non-linear relationship.

BONUS POINTS

Great, now go to the lecture notes slide 11
1. Implement Z-score normalization (the first formula), name your function ZScore_norm()
2. Implement Logistic normalization (the third formula), name your function Log_norm()

DO NOT USE READY TO USE PACKAGES

In [None]:
# to be finished 

def ZScore_norm(value_, mean_, std_)

In [None]:
# to be finished 

def Log_norm(value_)

For you futher work you can use implementation from sklearn
<br/>
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html