# Feature Engineering #2

In [None]:
import pandas as pd
import numpy as np

In [None]:
titanic = pd.read_csv('data/train.csv')
titanic.head()

## Data Binning
Data binning is a data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall into a given small interval, a bin, are replaced by a value representative of that interval.

In [None]:
titanic.nunique()

In [None]:
titanic.Age.max()

In [None]:
age = titanic['Age']
df = pd.DataFrame(age)

cut_labels = ['child', 'teenage', 'young adullt', 'mid-age adult', 'old']
cut_bins = [0, 12, 18, 35, 50, 80]

df['Age binning'] = pd.cut(df['Age'], bins=cut_bins, labels=cut_labels)
df

## One Hot Encoding

One hot encoding transforms categorical features to a format that works better with classification and regression algorithms.

<img src="image/one_hot_encoding.png"  width="400" />

This works very well with most machine learning algorithms. Some algorithms, like random forests, handle categorical values natively. Then, one hot encoding is not necessary. The process of one hot encoding may seem tedious, but fortunately, most modern machine learning libraries can take care of it.


In [None]:
titanic.nunique()

In [None]:
cls = titanic['Pclass']
# create dataframe
df = pd.DataFrame(cls)

one_hot = pd.get_dummies(df['Pclass'], prefix='Pclass').astype(int)
df = df.join(one_hot)
df

## Transformer

To map data from various distributions to a normal distribution.

1. Log transformer
2. Box-Cox transformer
3. Yeo-Johnson transformer

In [None]:
news = pd.read_csv('data/OnlineNewsPopularity.csv')
news = news[news[' n_tokens_content']>0]
news.head()

In [None]:
news[' n_tokens_content'].describe(), news[' n_tokens_content'].median()

In [None]:
np.log10(0)

In [None]:
!pip install matplotlib

In [None]:
import matplotlib.pyplot as plt

news['log_n_tokens_content'] = np.log10(news[' n_tokens_content'])

fig, (ax1, ax2) = plt.subplots(2,1,figsize=(4, 5))
print(fig)
ax1.set_xlabel('Number of Words/Tokens in Article', fontsize=14)
ax2.set_xlabel('Log of Number of Words/Tokens', fontsize=14)

news[' n_tokens_content'].hist(ax=ax1, bins=20)
news['log_n_tokens_content'].hist(ax=ax2, bins=20)

plt.show()

In [None]:
news['log_n_tokens_content'].median(), news['log_n_tokens_content'].mean(), news['log_n_tokens_content'].mode()

#### Box-Cox transformer


<img src="image/boxcox.png" />

In [None]:
from scipy.stats import boxcox

y = news[' n_tokens_content']
y, fitted_lambda= boxcox(y, lmbda=None)

print("lambda :", fitted_lambda)

news['boxcox_n_tokens_content'] = y

# plot
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(10, 3))

ax1.set_xlabel('Number of Words in Article', fontsize=14)
ax2.set_xlabel('Box-cox of Number of Words', fontsize=14)

news[' n_tokens_content'].hist(ax=ax1, bins=20)
news['boxcox_n_tokens_content'].hist(ax=ax2, bins=20)

plt.show()

In [None]:
news['boxcox_n_tokens_content'].mean(), news['boxcox_n_tokens_content'].median(), news['boxcox_n_tokens_content'].mode()

#### Yeo-Johnson transformer
source : https://www.stat.umn.edu/arc/yjpower.pdf
<img src="image/yj.png" />

In [None]:
from scipy.stats import yeojohnson

y = news[' n_tokens_content']
y, lmbda = yeojohnson(y)
news['yeojohnson'] = y

fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(12, 3))
ax1.set_xlabel('Number of Words in Article', fontsize=14)
ax2.set_xlabel('yeo-johnson of Number of Words', fontsize=14)
news[' n_tokens_content'].hist(ax=ax1, bins=20)
news['yeojohnson'].hist(ax=ax2, bins=20)
plt.show()

In [None]:
lmbda

In [None]:
news.head()

In [None]:
news['yeojohnson'].mean(), news['yeojohnson'].median(), news['yeojohnson'].mode()

## Scaling & Normalization
Numeric features, such as counts, may increase without bound. Models that are
smooth functions of the input, such as linear regression, logistic regression, or
anything that involves a matrix, are affected by the scale of the input. Tree-based
models, on the other hand, couldn’t care less. If your model is sensitive to the
scale of input features, feature scaling could help.


1. Min-max
2. Standardization
3. l2 Norm.

#### Min-max
Min-max scaling squeezes all feature values to be within the range of [0, 1]

<img src="image/min-max.png" />

Illustration of min-max scaling

<img src="image/min-max2.png" width='400'/>


#### Standardization

It subtracts off the mean of the feature (over all data points) and divides by the
variance. Hence, it can also be called **variance scaling**.

<img src="image/stand.png" />

<img src="image/s2.png" width='400'/>



#### L-2 Normalization
This technique normalizes (divides) the original feature value by what’s known
as the ℓ2 norm, also known as the Euclidean norm.
<img src="image/l2.png" width='400'/>




In [None]:
import pandas as pd
import sklearn.preprocessing as preproc

# Look at the original data - the number of words in an article
print('values: ',news[' n_tokens_content'].values)
# Min-max scaling

news['minmax'] = preproc.minmax_scale(news[[' n_tokens_content']])
print("\nmin-max : ",news['minmax'].values)

# Standardization - note that by definition, some outputs will be negative
news['standardized'] = preproc.StandardScaler().fit_transform(news[[' n_tokens_content']])
print('\nstandardized : ',news['standardized'].values)

# L2-normalization
news['l2_normalized'] = preproc.normalize(news[[' n_tokens_content']], axis=0)
print('\nl2 norm : ',news['l2_normalized'].values)



In [None]:
news.head()

In [None]:
news.minmax.max()

In [None]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4,1,figsize=(6,12))
fig.tight_layout()

news[' n_tokens_content'].hist(ax=ax1, bins=100)
ax1.tick_params(labelsize=14)
ax1.set_xlabel('Article word count', fontsize=14)
ax1.set_ylabel('Number of articles', fontsize=14)

news['minmax'].hist(ax=ax2, bins=100)
ax2.tick_params(labelsize=14)
ax2.set_xlabel('Min-max scaled word count')
ax2.set_ylabel('Number of articles', fontsize=14)

news['standardized'].hist(ax=ax3, bins=100)
ax3.tick_params(labelsize=14)
ax3.set_xlabel('Standardized word count')
ax3.set_ylabel('Number of articles', fontsize=14)

news['l2_normalized'].hist(ax=ax4, bins=100)
ax4.tick_params(labelsize=14)
ax4.set_xlabel('L2-normalized word count')
ax4.set_ylabel('Number of articles', fontsize=14)

plt.show()

# Feature Engineering With Real-Life Dataset - self study
Dataset : https://www.kaggle.com/datasets/patrickgendotti/udacity-course-catalog

1. Download and Read data using pandas
2. Handle missing values in 'Level', 'Duration', 'Review Count' and 'rating' columns
3. Apply one-hot encoding for 'Level' column
4. Apply data bining for 'Rating' column
5. Apply boxcox transformer for 'Review Count' column and plot the distribution (before and after transformer)
