# Feature Engineering 2 - Advanced Techniques

## Normalization and Standardization

`MinMaxScaler` and `StandardScaler` are data scaling techniques used to transform numerical features to a specific scale. 

Here's an overview of each technique along with examples in Python using scikit-learn:


In [2]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np


**MinMaxScaler:**
- MinMaxScaler scales the data to a specific range, typically between 0 and 1. It preserves the relative relationships between data points.


In [3]:
# Example data
data = np.array([1, 2, 3, 4, 5, 7, 8,9]).reshape(-1, 1)

# Apply MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)

scaled_data

array([[0.   ],
       [0.125],
       [0.25 ],
       [0.375],
       [0.5  ],
       [0.75 ],
       [0.875],
       [1.   ]])



**StandardScaler:**
- StandardScaler standardizes the data to have a mean of 0 and a standard deviation of 1. It centers the data around the mean and adjusts the spread based on the standard deviation.


In [4]:

# Example data
data = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)

# Apply StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print(scaled_data)

[[-1.41421356]
 [-0.70710678]
 [ 0.        ]
 [ 0.70710678]
 [ 1.41421356]]


![sc](https://media.geeksforgeeks.org/wp-content/uploads/20200519001052/2020-05-18-21.png)


Here is a table that summarizes the key differences between MinMaxScaler and StandardScaler:

| Feature | MinMaxScaler | StandardScaler |
|---|---|---|
| Range | Scales the data to a fixed range, typically between 0 and 1 | Scales the data to have a mean of 0 and a standard deviation of 1 |
| Distribution | Not sensitive to the distribution of the data | Sensitive to the distribution of the data |
| Outliers | Sensitive to outliers | Not sensitive to outliers |
| Use cases | Good for datasets with a large range of values or neural networks| Good for datasets with a normal distribution or regression models|


**Robust Scaling:**

Robust scaling is a method used in statistics and machine learning to scale features by removing the median and scaling data based on the interquartile range (IQR). It is robust to outliers, meaning that extreme values in the data do not unduly influence the scaling. Robust scaling is particularly useful when dealing with datasets that contain outliers.

## Feature Hashing

Hashing is a technique that combines more than one category of a categorical variable into one single category. 

Feature hasing is a important technique for handling sparse and high-dimensional features in machine learning. 

- It is fast, simple, memory-efficient, and well-suited to online learning sceanrios. 
- It converts unique tokens into integers. 
- It operates on the exact strings that you provide as input and does not perform any linguistic analysis or preprocessing.

**Example1**: combining movies into categories for Netflix recommendation

**Example2**: Representing a text into a vector:
_Mark has a fun hoby. He goes fishing every weekend. Fishing is fun and relaxing._
 
Bag of words: 
assume an array of all words in a dictionary:
Vector has 1 entry per word in dictionary:
 
(0,1,0,0,....,2,0,1)
- 1 is occurence of Mark
- 2 is occurance of fishing
- 0 is occurance of bike

### **Business Logic**

Consider a column in the dataset corresponds to "zip codes". There are 182 zip codes in New York state and it is impractical to use each zip code as a separate category. 
So, to tackle this situation we can merge the zip codes according to localities.
This helps to reduce the number of categories and results in meaningful aggregation of zip code.

### **Frequency**

- It is not possible to apply business logic every time. In such cases, perform hashing using the frequency of occurrence.
- To combine levels using their frequency, we first look at the frequency distribution of each level and combine levels having frequency say less than 5% of total observation (can be changed based on distribution).
- This is an effective method to deal with rare levels.
- We can also combine levels by considering the response rate of each level. We can simply combine levels having similar response rates into the same group.

Exercise

In [11]:
from sklearn.feature_extraction import FeatureHasher

In [12]:
#Select the cell and click on run icon
import pandas as pd
game_df = pd.read_csv("datasets/vgsales.csv", encoding="utf-8")
game_df.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [13]:
#Select the cell and click on run icon
game_df.columns

Index(['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher', 'NA_Sales',
       'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'],
      dtype='object')

**Observations from the above output:**
>The column names such as Rank, Name, Platform, Year, Genre, Publisher, NA_Sales,
       EU_Sales, JP_Sales, Other_Sales, and Global_Sales present in the **`game_df`** dataframe.

In [14]:
game_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]

Unnamed: 0,Name,Platform,Year,Genre,Publisher
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo
5,Tetris,GB,1989.0,Puzzle,Nintendo
6,New Super Mario Bros.,DS,2006.0,Platform,Nintendo


In [19]:
#let's get unique values and length of Genre
u_generes = game_df["Genre"].unique()
print("Total game generes:", len(u_generes))
print(u_generes)

Total game generes: 12
['Sports' 'Platform' 'Racing' 'Role-Playing' 'Puzzle' 'Misc' 'Shooter'
 'Simulation' 'Action' 'Fighting' 'Adventure' 'Strategy']


In [26]:
fh = FeatureHasher(n_features=12, input_type='string')


In [29]:
hashed_features = fh.fit_transform(game_df["Genre"])

ValueError: Samples can not be a single string. The input must be an iterable over iterables of strings.

In [None]:
hashed_features = hashed_features.toarray()
new_game_df = pd.concat([game_df[['Name', 'Genre']], pd.DataFrame(hashed_features)], axis=1)

new_game_df.head()

In [25]:
import pandas as pd
from sklearn.feature_extraction import FeatureHasher

# Sample data
data = pd.DataFrame({'category': ['A', 'B', 'C', 'A', 'B', 'A']}) 

# Create the hasher 
hasher = FeatureHasher(n_features=10)

# Hash the categorical column  
hashed_features = hasher.transform(data['category'])

# Convert to DataFrame    
hashed_df = pd.DataFrame(hashed_features.toarray())

print(hashed_df)

AttributeError: 'str' object has no attribute 'items'

In [30]:
# Import the pandas and sklearn libraries
import pandas as pd
from sklearn.feature_extraction import FeatureHasher

# Create a dataframe with some dummy data
df = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'red', 'green', 'blue'],
    'shape': ['circle', 'square', 'triangle', 'circle', 'square', 'triangle']
})

# Create a FeatureHasher object with 10 features and input type as string
h = FeatureHasher(n_features=10, input_type='string')

# Transform the dataframe into a sparse matrix of hashed features
f = h.transform(df.values)

# Convert the sparse matrix into a dense array
f = f.toarray()

# Print the array
print(f)


[[ 1.  0.  0.  0.  0. -1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1. -1.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0. -1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1. -1.  0.  0.  0.  0.]]
