# Exploratory Data Analysis and Feature Engineering

## Feature Engineering
- The process by which we create new feature or modify the existing features in our dataset so that we improve the performances of our machine learning models.

### Importance of Feature Engineering:
1. Improve the model performance
2. Reducing overfitting
3. Enabling simpler models
4. Handling the Data Quality Issues

### Steps in Feature Engineering
1. Understand the data - Distribution, Relationships, Domain specific characteristics of the features of the data.
2. Data Cleaning - Handle Missing Values, Handle Outliers, Handle Errors in the data.
3. Feature Creation - Generate New Features from the existing data.
4. Feature Transformation - Modify the existing features: Normalization, Scaling, Encoding, etc.
5. Feature Selection - Select the most relevant features to use in your model: Correlation Analysis, Feature Importance, Dimensionality Reduction
6. Interaction Features
7. Domain Specific Features

## Techniques used in Feature Engineering

### A. Normalization and Scaling

1. **Scaling**: 
    - Transforming the data so that it fits within a specific scale; like 0-100 or 0-1.
    - One US Dollar is worth about 100 Yen, but if we don't scale our prices, methods like SVM or KNN will consider a difference in price of 1 Yen as important as a difference of 1 US dollar.
2. **Normalization**:
    - To change our observations so that they can be described as normal distribution.
    - Change the shape of distribution of the data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('sample_dataset.csv')
data.head()

Unnamed: 0,id,name,age,income,purchase_date,product_category,purchase_amount,is_returned
0,e3e70682-c209-4cac-a29f-6fbed82c07cd,Thomas Gonzalez,42.0,92701.95,2021-04-14,Clothing,261.01,1
1,f728b4fa-4248-4e3a-8a5d-2f346baa9455,Christopher Barrett,66.0,67632.79,2021-06-14,Sports,318.51,1
2,eb1167b3-67a9-4378-bc65-c1e582e2e662,Kathy Willis,44.0,22836.08,2021-03-31,Electronics,370.91,0
3,f7c1bd87-4da5-4709-9471-3d60c8a70639,Shannon Bullock,20.0,83378.92,2023-11-04,Clothing,77.07,1
4,e443df78-9558-467f-9ba9-1faf7a024204,Steven Johnson,34.0,44448.31,2020-01-12,Home & Kitchen,256.4,0


In [3]:
data = data.drop(columns=['id', 'product_category', 'name', 'purchase_date'])
data.head()

Unnamed: 0,age,income,purchase_amount,is_returned
0,42.0,92701.95,261.01,1
1,66.0,67632.79,318.51,1
2,44.0,22836.08,370.91,0
3,20.0,83378.92,77.07,1
4,34.0,44448.31,256.4,0


In [4]:
! pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeable


In [5]:
#Normalization and Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
data_scaled

array([[-0.11497117,  0.22079014, -0.09879538,  0.99203175],
       [ 1.42082185, -0.02730732,  0.01253117,  0.99203175],
       [ 0.01301158, -0.47063888,  0.11398355, -1.00803226],
       ...,
       [-0.49891943, -0.27423629, -0.23726625,  0.99203175],
       [ 1.42082185,  0.0185204 , -0.07178659, -1.00803226],
       [        nan, -0.02411104,         nan, -1.00803226]])

In [6]:
# Using the MinMax Scaler
minmax_scaler = MinMaxScaler()
data_normalized = minmax_scaler.fit_transform(data)
data_normalized

array([[0.46153846, 0.07407981, 0.05124714, 1.        ],
       [0.92307692, 0.04849344, 0.06276043, 1.        ],
       [0.5       , 0.0027725 , 0.07325253, 0.        ],
       ...,
       [0.34615385, 0.02302757, 0.03692661, 1.        ],
       [0.92307692, 0.05321966, 0.05404036, 0.        ],
       [       nan, 0.04882307,        nan, 0.        ]])

1. **StandardScaler**: follows Standard Normal Distribution(SND). Therefore, it makes mean = 0 and standard deviation = 1, and scales the data to unit variance.
2. **MinMaxScaler**: scales all the data features in the range [0,1] or else in the range [-1,1] if there are negative values in the dataser.

## B. Data Encoding
- Converting the Categorical variables into numerical data.

In [7]:
df = pd.read_csv('sample_dataset.csv')
df.head()

Unnamed: 0,id,name,age,income,purchase_date,product_category,purchase_amount,is_returned
0,e3e70682-c209-4cac-a29f-6fbed82c07cd,Thomas Gonzalez,42.0,92701.95,2021-04-14,Clothing,261.01,1
1,f728b4fa-4248-4e3a-8a5d-2f346baa9455,Christopher Barrett,66.0,67632.79,2021-06-14,Sports,318.51,1
2,eb1167b3-67a9-4378-bc65-c1e582e2e662,Kathy Willis,44.0,22836.08,2021-03-31,Electronics,370.91,0
3,f7c1bd87-4da5-4709-9471-3d60c8a70639,Shannon Bullock,20.0,83378.92,2023-11-04,Clothing,77.07,1
4,e443df78-9558-467f-9ba9-1faf7a024204,Steven Johnson,34.0,44448.31,2020-01-12,Home & Kitchen,256.4,0


In [8]:
# For Encoding Categorical Data
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [9]:
df = df.drop(columns=['id', 'name'])
df.head()

Unnamed: 0,age,income,purchase_date,product_category,purchase_amount,is_returned
0,42.0,92701.95,2021-04-14,Clothing,261.01,1
1,66.0,67632.79,2021-06-14,Sports,318.51,1
2,44.0,22836.08,2021-03-31,Electronics,370.91,0
3,20.0,83378.92,2023-11-04,Clothing,77.07,1
4,34.0,44448.31,2020-01-12,Home & Kitchen,256.4,0


In [10]:
encoder = OneHotEncoder()
categorical_data_encoded = encoder.fit_transform(df)
categorical_data_encoded

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 6000 stored elements and shape (1000, 2603)>

In [11]:
encoded_data = pd.DataFrame(categorical_data_encoded)
encoded_data.head()

Unnamed: 0,0
0,<Compressed Sparse Row sparse matrix of dtype ...
1,<Compressed Sparse Row sparse matrix of dtype ...
2,<Compressed Sparse Row sparse matrix of dtype ...
3,<Compressed Sparse Row sparse matrix of dtype ...
4,<Compressed Sparse Row sparse matrix of dtype ...


## Date Time Features

In [12]:
df.head()

Unnamed: 0,age,income,purchase_date,product_category,purchase_amount,is_returned
0,42.0,92701.95,2021-04-14,Clothing,261.01,1
1,66.0,67632.79,2021-06-14,Sports,318.51,1
2,44.0,22836.08,2021-03-31,Electronics,370.91,0
3,20.0,83378.92,2023-11-04,Clothing,77.07,1
4,34.0,44448.31,2020-01-12,Home & Kitchen,256.4,0


In [13]:
df['purchase_date'] = pd.to_datetime(df['purchase_date'])

In [14]:
df['Year'] = df['purchase_date'].dt.year
df['Month'] = df['purchase_date'].dt.month
df['Day'] = df['purchase_date'].dt.day
df.head()

Unnamed: 0,age,income,purchase_date,product_category,purchase_amount,is_returned,Year,Month,Day
0,42.0,92701.95,2021-04-14,Clothing,261.01,1,2021,4,14
1,66.0,67632.79,2021-06-14,Sports,318.51,1,2021,6,14
2,44.0,22836.08,2021-03-31,Electronics,370.91,0,2021,3,31
3,20.0,83378.92,2023-11-04,Clothing,77.07,1,2023,11,4
4,34.0,44448.31,2020-01-12,Home & Kitchen,256.4,0,2020,1,12
