<a href="https://colab.research.google.com/github/EnsiyehRaoufi/EnsiyehRaoufi/blob/main/Category_Scaling_feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Feature Engineering**

Feature engineering is useful to create a robust dataset and increase the effectiveness of a model. Generally, they modify or extend the current features in a data set with additional insights or data.

In [7]:
import pandas as pd

# Creating a mixed dataset of strings, floats, and date strings
df = pd.DataFrame(
    [
        ["cat", 1.0, "3-2021"],
        ["cat", 0.5, "1-2021"],
        ["dog", 0.2, "5-2021"],
        ["bird", 3.3, "3-2021"],
        ["dog", 5.7, "1-2021"],
        ["dog", 0.0, "2-2021"],
        ["cat", 1.9, "4-2021"],
        ["bird", 2.4, "4-2021"],
        ["bird", 2.4, "5-2021"]
    ],
    columns=["animal", "value", "date"]
)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   animal  9 non-null      object 
 1   value   9 non-null      float64
 2   date    9 non-null      object 
dtypes: float64(1), object(2)
memory usage: 344.0+ bytes


**Changing Data Types**

astype() function changes column to designated type.

In [8]:
df.loc[:, "animal"] = df["animal"].astype("category")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   animal  9 non-null      category
 1   value   9 non-null      float64 
 2   date    9 non-null      object  
dtypes: category(1), float64(1), object(1)
memory usage: 385.0+ bytes


**Normalizing Data**

Transforms numerical data to have specific range of values.

Transformations typically have zero mean, meaning their average is 0.

In [9]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df[["value"]])
scaler.transform(df[["value"]])

array([[-0.54744332],
       [-0.84071653],
       [-1.01668045],
       [ 0.80161343],
       [ 2.20932483],
       [-1.13398974],
       [-0.01955155],
       [ 0.27372166],
       [ 0.27372166]])

**Parsing Data Types**

Pandas to_datetime() method will parse datetime strings.

Converts strings to datetime objects

In [10]:
pd.to_datetime(df.loc[:, "date"])

0   2021-03-01
1   2021-01-01
2   2021-05-01
3   2021-03-01
4   2021-01-01
5   2021-02-01
6   2021-04-01
7   2021-04-01
8   2021-05-01
Name: date, dtype: datetime64[ns]

**One-hot Encoding**



*   Required for models that only take numerical data
*   Pandas has a one-hot encoding function,.get_dummies()
*   Converts categorical data to many feature columns

In [11]:
# prefix parameter will add the name to column name
pd.get_dummies(df.animal, prefix="animal")

Unnamed: 0,animal_bird,animal_cat,animal_dog
0,0,1,0
1,0,1,0
2,0,0,1
3,1,0,0
4,0,0,1
5,0,0,1
6,0,1,0
7,1,0,0
8,1,0,0
