## Feature Engineering

This notebooks covers a few very important examples of how to engineer commonly occurring data types. We'll cover:

1. What data types are commonly occuring in datasets.
2. What are the methods to transform each type of Data.

### Common Data Types

1. Numerical Features
2. Textual Features
3. Categorical Features

### 1. Numerical Features

Easiest to work with. Mostly used to derive new features out of the existing ones.

In [2]:
import numpy as np
import matplotlib.pyplot as plt

##add new features 
distance = np.array([40, 45, 50, 55, 60])
time = np.array([5, 8, 10, 12, 15])

##add a new feature of speed using distance and time in a data
speed = distance / time
speed

array([8.        , 5.625     , 5.        , 4.58333333, 4.        ])

### 2. Textual Data

In [4]:
text = ['talk is cheap',
        'persistence is the key',
        'to find the key to happiness']

##import countvectorizer to count words
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(text)
X

<3x9 sparse matrix of type '<class 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>

In [8]:
import pandas as pd

##create a dataframe of the encoded text
pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())

Unnamed: 0,cheap,find,happiness,is,key,persistence,talk,the,to
0,1,0,0,1,0,0,1,0,0
1,0,0,0,1,1,1,0,1,0
2,0,1,1,0,1,0,0,1,2


### 3. Categorical Data

In [11]:
##sales data with a categorical variable - day
sales_data = [
   {'sales': 850000,  'day': 'Monday'},
   {'sales': 700000,  'day': 'Wednesday'},
   {'sales': 650000,  'day': 'Thursday'},
   {'sales': 600000,  'day': 'Wednesday'}
]

from sklearn.feature_extraction import DictVectorizer

##one-hot encode the categorical column
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(sales_data)

array([[     1,      0,      0, 850000],
       [     0,      0,      1, 700000],
       [     0,      1,      0, 650000],
       [     0,      0,      1, 600000]])

In [15]:
##access feature names
vec.get_feature_names_out()

array(['day=Monday', 'day=Thursday', 'day=Wednesday', 'sales'],
      dtype=object)