### Feature Engineering
This is a useful process since most real world data rarely comes in the form\
of numerical data in a tidy format\
We're going to see the common task in feature engineering and how to handle each on of them

#### 1. Categorical Features 

In [None]:
data = [
    {"price": 850000, "rooms": 4, "neighborhood": "Queen Anne"},
    {"price": 700000, "rooms": 3, "neighborhood": "Fremont"},
    {"price": 650000, "rooms": 3, "neighborhood": "Wallingford"},
    {"price": 600000, "rooms": 2, "neighborhood": "Fremont"},
]


In [None]:
# Set sparse=True if dataset is large
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)


To inspect the feature names

In [None]:
vec.get_feature_names_out()

#### 2. Text Features

In [None]:
sample = ["problem of evil", "evil queen", "horizon problem"]


In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vec = CountVectorizer()
X = vec.fit_transform(sample)
X


The result is a sparse matrix recording the no. of occurences of each word. Let's inspect it

In [None]:
import pandas as pd
import numpy as np

pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())

One can use TF-IDF to provide solutions to the sub optimal issues brought by CountVectorizer

In [None]:
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())


#### 3. Derived features

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

x = np.array([1, 2, 3, 4, 5])
y = np.array([4, 2, 1, 3, 7])
plt.scatter(x, y)


In [None]:
X = x[:, np.newaxis]


Let's first create a Linear Regression model

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(X, y)
yfit = model.predict(X)
plt.scatter(x, y)
plt.plot(x, yfit)

It's clear that we need a more sophisticated model to describe the relationship btwn x and y\
This can be achieved by transforming the input

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=3, include_bias=False)
X2 = poly.fit_transform(X)

In [None]:
model = LinearRegression().fit(X2, y)
yfit = model.predict(X2)
plt.scatter(x, y)
plt.plot(x, yfit)

We can see that computing a Linear Regression model on this expanded input gives a much closer fit to our data.\
<br>
The idea of improving a model not by changing the model but by transforming the inputs, is fundamental to many of the more\
powerful ML Models 

#### 4. Imputation of Missing Data

In [None]:
from numpy import nan

X = np.array([[nan, 0, 3], [3, 7, 9], [3, 5, 2], [4, nan, 6], [8, 8, 1]])
y = np.array([14, 16, -1, 8, -5])


In [None]:
from sklearn.impute import SimpleImputer

imp = SimpleImputer(strategy="mean")
X2 = imp.fit_transform(X)
X2


We see that in the resulting data, the two missing values have been replaced with the\
mean of the remaining values in the column. This imputed data can then be fed\
directly into, for example, a LinearRegression estimator:

In [None]:
model = LinearRegression().fit(X2, y)
a = model.predict(X2)
a

#### 5. Feature Pipeline
Scikit provides make_pipeline to help string together multiple steps

In [None]:
from sklearn.pipeline import make_pipeline

model = make_pipeline(
    SimpleImputer(strategy="mean"), PolynomialFeatures(
        degree=2), LinearRegression()
)


We then proceed like we would normally

In [None]:
model.fit(X, y)
a = model.predict(X)
a
