# Feature Engineering

## Loading dataset

In [146]:
import pandas as pd

In [147]:
df=pd.read_csv("data/HousingData.csv")

In [148]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     486 non-null    float64
 1   ZN       486 non-null    float64
 2   INDUS    486 non-null    float64
 3   CHAS     486 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      486 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    486 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


# Boston Housing Dataset: Feature Descriptions

---
### Features

-   `CRIM`: Per capita crime rate by town.
-   `ZN`: Proportion of residential land zoned for lots over 25,000 sq. ft.
-   `INDUS`: Proportion of non-retail business acres per town.
-   `CHAS`: Charles River dummy variable (`1` if tract bounds river; `0` otherwise).
-   `NOX`: Nitric oxide concentration (parts per 10 million).
-   `RM`: Average number of rooms per dwelling.
-   `AGE`: Proportion of owner-occupied units built prior to 1940.
-   `DIS`: Weighted distances to five Boston employment centers.
-   `RAD`: Index of accessibility to radial highways.
-   `TAX`: Full-value property tax rate per $10,000.
-   `PTRATIO`: Pupil-teacher ratio by town.
-   `B`: `1000(Bk — 0.63)²`, where `Bk` is the proportion of people of African American descent by town.
-   `LSTAT`: Percentage of lower status of the population.

---

### Target Variable

> **`MEDV`**: Median value of owner-occupied homes in $1000s. This is typically the target variable for regression models.

In [149]:
df.isnull().sum()

CRIM       20
ZN         20
INDUS      20
CHAS       20
NOX         0
RM          0
AGE        20
DIS         0
RAD         0
TAX         0
PTRATIO     0
B           0
LSTAT      20
MEDV        0
dtype: int64

## 1.Missing Values

### I highly recommend to use sklearn.impute instead of reparo because of some unexpected errors and the principle of work

```python
from reparo import KNNImputer

columns_to_impute = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'AGE', 'LSTAT']
imputer = KNNImputer(n_neighbors=6)

df[columns_to_impute] = imputer.fit_transform(df[columns_to_impute])
```


❗ AttributeError: 'DataFrame' object has no attribute 'dtype'

In [150]:
from sklearn.impute import KNNImputer

columns_to_impute = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'AGE', 'LSTAT']
imputer = KNNImputer(n_neighbors=6)

df[columns_to_impute] = imputer.fit_transform(df[columns_to_impute])

## 2.Handle Outliers


In [None]:
from sklearn.ensemble import IsolationForest

columns_to_check = df.columns
iso_forest = IsolationForest(contamination=0.05, random_state=42)

iso_forest.fit(df[columns_to_check])

outlier_labels = iso_forest.predict(df[columns_to_check])

df_clean = df[outlier_labels == 1]


#Because is a small amount of outliers ,could be removed ,but this is not always the right approach
print(f"Removed {sum(outlier_labels == -1)} outliers.")


Removed 26 outliers.


## 3. Encode Categorical Variables

#### If categorical data exists it should be encoded , if you want to use them later for something because ML is more adapted to numeric forms

## 4.Scalling data

In [152]:
from sklearn.metrics import accuracy_score,mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = df_clean.drop(columns=['MEDV'])
y = df_clean['MEDV']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
accuracy_score = model.score(X_test, y_test)
print("Trained Model without polynomial features")
print(f"Model Accuracy: {accuracy_score}")
print(f"Mean Squared Error: {mse}")


Trained Model without polynomial features
Model Accuracy: 0.7929928866668681
Mean Squared Error: 14.679095205055257


## 5.Polynomial Features == from existing features make new and add them in dataset

In [None]:
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score


poly_cols = ['CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE',
             'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

# Add new polynomial features to my dataset
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly_features = poly.fit_transform(df[poly_cols])
feature_names = poly.get_feature_names_out(poly_cols)

df_poly = pd.DataFrame(X_poly_features, columns=feature_names)

df_with_poly = pd.concat([df.reset_index(drop=True), df_poly], axis=1)

X = df_with_poly.drop(columns=['MEDV'])
y = df_with_poly['MEDV']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Use Ridge regression with hyperparameter tuning it is the same as linear regression but with regularization
ridge = Ridge()
param_grid = {
    'alpha': [0.01, 0.1, 1.0, 10.0, 100.0],
    'fit_intercept': [True, False]
}

#Hyperparameter tuning using GridSearchCV
grid_search = GridSearchCV(ridge, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
accuracy_score = best_model.score(X_test, y_test)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Model Accuracy (R²): {accuracy_score}")
print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")


Best Parameters: {'alpha': 1.0, 'fit_intercept': True}
Model Accuracy (R²): 0.8392578867685252
Mean Squared Error: 11.787827276449052
R^2 Score: 0.8392578867685252


## Conclusion

#### Feature engineering significantly improved model performance. By imputing missing values, removing outliers, and adding polynomial features, the model's R² score increased from 0.79 to 0.83, and the MSE dropped from 14.68 to 11.78.

#### PolynomialFeatures captured non-linear relationships, and Ridge regularization prevented overfitting. This approach is like a boost for regression accuracy.