In [None]:
# Re-load the dataset after execution state reset
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

file_path = "cleaned_selected_features.csv"
df = pd.read_csv(file_path)

# Create new ratio-based features
df['PctHousOwnOcc_PctHousLess3BR'] = df['PctHousOwnOcc'] / (df['PctHousLess3BR'] + 1e-5)  # Avoid division by zero
df['PersPerFam_householdsize'] = df['PersPerFam'] / (df['householdsize'] + 1e-5)
df['NumUnderPov_householdsize'] = df['NumUnderPov'] / (df['householdsize'] + 1e-5)
df['NumIlleg_PctWorkMomYoungKids'] = df['NumIlleg'] / (df['PctWorkMomYoungKids'] + 1e-5)

# Create interaction features
df['MedRentPctHousInc_PctHousOwnOcc'] = df['MedRentPctHousInc'] * df['PctHousOwnOcc']
df['PersPerRentOccHous_PctWorkMomYoungKids'] = df['PersPerRentOccHous'] * df['PctWorkMomYoungKids']

# Display the updated dataset with new features
df.head()

# df.to_csv("cleaned_selected_features_engineered.csv", index=False)


Unnamed: 0,PersPerFam,PctHousLess3BR,householdsize,NumIlleg,state,PersPerRentOccHous,PctHousOwnOcc,PctWorkMomYoungKids,MedRentPctHousInc,NumUnderPov,PctHousOwnOcc_PctHousLess3BR,PersPerFam_householdsize,NumUnderPov_householdsize,NumIlleg_PctWorkMomYoungKids,MedRentPctHousInc_PctHousOwnOcc,PersPerRentOccHous_PctWorkMomYoungKids
0,0.43,0.0,0.107354,0.0,65,0.26,0.24,0.46,0.32,0.009901,24000.0,4.005054,0.09222,0.0,0.0768,0.1196
1,0.42,0.5,0.335598,0.009901,65,0.42,0.41,0.71,0.39,0.009901,0.819984,1.251461,0.029502,0.013945,0.1599,0.2982
2,0.65,0.5,0.465824,0.02913,5,0.94,0.96,0.85,0.51,0.009901,1.919962,1.395346,0.021255,0.034271,0.4896,0.799
3,0.91,0.5,0.019609,0.0,95,0.89,0.87,0.4,0.51,0.0,1.739965,46.383357,0.0,0.0,0.4437,0.356
4,0.62,0.0,0.056634,0.0,13,0.39,0.3,0.3,0.59,0.009901,30000.0,10.945464,0.174795,0.0,0.177,0.117


## Target Encoding with Cross-Validation for 'state' ##

### Why Use Target Encoding with Cross-Validation?

 1. Prevents Data Leakage
- Direct Target Encoding can cause the model to "see" the target variable during training, leading to overfitting.
- Cross-Validation (CV) ensures that each data point's encoding value is computed from a training set that does not include itself, preventing data leakage.

 2. Suitable for High-Cardinality Categories
- When a categorical variable like `state` has many unique values (e.g., 50 states), One-Hot Encoding creates too many features, increasing model complexity.
- Target Encoding reduces dimensionality** by replacing the category with a single numerical value, making it more efficient.

 3. Preserves Category-Target Relationships
- `state` may directly influence crime rates (e.g., some states have higher crime rates).
- Target Encoding captures this relationship** by computing the mean crime rate for each state, unlike One-Hot Encoding, which treats categories independently.

 4. Works Well with Linear Models & Neural Networks
- One-Hot Encoding works well for tree-based models (e.g., XGBoost, Random Forest) but can lead to sparse high-dimensional data in linear regression and deep learning models.
- Target Encoding provides a continuous numerical feature, improving performance in models that struggle with categorical data.



## Standardization:
- **`PersPerFam` (Persons per Family)**
- **`PersPerRentOccHous` (Persons per Rented Occupied House)**
- **`PctHousOwnOcc` (Percentage of Owner-Occupied Houses)**
- **`PctWorkMomYoungKids` (Percentage of Working Mothers with Young Kids)**
- **`MedRentPctHousInc` (Median Rent as a Percentage of Household Income)**

### **Reason for Standardization (Z-score)**
- These features follow approximately a **normal distribution** or have a **bell-shaped** curve.
- Standardization transforms the data to have **zero mean and unit variance**:
  
  $X_{\text{scaled}} = \frac{X - \mu}{\sigma}$
  
- Many machine learning algorithms (e.g., linear regression, logistic regression, PCA, KNN, and SVM) perform better when features have similar scales.
- Standardization is preferred over Min-Max scaling because it retains the original distribution while making the model more robust to outliers.


In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
cols_to_standardize = ['PersPerFam', 'PersPerRentOccHous', 'PctHousOwnOcc', 'PctWorkMomYoungKids', 'MedRentPctHousInc']
df[cols_to_standardize] = scaler.fit_transform(df[cols_to_standardize])


## Normalization:
- **`householdsize` (Household Size per Unit)**

### **Reason for Normalization (Min-Max Scaling)**
- This feature is **right-skewed**, meaning most values are concentrated near the lower range with a few high values.
- Min-Max Scaling transforms values to a **fixed range [0,1]**:
  $X_{\text{scaled}} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}$
- Since household size has a natural minimum (1 person) and maximum (a few households with large sizes), normalization ensures all values remain in a comparable range.


In [None]:


scaler = MinMaxScaler()
cols_to_normalize = ['householdsize']
df[cols_to_normalize] = scaler.fit_transform(df[cols_to_normalize])


## Log Transformation:
- **`NumIlleg` (Number of Illegal Immigrants)**
- **`NumUnderPov` (Number of People Under Poverty Line)**

### **Reason for Log Transformation (`log1p`)**
- Both features exhibit **highly skewed distributions**, where most values are very small, but a few extremely large values create a long tail.
- Applying a log transformation:
 $X_{\text{log}} = \log(1 + X)$
  helps to **reduce skewness** and make the distribution more normal-like.
- This is particularly important for regression-based models, as extreme values can disproportionately affect predictions.
- The `log1p` function is used instead of `log(X)` to handle zeros safely.

In [8]:

df['NumIlleg'] = np.log1p(df['NumIlleg'])
df['NumUnderPov'] = np.log1p(df['NumUnderPov'])


## Binarization:
- **`PctHousLess3BR` (Percentage of Houses with Less than 3 Bedrooms)**

### **Reason for Binarization**
- This feature has only **three unique values: 0, 0.5, 1**.
- The nature of these values suggests that it might be a **categorical or artificially bucketed** feature.

In [9]:
df['PctHousLess3BR_binary'] = (df['PctHousLess3BR'] > 0).astype(int)
