# Day 6 - Feature Engineering
### Machine Learning Roadmap - Week 2 
### Author - N Manish Kumar 
---

Feature engineering is the process of transforming raw data into features that better represent the underlying problem.

In this notebook, we apply feature engineering techniques to improve a Logistic Regression model without changing the algorithm.

## 1. Load Dataset

We use the Breast Cancer dataset from scikit-learn.

This step:
- Loads the raw dataset
- Converts it into a pandas DataFrame
- Explicitly separates features (X) and target (y)

All feature engineering will be applied only to X.

In [6]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Train-test split (consistent with previous days)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

### Scale Features

In [7]:
scaler = StandardScaler()

X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Training and Evaluating Baseline Model

In [8]:
# Train Baseline Model
baseline_model = LogisticRegression(C=1.0, max_iter=500)
baseline_model.fit(X_train_s, y_train)

# Evaluate Baseline Model
y_pred_base = baseline_model.predict(X_test_s)

baseline_acc = accuracy_score(y_test, y_pred_base)
baseline_f1 = f1_score(y_test, y_pred_base)

baseline_acc, baseline_f1

(0.9824561403508771, 0.9861111111111112)

### Interpretation
The baseline model provides a reference point.

Any improvement after feature engineering should be measured relative to this baseline, not in isolation.

---
## 2. Feature Engineering #1 : Interaction Features
Sometimes the target depends not on a single feature, but on how two features interact with each other.

Interaction features allow a linear model to capture relationships of the form: feature_A × feature_B

### Create Interaction Feature

In [9]:
# Create feature-engineered copies
X_train_fe = X_train.copy()
X_test_fe = X_test.copy()

# Interaction feature: mean radius × mean texture
X_train_fe["radius_texture_interaction"] = (
    X_train_fe["mean radius"] * X_train_fe["mean texture"]
)

X_test_fe["radius_texture_interaction"] = (
    X_test_fe["mean radius"] * X_test_fe["mean texture"]
)

### Scale Engineered Features

In [10]:
# Scale Engineered Features
scaler_fe_1 = StandardScaler()
X_train_fe_scaled = scaler_fe_1.fit_transform(X_train_fe)
X_test_fe_scaled = scaler_fe_1.transform(X_test_fe)

### Train Logistic Regression on Engineered Features

In [11]:
# Train logistic regression on engineered features
interaction_model = LogisticRegression(C=1.0, max_iter=500)
interaction_model.fit(X_train_fe_scaled, y_train)

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


### Evalueate Model

In [13]:
# Predictions
y_pred_inter = interaction_model.predict(X_test_fe_scaled)

# Metrics
interaction_acc = accuracy_score(y_test, y_pred_inter)
interaction_f1 = f1_score(y_test, y_pred_inter)

interaction_acc, interaction_f1

(0.9824561403508771, 0.9861111111111112)

### Interpretation

Compared to the baseline:
- Accuracy: unchanged
- F1 Score: unchanged

The interaction feature helped/did not help because:
    After adding the interaction feature (mean radius × mean texture),
the accuracy and F1 score remained unchanged.

This suggests that the interaction did not introduce
new discriminative information beyond what the original
features already provided.

The baseline Logistic Regression model already performs strongly
on this dataset, and regularization likely suppressed the
interaction feature due to its limited additional contribution.

---

## 3. Feature Engineering #2 : Log Transform
Some features have highly skewed distributions.

Log transformation compresses large values and
expands small values, making patterns easier
for linear models to learn.

### Create Log Feature

In [14]:
# Add log-transformed feature
X_train_fe_log = X_train_fe.copy()
X_test_fe_log = X_test_fe.copy()

# Log-transform a skewed feature
X_train_fe_log["log_mean_area"] = np.log1p(X_train_fe_log["mean area"])
X_test_fe_log["log_mean_area"] = np.log1p(X_test_fe_log["mean area"])

### Scale Log-Engineered Features

In [15]:
# Scale features
scaler_fe_2 = StandardScaler()
X_train_fe_log_scaled = scaler_fe_2.fit_transform(X_train_fe_log)
X_test_fe_log_scaled = scaler_fe_2.transform(X_test_fe_log)

### Train Logistic Regression on Engineered Features

In [17]:
# Train logistic regression with log-transformed features
log_model = LogisticRegression(C=1.0, max_iter=500)
log_model.fit(X_train_fe_log_scaled, y_train)

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


### Evaluating Model

In [19]:
# Predictions
y_pred_log = log_model.predict(X_test_fe_log_scaled)

# Metrics
log_acc = accuracy_score(y_test, y_pred_log)
log_f1 = f1_score(y_test, y_pred_log)

log_acc, log_f1

(0.9824561403508771, 0.9861111111111112)

### Log Transformation Results

After applying a log transformation to `mean area`,
the model’s accuracy and F1 score remained unchanged.

This indicates that the original feature already captured
most of the useful information, and the log-transformed
version did not introduce additional discriminative power.

Since the Breast Cancer dataset is relatively well-separated
and features are well-scaled, log transformation provided
minimal benefit in this case.

---

## 4. Error Analysis (Andrew Ng Core Skill)

When feature engineering does not improve performance,
the correct next step is to analyze model errors.

Instead of adding random features, we inspect
misclassified examples to understand:
- Where the model fails
- What information might be missing

### Get Predictions + Errors

In [21]:
# Baseline predictions (already trained in Step 1)
y_pred_base = baseline_model.predict(X_test_s)

# Identify misclassified indices
misclassified_idx = np.where(y_pred_base != y_test.values)[0]

len(misclassified_idx)

2

### Inspect Misclassified Samples

In [22]:
# Inspect misclassified examples
misclassified_samples = X_test.iloc[misclassified_idx]
misclassified_samples.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
541,14.47,24.99,95.81,656.4,0.08837,0.123,0.1009,0.0389,0.1872,0.06341,...,16.22,31.73,113.5,808.9,0.134,0.4202,0.404,0.1205,0.3187,0.1023
73,13.8,15.79,90.43,584.1,0.1007,0.128,0.07789,0.05069,0.1662,0.06566,...,16.57,20.86,110.3,812.4,0.1411,0.3542,0.2779,0.1383,0.2589,0.103


### Compare With Correct Predictions

In [24]:
# Correctly classified samples
correct_idx = np.where(y_pred_base == y_test.values)[0]
correct_samples = X_test.iloc[correct_idx]

correct_samples.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,112.0,112.0,112.0,112.0,112.0,112.0,112.0,112.0,112.0,112.0,...,112.0,112.0,112.0,112.0,112.0,112.0,112.0,112.0,112.0,112.0
mean,14.371223,19.441786,93.620714,681.296429,0.097176,0.105881,0.087188,0.05133,0.183451,0.062677,...,16.640545,25.787768,109.761786,928.780357,0.132437,0.251633,0.253127,0.117193,0.28689,0.083617
std,3.658738,3.853911,25.148977,382.284527,0.016427,0.050446,0.07199,0.038724,0.024763,0.006734,...,5.12837,5.850991,35.532756,636.684533,0.02423,0.147567,0.177201,0.062076,0.049757,0.016641
min,7.76,11.97,47.92,181.0,0.05263,0.03116,0.0,0.0,0.1353,0.05044,...,9.456,14.1,59.16,268.6,0.08864,0.05232,0.0,0.0,0.1999,0.05933
25%,11.8675,16.695,76.3275,433.325,0.085218,0.066275,0.037465,0.021698,0.16395,0.057985,...,13.0175,21.3175,84.9475,515.35,0.11975,0.148375,0.121175,0.073483,0.252325,0.07108
50%,13.65,19.025,87.6,571.05,0.09687,0.096715,0.06519,0.04165,0.18435,0.062225,...,14.97,25.43,97.965,688.25,0.13115,0.2143,0.20295,0.10875,0.28225,0.080245
75%,16.3275,21.8875,106.675,819.8,0.107525,0.131675,0.118525,0.074845,0.1976,0.065795,...,19.6325,30.3725,128.825,1169.25,0.1457,0.3341,0.3756,0.15755,0.307425,0.09074
max,27.42,29.33,186.9,2501.0,0.1634,0.277,0.3635,0.1878,0.2595,0.09502,...,36.04,41.85,251.2,4254.0,0.2226,0.8681,0.9387,0.2688,0.4753,0.1431


### Error Analysis Observations

After inspecting misclassified samples, I observed that:
- Certain features tend to have overlapping values between classes
- Some samples lie near the decision boundary
- The model struggles with borderline or ambiguous cases

This suggests that:
- Linear separation may be insufficient for some regions
- More informative features or non-linear models may be required

----
# Feature Engineering Takeaway

Feature engineering did not significantly improve performance
because the baseline model already performs strongly on this dataset.

Error analysis indicates that remaining errors are due to
inherent overlap between classes rather than missing features.

This suggests diminishing returns from further feature engineering
with Logistic Regression on this dataset.