# 🧪 Feature Engineering

*Feature engineering* is the process of transforming, creating, or selecting variables (features) to **improve the performance** of machine learning models.

🔹 Sometimes the original data doesn’t directly include the variables that are most helpful to the model.

🔹 Creating good features can make a bigger difference than switching models.

In this notebook, we’ll work on the Titanic dataset using techniques such as:

- Variable selection and transformation
- Encoding categorical variables
- Generating new variables (synthetic features)
- Discretization, scaling, and more


## 1. Importing libraries

In [1]:
import pandas as pd
import seaborn as sns

df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## 2. Basic data cleaning

In [2]:
# Check missing values
df.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,177
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


In [3]:
# Remove columns that are not very useful for the model
df = df.drop(columns=['deck', 'embark_town', 'alive'])

# Impute missing values:
# For numerical variables -> fill with the mean (sum of values / number of values) or median (middle value of the sorted data)
# For categorical variables -> fill with the mode (most frequent value)
df['age'] = df['age'].fillna(df['age'].median())
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

## 3. Encoding categorical variables

Label Encoding (suitable for binary or ordinal variables)

In [4]:
df['sex'] = df['sex'].map({'male': 0, 'female': 1})

One-Hot Encoding (suitable for nominal variables): creates a new column for each category, except for one, to avoid multicollinearity

In [5]:
df = pd.get_dummies(df, columns=['embarked'], drop_first=True)

In [6]:
# Check the result
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,class,who,adult_male,alone,embarked_Q,embarked_S
0,0,3,0,22.0,1,0,7.25,Third,man,True,False,False,True
1,1,1,1,38.0,1,0,71.2833,First,woman,False,False,False,False
2,1,3,1,26.0,0,0,7.925,Third,woman,False,True,False,True
3,1,1,1,35.0,1,0,53.1,First,woman,False,False,False,True
4,0,3,0,35.0,0,0,8.05,Third,man,True,True,False,True


## 4. Creating new variables

Ej: Passenger Title (Mr, Mrs, Miss…)



```
df['title'] = df['name'].str.extract('([A-Za-z]+)\.', expand=False)

# Group rare titles
df['title'] = df['title'].replace(['Lady', 'Countess', 'Capt', 'Col',
                                   'Don', 'Dr', 'Major', 'Rev', 'Sir',
                                   'Jonkheer', 'Dona'], 'Rare')

# Simplify
df['title'] = df['title'].replace('Mlle', 'Miss')
df['title'] = df['title'].replace('Ms', 'Miss')
df['title'] = df['title'].replace('Mme', 'Mrs')

df['title'] = df['title'].map({'Mr': 0, 'Miss': 1, 'Mrs': 2, 'Master': 3, 'Rare': 4})
```

A. Is a child?

In [7]:
df['is_child'] = (df['age'] < 12).astype(int)

B. Family size

In [8]:
df['family_size'] = df['sibsp'] + df['parch'] + 1

C. Is alone?

In [9]:
df['is_alone'] = (df['family_size'] == 1).astype(int)

## 5. Scaling numerical variables

Scaling is important for models sensitive to feature scale, such as SVM or neural networks.

In [10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler() # Scale variables using mean 0 and variance 1
df[['age_scaled', 'fare_scaled']] = scaler.fit_transform(df[['age', 'fare']])

### Other useful transformations

Discretization (binning): Binning can help models capture non-linear relationships and simplify interpretation

In [11]:
# Group age into bins
df['age_bin'] = pd.cut(df['age'], bins=[0, 12, 18, 35, 60, 100], labels=['child', 'teen', 'young_adult', 'adult', 'senior'])

## 6. Final dataset

In [12]:
# Show transformed variables
df_model = df[['pclass', 'sex', 'age_scaled', 'fare_scaled', 'is_child', 'family_size', 'is_alone', 'embarked_Q', 'embarked_S', 'survived']]
df_model.head()

Unnamed: 0,pclass,sex,age_scaled,fare_scaled,is_child,family_size,is_alone,embarked_Q,embarked_S,survived
0,3,0,-0.565736,-0.502445,0,2,0,False,True,0
1,1,1,0.663861,0.786845,0,2,0,False,False,1
2,3,1,-0.258337,-0.488854,0,1,1,False,True,1
3,1,1,0.433312,0.42073,0,2,0,False,True,1
4,3,0,0.433312,-0.486337,0,1,1,False,True,0


- Creating variables like `family_size` or `is_alone` can capture useful information that isn't directly visible.
- Encoding and scaling are essential steps for models that require them (such as SVMs or neural networks).
- It's always a good idea to visualize the new variables to understand whether they add value.