# Feature Engineering and Feature Selection Exercises
This notebook contains hands-on exercises for feature engineering and feature selection using Python.
Datasets used:
- Titanic (from seaborn)
- Iris (from sklearn)
- Breast Cancer (from sklearn)


## Part 1 – Feature Engineering
### Exercise 1: Create new features from existing ones
**Dataset:** Titanic dataset (`seaborn.load_dataset('titanic')`)
 

In [167]:
import seaborn as sns
import pandas as pd
pd.options.mode.copy_on_write = True

titanic = sns.load_dataset('titanic')

###  1.1 The Titanic dataset

A list of all passengers on the last Titanic voyage.
 
**Features:**
   * survived - (0 = no, 1 = yes)
   * pclass - (1 = First class, 2 = Second class, 3 = Third class)
   * sex - ("male", "female")
   * age - Age in years
   * sibsp - Number of siblings/parents on board.
   * parch - Number of parents/kids on board.
   * fare - Ticket price

In [168]:
# 0. print the first few rows of the dataset 
print(titanic.head(5))

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  


In [169]:
# 1. Create FamilySize = sibsp + parch + 1
titanic["family_size"] = titanic["sibsp"] + titanic["parch"] + 1

# 2. Create IsAlone = 1 if FamilySize == 1 else 0
titanic["is_alone"] = (titanic["family_size"] == 1).astype(int)

# 3. Convert 'sex' column into numerical format (0=female, 1=male) (Hint: use map. for example: df['col'] = df['col'].map({'a': 1, 'b': 2}))
titanic['sex_numerical'] = titanic["sex"].map({"male": 0, "female": 1})


print(titanic.head(5))

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  family_size  is_alone  \
0    man        True  NaN  Southampton    no  False            2         0   
1  woman       False    C    Cherbourg   yes  False            2         0   
2  woman       False  NaN  Southampton   yes   True            1         1   
3  woman       False    C  Southampton   yes  False            2         0   
4    man        True  NaN  Southampton    no   True            1         1   

   sex_numerical  
0              0  
1              1  
2              1 

### Exercise 2: Handling missing values


#### 2.1. Identify columns with missing values


In [170]:
with_missing = titanic[titanic.columns[titanic.isna().any()]]

with_missing

Unnamed: 0,age,embarked,deck,embark_town
0,22.0,S,,Southampton
1,38.0,C,C,Cherbourg
2,26.0,S,,Southampton
3,35.0,S,C,Southampton
4,35.0,S,,Southampton
...,...,...,...,...
886,27.0,S,,Southampton
887,19.0,S,B,Southampton
888,,S,,Southampton
889,26.0,C,C,Cherbourg



#### 2.2. Impute missing numerical values with the median

In [171]:
# Impute missing values:
#   1. missing numerical <- median: (hint: use select_dtypes(include=['number']).columns to select numerical columns. and then use fillna)

numbers_with_missing = with_missing.select_dtypes(include="number")
numbers_filled = numbers_with_missing.fillna(numbers_with_missing.median())

numbers_filled

Unnamed: 0,age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0
...,...
886,27.0
887,19.0
888,28.0
889,26.0



#### 2.3. Impute missing categorical values with the mode

In [172]:

#   2. missing categorical <- mode: (hint: user select_dtypes(include=['object', 'category']).columns to select categorical columns. and then use fillna)
categorical_with_missing = with_missing.select_dtypes(include=['object', 'category'])
{
    col: categorical_with_missing[col].unique() for col in categorical_with_missing.columns
}

{'embarked': array(['S', 'C', 'Q', nan], dtype=object),
 'deck': [NaN, 'C', 'E', 'G', 'D', 'A', 'B', 'F']
 Categories (7, object): ['A', 'B', 'C', 'D', 'E', 'F', 'G'],
 'embark_town': array(['Southampton', 'Cherbourg', 'Queenstown', nan], dtype=object)}

In [173]:
# Add the category "MISSING" to all category columns
for column in categorical_with_missing.select_dtypes(include="category"):
    try:
        categorical_with_missing[column] = categorical_with_missing[column].cat.add_categories("MISSING")
    except ValueError:
        continue

categorical_with_missing = categorical_with_missing.fillna("MISSING")

titanic[categorical_with_missing.columns] = categorical_with_missing

categorical_with_missing


Unnamed: 0,embarked,deck,embark_town
0,S,MISSING,Southampton
1,C,C,Cherbourg
2,S,MISSING,Southampton
3,S,C,Southampton
4,S,MISSING,Southampton
...,...,...,...
886,S,MISSING,Southampton
887,S,B,Southampton
888,S,MISSING,Southampton
889,C,C,Cherbourg


In [177]:
display(titanic)
#   3. verify if there are no missing values
{
    col: titanic[col].unique() for col in categorical_with_missing.columns.extend(numbers_with_missing.columns)
}


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,family_size,is_alone,sex_numerical
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,MISSING,Southampton,no,False,2,0,0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,2,0,1
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,MISSING,Southampton,yes,True,1,1,1
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False,2,0,1
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,MISSING,Southampton,no,True,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,MISSING,Southampton,no,True,1,1,0
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True,1,1,1
888,0,3,female,,1,2,23.4500,S,Third,woman,False,MISSING,Southampton,no,False,4,0,1
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True,1,1,0


AttributeError: 'Index' object has no attribute 'extend'

### Exercise 3: Encoding categorical variables
**Dataset:** Iris dataset (`sklearn.datasets.load_iris`)

In [None]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

iris = load_iris(as_frame=True) # load data as a DataFrame, where (data, target) will be pandas DataFrames or Series
iris_df = iris.data.copy()  # features only, without inbuilt 'target' column.
iris_df['species'] = pd.Index(iris.target_names)[iris.target] # add 'species' column to iris_df using target names

display(iris_df.head())  # Display the first few rows of the iris dataset
display("Unique species names: ", iris_df['species'].unique())  # Display unique species names

# --- Encode species names using One-Hot Encoding and Label Encoding ---
    # 1. label encoding (hint: use LabelEncoder, and fit_transform)

    # 2. (optional): want to see the mapping? uncomment the following code (remember to replace 'le' by your label encoder variable name)
# id_to_name = {i: c for i, c in enumerate(le.classes_)}
# display(id_to_name)

    # 3. Onehot encoding 
        #  (Hint: use OneHotEncoder(sparse_output=False, handle_unknown='ignore'), 
        #  then fit_transform, 
        #  after that use et_feature_names_out(['species']).tolist() to get the one-hot encoded column names, 
        #  and then add the one-hot encoded columns to iris_df)


# (optional): display the first few rows of the iris dataset. see any difference? 

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


'Unique species names: '

array(['setosa', 'versicolor', 'virginica'], dtype=object)

## Part 2 – Feature Selection
### Exercise 4: Correlation-based selection
**Dataset:** Continuous with the processed Titanic dataset

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Compute correlation matrix for numerical variables
    # 1. select numerical columns: (HINT you have done this before in part 1))
    
    # 2. compute correlation matrix (Hint: use corr(numeric_only=True) to compute correlation matrix for numerical variables
    
    # 3. (optional) Plot matrix 
        # (HINT: uncomment the following lines to plot the correlation matrix.
        #  Remember to import seaborn and matplotlib.pyplot and replace 'corr' by your correlation matrix variable name)
# plt.figure(figsize=(9,7))
# sns.heatmap(corr, annot=True, fmt=".2f", vmin=-1, vmax=1, square=True,
#             linewidths=0.5, cbar_kws={"label": "Pearson r"})
# plt.title("Correlation matrix Titanic Dataset", fontsize=16)
# plt.show()

    # 4. Drop variables with high correlation (>0.85)
thr = 0.85 
keep_fea = []   # features we keep
drop_fea = []   # features to drop

        # (HINT: use for loop to iterate over the columns of the correlation matrix, 
        #   for example: for c in corr.columns: if any(corr.loc[c, k] > thr for k in keep_fea): ??_fea.append(c) else: ??_fea.append(c))


#   5. (optional) if you want to create a reduced dataset with dropped features: (uncomment the following line)
# titanic_reduced = titanic.drop(columns=drop) 

### Exercise 5: Univariate feature selection
**Dataset:** Breast Cancer dataset (`sklearn.datasets.load_breast_cancer`)

In [None]:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

cancer = load_breast_cancer(as_frame=True)
X = cancer.data
y = cancer.target

    # 1. Train/test split (HINT: use train_test_split from sklearn.model_selection)

    # 2. Select top 5 features 
        # (hint: using SelectKBest(score_func=f_classif, k=5) , and then fit it to the training data. 
        #  After that, use get_support() to get a boolean mask of selected features, and then use it to get the feature names from X.columns)
    
    # 3. make training dataset based on the selected features: 

    # 4. (optional) Training a Logistic Regression model with selected features
        # HINT: Since we have not talked about the algorithms, it is possible for you to use the code provided below:
# clf_5 = LogisticRegression(max_iter=100, # max_iter is increased to ensure convergence
#                          solver="liblinear"  # use liblinear for small datasets
#                          )
# clf_5.fit(X_train_sel_5, y_train)  # remeber replace X_train_sel_5 with your training dataset based on selected features
# y_pred = clf_5.predict(X_test_sel_5)  # remember to replace X_test_sel_5 with your test dataset based on selected features
# print("Accuracy:", accuracy_score(y_test, y_pred))



### Exercise 6: Recursive Feature Elimination (RFE)

In [None]:
from sklearn.feature_selection import RFE

    # 1. Use RFE with logistic regression to select top 7 features 
        # (Hint: use RFE(estimator=clf_5, n_features_to_select=7, step=1), 
        # then fit it to the training data, 
        # and then use X.columns[rfe.support_].tolist() to get the selected features)
    
    # 2. Make training dataset based on the selected 7 features


    # 3. Train new model on selected features and Compare accuracy with and without RFE. What the accuracy difference?
        # (HINT, uncomment the following lines to train a new model and compare accuracy with the previous model)---
# clf_7 = LogisticRegression(max_iter=100,
#                            solver="liblinear")
# clf_7.fit(X_train_sel_7, y_train)
# y_pred_7 = clf_7.predict(X_test_sel_7)
# print("Accuracy:", accuracy_score(y_test, y_pred_7))
