# Feature Engineering

Some fundamentals of exploratory data analysis and feature engineering using the Titanic data set


In [1]:
# %pip install --quiet --upgrade pip 
# %pip install numpy --quiet
# %pip install PyArrow --quiet
# %pip install Pandas --quiet
# %pip install scikit-learn --quiet

In [2]:
import pandas as pd

In [3]:
# Load the data
titanic_train = pd.read_csv("Data/titanic_train.csv")
titanic_test = pd.read_csv("Data/titanic_test.csv")


## Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process where analysts explore datasets to summarise their main characteristics, often visually, before applying more formal modeling techniques. The primary goal is to understand the data's structure, spot patterns, detect anomalies, check assumptions, and test hypotheses. EDA helps to make the data analysis process more efficient and guides subsequent modeling steps.

In [4]:
# Combine train and test data. For exploratory data analysis we want to find corrlations across the entire data set
# so we can avoid overfitting to the train dataset.
titanic_data = pd.concat([titanic_train, titanic_test], sort=True)
titanic_data = titanic_data.drop(["Survived"], axis=1)
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1309 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          1046 non-null   float64
 1   Cabin        295 non-null    object 
 2   Embarked     1307 non-null   object 
 3   Fare         1308 non-null   float64
 4   Name         1309 non-null   object 
 5   Parch        1309 non-null   int64  
 6   PassengerId  1309 non-null   int64  
 7   Pclass       1309 non-null   int64  
 8   Sex          1309 non-null   object 
 9   SibSp        1309 non-null   int64  
 10  Ticket       1309 non-null   object 
dtypes: float64(2), int64(4), object(5)
memory usage: 122.7+ KB


A total of 1309 rows. 

- 1046 rows have non-null Age, implying 263 missing values.
- Cabin has 1,014 missing values.
- Embarked has 2 missing values.
- Fare has 1 missing value.

In [5]:
titanic_data.nunique()

Age              98
Cabin           186
Embarked          3
Fare            281
Name           1307
Parch             8
PassengerId    1309
Pclass            3
Sex               2
SibSp             7
Ticket          929
dtype: int64

In [6]:
titanic_data.describe(include="all")

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Ticket
count,1046.0,295,1307,1308.0,1309,1309.0,1309.0,1309.0,1309,1309.0,1309
unique,,186,3,,1307,,,,2,,929
top,,C23 C25 C27,S,,"Connolly, Miss. Kate",,,,male,,CA. 2343
freq,,6,914,,2,,,,843,,11
mean,29.881138,,,33.295479,,0.385027,655.0,2.294882,,0.498854,
std,14.413493,,,51.758668,,0.86556,378.020061,0.837836,,1.041658,
min,0.17,,,0.0,,0.0,1.0,1.0,,0.0,
25%,21.0,,,7.8958,,0.0,328.0,2.0,,0.0,
50%,28.0,,,14.4542,,0.0,655.0,3.0,,0.0,
75%,39.0,,,31.275,,0.0,982.0,3.0,,1.0,


In [7]:
min_age = titanic_data["Age"].min()
titanic_data[titanic_data["Age"] == min_age]

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Ticket
354,0.17,,S,20.575,"Dean, Miss. Elizabeth Gladys Millvina""""",2,1246,3,female,1,C.A. 2315


In [8]:
titanic_data["Sex"].value_counts()

Sex
male      843
female    466
Name: count, dtype: int64

Sex has 2 values: male and female. 
All rows have a value for this column
Nearly twice as many males as females

In [9]:
titanic_data["Embarked"].value_counts()

Embarked
S    914
C    270
Q    123
Name: count, dtype: int64

Embarked is port of embarkation and it is a categorical feature which has 3 unique values (C, Q or S):
1. C = Cherbourg
2. Q = Queenstown
3. S = Southampton

## Missing Values

Handling missing values is an important part of Exploratory Data Analysis because missing data can impact the quality and accuracy of the analysis. There are several approaches to deal with missing values depending on the context and the nature of the data.

In Pandas `isna().sum()` will give a count of the number of missing values in each column. 

`info()` can also be used to count the non-null rows. 

You may also see `isnull()` being used rather than `isna()`. `isnull()` is an alias `isna()` and both do the same thing.  



In [10]:
titanic_data.isna().sum()

Age             263
Cabin          1014
Embarked          2
Fare              1
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Ticket            0
dtype: int64

Determining whether missing data is Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR) is critical for selecting appropriate strategies for handling it. These categories describe the mechanism behind the missing data, and understanding them helps inform how you should treat the missing values.

**Missing Completely at Random (MCAR)**
The probability of a data point being missing is entirely independent of both observed and unobserved data. In other words, the missingness is purely random and doesn't depend on any variables in the dataset.
*Example:* A random system failure causes some sensor readings to be missing, irrespective of any observed or unobserved features.

**Missing at Random (MAR)**
The probability of missingness is related to observed data but not to the value of the missing data itself. In other words, missing values may depend on some other measured variables, but not on the variable that is missing.
*Example:* Higher-income individuals are less likely to disclose their income, but the likelihood of missing income data can be predicted by education level or job position (other observed variables).

**Missing Not at Random (MNAR)**
The probability of missingness is related to the value of the missing data itself. In this case, the fact that the data is missing is informative and reveals something about the missing data.
*Example:* In a job satisfaction survey, employees who are dissatisfied with their job may choose not to answer the satisfaction question. People who are unhappy with their job are more likely to leave the question blank. The missingness is directly related to their dissatisfaction (the missing value itself), making it MNAR.

In [11]:
titanic_data.shape[0]

1309