# Kaggle - Titanic - Machine Learning from Disaster

A task as part of my data science class:

Requirements:

1. Add at least 2 new features to the dataset (explain your reasoning below)
2. Use KNN (and only KNN) to predict survival
3. Explain your process below and choice of K
4. Make a submission to the competition and provide a link to your submission below.
5. Show your code below

Lets start by reading in our data

In [5]:
import pandas as pd

train_Data_DF = pd.read_csv("./Data/train.csv")
test_Data_DF = pd.read_csv("./Data/test.csv")

Lets feature engineer to add to new columns:

- Title from Name: 

    Lets Extract titles (Mr, Mrs, Miss, Master, etc.) from the passenger names. The title can indicate social status, gender, and marital status, which might correlate with survival chances. For example, women (Mrs, Miss) and nobility (titles indicating a higher social rank) might have had higher priority for lifeboats.

- Family Size: 

    Lets Combine SibSp (number of siblings/spouses aboard) and Parch (number of parents/children aboard) to create a new feature that represents the total number of family members on board. This could affect survival as those with families might have prioritized keeping their family together or ensuring their family's safety over their own.

- IsAlone: 

    The intuition behind creating an IsAlone feature is that the survival chances might differ between passengers who were traveling alone and those who were with family. Being alone or with family could impact a passenger's mobility, decision-making, and access to resources during the evacuation.



In [6]:
# Extract titles from the Name column
train_Data_DF['Title'] = train_Data_DF['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
test_Data_DF['Title'] = test_Data_DF['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)


- train_Data_DF['Name'] accesses the Name column in the training DataFrame.

- .str.extract(' ([A-Za-z]+)\.', expand=False) applies the regular expression to each name, extracting the title.
    - (space): The search starts after a space, ensuring we don't start extracting from the beginning of the Lastname.

    - ([A-Za-z]+): This part captures one or more (+) alphabetical characters (A-Za-z). This is where the title will be matched, as titles are made up of letters only.
    
    - \.: This looks for a literal period (.). Titles in the dataset are followed by a period (e.g., "Mr."), making this a reliable way to end the capture.

- The extracted title is then assigned to a new column in the DataFrame called 'Title'.

In [7]:
for df in [train_Data_DF, test_Data_DF]:
    df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
                                       'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    df['Title'] = df['Title'].replace('Mlle', 'Miss')
    df['Title'] = df['Title'].replace('Ms', 'Miss')
    df['Title'] = df['Title'].replace('Mme', 'Mrs')


The code provided does a few things:

- It first replaces rare titles (like 'Lady', 'Countess', 'Capt', 'Col', etc.) with 'Rare'. This groups various titles of nobility or uncommon professional titles into a single 'Rare' category, acknowledging their unique status without overcomplicating the model.

- It replaces titles with their common equivalents, such as converting 'Mlle' and 'Ms' to 'Miss', and 'Mme' (Madame) to 'Mrs', to ensure consistency in the dataset.

In [8]:
for df in [train_Data_DF, test_Data_DF]:
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

FamilySize is a combination of SibSp and Parch plus 1 (for the passenger themselves). This feature can be useful to understand if having family members on board affects a passenger's survival rate.

In [9]:
for df in [train_Data_DF, test_Data_DF]:
    df['IsAlone'] = 0 # Initially, assume no passengers are alone
    df.loc[df['FamilySize'] == 1, 'IsAlone'] = 1 # If FamilySize is 1, the passenger is alone


- df['IsAlone'] = 0 initializes a new column IsAlone for every passenger in the DataFrame, setting it to 0 by default, indicating that passengers are not alone.

- df.loc[df['FamilySize'] == 1, 'IsAlone'] = 1 this line looks for passengers whose FamilySize equals 1—meaning they have no family members aboard—and sets their IsAlone status to 1, indicating they are traveling alone.


### Applying One-Hot Encoding to the Titanic Dataset

One-hot encoding is a common method to convert categorical data into a numerical format. It creates new columns for each category of the variable, with a 1 indicating the presence of the category and 0 indicating its absence for each row. This is particularly useful for non-ordinal categorical variables where no inherent order exists between the categories (e.g., Embarked).

The Sex column can be easily converted into numeric format because it typically has two categories (male and female).

In [10]:
train_Data_DF = pd.get_dummies(train_Data_DF, columns=['Sex'], drop_first=True)
test_Data_DF = pd.get_dummies(test_Data_DF, columns=['Sex'], drop_first=True)

drop_first=True is used to avoid redundancy. For binary categories like Sex, you only need one column where, for example, 1 could represent male and 0 could represent female.

In [11]:
train_Data_DF = pd.concat([train_Data_DF, pd.get_dummies(train_Data_DF['Title'], prefix='Title')], axis=1)
test_Data_DF = pd.concat([test_Data_DF, pd.get_dummies(test_Data_DF['Title'], prefix='Title')], axis=1)

- pd.concat([...], axis=1): The pd.concat() function is used to concatenate pandas objects along a particular axis. Here's what the parameters mean:

    - The first parameter is a list of DataFrames to concatenate. In this case, we concatenate the original DataFrame (train_Data_DF or test_Data_DF) with the new DataFrame of dummy variables created from the Title column.

    - axis=1 tells pandas to concatenate columns, not rows. When concatenating DataFrames, axis=0 would stack the DataFrames on top of each other, increasing the number of rows. axis=1 places the new columns from the second DataFrame (the one-hot encoded titles) alongside the existing columns of the first DataFrame.


In [12]:
train_Data_DF = pd.get_dummies(train_Data_DF, columns=['Embarked'], prefix='Embarked')
test_Data_DF = pd.get_dummies(test_Data_DF, columns=['Embarked'], prefix='Embarked')
