# 2. Feature engineering:
- Feature engineering involves creating or modifying features to improve model performance. In the context of the Titanic dataset, this involves several steps:

####Handling missing values
- Age: The 'Age' column often has missing values. You can impute these missing values using the mean, median, or a more sophisticated method like K-Nearest Neighbors imputation.
- Embarked: This categorical feature may also have missing values. You can impute these by using the most frequent value (mode) of the column.

- Cabin: This column has a large number of missing values and might be best dropped or used to create a new feature representing the deck.



In [None]:
import pandas as pd

def clean_titanic_data(df):
  df["Age"] = df["Age"].fillna(df["Age"].mean())
  df.drop(columns=["PassengerId","Cabin", "Ticket"], inplace=True)
  df["Sex"] = df["Sex"].map({"male":0, "female": 1})
  return df

data = pd.read_csv("Titanic-Dataset.csv")
data = clean_titanic_data(data)
data

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,"Braund, Mr. Owen Harris",0,22.000000,1,0,7.2500,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.000000,1,0,71.2833,C
2,1,3,"Heikkinen, Miss. Laina",1,26.000000,0,0,7.9250,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.000000,1,0,53.1000,S
4,0,3,"Allen, Mr. William Henry",0,35.000000,0,0,8.0500,S
...,...,...,...,...,...,...,...,...,...
886,0,2,"Montvila, Rev. Juozas",0,27.000000,0,0,13.0000,S
887,1,1,"Graham, Miss. Margaret Edith",1,19.000000,0,0,30.0000,S
888,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,29.699118,1,2,23.4500,S
889,1,1,"Behr, Mr. Karl Howell",0,26.000000,0,0,30.0000,C


###Creating new features
- Family Size: Combine 'SibSp' (number of siblings/spouses aboard) and 'Parch' (number of parents/children aboard) to create a new feature called 'FamilySize'. This can provide more insight into the impact of family on survival chances.
- IsAlone: Create a binary feature indicating whether a passenger was traveling alone or not, based on 'FamilySize'.
- Age Group: Categorize the 'Age' feature into groups (e.g., child, young adult, adult, elderly). This can capture non-linear relationships with survival and might be more robust to outliers.
Transforming features


In [None]:
data["FamilySize"] = data["SibSp"] + data["Parch"] + 1
data["IsAlone"] = data["FamilySize"].apply(lambda x:1 if x ==1 else 0)
year = [0,9,17,39,59,120]
labels = ["child", "Teen", "y_adult", "m_adult", "senior"]
data["AgeGroup"] = pd.cut(data["Age"],bins=year,labels=labels)
data = pd.get_dummies(data,columns=["Embarked"], drop_first= True)
data.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,FamilySize,IsAlone,AgeGroup,Embarked_Q,Embarked_S
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,7.25,2,0,y_adult,False,True
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,71.2833,2,0,y_adult,False,False
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,7.925,1,1,y_adult,False,True
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,53.1,2,0,y_adult,False,True
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,8.05,1,1,y_adult,False,True


- Title: Extract titles from the 'Name' column (e.g., Mr., Mrs., Miss, Master). Titles can reveal social status and, consequently, have an impact on survival rates.

In [None]:
data["Title"] = data["Name"].apply(lambda x: x.split(",")[1].split(".")[0].strip())
unique_title = data["Title"].unique()

def simplify_title(title):
    royalty = ['Don', 'Sir', 'Lady', 'the Countess', 'Jonkheer', 'Prince']
    military = ['Major', 'Col', 'Capt']
    professional = ['Dr', 'Rev']
    if title in ['Mme']:
        return 'Mrs'
    elif title in ['Mlle', 'Ms']:
        return 'Miss'
    elif title in royalty:
        return 'Royalty'
    elif title in military:
        return 'Military'
    elif title in professional:
        return 'Professional'
    else:
        return title  # Keep common titles like Mr, Mrs, Miss, Master

data["TitleGroup"] = data["Title"].apply(simplify_title)
data = pd.get_dummies(data, columns=["TitleGroup"], drop_first= True)
data.drop(columns=["Name", "Title"], inplace=True)
data.head(5)




Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,FamilySize,IsAlone,AgeGroup,Embarked_Q,Embarked_S,Title_Group_Military,Title_Group_Miss,Title_Group_Mr,Title_Group_Mrs,Title_Group_Professional,Title_Group_Royalty
0,0,3,0,22.0,1,0,7.25,2,0,y_adult,False,True,False,False,True,False,False,False
1,1,1,1,38.0,1,0,71.2833,2,0,y_adult,False,False,False,False,False,True,False,False
2,1,3,1,26.0,0,0,7.925,1,1,y_adult,False,True,False,True,False,False,False,False
3,1,1,1,35.0,1,0,53.1,2,0,y_adult,False,True,False,False,False,True,False,False
4,0,3,0,35.0,0,0,8.05,1,1,y_adult,False,True,False,False,True,False,False,False


In [None]:
df = data.copy()

df = pd.get_dummies(df, columns=["AgeGroup"], drop_first = True)
df.head(4)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,FamilySize,IsAlone,Embarked_Q,...,Title_Group_Military,Title_Group_Miss,Title_Group_Mr,Title_Group_Mrs,Title_Group_Professional,Title_Group_Royalty,AgeGroup_Teen,AgeGroup_y_adult,AgeGroup_m_adult,AgeGroup_senior
0,0,3,0,22.0,1,0,7.25,2,0,False,...,False,False,True,False,False,False,False,True,False,False
1,1,1,1,38.0,1,0,71.2833,2,0,False,...,False,False,False,True,False,False,False,True,False,False
2,1,3,1,26.0,0,0,7.925,1,1,False,...,False,True,False,False,False,False,False,True,False,False
3,1,1,1,35.0,1,0,53.1,2,0,False,...,False,False,False,True,False,False,False,True,False,False


###Transforming Features
- Encoding categorical features: Machine learning algorithms require numerical input, so categorical features like 'Sex', 'Embarked', 'Pclass', and the newly created 'Title' need to be converted to numerical representations. One-hot encoding is a common method for this, creating new binary columns for each category.
- Feature scaling: Numerical features like 'Age' and 'Fare' may have different scales, which can bias certain models. Scaling techniques like Standardization or Min-Max scaling can normalize these features to a uniform range (e.g., 0 to 1).
- Logarithmic transformation: If a numerical feature like 'Fare' is skewed, a logarithmic transformation can help normalize the distribution and reduce the impact of outliers.

####Feature selection
- Dropping irrelevant features: Features like 'PassengerId', 'Name', 'Ticket', and potentially 'Cabin' might not be directly useful for predicting survival and can be dropped.
- Correlation analysis: Use correlation analysis to identify features that are strongly correlated with the target variable ('Survived') and consider removing features that have low correlation or are redundant.


- Dropping irrelevant features: Features like 'PassengerId', 'Name', 'Ticket', and potentially 'Cabin' might not be directly useful for predicting survival and can be dropped.
- Correlation analysis: Use correlation analysis to identify features that are strongly correlated with the target variable ('Survived') and consider removing features that have low
correlation or are redundant.
Conclusion.


By thoroughly applying EDA, you gain a deep understanding of the Titanic dataset, uncovering relationships between features and survival. This understanding then guides the feature engineering process, where you create and refine features to improve the performance of your machine learning model. These two crucial steps are essential for building effective and accurate models for predicting Titanic passenger survival.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
