In [89]:
# Import needed libraries and load cleaned data for feature engineering
import sys
notebook_path = sys.path.append(r'e:\Data science\Titanic dataset\notebooks')

from auto_imports import *

df = pd.read_csv('E:\Data science\Titanic dataset\data\Processed data\Data Analysis\cleaned data.csv')

### Feature engineering

This notebook creates new, more useful features from the Titanic dataset. By transforming columns like Cabin, Name, and Ticket, we help the computer find patterns that can improve survival predictions. Each step below explains what was done and why it matters.

In [90]:
# Show first rows of dataframe to understand the structure
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,ind
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,C23 C25 C27,S,train
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,train
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,C23 C25 C27,S,train
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,train
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,C23 C25 C27,S,train


#### Cabin Feature Engineering

We split the Cabin column into two new features:
- **Cabin letter**: Shows which part of the ship the passenger's cabin was in (for example, 'B' or 'C'). This can reveal if location affected survival.
- **Cabin cell number**: Counts how many separate rooms or cells a passenger had. More rooms might mean higher social status.

This helps the model understand if where someone stayed on the ship influenced their chances.

In [91]:
# Show Cabin column values for inspection
df['Cabin']

0       C23 C25 C27
1               C85
2       C23 C25 C27
3              C123
4       C23 C25 C27
           ...     
1304    C23 C25 C27
1305           C105
1306    C23 C25 C27
1307    C23 C25 C27
1308    C23 C25 C27
Name: Cabin, Length: 1309, dtype: object

In [92]:
# Function to remove duplicate letters from cabin string
# This function makes sure we only keep unique cabin letters, so the model doesn't get confused by repeated letters.
def remove_duplicate_letters(text):
        seen = set()
        result = ''
        for char in text:
            if char not in seen:
                seen.add(char)
                result += char
        return result.strip().replace(' ' , '-')

# Splitting cabin column to cell number and cabin letter
# We extract the number of rooms (cells) and the main cabin letter for each passenger.
cell_number_count = df['Cabin'].replace(r'[a-zA-Z]' , '' , regex =True).str.strip().str.split().str.len()
print(cell_number_count.head())
df['cells_count'] = cell_number_count
df['Cabin_letter'] = df['Cabin'].replace(r'[0-9]' , '' , regex = True).apply(remove_duplicate_letters)
# When removing digits from cabin column values output may be like that (B B B B) so I made a function to remove these duplicated letters

0    3
1    1
2    3
3    1
4    3
Name: Cabin, dtype: int64


In [93]:
# Show unique cabin letters to check extraction
# This helps us see all the different cabin areas on the ship.
df['Cabin_letter'].unique() 

array(['C', 'E', 'G', 'D', 'A', 'B', 'F', 'F-G', 'F-E', 'T'], dtype=object)

In [94]:
# Show unique cell numbers to check extraction
# This shows the variety in how many rooms passengers had.
df['cells_count'].unique()

array([3, 1, 2, 0, 4])

In [95]:
# Count missing cell numbers
# We check for missing values to make sure our new feature is complete.
df['cells_count'].isna().sum()

np.int64(0)

In [96]:
# Drop Cabin column after splitting
# The original Cabin column is no longer needed after creating the new features.
df.drop('Cabin' , axis =1 , inplace=True)

In [97]:
# Show first rows after cabin processing
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,ind,cells_count,Cabin_letter
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,train,3,C
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,train,1,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,train,3,C
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,train,1,C
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,train,3,C


#### Name Feature Engineering

We extract two new features from the Name column:
- **Family name**: The last name, which can help identify family groups.
- **Title**: Words like Mr., Mrs., Miss, which give clues about age, gender, and social status.

These features help the model understand relationships and social roles, which can affect survival.

In [98]:
# Show first name value for inspection
# This lets us see how names are structured before extracting features.
df['Name'][0]

'Braund, Mr. Owen Harris'

- Extract family name from the first word and title from the second word in the Name column.
- Check for names with brackets, which may indicate maiden names or nicknames.
# We look for special cases in names to make sure our extraction is accurate.

#### Title Extraction

We extract titles (like Mr., Mrs., Miss) from the Name column. Titles can reveal age, gender, and social status, all of which may influence survival.

In [99]:
# Show value counts of titles extracted from Name
# This helps us see which titles are common and which are rare.
titles = df['Name'].str.split(r'[,.]').str[1].str.strip()
titles.value_counts()

Name
Mr              757
Miss            260
Mrs             197
Master           61
Rev               8
Dr                8
Col               4
Major             2
Mlle              2
Ms                2
Mme               1
Don               1
Sir               1
Lady              1
Capt              1
the Countess      1
Jonkheer          1
Dona              1
Name: count, dtype: int64

In [100]:
# Ok there are many titles so I will combine rare titles in new category called others
# We group rare titles together so the model isn't distracted by uncommon categories.
others = titles.value_counts()[titles.value_counts() < 61]
titles = titles.replace(others.index , 'Others') # replace all titles are less than 61

titles.value_counts()

Name
Mr        757
Miss      260
Mrs       197
Master     61
Others     34
Name: count, dtype: int64

In [101]:
# Extract Title from Name
# We add the cleaned title as a new feature.
df['Title'] = titles

#### Extracting family size and is Alone or not

We create two new features:
- **FamilySize**: Total number of family members on board (parents, children, siblings, spouses).
- **Is Alone**: Whether the passenger was traveling alone (1) or with family (0).

Traveling with family could affect survival chances, so these features are important.

In [102]:
familysize =  df['Parch'] + df['SibSp'] # family size from this feature I will get is alone or not

df['FamilySize'] = familysize
df['Is Alone'] = (df['FamilySize'] == 0).astype(int) # if family size is 0 that means he/she is alone

In [103]:
# Drop Name column after extracting features
# The original Name column is no longer needed.
df.drop('Name' , axis = 1 , inplace=True)

In [104]:
# Show first rows after name processing
# This lets us check our new features.
df.head() 

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,ind,cells_count,Cabin_letter,Title,FamilySize,Is Alone
0,1,0.0,3,male,22.0,1,0,A/5 21171,7.25,S,train,3,C,Mr,1,0
1,2,1.0,1,female,38.0,1,0,PC 17599,71.2833,C,train,1,C,Mrs,1,0
2,3,1.0,3,female,26.0,0,0,STON/O2. 3101282,7.925,S,train,3,C,Miss,0,1
3,4,1.0,1,female,35.0,1,0,113803,53.1,S,train,1,C,Mrs,1,0
4,5,0.0,3,male,35.0,0,0,373450,8.05,S,train,3,C,Mr,0,1


#### Ticket Feature Engineering

We extract several new features from the Ticket column:
- **Is Special Agent/route**: Whether the ticket has a special code, which might indicate a group or agency booking.
- **TicketBatch**: Whether the ticket was issued early or late, based on the number of digits.
- **SharedTicket**: Whether the ticket was shared by more than one passenger, which could mean group travel.

These features help the model understand group dynamics and booking patterns.

In [105]:
# Split Ticket column for further processing
df['Ticket'].str.split()

0                [A/5, 21171]
1                 [PC, 17599]
2         [STON/O2., 3101282]
3                    [113803]
4                    [373450]
                ...          
1304             [A.5., 3236]
1305              [PC, 17758]
1306    [SOTON/O.Q., 3101262]
1307                 [359309]
1308                   [2668]
Name: Ticket, Length: 1309, dtype: object

In [106]:
# Extract special agent/route from Ticket
# This finds special codes in the ticket, which might be important for survival.
clean_agents = df['Ticket'].str.extract(r'^([A-Za-z/.]+\d*)', expand=False).fillna('Non-special').replace('[.]' , '' , regex=True)

In [107]:
# Show unique special agent/route values
# This checks the variety of special codes.
clean_agents.unique()

# That's perfect

array(['A/5', 'PC', 'STON/O2', 'Non-special', 'PP', 'CA', 'SC/Paris',
       'SC/A4', 'A/4', 'SP', 'SOC', 'SO/C', 'W/C', 'SOTON/OQ', 'WEP',
       'STON/O', 'A4', 'C', 'SC/PARIS', 'SOP', 'A5', 'Fa', 'LINE', 'FCC',
       'SW/PP', 'SCO/W', 'P/PP', 'SC', 'SC/AH', 'A/S', 'WE/P', 'SO/PP',
       'FC', 'SOTON/O2', 'CA/SOTON', 'SC/A3', 'STON/OQ', 'AQ/4', 'A',
       'LP', 'AQ/3'], dtype=object)

In [108]:
# Add is special agent/route column
# We add a new column to indicate if a ticket is special.
df['Is Special Agent/route'] = (clean_agents != 'Non-special').astype(int)

Create a TicketBatch column to indicate early or late ticket batches based on ticket digit length.
# We use the length of the ticket number to guess if it was issued early or late.

In [109]:
# Extract ticket digits for batch classification
ticket_digits = df['Ticket'].str.split().str[-1]
ticket_digits.str.len().unique()

array([5, 7, 6, 4, 3, 1])

In [110]:
# Show ticket digits with length 1 for correction
# ...existing code...
ticket_digits[ticket_digits.str.len() == 1]

772     3
841     3
1077    2
1193    2
Name: Ticket, dtype: object

In [111]:
# get the full tickets
df.iloc[ticket_digits[ticket_digits.str.len() == 1].index]['Ticket'].values

array(['S.O./P.P. 3', 'S.O./P.P. 3', 'S.O./P.P. 2', 'S.O./P.P. 2'],
      dtype=object)

OK that's good there aren't any bugs

In [112]:
# Show unique ticket digits
ticket_digits.unique()

array(['21171', '17599', '3101282', '113803', '373450', '330877', '17463',
       '349909', '347742', '237736', '9549', '113783', '2151', '347082',
       '350406', '248706', '382652', '244373', '345763', '2649', '239865',
       '248698', '330923', '113788', '347077', '2631', '19950', '330959',
       '349216', '17601', '17569', '335677', '24579', '17604', '113789',
       '2677', '2152', '345764', '2651', '7546', '11668', '349253',
       '2123', '330958', '23567', '370371', '14311', '2662', '349237',
       '3101295', '39886', '17572', '2926', '113509', '19947', '31026',
       '2697', '34651', '2144', '2669', '113572', '36973', '347088',
       '17605', '2661', '29395', '3464', '3101281', '315151', '33111',
       '14879', '2680', '1601', '348123', '349208', '374746', '248738',
       '364516', '345767', '345779', '330932', '113059', '14885',
       '3101278', '6608', '392086', '343275', '343276', '347466', '5734',
       '2315', '364500', '374910', '17754', '17759', '231919', '244

In [113]:
# Create TicketBatch column: 1 for early, 0 for late
df['TicketBatch'] = np.where(ticket_digits.str.len() < 5 , 1,0)

1: Early (shorter ticket numbers)
0: Late (longer ticket numbers)

Create a boolean column called SharedTicket to indicate if a ticket is shared by more than one passenger.
# This helps us find passengers who traveled together on the same ticket.

In [114]:
# Create SharedTicket column: 1 if ticket is duplicated, else 0
df['SharedTicket'] = np.where(df['Ticket'].duplicated() , 1 , 0)

In [115]:
# Drop Ticket column after feature extraction
# The original Ticket column is no longer needed.
df.drop('Ticket' , axis = 1 , inplace=True)

In [116]:
# Show random sample of dataframe to inspect new features
df.sample(10)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,ind,cells_count,Cabin_letter,Title,FamilySize,Is Alone,Is Special Agent/route,TicketBatch,SharedTicket
846,847,0.0,3,male,29.881138,8,2,69.55,S,train,3,C,Mr,10,0,1,1,1
523,524,1.0,1,female,44.0,0,1,57.9792,C,train,1,B,Mrs,1,0,0,0,1
276,277,0.0,3,female,45.0,0,0,7.75,S,train,3,C,Miss,0,1,0,0,0
404,405,0.0,3,female,20.0,0,0,8.6625,S,train,3,C,Miss,0,1,0,0,0
571,572,1.0,1,female,53.0,2,0,51.4792,S,train,1,C,Mrs,2,0,0,0,0
284,285,0.0,1,male,29.881138,0,0,26.0,S,train,1,A,Mr,0,1,0,0,0
19,20,1.0,3,female,29.881138,0,0,7.225,C,train,3,C,Mrs,0,1,0,1,0
1262,1263,,1,female,31.0,0,0,134.5,C,test,2,E,Miss,0,1,0,0,1
558,559,1.0,1,female,39.0,1,1,79.65,S,train,1,E,Mrs,2,0,0,0,1
394,395,1.0,3,female,24.0,0,2,16.7,S,train,1,G,Mrs,2,0,1,1,1


In [117]:
# Show dataframe info to check new columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   PassengerId             1309 non-null   int64  
 1   Survived                891 non-null    float64
 2   Pclass                  1309 non-null   int64  
 3   Sex                     1309 non-null   object 
 4   Age                     1309 non-null   float64
 5   SibSp                   1309 non-null   int64  
 6   Parch                   1309 non-null   int64  
 7   Fare                    1309 non-null   float64
 8   Embarked                1307 non-null   object 
 9   ind                     1309 non-null   object 
 10  cells_count             1309 non-null   int64  
 11  Cabin_letter            1309 non-null   object 
 12  Title                   1309 non-null   object 
 13  FamilySize              1309 non-null   int64  
 14  Is Alone                1309 non-null   

In [118]:
# Check unique values in Embarked column
df['Embarked'].unique() 

array(['S', 'C', 'Q', nan], dtype=object)

In [119]:
# fill nan values with mode
df[df['Embarked'].isna()]['Embarked'].index

Index([61, 829], dtype='int64')

In [120]:
df['Embarked'].mode() # see the mode

0    S
Name: Embarked, dtype: object

In [121]:
df['Embarked'].replace(np.nan , 'S' , inplace=True) # fill nan values with mode

In [122]:
df['Embarked'].isna().sum()

np.int64(0)

In [123]:
# Save the processed dataframe with new features
df.to_csv('E:\Data science\Titanic dataset\data\Processed data\Data Analysis\processed_data.csv' , index=False)

#### Summary

- We created new features from Cabin, Name, and Ticket columns to give the model more useful information.
- These features help the model understand social status, family groups, and travel patterns, which can all affect survival.
- The processed data is now ready for further analysis or modeling.