**Data Wrangling with the Titanic Dataset using Seaborn**



**Introduction:**

The Titanic dataset is one of the most popular datasets used for learning data analysis and machine learning. It contains information about the passengers aboard the Titanic, including whether they survived or not. In this project, we will perform data wrangling (cleaning, transforming, and organizing data) on the Titanic dataset using Python's seaborn and pandas libraries. By the end of this project, you will have a clean and structured dataset ready for analysis or modeling.

**Step 1: Import Libraries**

We start by importing the necessary libraries.

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np

**Notion:**


*   seaborn is used for visualization and comes with built-in datasets like  
    Titanic.
*   pandas is used for data manipulation and analysis.
*   numpy is used for numerical operations.



**Step 2: Load the Dataset**

Load the Titanic dataset from Seaborn.

In [2]:
# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Display the first 5 rows
print(titanic.head())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  


**Notion:**

* The load_dataset() function loads the Titanic dataset directly from Seaborn.

* head() displays the first 5 rows to get a quick overview of the data.

**Step 3: Explore the Dataset**

Before wrangling, let's understand the dataset.

In [3]:
# Check dataset info
print(titanic.info())

# Check for missing values
print(titanic.isnull().sum())

# Basic statistics
print(titanic.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
None
survived         0
pclass           0
sex              0
age            17

**Notion:**

* info() provides a summary of the dataset, including column names, data types, * and non-null counts.

* isnull().sum() helps identify missing values in each column.

* describe() gives statistical details like mean, min, max, etc., for numerical columns.

**Step 4: Handle Missing Values**

Missing data can affect analysis, so we need to handle it.

In [4]:
# Fill missing 'age' with the median value
titanic['age'].fillna(titanic['age'].median(), inplace=True)

# Fill missing 'embarked' with the most frequent value
titanic['embarked'].fillna(titanic['embarked'].mode()[0], inplace=True)

# Drop the 'deck' column as it has too many missing values
titanic.drop(columns=['deck'], inplace=True)

# Check if missing values are handled
print(titanic.isnull().sum())

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    2
alive          0
alone          0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['age'].fillna(titanic['age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['embarked'].fillna(titanic['embarked'].mode()[0], inplace=True)


**Notion:**

* For numerical columns like age, we fill missing values with the median.

* For categorical columns like embarked, we use the mode (most frequent value).

* Columns with too many missing values (e.g., deck) are dropped entirely.

**Step 5: Feature Engineering**

Create new features or modify existing ones for better analysis.

In [5]:
# Create a new column 'family_size' by combining 'sibsp' and 'parch'
titanic['family_size'] = titanic['sibsp'] + titanic['parch']

# Convert 'sex' to numerical values (0 for male, 1 for female)
titanic['sex'] = titanic['sex'].map({'male': 0, 'female': 1})

# Display the updated dataset
print(titanic.head())

   survived  pclass  sex   age  sibsp  parch     fare embarked  class    who  \
0         0       3    0  22.0      1      0   7.2500        S  Third    man   
1         1       1    1  38.0      1      0  71.2833        C  First  woman   
2         1       3    1  26.0      0      0   7.9250        S  Third  woman   
3         1       1    1  35.0      1      0  53.1000        S  First  woman   
4         0       3    0  35.0      0      0   8.0500        S  Third    man   

   adult_male  embark_town alive  alone  family_size  
0        True  Southampton    no  False            1  
1       False    Cherbourg   yes  False            1  
2       False  Southampton   yes   True            0  
3       False  Southampton   yes  False            1  
4        True  Southampton    no   True            0  


**Notion:**

* family_size combines siblings/spouses (sibsp) and parents/children (parch) to represent the total family size.

* Converting sex to numerical values makes it easier to use in machine learning models.

**Step 6: Remove Unnecessary Columns**

Drop columns that are not useful for analysis.

In [6]:
# Drop columns like 'alive', 'alone', and 'class' as they are redundant or less useful
titanic.drop(columns=['alive', 'alone', 'class'], inplace=True)

# Display the final dataset
print(titanic.head())

   survived  pclass  sex   age  sibsp  parch     fare embarked    who  \
0         0       3    0  22.0      1      0   7.2500        S    man   
1         1       1    1  38.0      1      0  71.2833        C  woman   
2         1       3    1  26.0      0      0   7.9250        S  woman   
3         1       1    1  35.0      1      0  53.1000        S  woman   
4         0       3    0  35.0      0      0   8.0500        S    man   

   adult_male  embark_town  family_size  
0        True  Southampton            1  
1       False    Cherbourg            1  
2       False  Southampton            0  
3       False  Southampton            1  
4        True  Southampton            0  


**Notion:**

* Columns like alive are redundant because survived already provides the same information.

* alone can be derived from family_size, so it’s not needed.

**Step 7: Save the Cleaned Dataset**

Save the cleaned dataset for future use.

In [7]:
# Save the cleaned dataset to a CSV file
titanic.to_csv('cleaned_titanic.csv', index=False)

**Notion:**

Saving the cleaned dataset ensures you don’t have to repeat the wrangling process every time.

**Conclusion:**

In this project, we performed data wrangling on the Titanic dataset using Seaborn and Pandas. We handled missing values, created new features, removed unnecessary columns, and saved the cleaned dataset. The dataset is now ready for further analysis or machine learning tasks. Data wrangling is a crucial step in any data science project, as it ensures the data is clean, structured, and meaningful.

**Final Thoughts:**

* Data wrangling can be time-consuming but is essential for accurate analysis.

* Always explore the dataset thoroughly before making any changes.

* Feature engineering can significantly improve the quality of your analysis or
 models.

* By following these steps, you can confidently wrangle any dataset and      prepare  it for deeper insights!