# Task -02
## Data cleaning and Explorartory Data Anaylisis

###  Data Cleaning

### Load the Dataset

####  Import necessary libraries

In [153]:
#type:ignore 
import pandas as pd                    # for data manipulation and analysis
import numpy as np                     # for numerical operations and array manipulation
import seaborn as sns                  # for statistical data visualization
import matplotlib.pyplot as plt        # for creating visualizations

#### Load the Titanic dataset

In [154]:
# Load the Titanic dataset
Titanic = pd.read_csv("Titanic _Dataset.csv")

#### Understanding the dataset

In [155]:
#type:ignore
 
# Check the basic info of the dataset
Titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


#### Code Description
The `Titanic.info()` function provides a summary of the dataset, including:
- The number of entries
- Column names
- Data types
- Non-null values

#### Titanic Dataset Column Descriptions
---
- **PassengerId**: A unique identifier for each passenger.
- **Survived**: Indicates whether the passenger survived (1) or did not survive (0).
- **Pclass**: The passenger's ticket class (1 = first class, 2 = second class, 3 = third class).
- **Name**: The name of the passenger.
- **Sex**: The gender of the passenger (male or female).
- **Age**: The age of the passenger in years.
- **SibSp**: The number of siblings or spouses the passenger had on board.
- **Parch**: The number of parents or children the passenger had on board.
- **Ticket**: The ticket number of the passenger.
- **Fare**: The fare paid by the passenger for the ticket, in British pounds. This reflects the travel class and amenities, influencing survival chances; higher fares often indicated first-class accommodations.
- **Cabin**: The cabin number assigned to the passenger, indicating their location on the ship. Cabin location played a role in access to lifeboats during the evacuation, affecting survival rates.
- **Embarked**: The port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
---

#### Preview of the Titanic Dataset

In [156]:
Titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Code Description
---
- The `Titanic.head()`function displays the first five rows of the dataset, giving a quick overview of the data structure and the initial values in each column.
---

#####  Explanation of the Steps
---
- Checking for duplicates is done first to ensure the dataset's integrity before any further processing, as duplicates can skew analysis results.
- Next, verifying data types ensures that each column is correctly formatted for the intended operations, facilitating accurate calculations and analyses.
- Finally, filling missing values addresses gaps in the data to maintain completeness and reliability in the dataset, ensuring more robust analysis outcomes.
---


#### Check for Duplicates

In [157]:
# Check for duplicates
duplicates = Titanic.duplicated().sum()
print(f"Number of duplicate entries: {duplicates}")

Number of duplicate entries: 0


##### Code Description
---
- `Titanic.duplicated().sum()` This code checks for duplicate entries in the Titanic DataFrame by identifying rows that are identical and counts them, outputting the total number of duplicates.
---

In [158]:
# Check data types
print("Data types")
Titanic.dtypes

Data types


PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

#####  Transforming to convenient data types

In [159]:
# Converting to categorical types
cols = ['Survived', 'Pclass', 'Sex', 'Cabin', 'Embarked']
for col in cols:
    Titanic[col] = Titanic[col].astype('category')

##### Code Description
---
This code snippet converts specified columns in the Titanic DataFrame to categorical data types. Categorical types are more memory-efficient and enhance the performance of data analysis tasks. The columns being converted are:
- **Survived**: Indicates if a passenger survived (1) or did not survive (0).
- **Pclass**: Represents the passenger's ticket class (1, 2, or 3).
- **Sex**: The gender of the passenger (male or female).
- **Cabin**: The cabin number assigned to the passenger.
- **Embarked**: The port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).


In [160]:
# Changing PassengerId to string
Titanic['PassengerId'] = Titanic['PassengerId'].astype(str)

##### Code Description
---
- This code changes the data type of the **PassengerId** column to a string (object) type, preserving leading zeros and treating it as a categorical variable rather than a numerical one, preserving any leading zeros and avoiding mathematical operations that are unnecessary for identifiers.
---

In [161]:
# Changing SibSp to integer
Titanic['SibSp'] = Titanic['SibSp'].astype(int)

##### Code Description
---
- This code converts the **SibSp** column to an integer type, ensuring it accurately represents the number of siblings or spouses the passenger had on board.
---


In [162]:
Titanic.dtypes

PassengerId      object
Survived       category
Pclass         category
Name             object
Sex            category
Age             float64
SibSp             int32
Parch             int64
Ticket           object
Fare            float64
Cabin          category
Embarked       category
dtype: object

#### Check for Missing Values

In [163]:
Titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

##### Code description
---

**`Titanic`**: The dataset containing information about the Titanic passengers.

**`.isnull()`**: This function checks for missing values, returning `True` for missing (null) entries and `False` for present values.

**`.sum()`**: This function counts the number of `True` values (missing entries) in each column, effectively tallying the missing values.

--- 
##### Output Analysis

The columns **PassengerId**, **Survived**, **Pclass**, **Name**, **Sex**, **SibSp**, and **Ticket** have no missing values, indicating complete data.

In contrast, the **Age**, **Cabin**, and **Embarked** columns contain missing values, which need to be addressed for thorough analysis.





##### Handling  Missing Values

In [164]:
# Fill missing values in the Age column with the mean
mean_age = Titanic['Age'].mean()
Titanic['Age'].fillna(mean_age, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  Titanic['Age'].fillna(mean_age, inplace=True)


##### Code description
---
-The code calculates the mean age of passengers using the `mean()` function on the **Age** column. It then uses `fillna()` with the `inplace=True` argument to replace any missing values in the **Age** column with this mean value directly in the original DataFrame.

-The mean is used for the **Age** column because it provides a central value that represents the average age of passengers, making it a suitable choice for filling missing values in a numerical dataset. This approach minimizes distortion of the overall age distribution compared to filling with a value that may be too high or too low.

---

In [165]:
# Fill missing values in the Embarked column with the mode
mode_embarked = Titanic['Embarked'].mode()[0]  # Get the first mode if there are multiple
Titanic['Embarked'].fillna(mode_embarked, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  Titanic['Embarked'].fillna(mode_embarked, inplace=True)


##### Code description
---
- The code computes the mode of the **Embarked** column using the `mode()` function, which identifies the most frequently occurring embarkation point. It then utilizes `fillna()` with the `inplace=True` argument to fill any missing values in the **Embarked** column with this mode value directly in the original DataFrame.

- The mode is used for the **Embarked** column is catagorical data because it reflects the most common embarkation point among passengers. This method preserves the dataset's overall distribution and ensures that the imputed values accurately represent the majority experience of the passengers.
---


##### Conditions consider Filling missing values of `Cabin column`
-Filling missing values for the `Cabin column` in the Titanic dataset can be approached in a few ways, but given that a significant portion (687 out of 891) is missing, it's important to choose a method that maintains the integrity of the  analysis. 

---

##### Imputation Based on Other Features

In [166]:
# Fill missing values in the Cabin column based on Pclass
for pclass in Titanic['Pclass'].unique():
    most_common_cabin = Titanic[Titanic['Pclass'] == pclass]['Cabin'].mode()[0]
    Titanic.loc[(Titanic['Pclass'] == pclass) & (Titanic['Cabin'].isnull()), 'Cabin'] = most_common_cabin

##### Code Explanation:
- **Loop Through Unique Pclass Values**: The `for` loop iterates over each unique value in the **Pclass** column. This allows us to analyze the data for each class of passengers separately.
  
- **Calculate Most Common Cabin**: 
  - `most_common_cabin = Titanic[Titanic['Pclass'] == pclass]['Cabin'].mode()[0]`:
    - This line filters the Titanic DataFrame to include only the rows where **Pclass** equals the current `pclass` in the loop.
    - It then computes the mode (most frequently occurring value) of the **Cabin** column for that specific **Pclass**.
    - `[0]` retrieves the first mode in case there are multiple modes.

- **Fill Missing Values**:
  - `Titanic.loc[(Titanic['Pclass'] == pclass) & (Titanic['Cabin'].isnull()), 'Cabin'] = most_common_cabin`:
    - This line uses the `.loc` accessor to target rows in the DataFrame where the **Pclass** matches the current `pclass` and the **Cabin** value is missing (is null).
    - It assigns the previously calculated `most_common_cabin` to those missing **Cabin** entries.

### Summary:
This approach ensures that missing cabin values are filled with the most common cabin assignment specific to each passenger class, improving the accuracy and relevance of the data imputation process. By leveraging the known relationships between **Pclass** and cabin assignments, this method enhances the integrity of the dataset.

In [167]:
Titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  891 non-null    object  
 1   Survived     891 non-null    category
 2   Pclass       891 non-null    category
 3   Name         891 non-null    object  
 4   Sex          891 non-null    category
 5   Age          891 non-null    float64 
 6   SibSp        891 non-null    int32   
 7   Parch        891 non-null    int64   
 8   Ticket       891 non-null    object  
 9   Fare         891 non-null    float64 
 10  Cabin        891 non-null    category
 11  Embarked     891 non-null    category
dtypes: category(5), float64(2), int32(1), int64(1), object(3)
memory usage: 56.3+ KB
