<a href="https://colab.research.google.com/github/5237-mests/5237-mests/blob/main/UNDP_SETH.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## List all built-in Seaborn datasets

In [None]:
import seaborn as sns

# List all built-in Seaborn datasets
sns.get_dataset_names()

#EDA Pipeline on Titanic Dataset
##Install Libraries

In [None]:
#!pip uninstall -y numpy
#!pip install --upgrade numpy
#!pip install --upgrade --force-reinstall missingno shap umap-learn dowhy pandas-profiling
# import numpy as np
# print(np.__version__)  # Check if NumPy is properly installed
#!pip install missingno shap umap-learn dowhy pandas-profiling

#Import Libraries

In [3]:
# Import core libraries for data handling and analysis
import pandas as pd  # For handling datasets in tabular format
import numpy as np  # For numerical computations

# Import libraries for visualization
import matplotlib.pyplot as plt  # Basic plotting library
import seaborn as sns  # Advanced statistical visualizations
import missingno as msno  # For visualizing missing values

# Import preprocessing utilities
from sklearn.preprocessing import StandardScaler, OneHotEncoder  # For scaling and encoding data
from sklearn.impute import SimpleImputer  # For handling missing values

# Import dimensionality reduction techniques
from sklearn.decomposition import PCA  # Principal Component Analysis (PCA)

# Import machine learning models and outlier detection
from sklearn.ensemble import IsolationForest, RandomForestClassifier  # Outlier detection & feature importance

# Import SHAP for explainability and feature importance analysis
import shap

# Import UMAP for non-linear dimensionality reduction
import umap

# Import DoWhy for causal inference analysis
# from dowhy import CausalModel
import warnings

# Suppress all warnings, including FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Suppress all warnings (including SettingWithCopyWarning and FutureWarning)
warnings.simplefilter(action='ignore')



2️⃣ Load & Inspect Titanic Dataset
### a)Load Dataset

In [None]:
import pandas as pd
import seaborn as sns

# Load Titanic dataset
df = sns.load_dataset("titanic")
df

### b) Head

In [None]:
df.head()

### c)Tail

In [None]:
df.tail()

### d) Information

In [None]:
df.info()

### e) Columns

In [None]:
df.columns

### f) Shape

In [None]:
df.shape

### g) Data types

In [None]:
df.dtypes

In [None]:
type(df)

###h) Verify if df is a DataFrame

In [None]:
isinstance(df, pd.DataFrame)

###i) Duplicates

In [None]:
df.duplicated()

In [None]:
df.duplicated().sum()

### j) Uniqueness

In [None]:
df.nunique()

###k) Binary

In [None]:
(df.isin([0, 1]).all()).sum()

In [None]:
print(df.columns[df.isin([0, 1]).all()])

#2) DATA CLEANING

###a) Drop Duplicates

In [None]:
df = df.drop_duplicates()
df

### b) Missing values

In [None]:
missing = df.isnull().sum()
missing

In [None]:
# Count missing 'age' values grouped by 'pclass'
missing_by_class = df[df['age'].isnull()].groupby('pclass').size()

# Show the percentage of missing 'age' by 'pclass'
missing_percentage_by_class = (missing_by_class / df['age'].isnull().sum()) * 100
print(missing_percentage_by_class)


In [None]:
import plotly.express as px

# Create an exploded pie chart with annotations
fig = px.pie(
    names=missing_percentage_by_class.index,
    values=missing_percentage_by_class,
    title="Missing 'Age' Values by Pclass",
    hole=0.3  # Create a donut chart
)

# Explode the slices to highlight each class
fig.update_traces(pull=[0.1, 0.1, 0.1], textinfo='percent+label')

# Show the plot
fig.show()


### Imputation Methods and Their Appropriate Usage

| **Imputation Method** | **Appropriate When** | **Explanation**                                                                                                                                         |
|-----------------------|----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Mean**              | - Data is **normally distributed**. <br> - Data has **few outliers**. <br> - Missing data is **small in proportion** (less than 10-20%). | The **mean** is useful for **symmetric data** with **few outliers**. It is sensitive to outliers, so it should not be used when data is skewed.         |
| **Median**            | - Data is **skewed**. <br> - Data contains **outliers**. <br> - Data is **continuous** (e.g., `age`, `fare`). <br> - Missing data is **moderate to high** in proportion. | The **median** is robust and works well for **skewed data** or when data contains **outliers**. It does not get affected by extreme values.               |
| **Mode**              | - Data is **categorical** (e.g., `sex`, `pclass`, `embarked`). <br> - Missing data is **small in proportion** (less than 10-20%). <br> - Missing data is **random**. | The **mode** is ideal for **categorical data**, as it imputes the most frequent category. It is also used when the missing data is random and small.       |

### **When to Use Each Method**:
| **Method**  | **Typical Use Case**                                                |
|-------------|---------------------------------------------------------------------|
| **Mean**    | - Imputing `age`, `income`, `height` (if normally distributed).     |
| **Median**  | - Imputing `age`, `fare`, `height` (if skewed or with outliers).    |
| **Mode**    | - Imputing `sex`, `embarked`, `class`, `who` (categorical columns). |


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot histogram for 'age'
sns.histplot(df['age'], kde=True)
plt.title('Distribution of Age')
plt.show()


In [None]:
# Calculate skewness for 'age' (or any continuous column)
skewness = df['age'].skew()
print(f"Skewness of 'age': {skewness}")

### Skewness Analysis for `age` Column

A skewness of **0.3674** for the `age` column indicates that the data is **slightly right-skewed** (positive skew), but it is not highly skewed. This means that the distribution of `age` has a tail on the right side, but it's not extreme.

#### Interpretation of Skewness:
- **Skewness between 0 and 1** typically indicates a **moderate right skew**.
- **Skewness of 0** means the data is **perfectly symmetric**.
- **Skewness > 1** indicates a more pronounced **right-skewed distribution**.
- **Skewness < -1** indicates a **left-skewed distribution**.

Given that the skewness of `age` is **0.3674**, it is a **mild positive skew**, meaning the data has more younger passengers with a few older ones (a longer tail on the right).

### Action for Missing Data Imputation:
Since the skewness is mild, it’s generally better to impute missing values with the **median** (as it is robust to skewed distributions) rather than the **mean**, which could be influenced by extreme age values.

#### Suggested Approach for Imputation:
Impute missing `age` values with the **median** of `age` grouped by **pclass** (passenger class), as **age** can vary by class. This will ensure that imputation considers the differences between passenger classes (e.g., first-class passengers may have a different age distribution than third-class passengers).


In [None]:
# Impute 'age' with the median value by 'pclass'
df['age'] = df.groupby('pclass')['age'].transform(lambda x: x.fillna(x.median()))

# Check if there are any remaining missing values
print(df['age'].isnull().sum())


### For the `embarked` Variable:

- **Missing Data**: The `embarked` column has a **0.2%** missing value.
- **Imputation Strategy**: Given the low percentage, the missing values are likely **random**, so we will impute them with the **mode** (most frequent value).
- **Action**: Use the **mode** to fill missing values in the `embarked` column.


In [None]:
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

print(df['embarked'].isnull().sum())

### For the `embark_town` Variable:

- **Missing Data**: The `embark_town` column has **0.2%** missing values.
- **Imputation Strategy**: Since the missing data is low, it’s likely random, so we can impute the missing values using the **mode** (most frequent value), as this column likely has a few categories.
- **Action**: Impute the missing values in the `embark_town` column with the **mode**.


In [None]:
# Impute 'embark_town' with the mode
df['embark_town'].fillna(df['embark_town'].mode()[0], inplace=True)

# Check if there are any remaining missing values
print(df['embark_town'].isnull().sum())

### For the `deck` Variable:

- **Missing Data**: The `deck` column has **77.2%** missing values.
- **Imputation Strategy**: Given the high percentage, imputing is not ideal. We will **drop** the column instead.
- **Action**: Drop the `deck` column from the dataset.


In [None]:
df.drop('deck', axis=1, inplace=True)
df


In [None]:
df.isnull().sum()

###Categorical Vairiables
#####a) Inconsistencies

In [None]:
print(df['sex'].unique())

In [None]:
# Standardize 'sex' values (convert to lowercase and strip any whitespace)
df['sex'] = df['sex'].str.lower().str.strip()
df['sex']

In [None]:
# Check for inconsistent values in 'embarked' column
print(df['embarked'].unique())

In [None]:
# Standardize 'embarked' values (convert to uppercase and strip any whitespace)
df['embarked'] = df['embarked'].str.upper().str.strip()
df['embarked']

In [None]:
# Ensure valid categories for 'embarked' (C, Q, S)
valid_embarked = ['C', 'Q', 'S']
df = df[df['embarked'].isin(valid_embarked)]
df

In [None]:
# Check unique values for 'who' column
print(df['who'].unique())


In [None]:

# Check unique values for 'embark_town' column
print(df['embark_town'].unique())


In [None]:
# Check unique values for 'alive' column (Yes/No)
print(df['alive'].unique())


In [None]:
# Standardize 'alive' values to lowercase and strip spaces
df['alive'] = df['alive'].str.lower().str.strip()
df['alive']

In [None]:
# Check if 'fare' contains any negative values
print(df[df['fare'] < 0])

In [27]:
# Handle negative 'fare' values (e.g., set them to NaN)
df.loc[df['fare'] < 0, 'fare'] = None


In [None]:
# Check if 'age' contains any unrealistic values (greater than 100 or negative)
print(df[(df['age'] < 0) | (df['age'] > 100)])


In [29]:
# Handle unrealistic 'age' values (e.g., set them to NaN)
df.loc[(df['age'] < 0) | (df['age'] > 100), 'age'] = None


### Encoding

####a) Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder object
label_encoder = LabelEncoder()

# Label encode 'pclass' and 'sex'
df['pclass'] = label_encoder.fit_transform(df['pclass'])
df['sex'] = label_encoder.fit_transform(df['sex'])

# Check the encoding
print(df[['pclass', 'sex']].head(20))


###b)Mapping

In [None]:
# Mapping the 'embarked' column (C, Q, S)
df['embarked'] = df['embarked'].map({'C': 0, 'Q': 1, 'S': 2})
df['embarked']

In [None]:
# Mapping the 'embark_town' column (Southampton, Cherbourg, Queenstown)
df['embark_town'] = df['embark_town'].map({'Southampton': 0, 'Cherbourg': 1, 'Queenstown': 2})
df['embark_town']

In [None]:
df

In [None]:
# Check unique values in the 'class' column
unique_class = df['class'].unique()
print("Unique values in 'class' column:", unique_class)


In [None]:
# Map 'class' values to numerical values
class_mapping = {'First': 1, 'Second': 2, 'Third': 3}
df['class'] = df['class'].map(class_mapping)

# Check the updated 'class' column
print(df['class'].head())


In [None]:
# Map 'who' values to numerical values
who_mapping = {'man': 1, 'woman': 2, 'child': 3}
df['who'] = df['who'].map(who_mapping)

# Check the updated 'who' column
print(df['who'].head())


In [None]:
# Map 'adult_male' values to numerical values (True -> 1, False -> 0)
df['adult_male'] = df['adult_male'].map({True: 1, False: 0})

# Check the updated 'adult_male' column
print(df['adult_male'].head())


In [None]:
# Map 'alive' values to numerical values ('yes' -> 1, 'no' -> 0)
df['alive'] = df['alive'].map({'yes': 1, 'no': 0})

# Check the updated 'alive' column
print(df['alive'].head())


In [None]:
# Map 'alone' values to numerical values (True -> 1, False -> 0)
df['alone'] = df['alone'].map({True: 1, False: 0})

# Check the updated 'alone' column
print(df['alone'].head())


In [None]:
# Check for any rows with text values (non-numeric) in the DataFrame
text_rows = df.applymap(lambda x: isinstance(x, str))

# Find rows with text values
rows_with_text = df[text_rows.any(axis=1)]

# Print rows with text values
print("Rows with text values:")
print(rows_with_text)


In [None]:
# Check for any rows with text values (non-numeric) in the DataFrame
text_rows = df.applymap(lambda x: isinstance(x, str))

# Find rows with text values
rows_with_text = df[text_rows.any(axis=1)]

# Print rows with text values
print("Rows with text values:")
print(rows_with_text)


In [None]:
df.info()

#OUTLIERS

### Z-Score:

- Works best when the data is normally distributed.
- Measures how far a data point is from the mean in terms of standard deviations.
- It's sensitive to extreme outliers and is less effective if the data has a skewed distribution.

### IQR:

- More robust, as it focuses on the spread of the middle 50% of the data (between the 25th and 75th percentiles).
- It is less influenced by extreme values or skewed data.
- It works well even when the data is not normally distributed.

### When to Use Each:

- **Z-score**: Best used for **normally distributed** data or when you have a reasonable expectation that the data should follow a normal distribution.
- **IQR**: Works better when the data is **skewed** or has a **non-normal distribution**. It's generally a safer choice in many cases.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Get all numerical columns
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

# Adjust grid size based on number of numerical columns
num_columns = len(numerical_columns)
rows = (num_columns // 3) + (num_columns % 3 > 0)  # Determine number of rows needed
cols = 3  # Keep 3 columns for the subplots

# 1. Boxplot to visualize outliers for all numerical columns
plt.figure(figsize=(15, 5 * rows))
for i, column in enumerate(numerical_columns, 1):
    plt.subplot(rows, cols, i)
    sns.boxplot(df[column])
    plt.title(f'Boxplot of {column}')
plt.tight_layout()
plt.show()

In [None]:
# 2. Z-score method to detect outliers
z_scores = np.abs(stats.zscore(df[numerical_columns]))
outliers_z = (z_scores > 3).all(axis=1)  # Outliers where z-score > 3
print(f"Number of outliers detected using Z-score: {np.sum(outliers_z)}")

###  IQR Method

1. **Calculate the Quartiles**:

   - **First Quartile ($Q1$)**: The 25th percentile of the data, which means that 25% of the data points lie below $Q1$.
   
   $$
   Q1 = \text{25th percentile of the data}
   $$

   - **Third Quartile ($Q3$)**: The 75th percentile of the data, meaning 75% of the data points lie below $Q3$.
   
   $$
   Q3 = \text{75th percentile of the data}
   $$

   - **Interquartile Range ($IQR$)**: The difference between the third and first quartile. It represents the middle 50% of the data.
   
   $$
   IQR = Q3 - Q1
   $$

2. **Define the Outlier Thresholds**:

   - **Lower Bound**: Any data point below this threshold is considered an outlier.
   
   $$
   \text{Lower Bound} = Q1 - 1.5 \times IQR
   $$

   - **Upper Bound**: Any data point above this threshold is considered an outlier.
   
   $$
   \text{Upper Bound} = Q3 + 1.5 \times IQR
   $$

3. **Outlier Detection**:

   A data point $x_i$ is considered an outlier if it falls outside the range defined by the lower and upper bounds:

   $$
   x_i < Q1 - 1.5 \times IQR \quad \text{or} \quad x_i > Q3 + 1.5 \times IQR
   $$

### Summary of Formulae:

- **$Q1$** = 25th percentile
- **$Q3$** = 75th percentile
- **$IQR$** = $Q3 - Q1$
- **Lower Bound** = $Q1 - 1.5 \times IQR$
- **Upper Bound** = $Q3 + 1.5 \times IQR$

### Outlier Condition:

- $x_i$ is an outlier if:

$$
x_i < Q1 - 1.5 \times IQR \quad \text{or} \quad x_i > Q3 + 1.5 \times IQR
$$


In [None]:
# 3. Detect outliers using IQR method
Q1 = df[numerical_columns].quantile(0.25)  # First quartile (25th percentile)
Q3 = df[numerical_columns].quantile(0.75)  # Third quartile (75th percentile)
IQR = Q3 - Q1  # Interquartile range

# Identify outliers (values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR)
outliers_iqr = ((df[numerical_columns] < (Q1 - 1.5 * IQR)) | (df[numerical_columns] > (Q3 + 1.5 * IQR)))

# Count of rows with outliers in any of the numerical columns
outliers_iqr_count = outliers_iqr.any(axis=1).sum()

print(f"Number of outliers detected using IQR: {outliers_iqr_count}")


| **Variable**  | **Outliers Detected**          | **Action**                           | **Reason**                                                                 |
|---------------|---------------------------------|--------------------------------------|---------------------------------------------------------------------------|
| **Age**       | Extreme values (e.g., >100)     | Cap to 99 or remove                  | Uncommon and may be erroneous; capping prevents distortion in model.       |
| **Parch**     | High values (e.g., >6)          | Cap or remove                        | Rare, likely data entry errors; capping keeps model from being skewed.    |
| **SibSp**     | High values (e.g., >6 or 7)     | Cap or remove                        | Rare occurrences; extreme values may be data errors.                       |
| **Fare**      | High values (e.g., >500)        | Cap (95th percentile) or log transform| VIP/first-class passengers; capping/log reduces extreme influence on model. |


In [None]:
# Get max and min values for each variable of interest
max_min_values = {
    'Variable': ['Age', 'Parch', 'SibSp', 'Fare'],
    'Max Value': [df['age'].max(), df['parch'].max(), df['sibsp'].max(), df['fare'].max()],
    'Min Value': [df['age'].min(), df['parch'].min(), df['sibsp'].min(), df['fare'].min()]
}

# Create DataFrame
max_min_df = pd.DataFrame(max_min_values)

# Show the table
max_min_df


#### Comments:
- **Age**: The maximum value is 80, and the minimum value is 0.42. This range is reasonable since it covers a variety of ages from infants to elderly passengers.
- **Parch**: The maximum value is 6, and the minimum value is 0. This is also reasonable, indicating that most passengers traveled with no or very few parents/children, with some passengers possibly traveling with a larger family.
- **SibSp**: The maximum value is 8, and the minimum value is 0. This is still reasonable, as it represents the number of siblings/spouses aboard, and the values are within expected limits for most passengers.
- **Fare**: The maximum value is 512.33, and the minimum value is 0. While a few extreme fares might suggest VIP passengers, values up to 512 are plausible in a first-class or luxury context. A cap might be applied to limit extreme values if necessary for modeling purposes.

In summary, none of these variables have extreme outliers that would need aggressive treatment, but further action might be needed for **Fare** if there are specific business logic concerns or if the values are heavily skewed.


In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Apply log transformation to the 'fare' column
df['fare_log'] = np.log1p(df['fare'])  # log1p to handle 0 values

# Create subplots to compare original and log-transformed distributions
plt.figure(figsize=(14, 6))

# Original Fare distribution
plt.subplot(1, 2, 1)
sns.histplot(df['fare'], kde=True, color='blue')
plt.title('Original Fare Distribution')

# Log-transformed Fare distribution
plt.subplot(1, 2, 2)
sns.histplot(df['fare_log'], kde=True, color='green')
plt.title('Log Transformed Fare Distribution')

# Show the plots
plt.tight_layout()
plt.show()


In [None]:
from scipy import stats
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Ensure 'fare' is positive by adding a constant if necessary
df['fare_boxcox'] = df['fare'] + 1  # Box-Cox requires positive values, so we add 1 if there are zero values

# Apply Box-Cox transformation
df['fare_boxcox'], lambda_value = stats.boxcox(df['fare_boxcox'])

# Create subplots to compare original and Box-Cox transformed distributions
plt.figure(figsize=(14, 6))

# Original Fare distribution
plt.subplot(1, 2, 1)
sns.histplot(df['fare'], kde=True, color='blue')
plt.title('Original Fare Distribution')

# Box-Cox transformed Fare distribution
plt.subplot(1, 2, 2)
sns.histplot(df['fare_boxcox'], kde=True, color='red')
plt.title('Box-Cox Transformed Fare Distribution')

# Show the plots
plt.tight_layout()
plt.show()

# Output the lambda value
print(f"Lambda value used for Box-Cox transformation: {lambda_value}")


In [None]:
df

In [None]:
# Check for imbalance in categorical columns
df.select_dtypes(include=['object', 'category']).apply(lambda x: x.value_counts(normalize=True))


### Imbalance Analysis of the 'class' Variable

The `class` variable shows the following proportions for each class:

| Class | Proportion |
|-------|------------|
| 3     | 51.66%     |
| 1     | 27.30%     |
| 2     | 21.05%     |

#### Interpretation:
- **Imbalance**: The dataset is imbalanced with more passengers from the third class (`class 3`) compared to the other two classes.
- **Action**: If you're using this variable for modeling, you may need to address this imbalance, as it could affect model performance. Techniques like **oversampling**, **undersampling**, or using **weighted loss functions** can help mitigate the impact of imbalanced classes.


In [None]:
df.info()

## Data Cleaning Completed

In this phase, we have successfully completed the data cleaning process. The dataset now contains 784 entries with no missing values. All columns are properly cleaned and transformed:

- Categorical variables were encoded numerically.
- Missing data was imputed using appropriate strategies.
- Outliers were handled, including the transformation of the `fare` column using log and Box-Cox transformations to reduce the impact of extreme values.

We now have a clean dataset that is ready for further analysis and model building.

#DESCRIPTIVE STATISTICS

In [None]:
df.describe()

### Statistical Insights from Descriptive Statistics:

1. **Survived**:  
   - The **mean** value of `0.412` indicates that approximately 41.2% of the passengers survived, which aligns with historical data for the Titanic. The **standard deviation** of `0.493` suggests a relatively even split between survivors and non-survivors.
   - The **max** of `1` and **min** of `0` confirm that this is a binary variable (survived or not).

2. **Pclass**:  
   - The mean of `1.244` and **standard deviation** of `0.855` suggests the majority of passengers were from second and third classes, with a smaller proportion in the first class. The **min** and **max** of `0` and `2` suggest that `pclass` is coded from `0` to `2`, but typically, it is considered as `1`, `2`, or `3`.
   - The data points toward a **larger proportion** of passengers being from the lower classes, which is consistent with historical demographics.

3. **Sex**:  
   - With a mean of `0.626`, this suggests that more than half of the passengers were male (coded as `1` for male and `0` for female). The **standard deviation** of `0.484` indicates a relatively balanced distribution between males and females.
   - The min of `0` and max of `1` confirms the binary nature of this categorical feature.

4. **Age**:  
   - The **mean** age of `29.59` years with a **standard deviation** of `13.90` shows a somewhat young demographic. However, the large **min** of `0.42` and **max** of `80` indicate the presence of outliers, such as very young or elderly passengers. This could require **further cleaning** (e.g., capping or imputation) to avoid distorting the model.
   - The **median (50%)** is `28`, which supports the idea that the data is somewhat centered around young adults.

5. **SibSp (Siblings/Spouses Aboard)** and **Parch (Parents/Children Aboard)**:  
   - Both features have a **mean of approximately 0.5**, suggesting that most passengers traveled alone or with only one family member. The **max** values of `8` (for SibSp) and `6` (for Parch) are outliers, indicating that a few passengers traveled with large families. These extreme values might need to be capped or treated as outliers.
   - The **standard deviations** of `0.986` and `0.837` confirm the variability, with many passengers traveling alone but a small proportion traveling with large families.

6. **Fare**:  
   - The mean fare of `34.71` is quite lower than the **max** fare of `512.33`. This high variance (**std = 52.16**) suggests there are a few high-fare passengers, likely those in first class. The large **max** fare indicates potential outliers, which might have a disproportionate impact on model performance. **Log or Box-Cox transformations** could be beneficial to stabilize this skewed distribution.

7. **Embarked**:  
   - The **mean** of `1.529` and **standard deviation** of `0.803` suggest a fairly balanced distribution of embarked passengers across three ports. Since `1` and `2` represent two major embarkation points, this could indicate that the majority of passengers embarked from those two places, with fewer passengers from the third port.

8. **Who**:  
   - The feature **Who** seems to have two values, with the mean of `1.529`, which corresponds to a slight imbalance, where more passengers were adults, as expected.

9. **Alive** and **Alone**:  
   - Both features are binary variables, with **means** around `0.41` for **Alive** (indicating a survival rate of 41%) and `0.57` for **Alone**, showing that more passengers were traveling alone.
  
10. **Fare_log and Fare_boxcox**:  
    - These transformed versions of the `Fare` variable show a more balanced distribution with lower values (suggested by the **mean** and **standard deviation**), indicating that the transformations have reduced the extreme influence of the high-fare passengers.

### Key Takeaways:
- **Skewness in Fare**: There's a strong right-skew in the `Fare` column, with a few passengers paying significantly higher fares. Transforming this variable could help stabilize variance and reduce the impact of outliers.
- **Age Distribution**: The age distribution is centered around younger passengers, but there are some extreme values (ages as low as 0.42 and as high as 80), suggesting that the data may contain errors or outliers.
- **Passenger Class**: Most passengers belong to the second and third classes, with fewer in the first class. This aligns with the socioeconomic profile of the Titanic passengers.
- **Family Size**: Most passengers traveled alone or with just one family member, but some extreme cases with large family sizes may need to be addressed in further analysis.

These insights should guide data preprocessing, feature engineering, and further analysis.


#Heatmap

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate the correlation matrix
correlation_matrix = df.corr()

# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", cbar=True, linewidths=0.5)
plt.title('Correlation Heatmap of Features')
plt.show()


In [None]:
correlation_matrix

# Statistical Interpretation and Insights from the Correlation Matrix

## 1. **Survival Rate and Other Variables:**
- **Survived vs. Pclass:** A moderate negative correlation of **-0.3327** between `survived` and `pclass` indicates that passengers in lower classes (higher `pclass` values) were less likely to survive.
- **Survived vs. Sex:** A strong negative correlation of **-0.5158** suggests that females were more likely to survive than males, which aligns with historical data showing women and children had higher survival rates.
- **Survived vs. Age:** The correlation of **-0.0681** is very low, indicating that age had little impact on survival.
- **Survived vs. Fare:** A moderate positive correlation of **0.2468** suggests that passengers who paid higher fares were slightly more likely to survive, potentially due to better access to lifeboats or cabins.
- **Survived vs. Embarked (Embarkation Location):** The correlation of **-0.1549** suggests a weak negative relationship with survival. This might imply that the place of embarkation had a small influence, but the effect is not strong.

## 2. **Class and Survival:**
- **Pclass vs. Sex:** A small positive correlation (**0.1153**) suggests that the class might have some relationship with the gender of passengers, but it is not a strong relationship.
- **Pclass vs. Fare:** There is a strong negative correlation of **-0.5492**, meaning that passengers in lower classes tended to pay lower fares.
- **Pclass vs. Embarked:** A weak positive correlation (**0.1746**) indicates a minor relationship between the class and embarkation point, though it's not a decisive factor.

## 3. **Fare and Class:**
- **Fare vs. Pclass:** A strong negative correlation of **-0.5492** suggests that higher fares were associated with higher class passengers (first and second class), as expected.
- **Fare vs. Sex:** A negative correlation (**-0.1695**) shows that males paid slightly lower fares than females, though the relationship is weak.
- **Fare vs. Fare Log / Fare Boxcox:** The high correlation with **fare_log (0.7956)** and **fare_boxcox (0.7505)** suggests that these transformations (logarithmic and Box-Cox transformations) of fare are highly correlated with the original fare value, which is typical when these transformations are used for normality.

## 4. **Who and Survival:**
- **Who vs. Survived:** A moderate positive correlation (**0.4403**) suggests that being a woman (or child) increased the chances of survival, as historically women and children were prioritized in lifeboat access.
- **Who vs. Adult Male:** A strong positive correlation (**0.8990**) suggests that the variable `who` (which likely refers to gender) is strongly associated with `adult_male`.

## 5. **Age and Class/Family Relations:**
- **Age vs. Sibsp (siblings/spouses aboard):** A moderate negative correlation (**-0.2857**) indicates that older passengers had fewer siblings or spouses aboard. This might suggest that younger passengers tended to travel with family.
- **Age vs. Parch (parents/children aboard):** A similar negative correlation (**-0.1870**) suggests older passengers might have fewer children or parents aboard.

## 6. **Alone Variable:**
- **Alone vs. Sibsp and Parch:** The variable `alone` (indicating whether a passenger was traveling alone) is negatively correlated with both `sibsp (-0.6095)` and `parch (-0.5711)`, suggesting that those with more family members aboard were less likely to be alone.
- **Alone vs. Survived:** A negative correlation (**-0.1766**) indicates that passengers traveling alone had a lower chance of survival. This makes sense since passengers with family members might have had more support in emergencies.

## 7. **Transformations (Fare_log and Fare_boxcox):**
- **Fare_log vs. Fare_boxcox:** The perfect correlation of **0.9961** indicates that these transformations of the `fare` feature are nearly identical in structure, which is typical for common transformations applied to skewed data to make it more normal.
- **Fare_log vs. Survived:** A moderate positive correlation (**0.3131**) indicates that passengers with higher transformed fares also had a better chance of survival, supporting the idea that those with higher fares were more likely to have had better access to lifeboats or other advantages.

## **Summary of Insights:**
1. **Gender and Class are significant predictors of survival:** The strong negative correlations between `sex` and `survived` suggest women were more likely to survive. The negative correlation between `pclass` and `survived` suggests passengers in lower classes had lower survival chances.
2. **Fare has a moderate positive relationship with survival:** Passengers who paid higher fares were somewhat more likely to survive, potentially due to better access to lifeboats or cabins.
3. **Age has a minor impact on survival:** Age does not significantly affect the likelihood of survival, although there is a slight tendency for younger passengers to travel with family.
4. **Traveling Alone vs. Survival:** Those traveling alone were less likely to survive, which might reflect the tendency for passengers with family to have better access to safety resources.
5. **Log and Box-Cox Transformed Fare:** The transformations on fare are highly correlated with the original fare, suggesting these are effective for modeling purposes, especially for normalizing skewed data.

These insights can be further explored in predictive modeling tasks to refine understanding and improve predictive accuracy, especially for survival prediction based on these variables.


# Feature Selection

Feature Selection is a crucial step to improve your model’s performance by eliminating irrelevant or redundant features. Feature selection can help to:

- **Reduce overfitting**
- **Improve model accuracy**
- **Reduce training time**
- **Simplify the model**

## 1. Univariate Feature Selection

We can start by selecting the most relevant features using statistical tests.

For this, we’ll use **SelectKBest** with a statistical test like the **Chi-Squared test** for categorical features and **ANOVA F-test** for continuous features.

## 2. Correlation-based Feature Selection

We can also remove highly correlated features to avoid multicollinearity, which can skew model performance. This is done by calculating the correlation matrix and identifying features that are highly correlated (e.g., correlation > 0.85 or < -0.85). Then remove one of the features in such pairs to improve model performance.

## 3. Recursive Feature Elimination (RFE)

RFE is an efficient method to recursively eliminate features and build models, selecting the most important features. This can be done using a model like Logistic Regression, and by iteratively removing the least significant features based on model performance.

---


In [None]:
df.info()

In [None]:
df

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns

# # Features and target variable
X = df.drop(columns=['survived'])  # Features
y = df['survived']  # Target

# # Step 1: Train Random Forest Classifier to get feature importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# # Step 2: Get feature importances
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
})

# # Step 3: Sort the features by importance (descending order)
feature_importance = feature_importance.sort_values(by='Importance', ascending=False)

# # Step 4: Normalize the importance to percentage
feature_importance['Importance (%)'] = feature_importance['Importance'] * 100

# # Step 5: Visualize the feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance (%)', y='Feature', data=feature_importance, palette='Blues_d')

# # Annotate with percentage values
for index, value in enumerate(feature_importance['Importance (%)']):
    plt.text(value + 1, index, f'{value:.2f}%', va='center', fontweight='bold')

plt.title('Feature Importance (Random Forest)')
plt.xlabel('Importance (%)')
plt.ylabel('Features')
plt.show()

# Print the feature importance in descending order
print("Feature Importance in Descending Order (as percentage):")
print(feature_importance[['Feature', 'Importance (%)']])


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, precision_recall_curve
from sklearn.preprocessing import StandardScaler

# Data preparation
X = df.drop(columns='survived')  # Features
y = df['survived']  # Target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Random Forest Classifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_scaled, y_train)
y_pred = rf.predict(X_test_scaled)

# Get classification report
report = classification_report(y_test, y_pred, output_dict=True)

# Convert the report to a DataFrame for better readability
report_df = pd.DataFrame(report).transpose()
report_df

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
# Plot Confusion Matrix Heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.title('Confusion Matrix')
plt.show()

In [None]:
# ROC Curve
y_pred_prob = rf.predict_proba(X_test_scaled)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, color='blue', label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

# **Assignment**

## **Assignment 1: Feature Importance & Multicollinearity**
Using the cleaned data under `df`, you are expected to:

1. Implement a feature importance method different from the ones already experimented with in this case.
2. Check for multicollinearity and covariance in the dataset.

---

## **Assignment 2: Data Distributions & Visualization**
1. Work with at least five different types of data distributions.
2. Visualize them using synthetic data.
