# Basic Introduction and Summary of Assignment

In this assignment, we will be working with the Titanic dataset. The goal is to load, preprocess, and analyze the data to gain insights into the factors that influenced the survival of passengers. We will perform various data preprocessing steps such as handling missing values, encoding categorical variables, and feature scaling.

## Loading and Preprocessing of the Titanic Dataset

We will start by loading the Titanic dataset and performing necessary preprocessing steps to prepare the data for analysis. This includes:

1. Handling missing values.
2. Encoding categorical variables.
3. Feature scaling.


# Titanic Dataset Description

1. **survival**: Survival (0 = No; 1 = Yes).
2. **class**: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd).
3. **name**: Name.
4. **sex**: Sex.
5. **sibsp**: Number of Siblings/Spouses Aboard.
6. **parch**: Number of Parents/Children Aboard.
7. **ticket**: Ticket Number.
8. **fare**: Passenger Fare.
9. **cabin**: Cabin.
10. **embarked**: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).
11. **boat**: Lifeboat (if survived).
12. **body**: Body number (if did not survive and the body was recovered).


# Data Exploration

## Task 1: Data Loading and Initial Exploration

**Lecture material:** Lecture 3, slides 4–8, 10, and 11.

- **Load the dataset into a Pandas DataFrame.**
- **Perform basic exploratory data analysis (EDA) to comprehend the structure and characteristics of the data.**

**Note:** Your analysis should include appropriate exploratory statistics and visualizations.

In [104]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy.stats as stats

from scipy.stats import pearsonr

In [105]:
# Load the dataset
file_path = 'titanic3.xls'
titanic_df = pd.read_excel(file_path)

# Display the first few rows of the dataset
print(titanic_df.head())

# Display the dataset information
print(titanic_df.info())

# Display the summary statistics of the dataset
print(titanic_df.describe())

   pclass  survived                                             name     sex  \
0       1         1                    Allen, Miss. Elisabeth Walton  female   
1       1         1                   Allison, Master. Hudson Trevor    male   
2       1         0                     Allison, Miss. Helen Loraine  female   
3       1         0             Allison, Mr. Hudson Joshua Creighton    male   
4       1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female   

       age  sibsp  parch  ticket      fare    cabin embarked boat   body  \
0  29.0000      0      0   24160  211.3375       B5        S    2    NaN   
1   0.9167      1      2  113781  151.5500  C22 C26        S   11    NaN   
2   2.0000      1      2  113781  151.5500  C22 C26        S  NaN    NaN   
3  30.0000      1      2  113781  151.5500  C22 C26        S  NaN  135.0   
4  25.0000      1      2  113781  151.5500  C22 C26        S  NaN    NaN   

                         home.dest  
0                     St 

# Task 2: Managing Missing Values

**Lecture Material:** Lecture 3, slides 22–24.

- **Identify the columns containing missing values.**
- **Develop a strategy to address them.**

To manage missing values, we will:

1. Identify columns with missing values.
2. Decide on a strategy to handle these missing values, such as:
    - Removing rows or columns with a high percentage of missing values.
    - Imputing missing values using statistical methods (mean, median, mode).
    - Using more advanced imputation techniques if necessary.

In [106]:
# Count missing values for each column
missing_values_count = titanic_df.isnull().sum()
print("\nMissing Values Count for Each Column:")
print(missing_values_count)

# Filter out rows with missing values
filtered_df = titanic_df.dropna()

# Display the first few rows of the filtered dataset
print("\nFiltered Dataset (No Missing Values):")
print(filtered_df.head())

# Display the dataset information
print("\nFiltered Dataset Info:")
print(filtered_df.info())

# Display the summary statistics of the filtered dataset
print("\nFiltered Dataset Summary Statistics:")
print(filtered_df.describe())


Missing Values Count for Each Column:
pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

Filtered Dataset (No Missing Values):
Empty DataFrame
Columns: [pclass, survived, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked, boat, body, home.dest]
Index: []

Filtered Dataset Info:
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     0 non-null      int64  
 1   survived   0 non-null      int64  
 2   name       0 non-null      object 
 3   sex        0 non-null      object 
 4   age        0 non-null      float64
 5   sibsp      0 non-null      int64  
 6   parch      0 non-null      int64  
 7   ticket     0 non-null      object 
 8   fa

In [107]:
# Drop the 'name' column
titanic_df.drop(columns=['name'], inplace=True)

In [108]:
# Remove the 'boat' column, as this is dependent on survival status
titanic_df.drop(columns=['boat'], inplace=True)

In [109]:
# Remove the 'body' column, as this is dependent on survival status
titanic_df.drop(columns=['body'], inplace=True)

In [110]:
# Calculate the median age
median_age = titanic_df['age'].median()

# Substitute the median age for missing age values
titanic_df['age'].fillna(median_age, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic_df['age'].fillna(median_age, inplace=True)


In [111]:
# Calculate the average fare for each class
average_fare_per_class = titanic_df.groupby('pclass')['fare'].mean()

# Fill missing fare values with the average fare of their respective class
titanic_df['fare'] = titanic_df.apply(
    lambda row: average_fare_per_class[row['pclass']] if pd.isnull(row['fare']) else row['fare'],
    axis=1
)

In [112]:
# Perform one-hot encoding for the 'sex' column
titanic_df = pd.get_dummies(titanic_df, columns=['sex'], drop_first=True)

In [113]:
# Drop the 'embarked' columns
titanic_df.drop(columns=['embarked_C', 'embarked_Q', 'embarked_S'], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic_df['embarked'].fillna(median_embarked, inplace=True)


In [115]:
# Drop the 'home.dest' column
titanic_df.drop(columns=['home.dest'], inplace=True)

In [114]:
# Display the updated dataset information
print(titanic_df.info(verbose=True))

# Show columns with missing values
columns_with_missing_values = titanic_df.columns[titanic_df.isnull().any()]
print("\nColumns with Missing Values:")
print(columns_with_missing_values)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   pclass      1309 non-null   int64  
 1   survived    1309 non-null   int64  
 2   age         1309 non-null   float64
 3   sibsp       1309 non-null   int64  
 4   parch       1309 non-null   int64  
 5   ticket      1309 non-null   object 
 6   fare        1309 non-null   float64
 7   cabin       295 non-null    object 
 8   home.dest   745 non-null    object 
 9   sex_male    1309 non-null   bool   
 10  embarked_C  1309 non-null   bool   
 11  embarked_Q  1309 non-null   bool   
 12  embarked_S  1309 non-null   bool   
dtypes: bool(4), float64(2), int64(4), object(3)
memory usage: 97.3+ KB
None

Columns with Missing Values:
Index(['cabin', 'home.dest'], dtype='object')
