**Titanic Dataset**

Import Data and Required Packages

In [1]:
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

Import the CSV Data as Pandas DataFrame

In [2]:
df = pd.read_csv('Titanic-Dataset.csv')

Show Top 5 Records

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Show last 5 records

In [4]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


Understand Structure & Data Types

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Statistical Summary of Numerical Columns

In [6]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Check Missing Values

In [7]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Check Duplicates

In [8]:
df.duplicated().sum()

np.int64(0)

Number of Unique Values

In [9]:
df.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

Define numerical & categorical columns and Identify binary and ordinal features

In [11]:
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

binary_features = [
    feature for feature in df.columns 
    if df[feature].nunique() == 2
]

ordinal_features = [
    'parental_level_of_education'
]

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))
print(f"\nWe have {len(ordinal_features)} Ordinal Features:\n", ordinal_features)
print(f"\nWe have {len(binary_features)} Binary Features:\n", binary_features)

We have 7 numerical features : ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

We have 5 categorical features : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

We have 1 Ordinal Features:
 ['parental_level_of_education']

We have 2 Binary Features:
 ['Survived', 'Sex']


Target Variable

In [12]:
'Survived'

'Survived'

Input Features

In [13]:
'PassengerId'
'Pclass'
'Name'
'Sex'
'Age'
'SibSp'
'Parch'
'Ticket'
'Fare'
'Cabin'
'Embarked'

'Embarked'

Size of Dataset

In [14]:
df.shape

(891, 12)

Suitability for Machine Learning 

In [19]:
'''
The Titanic dataset is well-suited for machine learning, particularly for supervised classification tasks. The target variable, Survived, is binary in 
nature, making the dataset ideal for models such as Logistic Regression, Decision Trees, Random Forests, and Support Vector Machines.
The dataset contains a mix of numerical, categorical, ordinal, and binary features, which are commonly encountered in real-world machine learning 
problems. Features such as passenger class, gender, age, fare, and family relationships are meaningful predictors of survival and provide sufficient 
information for model learning.
With approximately 891 records and 12 features, the dataset is large enough for training and evaluating classical machine learning models, while still 
being manageable for exploratory data analysis and preprocessing. Although the dataset is not large enough for deep learning approaches, it is highly 
appropriate for educational, internship-level, and introductory machine learning projects.
After handling missing values, encoding categorical variables, and performing feature scaling if required, the dataset is fully ready for machine 
learning model development.'''

'\nThe Titanic dataset is well-suited for machine learning, particularly for supervised classification tasks. The target variable, Survived, is binary in \nnature, making the dataset ideal for models such as Logistic Regression, Decision Trees, Random Forests, and Support Vector Machines.\nThe dataset contains a mix of numerical, categorical, ordinal, and binary features, which are commonly encountered in real-world machine learning \nproblems. Features such as passenger class, gender, age, fare, and family relationships are meaningful predictors of survival and provide sufficient \ninformation for model learning.\nWith approximately 891 records and 12 features, the dataset is large enough for training and evaluating classical machine learning models, while still \nbeing manageable for exploratory data analysis and preprocessing. Although the dataset is not large enough for deep learning approaches, it is highly \nappropriate for educational, internship-level, and introductory machine 

 Data Quality Observations


In [17]:
'''
The Titanic dataset exhibits moderate data quality issues that require preprocessing before modeling:

>> Missing Values:
The Age column contains missing values, which may require imputation using mean, median, or predictive methods.
The Cabin column has a large number of missing values, making it less reliable and a candidate for removal or feature transformation.
The Embarked column has a few missing entries, which can be handled using mode imputation.

>> Data Imbalance:
The target variable Survived is imbalanced, with more passengers not surviving than surviving.
This imbalance may affect classification model performance and should be addressed using evaluation metrics such as precision, recall, F1-score, 
or techniques like resampling.

>> Categorical Encoding Required:
Features such as Sex, Embarked, Ticket, and Cabin are categorical and must be encoded before model training.

>> Irrelevant or Identifier Features:
Columns like PassengerId and Name do not contribute predictive value and can be safely dropped.

Overall, while the dataset contains missing values and class imbalance, these issues are common in real-world datasets and can be effectively handled 
using standard preprocessing techniques.

'''

'\nThe Titanic dataset exhibits moderate data quality issues that require preprocessing before modeling:\n\n>> Missing Values:\nThe Age column contains missing values, which may require imputation using mean, median, or predictive methods.\nThe Cabin column has a large number of missing values, making it less reliable and a candidate for removal or feature transformation.\nThe Embarked column has a few missing entries, which can be handled using mode imputation.\n\n>> Data Imbalance:\nThe target variable Survived is imbalanced, with more passengers not surviving than surviving.\nThis imbalance may affect classification model performance and should be addressed using evaluation metrics such as precision, recall, F1-score, \nor techniques like resampling.\n\n>> Categorical Encoding Required:\nFeatures such as Sex, Embarked, Ticket, and Cabin are categorical and must be encoded before model training.\n\n>> Irrelevant or Identifier Features:\nColumns like PassengerId and Name do not contri