# Z-Score for Outlier Detection in the Titanic Dataset

The Z-score method detects outliers based in the number of standard deviations a data point is from the mean. If a value is more than 3 standard deviations away (i.e., |Z| > 3), it is consedered an ouliers.

Step-by-Step Approch

1. Load the Titanic dataset.
2. Select a numerical column (e.g., "Fare","Age).
3. compute the Z-score for each value.
4. Identify outliers where |Z| > 3.

In [1]:
import seaborn as sns
import pandas as pd
from scipy.stats import zscore

In [5]:
df = pd.read_csv("Titanic_dataset.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [9]:
# Select a numerical column (e.g., "Ticket")
df = df[["Age"]].dropna()  # Drop missing value

In [10]:
# Compute Z-score
df['Age_zscore'] = zscore(df['Age'])

In [11]:
# Identify outliers (Z-score > 3 or < -3)
outliers = df[df['Age_zscore'].abs() > 3]

In [13]:
# Print result
print(f"Number of outliers using Z-score: {len(outliers)}")
print(outliers)

Number of outliers using Z-score: 1
     Age  Age_zscore
96  76.0    3.229374


# Explanation

* Step 1 : Load the Totanic dataset.
* Step 2 : Select the "Age" column, dropping any missing values.
* Step 3 : Compute the Z-score using the formula.  
    Z = x-u/sigma
* X = Data point
* u = mean of the column
* sigma = standard deviation
* Step 4 : Filter out values where |Z| > 3 (i.e., extreme values).
* Step 5 : Print the number of outliers and their values.

# Interpretation 

* If a Age has a Z-score greater than 3, it is much higher than the average Age.
* If a Age has a Z-score less than -3, it is much lower than the average age.
* This method is useful for normally distribured data, but it may not work well for skewed distribution.