# Outlier Analysis

We found in our analysis that there might be outliers for 4 features: Age, SibSp, Parch and Fare. We are going to check them individually, but later, we will check them on a different jupyter notebook using specific libraries.

## Outliers for Age

From the histogram before, we supposed that ages around 70 could be outliers. Our analysis also told us that ages greater than 64.81 may be considered outliers. Let's check those values

In [None]:
ds_work[ds_work["Age"] > 64.81].sort_values(by = "Age")

The only relevant info we can get is that almost all of them died, and they were all males. There doesn't seem anything wrong with these values.

Let's check a boxplot for Age for better understanding.

In [None]:
sns.boxplot(data = ds_work, x = "Age")

Let's calculate the 0.25-percentile (q1) and the 0.75-percentile (q3) to calculate the interquartile range (iqr) and check where the most common values for Age are

In [None]:
q1_age = np.percentile(ds_work["Age"], [25])
q1_age[0]

In [None]:
q3_age = np.percentile(ds_work["Age"], [75])
q3_age[0]

In [None]:
iqr_age = q3_age - q1_age
iqr_age[0]

Even if most common values for Age are between 6 and 35 years, there is nothing wrong with those values above 64.81. Logically, ages greater than 64.81 are possible (Even 80 years). In conclussion, it is not neccesary to delete them.

## Outliers for SibSp

Let's first check a histogram for SibSp

In [None]:
sns.histplot(data = ds_work, x = "SibSp", bins = 30)

It seems like values greater than or equal to 2 seem less common. Our analysis also told us that Sibsp values greater than 2.50 may be considered outliers. Let's check those values

In [None]:
ds_work[ds_work["SibSp"] > 2.50].sort_values(by = "SibSp")

Those rows with equal SibSp values, given the passengers have same last name and given they have the same values for Embarked, Fare, Parch, Ticket and Cabin, seem to represent siblings, as the feature "SibSp" gives information about siblings aboard.

Let's check a boxplot for SibSp for better understanding.

In [None]:
sns.boxplot(data = ds_work, x = "SibSp")

Since SibSp takes only few values, we can easily conclude that the 0.25-percentile (q1) is 0, the 0.75-percentile (q3) is 1, and the IQR is 1. So, even if an amount of 0 or 1 siblings/spouses were more common, it is logical and it makes sense to have siblings greater than 2 (Even 8 siblings/spouses). In conclussion, it is not neccesary to delete them.

## Outliers for Parch

Let's first check a histogram for SibSp

In [None]:
sns.histplot(data = ds_work, x = "Parch", bins = 30)

Even if our analysis also told us that Parch values greater than 0 may be considered outliers, values greater than or equal to 3 seem less common. Let's check those values

In [None]:
ds_work[ds_work["Parch"] >= 3].sort_values(by = "Parch")

This doesn't give any important relationship.

Let's check a boxplot for Parch for better understanding.

In [None]:
sns.boxplot(data = ds_work, x = "Parch")

This also doesn't give us valuable information. Even if Parch values greater than 0 are less common, just like with SibSp, it is possible to have more than 0 parents/children. In conclussion, it is not neccesary to delete these values.

## Outliers for Fare

From the histogram before, we supposed that some values may be outliers. Our analysis also told us that Fare values greater than 65.63 may be considered outliers. Let's check those values

In [None]:
ds_work[ds_work["Fare"] >= 65.63].sort_values(by = "Fare")

So, people who paid a higher Fare, usually travelled in 1st class (Which makes sense), but we can't get more important info by checking the other features.

Let's check a boxplot for Fare for better understanding.

In [None]:
sns.boxplot(data = ds_work, x = "Fare")

Let's calculate the 0.25-percentile (q1) and the 0.75-percentile (q3) to calculate the interquartile range (iqr) and check where the most common values for Fare are

In [None]:
q1_fare = np.percentile(ds_work["Fare"], [25])
q1_fare[0]

In [None]:
q3_fare = np.percentile(ds_work["Fare"], [75])
q3_fare[0]

In [None]:
iqr_fare = q3_fare - q1_fare
iqr_fare[0]

Even if most common values for Fare are between 7.9104 and 31 years, there is nothing wrong with those values above 65.63. Logically, higher Fares are possible. In conclussion, it is not neccesary to delete them.