# Data Understanding
## Collecting Initial Data

### Initial data collection Report

Initial data shows good consistency. There are  two missing values in Embarked at passenger 62 and 830. These missing values will be handled during the data preparation phase using a documented strategy but the initial solution would be to input the most frequent category. The dataset also contains several columns named zero, that are constant zeros. These will be also removed during data preparation.This project uses a single existing data source provided with the course assignment: titanic1.csv.

## 2.2 Describe Data
### Data Description Report

The dataset is loaded directly into the **Jupyter Notebook** from a local file using Python. The dataset contains **1,309 rows** (passenger records) and **9 important columns** representing passenger attributes:

* **Passengerid** — *Integer (ID)*; unique passenger identifier
* **Age** — *Numeric (float)*; passenger age in years.
* **Fare** — *Numeric (float)*; ticket price.
* **Sex** — *Categorical (binary)*; passenger sex (encoding described in preprocessing).
* **SibSp** — *Numeric (integer)*; number of siblings/spouses aboard.
* **Parch** — *Numeric (integer)*; number of parents/children aboard.
* **Pclass** — *Categorical (integer)*; passenger class 1, 2, 3
* **Embarked** — *Categorical (integer)*; port of embarkation 1, 2, 3
* **Survived** — *Target (binary)*; survival outcome (1 = survived, 0 = not survived).
* **zero*** — *Constant (all zeros)*; no information content

## Variable Influence

* At this stage, attributes can be prioritized based on historical context and domain knowledge. **Sex and Age** are expected to be highly influential due to the evacuation policy often summarized as "women and children first."
* Additionally, **Pclass and Fare** are expected to be relevant, as passenger class reflects socio-economic status, which likely affected access to lifeboats and crew assistance.
* Variables related to family structure (**SibSp, Parch**) may also be influential, since travelling with close family, especially children, could affect evacuation priority and decision-making during the disaster.
* Identifier and constant variables like **Passengerid** and **zero*** columns are not considered relevant for prediction.

At this stage, relevant attributes can be preliminarily prioritized based on knowledge of the Titanic disaster history. **Sex, Pclass, Age, and Fare** are expected to be the most influential, as historical evidence suggests strong survival differences across gender, socio-economic class, and age groups. Variables related to family structure (**SibSp, Parch**) are considered moderately relevant, while identifiers like **Passengerid** are not relevant for prediction.

## Planned Descriptive Statistics

### **Age**
* Calculate mean, median, standard deviation, minimum, and maximum age
* Compare average age between survivors and non-survivors
* Check the distribution of age and presence of missing values

### **Fare**
* Calculate mean, median, standard deviation, minimum, and maximum fare
* Compare average fare between survivors and non-survivors
* Inspect fare distribution to identify skewness and outliers

### **Sex**
* Calculate frequency counts for each category
* Compute survival rates by sex

### **Pclass**
* Calculate frequency counts for each passenger class
* Compute survival rates per class
* Compare survival patterns across classes

### **SibSp**
* Calculate distribution of values (counts)
* Compare average SibSp values between survivors and non-survivors

### **Parch**
* Calculate distribution of values (counts)
* Compare average Parch values between survivors and non-survivors

### **Embarked**
* Calculate frequency counts per embarkation port
* Compute survival rates per port

This prioritization is tentative and will be validated during exploratory data analysis and modeling. As this is an academic case study, no external business analysts are involved; analytical decisions are made based on data analysis and documented reasoning.

## 2.3 Explore Data

The first step is to load the data

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


# Find project root (folder that contains "data")
def get_project_root():
    p = Path.cwd()
    while not (p / "data").exists() and p != p.parent:
        p = p.parent
    return p

PROJECT_ROOT = get_project_root()
RAW_DATA_DIR = PROJECT_ROOT / "data" / "raw"

filename="titanic1.csv"
input_path = RAW_DATA_DIR / filename
print("Reading from:", input_path)  # optional but useful
df = pd.read_csv(input_path)

display(df)




The Titanic dataset was read from the *raw data* directory and displayed to verify that it was successfully loaded.

### Initial data

Data which we will need to predict is if a person has died or survived. So the first table we will show that

In [None]:
df.groupby(df['2urvived'])['Age'].count()

### Planned Descriptive Statistics

**Age**  
  - Calculate mean, median, standard deviation, minimum, and maximum age  
  - Compare average age between survivors and non-survivors 
  - Create an age gaps and check the survival rate
  - Histogram showing distribution of age


#### Calculate mean, median, standard deviation, minimum, and maximum age  

In [None]:
# Calculate mean, median, standard deviation, minimum, and maximum age for observation
data = pd.DataFrame(df['Age'])
print(data.mode())
df['Age'].agg(['mean', 'median', 'std', 'min', 'max'])

To summarize the Age variable, basic descriptive statistics were calculated. 
* The average passenger age is 29.50 years
* The median age is 28 years
* The standard deviation is 12.91, indicating that passenger ages vary widely around the average
* The youngest recorded passenger is 0.17 years and the oldest is 80 years

#### Compare average age between survivors and non-survivors

In [None]:
#Compare average age between survivors and non-survivors to see if generally younger or older people survivors
df.groupby('2urvived')['Age'].agg(['mean','count']) 

The average age of survivors is slightly lower 28.29 than non-survivors 29.93, with difference inky 1.64 suggesting a weak association. However, averages can hide differences at the extremes children vs elderly.

#### Create an age gaps and check the survival rate

In [None]:
labels=["0-12","13-18","19-30","31 -45","46-60","61-80"]
df["AgeGroup"] = pd.cut(df["Age"], bins=[0, 12, 18, 30, 45, 60, 80],labels=["0-12", "13-18", "19-30", "31-45", "46-60", "61-80"], include_lowest=True)
counts = pd.crosstab(df["AgeGroup"], df["2urvived"])
display(df)

survived = counts[0].to_numpy()
dead = counts[1].to_numpy()

x = np.arange(6)

plt.bar(x, survived, width=.45, label="Not Survived")
plt.bar(x+.45, dead, width=.45, label="Survived")
plt.xticks(x + 0.225, labels=["0-12","13-18","19-30","31 -45","46-60","61-80"], rotation=45)
plt.xlabel("Age")
plt.ylabel("Count") 
plt.title("Survival vs Death")
plt.show()



To further explore the relationship between age and survival, passengers were grouped into meaningful age intervals.

The graph shows that the rate of survival for children  is very high and confirms the rule,save children first. Another high rate shows for adults between 18 and 30, where the death probability is very high. There is as well notable gap between eldery people in surival.

#### Histogram showing distribution of age

In [None]:
plt.hist(df['Age'], bins=20, edgecolor='black')
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Using histogram, we can how the age has been distributed

**Fare**  
  - Calculate mean, median, standard deviation, minimum, and maximum fare  
  - Compare average fare between survivors and non-survivors  
  - Correlation between price of fare and survival in sum and rate
  - Histogram showing distribution of fare

#### Calculate mean, median, standard deviation, minimum, and maximum fare  

In [None]:
# Calculate mean, median, standard deviation, minimum, and maximum fare 
mode = df['Fare'].mode()
print(mode)
df['Fare'].agg(['mean', 'median', 'std', 'min', 'max'])



* The price of the ticket has a wide range from 0 to 512.33.
* The average price of the ticket was 33.28
* Median was 14.45 indicating lots of tickets were low, but there were some expensive tickets increasing the average
* The standard deviation is 51.74, confirming high variability and the presence of extreme prices

Fare is therefore considered a relevant variable for survival analysis, acting as a proxy for passenger class and socio-economic status rather than a direct cause of survival.


#### Compare average fare between survivors and non-survivors  

In [None]:
df.groupby('2urvived')['Fare'].agg(['count', 'mean'])

The average price of ticket of people who survived is nearly the double the price of people who died.

#### Correlation between price of fare and survival in sum and rate

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

bins = [0, 1, 5, 10, 25, 50, 100, 600]
labels_fare = ["0-1","1-5","5-10","10-25","25-50","50-100","100-600"]

fare_bins = pd.cut(df["Fare"], bins=bins, include_lowest=True)

fare_survival = pd.crosstab(fare_bins, df["2urvived"])  # fix column name
died = fare_survival[0].to_numpy()
surv = fare_survival[1].to_numpy()

x = np.arange(len(labels_fare))
w = 0.45

plt.bar(x,     died, width=w, label="Died (0)")
plt.bar(x + w, surv, width=w, label="Survived (1)")

plt.xticks(x + w/2, labels= labels_fare, rotation=45)
plt.ylabel("Number of passengers")
plt.title("Count")
plt.legend()
plt.show()

Visualizations of fare show the same pattern. Most passengers paid low fares, while higher fares are associated with higher survival rate.

In [None]:
rate = pd.crosstab(fare_bins, df["2urvived"], normalize="index")

died_rate = rate[0].to_numpy()
surv_rate = rate[1].to_numpy()

labels = rate.index.astype(str)
x = np.arange(len(labels))
w = 0.45

# 3) Bars
plt.bar(x,died_rate, width=w, label="Died (0)")
plt.bar(x + w, surv_rate, width=w, label="Survived (1)")

plt.xticks(x + w/2, labels=labels_fare, rotation=45)
plt.xlabel("Fare range")
plt.ylabel("Survival rate")
plt.title("Survival Rate by Fare Range")
plt.legend()
plt.tight_layout()
plt.show()

The rate bar chart shows that the passengers who paid lower fares had a substantially lower probability of survival, while survival rates increase steadily with higher fare ranges. This indicates a strong association between fare and survival probability. However, fare likely reflects passenger class and access to evacuation resources rather than being a direct causal factor.

#### Histogram showing distribution of fare

In [None]:
import matplotlib.pyplot as plt
plt.hist(df['Fare'], bins=40, edgecolor='black')
plt.title('Fare')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.show()


**Sex**
  - Calculate frequency counts for each category and show via pit chart 
  - Compute survival rates by sex
  - Correlation between Sex, Age and survivability in sum and rate

#### Calculate frequency counts for each sex category 

In [None]:
sex_counts = df.groupby('Sex')['Sex'].size()
display(sex_counts.rename(index={0: 'Male', 1: 'Female'}).to_frame(name='Total Passengers'))
plt.pie(sex_counts, autopct='%1.1f%%', startangle=90, labels=['male', 'female'], colors=['blue', 'red'], wedgeprops={'edgecolor':'black'})

There are 843 male passengers and 466 female passengers.

####  Compute survival rates by sex

In [None]:
sex_outcome = pd.crosstab(df['Sex'], df['2urvived'])
dead = sex_outcome[0].to_numpy()
live = sex_outcome[1].to_numpy()
lenght = np.arange(2)
plt.bar(lenght,dead, width=w)
plt.bar(lenght + 0.45, live, width=w)
plt.xticks(lenght + 0.45/2, labels=["Male", "Female"]) 
sex_outcome

Among males, 734 did not survive and 109 survived.While 233 woman did not survive and 233 survived. This shows a strong association between sex and survival. Females have a much higher survival count relative to their group size, while male survival is much lower.

#### Correlation between Sex, Age and survivability in sum and rate

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

age_bins = pd.cut(df["Age"], bins=[0,12,18,30,45,60,80], include_lowest=True)

tab = pd.crosstab([age_bins, df["Sex"]], df["2urvived"])  # columns are 0 and 1

male= tab.xs(0, level=1)   # Sex==0
female = tab.xs(1, level=1)   # Sex==1

print(male)
labels=["0-12","13-18","19-30","31 -45","46-60","61-80"]
x = np.arange(6)
w = 0.4

fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=True)

# Male
axes[0].bar(x - w/2, male[0], width=w, label="Died (0)")
axes[0].bar(x + w/2, male[1], width=w, label="Survived (1)")
axes[0].set_title("Male")
axes[0].set_xticks(x)
axes[0].set_xticklabels(labels =labels, rotation=45)
axes[0].set_ylabel("Count")
axes[0].legend(title="Outcome")

# Female
axes[1].bar(x - w/2, female[0], width=w, label="Died (0)")
axes[1].bar(x + w/2, female[1], width=w, label="Survived (1)")
axes[1].set_title("Female")
axes[1].set_xticks(x)
axes[1].set_xticklabels(labels, rotation=45)
axes[1].legend(title="Outcome")

plt.suptitle("Survival Counts by Age Group and Sex")
plt.tight_layout()
plt.show()

The left chart shows survival outcomes for male passengers across age groups. In every age group, the number of male non-survivors is substantially higher than the number of survivors, with the largest concentration of deaths among young adult males between 18 and 30. Survival among males remains low across all adult age groups and decreases more at older ages.

In contrast to males, females show much higher survival counts in most age groups. Mostly between ages 18–45 where the number of survivors is comparable to or higher than the number of deaths. This pattern indicates a strong survival advantage for female passengers across all ages.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

age_bins = pd.cut(df["Age"], bins=[0,12,18,30,45,60,80], include_lowest=True)

#normalize="index" turns it into rate
tab_rate = pd.crosstab([age_bins, df["Sex"]], df["2urvived"], normalize="index")
male_rate = tab_rate.xs(0, level=1)
female_rate = tab_rate.xs(1, level=1)

labels = ["0-12","13-18","19-30","31-45","46-60","61-80"]
x = np.arange(6)
w = 0.4

fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=True)

# Male
axes[0].bar(x - w/2, male_rate[0], width=w, label="Died (0)")
axes[0].bar(x + w/2, male_rate[1], width=w, label="Survived (1)")
axes[0].set_title("Male")
axes[0].set_xticks(x)
axes[0].set_xticklabels(labels, rotation=45)
axes[0].set_ylabel("Rate")
axes[0].set_ylim(0, 1)
axes[0].legend(title="Outcome")

# Female
axes[1].bar(x - w/2, female_rate[0], width=w, label="Died (0)")
axes[1].bar(x + w/2, female_rate[1], width=w, label="Survived (1)")
axes[1].set_title("Female")
axes[1].set_xticks(x)
axes[1].set_xticklabels(labels, rotation=45)
axes[1].set_ylim(0, 1)
axes[1].legend(title="Outcome")

plt.suptitle("Survival Rates by Age Group and Sex")
plt.tight_layout()
plt.show()

**Pclass**
  - Calculate frequency counts for each passenger class and show via pie chart
  - Compute survival rates per class
  - Compare survival rate across gender and class
  - Compare survival rate of my age groups and class

#### Calculate frequency counts for each passenger class 

In [None]:
pclass_counts = df.groupby("Pclass").size()
display(pclass_counts.rename({1: "First Class", 2: "Second Class", 3: "Third Class"}).to_frame("count"))
plt.pie(pclass_counts, autopct='%1.1f%%', startangle=90,)

Third class contains the highest number of passengers with a total of 709. First class follows with 323 passengers, which is approximately half of the 3rd class. Second class contains 277 passengers.


#### Compute survival rates per class  

In [None]:
count = df.groupby(['Pclass'])[ '2urvived'].mean()
display(count)
plt.pie(count, autopct='%1.1f%%', startangle=90,)

Here we can confirm the sociology standards where the 1st class passengers probably had better access to the lifeboat ships and therefor has higher survival rate. With second class being the 2nd with the survival, it confirms that the better the class the higher the chance of survival. 

#### Compare survival rate across gender and class

In [None]:
rates = pd.crosstab([df["Sex"], df["Pclass"]], df["2urvived"], normalize="index")
rates.columns = ['Death Rate','Survival Rate']
display(rates)

labels = [
    "Class 1 male",
    "Class 2 male",
    "Class 3 male",
    "Class 1 female",
    "Class 2 female",
    "Class 3 female",
]

ax, pie = plt.subplots(1, 2, figsize=(12, 4), sharey=True)

# left pie: survival
pie[0].pie(rates['Survival Rate'], labels=labels, autopct='%1.1f%%', startangle=90)
pie[0].set_title("Survival Rate by Class and Sex")

# right pie: death
pie[1].pie(rates['Death Rate'], labels=labels, autopct='%1.1f%%', startangle=90)
pie[1].set_title("Death Rate by Class and Sex")

# main title
plt.suptitle("Survival vs Death Distribution (Class & Sex)", y=1.05)

# layout
plt.tight_layout()
plt.show()

Survival outcomes differ across the combined groups of passenger class and sex. The highest survival shares are among female passengers in first and second class, while the lowest survival is among male passengers in third class. Death rates show that the largest portions is concentrated among lower-class males. This indicates that survival was not evenly distributed but closely associated with both class level and gender. Higher class likely provided better access and priority, while sex-based differences further separated outcomes within each class.

#### Compare survival rate of my age groups and class

In [None]:
age_class_pattern = pd.crosstab([df["AgeGroup"], df["Pclass"]],df['2urvived'], normalize="index")
age_class_pattern.columns = ['Death Rate','Survival Rate']
age_class_pattern

This table shows that the highest survival rates occur among children and teenagers across most classes. Survival decreases with increasing age, especially in lower passenger classes. Within nearly every age group, first class has the highest survival rate compared to second and third class, indicating a consistent class advantage. Some class–age combinations show a survival rate of 0, which likely results from very small group sizes rather than a true absence of survival, so those cases should be interpreted cautiously.

Girl under 12 years old in 1st class

In [None]:
target_group = df[(df['Sex'] == 1) & (df['Pclass'] == 1) & (df['Age'] < 12)]
target_group[['Age', 'Sex', 'Pclass', '2urvived']]


There is only 1 female under 12 years old, and she passed, that's why the survival case is 0.

Male class between 12 and 18

In [None]:
# Use >= and < for the range as my bins
target_group = df[(df['Sex'] == 0) & (df['Pclass'] == 2) & (df['Age'] >= 12) & (df['Age'] < 18)]

target_group[['Age', 'Sex', 'Pclass', '2urvived']]

There were 4 boys in 2nd class between 12 and 18, and they all died. That is the reason for 0

**SibSp**  
  - Calculate the amount of siblings and spouses per countr 
  - Compute  survival rate for each SibSp value
  - Compute survival rate for each SibSp value for genders


#### Calculate the amount of siblings and spouses per count

In [None]:
sibsp_counts = df.groupby(df['sibsp']).size()
sibsp_counts.to_frame('Count')

In [None]:
import matplotlib.pyplot as plt
import numpy as np

distrubuted = sibsp_counts.to_numpy()
range = np.arange(len(distrubuted))

aaaa, showing = plt.subplots(1, 2, figsize=(15, 5))


showing[0].bar(range, distrubuted)
showing[0].set_xlabel("Number of siblings/spouses aboard (SibSp)")
showing[0].set_ylabel("Number of passengers")
showing[0].set_title("Distribution of SibSp")
showing[0].set_xticks(range)

showing[1].pie(distrubuted, 
            labels=['0', '1', '2', '3', '4', '5', '8'], 
            autopct='%1.1f%%', 
            startangle=140,
            wedgeprops={'edgecolor': 'black'}
            )
plt.show()

Most passengers traveled with no siblings or spouse, and a smaller group traveled with one. Higher SibSp values are rare.

#### Compute survival rate for each SibSp value  

In [None]:
sibsp_survival = pd.crosstab(df["sibsp"], df["2urvived"])

x = sibsp_survival.index
died = sibsp_survival[0]
survived = sibsp_survival[1]

plt.bar(x, died,  label='Died')
plt.bar(x, survived, bottom=died,  label='Survived')

plt.ylabel('Number of Passengers') # Correct label for counts
plt.title('Total Passenger Counts by SibSp')
plt.legend()
plt.show()

Survival shows a non-linear relationship with SibSp. Families traveling in a small group show the highest survival rates, while survival decreases a lot for  family larger than 4.

#### Compute survival rate for each SibSp value for genders

In [None]:
rates = df.groupby(["sibsp", "Sex"])["2urvived"].mean().unstack()
print(rates)
x = np.arange(len(rates.index))
width = 0.35

plt.bar(x - width/2, rates.iloc[:, 0], width, label=rates.columns[0], edgecolor="black")
plt.bar(x + width/2, rates.iloc[:, 1], width, label=rates.columns[1], edgecolor="black")

plt.xticks(x, rates.index)
plt.title("Survival Probability by Gender and SibSp", fontsize=14)
plt.xlabel("Number of Siblings/Spouses (SibSp)")
plt.ylabel("Survival Rate (0.0 - 1.0)")
plt.legend()
plt.show()

Women have marked higher survival rates than men across all Sections. Survival is highest for women with zero, one or twe SibSP. Remaining survival is around 50% or 0. While male survival stays very low in all amounts. 

**Parch**  
  - Calculate the amount of Parch per number 
  - Compare average Parch values between survivors and non-survivors  
  - SibSp and Parch are jointly considered to assess whether traveling with family influenced survival outcomes

#### Calculate the amount of Parch per number 

In [None]:
# Frequency counts for Parch
parch_counts = df.groupby(['Parch']).size()
print(parch_counts)

plt.bar(parch_counts.index, parch_counts)
plt.xlabel("Number of parents/children aboard (Parch)")
plt.ylabel("Number of passengers")
plt.title("Distribution of Parch")
plt.xticks(rotation=0)
plt.show()

Parch means how many parents or children did come with you on Titanic. In the first table there is statics of Parch per number of family on the ship and graph visualization.

#### Compare average Parch values between survivors and non-survivors  

In [None]:
import matplotlib.pyplot as plt
import numpy as np

parch_rates = pd.crosstab(df['Parch'], df['2urvived'], normalize='index')
display(parch_rates)
dead = parch_rates[0].to_numpy()
survived = parch_rates[1].to_numpy()
x = np.arange(len(parch_rates))
width = 0.4

plt.bar(x - width/2, survived, width, label='Survived', edgecolor='black')
plt.bar(x + width/2, dead, width, label='Died', edgecolor='black')

plt.xlabel("Number of parents/children aboard (Parch)")
plt.ylabel("Rate")
plt.title("Survival vs Death Rate by Parch")

plt.legend()
plt.show()

For having only family like children and sons, we can not make any clear conclusion which would change our reasoning why the survival rate between families look like on the chart. Thing worth to mention is having smaller family abroad on titanic has higher chance of survival rather than having none or very big family.

**Embarked**  
  - Calculate frequency counts per embarkation port  
  - Compute survival rates per port  

#### Calculate frequency counts per embarkation port 

In [None]:
embarked_counts = df.groupby(['Embarked']).size()
embarked_counts.to_frame('Count')

Embark means where the passenger boarded the ship. There are 3 ports named 1,2,3. The most people entered the 2nd port with 914 passengers

#### Compute survival rates per port 

In [None]:
import matplotlib.pyplot as plt

# 1. Get the actual counts (not normalized)
# We need the real number of people for the pie slices to be accurate
embarked_counts = df.groupby(['Embarked','2urvived']).size().unstack()
data_one = embarked_counts.loc[0]
data_two = embarked_counts.loc[1]
data_three = embarked_counts.loc[2]
plots, axis = plt.subplots(1, 3, figsize=(10, 5))
axis[0].pie(data_one, labels=['Dead','Live'], autopct='%1.1f%%', startangle=90)
axis[1].pie(data_two, labels=['Dead','Live'], autopct='%1.1f%%', startangle=90)
axis[2].pie(data_three, labels=['Dead','Live'], autopct='%1.1f%%', startangle=90)

There is no clear reason why people from port 0 survived the most. This data will have to studied more in detail where we can add more fields to compare and find higher reason for survivability and boarding the ship. 

## Data finding conlusion

During the data exploration we came to some conclusion about the data
* Children had higher survival no matter the backgrounds
* People who spend more money on ticket or bought higher class mad much higher probability of survival
* Woman had much higher survivability than man
* Third class male had the lowest chance of survival
* etc

# 2.4 Verifying Data Quality

## Verifying Data Quality

### Age

In [None]:
print(
df["Age"].min(), 
df["Age"].max(),
df["Age"].isna().sum())

print("Unique values:", df["Age"].unique().size)
df["Age"].describe()

### Fare

In [None]:
print("Missing Fare:", df["Fare"].isna().sum())
print("Unique values:", df["Fare"].unique().size)
print("Min Fare:", df["Fare"].min())
print("Max Fare:", df["Fare"].max())
print(df["Fare"].describe())


# Sex

In [None]:
print("Missing Sex:", df["Sex"].isna().sum())
print("Unique values:", sorted(df["Sex"].unique()))
print("Unique values:", df["Sex"].unique())
print(df["Sex"].size)   
print(df["Sex"].describe())

### Siblings, Husband and Wifes

In [None]:
print("Missing sibsp:", df["sibsp"].isna().sum())
print("Unique values:", sorted(df["sibsp"].unique()))
print(df["sibsp"].describe())

### Parents and Children

In [None]:
print("Missing parch:", df["Parch"].isna().sum())
print("Unique values:", sorted(df["Parch"].unique()))
print(df["Parch"].describe())


# Pclass

In [None]:
print("Missing parch:", df["Pclass"].isna().sum())
print("Unique values:", sorted(df["Pclass"].unique()))
print(df["Pclass"].describe())


### Embarked

In [None]:

print("Missing Embarked:", df["Embarked"].isna().sum())

print("\nUnique values:", df["Embarked"].unique())
print(df["Embarked"].value_counts())
print(df["Sex"].describe())

### Conclusion of Verifying Data Quality

* Missing data: There are 2 fields missings, those are in the collom of Embarked at the row 61 and 829. The missing data will be field with the most frequent category , which is 2
* Data errors: Most of the data sources are correct, so this is not a great worry
* Measurement Errors: There are no errors
* Change name from 2urvive to Survive


In general the data is consistent. The only worry for the data preparation will be the redundant rows on zeros, which will have to be removed