AIM #1: Loading the dataset and printing basic information 
1. Import the Titanic dataset using pandas
2. Create a Dataframe from the dataset
3. Print the first 10 rows of the dataset
4. Print the last 20 rows of the dataset
5. Print dataset's information
6. Describe the dataset
7. Make sure all the information returned by the different functions are displayed in a single table and not on multiple ines

In [None]:
import pandas as pd

# 1. Import the Titanic dataset
df = pd.read_csv('titanic.csv')

# 2. Create a DataFrame from the dataset
# Already done with the read_csv function

# 3. Print the first 10 rows of the dataset
first_10_rows = df.head(10)

# 4. Print the last 20 rows of the dataset
last_20_rows = df.tail(20)

# 5. Print dataset's information
dataset_info = df.info()

# 6. Describe the dataset
dataset_description = df.describe()

# 7. Display all the information in a single table
# Use pandas' display capabilities to print outputs in a single block
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print("First 10 Rows:\n", first_10_rows)
    print("\nLast 20 Rows:\n", last_20_rows)
    print("\nDataset Information:\n", dataset_info)
    print("\nDataset Description:\n", dataset_description)

AIM #2: Finding issues (empty, NAs, incorrect value, incorrect format, outliers, etc.) 
1. Find out how many missing values there are in the dataset
2. For the 'Age' column, find the best way to handle the missing values
    2.1. Use an appropriate plot to study the nature of the 'Age' column
    2.2. Figure out what is the best way to calculate the central tendency of the 'Age' column based on the above plot
    2.3. Using the most suitable central tendency measure, fill the missing values in the age column
3. Decide what is the best way to handle the missing values in the 'Cabin' columns
4. Similarly, decide what is the best way to handle the missing values in the 'Embarked' columns
5. Handle the incorrect data under the 'Survived' columns using appropriate measure
6. Handle the incorrectly formatted data under the 'Fare' column


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('titanic.csv')

# 1. Find out how many missing values there are in the dataset
missing_values = df.isnull().sum()

# 2. Handle missing values in the 'Age' column
# 2.1. Use an appropriate plot to study the nature of the 'Age' column
sns.histplot(df['Age'].dropna(), bins=30, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# 2.2. Calculate the best central tendency measure
# Based on the distribution, decide between mean or median
age_median = df['Age'].median()

# 2.3. Fill the missing values in the 'Age' column
df['Age'].fillna(age_median, inplace=True)

# 3. Handle missing values in the 'Cabin' column
# Often, 'Cabin' can be filled with 'Unknown' or dropped if not critical
df['Cabin'].fillna('Unknown', inplace=True)

# 4. Handle missing values in the 'Embarked' column
# Fill with the most common value (mode)
embarked_mode = df['Embarked'].mode()[0]
df['Embarked'].fillna(embarked_mode, inplace=True)

# 5. Handle incorrect data in the 'Survived' column
# Ensure all values are 0 or 1
df['Survived'] = df['Survived'].apply(lambda x: 1 if x == '1' else 0)

# 6. Handle incorrectly formatted data in the 'Fare' column
# Convert to numeric and handle errors
df['Fare'] = pd.to_numeric(df['Fare'], errors='coerce')

# Handle any remaining missing values in 'Fare'
fare_median = df['Fare'].median()
df['Fare'].fillna(fare_median, inplace=True)

# Display the cleaned data
print("Missing Values:\n", missing_values)
print("\nCleaned DataFrame:\n", df.head())

AIM #3: Grouping 
1. Find out the average fare grouped by Pclass
    1.1. Plot the above using a suitable plot
2. Find out the average fare grouped by Sex
    2.1. Plot the above using a suitable plot

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('titanic.csv')

# Convert 'Fare' to numeric, handling errors
df['Fare'] = pd.to_numeric(df['Fare'], errors='coerce')

# Fill any remaining NaN values in 'Fare' with the median
fare_median = df['Fare'].median()
df['Fare'].fillna(fare_median, inplace=True)

# 1. Find out the average fare grouped by Pclass
avg_fare_by_pclass = df.groupby('Pclass')['Fare'].mean()

# 1.1. Plot the average fare grouped by Pclass
plt.figure(figsize=(8, 5))
sns.barplot(x=avg_fare_by_pclass.index, y=avg_fare_by_pclass.values)
plt.title('Average Fare by Pclass')
plt.xlabel('Pclass')
plt.ylabel('Average Fare')
plt.show()

# 2. Find out the average fare grouped by Sex
avg_fare_by_sex = df.groupby('Sex')['Fare'].mean()

# 2.1. Plot the average fare grouped by Sex
plt.figure(figsize=(8, 5))
sns.barplot(x=avg_fare_by_sex.index, y=avg_fare_by_sex.values)
plt.title('Average Fare by Sex')
plt.xlabel('Sex')
plt.ylabel('Average Fare')
plt.show()

AIM #4: Dataset visualization using pandas

1. Plot the distribution of 'Age' using a suitable plot
2. Plot the distribution of 'Fare' using a suitable plot
3. Plot the distribution of 'Pclass' using a suitable plot
4. Plot the distribution of 'Survived' using a suitable plot
5. Plot the distribution of 'Embarked' using a suitable plot
6. Plot the distribution of 'Fare' grouped by 'Survived'
7. Plot the distribution of 'Fare' grouped by 'Pclass'
8. Plot the distribution of 'Age' grouped by 'Survived'
9. Plot the distribution of 'Age' grouped by 'PClass'
10. Combine the 'SibSp' and 'Parch' and plot its distribution grouped by 'Survived'
11. Combine the 'SibSp' and 'Parch' and plot its distribution grouped by 'Pclass'
12. Plot a distribution between 'Age' and 'Fare' to see if there's any relationship
13. Are there any other possibilities to show relationships?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('titanic.csv')

# Ensure necessary columns are numeric
df['Fare'] = pd.to_numeric(df['Fare'], errors='coerce')
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

# Fill missing 'Age' and 'Fare' with median
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Fare'].fillna(df['Fare'].median(), inplace=True)

# 1. Plot the distribution of 'Age'
plt.figure(figsize=(8, 5))
sns.histplot(df['Age'], bins=30, kde=True)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# 2. Plot the distribution of 'Fare'
plt.figure(figsize=(8, 5))
sns.histplot(df['Fare'], bins=30, kde=True)
plt.title('Distribution of Fare')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.show()

# 3. Plot the distribution of 'Pclass'
plt.figure(figsize=(8, 5))
sns.countplot(x='Pclass', data=df)
plt.title('Distribution of Pclass')
plt.xlabel('Pclass')
plt.ylabel('Count')
plt.show()

# 4. Plot the distribution of 'Survived'
plt.figure(figsize=(8, 5))
sns.countplot(x='Survived', data=df)
plt.title('Distribution of Survived')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.show()

# 5. Plot the distribution of 'Embarked'
plt.figure(figsize=(8, 5))
sns.countplot(x='Embarked', data=df)
plt.title('Distribution of Embarked')
plt.xlabel('Embarked')
plt.ylabel('Count')
plt.show()

# 6. Plot the distribution of 'Fare' grouped by 'Survived'
plt.figure(figsize=(8, 5))
sns.boxplot(x='Survived', y='Fare', data=df)
plt.title('Fare Distribution by Survived')
plt.xlabel('Survived')
plt.ylabel('Fare')
plt.show()

# 7. Plot the distribution of 'Fare' grouped by 'Pclass'
plt.figure(figsize=(8, 5))
sns.boxplot(x='Pclass', y='Fare', data=df)
plt.title('Fare Distribution by Pclass')
plt.xlabel('Pclass')
plt.ylabel('Fare')
plt.show()

# 8. Plot the distribution of 'Age' grouped by 'Survived'
plt.figure(figsize=(8, 5))
sns.boxplot(x='Survived', y='Age', data=df)
plt.title('Age Distribution by Survived')
plt.xlabel('Survived')
plt.ylabel('Age')
plt.show()

# 9. Plot the distribution of 'Age' grouped by 'Pclass'
plt.figure(figsize=(8, 5))
sns.boxplot(x='Pclass', y='Age', data=df)
plt.title('Age Distribution by Pclass')
plt.xlabel('Pclass')
plt.ylabel('Age')
plt.show()

# 10. Combine 'SibSp' and 'Parch' and plot distribution grouped by 'Survived'
df['FamilySize'] = df['SibSp'] + df['Parch']
plt.figure(figsize=(8, 5))
sns.boxplot(x='Survived', y='FamilySize', data=df)
plt.title('Family Size Distribution by Survived')
plt.xlabel('Survived')
plt.ylabel('Family Size')
plt.show()

# 11. Combine 'SibSp' and 'Parch' and plot distribution grouped by 'Pclass'
plt.figure(figsize=(8, 5))
sns.boxplot(x='Pclass', y='FamilySize', data=df)
plt.title('Family Size Distribution by Pclass')
plt.xlabel('Pclass')
plt.ylabel('Family Size')
plt.show()

# 12. Plot a distribution between 'Age' and 'Fare' to see relationships
plt.figure(figsize=(8, 5))
sns.scatterplot(x='Age', y='Fare', data=df)
plt.title('Scatter Plot of Age vs Fare')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.show()

# 13. Other possibilities for relationships
# Example: Pairplot for numerical features
sns.pairplot(df[['Age', 'Fare', 'Pclass', 'Survived']])
plt.show()

AIM #5: Correlation

1. Generate a correlation matrix for the entire dataset
2. Find correlation between 'Age' and 'Fare'
3. What other possible correlations can be found in the dataset?

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('titanic.csv')

# Convert relevant columns to numeric, coercing errors
df['Fare'] = pd.to_numeric(df['Fare'], errors='coerce')
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df['Survived'] = pd.to_numeric(df['Survived'], errors='coerce')
df['Pclass'] = pd.to_numeric(df['Pclass'], errors='coerce')
df['SibSp'] = pd.to_numeric(df['SibSp'], errors='coerce')
df['Parch'] = pd.to_numeric(df['Parch'], errors='coerce')

# Fill missing values with median
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Fare'].fillna(df['Fare'].median(), inplace=True)

# 1. Generate a correlation matrix for numeric columns
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
correlation_matrix = df[numeric_columns].corr()

# Plot the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

# 2. Find correlation between 'Age' and 'Fare'
age_fare_correlation = correlation_matrix.loc['Age', 'Fare']
print(f"Correlation between Age and Fare: {age_fare_correlation:.2f}")

# 3. Other possible correlations
# Example: Correlation between 'Pclass' and 'Survived'
pclass_survived_correlation = correlation_matrix.loc['Pclass', 'Survived']
print(f"Correlation between Pclass and Survived: {pclass_survived_correlation:.2f}")

# Example: Correlation between 'SibSp' and 'Parch'
sibsp_parch_correlation = correlation_matrix.loc['SibSp', 'Parch']
print(f"Correlation between SibSp and Parch: {sibsp_parch_correlation:.2f}")