AIM #1: Loading the dataset and printing basic information 
1. Import the Titanic dataset using pandas
2. Create a Dataframe from the dataset
3. Print the first 10 rows of the dataset
4. Print the last 20 rows of the dataset
5. Print dataset's information
6. Describe the dataset
7. Make sure all the information returned by the different functions are displayed in a single table and not on multiple ines

In [None]:
import pandas as pd

# 1. Import the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"  # URL for the Titanic dataset
data = pd.read_csv(url)

# 2. Create a DataFrame from the dataset
df = pd.DataFrame(data)

# 3. Print the first 10 rows of the dataset
first_10_rows = df.head(10)

# 4. Print the last 20 rows of the dataset
last_20_rows = df.tail(20)

# 5. Print dataset's information
info = df.info()

# 6. Describe the dataset
description = df.describe()

# 7. Combine the outputs into a single table
# We can concatenate the first 10 and last 20 rows into one DataFrame for display purposes
combined = pd.concat([first_10_rows, last_20_rows], keys=['First 10 Rows', 'Last 20 Rows'])

# Display the results
print(combined)
print("\nDataset Information:\n")
info  # This will print to the console when executed
print("\nDataset Description:\n")
print(description)


AIM #2: Finding issues (empty, NAs, incorrect value, incorrect format, outliers, etc.) 
1. Find out how many missing values there are in the dataset
2. For the 'Age' column, find the best way to handle the missing values
    2.1. Use an appropriate plot to study the nature of the 'Age' column
    2.2. Figure out what is the best way to calculate the central tendency of the 'Age' column based on the above plot
    2.3. Using the most suitable central tendency measure, fill the missing values in the age column
3. Decide what is the best way to handle the missing values in the 'Cabin' columns
4. Similarly, decide what is the best way to handle the missing values in the 'Embarked' columns
5. Handle the incorrect data under the 'Survived' columns using appropriate measure
6. Handle the incorrectly formatted data under the 'Fare' column


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# 1. Find missing values
missing_values = df.isnull().sum()

# 2. Handle missing values in the 'Age' column
# 2.1. Plot to study the 'Age' column
plt.figure(figsize=(10, 6))
sns.histplot(df['Age'], bins=30, kde=True)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# 2.2. Central tendency calculation
mean_age = df['Age'].mean()
median_age = df['Age'].median()

# Choose median as it's less affected by outliers
central_tendency_age = median_age

# 2.3. Fill missing values in 'Age'
df['Age'].fillna(central_tendency_age, inplace=True)

# 3. Handle missing values in the 'Cabin' column
# Since 'Cabin' has many missing values, we can drop the column or fill with 'Unknown'
df['Cabin'].fillna('Unknown', inplace=True)

# 4. Handle missing values in the 'Embarked' column
# Fill missing values with the mode (most common port)
mode_embarked = df['Embarked'].mode()[0]
df['Embarked'].fillna(mode_embarked, inplace=True)

# 5. Handle incorrect data under the 'Survived' column
# Check for unique values and replace incorrect entries if any
df['Survived'] = df['Survived'].replace({-1: 0})  # Example of handling incorrect data

# 6. Handle incorrectly formatted data under the 'Fare' column
# Convert 'Fare' to numeric, forcing errors to NaN
df['Fare'] = pd.to_numeric(df['Fare'], errors='coerce')

# Fill missing 'Fare' values with the median
median_fare = df['Fare'].median()
df['Fare'].fillna(median_fare, inplace=True)

# Output summary of missing values after handling
print("\nMissing values after handling:\n", df.isnull().sum())


AIM #3: Grouping 
1. Find out the average fare grouped by Pclass
    1.1. Plot the above using a suitable plot
2. Find out the average fare grouped by Sex
    2.1. Plot the above using a suitable plot

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Ensure necessary handling has been done as per previous tasks
# (e.g., missing values)

# 1. Average fare grouped by Pclass
average_fare_pclass = df.groupby('Pclass')['Fare'].mean().reset_index()

# 1.1. Plot the average fare by Pclass
plt.figure(figsize=(8, 5))
sns.barplot(x='Pclass', y='Fare', data=average_fare_pclass, palette='viridis')
plt.title('Average Fare by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Average Fare')
plt.xticks([0, 1, 2], ['1st Class', '2nd Class', '3rd Class'])
plt.show()

# 2. Average fare grouped by Sex
average_fare_sex = df.groupby('Sex')['Fare'].mean().reset_index()

# 2.1. Plot the average fare by Sex
plt.figure(figsize=(8, 5))
sns.barplot(x='Sex', y='Fare', data=average_fare_sex, palette='pastel')
plt.title('Average Fare by Sex')
plt.xlabel('Sex')
plt.ylabel('Average Fare')
plt.show()


AIM #4: Dataset visualization using pandas

1. Plot the distribution of 'Age' using a suitable plot
2. Plot the distribution of 'Fare' using a suitable plot
3. Plot the distribution of 'Pclass' using a suitable plot
4. Plot the distribution of 'Survived' using a suitable plot
5. Plot the distribution of 'Embarked' using a suitable plot
6. Plot the distribution of 'Fare' grouped by 'Survived'
7. Plot the distribution of 'Fare' grouped by 'Pclass'
8. Plot the distribution of 'Age' grouped by 'Survived'
9. Plot the distribution of 'Age' grouped by 'PClass'
10. Combine the 'SibSp' and 'Parch' and plot its distribution grouped by 'Survived'
11. Combine the 'SibSp' and 'Parch' and plot its distribution grouped by 'Pclass'
12. Plot a distribution between 'Age' and 'Fare' to see if there's any relationship
13. Are there any other possibilities to show relationships?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Ensure necessary handling has been done as per previous tasks
# (e.g., missing values)

# 1. Plot the distribution of 'Age'
plt.figure(figsize=(10, 6))
sns.histplot(df['Age'], bins=30, kde=True)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# 2. Plot the distribution of 'Fare'
plt.figure(figsize=(10, 6))
sns.histplot(df['Fare'], bins=30, kde=True)
plt.title('Distribution of Fare')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.show()

# 3. Plot the distribution of 'Pclass'
plt.figure(figsize=(10, 6))
sns.countplot(x='Pclass', data=df, palette='viridis')
plt.title('Distribution of Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Count')
plt.show()

# 4. Plot the distribution of 'Survived'
plt.figure(figsize=(10, 6))
sns.countplot(x='Survived', data=df, palette='pastel')
plt.title('Distribution of Survival')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

# 5. Plot the distribution of 'Embarked'
plt.figure(figsize=(10, 6))
sns.countplot(x='Embarked', data=df, palette='Set2')
plt.title('Distribution of Embarked')
plt.xlabel('Embarked')
plt.ylabel('Count')
plt.show()

# 6. Plot the distribution of 'Fare' grouped by 'Survived'
plt.figure(figsize=(10, 6))
sns.boxplot(x='Survived', y='Fare', data=df, palette='pastel')
plt.title('Fare Distribution by Survival')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Fare')
plt.show()

# 7. Plot the distribution of 'Fare' grouped by 'Pclass'
plt.figure(figsize=(10, 6))
sns.boxplot(x='Pclass', y='Fare', data=df, palette='viridis')
plt.title('Fare Distribution by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Fare')
plt.show()

# 8. Plot the distribution of 'Age' grouped by 'Survived'
plt.figure(figsize=(10, 6))
sns.boxplot(x='Survived', y='Age', data=df, palette='pastel')
plt.title('Age Distribution by Survival')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Age')
plt.show()

# 9. Plot the distribution of 'Age' grouped by 'Pclass'
plt.figure(figsize=(10, 6))
sns.boxplot(x='Pclass', y='Age', data=df, palette='viridis')
plt.title('Age Distribution by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Age')
plt.show()

# 10. Combine 'SibSp' and 'Parch' and plot its distribution grouped by 'Survived'
df['FamilySize'] = df['SibSp'] + df['Parch']
plt.figure(figsize=(10, 6))
sns.countplot(x='FamilySize', hue='Survived', data=df, palette='pastel')
plt.title('Family Size Distribution by Survival')
plt.xlabel('Family Size')
plt.ylabel('Count')
plt.show()

# 11. Combine 'SibSp' and 'Parch' and plot its distribution grouped by 'Pclass'
plt.figure(figsize=(10, 6))
sns.countplot(x='FamilySize', hue='Pclass', data=df, palette='viridis')
plt.title('Family Size Distribution by Passenger Class')
plt.xlabel('Family Size')
plt.ylabel('Count')
plt.show()

# 12. Plot a distribution between 'Age' and 'Fare'
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Age', y='Fare', data=df, alpha=0.6)
plt.title('Relationship Between Age and Fare')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.show()

# 13. Other possibilities to show relationships
# Pairplot for multiple relationships
sns.pairplot(df[['Age', 'Fare', 'Pclass', 'Survived']], hue='Survived', palette='pastel')
plt.title('Pairplot of Age, Fare, Pclass, and Survival')
plt.show()


AIM #5: Correlation

1. Generate a correlation matrix for the entire dataset
2. Find correlation between 'Age' and 'Fare'
3. What other possible correlations can be found in the dataset?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Ensure necessary handling has been done as per previous tasks
# (e.g., missing values)

# Convert categorical variables to numeric for correlation analysis
df_encoded = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)

# 1. Generate a correlation matrix for the entire dataset
correlation_matrix = df_encoded.corr()

# Display the correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True, cbar=True)
plt.title('Correlation Matrix')
plt.show()

# 2. Find correlation between 'Age' and 'Fare'
age_fare_correlation = df['Age'].corr(df['Fare'])
print(f"Correlation between 'Age' and 'Fare': {age_fare_correlation:.2f}")

# 3. Explore other possible correlations in the dataset
# You can look for correlations with 'Survived', 'Pclass', 'SibSp', and 'Parch'
correlation_with_survived = correlation_matrix['Survived']
print("\nCorrelation of other features with 'Survived':\n", correlation_with_survived.sort_values(ascending=False))
