AIM #1: Loading the dataset and printing basic information 
1. Import the Titanic dataset using pandas
2. Create a Dataframe from the dataset
3. Print the first 10 rows of the dataset
4. Print the last 20 rows of the dataset
5. Print dataset's information
6. Describe the dataset
7. Make sure all the information returned by the different functions are displayed in a single table and not on multiple ines

In [None]:
import pandas as pd

# 1. Load the Titanic dataset
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)

# 2. Create a DataFrame
# (DataFrame is already created with the above line)

# 3. Print the first 10 rows of the dataset
print(df.head(10))

# 4. Print the last 20 rows of the dataset
print(df.tail(20))

# 5. Print dataset's information
info = df.info()

# 6. Describe the dataset
description = df.describe()

# 7. Combine info and description into a single output
combined_info = pd.concat([pd.DataFrame(info).transpose(), description], axis=1)
print(combined_info)


AIM #2: Finding issues (empty, NAs, incorrect value, incorrect format, outliers, etc.) 
1. Find out how many missing values there are in the dataset
2. For the 'Age' column, find the best way to handle the missing values
    2.1. Use an appropriate plot to study the nature of the 'Age' column
    2.2. Figure out what is the best way to calculate the central tendency of the 'Age' column based on the above plot
    2.3. Using the most suitable central tendency measure, fill the missing values in the age column
3. Decide what is the best way to handle the missing values in the 'Cabin' columns
4. Similarly, decide what is the best way to handle the missing values in the 'Embarked' columns
5. Handle the incorrect data under the 'Survived' columns using appropriate measure
6. Handle the incorrectly formatted data under the 'Fare' column


AIM #3: Grouping 
1. Find out the average fare grouped by Pclass
    1.1. Plot the above using a suitable plot
2. Find out the average fare grouped by Sex
    2.1. Plot the above using a suitable plot

In [None]:
# 1. Average fare grouped by Pclass
average_fare_by_class = df.groupby('Pclass')['Fare'].mean()
print(average_fare_by_class)

# 1.1 Plot the average fare by Pclass
average_fare_by_class.plot(kind='bar')
plt.title('Average Fare by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Average Fare')
plt.show()

# 2. Average fare grouped by Sex
average_fare_by_sex = df.groupby('Sex')['Fare'].mean()
print(average_fare_by_sex)

# 2.1 Plot the average fare by Sex
average_fare_by_sex.plot(kind='bar', color=['blue', 'pink'])
plt.title('Average Fare by Sex')
plt.xlabel('Sex')
plt.ylabel('Average Fare')
plt.show()


AIM #4: Dataset visualization using pandas

1. Plot the distribution of 'Age' using a suitable plot
2. Plot the distribution of 'Fare' using a suitable plot
3. Plot the distribution of 'Pclass' using a suitable plot
4. Plot the distribution of 'Survived' using a suitable plot
5. Plot the distribution of 'Embarked' using a suitable plot
6. Plot the distribution of 'Fare' grouped by 'Survived'
7. Plot the distribution of 'Fare' grouped by 'Pclass'
8. Plot the distribution of 'Age' grouped by 'Survived'
9. Plot the distribution of 'Age' grouped by 'PClass'
10. Combine the 'SibSp' and 'Parch' and plot its distribution grouped by 'Survived'
11. Combine the 'SibSp' and 'Parch' and plot its distribution grouped by 'Pclass'
12. Plot a distribution between 'Age' and 'Fare' to see if there's any relationship
13. Are there any other possibilities to show relationships?

In [None]:
# 1. Distribution of 'Age'
sns.histplot(df['Age'], kde=True)
plt.title('Age Distribution')
plt.show()

# 2. Distribution of 'Fare'
sns.histplot(df['Fare'], kde=True)
plt.title('Fare Distribution')
plt.show()

# 3. Distribution of 'Pclass'
sns.countplot(x='Pclass', data=df)
plt.title('Distribution of Passenger Class')
plt.show()

# 4. Distribution of 'Survived'
sns.countplot(x='Survived', data=df)
plt.title('Survival Distribution')
plt.show()

# 5. Distribution of 'Embarked'
sns.countplot(x='Embarked', data=df)
plt.title('Embarked Distribution')
plt.show()

# 6. Distribution of 'Fare' grouped by 'Survived'
sns.boxplot(x='Survived', y='Fare', data=df)
plt.title('Fare Distribution by Survival')
plt.show()

# 7. Distribution of 'Fare' grouped by 'Pclass'
sns.boxplot(x='Pclass', y='Fare', data=df)
plt.title('Fare Distribution by Passenger Class')
plt.show()

# 8. Distribution of 'Age' grouped by 'Survived'
sns.boxplot(x='Survived', y='Age', data=df)
plt.title('Age Distribution by Survival')
plt.show()

# 9. Distribution of 'Age' grouped by 'Pclass'
sns.boxplot(x='Pclass', y='Age', data=df)
plt.title('Age Distribution by Passenger Class')
plt.show()

# 10. Combine 'SibSp' and 'Parch' and plot its distribution grouped by 'Survived'
df['FamilySize'] = df['SibSp'] + df['Parch']
sns.countplot(x='FamilySize', hue='Survived', data=df)
plt.title('Family Size Distribution by Survival')
plt.show()

# 11. Combine 'SibSp' and 'Parch' and plot its distribution grouped by 'Pclass'
sns.countplot(x='FamilySize', hue='Pclass', data=df)
plt.title('Family Size Distribution by Passenger Class')
plt.show()

# 12. Distribution between 'Age' and 'Fare'
sns.scatterplot(x='Age', y='Fare', data=df)
plt.title('Age vs Fare')
plt.show()

# 13. Consider other relationships (e.g., Age and Pclass)
sns.boxplot(x='Pclass', y='Age', data=df)
plt.title('Age Distribution by Passenger Class')
plt.show()


AIM #5: Correlation

1. Generate a correlation matrix for the entire dataset
2. Find correlation between 'Age' and 'Fare'
3. What other possible correlations can be found in the dataset?

In [None]:
# 1. Generate a correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)

# 2. Find correlation between 'Age' and 'Fare'
age_fare_corr = df['Age'].corr(df['Fare'])
print(f'Correlation between Age and Fare: {age_fare_corr}')

# 3. Other possible correlations
# You can explore other correlations using correlation_matrix
