***
**Downloading CSV and intializing imports**
===

In [None]:
# #@title Download CSV
# #@markdown Supported URL types:
# #@markdown * Google Drive zip
# #@markdown * GitHub raw file

# csv_file_link = "https://github.com/codebasics/py/blob/master/ML/9_decision_tree/Exercise/titanic.csv" #@param {type:"string"}
# is_csv_from_github_comment = False #@param {type:"boolean"}

# import gdown
# import requests

# csv_name = ''

# if 'drive.google.com' in csv_file_link:
#   url = csv_file_link
#   url = url.replace('/file/d/', '/uc?id=').replace('/view?usp=sharing', '')
#   gdown.download(url, quiet=False)

# elif 'github.com' in csv_file_link and is_csv_from_github_comment is False:
#   url = csv_file_link.replace('github.com', 'raw.githubusercontent.com').replace('/blob', '')
#   response = requests.get(url)
#   content = response.content
#   output = url.split('/')[-1]
#   with open(output, 'wb') as f:
#       f.write(content)
#       print('[', output, ']', 'downloaded succesfully')
#       csv_name = output

# elif 'github.com' in csv_file_link and is_csv_from_github_comment is True:
#   url = csv_file_link
#   response = requests.get(url)
#   content = response.content
#   output = url.split('/')[-1]
#   with open(output, 'wb') as f:
#       f.write(content)
#       print('[', output, ']', 'downloaded successfully')
#       csv_name = output

In [None]:
# df = pd.read_csv(csv_name)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import metrics
from sklearn import tree

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.impute import KNNImputer

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix, roc_curve, roc_auc_score
from sklearn.preprocessing import OneHotEncoder

In [None]:
# * alternative df with more inputs (from 891 -> 1309)
df = pd.read_csv('titanic3.csv')

***
**Data Visualization**
===

In [None]:
df.describe(include = 'all')

**Note:**

Keep in mind that the Percentage of people who survived the Titanic disaster is only 38.2% of the 1309±1 passengers who boarded the ship
see the chart below that includes the percentage of who survived and who died.

In [None]:
deceased_percentage = (df['survived'] == 0).sum() / len(df) * 100
survival_percentage = (df['survived'] == 1).sum() / len(df) * 100
print(f"Deceased in percentage: {deceased_percentage:.1f}%")
print(f"Survived in percentage: {survival_percentage:.1f}%")

# Count the number of survivors and non-survivors
survival_counts = df['survived'].value_counts()

# Create a bar plot
plt.figure(figsize=(6, 4))
plt.bar(survival_counts.index, survival_counts.values, color=['tab:blue', 'tab:orange'])
plt.xlabel('Survival Status')
plt.ylabel('Count')
plt.title('Overall Survival')
plt.xticks([0, 1], ['Deceased', 'Survived'])
plt.show()



**Explanation: Survival by Passenger Class**

The bar chart illustrates the survival status of passengers categorized by their socio-economic class (1st, 2nd, and 3rd) aboard the Titanic. Here are the key observations:

1. **Class Prioritization:**
   - Passengers in 1st class had the highest survival count (orange bars). They were prioritized during the evacuation.
   - 2nd class passengers had a lower survival count but still fared better than 3rd class passengers.

2. **Survival Rates:**
   - The survival rate (percentage of survivors among total passengers) was highest for 1st class passengers.
   - 2nd class passengers had a moderate survival rate.
   - 3rd class passengers faced the lowest survival rate.

3. **Mortality vs. Survivability:**
   - Among 3rd class passengers:
     - Over 500 passengers did not survive (blue bar).
     - Only around 180 passengers survived (orange bar).
     - The mortality rate (deaths relative to the total) was significantly higher.

4. **Conclusion:**
   - Socio-economic class played a crucial role in survival chances.
   - While 1st class passengers had the best odds, 3rd class passengers faced the greatest risk of not surviving.

In [None]:
# Create the bar plot
plt.figure(figsize=(8, 6))
df.groupby('pclass')['survived'].value_counts().unstack().plot(kind='bar')
plt.xlabel('Passenger Class')
plt.ylabel('Count')
plt.title('Survival Status by Passenger Class')
plt.legend(title='Survived', labels=['Deceased(0)', 'Survived(1)'])
plt.show()

**Explanation: Survival by Passenger Sex**

The bar chart provides compelling evidence regarding the survival patterns based on passenger sex. Here's the breakdown:

1. **Historical Context:**
   - During maritime disasters, a well-known principle was followed: "WOMEN AND CHILDREN FIRST." This meant that women and children were prioritized for lifeboats and evacuation.

2. **Observations from the Chart:**
   - The tallest orange bar (for females) indicates a higher number of survivors.
   - Conversely, the blue bar (for males) is significantly taller, indicating a larger number of deceased passengers.

3. **Survival Rates:**
   - Women had a higher survival rate compared to men.
   - The proportion of female survivors was greater than the proportion of female deceased passengers.

4. **Conclusion:**
   - The chart supports the historical practice of prioritizing women (and children) during emergencies.
   - While the mortality rate for men was higher, women had a better chance of surviving.

In [None]:
# Create the bar plot
plt.figure(figsize=(8, 6))
df.groupby('sex')['survived'].value_counts().unstack().plot(kind='bar')
plt.xlabel('Sex type')
plt.ylabel('Count')
plt.title('Survival Status by Passenger Sex type')
plt.legend(title='Survived', labels=['Deceased(0)', 'Survived(1)'])
plt.show()

In [None]:
# creating new instance so it's separate case fromm the original DataFrame
df_visuals = df.copy()

**Explanation: Survival by Passenger Age**

The bar chart provides insights into survival patterns based on passenger age. Here's an expanded analysis:

1. **Historical Context:**
   - During maritime disasters, the principle of "WOMEN AND CHILDREN FIRST" prioritized their evacuation to lifeboats.

2. **Observations from the Chart:**
   - The tallest orange bar corresponds to the '0-10' age group (children), indicating a higher number of child survivors.
   - The blue bar (for children) is significantly shorter, suggesting fewer child fatalities.
   - Interestingly, the blue bar for the '20-30' age group is the highest, indicating the largest number of deceased passengers in that range.

3. **Survival Rates:**
   - Children had a higher survival rate compared to other age groups.

4. **Age Range 20-30:**
   - Most passengers fell within the 20-30 age range.
   - Despite the high mortality rate for this group, it's essential to consider the context and historical practices during the disaster.

5. **Conclusion:**
   - The chart supports the prioritization of women and children.
   - While the mortality rate for other age groups was higher, children had a better chance of surviving.

In [None]:
age_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80]
age_labels = ['0-10', '10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80']

df_visuals['Age_Group'] = pd.cut(df_visuals['age'], bins=age_bins, labels=age_labels, right=False)

grouped_df = df_visuals.groupby(['Age_Group', 'survived'], observed=True).size().unstack(fill_value=0)

plt.figure(figsize=(8, 6))
grouped_df.plot(kind='bar', width=0.7)
plt.xlabel('Passenger Age')
plt.ylabel('Count')
plt.title('Survival Status by Passenger Age')
plt.legend(title='Survived', labels=['Deceased(0)', 'Survived(1)'])
plt.show()


**Explanation: Survival by Passenger Sibling / Spouse**

1. **Bar Chart Interpretation:**
   - The tallest orange bar corresponds to the '0-1' category, indicating the highest number of survivors.
   - However, the '0-1' category also has the highest number of deceased passengers (represented by the blue bar).
   - The '1-2' category has a similar bar height, suggesting a comparable number of survivors and deceased passengers.

2. **Survival Rate vs. Absolute Numbers:**
   - While more passengers survived in the '0-1' category in absolute numbers, the proportion of survivors was higher among those with 1-2 siblings/spouse.
   - To assess survival rates, consider the percentage of survivors within each category:
     - For '0-1': Survival rate ≈ 34%
     - For '1-2': Survival rate ≈ 53%

3. **Conclusion:**
   - Passengers with 1-2 siblings/spouse had a higher likelihood of survival compared to those with 0-1 relatives.
   - The corrected chart aligns with the observation that traveling with more than 2 siblings/spouse significantly decreased survival chances.

In [None]:
# sibsp_bins = [0, 1, 2, 3, 4, 5, 6, 7, 8]
# sibsp_labels = ['0-1', '1-2', '2-3', '3-4', '4-5', '5-6', '6-7', '7-8']

# df_visuals['sibsp_Group'] = pd.cut(df_visuals['sibsp'], bins=sibsp_bins, labels=sibsp_labels, right=False)

# grouped_df = df_visuals.groupby(['sibsp_Group', 'survived'], observed=True).size().unstack(fill_value=0)

# plt.figure(figsize=(8, 6))
# grouped_df.plot(kind='bar', width=0.7)
# plt.xlabel('Sibling / Spouse')
# plt.ylabel('Count')
# plt.title('Survival Status by # of Passenger Sibling / Spouse')
# plt.legend(title='Survived', labels=['Deceased(0)', 'Survived(1)'])
# plt.show()


# made into singular chart by concat sibsp and parch (sibling,spouse,parent,children) into one variable called "Family"

df_visuals['family'] = df_visuals.sibsp + df_visuals.parch


In [None]:
fare_bins = [0.0,7.90,14.45,31.28,120, 513]
fare_labels = ['v.Low','Low','Mid','High','v. High']

df_visuals['fare_Group'] = pd.cut(df_visuals['fare'], bins=fare_bins, labels=fare_labels)

grouped_df = df_visuals.groupby(['fare_Group', 'survived'], observed=True).size().unstack(fill_value=0)

plt.figure(figsize=(8, 6))
grouped_df.plot(kind='bar', width=0.7)
plt.xlabel('Passenger Fare')
plt.ylabel('Count')
plt.title('Survival Status by Passenger Fare')
plt.legend(title='Survived', labels=['Deceased(0)', 'Survived(1)'])
plt.show()


***
**Data Preprocessing**
===

In [None]:
sns.heatmap(df.isnull(), cbar = False).set_title("Missing values heatmap")

In this part of the code we can see there are many unique variables, the ones we'll likely use are the following:

TARGET VARIABLE:

SURVIVED

FEATURE VARIABLES:

PCLASS, SEX, AGE, SIBSP(?), FARE,(to be changed)

In [None]:
df.nunique()

In [None]:
# df.describe(include = 'all')

It is observed that there are a lot of missing values in our dataset, We'll clean them in the DATA CLEANING part
Target:
- delete the following: [name, ticket, cabin, boat, body, home.dest]
- use mean method for the following: [pclass, survived, sex, sibsp, parch, fare, ]

In [None]:
print(df_visuals.isnull().sum())

In [None]:
null_rows = df_visuals[df_visuals['fare'].isnull()]
null_rows