## Titanic Dataset Overview:

* **Description:** This dataset contains information about passengers and crew members onboard the **RMS Titanic** during its **fateful voyage in 1912.** 
* **Features (Columns):**
    * **Passenger Information:**
        * `PassengerId`: **Unique identifier for each passenger**
        * `Pclass`: **Passenger **class (1st, 2nd, 3rd)**
        * `Name`: **Passenger name**
        * `Sex`: **Passenger sex**
        * `Age`: **Passenger age (in years)**
        * `SibSp`: **Number of siblings/spouses aboard**
        * `Parch`: **Number of parents/children aboard**
        * `Ticket`: **Ticket number**
        * `Cabin`: **Cabin number**
        * `Embarked`: **Port of embarkation** (C = Cherbourg, Q = Queenstown, S = Southampton)
    * **Survival Information:**
        * `Survived`: **Whether the passenger survived** (0 = No, 1 = Yes)
* **Data Types:** Mix of categorical (e.g., Sex, Embarked) and numerical (e.g., Age, Pclass) data.

* **Size:**  
    * Around **891 rows (passengers)**
    * Relatively small for modern data science tasks.

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

In [None]:
data=  pd.read_csv('/kaggle/input/titanic-dataset/Titanic-Dataset.csv')
data.head()

In [None]:
data.describe()

In [None]:
#Checking data types
data.dtypes

Here is breakdown of **categoical data** and **numerical data**:


**Categorical Data:**

* `Name`
* `Sex`
* `Ticket`
* `Cabin`
* `Embarked`

**Numerical Data:**

* `PassengerId` (While technically an ID, it likely won't be used for mathematical operations)
* `Survived` (Can be treated as numerical for calculations like mean survival rate)
* `Pclass`
* `Age` (Floating-point data type suggests it can include decimals)
* `SibSp`
* `Parch`
* `Fare`

### Checking Missing Values

In [None]:
data.isnull().sum()

**Handling null**

In [None]:
#Making a copy of the dataset
df = data.copy()

#Dropping rows with null values:
df.dropna(subset=["Age"],inplace=True)
df.dropna(subset=["Embarked"],inplace=True)

df.isnull().sum()

In [None]:
#Mapping Sex
sex_mapping= {'female': 0,'male': 1}
df.Sex= df.Sex.map(sex_mapping)
print(df['Sex'].head())

In [None]:
#Mapping Embarked
Embarked = {'C': 0, 'Q': 1, 'S': 2}

df.Embarked= df.Embarked.map(Embarked)
print("Embarked column after encoding:")
print(df['Embarked'].head())  

## Feature Importance

## OLS Regression
A **linear regression model** establishes the relation between a **dependent variable(y)** and at least one **independent variable(x)** as : 

![image.png](attachment:aca98f27-4c25-4619-8d9a-fb80095a4b9c.png)

In **OLS method,** we have to choose the values of ![image.png](attachment:2c873a88-ae02-4fe5-8404-ce67615ee29f.png)  and ![image.png](attachment:d7512ae0-ac38-43ec-b475-c79290fcd928.png) such that, the total sum of squares of the difference between the calculated and observed values of y, is minimised. 

Formula for OLS:

![image.png](attachment:83f83949-c593-47d7-92df-58e119122eb4.png)

Where, 

![image.png](attachment:4b4ae0c6-d5b7-46cb-acbe-bc386e1a8ad9.png) = *predicted value for the with observation* 

![image.png](attachment:8d1359ed-7fc8-4d4c-94e2-86ea4fcc1925.png) = *actual value for the ith observation* 

![image.png](attachment:874d80df-cc3c-4675-8277-1a762ed9f2f9.png) = *error/residual for the with observation* 

n = *total number of observations*


In [None]:
X = df[['PassengerId','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']]
Y = df['Survived']

In [None]:
# Import libraries (assuming you have pandas, scikit-learn, and shap installed)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as smf  # For OLS model

# Split data into training and test sets (assuming X and Y are your data)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=42)


# Fit a linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, Y_train)

# Fit an OLS model (Ordinary Least Squares) with formula syntax
# Assuming your features are already in df and target variable is 'total_points'
ols_model = smf.ols("Survived ~ " + " + ".join(list(X)), data=df).fit()
print(ols_model.summary())

**Overall Performance:**

* **R-squared value (0.401):** indicates that the **model explains about 40%** of the variance in the survival data. There **might be other factors** not included in the model that **influence survival.**

**Important Features:**

* **`Pclass` (Passenger Class):** This has the **highest negative coefficient (-0.1898)** among significant features. 
    * A higher passenger class is associated with a larger decrease in the likelihood of death (higher chance of survival). 
    * The **p-value (close to 0)** indicates a **statistically significant relationship.**
    
* **`Sex` (Male/Female):** Being female represented by a **negative coefficient (-0.4850)** 
    * Has a **significant positive impact** on survival **(p-value close to 0)**. 
    * This suggests **women were more likely** to **survive than men.**
    
* **`Age`:** 
    * Age has a **negative coefficient (-0.0065)** and a **significant p-value (close to 0).** 
    * This indicates a **very slight decrease** in **survival probability with increasing age.**

**Less Important Features (considering p-values):**

* **`SibSp` (Number of Siblings/Spouses):** The **negative coefficient (-0.0506)** suggests a **slight decrease in survival** with more siblings/spouses aboard. 
    * However, the **p-value (0.004)** is **borderline significant**, so the importance might be debatable.


**Overall:**

This analysis suggests that passenger class, sex, and possibly the number of siblings/spouses on board are the most important factors influencing survival rates. Age might also play a minor role. The port of embarkation, fare, and passenger ID seem to have little to no influence based on this model. It's important to consider potential multicollinearity and explore other factors that might improve the model's performance.

## Decision Tree

In [None]:
df.Pclass.nunique()
features= ['Sex','Pclass']
model= DecisionTreeClassifier()

In [None]:
model.fit(df[features],df.Survived)

In [None]:
plt.figure(figsize=(10, 10))  # Set the figure size
plot_tree(model, feature_names=features, filled=True)  
plt.show()

**Decision Rules and Predictions**

1. **Split by Sex:**
    * If **Sex** is less than or equal to a threshold (likely representing male passengers encoded numerically, e.g., `< 0.5`):
        * Passengers are predicted to **not survive (value = 0)**
    * Otherwise (**Sex** is greater than the threshold, likely representing female passengers):
        * The decision tree further analyzes **Pclass** to make the final prediction.

2. **Refine Prediction for Females (Sex >= 0.5):**
    * If **Pclass** is less than or equal to another threshold (likely representing 1st or 2nd class passengers, e.g., `<= 2.5`):
        * Passengers are predicted to **survive (value = 1)**
    * Otherwise (**Pclass** is greater than the threshold, likely representing 3rd class passengers):
        * Passengers are predicted to **not survive (value = 0)**

**Overall**
* This simplified decision tree prioritizes **Sex as the most important factor for predicting survival**. It suggests females **(`Sex` >= 0.5)** have a significantly higher chance of survival than males **(`Sex` < 0.5)**. 
* Within the **female group, `Pclass`** is used for further refinement. 
* Passengers in 1st or 2nd class **(`Pclass` <= 2.5)** have a higher predicted survival rate compared to 3rd class passengers **(`Pclass` <= 2.5)**.


## Conclusion

* **Factors associated with higher predicted survival probability:**
    * **`Sex`:** Being female (likely encoded as a value of 1) is associated with a higher predicted survival probability.
    * **`Pclass`:** Passengers in **higher classes** (`lower Pclass values`) had a **higher predicted chance of survival.**
    * **`Age`:** There might be a **slight decrease** in predicted survival probability with **increasing age**, but the effect seems weak.
    * **`SibSp`** and **`Embarked`:** Having more siblings/spouses (SibSp) or embarking from certain ports (Embarked) might be associated with a slightly higher predicted survival probability, although the effect sizes seem small.
* **`Fare`:** The price of the ticket (fare) has a **weak or negligible relationship** with the predicted survival probability in this model.

**Important Caveats:**

* **OLS regression** assumes a linear relationship between **features** and the **target variable**. This might not perfectly capture the true relationships in all cases.
* The absolute values of coefficients don't directly represent feature importance. Consider using permutation importance for a more balanced comparison of features, especially if there's multicollinearity.

## References:
1. https://www.geeksforgeeks.org/ordinary-least-squares-ols-using-statsmodels/
2. https://www.kaggle.com/code/piyushchaudhari007/titanic-dataset-decision-tree
3. https://www.kaggle.com/code/mohamedzaghloula/titanic-classification-survived-or-not#RandomForest-With-GridSearchCV


## License


Copyright 2023 Vismay Devjee

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.