# Titanic Classification - Using Logistic Regression
## Titanic Dataset : Kaggle (https://www.kaggle.com/datasets/yasserh/titanic-dataset?resource=download)

## Step 1 : Data Preprocessing

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Reading The Dataset

In [2]:
titanic_df=pd.read_csv(r'E:\Machine Learning Projects\Titanic Classification\Titanic-Dataset.csv')

### Filling Missing Values and Cleaning the Dataset

In [3]:
titanic_df.isnull().sum()

# Fill missing values for 'Age' with the median age
titanic_df['Age'].fillna(titanic_df['Age'].median(), inplace=True)

# Fill missing values for 'Embarked' with the mode (most common value)
titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0], inplace=True)

# Drop the 'Cabin' column as it has too many missing values
titanic_df.drop(columns=['Cabin'], inplace=True)

# Verify that there are no more missing values
titanic_df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

## Step 2 : Feature Scaling 
### Label Encoding for 'Sex' and 'Embarked Column amd Selecting Relevant Columns (Features)

In [4]:
from sklearn.preprocessing import LabelEncoder

# Encode 'Sex' column
le_sex = LabelEncoder()
titanic_df['Sex'] = le_sex.fit_transform(titanic_df['Sex'])

# Encode 'Embarked' column
le_embarked = LabelEncoder()
titanic_df['Embarked'] = le_embarked.fit_transform(titanic_df['Embarked'])

# Select relevant features for the model
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = titanic_df[features]
y = titanic_df['Survived']

# Display the first few rows of the processed data
X.head(), y.head()

(   Pclass  Sex   Age  SibSp  Parch     Fare  Embarked
 0       3    1  22.0      1      0   7.2500         2
 1       1    0  38.0      1      0  71.2833         0
 2       3    0  26.0      0      0   7.9250         2
 3       1    0  35.0      1      0  53.1000         2
 4       3    1  35.0      0      0   8.0500         2,
 0    0
 1    1
 2    1
 3    1
 4    0
 Name: Survived, dtype: int64)

## Step 3 : Splitting Dataset into Training and Testing Data

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Step 4 : Building A Logistic Regression Model and Training it on Training Dataset

In [6]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

## Step 5 : Evaluating our Model by Creating Matrix and Determining Accuracy

In [7]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

accuracy, conf_matrix, class_report

(0.7988826815642458,
 array([[92, 18],
        [18, 51]], dtype=int64),
 '              precision    recall  f1-score   support\n\n           0       0.84      0.84      0.84       110\n           1       0.74      0.74      0.74        69\n\n    accuracy                           0.80       179\n   macro avg       0.79      0.79      0.79       179\nweighted avg       0.80      0.80      0.80       179\n')

#### Accuracy - 79%
#### Correct Predictions : 92 and 51 
#### Incorrect Prediction : 18 and 18