## Concepts Breakdown
### 1. Data Cleaning

- Raw data often has missing values, duplicates, or inconsistent formats. For Titanic:

    - Age: Many passengers don’t have age recorded → fill with median or group median.
    - Cabin: Mostly missing → can drop or create "HasCabin" feature.
    - Embarked: A few missing → fill with most common port ("S").
    - Ticket/Name: Not useful directly for prediction → can engineer features or drop.

### 2. Feature Engineering

- Creating new features or transforming existing ones to help the model:

    - Sex → binary (0 = male, 1 = female)
    - Pclass → categorical, keep as integer
    - FamilySize = SibSp + Parch + 1 (helps survival chances)
    - IsAlone = 1 if FamilySize == 1 else 0
    - Title Extraction from Name (Mr, Miss, Mrs, Master, etc.)
    - Embarked → one-hot encoding (C, Q, S)

### 3. Logistic Regression

- A classification algorithm that estimates the probability of an event (survival = 1, not survival = 0).

    - Uses the logistic (sigmoid) function:
    - P(y=1)= 1 / 1+e−(β0+β1x1+β2x2+...)
    - Outputs probability between 0 and 1.
    - If probability > 0.5 → predict 1 (Survived).

### 4. Model Training

- Split data into train and test.
- Fit logistic regression on train set.
- Use coefficients to interpret impact of features.

### 5. Evaluation

- Accuracy: % correctly predicted.
- Confusion Matrix: TP, FP, TN, FN.
- Precision/Recall/F1: Better for imbalanced datasets.
- ROC Curve & AUC: Probability ranking ability.

In [1]:
# Step 1: Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [5]:

# Step 2: Load Data
df = pd.read_csv("data/titanic.csv")
print(df.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


In [6]:
# Step 3: Handle Missing Values
df['Age'].fillna(df['Age'].median(), inplace=True) 
df.drop('Cabin', axis=1, inplace=True)  
df['Fare'].fillna(df['Fare'].median(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Fare'].fillna(df['Fare'].median(),inplace=True)


In [7]:
print(df.isnull().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64


In [8]:
# Step 4: Feature Engineering
# Extract Title from Name
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
df['Title'] = df['Title'].replace(['Mlle','Ms'],'Miss').replace(['Mme'],'Mrs')
rare_titles = df['Title'].value_counts()[df['Title'].value_counts() < 10].index
df['Title'] = df['Title'].replace(rare_titles, 'Rare')

  df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)


In [9]:

# Create Family Features
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

In [10]:
# Convert Categorical → Numeric
df['Sex'] = df['Sex'].map({'male':0, 'female':1})
df = pd.get_dummies(df, columns=['Embarked','Title'], drop_first=True)

In [11]:
# Step 5: Select Features
X = df[['Pclass','Sex','Age','Fare','FamilySize','IsAlone','Embarked_Q','Embarked_S',
        'Title_Miss','Title_Mr','Title_Mrs','Title_Rare']]
y = df['Survived']

In [12]:
# Step 6: Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:

# Step 7: Train Logistic Regression
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

In [15]:
# Step 8: Evaluate
y_pred = model.predict(X_test)

In [16]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 1.0

Confusion Matrix:
 [[50  0]
 [ 0 34]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00        34

    accuracy                           1.00        84
   macro avg       1.00      1.00      1.00        84
weighted avg       1.00      1.00      1.00        84



In [17]:
# Step 9: Interpret Coefficients
coeff_df = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_[0]})
print(coeff_df.sort_values(by='Coefficient', ascending=False))

       Feature  Coefficient
1          Sex     3.813359
8   Title_Miss     2.048226
10   Title_Mrs     1.765133
6   Embarked_Q     0.261099
2          Age     0.015221
3         Fare     0.000900
7   Embarked_S    -0.036208
4   FamilySize    -0.049525
0       Pclass    -0.090366
5      IsAlone    -0.136403
11  Title_Rare    -0.399163
9     Title_Mr    -2.139755


- Negative coefficient ≠ bad model
- It simply means that feature reduces survival probability.
- A “good” or “bad” model is judged by evaluation metrics (accuracy, F1, ROC AUC), not by whether coefficients are positive or negative.