In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


## Feature Importance with Logistic Regression

### What I Did

- Reused the processed Titanic dataset from Day 25.
- Fit a Logistic Regression model on selected features.
- Extracted feature coefficients as a measure of feature importance.
- Sorted and displayed them to see strongest predictors.

In [5]:
df = pd.read_csv('datasets/cleaned_titanic.csv')

In [6]:
df['Sex'] = df['Sex'].map({'male': 1, 'female': 0})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

cleaned_df = df[['Age', 'Pclass', 'Fare','Embarked_Q','Sex', 'Embarked_S', 'SibSp','Survived']].dropna()


In [7]:
features=cleaned_df[['Age', 'Pclass', 'Fare','Embarked_Q','Sex', 'Embarked_S', 'SibSp']]
target=cleaned_df['Survived']
X=features
X.shape
y=target

In [8]:
X_train,X_test, y_train, y_test=train_test_split(X,y, test_size=0.2,random_state=42)

X_train.shape,X_test.shape

((571, 7), (143, 7))

In [12]:
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Extract coefficients
coef = model.coef_[0]
coeff_df = pd.DataFrame({
    'Feature': features.columns,
    'Coefficient': coef,
    'abscoeff': abs(coef)
})

coeff_df_sorted = coeff_df.sort_values(by='abscoeff', ascending=False)
coeff_df_sorted

Unnamed: 0,Feature,Coefficient,abscoeff
4,Sex,-2.592794,2.592794
1,Pclass,-1.190557,1.190557
3,Embarked_Q,-0.371773,0.371773
5,Embarked_S,-0.37092,0.37092
6,SibSp,-0.351259,0.351259
0,Age,-0.047655,0.047655
2,Fare,0.001107,0.001107


## What the Output Shows

- Features at the top have the strongest positive contribution to survival predictions.
- Features with negative coefficients reduce survival probability.

Features closer to 0 have weaker impact.

## Insights

- Gender and class are the most influential predictors in the Titanic dataset.
- Logistic Regression coefficients help determine how each feature affects survival.


## Evaluation:

**Why can negative coefficients be just as important as positive coefficients in logistic regression?**
Negative coefficients are as important as positive ones because they indicate features that decrease the probability (odds) of the positive class. For example, if Sex is encoded as male=1, a negative coefficient for that variable means being male is associated with lower odds of survival. Note: coefficient magnitudes are easier to compare when features are scaled to the same units.

**What do they represent about the relationship between the feature and the prediction?**
They represent an inverse relationship between the feature and the predicted probability of the positive class.
Specifically, a negative coefficient means that as the feature value increases, the model predicts lower odds of the target outcome, holding other features constant