## Logistic Regression
Despite the name, Logistic Regression is used for classification, not regression.

It predicts the probability that a data point belongs to a certain class (like 0 or 1, Yes or No, Spam or Not Spam).

It uses the logistic (sigmoid) function to squeeze any real number between 0 and 1

### 1. Using statsmodel

In [3]:
import statsmodels.api as sm
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from scipy import stats

In [4]:
# Load data
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Select features
features= ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Parch']
df = df[features + ['Survived']]

In [9]:
df.head(2)

Unnamed: 0,Pclass,Sex,Age,Fare,SibSp,Parch,Survived
0,3,1,22.0,7.25,1,0,0
1,1,0,38.0,71.2833,1,0,1


In [5]:
# Handle missing values
df['Age'].fillna(df['Age'].median(), inplace=True)

# Encode 'Sex'
df['Sex'] = LabelEncoder().fit_transform(df['Sex'])

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)


In [10]:
# Define X and y
X= df[features]
y= df['Survived']

# Add constant for intercept
X = sm.add_constant(X)

In [11]:
# Fit logistic regression model
model= sm.Logit(y, X)#why
results = model.fit()

# Print summary
print(results.summary())

Optimization terminated successfully.
         Current function value: 0.442861
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  891
Model:                          Logit   Df Residuals:                      884
Method:                           MLE   Df Model:                            6
Date:                Thu, 24 Apr 2025   Pseudo R-squ.:                  0.3350
Time:                        19:11:17   Log-Likelihood:                -394.59
converged:                       True   LL-Null:                       -593.33
Covariance Type:            nonrobust   LLR p-value:                 9.750e-83
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9410      0.532      9.295      0.000       3.899       5.983
Pclass        -1.0874      0.

In [12]:
print(X.dtypes)

const     float64
Pclass      int64
Sex         int64
Age       float64
Fare      float64
SibSp       int64
Parch       int64
dtype: object


In [13]:

#Coefficients in log-odds
print("Coefficients in log-odds:")
print(results.params)

Coefficients in log-odds:
const     4.941046
Pclass   -1.087436
Sex      -2.760875
Age      -0.039398
Fare      0.002846
SibSp    -0.348785
Parch    -0.106709
dtype: float64


In [30]:

#Coefficients in odds ratio

### 2. Using scikit-learn

In [None]:
# Step 1: Load data
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Step 2: Select features and target


In [None]:
# Step 3: Handle missing values


# Step 4: Encode categorical variables


In [26]:
# Step 5: Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [27]:
# Step 6: Train model
model2 = LogisticRegression(max_iter=1000)
model2.fit(X_train, y_train)



In [30]:
# Step 7: Make predictions
y_pred = model2.predict(X_train)
y_pred

array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0,

In [31]:
# Step 8: Evaluate model
print("Accuracy on test data:", accuracy_score(y_train, y_pred))
print("Classification report:\n", classification_report(y_train, y_pred))

Accuracy on test data: 0.8019662921348315
Classification report:
               precision    recall  f1-score   support

           0       0.82      0.88      0.85       444
           1       0.77      0.68      0.72       268

    accuracy                           0.80       712
   macro avg       0.79      0.78      0.78       712
weighted avg       0.80      0.80      0.80       712



In [32]:
# Evaluate on testing data
y_pred_test = model2.predict(X_test)
print("Accuracy on test data:", accuracy_score(y_test, y_pred_test))
print("Classification report:\n", classification_report(y_test, y_pred_test))

Accuracy on test data: 0.8100558659217877
Classification report:
               precision    recall  f1-score   support

           0       0.81      0.88      0.84       105
           1       0.80      0.72      0.76        74

    accuracy                           0.81       179
   macro avg       0.81      0.80      0.80       179
weighted avg       0.81      0.81      0.81       179



In [None]:
# accuracy on test data: 0.81
# accuracy on train data: 0.80
# this is a good model because the accuracy is high on both train and test data