Hi there! This is my first practice of working on Kaggle data sets and, in a way of data science in general. Recently, I completed a [course by Hastie and Tibshirani](https://www.edx.org/learn/python/stanford-university-statistical-learning-with-python), and I plan to try out most of the techniquues covered in this and following data sets

What to try:
- [x] Data processing
- [x] Basic data analysis (descriptive, correlation matrices)
- [ ] Basic data visualisation
- [ ] Regression (SLR, MLR)
- [x] Assessing model accuracy on train and test split 
- [ ] Classification
    - [ ] Logistic regression
    - [ ] Linear discriminant
    - [ ] K nearest neigbours
- [ ] Resampling
    - [ ] Cross-validation
    - [ ] Bootstrap
- [ ] Best subset selection
- [ ] Shrinkage
    - [ ] Ridge 
    - [ ] Lasso
    - [ ] PCR
- [ ] _maybe_ Smoothing Splines, GAMs
- [ ] Trees
    - [ ] Decision trees
    - [ ] Random forests
    - [ ] Boosting
- [ ] Support Vector Machines
- [ ] *Definitely not here* Deep Learning

Would be interesting to explore differences between Scikit learn and Statsmodels for regressions

Some notes:
- The data seems to be well separated (at least with respect to gender), so SVM is likely to perform better than logistic regression
- Hence, I'll do logistic on all seemingly relevant params, then SVM on Gender, then SVM on all reasonable, figuring out how to not overfit

In [1]:
import numpy as np
import pandas as pd
# from matplotlib.pyplots import subplots - will have to figure out why it doesn't work
# import seaborn
from sklearn.model_selection import train_test_split, cross_val_score

# Dataset Overview and Notes
For detailed information, visit the Kaggle competition page: [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)

## Plan (Inspired by an Introductory Video)
1. **Exploratory Data Analysis (EDA)** - Understand the data, find patterns and outliers.
2. **Train and Tune Model** - Develop a model to predict survival, and optimize its parameters.

## Variable Definitions
Below is a summary of the variables included in the Titanic dataset, with details on their meaning and encoding.

| Variable  |       Definition        | Key                             |
|-----------|:-----------------------:|---------------------------------|
| survival  | Survival                | 0 = No, 1 = Yes                 |
| pclass    | Ticket class            | 1 = 1st, 2 = 2nd, 3 = 3rd       |
| sex       | Sex                     |                                 |
| Age       | Age in years            |                                 |
| sibsp     | # of siblings / spouses aboard the Titanic |            |
| parch     | # of parents / children aboard the Titanic |            |
| ticket    | Ticket number           |                                 |
| fare      | Passenger fare          |                                 |
| cabin     | Cabin number            |                                 |
| embarked  | Port of Embarkation     | C = Cherbourg, Q = Queenstown, S = Southampton |

### Detailed Variable Insights

- `pclass:` Serves as a proxy for socio-economic status (SES)
  - **1st = Upper**
  - **2nd = Middle**
  - **3rd = Lower**

- `age:` Age is fractional if less than 1. If the age is estimated, it is denoted in the form of xx.5.

- `sibsp:` This variable defines family relations as follows:
  - **Sibling** = brother, sister, stepbrother, stepsister
  - **Spouse** = husband, wife (mistresses and fiancés were ignored)

- `parch:` This variable further defines family relations:
  - **Parent** = mother, father
  - **Child** = daughter, son, stepdaughter, stepson
  - Note: Some children traveled only with a nanny, hence `parch=0` for them.
  
### Some notes

Since this is a classification problem, visualisation is unlikely to be that helpful

# Basic Data Exploration

In [2]:
train = pd.read_csv("./train.csv")
test = pd.read_csv("./test.csv")
gender_pred = pd.read_csv("./gender_submission.csv")
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
print("Shape:", train.shape)
print("Columns:", train.dtypes)
train.describe()

Shape: (891, 12)
Columns: PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [4]:
train.describe(include="object")

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


In [5]:
train["Sex"].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [6]:
train["Cabin"].value_counts(dropna = False)

NaN            687
C23 C25 C27      4
G6               4
B96 B98          4
C22 C26          3
              ... 
E34              1
C7               1
C54              1
E36              1
C148             1
Name: Cabin, Length: 148, dtype: int64

In [7]:
train.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


# Basic sampling

In [8]:
y = train["Survived"]
X = train.drop(["Survived", "PassengerId"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Basic Models

### Replicating gender submission
Just to practice saving and making my first submission

In [9]:
# Training and validating
X_train.shape

def share_survived(field, value, df = train, X = None, y = None):
    if type(X) != pd.core.frame.DataFrame:
        survived = df.loc[df[field] == value]["Survived"].sum()
        total = df.loc[df[field] == value].shape[0]
    else:
        survived = y.loc[X[field] == value].sum()
        total = y.loc[X[field] == value].shape[0]
    return survived / total

def evaluate(y_predicted, y):
    print((y_predicted == y).sum(), len(y))
    return (y_predicted == y).sum() / len(y)

def expected_by_feature(X, y, col):
    return {feature: share_survived(col, feature, X = X, y = y) 
                  for feature in list(X[col].unique())}

expected_by_sex = expected_by_feature(X_train, y_train, "Sex")

def predict_sex(X_train, y_train, X_test):
    expected = expected_by_feature(X_train, y_train, "Sex")
    y_predict = X_test["Sex"].apply(lambda sex: 1 if expected[sex] >= 0.5 else 0)
    y_predict.name = 'y_predict'
    return y_predict
    
    

# print("% male survived:", share_survived("Sex", "male"))
# print("% female survived:", share_survived("Sex", "female"))

print(expected_by_sex)

y_train_predict = predict_sex(X_train, y_train, X_train)
y_test_predict = predict_sex(X_train, y_train, X_test)
y_submit_predict = predict_sex(X_train, y_train, test)

print(X_test.assign(y_predict = y_test_predict))

print("Train % correct:", evaluate(y_train_predict, y_train))
print("Test % correct:", evaluate(y_test_predict, y_test))
# print(evaluate(y_train_predict, y_train))


{'male': 0.1793103448275862, 'female': 0.7253218884120172}
     Pclass                                               Name     Sex   Age  \
862       1  Swift, Mrs. Frederick Joel (Margaret Welles Ba...  female  48.0   
223       3                               Nenkoff, Mr. Christo    male   NaN   
84        2                                Ilett, Miss. Bertha  female  17.0   
680       3                                Peters, Miss. Katie  female   NaN   
535       2                             Hart, Miss. Eva Miriam  female   7.0   
..      ...                                                ...     ...   ...   
506       2      Quick, Mrs. Frederick Charles (Jane Richards)  female  33.0   
467       1                         Smart, Mr. John Montgomery    male  56.0   
740       1                        Hawksford, Mr. Walter James    male   NaN   
354       3                                  Yousif, Mr. Wazli    male   NaN   
449       1                     Peuchen, Major. Arthur Godfre

Submitting

In [10]:
gender_pred

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived
0,0,892,0
1,1,893,1
2,2,894,0
3,3,895,0
4,4,896,1
...,...,...,...
413,413,1305,0
414,414,1306,1
415,415,1307,0
416,416,1308,0


In [84]:
def submission(y_predict):
    gender_submission = pd.DataFrame({
    "PassengerId": test['PassengerId'], 
    "Survived": y_predict})
    # gender_sumbission.columns == gender_pred.columns
    print(gender_submission.head())
    gender_submission.to_csv(".\gender_submit.csv", index = False)

In [11]:
gender_submission = pd.DataFrame({
    "PassengerId": test['PassengerId'], 
    "Survived": y_submit_predict})
# gender_sumbission.columns == gender_pred.columns
print(gender_submission.head())
gender_submission.to_csv(".\gender_submit.csv", index = False)

   PassengerId  Survived
0          892         0
1          893         1
2          894         0
3          895         0
4          896         1


Yay, my first submission! It scored 0.765

# Logistic regression

1. figure out how to implement
2. Select quantitative parameters intuitively
3. Include qualitative parameters (dummies)

At first, will try with only age, sibsp, parch

I faced with the problem that some entries are NaN - so will do mean imputations
A more advanced method would be to use Matrix Completion
One more idea is to include a dummy variable and an interaction - I like it more than mean inputation tbh and will implement this
Will create a dummy variable for NA and regression y = a + b*NA * c* (1-NA) * age
but I'll have to figure out how to do dummies 
In addition, I regress on pclass, sex, sibsp, parch, embarked

Would be cool to then do a data interpretation and figure out what's more and less important


In [12]:
# for i in features: print(X_test[i].value_counts(dropna=False))
# as we can see, the problem is only with age
# print("aaa", X.iloc[888].Age == np.)
def process_X(X):
    X_new = X.copy()
    
    X_new["AgeIsNan"] = X_new["Age"].isna().astype(int)
    X_new["AgeNanProd"] = X_new["Age"]
    X_new.loc[X_new["AgeIsNan"] == 1, "AgeNanProd"] = 0
    
    return X_new

X_processed = process_X(X)
X_processed

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeIsNan,AgeNanProd
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,0,22.0
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,38.0
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,0,26.0
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,0,35.0
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,0,35.0
...,...,...,...,...,...,...,...,...,...,...,...,...
886,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,0,27.0
887,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,0,19.0
888,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,1,0.0
889,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,0,26.0


In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

y = train["Survived"]
X = train.drop(["Survived", "PassengerId"], axis=1)

X_processed = process_X(X)

X_train, X_test, y_train, y_test = train_test_split(X_processed, y, random_state=1)

print(X_train.columns)
features = ["AgeIsNan", "AgeNanProd"]

model_train = LogisticRegression()
model_train.fit(X_train[features], y_train)

y_pred = model_train.predict(X_test[features])
print(y_pred)

print(classification_report(y_test, y_pred))

Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked', 'AgeIsNan', 'AgeNanProd'],
      dtype='object')
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0]
              precision    recall  f1-score   support

           0       0.57      1.00      0.73       128
           1       0.00      0.00      0.00        95

    accuracy                           0.57       223
   macro avg       0.29      0.50      0.36       223
weighted avg       0.33      0.57      0.42       223



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Well not very successful! But it seems this is just because on average, person of any age would die hahaha
The thing above with done with Sklearn! Let's do same with statsmodels to 

In [14]:
import statsmodels.api as sm

y = train["Survived"]
X = train.drop(["Survived"], axis=1)

X_processed = process_X(X)
X_processed = sm.add_constant(X_processed)

X_train, X_test, y_train, y_test = train_test_split(X_processed, y, random_state=1)
# print(X_processed)

features = ["const", "AgeIsNan", "AgeNanProd"]

logit_model = sm.Logit(y_train, X_train[features])
result = logit_model.fit()

print(result.summary())

Optimization terminated successfully.
         Current function value: 0.650863
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  668
Model:                          Logit   Df Residuals:                      665
Method:                           MLE   Df Model:                            2
Date:                Mon, 25 Mar 2024   Pseudo R-squ.:                 0.01209
Time:                        11:17:06   Log-Likelihood:                -434.78
converged:                       True   LL-Null:                       -440.10
Covariance Type:            nonrobust   LLR p-value:                  0.004888
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0556      0.201     -0.277      0.782      -0.449       0.338
AgeIsNan      -0.8979      0.

Indeed, the effect of age is not sufficient to convince the model the individual won't die
Let's now put good stuff

### Multiple parameters!

In [76]:
def logit_reg(X, y, features, process_func: "list or function"):
#     print(X.head())
    X_processed = process_func(X[features])
    X_processed = sm.add_constant(X_processed)
    
#     print(X_processed.head())
    
    X_train, X_test, y_train, y_test = train_test_split(X_processed, y, random_state=1)

    logit_model = sm.Logit(y_train, X_train)
    result = logit_model.fit()
    
    y_train_predict = result.predict(X_train)
    y_test_predict = result.predict(X_test)
#     print(y_train_predict)
#     print(y_train)
    
    print(result.summary())
    
    print("Train errors:")
    print(classification_report(y_train, list(map(round, y_train_predict))))
    print("Test errors:")
    print(classification_report(y_test, list(map(round, y_test_predict))))
    
    
    return result, y_test_predict

def process_age(X):
    X_new = X.copy()
    
    X_new["AgeIsNan"] = X_new["Age"].isna().astype(int)
    X_new["AgeNanProd"] = X_new["Age"]
    X_new.loc[X_new["AgeIsNan"] == 1, "AgeNanProd"] = 0
    X_new = X_new.drop('Age', axis = 1)
    
#     print("age", X_new.head())
    return X_new


def process_gender(X):
    X_new = X.copy()
    X_new = pd.get_dummies(X_new, columns = ["Sex"], drop_first = True) #drop_first to avoid multicollinearity
#     print(X_new.columns)
#     X_new = X_new.drop('Sex', axis = 1)
#     print("sex", X_new.tail())
    return X_new



In [77]:
regression_gender, y_predict_gender_age = logit_reg(X,y, ["Age", "Sex"], lambda X: process_gender(process_age(X)))

Optimization terminated successfully.
         Current function value: 0.503749
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  668
Model:                          Logit   Df Residuals:                      664
Method:                           MLE   Df Model:                            3
Date:                Mon, 25 Mar 2024   Pseudo R-squ.:                  0.2354
Time:                        12:19:35   Log-Likelihood:                -336.50
converged:                       True   LL-Null:                       -440.10
Covariance Type:            nonrobust   LLR p-value:                 1.182e-44
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.4544      0.271      5.370      0.000       0.924       1.985
AgeIsNan      -1.0224      0.

## Comparing results
Our new logit model gave us accuracy (the metric we care about in the competition) of 0.78 on test data. Let's compare it to the initial model of just gessing women live and men die

In [31]:
y_train_predict = predict_sex(X_train, y_train, X_train)
y_test_predict = predict_sex(X_train, y_train, X_test)


print("Train errors:")
print(classification_report(y_train, list(map(round, y_train_predict))))
print("Test errors:")
print(classification_report(y_test, list(map(round, y_test_predict))))

Train errors:
              precision    recall  f1-score   support

           0       0.82      0.85      0.83       421
           1       0.73      0.68      0.70       247

    accuracy                           0.79       668
   macro avg       0.77      0.77      0.77       668
weighted avg       0.79      0.79      0.79       668

Test errors:
              precision    recall  f1-score   support

           0       0.78      0.87      0.82       128
           1       0.79      0.67      0.73        95

    accuracy                           0.78       223
   macro avg       0.79      0.77      0.77       223
weighted avg       0.79      0.78      0.78       223



We don't get **any** improvement!!!
That's likely because female is still a very strong predictor. Let's see for how many records predictions of our two models differ

In [48]:
# print(X_test.shap
X_new = X_test.copy()
X_new['logit_predict'] = y_predict_gender_age
print(X_test[y_predict_gender_age.apply(round) != y_test_predict])
X_new.loc[X_new['Sex'] == 'male'].sort_values(by=['logit_predict'])

Empty DataFrame
Columns: [const, PassengerId, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked, AgeIsNan, AgeNanProd]
Index: []


Unnamed: 0,const,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeIsNan,AgeNanProd,logit_predict
223,1.0,224,3,"Nenkoff, Mr. Christo",male,,0,0,349234,7.8958,,S,1,0.0,0.110394
790,1.0,791,3,"Keane, Mr. Andrew ""Andy""",male,,0,0,12460,7.7500,,Q,1,0.0,0.110394
107,1.0,108,3,"Moss, Mr. Albert Johan",male,,0,0,312991,7.7750,,S,1,0.0,0.110394
828,1.0,829,3,"McCormack, Mr. Thomas Joseph",male,,0,0,367228,7.7500,,Q,1,0.0,0.110394
201,1.0,202,3,"Sage, Mr. Frederick",male,,8,2,CA. 2343,69.5500,,S,1,0.0,0.110394
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50,1.0,51,3,"Panula, Master. Juha Niilo",male,7.0,4,1,3101295,39.6875,,S,0,7.0,0.241555
850,1.0,851,3,"Andersson, Master. Sigvard Harald Elias",male,4.0,4,2,347082,31.2750,,S,0,4.0,0.247880
16,1.0,17,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.1250,,Q,0,2.0,0.252158
827,1.0,828,2,"Mallet, Master. Andre",male,1.0,0,2,S.C./PARIS 2079,37.0042,,C,0,1.0,0.254315


None! So we may try to employ more parameters, hoping it will differ
As we can see, the highest probability a male would survive is 0.25 based on our data, far below 

We will now include the following parameters:
- Age
- Gender
- pclass
- parch
- sibsp
- fare
- embarked

I'm actually a bit hesitant about parch and sibsp so I wd love to do hypothesis testing on joint significance

We'll need to check if some are none in each cat

In [64]:
for i in train.columns:
    print(i, train[i].isna().astype(int).sum())
print()
print("Test data")
for i in test.columns:
    print(i, test[i].isna().astype(int).sum())
    
print(train["SibSp"].value_counts())
print(train["Parch"].value_counts())

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2

Test data
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64
0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64


As we can see, NAs in Embarked - only 2 - we can just drop em
But NA in 1 fare in test - this may prove to be an issue. We may treat is as 0 in all three of dummy variables (and actually, do the same with embarked) - and that is what get_dummy does by defautlt!
Also note that in fact Sibsp and Parch are numerical!

In [61]:
print(train.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


In [80]:
def process_dummies(X, columns):
    X_new = X.copy()
    X_new = pd.get_dummies(X_new, columns = columns, drop_first = True) #I fell into the dummy variable trap!
    return X_new

def process_3(X):
    return process_dummies(process_age(X), columns = ['Pclass', 'Sex', 'Embarked'])

features = ["Age", 'Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked']

y = train["Survived"]
X = train.drop(["Survived", "PassengerId"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

logit_reg(X_train, y_train, features, process_3)
print()

Optimization terminated successfully.
         Current function value: 0.407599
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  501
Model:                          Logit   Df Residuals:                      490
Method:                           MLE   Df Model:                           10
Date:                Mon, 25 Mar 2024   Pseudo R-squ.:                  0.3769
Time:                        12:23:20   Log-Likelihood:                -204.21
converged:                       True   LL-Null:                       -327.73
Covariance Type:            nonrobust   LLR p-value:                 2.270e-47
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.5772      0.691      6.626      0.000       3.223       5.931
SibSp         -0.3923      0.

Analysis: Fare is not statistically significant, Embarked seems to be jointly significant
Parch seems insignificant but I feel there may be multicollinearity with sibsp
Let's remove Parch, Fare

In [83]:
features = ["Age", 'Pclass', 'Sex', 'SibSp', 'Embarked']

reg, y_logit_predict = logit_reg(X_train, y_train, features, process_3)
print()

Optimization terminated successfully.
         Current function value: 0.407949
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  501
Model:                          Logit   Df Residuals:                      492
Method:                           MLE   Df Model:                            8
Date:                Mon, 25 Mar 2024   Pseudo R-squ.:                  0.3764
Time:                        12:24:46   Log-Likelihood:                -204.38
converged:                       True   LL-Null:                       -327.73
Covariance Type:            nonrobust   LLR p-value:                 8.652e-49
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.7701      0.600      7.953      0.000       3.595       5.946
SibSp         -0.3743      0.

Well... we've got a tiny improvement! Let's submit to kaggle and see how we're doing

In [95]:
test_processed = process_3(test[features])
test_processed = sm.add_constant(test_processed)
# print(test_processed)
# reg.predict(test_processed)
submission(reg.predict(test_processed).apply(round))

   PassengerId  Survived
0          892         0
1          893         0
2          894         0
3          895         0
4          896         1
