## Run logistic regression on iris dataset

In [3]:
import numpy as np

In [1]:
from sklearn import datasets


iris = datasets.load_iris()
list(iris.keys())

['data',
 'target',
 'frame',
 'target_names',
 'DESCR',
 'feature_names',
 'filename',
 'data_module']

In [8]:
iris["feature_names"]

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [9]:
iris["target_names"]

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

#### Train Test Split

In [11]:
from sklearn.model_selection import train_test_split


X = iris["data"]
y = (iris["target"] == 0).astype(int)  # 1 if flower is of the setosa variety
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.1, 
                                                    random_state=42)

#### Train and evaluate

In [16]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

LogisticRegression()

In [21]:
y_pred = log_reg.predict(X_test)

In [29]:
from sklearn.metrics import recall_score, precision_score


print(f"Recall: {recall_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")


Recall: 1.0
Precision: 1.0


In [27]:
y_pred


array([0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1])

In [28]:
y_test

array([0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1])

Note: the predictions seem to be perfect. This is likely due to the fact that iris is a toy dataset.

## Run logistic regression on Titanic dataset

#### Load data

In [31]:
import pandas as pd

You can download the titanic dataset [here](https://www.kaggle.com/competitions/titanic/data?select=train.csv).

It is assumed that you created a `data` directory on the root of this repository and unziped the content of the download into that directory.

In [87]:
train_set = pd.read_csv("../data/train.csv", index_col="PassengerId")

In [88]:
train_set.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Columns descriptions

- **Survived**: Survival (0 = No; 1 = Yes)
- **Pclass**: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- **name**: Name
- **sex**: Sex
- **age**: Age
- **sibsp**: Number of Siblings/Spouses Aboard
- **parch**: Number of Parents/Children Aboard
- **ticket**: Ticket Number
- **fare**: Passenger Fare
- **cabin**: Cabin
- **embarked**: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- **boat**: Lifeboat (if survived)
- **body**: Body number (if did not survive and body was recovered)


#### Process categorical feature

In [89]:
# 1 if passenger is a male
train_set['Sex'] = (train_set['Sex'] == 'male').astype(int)  

#### Discard unnecessary columns

In [90]:
train_set.drop(columns=['Name', 'Ticket', 'Cabin', 'Embarked'], 
               inplace=True)

#### Analyze NaN's

In [91]:
train_set.shape

(891, 7)

In [92]:
train_set.isna().describe()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
count,891,891,891,891,891,891,891
unique,1,1,1,2,1,1,1
top,False,False,False,False,False,False,False
freq,891,891,891,714,891,891,891


Looks like the `Age` feature contains more than 200 NaN's.
Lets try synthesizing their values by using the average age for a gender and passenger class.

In [93]:
mean_ages_per_group = train_set.groupby(['Pclass', 'Sex']).mean()['Age'].reset_index()
mean_ages_per_group

Unnamed: 0,Pclass,Sex,Age
0,1,0,34.611765
1,1,1,41.281386
2,2,0,28.722973
3,2,1,30.740707
4,3,0,21.75
5,3,1,26.507589


In [95]:
unknown_age_rows = train_set[train_set.Age.isna()].reset_index()
unknown_age_rows.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,6,0,3,1,,0,0,8.4583
1,18,1,2,1,,0,0,13.0
2,20,1,3,0,,0,0,7.225
3,27,0,3,1,,0,0,7.225
4,29,1,3,0,,0,0,7.8792


In [109]:
imputated_ages = mean_ages_per_group.merge(unknown_age_rows, how='inner', on=['Sex', 'Pclass'])
imputated_ages = imputated_ages\
                    .drop(columns='Age_y')\
                    .rename(columns={'Age_x':'Age'})\
                    .set_index('PassengerId')
print(imputated_ages.shape)
imputated_ages.head()

(177, 7)


Unnamed: 0_level_0,Pclass,Sex,Age,Survived,SibSp,Parch,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
32,1,0,34.611765,1,1,0,146.5208
167,1,0,34.611765,1,0,1,55.0
257,1,0,34.611765,1,0,0,79.2
307,1,0,34.611765,1,0,0,110.8833
335,1,0,34.611765,1,1,0,133.65


#### Train and evaluate 

In [112]:
def train_and_evaluate(train_set):
    # split
    X = train_set.drop(columns='Survived')
    y = train_set['Survived']  
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size=0.1, 
                                                        random_state=42)
    # train
    log_reg = LogisticRegression(max_iter=10000)
    log_reg.fit(X_train, y_train)
    
    # evaluate
    y_pred = log_reg.predict(X_test)
    print(f"y_test: {y_test.values}")
    print(f"y_pred: {y_pred}")
    print(f"Recall: {recall_score(y_test, y_pred)}")
    print(f"Precision: {precision_score(y_test, y_pred)}")


#### without rows that have no age info

In [113]:
train_set_without_nan_values = train_set.dropna()
train_and_evaluate(train_set_without_nan_values)

y_test: [0 1 1 1 0 1 1 1 0 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0
 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 1]
y_pred: [0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0
 0 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 1 0 1]
Recall: 0.53125
Precision: 0.6296296296296297


#### with imputated age rows

In [123]:
synthesised_age_train_set = pd.concat([train_set_without_nan_values, imputated_ages])
train_and_evaluate(synthesised_age_train_set)

y_test: [0 1 0 1 1 0 1 1 1 0 0 1 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 0
 0 0 1 0 0 1 1 0 0 1 1 1 1 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 1 0 0 1 1 1 1 0 0
 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0]
y_pred: [0 0 0 1 1 0 1 1 1 0 0 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1
 1 0 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0
 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0]
Recall: 0.7428571428571429
Precision: 0.7647058823529411


### Conclusion

We can see there is a dramatic increase in performance when using the mean value imputation technique. This is probably due to the fact that the training set, having less than 1k rows, will lose relevant samples if we discard the ~20% where there is no Age value.