# Praktische Übung 2: Logistische Regresssion

## Titanic Datensatz
Vorhersageproblem: Verschiedene Attribute zu den Passagieren auf der Titanic <br>
Frage: Hat der Passagier das Unglück überlebt?

Da wir den Umgang mit kategorischen Features noch nicht eingeführt haben benutzen wir nur folgende Attribute: <br>
- Pclass , Passenger Class (1 = 1st, 2 = 2nd, 3 = 3rd)
- Survived , Survival (0 = no, 1 = yes)
- age, Age
- sibso , Number of siblings/spouses aboard
- parch , Number of children/parents aboard
- fare , Passender Fare (brithish pound)

## Aufgabe 1

1. Laden Sie den Datensatz aus `titanic.csv` in einen Pandas DataFrame. Die Daten befinden sich im `/data` folder auf [GitHub](https://github.com/pabair/ml-kurs-ws22/).
2. Erstellen Sie einen neuen DataFrame, der nur die folgenden Spalten enthält: "Survived", "Pclass", "Age", "Fare", "Sibsp", "Parch"
3. Entfernen Sie die leeren Felder aus dem DataFrame (d.h. Felder mit `NaN`-Werten) indem sie die Methode `dropna()` auf dem DataFrame aufrufen. <br>Zählen Sie wie viele Zeilen der DataFrame vor und nach dem Aufruf hat. 

In [112]:
# import pandas
import pandas as pd

# load the data from ./data/titanic.csv into a pandas dataframe
df_all = pd.read_csv('./data/titanic.csv')
# create a new dataframe with only the features survived, Pclass, Age, Fare, Sibsp, Parch

# log shape
print('Shape of the dataframe: ', df_all.shape)

df = df_all[['Survived', 'Pclass', 'Age', 'Fare', 'SibSp', 'Parch']]

# log shape
print('Shape of the dataframe: ', df.shape)

# drop all rows with missing values
df = df.dropna()

# log shape
print('Shape of the dataframe after dropping all rows with missing values: ', df.shape)

Shape of the dataframe:  (891, 12)
Shape of the dataframe:  (891, 6)
Shape of the dataframe after dropping all rows with missing values:  (714, 6)


In [113]:
# log head
df.head()

Unnamed: 0,Survived,Pclass,Age,Fare,SibSp,Parch
0,0,3,22.0,7.25,1,0
1,1,1,38.0,71.2833,1,0
2,1,3,26.0,7.925,0,0
3,1,1,35.0,53.1,1,0
4,0,3,35.0,8.05,0,0


## Aufgabe 2

1. Unterteilen Sie den aus Aufgabe 1 entstandenen DataFrame mit Hilfe der Methode `train_test_split` in Trainings- und Testdaten. 
2. Trainieren Sie eine logistische Regression auf den Trainingsdaten mit "Survived" als Label. 
3. Machen Sie mit dem trainierten Model Vorhersagen auf den Testdaten und berechnen Sie:
    - Accuracy
    - Precision
    - Recall

In [114]:
from sklearn.model_selection import train_test_split

# use train_test_split to split the data into train and test data
X_train, X_test, y_train, y_test = train_test_split(df.drop('Survived', axis=1), df['Survived'], test_size=0.2, random_state=42)
print('Shape of the training data: ', X_train.shape)
print('Shape of the test data: ', X_test.shape)

Shape of the training data:  (571, 5)
Shape of the test data:  (143, 5)


In [115]:
# log head of the training data
print('Head of the training data: ', X_train.head())


Head of the training data:       Pclass   Age     Fare  SibSp  Parch
328       3  31.0  20.5250      1      1
73        3  26.0  14.4542      1      0
253       3  30.0  16.1000      1      0
719       3  33.0   7.7750      0      0
666       2  25.0  13.0000      0      0


In [116]:
# training the model
from sklearn.linear_model import LogisticRegression

# create a logistic regression model
model = LogisticRegression()

# fit the model to the training data
model.fit(X_train, y_train)

# predict the test data
y_pred = model.predict(X_test)

In [117]:
# evaluate the model
from sklearn.metrics import accuracy_score

# calculate the accuracy score and store it in a variable
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy score: ', accuracy)

# calculate the precision score
from sklearn.metrics import precision_score
print('Precision score: ', precision_score(y_test, y_pred))

# calculate the recall score
from sklearn.metrics import recall_score
print('Recall score: ', recall_score(y_test, y_pred))


Accuracy score:  0.6993006993006993
Precision score:  0.6666666666666666
Recall score:  0.4642857142857143


## Aufgabe 3

1. Fügen Sie nun das Geschlecht als weiteres Feature hinzu und führen Sie die Schritte aus Aufgabe 1 und 2 erneut aus. Beachten Sie dabei, dass Geschlecht kein numerischer Wert ist, d.h. Sie müssen daraus ein numerisches Feature erstellen. Beispiel: Feature "isFemale" hat den Wert 1  wenn Sex == "female", sonst 0. <br>
Hinweis: Python wandelt Booleans automatisch in Nummern um. 
2. In wie weit verbessert sich die Accuracy durch das neue Feature?
3. Versuchen Sie nun auf ähnliche Weise die Spalte "Embarked" als Feature zu nutzen. 

In [118]:
# create datafram with additional feature 'sex'
df = df_all[['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Parch']]

# log shape
print('Shape of the dataframe: ', df.shape)

# drop all rows with missing values
df = df.dropna()

# log shape
print('Shape of the dataframe after dropping all rows with missing values: ', df.shape)

df.head()

Shape of the dataframe:  (891, 7)
Shape of the dataframe after dropping all rows with missing values:  (714, 7)


Unnamed: 0,Survived,Pclass,Sex,Age,Fare,SibSp,Parch
0,0,3,male,22.0,7.25,1,0
1,1,1,female,38.0,71.2833,1,0
2,1,3,female,26.0,7.925,0,0
3,1,1,female,35.0,53.1,1,0
4,0,3,male,35.0,8.05,0,0


In [119]:
# replace the values of the feature Sex with 0 and 1
df['Sex'] = df['Sex'].replace(to_replace='male', value=0)
df['Sex'] = df['Sex'].replace(to_replace='female', value=1)

# log head
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,SibSp,Parch
0,0,3,0,22.0,7.25,1,0
1,1,1,1,38.0,71.2833,1,0
2,1,3,1,26.0,7.925,0,0
3,1,1,1,35.0,53.1,1,0
4,0,3,0,35.0,8.05,0,0


In [120]:
# use train_test_split to split the data into train and test data
X_train, X_test, y_train, y_test = train_test_split(df.drop('Survived', axis=1), df['Survived'], test_size=0.2, random_state=42)
print('Shape of the training data: ', X_train.shape)
print('Shape of the test data: ', X_test.shape)

Shape of the training data:  (571, 6)
Shape of the test data:  (143, 6)


In [121]:
# log head of the training data
X_train.head()

Unnamed: 0,Pclass,Sex,Age,Fare,SibSp,Parch
328,3,1,31.0,20.525,1,1
73,3,0,26.0,14.4542,1,0
253,3,0,30.0,16.1,1,0
719,3,0,33.0,7.775,0,0
666,2,0,25.0,13.0,0,0


In [122]:
# training the model
# create a logistic regression model
model = LogisticRegression()

# fit the model to the training data
model.fit(X_train, y_train)

# predict the test data
y_pred = model.predict(X_test)

In [123]:
# evaluate the model
# calculate the accuracy score and store it in a variable
accuracyWithSex = accuracy_score(y_test, y_pred)
print('Accuracy score: ', accuracyWithSex)

# calculate the precision score
print('Precision score: ', precision_score(y_test, y_pred))

# calculate the recall score
print('Recall score: ', recall_score(y_test, y_pred))

# calculate how much the accuracy score improved
print('Accuracy score improved by: ', accuracyWithSex - accuracy)

Accuracy score:  0.7482517482517482
Precision score:  0.6923076923076923
Recall score:  0.6428571428571429
Accuracy score improved by:  0.04895104895104896


In [124]:
# create datafram with additional feature 'Embarked' and 'Sex'
df = df_all[['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Parch', 'Embarked']]

# log shape
print('Shape of the dataframe: ', df.shape)

# drop all rows with missing values
df = df.dropna()

# log shape
print('Shape of the dataframe after dropping all rows with missing values: ', df.shape)

df.head()

Shape of the dataframe:  (891, 8)
Shape of the dataframe after dropping all rows with missing values:  (712, 8)


Unnamed: 0,Survived,Pclass,Sex,Age,Fare,SibSp,Parch,Embarked
0,0,3,male,22.0,7.25,1,0,S
1,1,1,female,38.0,71.2833,1,0,C
2,1,3,female,26.0,7.925,0,0,S
3,1,1,female,35.0,53.1,1,0,S
4,0,3,male,35.0,8.05,0,0,S


In [125]:
# replace the values of the features Sex and Embarked with 0 and 1
df['Sex'] = df['Sex'].replace(to_replace='male', value=0)
df['Sex'] = df['Sex'].replace(to_replace='female', value=1)
df['Embarked'] = df['Embarked'].replace(to_replace='S', value=0)
df['Embarked'] = df['Embarked'].replace(to_replace='C', value=1)
df['Embarked'] = df['Embarked'].replace(to_replace='Q', value=2)

# log head
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,SibSp,Parch,Embarked
0,0,3,0,22.0,7.25,1,0,0
1,1,1,1,38.0,71.2833,1,0,1
2,1,3,1,26.0,7.925,0,0,0
3,1,1,1,35.0,53.1,1,0,0
4,0,3,0,35.0,8.05,0,0,0


In [126]:
# clear the data
X_train = None
X_test = None
y_train = None
y_test = None

# use train_test_split to split the data into train and test data
X_train, X_test, y_train, y_test = train_test_split(df.drop('Survived', axis=1), df['Survived'], test_size=0.2, random_state=42)
print('Shape of the training data: ', X_train.shape)
print('Shape of the test data: ', X_test.shape)

Shape of the training data:  (569, 7)
Shape of the test data:  (143, 7)


In [127]:
# training the model
# create a logistic regression model
model = LogisticRegression(max_iter=1000)

# fit the model to the training data
model.fit(X_train, y_train)

# predict the test data
y_pred = model.predict(X_test)

In [129]:
# evaluate the model
# calculate the accuracy score and store it in a variable
accuracyWithSexAndEmbarked = accuracy_score(y_test, y_pred)

# calculate the precision score
precisionWithSexAndEmbarked = precision_score(y_test, y_pred)

# calculate the recall score
recallWithSexAndEmbarked = recall_score(y_test, y_pred)


# print the accuracy score
print('Accuracy score: ', accuracyWithSexAndEmbarked)

# print the precision score
print('Precision score: ', precisionWithSexAndEmbarked)

# print the recall score
print('Recall score: ', recallWithSexAndEmbarked)

# calculate how much the accuracy score improved
print('Accuracy score improved by: ', accuracyWithSexAndEmbarked - accuracyWithSex)


Accuracy score:  0.7972027972027972
Precision score:  0.8541666666666666
Recall score:  0.6507936507936508
Accuracy score improved by:  0.04895104895104896


## Bonus

Welches sind die wichtigsten und unwichtigsten Features des Modells fur die Vorhersage? 