# Logistische Regression 

### Module importieren

In [24]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, roc_auc_score

### Daten einlesen

In [2]:
spam_df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/openintro/email.csv', index_col=0)

### Variable Explanation
| Variable | Explanation |
|--- | --- |
| spam | Indicator for whether the email was spam. |
| to_multiple | Indicator for whether the email was addressed to more than one recipient. |
| from | Whether the message was listed as from anyone (this is usually set by default for regular outgoing email). |
| cc | Number of people cc'ed. |
| sent_email | Indicator for whether the sender had been sent an email in the last 30 days. |
| time | Time at which email was sent. |
| image | The number of images attached. |
| attach | The number of attached files. |
| dollar | The number of times a dollar sign or the word “dollar” appeared in the email. |
| winner | Indicates whether “winner” appeared in the email. |
| inherit | The number of times “inherit” (or an extension, such as “inheritance”) appeared in the email. |
| viagra | The number of times “viagra” appeared in the email. |
| password | The number of times “password” appeared in the email. |
| num_char | The number of characters in the email, in thousands. |
| line_breaks | The number of line breaks in the email (does not count text wrapping). |
| format | Indicates whether the email was written using HTML (e.g. may have included bolding or active links). |
| re_subj | Whether the subject started with “Re:”, “RE:”, “re:”, or “rE:” |
| exclaim_subj | Whether there was an exclamation point in the subject. |
| urgent_subj | Whether the word “urgent” was in the email subject. |
| exclaim_mess | The number of exclamation points in the email message. |
| number | Factor variable saying whether there was no number, a small number (under 1 million), or a big number. |


### Überblick über Daten bekommen

In [3]:
spam_df.head()

Unnamed: 0,spam,to_multiple,from,cc,sent_email,time,image,attach,dollar,winner,...,viagra,password,num_char,line_breaks,format,re_subj,exclaim_subj,urgent_subj,exclaim_mess,number
1,0,0,1,0,0,2012-01-01 01:16:41,0,0,0,no,...,0,0,11.37,202,1,0,0,0,0,big
2,0,0,1,0,0,2012-01-01 02:03:59,0,0,0,no,...,0,0,10.504,202,1,0,0,0,1,small
3,0,0,1,0,0,2012-01-01 11:00:32,0,0,4,no,...,0,0,7.773,192,1,0,0,0,6,small
4,0,0,1,0,0,2012-01-01 04:09:49,0,0,0,no,...,0,0,13.256,255,1,0,0,0,48,small
5,0,0,1,0,0,2012-01-01 05:00:01,0,0,0,no,...,0,2,1.231,29,0,0,0,0,1,none


In [4]:
spam_df.describe().round(2)

Unnamed: 0,spam,to_multiple,from,cc,sent_email,image,attach,dollar,inherit,viagra,password,num_char,line_breaks,format,re_subj,exclaim_subj,urgent_subj,exclaim_mess
count,3921.0,3921.0,3921.0,3921.0,3921.0,3921.0,3921.0,3921.0,3921.0,3921.0,3921.0,3921.0,3921.0,3921.0,3921.0,3921.0,3921.0,3921.0
mean,0.09,0.16,1.0,0.4,0.28,0.05,0.13,1.47,0.04,0.0,0.11,10.71,230.66,0.7,0.26,0.08,0.0,6.58
std,0.29,0.36,0.03,2.67,0.45,0.45,0.72,5.02,0.27,0.13,0.96,14.65,319.3,0.46,0.44,0.27,0.04,51.48
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.46,34.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.86,119.0,1.0,0.0,0.0,0.0,1.0
75%,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,14.08,298.0,1.0,1.0,0.0,0.0,4.0
max,1.0,1.0,1.0,68.0,1.0,20.0,21.0,64.0,9.0,8.0,28.0,190.09,4022.0,1.0,1.0,1.0,1.0,1236.0


### Daten vorbereiten

In [5]:
random_seed = 5

In [6]:
spam_df.drop(columns='time', inplace=True)

#### Binäre kategoriale Variablen in Dummy Variablen umwandeln (0-1-Encodierung) 

In [7]:
spam_df = pd.get_dummies(spam_df, columns=['winner', 'number'], drop_first=True)


#### Downsampling
Maßnahme gegen ungleichmäßige Verteilung der Targetlabel

In [8]:
num_spam_labels = spam_df.spam.sum()

spam_df_downsampled = spam_df.groupby('spam').sample(n=num_spam_labels, random_state=random_seed).reset_index(drop=True)

#### Für Training, Umwandlung der Pandas Dataframes zu Numpy Arrays und Separierung von Zielvariable und restlichen Daten

In [9]:
y = spam_df_downsampled['spam'].to_numpy()
X = spam_df_downsampled.drop(columns='spam').to_numpy()

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y, random_state=random_seed)

#### Skalierung der Variablen
Damit der Algorithmus schneller konvergiert, kann es hilfreich sein Daten zu skalieren.

In [11]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Modell Training

Initialisieren Sie das `LogisticRegression` Modell. Nutzen Sie danach `fit()` um es mit den Trainingsdaten zu trainieren. Wenn die Daten vorher nicht skaliert wurden, kann es sinnvoll sein den `max_iter` Parameter zu erhöhen (default=100).

### Modell anwenden

Nachdem das Modell trainiert wurde, können Sie mit `predict()` nun die Zielvariable für das Testset vorhersagen. Speichern Sie die Vorhersagen in `y_pred`.

### Validierung

Überprüfen Sie im letzten Schritt, wie gut Ihr Modell Vorhersagen machen kann anhand der Accuracy (Anteil richtiger Vorhersagen an gesamten Vorhersagen).

In [16]:
test_accuracy = accuracy_score(y_test, y_pred)
simple_benchmark_accuracy = accuracy_score(y_test, np.ones_like(y_test))


In [18]:
results = pd.DataFrame([test_accuracy, simple_benchmark_accuracy], columns=['Accuracy'], index=['Test', 'Benchmark']).round(3)
results

Unnamed: 0,Accuracy
Test,0.786
Benchmark,0.502
