# Logistic regression
The goal of logistic regression is to classify data using a trained model. Logistic regression is a simple algorithm that does not require a lot of computational power (in contrast to techniques like support vector machines) whilest often perform as well (or even better) than more complex models. Besides just assigning a class, it also calculates the probabilities allowing us to see how confident the model is in a decission.

The goals of this exercise are
* Correctly training and tuning a logistic regression classifier
* Performing classification via logistic regression
* Perform feature engineering
* Interpretting the different performance metrics like accuracy, recall, precision, f1-score, ROC
* Know when you're dealing with under- and overfitting and the ability to make adaptions regarding this


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt                        # To create plots
import numpy as np                                     # To perform calculations quickly
import pandas as pd                                    # To load in and manipulate data
from sklearn.linear_model import LogisticRegression    # Linear model
from sklearn.model_selection import train_test_split   # Split up the data in a train and test set
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,recall_score,precision_score,f1_score 
from sklearn.metrics import roc_curve,auc
import seaborn as sns


## Breastcancer dataset

https://www.kaggle.com/pranaykankariya/breastcancer-dataset

Target column: diagnosis (Malign or Benign)

In [None]:
# Read in the data

# Take a look at the first rows


In [None]:
# Summarize the dataset


In [None]:
# Create a countplot to determine if the classes are balanced
# Use sns.countplot(data=df_cancer,x='column_of_interest')


In [None]:
# Remove uninformative columns (look at the summary and first rows to determine which columns you should remove)


In [None]:
# Create a histogram plot of the different features
# This can be done using the pandas hist function.
# => my_df.hist()
# To increase the size of the figure, you can pass a figsize tupple. (20,15) works good. You can also hide the grid
# with the option "grid=False"
# => my_df.hist(figsize=(20,15),grid=False)

#To hide the plot information you can assign it to a variable, or add a ";" at the end (other options exist)
# => plot = my_df.hist(figsize=(20,15),grid=False)         or       my_df.hist(figsize=(20,15),grid=False);



In [None]:
# Split data into features and targets (or X, y depending on your preference)



In [None]:
# Split data into a training and test set


In [None]:
# Scale the data, use a scaler. You can base your choice on the histogrammes above, or by just trying them out 


In [None]:
# Initialize a logistic regression model, fit the data. Start with a C-value of 1


### Model evaluation
In the next section we will evaluate our model using different metrics

In [None]:
# Check if you have over- or underfitting of your model by comparing the score of the training and test set


# Predict values for the test set

# Look at the confusion matrix, what do the different values mean in this case?
# Hint: if you don't know the syntax/meaning for a specific funtion, you can always look this up
# in jupyter notebook by executing "?function_name"


# Show the accuracy, recall, precision and f1-score for the test set
# Note, sometimes you need to supply a positive label (if not working with 0 and 1)
# supply this with "pos_label='label'", in this case, the malign samples are the positives


In [None]:
# Make a roc curve, also show the AUROC
# calculate the fpr and tpr for all thresholds of the classification



Try some different C-values for the model. E.g. 0.0001 and 1000

What do you see in the metrics? what does this mean?


## Wine dataset
https://archive.ics.uci.edu/ml/datasets/Wine
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

The attributes are:
1. Alcohol
2. Malic acid
3. Ash
4. Alcalinity of ash
5. Magnesium
6. Total phenols
7. Flavanoids
8. Nonflavanoid phenols
9. Proanthocyanins
10. Color intensity
11. Hue
12. OD280/OD315 of diluted wines
13. Proline

The feature of interest is "cultivar"

In [None]:
# Read in the data

# Take a look at the first rows


In [None]:
# Show a summary of the data


In [None]:
# split into features and targets (X and y)


In [None]:
# split into train and test set, keep about 20% of the data to test


In [None]:
# scale/normalize the data


In [None]:
# Create a model

# Check if you have over- or underfitting of your model by comparing the score of the training and test set


# Predict values for the test set


# Look at the confusion matrix, what do the different values mean in this case

# Show the accuracy, recall, precision and f1-score for the test set
# Note, since we have multiple classes, we have to provide an average parameter to recall, precision and f1-score
# Using average=None will result in the scores for all classes. To know which classes correspond to which values,
# you can take a look at model.classes_
