# Logistic regression

Here, we will explore the logistic regression model that we covered in class. In good practice though, machine learning development should not start with thinking about the learning algorithm but begin with thinking about the problem, i.e. problem understanding, and then exploring the data. Today, we'll start with the data; in the real world, the problem that needs to be solved will be the starting point.


> You can listen here about the data-centric AI movement that encourages the approach of exploring and 'working on' the data (rather than only focusing on the crafting of a machine learning architecture) at the centre of developing a model for a given problem: https://datacentricai.org/blog/opening-remarks/



## Data

-- This week, we will be using real data instead of generating it randomly like we did last week. Specifically, we will use the Wine dataset. Read more about the dataset here: https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-dataset.


-- Here are some prompts to help you find out more about the dataset:

*   What type of labels does it have (real continuous or categorical)**?**
*   What is the feature dimensionality, i.e. the number of features**?**
*   Can you find out how it was collected**?**
*   What of how it was labelled**?**














## Load data


-- You need to load the data before you can get started. Here is documentation about how to load the wine data with the *scikit learn* library:
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html


-- After you load the data, you can explore the relationships between each individual feature and the labels, e.g. using scatter plots or boxplots.


-- Next, you can split the data into training and test sets. You can use this split function: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split. Let's do a 50:50 split first, i.e. take half of the dataset and put in a variable that represents your training set and keep the other half as your test set. Do a random split. What is the distribution of the labels in the training set**?** You can plot the distribution (Hint: You could use a histogram for the case of categorical labels).

You can use the code cell below. You will need to complete the code.

In [None]:
# load wine dataset and get the features and the labels
import numpy as np
import sklearn.model_selection
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
wine_data = load_wine()
wine_data_feature_vector = wine_data.data
wine_data_label = wine_data.target

# randomly split 50/50 into training and test sets
training_set_ids = sklearn.model_selection.train_test_split(wine_data,train_size=0.5)
test_set_ids = sklearn.model_selection.train_test_split(wine_data,train_size=0.5)
training_set_feature_vector = training_set_ids.data
training_set_label = training_set_ids.target
test_set_feature_vector =  test_set_ids.data
test_set_label = test_set_ids.target

# show the distribution of the labels in the training set
print(wine_data_label)

# Modelling

-- Now that you have training and test sets, you can start modelling. Let's start with a logistic regression model. Here is the documentation on how to build a logistic regression model using the *scikit learn* library: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. Have a look at the parameters of the function for building the model. Which can you relate to the parameters that we considered in class**?**



-- Build the model using the training set (you can start with the default parameter settings), then evaluate it using the test set. Just for today, you can use accuracy as the performance metric. You can use the accuracy function in the *scikit learn* library (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html). Compute the overall accuracy. Then, check the accuracy per class. What do you find**?** If you found something unexpected, why do you think that is**?** What happens if you change the 'penalty' parameter, i.e. the regularization parameter, to 'None'**?**

You can use the code cell below. You will need to complete the code.


In [None]:
# build the model
logistic_regr_model = 

# evaluate the model
test_prediction =

overall_accuracy = 

accuracy_per_class = 



# Modelling with training and test sets swapped


-- Repeat the experiment above but with the training and test sets swapped. Before you run the experiment, check the distribution of the labels in the new training set. Do you find any difference in the performance of the two models**?** If there is any, why do you think there might be a difference**?** And what does that then imply**?**

You can use the code cell below. You will need to complete the code

In [None]:

# swap training and test sets
new_training_feature_vector =
new_training_label =
new_test_feature_vector =
new_test_label =

# show the distribution of the labels in the training set

# build the new model
logistic_regr_model_swapped = 

# evaluate the new model
test_prediction_swapped =

overall_accuracy_swapped = 

accuracy_per_class_swapped = 


# Comparing the performance of the logistic regression with other algorithms

-- You can build additional models using the kNN and Naive Bayes classifier. You can find the documentations in https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html and https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB for the *scikit learn* library. How do they compare with the performance of the logistic regression**?** 

-- You can try different values of k for the kNN, recording the accuracy in each case. You can create a plot of accuracy (on the y axis) against k (on the x axis). What do you notice**?**

-- Rather than the 50:50 split you used above, you can try an 80:20 split for the logistic regression. What do you find, compared to the 50:50 split**?** And why, if you find any difference**?**

