---
title: "Topics in Econometrics and Data Science: Tutorial 8"

---

#### General Note

You will very likely find the solution to these exercises online. We, however, strongly encourage you to work on these exercises without doing so. Understanding someone else’s solution is very different from coming up with your own. Use the lecture notes and try to solve the exercises independently.

## Exercise 1: Logistic Regression

Load the `classification_1` dataset. The data consist of two features and a classification variable.

### **A)** 
How many different classes are contained in the dataset? (Hint: `help(pd.unique)`)

In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv('../../data/classification_1.csv', sep=';', na_values=".")

### **B)**
Use logistic regression to separate both classes and measure the error rate on the training set. 
Follow these steps:

1. *Preprocessing*: Select X columns `length` and `height` and save as numpy array using the attribute `values`. Select Y column `class`, save as numpy array. and `flatten` into one-dimensional array.

2. *Logistic Regression*: Use [`linear_model.LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to create a logistic regression classifier with specific parameters `penalty` and `solver`.

3. *Train model*: Learn the coefficients that best separate the classes using [`.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit).

4. *Predict class*: Use the trained model to make predictions on X with [`.predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict) and return predicted class labels for each input sample.

5. *Error rate*: Calculate the misclassification rate of the model.

In [None]:
from sklearn.linear_model import LogisticRegression

### **C)** 
Visualize the data with [`matplotlib`](https://matplotlib.org/). 

In [44]:
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as mpatches

# Plot configuration
plot_symbol_size = 50
cmap_bold  = ListedColormap(['#FF0000', '#0000FF'])

1. Create a basic scatter plot using just the `length` and `height` features.

2. Add `class` to define colors of the scatterplot. Use the ListedColormap with red and blue colors to distinguish between your two classes.

3. Add a legend by creating red and blue patches with [`mpatches.Patch`](https://matplotlib.org/stable/api/_as_gen/matplotlib.patches.Patch.html) that explain which color corresponds to which class (Y=0 or Y=1).

### **D)**
Can you explain the results from the logistic regression? Color the regions of the plot according to their predictions.

1. Create two color maps: one bold (for actual points) and one light (for decision boundary background).

In [2]:
import matplotlib.cm as cm
from matplotlib.colors import ListedColormap, BoundaryNorm
import matplotlib.patches as mpatches

2. Calculate the mesh grid for the decision boundary by finding min/max values of the features and creating a fine mesh grid with step size 0.01 using [`np.meshgrid`](https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html).\
**Hint**: `xx, yy = np.meshgrid(np.arange(x_min, x_max, stepsize), np.arange(y_min, y_max, stepsize))`

3. Use your trained model to predict classes for each point in the mesh grid using `.predict()` on your model from **B)**.\
**Hint**: Use `np.c_[xx.ravel(), yy.ravel()]` to pass arrays in the right shape to the `.predict()` function, where `xx`and `yy` are the 2D arrays created by `np.meshgrid`.

4. Plot the decision boundary visualization by using [`pcolormesh`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.pcolormesh.html#matplotlib.pyplot.pcolormesh) to create the colored background.

5. Add the actual data points on top using a scatter plot.

6. Set the plot limits to match the mesh grid, add legend, labels, and create a title that includes the model parameters.

### **E)** 
Which method could be better suited for this task? Try to improve on the error rate.

In [None]:
from sklearn import svm

Plot your classifier again. You only need to exchange the model used for predictions in the code.

In [None]:
import matplotlib.cm as cm
from matplotlib.colors import ListedColormap, BoundaryNorm
import matplotlib.patches as mpatches