# Week 7: Regression

In [None]:
# Loading the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

# Great content about what to be careful about in linear regression modeling
# Update scikit-learn with:
# pip install -U scikit-learn

## Day 3: Logistic Regression
The final step in our adventure with regression models is the **logistic regression**.

Unlike the linear and nonlinear regression models which return predicted values in infinite intervals (at least in theory), the logistic regression is set up in such a way that it returns values in the interval $(0, 1)$, regardles of the size of the inputs. This makes the logistic model a great candidate for application in **classification problems** where the outcome is a categorical variable.

Today we use the logistic regression to model probability in *binary classification problems* in which the target variable has only two classes. We choose one class that we target, and label it by $1$; the remaining data is labeled by $0$.

The logistic regression is uses the **logit function**, also known as **sigmoid function** to build the model. This function is given by:
\begin{equation} f(x) = \frac{1}{1 + e^{-x}} \quad \text{or} \quad f(x) = \frac{e^x}{e^x + 1}\end{equation}
Here is the graph of the function:

In [None]:
f = lambda x: 1/(1+np.exp(-x))
xs = np.linspace(-10, 10, 1000)
plt.figure()
plt.plot(xs, f(xs))
plt.show()

To construct the logistic regression model, first we must encode the target variable, one category with 1, the other with 0 (the efficiency of the model will be the same regrdless of the choice). Then the coefficients of the actual model are obtained in a *least squares* proces similar to the one we described in the case of linear regression.

Once we obtain the model, we can perform the classification based on the probability $p$ that the model gives us:
* If $p \geqslant 0.5$, then classify the input as category $1$
* If $p < 0.5$, then classify the input as category $0$

Let's illustrate this on some examples.

### Example 1
The data in `premium_membership.csv` contain info about the hourly wage of people and whether they have paid for premium membership for some service.
* Build a logistic classification model based on which classification can be performed
* Plot the data and the model to informally establish if the model is useful
* Make predictions and compare them to the actual data

In [None]:
# Load the data


# Plot the data


In [None]:
# Build the model


# Plotting data and model on the same graph


In [None]:
# Making predictions


# Comparing y and y_pred


# Stating the score


### Example 2
The file `train_gender.csv` contains data about the height, weight and gender of some number of people. The goal is to come up with a model that can predict the gender of a person based on the other variables. The models you build should be tested on the data given in `test_gender.csv`
* Transform the target variable in a form that can be used by the logistic model
* Build a model for predicting the gender using height
* Build a second model for predicting the gender using weight
* Finally, build a model for predicting the gender using both height and weight

In [None]:
# Load the data
df_train = pd.read_csv('train_gender.csv')
df_test = pd.read_csv('test_gender.csv')

# Encoding the target variable


# Display the data

#### Model based on height

In [None]:
# Preparing the data for train and test


# Building the model based on height


#plotting data and model on the same graph



# Making predictions, and evaluate the model
print('Evaluating the model on the TRAIN data (no very telling):')
plot_confusion_matrix()
print('Accuracy: ', )

print('Evaluating the model on the TEST data (appropriate):')
plot_confusion_matrix()
print('Accuracy: ', )

#### Model based on weight

In [None]:
# Preparing the data for train and test


# Building the model based on height


#plotting data and model on the same graph



# Making predictions, and evaluate the model
print('Evaluating the model on the TRAIN data (no very telling):')
plot_confusion_matrix()
print('Accuracy: ', )

print('Evaluating the model on the TEST data (appropriate):')
plot_confusion_matrix()
print('Accuracy: ', )

#### Model based on height and weight

In [None]:
# Preparing the data for train and test


# Building the model based on height


#plotting data and model on the same graph



# Making predictions, and evaluate the model
print('Evaluating the model on the TRAIN data (no very telling):')
plot_confusion_matrix()
print('Accuracy: ', )

print('Evaluating the model on the TEST data (appropriate):')
plot_confusion_matrix()
print('Accuracy: ', )

#### Plotting the model based on height and weight

In [None]:
# Full code will be shared :)