[![Open In Colab]((https://colab.research.google.com/assets/colab-badge.svg))](https://colab.research.google.com/github/AMLA-UBC/100-Exploring-the-World-of-Modern-Machine-Learning/blob/main/Linear_vs_Logistic_Regression_Tutorial.ipynb)

# Visualize the Difference

This tutorial explains how to create a linear regression and logistic regression model using the **California Housing Dataset** and **Breast Cancer Wisconsin Dataset**. The dataset is split into features and labels, and then split into training and testing sets. The models are created using TensorFlow 2, compiled and fitted, and then evaluated for accuracy. By the end of this tutorial, we will understand the advantages and disadvantages of linear and logistic regression models.

# Install and Import Required Libraries

In [None]:
!pip install -q tensorflow pandas numpy

import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing, load_breast_cancer

# Load the California Housing Dataset

The California Housing Dataset is used to predict median house values in California.

In [None]:
# Load the dataset
california_dataset = fetch_california_housing()

# Create a dataframe from the dataset
df = pd.DataFrame(california_dataset.data, columns=california_dataset.feature_names)

# Add the target column
df['MedHouseVal'] = california_dataset.target

# Split the dataset into features and labels
X = df.drop('MedHouseVal', axis=1).values
y = df['MedHouseVal'].values

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Linear Regression Model - Housing Price

After running each regression model, should see the test accuracy of the model, which is a measure of how well the model is performing.

In [None]:
# Create the linear regression model
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1)
])
# Compile and fit the model
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=50)

# Plot the loss over time
plt.plot(history.history['loss'])
plt.title('Linear Regression Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, y_test)
print('Test accuracy:', test_acc)

# Logistic Regression Model - Housing Price



In [None]:
# Create the logistic regression model
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile and fit the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=50)

# Plot the loss over time
plt.plot(history.history['loss'])
plt.title('Logistic Regression Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, y_test)
print('Test accuracy:', test_acc)

The loss of logistic regression keeps on decreasing because it is being optimized by an optimization algorithm. The optimization algorithm is trying to minimize the loss function by adjusting the parameters of the model. As the parameters are adjusted, the loss decreases. The final loss is a large negative value because the optimization algorithm has found the parameters that minimize the loss.

# Load the Breast Cancer Wisconsin Dataset

In the Breast Cancer Dataset, we are predicting whether or not a patient has breast cancer.

In [None]:
# Load the dataset
cancer_dataset = load_breast_cancer()

# Create a dataframe from the dataset
df = pd.DataFrame(cancer_dataset.data, columns=cancer_dataset.feature_names)

# Add the target column
df['target'] = cancer_dataset.target

# Split the dataset into features and labels
X = df.drop('target', axis=1).values
y = df['target'].values

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Linear Regression Model - Breast Cancer

In [None]:
# Create the linear regression model
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1)
])
# Compile and fit the model
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=50)

# Plot the loss over time
plt.plot(history.history['loss'])
plt.title('Linear Regression Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, y_test)
print('Test accuracy:', test_acc)

# Logistic Regression Model - Breast Cancer

In [None]:
# Create the logistic regression model
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile and fit the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=50)

# Plot the loss over time
plt.plot(history.history['loss'])
plt.title('Logistic Regression Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, y_test)
print('Test accuracy:', test_acc)

# Main Takeaways

Advantages of Linear Regression: 
- simpler and easier to understand 
- can be used to predict continuous values 
- can be used to identify relationships between variables 

Disadvantages of Linear Regression: 
- sensitive to outliers 
- only applicable to linear relationships 

Advantages of Logistic Regression: 
- more robust to outliers 
- can be used to predict discrete values 
- can capture non-linear relationships 

Disadvantages of Logistic Regression: 
- more complex and harder to understand

# Disadvantages of Both Models

The California Housing dataset is a complex dataset with many features and a wide range of values. This makes it difficult for a linear regression or logistic regression model to accurately predict the target values.