# Supervised Learning Section

1. [Introduction](#introduction)
2. [Data Setup](#data-setup)
3. [Principal Component Analysis (PCA) Review](#principal-component-analysis-pca-review)
    - [Significant Features](#significant-features-identification)
4. [Supervised Learning Model Implementation](#supervised-learning-model-implementation)
    - [Data Splitting](#data-splitting)
    - [Lasso Regression](#lasso-regression)
        - [Model Training](#model-training)
        - [Model Evaluation](#model-evaluation)
    - [Decision Tree](#decision-tree)
        - [Model Training](#model-training-1)
        - [Model Evaluation](#model-evaluation-1)
5. [Comparison of Model Results](#comparison-of-model-results)
    - [R squared analysis](#r-squared-analysis)
    - [MSE analysis](#mse-analysis)
6. [Policy Recommendation](#policy-recommendation-development)
    - [Interpretation of Findings](#interpretation-of-findings)
    - [Our Policy Recommendations](#formulating-policy-decisions)
7. [Conclusion](#conclusion)


## Introduction
Now that we have performed PCA and clustering to determine key features in the dataset, we would like to support these findings with supervised learning. Our goal in this section is to train easily interpretable supervised learning models to predict digital equity statistics in areas of Michigan. We aim to provide insight into which factors are the most significant in determining digital equity. The relative importance of each feature can be gleaned by the weight given to them during the training of these supervised learning models. With these key features in mind, we will recommend policy decisions that could use this insight to better allocate public funds. Since we are using both categorical and quantitative variables to predict our quantitative equity metric, we plan to compare the efficacy of a lasso regression approach.

## Data setup
Access the upload + download speed results etc. from /results, save as df

In [4]:
!pip install scikit-learn
!pip install pandas
import sklearn
import pandas as pd
import numpy as np

from sklearn import svm
from sklearn import metrics

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score





[notice] A new release of pip available: 22.2.2 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting pandas
  Downloading pandas-2.1.1-cp310-cp310-win_amd64.whl (10.7 MB)
     --------------------------------------- 10.7/10.7 MB 11.5 MB/s eta 0:00:00
Collecting pytz>=2020.1
  Downloading pytz-2023.3.post1-py2.py3-none-any.whl (502 kB)
     ------------------------------------- 502.5/502.5 kB 10.7 MB/s eta 0:00:00
Collecting tzdata>=2022.1
  Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
     ------------------------------------- 341.8/341.8 kB 10.4 MB/s eta 0:00:00
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.1.1 pytz-2023.3.post1 tzdata-2023.3



[notice] A new release of pip available: 22.2.2 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
total_df = pd.read_csv('../../DATA/results/total.csv')
total_df.drop(['Unnamed: 0'], axis=1, inplace=True)
total_df

FileNotFoundError: [Errno 2] No such file or directory: '../../DATA/results/total.csv'

In [None]:
X = total_df.drop(['TWNRNGSEC', 'cluster_labels'], axis=1)
y = total_df['cluster_labels']

## Principal Component Analysis (PCA) Review
Write out which groups our PCA highlighted, and which features emerged as the most significant. We will see if our supervised learning yields the same results

### Significant features

* bslcount
* usufcount	
* percent_usuf
* avg_d_mbps
* avg_u_mbps
* avg_lat_ms
* cluster_labels

## Supervised Learning Model Implementation
### Data splitting

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Lasso Regression

### Lasso training 

In [None]:
from sklearn import linear_model

# Lasso regression model

lasso = linear_model.Lasso(alpha=0.1)

# Train the model using the training sets
lasso.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = lasso.predict(X_test)



### Lasso evaluation

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print('Coefficients: \n', regr.coef_)

# Mean Squared Error
mse = metrics.mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

# Root Mean Squared Error
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE): {rmse}")

# Mean Absolute Error
mae = metrics.mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

# R² Score
r2 = metrics.r2_score(y_test, y_pred)
print(f"R² Score: {r2}")

## Linear Regression

### Linear Regression Training

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression



# Create linear regression object
regr = LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regr.predict(X_test)



### Linear Regression Evaluation

In [None]:
print('Coefficients: \n', regr.coef_)

# Mean Squared Error
mse = metrics.mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

# Root Mean Squared Error
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE): {rmse}")

# Mean Absolute Error
mae = metrics.mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

# R² Score
r2 = metrics.r2_score(y_test, y_pred)
print(f"R² Score: {r2}")

## Decision Tree

### Tree training

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor(random_state=42)

tree_model.fit(X_train, y_train)

y_pred = tree_model.predict(X_test)

### Tree evaluation

In [None]:
from sklearn.tree import export_text
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
tree_rules = export_text(tree_model, feature_names=list(X.columns))
print(tree_rules)
# Mean Squared Error
mse = metrics.mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

# Root Mean Squared Error
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE): {rmse}")

# Mean Absolute Error
mae = metrics.mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

# R² Score
r2 = metrics.r2_score(y_test, y_pred)
print(f"R² Score: {r2}")
print(classification_report(y_test, y_pred))

## Support Vector Machine

### SVM Training

In [None]:
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [None]:
# Create a SVM classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

# Train the model using the training sets
clf.fit(X_train, y_train)

# Predict the response for test dataset
y_pred = clf.predict(X_test)

### SVM Accuracy

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))

# Mean Squared Error
mse = metrics.mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

# Root Mean Squared Error
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE): {rmse}")

# Mean Absolute Error
mae = metrics.mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

# R² Score
r2 = metrics.r2_score(y_test, y_pred)
print(f"R² Score: {r2}")

# Model Evaluation: precision, recall, f1-score, support
print(classification_report(y_test, y_pred))

## Comparison of Model Results

Compare the statistics we drew from the last section to see if the key features (those with the greatest weights) matches up with our unsupervised learning results

### R squared analysis

From this we can see that the SVM is superior to that of the the tree regressor. The r^2 value is higher so we can deduce that the SVM model is a better fit for our data.

### MSE Analysis

The MSE is lower in our SVM model so we can deduce that overall, the SVM model is a better fit.

## Policy Recommendation
Using the weights of the features of the SVM model, we can deduce which features are more significant in predicting digital equity.

In [None]:
weights = clf.coef_
print(weights)

### Interpretation of Findings

We find that the download speed has a high correlation to all of the other metrics, so we can deduce that download speed is the main indicator of digital equity, something that is confirmed from our unsupervised learning models.

### Our Policy Recommendations

Thus, we recommend that the FCC use download speed as the primary indicator of a lack of high speed internet access, since all the other features we looked at correlate highly with download speed.