In [1]:
#!pip install scikit-learn
#!pip install pandas
import sklearn
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor
from sklearn import svm


from sklearn import metrics
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score


# Supervised Learning Section

1. [Introduction](#introduction)
2. [Data Setup](#data-setup)
3. [Principal Component Analysis (PCA) Review](#principal-component-analysis-pca-review)
    - [Significant Features](#significant-features-identification)
4. [Supervised Learning Model Implementation](#supervised-learning-model-implementation)
    - [Data Splitting](#data-splitting)
    - [Lasso Regression](#lasso-regression)
        - [Model Training](#model-training)
        - [Model Evaluation](#model-evaluation)
    - [Decision Tree](#decision-tree)
        - [Model Training](#model-training-1)
        - [Model Evaluation](#model-evaluation-1)
5. [Comparison of Model Results](#comparison-of-model-results)
    - [R squared analysis](#r-squared-analysis)
    - [MSE analysis](#mse-analysis)
6. [Policy Recommendation](#policy-recommendation-development)
    - [Interpretation of Findings](#interpretation-of-findings)
    - [Our Policy Recommendations](#formulating-policy-decisions)
7. [Conclusion](#conclusion)


## Introduction
Now that we have performed PCA and clustering to determine key features in the dataset, we would like to support these findings with supervised learning. Our goal in this section is to train easily interpretable supervised learning models to predict digital equity statistics in areas of Michigan. We aim to provide insight into which factors are the most significant in determining digital equity. The relative importance of each feature can be gleaned by the weight given to them during the training of these supervised learning models. With these key features in mind, we will recommend policy decisions that could use this insight to better allocate public funds. Since we are using both categorical and quantitative variables to predict our quantitative equity metric, we plan to compare the efficacy of a lasso regression approach.

## Data setup
Access the upload + download speed results etc. from /results, save as df

In [2]:
total_df = pd.read_csv('../../DATA/results/total.csv')
total_df.drop(['Unnamed: 0'], axis=1, inplace=True)
total_df.set_index('TWNRNGSEC', inplace=True)
total_df.columns

Index(['bslcount', 'usufcount', 'percent_usuf', 'avg_d_mbps', 'avg_u_mbps',
       'avg_lat_ms', 'cluster_labels'],
      dtype='object')

In [3]:
X = total_df.drop(['cluster_labels'], axis=1)
y = total_df['cluster_labels']

## Principal Component Analysis (PCA) Review
Write out which groups our PCA highlighted, and which features emerged as the most significant. We will see if our supervised learning yields the same results

### Significant features

* bslcount
* usufcount	
* percent_usuf
* avg_d_mbps
* avg_u_mbps
* avg_lat_ms
* cluster_labels

## Supervised Learning Model Implementation
### Data splitting

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Lasso Regression

### Lasso training 

In [5]:


# Train Lasso regression model

lasso = linear_model.Lasso(alpha=0.1)

# Train the model using the training sets
lasso.fit(X_train, y_train)

# Make predictions using the testing set
lasso_y_pred = lasso.predict(X_test)



### Lasso evaluation

In [6]:

# lasso_mse = mean_squared_error(y_test, lasso_y_pred)
# lasso_rmse = mean_squared_error(y_test, lasso_y_pred, squared=False)

print('Lasso Coefficients: \n', lasso.coef_)

# Mean Squared Error
lasso_mse = round(metrics.mean_squared_error(y_test, lasso_y_pred),3)
print(f"Lasso Mean Squared Error (MSE): {lasso_mse}")

# Root Mean Squared Error
lasso_rmse = round(np.sqrt(lasso_mse),3)
print(f"Lasso Root Mean Squared Error (RMSE): {lasso_rmse}")

# Mean Absolute Error
lasso_mae = round(metrics.mean_absolute_error(y_test, lasso_y_pred),3)
print(f"Lasso Mean Absolute Error (MAE): {lasso_mae}")

# R² Score
lasso_r2 = round(metrics.r2_score(y_test, lasso_y_pred),3)
print(f"Lasso R² Score: {lasso_r2}")

Lasso Coefficients: 
 [ 0.00044387 -0.00017024 -0.          0.00542804  0.0249065   0.00297853]
Lasso Mean Squared Error (MSE): 0.135
Lasso Root Mean Squared Error (RMSE): 0.367
Lasso Mean Absolute Error (MAE): 0.294
Lasso R² Score: 0.676


## Linear Regression

### Linear Regression Training

In [7]:


# Create linear regression object
linear_regr = LinearRegression()

# Train the model using the training sets
linear_regr.fit(X_train, y_train)

# Make predictions using the testing set
linear_regr_y_pred = linear_regr.predict(X_test)



### Linear Regression Evaluation

In [8]:
print('Coefficients: \n', linear_regr.coef_)

# Mean Squared Error
linear_regr_mse = round(metrics.mean_squared_error(y_test, linear_regr_y_pred),3)
print(f"Linear Regression Mean Squared Error (MSE): {linear_regr_mse}")

# Root Mean Squared Error
linear_regr_rmse = round(np.sqrt(linear_regr_mse),3)
print(f"Linear Regression Root Mean Squared Error (RMSE): {linear_regr_rmse}")

# Mean Absolute Error
linear_regr_mae = round(metrics.mean_absolute_error(y_test, linear_regr_y_pred),3)
print(f"Linear Regression Mean Absolute Error (MAE): {linear_regr_mae}")

# R² Score
linear_regr_r2 = round(metrics.r2_score(y_test, linear_regr_y_pred),3)
print(f"Linear Regression R² Score: {linear_regr_r2}")

Coefficients: 
 [ 0.00045089 -0.00023275 -0.02205739  0.00519972  0.02892578  0.00299575]
Linear Regression Mean Squared Error (MSE): 0.134
Linear Regression Root Mean Squared Error (RMSE): 0.366
Linear Regression Mean Absolute Error (MAE): 0.292
Linear Regression R² Score: 0.678


## Decision Tree

### Tree training

In [9]:
tree_model = DecisionTreeRegressor(random_state=42)

tree_model.fit(X_train, y_train)

tree_model_y_pred = tree_model.predict(X_test)

### Tree evaluation

In [10]:
from sklearn.tree import export_text
# mse = mean_squared_error(y_test, y_pred)
# rmse = mean_squared_error(y_test, y_pred, squared=False)
tree_rules = export_text(tree_model, feature_names=list(X.columns))
print(tree_rules)
# Mean Squared Error
tree_model_mse = round(metrics.mean_squared_error(y_test, tree_model_y_pred),3)
print(f"Desion Tree Mean Squared Error (MSE): {tree_model_mse}")

# Root Mean Squared Error
tree_model_rmse = round(np.sqrt(tree_model_mse),3)
print(f"Desion Tree Root Mean Squared Error (RMSE): {tree_model_rmse}")

# Mean Absolute Error
tree_model_mae = round(metrics.mean_absolute_error(y_test, tree_model_y_pred),3)
print(f"Desion Tree Mean Absolute Error (MAE): {tree_model_mae}")

# R² Score
tree_model_r2 = round(metrics.r2_score(y_test, tree_model_y_pred),3)
print(f"Desion Tree R² Score: {tree_model_r2}")
print(classification_report(y_test, tree_model_y_pred))

|--- avg_lat_ms <= 355.00
|   |--- avg_u_mbps <= 7.44
|   |   |--- avg_d_mbps <= 74.82
|   |   |   |--- value: [0.00]
|   |   |--- avg_d_mbps >  74.82
|   |   |   |--- value: [1.00]
|   |--- avg_u_mbps >  7.44
|   |   |--- avg_d_mbps <= 54.40
|   |   |   |--- avg_u_mbps <= 12.52
|   |   |   |   |--- avg_lat_ms <= 32.20
|   |   |   |   |   |--- value: [1.00]
|   |   |   |   |--- avg_lat_ms >  32.20
|   |   |   |   |   |--- avg_u_mbps <= 9.84
|   |   |   |   |   |   |--- value: [0.00]
|   |   |   |   |   |--- avg_u_mbps >  9.84
|   |   |   |   |   |   |--- bslcount <= 22.50
|   |   |   |   |   |   |   |--- value: [0.00]
|   |   |   |   |   |   |--- bslcount >  22.50
|   |   |   |   |   |   |   |--- value: [1.00]
|   |   |   |--- avg_u_mbps >  12.52
|   |   |   |   |--- value: [1.00]
|   |   |--- avg_d_mbps >  54.40
|   |   |   |--- avg_lat_ms <= 308.17
|   |   |   |   |--- avg_u_mbps <= 8.07
|   |   |   |   |   |--- avg_u_mbps <= 7.96
|   |   |   |   |   |   |--- value: [1.00]
|   |   | 

## Support Vector Machine

### SVM Training

In [11]:
# Create a SVM classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

# Train the model using the training sets
clf.fit(X_train, y_train)

# Predict the response for test dataset
svm_y_pred = clf.predict(X_test)

### SVM Accuracy

In [12]:
print("SVM Accuracy:", accuracy_score(y_test, svm_y_pred))

# Mean Squared Error
svm_mse = round(metrics.mean_squared_error(y_test, svm_y_pred),3)
print(f"SVM Mean Squared Error (MSE): {svm_mse}")

# Root Mean Squared Error
svm_rmse = round(np.sqrt(svm_mse),3)
print(f"SVM Root Mean Squared Error (RMSE): {svm_rmse}")

# Mean Absolute Error
svm_mae = round(metrics.mean_absolute_error(y_test, svm_y_pred),3)
print(f"SVM Mean Absolute Error (MAE): {svm_mae}")

# R² Score
svm_r2 = round(metrics.r2_score(y_test, svm_y_pred),3)
print(f"SVM R² Score: {svm_r2}")

# Model Evaluation: precision, recall, f1-score, support
print(classification_report(y_test, svm_y_pred))

SVM Accuracy: 1.0
SVM Mean Squared Error (MSE): 0.0
SVM Root Mean Squared Error (RMSE): 0.0
SVM Mean Absolute Error (MAE): 0.0
SVM R² Score: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00        62
           2       1.00      1.00      1.00        13

    accuracy                           1.00       125
   macro avg       1.00      1.00      1.00       125
weighted avg       1.00      1.00      1.00       125



## Comparison of Model Results

Compare the statistics we drew from the last section to see if the key features (those with the greatest weights) matches up with our unsupervised learning results

### R squared analysis

In [13]:
# Lasso Regression R² Score
print(f"Lasso R² Score: {lasso_r2}")

# Linear Regression R² Score
print(f"Linear Regression R² Score: {linear_regr_r2}")

# Decision Tree R² Score
print(f"Desion Tree R² Score: {tree_model_r2}")

# SVM R² Score
print("SVM Mean Squared Error (MSE): **{:.3f}**".format(svm_r2))


Lasso R² Score: 0.676
Linear Regression R² Score: 0.678
Desion Tree R² Score: 0.904
SVM Mean Squared Error (MSE): **1.000**


From this we can see that the SVM is superior to that of the the tree regressor. The r^2 value is higher so we can deduce that the SVM model is a better fit for our data.

### MSE Analysis

In [14]:
# Print Lasso Mean Squared Error
print(f"Lasso Mean Squared Error (MSE): {lasso_mse}")

# Mean Squared Error
print(f"Linear Regression Mean Squared Error (MSE): {linear_regr_mse}")

# Mean Squared Error
print(f"Desion Tree Mean Squared Error (MSE): {tree_model_mse}")

# Mean Squared Error
print("SVM Mean Squared Error (MSE): **{:.3f}**".format(svm_mse))


Lasso Mean Squared Error (MSE): 0.135
Linear Regression Mean Squared Error (MSE): 0.134
Desion Tree Mean Squared Error (MSE): 0.04
SVM Mean Squared Error (MSE): **0.000**


The **MSE is lowest in our SVM model** so we can deduce that overall, the SVM model is a better fit.

## Policy Recommendation
Using the weights of the features of the SVM model, we can deduce which features are more significant in predicting digital equity.

In [16]:
weights = clf.coef_
feature_names = clf.feature_names_in_

cluster_names = ['Good Network', 'Excellent Network', 'Poor or no Network']

for index, value in enumerate(feature_names):
    print("Feature at Index", index, "Name:", value)
print()

# Print which feature had the highest and lowest relevant factors
for i in range(clf.coef_.shape[0]):
    weights = clf.coef_[i]
    index_of_highest_weight = np.argmax(np.abs(weights))
    index_of_lowest_weight = np.argmin(np.abs(weights))
    print("Cluster:", i, cluster_names[i])
    print("Feature with highest absolute weight:", feature_names[index_of_highest_weight])


    print("Highest absolute weight:", weights[index_of_highest_weight])
    print("Feature with lowest absolute weight:", feature_names[index_of_lowest_weight])

    print("Lowest absolute weight:", weights[index_of_lowest_weight])

                                            
    print()

Feature at Index 0 Name: bslcount
Feature at Index 1 Name: usufcount
Feature at Index 2 Name: percent_usuf
Feature at Index 3 Name: avg_d_mbps
Feature at Index 4 Name: avg_u_mbps
Feature at Index 5 Name: avg_lat_ms

Cluster: 0 Good Network
Feature with highest absolute weight: avg_u_mbps
Highest absolute weight: -1.768716954472417
Feature with lowest absolute weight: usufcount
Lowest absolute weight: 0.0011145075586185271

Cluster: 1 Excellent Network
Feature with highest absolute weight: avg_lat_ms
Highest absolute weight: -0.09679614294400896
Feature with lowest absolute weight: percent_usuf
Lowest absolute weight: 0.0039606674403634065

Cluster: 2 Poor or no Network
Feature with highest absolute weight: avg_d_mbps
Highest absolute weight: 0.09540398255401014
Feature with lowest absolute weight: percent_usuf
Lowest absolute weight: 0.0002972824275481404



### Interpretation of Findings

We find that the download speed has a high correlation to all of the other metrics, so we can deduce that download speed is the main indicator of digital equity, something that is confirmed from our unsupervised learning models.

Using the coef_ method to return the weights of the svm model tells us the coefficients of each feature in the linear equation that defines the decision boundary between the classes. The higher the absolute value of the weight, the more important the feature is for the classification. For example, in this case, the feature with the highest weight for cluster 2 is **avg_d_mbps**, which means that the average download speed is the most influential factor in determining if the cluster label for a section should be 2. The feature with the lowest weight for cluster 2 is **percent_usuf**, which means that the percentage of unserverved and or unfunded households in a section is the least relevant factor for cluster 2.



### Our Policy Recommendations

Thus, we recommend that the FCC use low download speed as the primary indicator of a lack of high speed internet access, since all the other features we looked at correlate highly with download speed.