# Supervised Learning Section

1. [Introduction](#introduction)
2. [Data Setup](#data-setup)
3. [Principal Component Analysis (PCA) Review](#principal-component-analysis-pca-review)
    - [Significant Features](#significant-features-identification)
4. [Supervised Learning Model Implementation](#supervised-learning-model-implementation)
    - [Data Splitting](#data-splitting)
    - [Lasso Regression](#lasso-regression)
        - [Model Training](#model-training)
        - [Model Evaluation](#model-evaluation)
    - [Decision Tree](#decision-tree)
        - [Model Training](#model-training-1)
        - [Model Evaluation](#model-evaluation-1)
5. [Comparison of Model Results](#comparison-of-model-results)
    - [R squared analysis](#r-squared-analysis)
    - [MSE analysis](#mse-analysis)
6. [Policy Recommendation](#policy-recommendation-development)
    - [Interpretation of Findings](#interpretation-of-findings)
    - [Our Policy Recommendations](#formulating-policy-decisions)
7. [Conclusion](#conclusion)


## Introduction
Now that we have performed PCA and clustering to determine key features in the dataset, we would like to support these findings with supervised learning. Our goal in this section is to train easily interpretable supervised learning models to predict digital equity statistics in areas of Michigan. We aim to provide insight into which factors are the most significant in determining digital equity. The relative importance of each feature can be gleaned by the weight given to them during the training of these supervised learning models. With these key features in mind, we will recommend policy decisions that could use this insight to better allocate public funds. Since we are using both categorical and quantitative variables to predict our quantitative equity metric, we plan to compare the efficacy of a lasso regression approach.

## Data setup
Access the upload + download speed results etc. from /results, save as df

In [39]:
import sklearn
import pandas as pd
from sklearn import svm
from sklearn.metrics import accuracy_score


In [40]:
total_df = pd.read_csv('../../DATA/results/total.csv')
total_df.drop(['Unnamed: 0'], axis=1, inplace=True)
total_df

Unnamed: 0,TWNRNGSEC,bslcount,usufcount,percent_usuf,avg_d_mbps,avg_u_mbps,avg_lat_ms,cluster_labels
0,47N01E07,1342,870,0.648286,120.343040,12.107076,27.006601,1
1,47N01E06,847,431,0.508855,116.595788,11.714901,26.358491,1
2,47N01E08,584,431,0.738014,121.552481,12.842681,26.214286,1
3,47N01E05,581,371,0.638554,112.081817,25.232495,23.163366,1
4,45N01W29,543,325,0.598527,133.562899,13.284333,26.753623,1
...,...,...,...,...,...,...,...,...
617,42N11W21,1,1,1.000000,17.775000,2.792000,26.000000,0
618,44N02E01,1,1,1.000000,9.796000,7.664000,62.000000,0
619,42N05E10,1,1,1.000000,48.723857,13.215571,50.571429,1
620,42N05E12,1,1,1.000000,13.754000,1.387000,475.500000,2


In [41]:
X = total_df.drop(['TWNRNGSEC', 'cluster_labels'], axis=1)
y = total_df['cluster_labels']

## Principal Component Analysis (PCA) Review
Write out which groups our PCA highlighted, and which features emerged as the most significant. We will see if our supervised learning yields the same results

### Significant features

* bslcount
* usufcount	
* percent_usuf
* avg_d_mbps
* avg_u_mbps
* avg_lat_ms
* cluster_labels

## Supervised Learning Model Implementation
### Data splitting

In [42]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Lasso Regression

### Lasso training 

In [76]:
from sklearn import linear_model

# Lasso regression model

lasso = linear_model.Lasso(alpha=0.1)

# Fitting model to data

lasso.fit(X, y)
y_pred = lasso.predict(X)

### Lasso evaluation

In [77]:
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y, y_pred)
rmse = mean_squared_error(y, y_pred, squared=False)
print("Coefficients:", lasso.coef_)
r2 = r2_score(y, y_pred)
print("r2 Score:", r2)
print("Mean Squared Error:", mse)
print("Root Mean Square Error:", rmse)

Coefficients: [ 0.000528   -0.00019349 -0.          0.0054479   0.02527999  0.00302462]
r2 Score: 0.6512320615803624
Mean Squared Error: 0.1538500724828347
Root Mean Square Error: 0.39223726554578503


## Decision Tree

### Tree training

In [78]:
from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor(random_state=42)

tree_model.fit(X_train, y_train)

y_pred = tree_model.predict(X_test)

### Tree evaluation

In [79]:
from sklearn.tree import export_text
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
tree_rules = export_text(tree_model, feature_names=list(X.columns))
print(tree_rules)
r2 = r2_score(y_test, y_pred)
print("r2 Score:", r2)
print("Mean Squared Error:", mse)
print("Root Mean Square Error:", rmse)
#print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

|--- avg_lat_ms <= 355.00
|   |--- avg_u_mbps <= 7.44
|   |   |--- avg_d_mbps <= 74.82
|   |   |   |--- value: [0.00]
|   |   |--- avg_d_mbps >  74.82
|   |   |   |--- value: [1.00]
|   |--- avg_u_mbps >  7.44
|   |   |--- avg_d_mbps <= 54.40
|   |   |   |--- avg_u_mbps <= 12.52
|   |   |   |   |--- avg_lat_ms <= 32.20
|   |   |   |   |   |--- value: [1.00]
|   |   |   |   |--- avg_lat_ms >  32.20
|   |   |   |   |   |--- avg_u_mbps <= 9.84
|   |   |   |   |   |   |--- value: [0.00]
|   |   |   |   |   |--- avg_u_mbps >  9.84
|   |   |   |   |   |   |--- bslcount <= 22.50
|   |   |   |   |   |   |   |--- value: [0.00]
|   |   |   |   |   |   |--- bslcount >  22.50
|   |   |   |   |   |   |   |--- value: [1.00]
|   |   |   |--- avg_u_mbps >  12.52
|   |   |   |   |--- value: [1.00]
|   |   |--- avg_d_mbps >  54.40
|   |   |   |--- avg_lat_ms <= 308.17
|   |   |   |   |--- avg_u_mbps <= 8.07
|   |   |   |   |   |--- avg_u_mbps <= 7.96
|   |   |   |   |   |   |--- value: [1.00]
|   |   | 

## Support Vector Machine

### SVM Training

In [80]:
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [81]:
# Create a SVM classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

# Train the model using the training sets
clf.fit(X_train, y_train)

# Predict the response for test dataset
y_pred = clf.predict(X_test)

### SVM Accuracy

In [82]:
print("Accuracy:", accuracy_score(y_test, y_pred))
# Model Evaluation: precision, recall, f1-score, support
print(classification_report(y_test, y_pred))

Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00        62
           2       1.00      1.00      1.00        13

    accuracy                           1.00       125
   macro avg       1.00      1.00      1.00       125
weighted avg       1.00      1.00      1.00       125



## Comparison of Model Results

Compare the statistics we drew from the last section to see if the key features (those with the greatest weights) matches up with our unsupervised learning results

### R squared analysis

### MSE Analysis

## Policy Recommendation
Using the key features we found, make an argument for how we should allocate spending to take these results into account

### Interpretation of Findings

### Our Policy Recommendations

## Conclusion
Write a few sentences summing up the findings, and giving contact info / link to our repo