<a href="https://colab.research.google.com/github/Lambda-School-Labs/bridges-to-prosperity-ds-e/blob/main/notebooks/katie_semi_supervized_models_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is for problem 2 as described in `B2P Dataset_2020.10.xlsx` Contextual Summary tab:

## Problem 2: Predicting which sites will be technically rejected in future engineering reviews

> Any sites with a "Yes" in the column AQ (`Senior Engineering Review Conducted`) have undergone a full technical review, and of those, the Stage (column L) can be considered to be correct. (`Bridge Opportunity: Stage`)

> Any sites without a "Yes" in Column AQ (`Senior Engineering Review Conducted`) have not undergone a full technical review, and the Stage is based on the assessor's initial estimate as to whether the site was technically feasible or not. 

> We want to know if we can use the sites that have been reviewed to understand which of the sites that haven't yet been reviewed are likely to be rejected by the senior engineering team. 

> Any of the data can be used, but our guess is that Estimated Span, Height Differential Between Banks, Created By, and Flag for Rejection are likely to be the most reliable predictors.


### Load the data

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
url = 'https://github.com/Lambda-School-Labs/bridges-to-prosperity-ds-d/blob/main/Data/B2P%20Dataset_2020.10.xlsx?raw=true'
df = pd.read_excel(url, sheet_name='Data')

### Define the target

In [2]:
# Any sites with a "Yes" in the column "Senior Engineering Review Conducted"
# have undergone a full technical review, and of those, the 
# "Bridge Opportunity: Stage" column can be considered to be correct.
positive = (
    (df['Senior Engineering Review Conducted']=='Yes') & 
    (df['Bridge Opportunity: Stage'].isin(['Complete', 'Prospecting', 'Confirmed']))
)

negative = (
    (df['Senior Engineering Review Conducted']=='Yes') & 
    (df['Bridge Opportunity: Stage'].isin(['Rejected', 'Cancelled']))
)

# Any sites without a "Yes" in column Senior Engineering Review Conducted" 
# have not undergone a full technical review ...
# So these sites are unknown and unlabeled
unknown = df['Senior Engineering Review Conducted'].isna()

# Create a new column named "Good Site." This is the target to predict.
# Assign a 1 for the positive class and 0 for the negative class.
df.loc[positive, 'Good Site'] = 1
df.loc[negative, 'Good Site'] = 0

# Assign -1 for unknown/unlabled observations. 
# Scikit-learn's documentation for "Semi-Supervised Learning" says, 
# "It is important to assign an identifier to unlabeled points ...
# The identifier that this implementation uses is the integer value -1."
# We'll explain this soon!
df.loc[unknown, 'Good Site'] = -1

df['Good Site'].value_counts(dropna=False)

-1.0    1383
 1.0      65
 0.0      24
Name: Good Site, dtype: int64

### Create the test dataframe

In [3]:
no_eng = df[df['Senior Engineering Review Conducted'].isna()]
yes_eng = df[df['Senior Engineering Review Conducted'] == 'Yes']

X_train = yes_eng
y_train = yes_eng['Good Site']

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, 
                                                    test_size=0.15, random_state=42, stratify=y_train)

### Change Good Site for the test set to -1.

In [5]:
X_test['Good Site'] = -1

X_test['Good Site']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


758    -1
1105   -1
27     -1
16     -1
273    -1
282    -1
569    -1
44     -1
42     -1
25     -1
414    -1
22     -1
285    -1
28     -1
Name: Good Site, dtype: int64

### Concat them all back together

In [6]:
df = pd.concat([no_eng, X_train, X_test])

### Look at target's distribution

In [7]:
df['Good Site'].value_counts(normalize=True)

-1.0    0.949049
 1.0    0.037364
 0.0    0.013587
Name: Good Site, dtype: float64

In [8]:
print("% of data labeled ", 0.037364 + 0.013587)

% of data labeled  0.050951


In [9]:
df = df.reset_index()

### Notes on features chosen:


Not using because they have a majority null values:
* Bridge Opportunity: Bridge Type 
* Bridge Opportunity: Span (m) 
* Bridge Opportunity: Comments 
* Rejection Reason
* Height differential between banks
* Bridge Opportunity: General Project Photos

Not using due to leakage:
* Senior Engineering Review Conducted
* Bridge Opportunity: Stage

Not using because it's a duplicate:
* Proposed Bridge Location (GPS) (Latitude) 
* Proposed Bridge Location (GPS) (Longitude)

### Wrangle the data

In [24]:
import numpy as np

def wrangle(X):

  X['Bridge classification'] = X['Bridge classification'].replace({np.nan: "unknown"})

  X['Height differential between banks'] = X['Height differential between banks'].replace({np.nan: 10})

  X['Cell service quality'] = X['Cell service quality'].replace({np.nan: "unknown"})

  X['4WD Accessibility'] = X['4WD Accessibility'].replace({np.nan: "unknown"})

  X['Bridge Type'] = X['Bridge Type'].replace({np.nan: "unknown"})
  
  X['Estimated span (m)'] = X['Estimated span (m)'].replace({np.nan: X['Estimated span (m)'].median()})

  X['Days per year river is flooded'] = X['Days per year river is flooded'].replace({np.nan: X['Days per year river is flooded'].median()})

  X['River crossing deaths in last 3 years'] = X['River crossing deaths in last 3 years'].replace({np.nan: X['River crossing deaths in last 3 years'].median()})

  X['River crossing injuries in last 3 years'] = X['River crossing injuries in last 3 years'].replace({np.nan: X['River crossing injuries in last 3 years'].median()})
  
  X['Proposed Bridge Location (GPS) (Latitude)'] = X['Proposed Bridge Location (GPS) (Latitude)'].replace({np.nan: X['Proposed Bridge Location (GPS) (Latitude)'].median()})

  X['Proposed Bridge Location (GPS) (Longitude)'] = X['Proposed Bridge Location (GPS) (Longitude)'].replace({np.nan: X['Proposed Bridge Location (GPS) (Longitude)'].median()})


  crossing = []
  for i in X['Current crossing method']:
    if type(i) == float:
      crossing.append("none")
    elif 'timber' in i.lower() or 'log' in i.lower():
      crossing.append('timber')
    elif 'boat' in i.lower():
      crossing.append('boat')
    else:
      crossing.append('other')
  X['crossing'] = crossing
  
  return X
  
  return X

### Make a semi-supervized model and use it generate labels for unknown data points

In [26]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.semi_supervised import LabelSpreading
from sklearn.model_selection import RandomizedSearchCV


target = 'Good Site'

features = ['Bridge classification', 'crossing', 'Days per year river is flooded',
            'River crossing deaths in last 3 years', 'River crossing injuries in last 3 years',
            'Cell service quality', '4WD Accessibility', 'Bridge Type',
            'Proposed Bridge Location (GPS) (Latitude)', 'Proposed Bridge Location (GPS) (Longitude)',
            'Height differential between banks', 'Estimated span (m)']


labels = df[target]
X = wrangle(df)
X = X[features]
enc = OrdinalEncoder()


enc.fit(X)

X = enc.transform(X)

pipeline = make_pipeline(
    LabelSpreading(kernel='knn')
)
param_distributions = { 
    'labelspreading__alpha': [.8],
}


search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=20, 
    cv=8, 
    scoring='accuracy', 
    verbose=1, 
    return_train_score=True
)

search.fit(X, labels);

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  self.label_distributions_ /= normalizer
  self.label_distributions_ /= normalizer
  self.label_distributions_ /= normalizer
  self.label_distributions_ /= normalizer
  self.label_distributions_ /= normalizer


Fitting 8 folds for each of 1 candidates, totalling 8 fits


  self.label_distributions_ /= normalizer
  self.label_distributions_ /= normalizer
  self.label_distributions_ /= normalizer
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    0.3s finished
  self.label_distributions_ /= normalizer


In [27]:
pipeline = search.best_estimator_
label_spread = pipeline.named_steps['labelspreading']

In [28]:
output_labels = label_spread.transduction_
df['knn'] = output_labels
df['knn'].value_counts(normalize=True)

1.0    0.683424
0.0    0.316576
Name: knn, dtype: float64

In [29]:
test_ids = list(X_test['Bridge Opportunity: CaseSafeID'])
test = df[df['Bridge Opportunity: CaseSafeID'].isin(test_ids)]

In [30]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

print('Best hyperparameters', search.best_params_)
print('Cross-validation Accuracy', -search.best_score_/0.050951)
print("Train accuracy", pipeline.score(X, labels)/0.050951)
print("Test Accuracy", accuracy_score(y_test, test['knn']))
print(classification_report(y_test, test['knn']))


Best hyperparameters {'labelspreading__alpha': 0.8}
Cross-validation Accuracy -0.6133343800906753
Train accuracy 0.8266680775135189
Test Accuracy 0.5714285714285714
              precision    recall  f1-score   support

         0.0       0.25      0.25      0.25         4
         1.0       0.70      0.70      0.70        10

    accuracy                           0.57        14
   macro avg       0.47      0.47      0.47        14
weighted avg       0.57      0.57      0.57        14



### Get predicted probabilities

In [31]:
preds = pipeline.predict_proba(X)
preds

array([[6.33007331e-01, 3.66992669e-01],
       [3.16904592e-04, 9.99683095e-01],
       [2.02071278e-01, 7.97928722e-01],
       ...,
       [1.80367341e-01, 8.19632659e-01],
       [5.99984258e-02, 9.40001574e-01],
       [3.82356941e-03, 9.96176431e-01]])

### Make categorical labels for easier visualization.

In [32]:
df['Site Suitability'] = df['Good Site']
df['Site Suitability'] = df['Site Suitability'].replace({-1:'Unknown',
                                       0:'Unsuitable',
                                       1:'Suitable'})

df['Predicted Suitability'] = df['knn']
df['Predicted Suitability'] = df['Predicted Suitability'].replace({
                                       0:'Unsuitable',
                                       1:'Suitable'})

### Download the predictions

In [33]:
pred_suit = []
for i in preds:
  pred_suit.append(i[1])

df['Likelihood_of_Suitability'] = pred_suit

In [None]:
df[['Bridge Opportunity: Project Code', 'Likelihood_of_Suitability']].to_csv('likelihood_predictions.csv', index=False)

In [None]:
# from google.colab import files
# files.download('likelihood_predictions.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Make a probabilities dataframe for visualizations

In [34]:
probas = pd.DataFrame(preds)
probas['Suitability'] = df['Predicted Suitability']

In [39]:
import plotly.express as px
fig = px.density_contour(probas, y=1, x=0, color='Suitability', 
                         marginal_x="histogram", marginal_y="histogram")

fig.show()

### Make latitude/longitude graphs for fun.

In [36]:
graph = df[(
    (df['Proposed Bridge Location (GPS) (Latitude)'] < -1) & 
    (df['Proposed Bridge Location (GPS) (Longitude)'] > 20)
)]

fig = px.scatter(graph, x="Proposed Bridge Location (GPS) (Latitude)", y="Proposed Bridge Location (GPS) (Longitude)", color='Predicted Suitability')
fig.show()

fig = px.scatter(graph, x="Proposed Bridge Location (GPS) (Latitude)", y="Proposed Bridge Location (GPS) (Longitude)", color='Site Suitability')
fig.show()

In [37]:
fig = px.scatter_mapbox(graph, lat="Proposed Bridge Location (GPS) (Latitude)", lon="Proposed Bridge Location (GPS) (Longitude)", hover_name="Predicted Suitability", hover_data=["Predicted Suitability", "Predicted Suitability"],
                        color="Predicted Suitability", zoom=3, height=300)
fig.update_layout(
    mapbox_style="white-bg",
    mapbox_layers=[
        {
            "below": 'traces',
            "sourcetype": "raster",
            "sourceattribution": "United States Geological Survey",
            "source": [
                "https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}"
            ]
        }
      ])
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [38]:
fig = px.scatter_mapbox(graph, lat="Proposed Bridge Location (GPS) (Latitude)", lon="Proposed Bridge Location (GPS) (Longitude)", hover_name="Site Suitability", hover_data=["Predicted Suitability", "Predicted Suitability"],
                        color="Site Suitability", zoom=3, height=300)
fig.update_layout(
    mapbox_style="white-bg",
    mapbox_layers=[
        {
            "below": 'traces',
            "sourcetype": "raster",
            "sourceattribution": "United States Geological Survey",
            "source": [
                "https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}"
            ]
        }
      ])
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()