# Identifying Potential for Solar Investment from Machine Learning

Contributors: Buddy Bernhard, Allison Lee

### Table of Contents

1. <a href=#summary>Summary</a>
2. <a href=#data>Data Source</a>
3. <a href=#analysis>Analytical Approach</a>
4. <a href=#findings>Model Findings</a>
5. <a href=#next>Next Steps</a>

<a id=summary></a>
### Summary

Can we predict where there's potential for solar investment in the contiguous U.S.? <br>

The DeepSolar project is a deep learning framework that analyzes satellite imagery to identify the GPS locations and sizes of solar photovoltaic (PV) panels. DeepSolar was developed within Stanford's [Magic Lab](https://magiclab.stanford.edu/) and [Sustainable Systems Lab](https://magiclab.stanford.edu/). Our project aims to utilize that data to pinpoint areas where a) solar infrastructure is underdeveloped and conditions appear to be conducive, and b) there are opportunity zones for investment. 

<a id=data></a>
### Data Source

The DeepSolar dataset is a rich dataset consisting of 72537 observations, each of which represent one census tract in the continguous U.S. The dataset has 151 features, which include attributes such as racial makeup, electricity prices for different sectors (residential, industrial, commercial, etc), education levels, and environmental attributes such as relative humidity, wind speed, and daily solar radiation. 

* [DeepSolar database (census tract level)](http://web.stanford.edu/group/deepsolar/deepsolar_tract.csv)
* [DeepSolar database metadata](http://web.stanford.edu/group/deepsolar/deepsolar_tract_meta.csv)

<a id=analysis></a>
### Analytical Approach

After removing redundant features and missing values, our dataset consisted of roughly 55,000 tracts and 131 features. We created our 'target' column named 'has_tiles'. This column represents the outcome we are trying to predict with our model. If the DeepSolar dataset indicated that the tract had solar tiles (no matter the number), this column was coded as 1. If not, the column was coded as 0. 

One challenge we faced was that roughly 4 out of every 5 tracts had solar panels. This imbalance in our data means that our model could potentially learn to predict that a tract has panels more often, simply because there are more tracts with panels in our dataset. To combat this challenge, we tried three different sampling methods to 'balance' these two categories. The methods were: oversampling from the tracts that don't have panels, undersampling from the tracts that do have panels, and generating synthetic data via a method called SMOTE. 

We then tested and compared the performance of four different classifier models with different combinations of sampling methods and hyperparameter tuning. The models that we tested were: Decision Trees, Random Forests, K-Nearest Neighbors, and Logistic Regression.

<a id=findings></a>
### Model Findings

The table below summarizes the metrics of each model combination on our test dataset.

<img src='images/model_comparison.jpg'>

We decided to optimize our model based off of accuracy score (when we used sampling methods), and balanced accuracy score when we did not. 

In [None]:
## Insert AUC-ROC curve

<a id=next></a>
### Next Steps