# **KWH Model Report**
### ***Goal**: Build a model that predicts consumption*

Showcase the following:

1. Data Engineering
2. Solution
3. Problem methodology
4. Code
5. Analysis write-up

## **Approach**:
1. Explore and understand the dataset
2. Data Engineering: Cleaning / Feature Engineering / Pipeline
3. Baseline model (Linear Regression)
4. Evaluate, adjust pipeline & test various models

# **Findings:**



## 1. **EDA**

For this model we will be using the [2009 RESIDENTIAL ENERGY CONSUMPTION SURVEY (RECS) Survey Data](https://www.eia.gov/consumption/residential/data/2009/index.php?view=microdata). This is data collected by the U.S. Energy Information Administration. 

Data Tables:
| File Name                      	| Shape        	| Description                                                   	|
|--------------------------------	|--------------	|---------------------------------------------------------------	|
| [recs2009_public.csv](https://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public.csv)            	| (12083, 940) 	| Sample represents 113.6 million U.S. households in 2009       	|
| [public_layout.csv](https://www.eia.gov/consumption/residential/data/2009/csv/public_layout.csv)              	| (940, 5)     	| Descriptive labels and formats for each data variable         	|
| [recs2009_public_repweights.csv](https://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public_repweights.csv) 	| (12083, 246) 	| Replicate weights for each of the 12,083 RECS household cases 	|

For the timeline of this mini project I focused on using the first 2 tables. However, for model improvement I would look at utilizing data from the weights table to test for improvements in the model.



### **Inital Data Comments**:
[public table](https://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public.csv) 
![public table](../output/content/data_ex1.jpg)


1. As there are 940 columns we need to be more programatic to how we handle data. Taking the time to review each feature would be a large time investment and in a business would be best to work a SME to expedite understanding. This being said, I took a programatic approach in trying to understand the data and what is relavent for our desired model. 

2. There is dataleak around the target variable which needed investigation

3. There are many columns with multicoliniarity 

4. Data types should be confirmed for modeling (chategorical, continous, ordinal, nominal etc)

### **Taget Variable:** 'KWH'

Through exploring the data it was found that there are other columns which are subsets of the target variable. 

<img src="../output/content/target_dropvars.png" width="400">

**Assumption**: We would not have access to this data when making the prediction due to it being dependent on the target variable of KWH. Anything following this logic was droped as part of data processing.

### **-2 Values:** There are a lot of them
There was a lof of occurances of -2 in many columns. The below example shows column names and their count of -2 values. Remembering that theire is 12083 rows in this dataset, many columns contain -2 as most of their values. 

<img src="../output/content/neg_list.png" width="200">


<br>


Most of these columns are data that is only collected in some obersvations. For example, the "AGEHHMEMCAT14" feature indicates if there is a 14th houshold member. Most household memebers do not have 14 people. 

**Assumption**: when -2 is pressent in large quantities for a feature, it represents a *NULL* value. For the pipeline of this model I removed columns where most of the data was *NULL*

### **Low Unique Value Count:**
In my search to understand the data I was investigating columns that were potentially Ordinal or Nominal. I discovered that many columns are flags as also seen in the "-2" exaple above.  

<img src="../output/content/neg_example1.png" width="500">


## 2. **Pipeline**

As it stands in this version of the project:

1. Basic cleaning
2. Encode categorical
3. Drop ID columns
4. Drop target variable dependent columns
5. Remove columns with multicollinearity
6. Remove columns with more than 40% null values.
7. Scalling

Still experimenting with:

- Correcting SKEW
- PCA
- Outliers
- Converting target into BINS for a classification model



## 3. **Baseline Model:** Linear Regression

# Housekeeping (Imports / Functions)

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


# 1. The Data

In [2]:
dpath = '../data/'
df1 = pd.read_csv(dpath + 'recs2009_public.csv')
df_lay = pd.read_csv(dpath + "public_layout.csv")
df_w = pd.read_csv(dpath + "recs2009_public_repweights.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)
