# Futures Price Predictor Version 0.1

### This project idea was inspired by the movie "The Social Network" scene in which Zuckerberg's character is talking to a friend explaining the $300,000 a friend made by finding the correlation between heating oil futures and meteorology. We will attempt a similar model to predict the heating oil or natural gas futures spot price given weather data, as well as other predictors.

<br>
<br>

<img src="Soial_Network.jpg" width="500" height="240" align="center"/> 
<h3><i><center>The Social Network</center>

# Sections:
## 1. [Hypothesis](#Hypothesis)
## 2. [Data Needed](#Data-Needed)
## 3. [Initial Model Selections](#Initial-Model-Selection)
## 4. [Data Wranglin'](#Data_Wranglin)
<br>

<a id='Hypothesis'></a>
## 1. Hypothesis

#### The hypothesis we are testing is whether or not futures prices target variable can be predicted given various predictor variables that will initially consist of (and adjusted based on historical correlation):

 - Weather Patterns (in the contiguous United States, or other countries (exports) given additional findings)
 - Other futures spot prices
 - GDP measures of various countries
 - Weather sentiment (web scraper for positive/negative weather reports in future)
 - Farmer's Almanac predictors
 <br>

<a id='Data-Needed'></a>
## 2. Data Needed

### Weather Patterns

#### We will look at obtaining a weather pattern predictor that is already pre-built. The most important data subset will the predicted average daily temperatures for any part of the country. This will be used in the model as a predictor variable in our 2nd stage model, not in our predictor variables for training the intial regression model on.

#### Additionally, we will need to obtain historical weather data for the U.S. for the predictor variable in our intial model.

### Futures Spot Prices

####  We will need to obtain historical futures spot prices to train our model on. Hopefully we will be able to obtain the same date ranges as weather.

### GDP Measures

#### One hypothesis is that the higher the GDP output of the countries of the world are, the higher the prices of heating oil or natural gas will be in order to provide heat in the homes of persons able to pay for it. As demand for home ameniites increase as wealth increases across the globe, this should result in higher prices for a finite good like oil or natural gas (we will test this assumption).

#### We will need to obtain historical GDP output of countries in the world in the same date range as our spot prices and weather data.

### Weather Sentiment

#### I believe that since heating oil and natural gas are bought in bulk by the average homeowner, severe weather predictions for winter months in the United States will result in increased demand, and thus higher spot prices. 

#### We will need to create a web scraper for weather sentiment analysis. This will be difficult to fine tune given the amount of weather articles and keywords that will identify the news as postive or negative, as well as identifying if the news article is speaking to weather in the coming days, or the coming months. 
### This will be done in multiple steps:

 1. Is the weather report positive or negative?
 2. Is the weather report for near-term or long-term timeframe?
 3. Is the weather report for a particular geographical location in the US that will indeed experience colder weather?
 4. Additional Fine Tuning and preditors to be explored.
 

### Farmers Almanac Data

#### We will look at incorporating Farmers Almanac predictors as an additional data set.
#### The accuracy of the historical Farmers Almanac data will be assessed to see if it can supply meaningful data points to the 2nd stage model that will predict spot prices.
#### The future predictions in the most recent Farmers Almanac will be needed in order to incorporate as a predictor variable in our 2nd stage model.

<a id='Initial-Model-Selection'></a>
## 3. Initial Model Selections

#### Since we have various models feeding our final target variable, we will list them below:

 - 1st Stage Model: Spot price predictor. We will explore a regression model, whether it is one single random forest regressor, or a combination of other regressors and averaging the results for each respective model prediction. We will begin with a random forest regressor since it can be easily implemented with scikit, the slow computational time is not a concern since futures prices are not as volatile as somehting like virtual currency, and the estimates generally avoid overfitting given the nature of random forests to reduce variance by combining numerous outputs from numerous decision trees. Additionally a random forest regressor does not require us to normalize our data, thus saving us some data preprocessing work. However, I may do this regardless if we decide to incorporate other regressor models like SGD and should not affect the RF model.
 
 -  Weather News Sentiment Scraper: We will need to build a web scrpaing model that includes the following considerations:
    - 1. Is the weather report positive or negative?
    - 2. Is the weather report for near-term or long-term timeframe?
    - 3. Is the weather report for a particular geographical location in the US that will indeed      experience colder weather?
    - 4. Additional Fine Tuning and preditors to be explored.

<a id='Data_Wranglin'></a>
## 4. Data Wranglin'

#### We will gather the necessary data based on the order of the Data Needed section.

### <u>Weather Data:</u>

#### Setting up NOAA connection:

In [1]:
# This data will be used to get future temps on a regular basis to update and eppend to historical weather data

# NOAA API token to access data    qQBTJhUaZoWTrFQeNkkFSWHvhhEjUVQn

token = 'qQBTJhUaZoWTrFQeNkkFSWHvhhEjUVQn'


Import the historical weather data - Local to avoid large file size limitations from NOAA API + Only need to download once, and append new data to the file as it is released.

Given that I will be training the model once for the price prediction I am downloading the large (3GB) weather data set directly from NOAA (NOAA Global Historical Climatology Network - Daily (GHCN-Daily), Version 3)



####  We will need to obtain historical futures spot prices to train our model on. Hopefully we will be able to obtain the same date ranges as weather.

In [2]:
import itertools
import requests
import pandas as pd
import json
import numpy as np
from datetime import datetime

# Metadata formatting
# code adjusted from ned cr on gitlab
metadata_col_specs = [
    (0,  11),
    (12, 20),
    (21, 30),
    (31, 37),
    (38, 40),
    (41, 71),
    (72, 75),
    (76, 79),
    (80, 85)
]


metadata_names = [
    "ID",
    "LATITUDE",
    "LONGITUDE",
    "ELEVATION",
    "STATE",
    "NAME",
    "GSN FLAG",
    "HCN/CRN FLAG",
    "WMO ID"]


metadata_dtype = {
    "ID": str,
    "STATE": str,
    "NAME": str,
    "GSN FLAG": str,
    "HCN/CRN FLAG": str,
    "WMO ID": str
    }

# read the station metadata and return it in a df
path = "/home/rb/Documents/Python Projects/Futures_Price_Predict_0.1/Weather_Data/ghcnd-stations.txt"

def read_station_metadata(filename=path):
    
    df = pd.read_fwf(filename, colspecs = metadata_col_specs, header = None, names = metadata_names,
                     index_col = 'ID', dtype= metadata_dtype)
    
    return df


In [3]:
import itertools
import requests
import pandas as pd
import json
import numpy as np
from datetime import datetime


# # Data formatting

# data_header_names = [
#     "ID",
#     "YEAR",
#     "MONTH",
#     "ELEMENT"]

# data_header_col_specs = [
#     (0,  11),
#     (11, 15),
#     (15, 17),
#     (17, 21)]

# data_header_dtypes = {
#     "ID": str,
#     "YEAR": int,
#     "MONTH": int,
#     "ELEMENT": str}

# data_col_names = [[
#     "VALUE" + str(i + 1),
#     "MFLAG" + str(i + 1),
#     "QFLAG" + str(i + 1),
#     "SFLAG" + str(i + 1)]
#     for i in range(31)]

# # Join sub-lists
# data_col_names = list(itertools.chain.from_iterable(data_col_names))


# data_replacement_col_names = [[
#     ("VALUE", i + 1),
#     ("MFLAG", i + 1),
#     ("QFLAG", i + 1),
#     ("SFLAG", i + 1)]
#     for i in range(31)]

# # Join sub-lists
# data_replacement_col_names = list(itertools.chain.from_iterable(data_replacement_col_names))



# data_replacement_col_names = pd.MultiIndex.from_tuples(
#     data_replacement_col_names,
#     names=['VAR_TYPE', 'DAY'])


# data_col_specs = [[
#     (21 + i * 8, 26 + i * 8),
#     (26 + i * 8, 27 + i * 8),
#     (27 + i * 8, 28 + i * 8),
#     (28 + i * 8, 29 + i * 8)]
#     for i in range(31)]


# data_col_specs = list(itertools.chain.from_iterable(data_col_specs))


# data_col_dtypes = [{
#     "VALUE" + str(i + 1): int,
#     "MFLAG" + str(i + 1): str,
#     "QFLAG" + str(i + 1): str,
#     "SFLAG" + str(i + 1): str}
#     for i in range(31)]


# data_header_dtypes.update({dtypecol: dtype for data in data_col_dtypes for dtypecol, dtype in data.items()})

# # Use chunking to read large text file into dataframe in sections after filtering data points i want, then export to csv. will look into using a line by line iterator when training the model as well since this way still results in large RAM use.

# path_dly = '/home/rb/Documents/Python Projects/Futures_Price_Predict_0.1/Weather_Data/noaa_ghcn/all_dly_files'

# def read_dly_file(filename=path_dly):
    
#     df = pd.read_fwf(
#                 path_dly,
#                 colspecs = data_header_col_specs + data_col_specs,
#                 names = data_header_names + data_col_names,
#                 index_col = 'ID',
#                 dtype = data_header_dtypes,
#                 chunksize = 100000,
#                 iterator = True
#                 )
    
#     elements_needed = ['PRCP','SNOW','SNWD','TMAX','TMIN']
    
#     df_result = pd.concat([chunk[ (chunk['YEAR']>1970) 
#                                  & (chunk['ELEMENT'].isin(elements_needed))] for chunk in df])
    
#     return df_result

In [None]:
# df_all_dly = read_dly_file(filename=path_dly)

In [4]:
#df_all_dly.to_csv('/home/rb/Documents/Python Projects/Futures_Price_Predict_0.1/Weather_Data/noaa_ghcn/dly_1970_present.csv')

In [1]:
# # read in filtered local csv


# import itertools
# import requests
# import pandas as pd
# import json
# import numpy as np
# from datetime import datetime

# path_1970_present = '/home/rb/Documents/Python Projects/Futures_Price_Predict_0.1/Weather_Data/noaa_ghcn/dly_1970_present.csv'

# # use this data for loading in data types for csv, also using the list of col names for fuzzy match to exclude columns that are not needed

# data_header_names = [
#     "ID",
#     "YEAR",
#     "MONTH",
#     "ELEMENT"]

# data_header_dtypes = {
#     "ID": str,
#     "YEAR": int,
#     "MONTH": int,
#     "ELEMENT": str}

# data_col_names = [[
#     "VALUE" + str(i + 1),
#     "MFLAG" + str(i + 1),
#     "QFLAG" + str(i + 1),
#     "SFLAG" + str(i + 1)]
#     for i in range(31)]

# data_col_names = list(itertools.chain.from_iterable(data_col_names))


# data_col_dtypes = [{
#     "VALUE" + str(i + 1): int,
#     "MFLAG" + str(i + 1): str,
#     "QFLAG" + str(i + 1): str,
#     "SFLAG" + str(i + 1): str}
#     for i in range(31)]

# data_header_dtypes.update({dtypecol: dtype for data in data_col_dtypes for dtypecol, dtype in data.items()})



# cols_needed = data_header_names + [col for col in data_col_names if col.startswith('VALUE')]

# weather_data_1970_present = pd.read_csv(path_1970_present, 
#                                         dtype = data_header_dtypes,
#                                         usecols = cols_needed)

# wd = weather_data_1970_present

In [2]:
# pd.set_option('display.max_columns', None)
# wd

In [3]:
# wd_melt = wd.melt(id_vars=['ID','YEAR','MONTH','ELEMENT'], var_name = "Value_Dates", value_name = 'Observed_Data')

In [4]:
# #ridiculous but sending file to csv to free up RAM from first csv read in

# wd_melt.to_csv('/home/rb/Documents/Python Projects/Futures_Price_Predict_0.1/Weather_Data/noaa_ghcn/weather_melt_master.csv')

In [1]:
# #importing dask dataframe to hopefully interate to_datetime with 93 million rows

# import dask.dataframe as dd


# dtype={'MONTH': 'int64',
#        'Observed_Data': 'float64',
#        'YEAR': 'int64',
#        'ID': 'string',
#        'ELEMENT': 'string',
#        'Value_Dates': 'string'}

# path_melt = '/home/rb/Documents/Python Projects/Futures_Price_Predict_0.1/Weather_Data/noaa_ghcn/weather_melt_master.csv'

# weather_melt = dd.read_csv(path_melt, dtype=dtype)

# weather_melt['Date']=dd.to_datetime((weather_melt.MONTH.astype(str) 
#                                      + weather_melt.Value_Dates.str[5:] 
#                                      + weather_melt.YEAR.astype(str)),
#                                     format = '%m%d%Y',
#                                     errors = 'coerce')

# weather_melt.to_csv('/home/rb/Documents/Python Projects/Futures_Price_Predict_0.1/Weather_Data/noaa_ghcn/weather_melt_1.csv', single_file=True)

In [6]:
import itertools
import requests
import pandas as pd
import numpy as np
from datetime import datetime
import dask.dataframe as dd

dtype={'Observed_Data': 'float64',
       'ID': 'string',
       'ELEMENT': 'string'}

weather_data = '/home/rb/Documents/Python Projects/Futures_Price_Predict_0.1/Weather_Data/noaa_ghcn/weather_melt_1.csv'

weather_melt = dd.read_csv(weather_data, dtype=dtype)
weather_melt = weather_melt.drop('Unnamed: 0', axis=1)
weather_melt.head()

Unnamed: 0,ID,ELEMENT,Observed_Data,Date
0,USC00011084,TMAX,128.0,1971-01-01
1,USC00011084,TMIN,11.0,1971-01-01
2,USC00011084,PRCP,0.0,1971-01-01
3,USC00011084,SNOW,0.0,1971-01-01
4,USC00011084,SNWD,0.0,1971-01-01


#### Now joining the station metadata to the weather data based on the station ID

In [None]:

path = "/home/rb/Documents/Python Projects/Futures_Price_Predict_0.1/Weather_Data/ghcnd-stations.txt"

def read_station_metadata(filename=path)