## **ARIEL Engine**
This program uses [linear regression](http://www.incertitudes.fr/book.pdf) in Python to forecast predictions based on public datasets. It will display accuracy (a number out of 100) and displays a graph using Matplotlib. We're using numpy which will help us support large multi-dimensional arrays along with a large collection of high level mathematical functions to operate on our arrays we will create for the data. Importing pickle is for serializing and de-serializing a python object structure. It's one way to convert a python object into a character stream. We import pandas library for data manipulation and analysis. Sci-Kit Learn helps us create our regression model, and finally we use matplotlib for plotting our data and visualization of our data. 


In [None]:
import numpy as np
import pickle as pkl
import pandas as pd
import quandl, math, datetime
from sklearn import preprocessing
from sklearn import model_selection
# from sklearn.svm.libsvm import cross_validation
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as data_plot
from matplotlib import style


Here we are using a data-frame from Quandl which is a platform for economic and alternative data for development. I've used an API key to activate the API itself while it asks user for typing in a Stock Code for forecasting predictions and displaying accuracy. We then write the data to a file to save it and it displays user all the data from when the company was founded until present's stock data.

In [None]:
# using style
style.use('ggplot')

quandl.api_config.ApiConfig.api_key = 'h-FrEDCiZnh5cKM3Cuva'
# getting random data from the api

stockCode = input("Please enter the Stock Code for predictions: ")
data = quandl.get('WIKI/' + stockCode)
# writing
writeFile = open("ARIELDataWrite.txt", "w")
writeFile.write(str(data))
writeFile.close()
print(data)

Here we insert columns for our data so we can get them labeled properly. The other part involves having to create a new variable which is going to be our High and Low Percent rate which will be [Adjust High] - [Adjust Close] / [Adjust Close] * 100.
Our other Variable describes the percentage change which is [Adjust Close] - [Adjust Open] / [Adjust Open] * 100 and then printing them all out in our data for labeling.

In [None]:
# print(data.head())
# print(data.keys())
# condenses the information
# assert "Forecasting" in data
data = data[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]
# creating new variable which is the high low percent
data['HL_PCT'] = (data['Adj. High'] - data['Adj. Close']) / data['Adj. Close'] * 100.0
# creating new variable which is the Percentage Change
data['PCT_Change'] = (data['Adj. Close'] - data['Adj. Open']) / data['Adj. Open'] * 100.0

data = data[['Adj. Close', 'HL_PCT', 'PCT_Change', 'Adj. Volume']]


Here we forecast our dataset for predictions. The data.fillna is used to fill out our missing values to fill holes, alternately a series of values specifying which value to use for each of our index or column for the dataframe. Our forecast variable returns the ceiling of x as as float, where the smallest interger value is greater than or equal to x. The other function takes a scalar parameter which is called period, it represents the number of shifts to be made over our desired axis.

In [None]:
# forecasting out dataset
forecastData = 'Adj. Close'
data.fillna(-99999, inplace=True)
print(data)
# comment if necessary
print(len(data))

forecastOut = int(math.ceil(0.02*len(data)))

data['Label'] = data[forecastData].shift(-forecastOut)
# print(data)

We use data.drop to drop specified labels from rows or columns. This allows us to remove rows or columns by specifying the label names and the corresponding axis. We can also do it by specifying directly index or column names. We will be using our x parameter to standardize dataset along the axis by preprocessing it.

In [None]:
x = np.array(data.drop(['Label'], 1))
x = preprocessing.scale(x)
xLately = x[-forecastOut: ]
x = x[:-forecastOut]
data.dropna(inplace = True)
y = np.array(data['Label'])


Here we test and train the X and Y axis by using our model selection which is train_test_split. We use linear regression, it's a linear approach to our model that we will be creating to modelling the relationship between the scalar response and one or more explanatory variables. The relationships are modeled using a couple of linear predictor functions whose unknown model parameters  are estimated from the data itself. It focuses on the conditional probabilty distribution of the response given the values of the predictors. 

In [None]:
trainX, testX, trainY, testY = model_selection.train_test_split(x, y, test_size=0.2)

frameControl = LinearRegression(n_jobs=-1)

frameControl.fit(trainX, trainY)
with open('linearregression.pickle', 'wb') as a:
    pkl.dump(frameControl, a)

inPickle = open('linearregression.pickle', 'rb')
frameControl = pkl.load(inPickle)

accuracy = frameControl.score(testX, testY) * 100.0

setForecast = frameControl.predict(xLately)

data['Forecasting the Data'] = np.nan
last_date = data.iloc[-1].name
last_unix = last_date.timestamp()
one_day = 86400
next_unix = last_unix + one_day


The timestamp is used for the entries that make up a Date time index and other timeseries oriented data structures for our graph. This code here pretty much allows us to structure our data and show the data's past and present.

In [None]:
for i in setForecast:
    next_date = datetime.datetime.fromtimestamp(next_unix)
    next_unix += one_day
    data.loc[next_date] = [np.nan for _ in range(len(data.columns) - 1)] + [i]

Last but not least, we finally plot out our data using the matplotlib library for data visualization and it's prediction accuracy.

In [None]:
# plot the data for visualization
data['Adj. Close'].plot()
data['Forecasting the Data'].plot()
data_plot.legend(loc=4)
data_plot.title('Stock Prediction for ' + stockCode)
data_plot.xlabel('Date')
data_plot.ylabel('price')
data_plot.show()
print(accuracy)