# Canadian Credit Union Yelp and Asset Growth Project

Here we explore the correlation between Yelp! reviews and asset growth using various recurrent neural networks. From our experimentation with the data, and features, we found _ yielded the best accuracy of future asset growth. This prediction was generated using Yelp! reviews and sentiment analysis.


## Initial Set-Up

Here we will import the necessary libraries that we will need for the project. Additionally, will read in the data collected from various Canadian Credit Unions and their corresponding Yelp! reviews. 

In [7]:
# Let's first start by importing the libraries and data we'll need
# Libraries needed include numpy, keras, csv, matplotlib

import numpy as np
import keras
from matplotlib import pyplot as plt
import pickle

# Import data

data = pickle.load(open("dfFinal.p", "rb"))

## Data Preparation 

Here we will be doing the data cleaning, parsing and preparation. This will involve parsing out symbols and punctuation that may interfere with the model's ability to analyze the given data. 

## Data Generation

To ensure the data fits with our model, will define a function that will put the data into a model friendly shape and form. This will then be called prior to model compilation and training with a defined lookback period, step size, batch size and delay.

In [None]:
# Will define a function with all the inputs we'll need for our recurrent neural network as it changes the data given to it 
# This function will need our normalized data array, our delay, min and max indices, shuffle command (do we want to randomly draw on the data each time or not?), the batch size of our data and the number of steps

def generator(data, lookback, delay , min_index, max_index, shuffle = False, batch_size = 128, step = 6):
    if max_index is None:
        max_index = len(data)  - delay - 1
    i = min_index + lookback
    while 1:
        if shuffle:
            rows = np.random.randint(min_index + lookback, max_index, size = batch_size)
        else:
            if i + batch_size >= max_index:
                i = min_index + lookback
            rows = np.arange(i, min(i + batch_size, max_index))
            i += len(rows)
        samples = np.zeros((len(rows), lookback // step, data.shape[-1]))
        targets = np.zeros((len(rows), ))
        
        for j , row in enumerate(rows):
            indices = range(rows[j] - lookback, rows[j], step)
            samples[j] = data[indices]
            targets[j] = data[rows[j] + delay][1]
        yield samples, targets

## Data Analytics 

Here we will plot some of our data to see if we can see any obvious patterns. Best to do that before going right into the model creation so that we can ensure that the obvious patterns are indeed accounted for during that stage. Here will simply look at temperature versus time data.

In [None]:
# Store our Vancity asset data into a numpy array for convenience

vassets  = float_data[:,1] 
time  = data[:,1]
plt.plot(time, data, label = 'Asset Growth ($ CAD)')
plt.title('Asset Growth By Year')
plt.legend()
plt.xlabel('Year')
plt.ylabel('Asset Growth ($ CAD)')

As is evident from the plot above, we should be expecting year over year increase in assets. This will provide us with a diagnostic to ensure the model is providing us with reasonable output. Additionally, should get yearly growths in excess of a few hundreds of millions of dollars. This should pick up to a few billion dollars around the year 2000.

In [None]:
# Building the generator.  Hardcoded for our purpose

def generator(data, lookback, delay , min_index, max_index, batch_size = 128):
    step = 1
    
    if max_index is None:
        max_index = len(data)  - delay - 1
    i = min_index + lookback
    
    while 1:
        if i + batch_size >= max_index:
            i = min_index + lookback
        rows = np.arange(i, min(i + batch_size, max_index))
        i += len(rows)
        samples = np.zeros((len(rows), lookback // step, data.shape[-1]))
        targets = np.zeros((len(rows), ))
        
        for j , row in enumerate(rows):
            indices = range(rows[j] - lookback, rows[j], step)
            samples[j] = data[indices]
            targets[j] = data[rows[j] + delay][1]
        yield samples, targets

## Model Creation

Finally, after all the hard work of normalizing our data, taking a quick look at it, we get the to fun part: model creation. Will be taking Yelp! reviews and provided asset growth data to predict future asset growth as our only output. This will be accomplished using an LSTM recurrent neural network model with _ layers and _ neurons. These were selected as they yielded the highest accuracy from our experimentation. 

In [1]:
# Need to import some libraries from keras to create our model
# This will involve the use of keras sequential neural network models, layers and rmsprop optimizers

from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop


# Now need to define how far back into the data set wqe wish to go, how often we wish to sample,
# and how far into the future we wish to predict 

# Will use all of our data to make future predictions, as such, will use length of time series column 
# since each of the credit unions gave variable historical data (i.e. did not share common report start date)

lookback  = len(time)

# Since we have limited annual data, will sample every year, therefore will use a step of 1

step = 1

# Will now define a batch size for our model so wqe can call the data genertator function

batch_size = 5

# Will start by predicting asset growth one year into the future, and will extend outwards from there

delay  = 1

# Get our data using the data generator function

# Will start with the training data set, seems like an obvious place to start

train_gen =  generator(float_data, lookback=lookback, delay=delay, min_index=0,max_index=(number),shuffle=True,step=step,batch_size=batch_size)

# Will then create a validation data set using the exact same parameters, but shifting our indices up

val_gen =  generator(float_data, lookback=lookback, delay=delay, min_index=200001,max_index=300000,shuffle=False,step=step,batch_size=batch_size)

# And lastly will generate a testing data set 

test_gen =  generator(float_data, lookback=lookback, delay=delay, min_index=300001,max_index=None,shuffle=False,step=step,batch_size=batch_size)

#Set how many steps we need to get the entire validation data set
#val_steps = (300000 - 200001 - lookback)

#Set how many steps we'll need to get the entire testing data set
#test_steps = (len(float_data) -  300001 - lookback)

# The rest is very similar to the creation of the sequential neural network we made during the Warm-Up project
# Quick refresher though, need to define our model with a number of layers, an optimzer function, a loss function, an activation function, and how many layers we want it to be

# Define our model as a sequential one
model = Sequential()

# Add some layers to our model 
model.add(LSTM(32, input_shape=(lookback // step, float_data.shape[-1]), return_sequences = True))
model.add(layers.Dense(32, activation = 'relu'))
model.add(layers.Dense(1))

# Now compile our model with optimizer and loss functions, no metric for this one though
model.compile(optimizer = RMSprop(), loss = 'mae')
history =  model.fit_generator(train_gen,
                               steps_per_epoch = 50,
                               epochs = 1,
                               validation_data = val_gen,
                               validation_steps = step)

# Before we go any further, some important notes to make here. Will do that below in the "Model Notes" block

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


NameError: name 'lookback' is not defined

## Model Notes

1. Could have used other activation functions, 'relu' is a pretty popular one, but could use the likes of 'selu' and 'sigmoid'.

2. The number of layers we add is completely arbitrary and is usually driven by experimenting with the model to see what works the best for the project.

3. The number of epochs is another great place to play around. This is primarily due to wanting to avoid overfitting, which can happen by having too many training epochs. As such, should play around and see how many epochs yields the best result for the model.

4. The optimizer function is another area to play around as RMSprop may not always be the best choice for the project at hand.

5. The loss function selected here was another judgement call, but others could be used such as binary cross entropy. Used here since we actually have numbers to match to our model's prediction, so makes sense to use mean absolute error to see how far away our model's predictions are so we can mitigate the errors. Could also use root mean square method as well for the same purpose.

## Model Evaluation

Now that we have trained and validated our model, it is imperative we now test it. This will be done using model.evaluate() and the test data we set aside for it earlier.

In [None]:
results = model.evaluate(test_gen)

#Print out the accuracy of our model on the testing data

print("Accuracy:", results[1])

## Model Output

Will now plot our model's prediction against the actual data, and validation data to see if we're overfitting, and how our model is performing overall.

In [None]:
# Will grab our losses by going into the training history and defining appropriate variables to make plotting easier

loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)

# Now plot our training and validation losses

plt.figure()
plt.plot(epochs, loss, 'b', label = 'Training Loss')
plt.plot(epochs, val_loss, 'r', label = 'Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Model Training and Validation Losses By Epoch')
plt.legend()
plt.show()

# Will now have our model make a prediction
# X will be our input that we give to our model to make its prediction

prediction = model.predict(X)
print(prediction)

## Model Evaluation

As is evident from the plot and historical output above, our model is able to achieve accuracy in the range of _%-_%. Furthermore, it is evident that in using _ epochs, the model is able to avoid overfitting to the training data. As such, it can be concluded that our model is sufficiently effective in predicting future asset growth for Canadian Credit Unions using Yelp! reviews, and yearly asset growth. Lastly, it can be concluded that Vancity Credit Uni