# Canadian Credit Union Yelp and Asset Growth Project

Here we explore the correlation between Yelp! reviews and asset growth using various recurrent neural networks. From our experimentation with the data, and features, we found _ yielded the best accuracy of future asset growth. This prediction was generated using Yelp! reviews and sentiment analysis.


## Initial Set-Up

Here we will import the necessary libraries that we will need for the project. Additionally, will read in the data collected from various Canadian Credit Unions and their corresponding Yelp! reviews. 

In [2]:
# Let's first start by importing the libraries and data we'll need
# Libraries needed include numpy, keras, csv, matplotlib

import numpy as np
import keras
import csv
from matplotlib import pyplot as plt

# Need to first import the data into a numpy array so we can do some work with it

fname = 'data'
f = open(fname)
data = f.read();
f.close

# Now need to separate the column headers from the rest of the data

lines = data.split('\n')
header = lines[0].split(',')
lines = lines[1:]

## Data Preparation 

Here we will be doing the data cleaning, parsing and preparation. This will involve parsing out symbols and punctuation that may interfere with the model's ability to analyze the given data. 

## Data Vectorization

Below we will be vectorizing the Yelp reviews for the various credit unions. Please refer to Google Document for list of urls utilized in the collection of the Yelp Reviews. Will vectorize data, with 1- and 2- grams.

Here we are importing a useful function that converts our data set of sentences into a large matrix of 0's and 1's.

In [None]:
# Data vectorization code can be found below 

from sklearn.feature_extraction.text import CountVectorizer
# The matrix has each row representing a sentence and each column representing a word from the data set (no words are repeated). For each sentence, a 1 is placed in the columns of the words present in the sentence. If a word isn't in the sentence, a 0 is put in that column.

vectorizer = CountVectorizer(binary=True, lowercase=False)
# 'binary=false' would make the matrix count the frequency of a word in the sentence, instead of just marking its presence
# for some reason this wasn't working unless lowercase=false, some problem with the cleaned data I suppose
# This next line could be used instead if we wished to make the program more sophisticated, but larger/slower. Instead of a column for every word, there would also be columns for every set of two words placed next to each other in the data set. You can change the 2 to any integer, but it makes the matrix exponentially larger.

#vectorizer = CountVectorizer(binary=True, lowercase=False, ngram_range=(1, 2))
# Finally, we can simply call this function on our cleaned dataset 'phrases'.

vector = vectorizer.fit_transform(review)
# To use this with keras we need to first convert it to a numpy array.

# Change to a numpy array

data = vector.todense()
data = np.asarray(data)
print(type(data))

Then we split the data up into the training, validation, and testing sets, at a 60:20:20 ratio. We are careful to have an even number of positive and negative sentiments in each section. Without staying even, our model would learn to guess, say, positive more often, simply because more reviews were positive!


In [None]:
# Split into train, test, and validate sets

x_train = np.concatenate([data[:300], data[-300:]])
y_train = np.concatenate([sentiment[:300], sentiment[-300:]])
x_val = np.concatenate([data[300:400], data[600:700]])
y_val = np.concatenate([sentiment[300:400], sentiment[600:700]])
x_test = np.concatenate([data[400:600]])
y_test = np.concatenate([sentiment[400:600]])
print(x_train.shape)
print(x_val.shape)
print(x_test.shape)

## Sentiment Prediction 

Will now import the model we developed during our warm-up project for the purposes of predicting the sentiment of credit union reviews. This can be done now that we have cleaned and vectorized our data. Since keras is wonderful and has a built-in model prediction feature, won't unnecessarily complicate the prediction process by re-inventing the wheel. But rather, will simply pass in our new data into the built in prediction feature in keras.

In [None]:
# Need to load our model into the program, can be done using the model.load("Model name.h5") feature

model.load("Model Name.h5")

# Now make our prediction using the new data set to see what the sentiment of our credit union reviews is like

sentiment = model.predict(x_train, y_train)

# Now print out our findings from the sentiment analysis conducted above, see what the sentiment of 
# the credit union reviews looks like

print(sentiment)


## Data Analytics 

Here we will plot some of our data to see if we can see any obvious patterns. Best to do that before going right into the model creation so that we can ensure that the obvious patterns are indeed accounted for during that stage. Here will simply look at temperature versus time data.

In [None]:
# Store our Vancity asset data into a numpy array for convenience

vassets  = float_data[:,1] 
time  = data[:,1]
plt.plot(time, data, label = 'Asset Growth ($ CAD)')
plt.title('Asset Growth By Year')
plt.legend()
plt.xlabel('Year')
plt.ylabel('Asset Growth ($ CAD)')

As is evident from the plot above, we should be expecting year over year increase in assets. This will provide us with a diagnostic to ensure the model is providing us with reasonable output. Additionally, should get yearly growths in excess of a few hundreds of millions of dollars. This should pick up to a few billion dollars around the year 2000.

## Model Creation

Finally, after all the hard work of normalizing our data, taking a quick look at it, we get the to fun part: model creation. Will be taking Yelp! reviews and provided asset growth data to predict future asset growth as our only output. This will be accomplished using an LSTM recurrent neural network model with _ layers and _ neurons. These were selected as they yielded the highest accuracy from our experimentation. 

In [1]:
# Need to import some libraries from keras to create our model
# This will involve the use of keras sequential neural network models, layers and rmsprop optimizers

from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop


# Now need to define how far back into the data set wqe wish to go, how often we wish to sample,
# and how far into the future we wish to predict 

# Will use all of our data to make future predictions, as such, will use length of time series column 
# since each of the credit unions gave variable historical data (i.e. did not share common report start date)

lookback  = len(time)

# Since we have limited annual data, will sample every year, therefore will use a step of 1

step = 1

# Will start by predicting asset growth one year into the future, and will extend outwards from there

delay  = 1

# The rest is very similar to the creation of the sequential neural network we made during the Warm-Up project
# Quick refresher though, need to define our model with a number of layers, an optimzer function, a loss function, an activation function, and how many layers we want it to be

# Define our model as a sequential one
model = Sequential()

# Add some layers to our model 
model.add(LSTM(32, input_shape=(x_train.shape[1:], y_train.shape[1:]), return_sequences = True))
model.add(layers.Dense(32, activation = 'relu'))
model.add(layers.Dense(1))

# Now compile our model with optimizer and loss functions, no metric for this one though
model.compile(optimizer = RMSprop(), loss = 'mae')
history =  model.fit_generator(x_train,
                               y_train,
                               steps_per_epoch = 5,
                               epochs = 1,
                               validation_data = (x_val, y_val))

# Before we go any further, some important notes to make here. Will do that below in the "Model Notes" block

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


NameError: name 'lookback' is not defined

## Model Notes

1. Could have used other activation functions, 'relu' is a pretty popular one, but could use the likes of 'selu' and 'sigmoid'.

2. The number of layers we add is completely arbitrary and is usually driven by experimenting with the model to see what works the best for the project.

3. The number of epochs is another great place to play around. This is primarily due to wanting to avoid overfitting, which can happen by having too many training epochs. As such, should play around and see how many epochs yields the best result for the model.

4. The optimizer function is another area to play around as RMSprop may not always be the best choice for the project at hand.

5. The loss function selected here was another judgement call, but others could be used such as binary cross entropy. Used here since we actually have numbers to match to our model's prediction, so makes sense to use mean absolute error to see how far away our model's predictions are so we can mitigate the errors. Could also use root mean square method as well for the same purpose.

## Model Output

Will now plot our model's prediction against the actual data, and validation data to see if we're overfitting, and how our model is performing overall.

In [None]:
# Will grab our losses by going into the training history and defining appropriate variables to make plotting easier

loss = history.history['loss']
val_loss - history.history['val_loss']
epochs = range(1, len(loss) + 1)

# Now plot our training and validation losses

plt.figure()
plt.plot(epochs, loss, 'b', label = 'Training Loss')
plt.plot(epochs, val_loss, 'r', label = 'Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Model Traing and Validation Losses By Epoch')
plt.legend()
plt.show()

# Will now have our model make a prediction
# X will be our input that we give to our model to make its prediction

prediction = model.predict(X)
print(prediction)

## Model Evaluation

As is evident from the plot and historical output above, our model is able to achieve accuracy in the range of _%-_%. Furthermore, it is evident that in using _ epochs, the model is able to avoid overfitting to the training data. As such, it can be concluded that our model is sufficiently effective in predicting future asset growth for Canadian Credit Unions using Yelp! reviews, and yearly asset growth. Lastly, it can be concluded that Vancity Credit Uni