<a href="https://colab.research.google.com/github/Quant-Projects/Algo-Trading-Examples/blob/master/Untitled8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import numpy as np
import tensorflow as tf

In [0]:
train = pd.read_csv("train.csv")

# Exploratory Phase

Not that we have imported our training data, lets start with the exploratory phase.  We are first going to view the data, statistics of the data, and find and replace any missing values.

In [70]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Just looking at our data, there are a few columns that we don't need.  For starters, we will want to get rid of the IDs.  We will also want to get rid of the cabin, as well the ticket number.  And just for this tutorial, we are also going to get rid of the names of the passengers too (if you want a challenge, you can use feature engineering to create new features from the deleted columns).

In [0]:
train.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

Now that we have only the data that we need, we should look at the data types of the remaining columns.

If the dtype of a column is an object or a string, then we will want to convert that the something machine learning readable.  For this example, we are going to convert them to dummy variables using 

```
pd.get_dummies(df)
```



In [0]:
train['Sex'] = pd.get_dummies(train['Sex'])
train['Embarked'] = pd.get_dummies(train['Embarked'])

In [73]:
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,0,22.0,1,0,7.25,0
1,1,1,1,38.0,1,0,71.2833,1
2,1,3,1,26.0,0,0,7.925,0
3,1,1,1,35.0,1,0,53.1,0
4,0,3,0,35.0,0,0,8.05,0


Ok, now that everything appears to be ML readable, then get some of the stats using ```df.describe()``` and ```df.info()```.



In [74]:
train.describe()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.352413,29.699118,0.523008,0.381594,32.204208,0.188552
std,0.486592,0.836071,0.47799,14.526497,1.102743,0.806057,49.693429,0.391372
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,20.125,0.0,0.0,7.9104,0.0
50%,0.0,3.0,0.0,28.0,0.0,0.0,14.4542,0.0
75%,1.0,3.0,1.0,38.0,1.0,0.0,31.0,0.0
max,1.0,3.0,1.0,80.0,8.0,6.0,512.3292,1.0


In [75]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null uint8
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    891 non-null uint8
dtypes: float64(2), int64(4), uint8(2)
memory usage: 43.6 KB


Looking at the stats, we can learn a few useful things about the data.  For one, since the mean survived variable is 0.38, most of the passengers did not live.  Another thing is that since the mean age was a little over 29 years, we can say that the people on board were somewhat young.  Lastly, since the mean sex was about 0.35, we can also say that the people were majority male. 

You can get so much more from just these simple reports, but we are going to move on.  If you want, we can try to come up with some more assumptions from the stats youself.

Now that we have learned something about the variable, we need to look to see what is missing.  To do this, we are going to use ```df.isnull().any()```.

In [76]:
train.isnull().any()

Survived    False
Pclass      False
Sex         False
Age          True
SibSp       False
Parch       False
Fare        False
Embarked    False
dtype: bool

Looking at the output, we can say that the only column with any missing values is the age column.  

Now before we just start randomly filling and/or removing missing values, we need to do some detective work.  For this part, we are going to use our brain, and attempt to assume the reason that the data is missing.

For the most part there is only two reasons that data could go missing, someone didn't record it, or it doesn't exist.  Since the part where data goes missing because it isn't recorded correctly, we are just going to give an example about data that doesn't exist. 

Imagine that I am interviewing people randomly and asking them their favorite ice cream flavor.  If I ask some people that we alergic to ice cream, then they would of course not have a favorite flavor.  Therefore, there will be missing data for them.  Now, the interviewer could decide to leave those people out, but in many cases, they decide to leave it in provived that not too much of the data is left empty.

Now in the case where data doesn't exist, it may be best to just replace data with the most common value, the mean, mode, etc. (provided that not too much of the data is missing).  However, if the data just wasn't recorded, we can just remove it.

Now in our example, I find that it is very rare that someone's age doesn't exist.  So, in this example it isn't a good idea to just remove it.  We are instead going to replace it with the mean.  (Plus is you looked, there are too much missing data to simple remove them).

In [0]:
train['Age'].fillna(value=train['Age'].mean(), inplace=True)

In [78]:
#now let's check it see if it worked!
train.isnull().any()

Survived    False
Pclass      False
Sex         False
Age         False
SibSp       False
Parch       False
Fare        False
Embarked    False
dtype: bool

In [93]:
train.drop(['Survived'], axis=1).shape

(891, 7)

Now if you want to go above and beyond, you can examine the distributions of the data, correlations, and so on.  But for this guide, we are just going to start in with the model design phase.

# Model Design Phase

Now looking back at our previous guides, we used simply supervised learning models.  However, the highest we ever got on the test set was aout 80% accuracy (this was even with Random Forest).  So, for this guide, we are going to attemp to beat this score using an ANN (artificial neural network) designed by Tensorflow.

For our model, we are going to use a simple network with 2 hidden layers using RELU (since RELU usually converges faster).  Now before we just start coding, we are just going to lay our the detals of our ANN architecture. 

To start, we need to know how many nodes to have in our input layer.  This is very easy, since the number of nodes in our input layer is simply just the number of inputs that our data has.  We can get this using ```df.shape```.  

Second, we are going to find the second easiest thing to find, the number of nodes in our output layer.  This is simply just the number of unique values in our target column.  For this example, there are only two unique values, either survived or died.  So, we are going to use 2 nodes.

Next, we are going to find the number of nodes in our hidden layers.  For this, there isn't really a hardcoded number to set.  When deciding, there are a few rules of thumb.  One, the most nodes per layer, the longer training will take, and the higher the chances of overfitting.  Two, the less the nodes, the more you may underfit, and three, your first hidden layer should have more nodes than your second.  The reason for this, is that it will take longer to train, and you may also have a problem with vanishing or exploding gradients.  

So, for this example, we are going to use 300 nodes, and 100 nodes.

In [79]:
#get the input shape
train.shape

(891, 8)

In [0]:
#create vars. to hold input sizes
n_inputs = 7
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 2

In [0]:
#now lets create placeholders for our X and y inputs.
#we specify None for the first dimension, since we want to be able to input any number of samples.
X = tf.placeholder(tf.float32, shape=(None, n_inputs))

y = tf.placeholder(tf.int64, shape=(None))

Now that we have created placeholders to hold our input varibles, we can go on and create the actual node and layer part of our model.  For this we are going to use ```tf.layers.dense()```.  This function creates a series of interconnected nodes.

We are first going to create an input layer, then we are going to create the first hidden layers and specify the outputs as from the input layer as the input.  Then, we are going to create the second hidden layer, and specify the outputs from the first hidden layer as the input.  Lastly, we are going to create the output layer, and specify the outputs from the second hidden layer.  As mentioned before, we are going to set the activation function to RELU, since it converges faster than the sigmoid function.

You can read up on how exactly an ANN works [here](https://www.digitaltrends.com/cool-tech/what-is-an-artificial-neural-network/).

In [0]:
with tf.name_scope("nn"):  #create a name scope to hold all of the ANN nodes and layers.
  input_layer = tf.layers.dense(X, n_inputs, tf.nn.relu)
  hidden_layer1 = tf.layers.dense(input_layer, n_hidden1, tf.nn.relu)
  hidden_layer2 = tf.layers.dense(hidden_layer1, n_hidden2, tf.nn.relu)
  output_layer = tf.layers.dense(hidden_layer2, n_outputs, tf.nn.relu)

Not that we have created our ANN layers, we need to specify a loss function.  If you read the link attached above, you would learn that an ANN works by using backpropagation in order to learn what parameters decrease the loss function the most.  This process is known as training the model.  So, in order to actually train the model, we are going to need to create some sort of loss function.

In our case, since we are doing a classification problem, we are going to just use accuracy.  However, there is a catch: since our model will actually output class accuracies (like softmax models), instead of just the class, we will want to use cross_entropy as our loss.  We can use ```tf.nn.softmax_cross_entropy_with_logits_v2()``` to find the cross entropy of our model (```tf.nn.softmax_cross_entropy_with_logits()``` is soon to be depreciated as of this post).  Since this function will then find the cross entropy for each prediction our model makes, we will be left with a big list of losses.  Since our model will only want one loss to try to minimize, we will just take the mean of the errors.  For this we will use ```tf.reduce_mean()```.

In [0]:
with tf.name_scope("loss"):  #create name scope to hold loss operations
  cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=y, logits=output_layer)
  loss = tf.reduce_mean(cross_entropy)

And now that we have created a loss function, we will need a training step to actually try to minimize it.  Thankfully, tensorflow comes with a ton of these operation built in.  For this example, we are just going to use a simple gradient descent operation (use ```tf.train.GradientDescentOptimizer()```).  However, if you want to change things up, you can use other training operations (you can read up on them [here](https://www.tensorflow.org/api_docs/python/tf/train)).

After we create an object for the training operation, we the use ```object.minimize(loss)``` to tell it to actually start the training process.

In [0]:
with tf.name_scope("train"):  #create a name scope to hold the training operations
  minimizer = tf.train.GradientDescentOptimizer(0.001)  #the parameter here is the learning rate.
  train_op = minimizer.minimize(loss)

Looks like we are almost done!  Now that we have created a trainable model, we are going to want to create a way to evaluate the performance of the model.  First, we are going to find the correct values that our model predicted using ```tg.nn.in_top_k()```.  Then we are going to find the accuracy by finding the mean of the correct variable.

In [0]:
with tf.name_scope("eval"):
  correct = tf.nn.in_top_k(output_layer, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

# Training Phase

At last, we are ready to actually train our model.  For this, we first need to use ```tf.global_variables_initializer()``` to initialize and create all of the variables and placeholders that tensorflow created.  Next, we will need to execute everything in a session using ```tf.Session()```.  A simple ```with``` statement will be ideal.  Lastly, we will train out model by calling a loop a certain number of times, and calling the train_op variable in the loop.  The number of times that we loop through our training operation is called the number of epochs.

It is also worth noting that if we had a large number of features and/or samples in our dataset, we would want to use batches instead of the whole data.  However, since we are not using a lot of either, we can just feed to data directly to tensorflow.

When we call our training operation, we will also need to feed it values to fill the placeholders.  Remember that we need to feed data to our placeholders, or else we will cause an error as we will basically be performing arithmetic on null variables.

In [91]:
init = tf.global_variables_initializer()

with tf.Session() as sess:
  init.run()
  
  data_x = train.drop(['Survived'], axis=1).values
  data_y = train['Survived'].values
  
  for _ in range(1000):
    sess.run(train_op, feed_dict={X:data_x, y:data_y})

InvalidArgumentError: ignored