In [20]:
#Imports needed
import numpy as np 
import tensorflow as tf
import csv
import pandas as pd

# Data Contents

Upon inspection of the csv file, we see it is ordered and split into 5 columns, from left to right these columns are:

* Sepal Length
* Sepal Width
* Petal Length
* Petal Width
* Species(setosa, versicolor, virginica)
    
It is important to determine and split your data according to your needs. For this repo we aim to feed data to a model to train and later test for accuracy. On this basis we will first recognize that the first four bullet points: sepal and petal width and lengts are our inputs and species is our output. 

Essentially
1. x =  Sepal Length, Sepal Width, Petal Length, Petal Width.
2. y =  Species(setosa, versicolor, virginica)

In [17]:
#Data file found at https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv
#Edited in Notepad++ to delete first row
#Reads first 150 lines of csv file into list
iriss = list(csv.reader(open('C:\\users\\Damian Curran\\Desktop\\IRIS2.csv')))[0:150]

# Know your training and testing data sets

We are working with a single data set which acts as both our train and test sets. During our inspection above we saw the data set is ordered, not exactly what we want for training.

We must now first shuffle the data set randomly and then seperate the data to input and output. Note seperating the data first and then shuffling is not a good idea as the data won't be lined up with its original flower type.

We are just as lucky, numpy comes with a shuffle function, happy days.

In [18]:
#Modifys sequences by shuffling
#https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.shuffle.html
np.random.shuffle(iriss)

#Spliting first four columns
irisTrainData = np.array(iriss[0:][:100])[:,:4].astype(np.float)
irisTestData = np.array(iriss[100:][:150])[:,:4].astype(np.float)

#Splits last column
irisTrainType = np.array(iriss[0:][:100])[:,4:].astype('S15')
irisTestType = np.array(iriss[100:][:150])[:,4:].astype('S15')

# Making our model

Now to create our model, note that no data is being fed in here, this is just a model for later use.

Lets define some parts here:
* Placeholder: somewhere to feed data, won't change
* Variable: something python knows it can change
* Matmul: numpy matrix multiplication function
* Argmax: tensorflow function to get largest index across an axis
* tf.nn.softmax_cross_entropy_with_logits: tensorflow function which calculates cross_entropy after applying softmax
* cross_entopy: helps in computing the cost of a softmax layer
* softmax: normalizes data, it "squishes" the data so the sum = 1

In [19]:
#this takes as many values of size 4, it is size four because we feed it:
#Sepal Length, Sepal Width, Petal Length, Petal Width.
x = tf.placeholder(tf.float32, [None, 4])

#this will hold the ouputs, (setosa, versicolor, virginica)
y = tf.placeholder(tf.float32, [None, 3])

#this represents the index of species types, we'll see more of this later in the code
y_true_cls = tf.placeholder(tf.int64, [None])

#These are changed to pythons liking to more accurately adjust towards the correct output
weights = tf.Variable(tf.zeros([4, 3]))
biases = tf.Variable(tf.zeros([3]))

#stores the unscaled matrix multiplication into "logits"
logits = tf.matmul(x, weights) + biases

#saves normalized data into y_pred
y_pred = tf.nn.softmax(logits)
#gets largest index e.g [1,3,2] largest index = 1
y_pred_cls = tf.argmax(y_pred, axis=1)

#can seperate out softmax and cross_entropy
#using this function is more accurate, and less lines of code, every line counts
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits,
                                                        labels=y)

#computes the mean of elements across a tensor
cost = tf.reduce_mean(cross_entropy)

#every tensorflow program uses an optimizer, the most used one is GradientDescent
#https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.05).minimize(cost)

#tf.equal returns bools(true, false) and stores in correct_prediction
correct_prediction = tf.equal(y_pred_cls, y_true_cls)

#we then cast correct_prediction to a float which returns 0 if false and 1 if true 
#we then use reduce_mean function to find avrage
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# Configuring data

Here we convert our labels from categorical data to numerical data uing one-hot encoding.

We do this because it  would be improper to say setosa = 1 and versicolor = 2, that would mean veriscolor is greater than setosa which is not true.

There are many ways to do this, for this example I will be using the pandas library

In [22]:
#This converts setosa to a one-hot vector [1,0,0]
trainTypeHot = pd.get_dummies(irisTrainType.ravel())
testTypeHot = pd.get_dummies(irisTestType.ravel())

testTypeHot

Unnamed: 0,b'Iris-setosa',b'Iris-versicolor',b'Iris-virginica'
0,1,0,0
1,0,0,1
2,0,0,1
3,1,0,0
4,0,0,1
5,0,1,0
6,1,0,0
7,0,0,1
8,0,1,0
9,0,1,0
