## MACHINE LEARNING ALGORITHMS PART 2.

### Classification

I will be using information and examples based on the **TensorFlow 2.0** course provided by the FreeCodeCamp and the TensorFlow documentation [doc](https://www.tensorflow.org/tutorials/estimator/premade).

First, let's import the requirements 

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals
import numpy as np
import pandas as pd
import tensorflow as tf

print(tf.version)

<module 'tensorflow_core._api.v2.version' from '/Users/christinadelta/environments/tensorflowTut_env/lib/python3.7/site-packages/tensorflow_core/_api/v2/version/__init__.py'>


### What is classification?

Classification is differentiating between data points and seperating them into classes. 
Rather than predicting a numeric value (y labels) as in Regression, in Classification we predict classes.
We predict the probability that one specific datapoint/entry is within all the different classes it could be.

Here, we'll see how to solve a classification problem in TensorFlow using **Estimators**.  

### Loading the Data

The dataset that I will use here comes from the TensorFlow website. 

The dataset seperates Iris flowers in three different species (which will be our labels/y values):
* Setosa 
* Versicolor
* Virginica

And four information about each flower (which will be the features/x values):
* Sepal length
* Sepal width
* Petal length
* Petal width

Based on the information above, let's define a few helpful constants:

In [2]:
FEATURE_COLUMN_NAMES = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species'] # headers of features
SPECIES_COLUMN_NAMES = ['Setosa', 'Versicolor', 'Virginica'] # labels

In [3]:
# Download the Iris datasets using Keras. We need to download two seperate datasets.
# One for training and one for testing
train_path = tf.keras.utils.get_file(
    "iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv")
test_path = tf.keras.utils.get_file(
    "iris_test.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv")


In [5]:
# Read the datasets using Pandas
train_df = pd.read_csv(train_path, names = FEATURE_COLUMN_NAMES, header = 0)
test_df = pd.read_csv(test_path, names = FEATURE_COLUMN_NAMES, header = 0)

In [9]:
# inspect a few things in the datasets
# train_df.head(10)
train_df.shape
test_df.shape
#test_df.head(20)
train_df.head(20)

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,6.4,2.8,5.6,2.2,2
1,5.0,2.3,3.3,1.0,1
2,4.9,2.5,4.5,1.7,2
3,4.9,3.1,1.5,0.1,0
4,5.7,3.8,1.7,0.3,0
5,4.4,3.2,1.3,0.2,0
6,5.4,3.4,1.5,0.4,0
7,6.9,3.1,5.1,2.3,2
8,6.7,3.1,4.4,1.4,1
9,5.1,3.7,1.5,0.4,0


The train_df has 120 values for each feature/label and the test_df has only 30. It's a pretty small dataset.
When running ```train_df.head()``` we notice that **species** are defined numericaly (0, 1, 2). These numerical representations of Species are:
* Setosa = 0
* Versicolor = 1
* Virginica = 2

The values of the rest of the information (our features) are defined in cm. 

Now let's extract the labels from the datasets using the ```dataframe.pop()``` method.

In [10]:
 train_y = train_df.pop('Species')
test_y = test_df.pop('Species')

# check that the Species column is gone
print(train_df)
print(train_y)

     SepalLength  SepalWidth  PetalLength  PetalWidth
0            6.4         2.8          5.6         2.2
1            5.0         2.3          3.3         1.0
2            4.9         2.5          4.5         1.7
3            4.9         3.1          1.5         0.1
4            5.7         3.8          1.7         0.3
..           ...         ...          ...         ...
115          5.5         2.6          4.4         1.2
116          5.7         3.0          4.2         1.2
117          4.4         2.9          1.4         0.2
118          4.8         3.0          1.4         0.1
119          5.5         2.4          3.7         1.0

[120 rows x 4 columns]
0      2
1      1
2      2
3      0
4      0
      ..
115    1
116    1
117    0
118    0
119    1
Name: Species, Length: 120, dtype: int64


## Programming with Estimators 

Once the dataframe is set up, we define our model using a TensorFlow Estimator. An **Estimator** is a class (any class) derived from ```tf.estimator.Estimator()```. TensrorFlow provides a collection of ```tf.estimator``` such as: ```LinearRegessor``` to implement ML algorithms. We can write our own Estimators but, TensorFlow suggests that beginners use the pre-made Estimators. So, to use pre-made Estimators of TensorFlow we need to first perform a few tasks:
* Create an input function
* Define the model's feature columns
* Instantiate an Estimator, specifying the feature columns and other parameters
* Call one or more methods on the Estimator object, passing the appropriate input function as the source of the data

For more information about pre-made Estimators look at the TensorFlow Documentation [link](https://www.tensorflow.org/tutorials/estimator/premade).

### Create an Input function

So the **input function** creates a ```tf.data.Dataset``` object in which the outputs are two element tuples. These tuples contain:

* **Features** in the form of dictionaries (with a key and a value):
    * Each key is the name of the feature
    * Each value is an array with that feature's values
    
* **Label** which is an array with the values of the labels

Here is the format of such an input function:

In [None]:
def input_evaluation_func():
    features = {'SepalLength': np.array([6.4, 5.0]),
               'SepalWidth': np.array([2.8, 2.3]),
               'PetalLength': np.array([5.6, 3.8]),
               'PetalWidth': np.array([2.2, 1.0])}
    labels = np.array([2,1])

The input function may generate the features and labels any way we like. The above function will not be used, it was was created just to get an idea of the object dataset with the two-element tuples.

Now, let's create an input function the way it was done in Regression:

In [11]:
def input_fn(features, labels, training = True, batch_size = 256):
    # convert the inputs to a Dataset object 
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
    
    # shuffle and repeat if you are training the model
    if training:
        dataset = dataset.shuffle(1000).repeat()
        
    return(dataset.batch(batch_size))

### Defining the feature columns

A **feature column** is an object (a list) that describes how the model should use the input data from the features dictionary (the dataset object). So when we build an Estimator model, we pass it a list of feature columns that describe each of the features that we want our model to use. To do that we use the ```tf.feature_column``` module in a similar way that it was done in regression. Only here we don't have to convert any categorical values to numeric so it's mucch easier. 

For this dataset I'll make a list of feature columns to tell the model to represent each of the features as 32-bit floating points:

In [12]:
# Feature columns tell the model how to use an input:
my_feature_columns = []
for key in train_df.keys():
    my_feature_columns.append(tf.feature_column.numeric_column(key = key))
    
print(my_feature_columns)

[NumericColumn(key='SepalLength', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='SepalWidth', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='PetalLength', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='PetalWidth', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]


Now that we have the description of how we want the model to represent our features we can build the model.

### Building the Model

So with this dataset we have a classic Classification task and a few main choices that we can use:
* ```.DNNClassifier``` for deep models that perform multi-class classification
* ```.LinearClassifier``` based on linear models. A linear classifier works similarly to linear regression except, it does classification rather than regression, so we get the probability of being a specific label rathen than a numeric value. 

For this classification task ```tf.estimator.DNNClassifier``` seems like an appropriate option and this is because we may not be able to find a linear coorespondence in our data with the ```.LinearClassifier```. 

Here is how we can use this model (Estimator):

In [13]:
# build a DNN with 2 hidden layers with 30  and 10 hidden nodes each
classifier = tf.estimator.DNNClassifier(
    feature_columns = my_feature_columns, hidden_units = [30, 10], n_classes = 3) # 2 hidden layers of 30 and 10 
#nodes respectively and three classes for the model to choose from

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/04/l_r6_2qd69907ztc_b_d2qz00000gn/T/tmp7gmac15_', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


Let's break down the code. The ```tf.estimator``` is a module with a bunch of stored pre-made models(Estimators) and the DNNClassifier is one of them. 
For this model what we need to pass is:
* our feature_columns list (created earlier)
* number of hidden layers
* number of classes

#### Hidden Layers 

Hidden layers is like building the architecture of the neural network. We have an input layer, hidden layers and an output layer. Neural Networks will be discussed later on. 

#### Classes

We predefined the number of classes in the beginning. This dataset has a label with 3 kinds of information (setosa, versicolor, virginica). These are our classes.

## Training, evaluating, predicting 

Now that our model is ready we can call methods to train it , evaluate it and to make predictions. 

### Train the model

we can train our model by calling the ```.train``` method

In [14]:
# train the model 
classifier.train(
    input_fn = lambda : input_fn(train_df, train_y, training=True),
    steps = 5000)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /var/folders/04/l_r6_2qd69907ztc_b_d2qz00000gn/T/tmp7gmac15_/model

<tensorflow_estimator.python.estimator.canned.dnn.DNNClassifierV2 at 0x1120d4710>

#### Breaking down what we did above:

Let's start with the **input function**. So here, we didn't have to make a function within a function as in Regression when we want to use the input function to train our model we create something called **lambda**.

##### what is lambda?
Lambda is an anomymous function that can be defined in one line. So when we write lambda what that means is that this is a function that will execute whatever comes after it. Thus, here we have a chain of functions (lambda and input_fn), we essentially ask lambda to call the ```input_fn()``` that we created earlier. The reason we are using lambda here is because we didn't create an exterior input_fn to return us the internal input_fn as we did in Regression. 

For more info about the **lambda function** check this [link](https://www.programiz.com/python-programming/anonymous-function). 

Now the ```input_fn()``` needs a few parameters/arguments. It needs features(train_df), label(y values), and we also need to specify that this tarining and not evaluation, and a number of steps.
**Steps** are similar to epochs.

#### output of training

what is the loss that we get in the output? Let's say for now that the smallest the loss the better!

### Evaluate the model

Now that the model has been trained let's get some statistics and look at the performance. The code below evaluates the accuracy of the trained model on the test data. For evaluation we can use the ```.evaluate``` method:

In [15]:
eval_results = classifier.evaluate(
    input_fn = lambda: input_fn(test_df, test_y, training = False))

print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_results))

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-05-21T15:35:07Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/04/l_r6_2qd69907ztc_b_d2qz00000gn/T/tmp7gmac15_/model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.44761s
INFO:tensorflow:Finished evaluation at 2020-05-21-15:35:08
INFO:tensorflow:Saving dict for global step 5000: accuracy = 0.93333334, average_loss = 0.37486255, global_step = 5000, loss = 0.37486255
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 5000: /var/folders/04/l

So we got an accuracy of 97% which is really good, but let's not forget the test dataset is very small (30 values) and this will change a bit every time we train the model. 

### Make predictions using the trained model

So we have trained a model that generates good evaluation results. Let's make predictions.
Here we have a small, silly game in which the model predicts the class of a flower based on usre's input:

This code below is a combination of the TensorFlow 2.0. course provided by FreeCodeCamp and the TensorFlow documentation.

In [None]:
# first make an input function for features 
def input_fn(features, batch_size=256):
    # Convert the inputs to a Dataset without labels.
    return tf.data.Dataset.from_tensor_slices(dict(features)).batch(batch_size)

# define the features
features = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']
predict = {}

print("Please type numeric values as prompted.") # only one value each time
for feature in features:
    valid = True
    while valid:
        val = input(feature + ": ")
        if not val.isdigit(): valid = False
            
    predict[feature] = [float(val)]
    
    # call the input function
    predictions = classifier.predict(input_fn= lambda: input_fn(predict))

for pred_dict in predictions:
    
    print(pred_dict)
    class_id = pred_dict['class_ids'][0]
    probability = pred_dict['probabilities'][class_id]
    
    print('Prediction is "{}" ({:.1f}%)'.format(
        SPECIES_COLUMN_NAMES[class_id], 100 * probability))
    
    
# here is some example input (x values) and expected classes that we can try:
'''
expected = ['Setosa', 'Versicolor', 'Virginica']
predict_x = {
    'SepalLength': [5.1, 5.9, 6.9],
    'SepalWidth': [3.3, 3.0, 3.1],
    'PetalLength': [1.7, 4.2, 5.4],
    'PetalWidth': [0.5, 1.5, 2.1],
}
'''

Let's see what we did above.

First we created an input function, only this one needs the features and not features and labels. The reason we don't pass the labels (y values) is because we'll use it for prediction, thus we want the model to give us the answer. 

In the **for** loop etc... we wait for some valid response (**while** valid etc..) and once we get the response we add it to the **predict** dictionary.

Now, what we get when we call the input function is a dictionary. So then we loop through that dictionary to get some specific information.
Let's see what the pred_dict **for** loop does in more detail:

So *pred_dict* is also a dictionary and we are interested in class_ids and the probabilities (in this dictionary).
**class_id** is an array and it contains one of three values (either 0,1 or 2) depending on the prediction that the model has made (in other words: *what flower did the model predict based on the values that we passed?*)

The **probabilities** key has three values (because we have 3 classes), so each of these values is the probabilty for each of the classes ('Setosa', 'Versicolor', 'Virginica') based on the input that we gave. Now, we don't print all three probabilities but only the probability of the class_id that the model estimated to be the highest.