Machine Learning with Tensor and Python
===

These are JulianNF's notes from following [freecodecamp's online Machine Learning with Python certification](https://www.freecodecamp.org/learn/machine-learning-with-python), and supplemented by [Google's Tensorflow documentation](https://www.tensorflow.org/guide/tensor)

Feel free to benefit from them if you're studying on your own.

---

# Classification

Where regression is used to predict a numeric value, classication is used to separate data points into classes of different labels.

In this example, we'll use the Iris flower dataset to create a ML model for classifying flowers. It contains data for 3 different species of flowers:
- Setosa
- Versicolor
- Virginica

And for each flower:
- sepal length
- sepal width
- petal length
- petal width

**📖👀 TODO - Research how classification algorithms work**

### Aside - Keras.io
Keras is a human-oriented API and the most used deep learning framework. It's a module built on top of TensorFlow.

It aims to make it easy to implement and test ML ideas by, among other things, abstracting and human-readability-ing many of the functionalities within TensorFlow.

## Preparing our data

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals
import pandas as pd
import tensorflow as tf

In [2]:
COLUMN_NAMES = [
    'sepalLength',
    'sepalWidth',
    'petalLength',
    'petalWidth',
    'species'
]
SPECIES = [
    'Setosa',
    'Versicolor',
    'Virginica'
]

training_file = tf.keras.utils.get_file(
    "iris_training.csv",
    "https://storage.googleapis.com/download.tensorflow.org/data"
)
testing_file = tf.keras.utils.get_file(
    "iris_testing.csv",
    "https://storage.googleapis.com/download.tensorflow.org/data"
)
 
training_df = pd.read_csv(training_file, names=COLUMN_NAMES, header=0)
testing_df = pd.read_csv(testing_file, names=COLUMN_NAMES, header=0)

print(training_df.head())

training_species = training_df.pop('species')
testing_species = testing_df.pop('species')

print(training_df.shape)
print(training_df.head())


   sepalLength  sepalWidth  petalLength  petalWidth  species
0          6.4         2.8          5.6         2.2        2
1          5.0         2.3          3.3         1.0        1
2          4.9         2.5          4.5         1.7        2
3          4.9         3.1          1.5         0.1        0
4          5.7         3.8          1.7         0.3        0
(120, 4)
   sepalLength  sepalWidth  petalLength  petalWidth
0          6.4         2.8          5.6         2.2
1          5.0         2.3          3.3         1.0
2          4.9         2.5          4.5         1.7
3          4.9         3.1          1.5         0.1
4          5.7         3.8          1.7         0.3


## Feature columns

In [3]:
feature_columns = []

# NB: all the columns in our dataset are numeric, so we don't need to work with categorical columns and then numeric columns, we can simply reference all the columns names:
for feature in training_df.keys():
	feature_columns.append(
		tf.feature_column.numeric_column(key=feature)
	)
feature_columns

[NumericColumn(key='sepalLength', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='sepalWidth', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='petalLength', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='petalWidth', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

## Input function
This time around, we'll write a basic input function (rather than an input-function-generating function like we did for the Titanic dataset) and later on, pass it to our model/TensorFlow using a lambda function.

Lambda functions are anonymous functions in Python.

In [4]:
def input_fn(features_df, labels_df, training=True, batch_size=256):
	dataset = tf.data.Dataset.from_tensor_slices(
		(
			dict(features_df),
			labels_df
		)
	)
	if training:
		dataset = dataset.shuffle(1000).repeat()
	return dataset.batch(batch_size)

## Building and Training our model
We first need to choose what type of model we want to apply. For classification tasks, there's a huge variety (100s?) of estimators/models in Tensorflow that we can pick from, such as:
- `DNNClassifier` (deep neural network classifier)
- `LinearClassifier`
- ...

We're going to use `DNNClassifier` because we may not be able to find a linear correspondance in our data.

In [5]:
dnn_classifier = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
	# The shape of our neural net --> 30 nodes in first layer, 10 in the second layer
    hidden_units=[30, 10],
	# for the three flower classes that we know are in our input data
    n_classes=3
)

dnn_classifier.train(
	# We use an anonymous function, aka a 'lambda function' in Python here because we didn't embed our input function within a "input function maker", as we did in our Titanic example:
	input_fn=lambda: input_fn(training_df, training_species, training=True),
	# Go through dataset until 5000 "things"(?) have been looked at:
	steps=5000
)



INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\Julian\\AppData\\Local\\Temp\\tmpcajj3k2s', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Instructions for updating:
Use Variable.read_

<tensorflow_estimator.python.estimator.canned.dnn.DNNClassifierV2 at 0x19ab2d55ee0>

## Evaluating our model

In [6]:
evaluation = dnn_classifier.evaluate(
	input_fn= lambda: input_fn(testing_df, testing_species, training=False)
)

print(evaluation)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2023-02-02T10:01:19
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\Julian\AppData\Local\Temp\tmpcajj3k2s\model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 2.44823s
INFO:tensorflow:Finished evaluation at 2023-02-02-10:01:22
INFO:tensorflow:Saving dict for global step 5000: accuracy = 0.8333333, average_loss = 0.6464407, global_step = 5000, loss = 0.6464407
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 5000: C:\Users\Julian\AppData\Local\Temp\tmpcajj3k2s\model.ckpt-5000
{'accuracy': 0.8333333, 'average_loss': 0.6464407, 'loss': 0.6464407, 'global_step': 5000}


## Using our model
Now that we've created a model, we can use it to predict individual cases.

For this, we'll:
1. create an input function without labels (after all, we want the model to tell us what it thinks the label is),
2. ask for the user to input their values (presumably from field observations in this case)
3. use our trained model to predict what flower species is
	- Note that by looking at each (in our case only one) prediction (the `pred_dict` value in our code below), we can see to what extent/percent the model thinks that our output matches each of the possible labels we trained it with (i.e. Setosa, Versicolor, Virginica)

In [9]:
def prediction_input_fn(features, batch_size=256):
	return tf.data.Dataset.from_tensor_slices(dict(features)).batch(batch_size)


def get_user_input():
    user_input = {}
    print('\nPlease input values for the following flower features:\n')
    for feature in training_df.keys():
        valid = True
        while valid:
            value = input(feature + ': ')
            if not value.isdigit():
                valid = False
		# Why is the value stored as an array?:
        user_input[feature] = [float(value)]
    return user_input


def predict_species():
	user_input = get_user_input()
	
	predictions = dnn_classifier.predict(
        input_fn=lambda: prediction_input_fn(user_input)
    )
	
	for pred_dict in predictions:
		print(pred_dict)
		class_id = pred_dict['class_ids'][0]
		probability = pred_dict['probabilities'][class_id]
		
		print('Prediction is "{}" ({:.1f}%)'.format(
			SPECIES[class_id], 100*probability
			)
		)

predict_species()



Please input values for the following flower features:

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\Julian\AppData\Local\Temp\tmpcajj3k2s\model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
{'logits': array([ 1.3364276, -3.3479393,  0.6834881], dtype=float32), 'probabilities': array([0.6537007 , 0.00603927, 0.34026003], dtype=float32), 'class_ids': array([0], dtype=int64), 'classes': array([b'0'], dtype=object), 'all_class_ids': array([0, 1, 2]), 'all_classes': array([b'0', b'1', b'2'], dtype=object)}
Prediction is "Setosa" (65.4%)
