Machine Learning with Tensor and Python
===

These are JulianNF's notes from following [freecodecamp's online Machine Learning with Python certification](https://www.freecodecamp.org/learn/machine-learning-with-python), and supplemented by [Google's Tensorflow documentation](https://www.tensorflow.org/guide/tensor)

Feel free to benefit from them if you're studying on your own.

---

In [9]:
# Required in notebook:
%pip install -q sklearn
%tensorflow_version 2.x

Note: you may need to restart the kernel to use updated packages.


UsageError: Line magic function `%tensorflow_version` not found.


In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals
import pandas as pd # library for data manipulation
# import numpy as np # library for handling arrays better
# import matplotlib.pyplot as plt # library for graphing

import tensorflow as tf

## (Data prepared in previous module)

In [3]:
training_dataframe = pd.read_csv(	'https://storage.googleapis.com/tf-datasets/titanic/train.csv')
testing_dataframe = pd.read_csv(	'https://storage.googleapis.com/tf-datasets/titanic/eval.csv')

training_survived = training_dataframe.pop('survived')
testing_survived = testing_dataframe.pop('survived')

CATEGORICAL_COLUMNS = [
	'sex',
	'n_siblings_spouses',
	'parch',
	'class',
	'deck',
	'embark_town',
	'alone'
]

NUMERIC_COLUMNS = [
	'age',
	'fare'
]

feature_columns = []

for feature_name in CATEGORICAL_COLUMNS:
	vocabulary = training_dataframe[feature_name].unique() # get all unique possible values (aka categories) in the given column
	feature_columns.append(
		tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary)
	)
# print(feature_columns)

for feature_name in NUMERIC_COLUMNS:
	feature_columns.append(
		tf.feature_column.numeric_column(feature_name, dtype=tf.float32)
	)
print(feature_columns)


[VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='n_siblings_spouses', vocabulary_list=(1, 0, 3, 4, 2, 5, 8), dtype=tf.int64, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='parch', vocabulary_list=(0, 1, 2, 5, 3, 4), dtype=tf.int64, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='class', vocabulary_list=('Third', 'First', 'Second'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='deck', vocabulary_list=('unknown', 'C', 'G', 'A', 'B', 'D', 'F', 'E'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Southampton', 'Cherbourg', 'Queenstown', 'unknown'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='alone', vocabulary_list=('n', 'y'), dtype=tf.string, def

## Batches
Typically, the size of our datasets is quite large, and most training processes would not be able to run all the data at once on RAM. We therefore feed our data to our model in smaller batches.

In this course, we're going to load the data in batches of 32 data points. Notes that feeding data to our model 1 data point at a time would actually be slower than doing it in "bit size chunks".

## Epochs
As we feed more and more batches of data into our model, it will improve. However, we need to feed the data multiple times so that the model can process all data points alongside all the other data points. 

❓TBC - sounds like the number of epochs is related to all the permutations required so that all the data points have been processed alongside all the other datapoints.

Each epoch is therefore equal to one complete stream of our data. The number of epochs we feed into our model will be equal to the number of times that our model sees the complete data set.

❓TBC - An epoch is therefore "one round of training"?

## Input Function
Before we start training our model, we need to convert our Pandas dataframe into a TensorFlow dataset (`tf.data.Dataset`). To do this, we need to create an input function, whose job it will be to handle the conversion.

Here's an example from the TensorFlow documentation, which we'll be using:

In [18]:
def make_input_function(data_dataframe, label_dataframe, num_epochs=10, shuffle=True, batch_size=32):
	def input_function():
		# Create a labeled TF dataset from our dataframe:
		# ⚠️ Beware!! Those extra parentheses inside of the .from_tensor_slices() method matter enormously!!!
		dataset = tf.data.Dataset.from_tensor_slices(
			(
				dict(data_dataframe),
				label_dataframe
			)
		)
		if shuffle:
			dataset = dataset.shuffle(1000)
		# Split dataset into batches and repeat for as many epochs as requested:
		dataset = dataset.batch(batch_size).repeat(num_epochs)
		return dataset
	return input_function

In [19]:
# NB: Our input function generator has default values for number of epochs, shuffling, and batch size:
training_input_function = make_input_function(training_dataframe, training_survived)

# NB: For our testing function, we only need one epoch and no shuffling:
testing_input_function = make_input_function(testing_dataframe, testing_survived, num_epochs=1, shuffle=False)

## Training our model

Woo! 🥳 We're about to train our first model!

TensorFlow comes with a few core learning algorithms that are grouped into modules. In our case, we'll be using a linear classifier from the estimator module:

In [23]:
# Create a linear estimator model:
linear_estimator = tf.estimator.LinearClassifier(feature_columns=feature_columns)

# Train our model, using our input function:
linear_estimator.train(training_input_function)

# Test/evaluate how good our model is with the data from our testing set, passed into the model by our training input function:
result = linear_estimator.evaluate(testing_input_function)

print('\n------\nResult:', result)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\Julian\\AppData\\Local\\Temp\\tmpyyfc06w3', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensor

In [48]:
prediction = list(linear_estimator.predict(testing_input_function))
print('\nThis person:\n', testing_dataframe.loc[6])
print('had a %.2f/1 probability of surviving' % prediction[6]['probabilities'][1])


INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\Julian\AppData\Local\Temp\tmpyyfc06w3\model.ckpt-200
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.

This person:
 sex                        female
age                           8.0
n_siblings_spouses              3
parch                           1
fare                       21.075
class                       Third
deck                      unknown
embark_town           Southampton
alone                           n
Name: 6, dtype: object
had a 0.58/1 probability of surviving
