# Machine Learning with Tensor and Python

These are JulianNF's notes from following [freecodecamp's online Machine Learning with Python certification](https://www.freecodecamp.org/learn/machine-learning-with-python), and supplemented by [Google's Tensorflow documentation](https://www.tensorflow.org/guide/tensor)

Feel free to benefit from them if you're studying on your own.

---

In [2]:
# Required in notebook:
%pip install -q sklearn
%tensorflow_version 2.x

Note: you may need to restart the kernel to use updated packages.


UsageError: Line magic function `%tensorflow_version` not found.


In [11]:
from __future__ import absolute_import, division, print_function, unicode_literals
import numpy as np # library for handling arrays better
import pandas as pd # library for data manipulation
import matplotlib.pyplot as plt # library for graphing
from IPython.display import clear_output 
import urllib

import tensorflow as tf

## Data from previous module

In [15]:
training_dataframe = pd.read_csv(	'https://storage.googleapis.com/tf-datasets/titanic/train.csv')
testing_dataframe = pd.read_csv(	'https://storage.googleapis.com/tf-datasets/titanic/eval.csv')

training_survived = training_dataframe.pop('survived')
testing_survived = testing_dataframe.pop('survived')

categorical_columns = [
	'sex',
	'n_siblings_spouses',
	'parch',
	'class',
	'deck',
	'embark_town',
	'alone'
]

numeric_columns = [
	'age',
	'fare'
]

feature_columns = []

for feature_name in categorical_columns:
	vocabulary = training_dataframe[feature_name].unique() # get all unique possible values (aka categories) in the given column
	feature_columns.append(
		tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary)
	)
# print(feature_columns)

for feature_name in numeric_columns:
	feature_columns.append(
		tf.feature_column.numeric_column(feature_name, dtype=tf.float32)
	)
# print(feature_columns)


## The Training Process
Typically, the size of the dataset is quite large, and most training processes would not be able to run all the data at once on RAM. We therefore feed our data to our model in smaller batches.

In this model, we're going to load the data in batches of 32 data points. Notes that feeding data to our model 1 data point at a time would actually be slower than doing it in "bit size chunks".

### Epochs
As we feed more and more batches of data into our model, it will improve. However, we need to feed the data multiple times so that the model can process all data points along with all the other data points. 

❓TBC - sounds like the number of epochs is related to all the permutations required so that all the data points have been processed alongside all the other datapoints.

Each epoch is therefore equal to one complete stream of our data. The number of epochs we feed into our model will be equal to the number of times that our model sees the complete data set.

❓TBC - An epoch is therefore "one round of training"?

### Input Function
We need to be able to convert our Pandas dataframe to a TensorFlow dataset (`tf.data.Dataset`). To do this, we need to create an input function.

Here's an example from the TensorFlow documentation:

In [5]:
def make_input_function(data_dataframe, label_dataframe, num_epochs=10, shuffle=True, batch_size=32):
	def input_function():
		# create a labeled TF dataset from our dataframe:
		dataset = tf.data.Dataset.from_tensor_slices(
			dict(data_dataframe),
			label_dataframe
		)
		if shuffle:
			dataset = dataset.shuffle(1000)
		# split dataset into batches and repeat for as many epochs as requested:
		dataset = dataset.batch(batch_size).repeat(num_epochs)
		return dataset
	return input_function

In [16]:
# NB: Our input function generator has default values for number of epochs, shuffling, and batch size:
training_input_function = make_input_function(training_dataframe, training_survived)

# NB: For our testing function, we only need one epoch and no shuffling:
testing_input_function = make_input_function(testing_dataframe, testing_survived, num_epochs=1, shuffle=False)

In [17]:
# Create a linear estimator model:
linear_estimator = tf.estimator.LinearClassifier(feature_columns=feature_columns)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\Julian\\AppData\\Local\\Temp\\tmpu5etxabp', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
