## MACHINE LEARNING ALGORITHMS PART 1.

### Linear Regression 

I will be using information and examples based on the **TensorFlow 2.0** course provided by the FreeCodeCamp and the TensorFlow documentation [doc](https://www.tensorflow.org/tutorials/quickstart/beginner).

First, let's import the requirements 

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import clear_output
import tensorflow.compat.v2.feature_column as fc
import tensorflow as tf

print(tf.version)

In [None]:
%matplotlib inline

## LOAD THE DATA

The dataset I will be focusing on here is the titanic dataset. It has tons of information about each passanger on the ship. Our first step is always to understand the data and explore it. So, let's do that!
What we are going to do, is to **predict the chance of survival** using the information provided by the tiatanic dataset

In [None]:
# load the dataset that will be used here
dftrain = pd.read_csv('./train.csv')
dfeval = pd.read_csv('./eval.csv')
y_train = dftrain.pop('survived') # pop/extract a specific column. This column is our y variable.
y_eval = dfeval.pop('survived') 

The ```pd.read_csv()``` method loads the dataset as a *pandas dataframe*. A dataframe is kind of like a table.
We can have a look at the table using simple commands such as ```dfname.head()``` where **dfname** is the name that you have given to your dataframe during loading. If the parenthesis is left blank then by default it will visualize the first 5 rows. In this dataset the y (dependent variable) is the **survived** column which was extracted and stored in a new variable(y_train, y_eval). The rest are indepenent variables x.

In [None]:
# visualize dataframe
dftrain.head(10)
#y_train.head(10)

In [None]:
# using the loc method to locate a specific row. Let's locate the 1st row in the dftrain dataframe and 
# the y_train dataframe
print(dftrain.loc[0], y_train.loc[0])

The last line of the output above *Name: 0, dtype: object 0* is the output of ```print(y_train.loc[0])``` and it only means that this 1st person (remember I located the 1st row) did not survive.

In [None]:
# to look at a specific column:
print(dftrain["age"])

In [None]:
# take a look at the discriptives
dftrain.describe()

The ```dfname.describe()``` method gives us some of the discriptive statistics od the dataframe such as, mean, standard deviation, etc. 


We can also look at the dataframe's shape. NumPy arrays are not the only ones with shape!

In [None]:
# dftrain.shape
dfeval.shape

The output is just 2 values. The first value (627) is the number of rows and the 2nd value (9) is the number of columns/features in this dataframe. 

Let's take a look at the survival column. 

In [None]:
y_train.head(10) # in the output 0=did not survive and 1=survived

Next let's make a few graphs of the data:


In [None]:
dftrain.age.hist(bins=20)

In [None]:
dftrain.sex.value_counts().plot(kind='barh')

In [None]:
dftrain['class'].value_counts().plot(kind='barh')

In [None]:
pd.concat([dftrain, y_train], axis=1).groupby('sex').survived.mean().plot(kind='barh').set_xlabel('% survived')

After a quick visualization we see that:
* most passengers were between 20 and 30 years old
* most passengers were males
* most passengers were in the 3rd class
* most passengers that survived were females

we used this visualization to get some idea of what kind of data we have, how we are going to further analyze the dataset and what kind of results should we expect

### Training and Testing Data

We have loaded two different datasets. This is because when we train models we need teo different sets of dat: one for **training** and one for **testing**. 
In a few words, we feed our model with the **training** data, so that it can learn and develop. Normally, a training dataset is larger than the testing dataset.

We use the **testing** dataset to evaluate the model and see how it performs. These two datasets should not be the same. **WHY?**

The whole point is for our model to be able to do predictions with new data, data that it hasn't seen before. If we evaluate the model with data that it has already seen then its performance wouldn't be really accurate. What if the model simply memorized the training data? 
This is why the testing and training datasets need to be different.

## Preparing our dataset

### Feature columns

In the beggining I imported something called **fc**, which stands for **feature columns**. 
In our dataset we have two types of information: Numerical and categorical. In the next block of code I will need to loop through these types of information to create feature columns. Feature columns is just what we feed our linear estimator with, to make predictions.

Numerical information can be found in all the columns that contain numbers:
* age
* fare

Categorical information is the information that can be put in categories and/or is not numerical:
* sex
* simblings & spouces 
* deck
* alone

Before we continue with the training, we must first convert the categorical data into numerical. The way to do this is by encoding or replacing each category with an integer (for example, male = 1, female = 0).
We are lucky that TensorFlow can handle that for us:

In [None]:
# define your categorical and numerical columns:
categorical_columns = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']
numerical_columns = ['age', 'fare']

# convert the categorical into numerical
feature_columns = [] # create black list to store the feature columns
for fname in categorical_columns:
    vocabulary = dftrain[fname].unique() # get the unique values from a given categorical column 
    # store these unique values in the feature column
    feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(fname, vocabulary))

    # it's easier to create a feature column with (already) numerical values, we just give the feature name 
    # (name of a given column) and the datatype (e.g. float32)
for fname in numerical_columns:
    feature_columns.append(tf.feature_column.numeric_column(fname, dtype = tf.float32))
    
print(feature_columns)

With the block of code above, I just created a list of features that are used in the dataset. 
The lines of code inside the ```append()``` attribute create an object that the model will then use to map unique string values (e.g. male and female) to integers (e.g. 0 and 1). For more information on feature columns see [link](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_list?version=stable)

Take a look at the unique values first, and then at the different categorical values.

In [None]:
# take a look at all the unique values stored in vocabulary
dftrain[fname].unique()

In [None]:
dftrain['sex'].unique()
dftrain['embark_town'].unique()

In [None]:
for fname in numerical_columns:
    print(fname)

## How is our model trained?

The way we train a model is by feeding it with information (with data). For this model, data will not be fed all at once but in batches of 32 entries. We'll feed our model several times depending on the number of **epochs**. 

This dataset is small so we wouldn't normally need to train the model in batches of data, but if someone has tons of data to train a model with, this is when the batches are quite needed. 

### What is an epoch?

Epoch is just one stream of the entire dataset. The number of epochs is the number of times the model will see the entire dataset. We are using a lot of repeats of our dataset (a lot of epochs) in different order and the reason is because we want to make sure that the model will look at the (same) data a lot of times but from *different views* and start making patterns. The idea is that the more our model looks at the data, the easier it is to make predictions.

#### the problem of overffiting 
Sometimes it may be the case that the model sees the same datapoints too many times, at the point where it just memorizes them instead of making patterns using the data we feed it with. In such a case, when we give our model some new data (some testing data) the performance is not really accurate (I think this is a common problem in classification). 

In order to prevent that from happening we can start with a small number of epochs and increase them later on (if needed).

### Input function
Given that our model will see the entire dataset many times and in small batches, we first need to create  an **input function**. This function will just define how our dataset will be split into batches of 32 entries at each epoch.

The TensorFlow model we are using requires that the data we give it comes  as a ```tf.data.Dataset``` object. So, the input function that we'll create will convert the pandas df into that object. The input function comes directly from the TensorFlow documentation, see [link](https://www.tensorflow.org/tutorials/estimator/linear)

In [None]:
def make_input_func(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
    def input_function(): # inner function, this will be returned
        ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df)) # create a tf.data.Dataset object with the data and label
        if shuffle:
            ds = ds.shuffle(1000) # randomize the order of the data
        ds = ds.batch(batch_size).repeat(num_epochs) # split the dataset into batches of 32 and repeat this process n times (depending on epochs)
        # return a batch of the dataset
        return(ds)
    # return a function object to use
    return(input_function)

# now call the input function that was returned to us to get a dataset object we'll feed to the model
train_input_func = make_input_func(dftrain, y_train)
eval_input_func = make_input_func(dfeval, y_eval, num_epochs =1, shuffle = False) 

Let's break down the code above. We created two functions (a function within a function). The exterior function ```def make_input_func()``` needs a few parameters:

* data_df = train data (our x variables) or features
* label_df = data that we want our model to predict (the y variable)
* num_epochs = number of repeats that the model will see our dataset
* shuffle = true or false. Are we going to randomize the data before passing it to the model?
* batch_size = number of entries at each epoch.

In the inner function ```def input_function()``` we create an object dataset (ds) by passing a dictionary representation of our x values and y values.
Now, the inner function returns the dataset object that we created and the exterior function returns the inner function for us. So, what the exterior function does, is that it makes the **imput function**. When we want to call the input function, we are actually going to call the exterior ```def make_input_func()``` because this one will return to us the actual input function.

### Calling the input function 

We call the same function to create our object dataset for both training and testing. The diffrence is that for testing we don't need to shuffle our data (obviously!), so it is set to **false** and we just want the model to see the data once because it has already been trained.  

## Let's now create the Model

Here, I will use the linear estimator to make use of the linear regression algorithm.

It's pretty easy to create a model. I'll be using the **LinearClassifier** object from the **estimator** module of TensorFlow by passing the *feature_columns* list created before.

*More information about estimators can be found in ML_algorithms Part2*.

In [None]:
linear_est = tf.estimator.LinearClassifier(feature_columns = feature_columns)

### Train the Model

Training the model is as easy as passing the input function created earlier:

In [None]:
linear_est.train(train_input_func) # and that's it!
results = linear_est.evaluate(eval_input_func) # evaluate the model by testing it with some other data (testing data)

clear_output() # clear the output created while we train the model
print(results['accuracy']) # print the accuracy level of the model

So essentially what I've done above is:

* train the model by passing it the **object dataset** created earlier using the ```def input_function()``` function,
* evaluate the model using the **object dataset** created earlier for evaluation/testing using the same function
* print the accuracy of the model

Now, let's take a look at the **results** we got:

In [None]:
print(results)

So the **results** is just a dictionary object with a banch of keys-values of statistical information related to the perfrormance of the model. This information may not tell us much at first, but if we want to access a specific piece of information we just define it as above: ```print(results['specific_key'])```, where specific key is the piece of information we want to access (e.g. accuracy). 

### What is this Accuracy key?

When I trained and evaluated the model the next thing I did was to check the accuracy of the model. But what exactly is this accuarcy? 
It is simply the *comparison* between the results of the evaluation of the model (the predictions it made based on the tarining data) vs the actual y values of the dataset (the survival column that we extracted in the beggining).  

Now, notice that if we run the model again the accuracy will probably change. This is because we shuffle our dataset, we put it in different order and based on the order the dataset is seen by the model it is being treated differently. Also, if we change the number of epochs (e.g. 20 epochs instead of 10) the accuracy again will probably change. 

## Use the Model to make predictions

Up until now, what I've done was to create the feature column list (our input data), make the input function that converts the pandas dataframes (our data) into an object dataset, create the linear model, train it and lastly test it. 
Now it's time to actually use the model to make predictions. 
So, with this silly titanic dataset (provided in the TensorFlow documentation) what we want is to predict the probabiltiy of survival (of a passenger) based on some information such as age, sex, deck, etc... We can use the ```.predict()``` method to estimate this probability: 

In [None]:
pred_dicts = list(linear_est.predict(eval_input_func)) # convert the dataset object into a list of dictionaries to later loop through it.
probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts]) # take only the probabilities keys to plot them 

probs.plot(kind = 'hist', bins = 20, title = 'predicted probabilities')

In [None]:
print(pred_dicts)

**Pred_dicts** is a list of dictionaries that represent information for every single prediction that the model has made. The dfeval dataframe has 264 values, so pred_dicts also has 264 dictionaries, one prediction for every datapoint). If we want to look at a specific dictionary in this list we just need to provide its index: ```print(pred_dicts[263])```


In [None]:
# get the prediciton for a specific datapoint/subject etc..
print(pred_dicts[1])

In [None]:
# within each prediction dictionary what we are interested in is the 'probabilties' key.
print(pred_dicts[1]['probabilities'])

Given that we have 2 possible outcomes (either someone survived (1) or did not survive(0)) we have 2 probabilities. The first probabilty is for **0 = did not survive** and the second for **1 = survived**. 

In [None]:
# look at the survived probability only:
print(pred_dicts[1]['probabilities'][1])

Now let's compare all the information of a specific person/datapoint with their probability of survival:

In [None]:
print(dfeval.loc[1]) # second person/datapoint, x values
print(y_eval.loc[1]) # actual y value. 
print(pred_dicts[1]['probabilities'][1]) # second dictionary/prediction

The info above shows that this person was a male 54 years old, did not survive and the prediction of survival (of our model) was 0.32 (pretty low)