# Core Learning Algorithms

## Four Basic Machine Learning Algorithms

- Linear Regression
- Classification
- Clustering
- Hidden Markov Models

## Linear Regression
- used to predict numeric values
- if data points are linearly related, we can generate line of best fit (y=mx+b) and predict future values
- finds a line with the same number of points on either side of it
- can be high dimensional 

#### Example Dataset
- titanic dataset
- has information about each passenger on the ship
- will predict whether or not the passenger will survive

In [None]:
from IPython.display import clear_output
from six.moves import urllib
import tensorflow.compat.v2.feature_column as fc

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import requests
import io

In [None]:
dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')   # data for training
dfeval = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv')     # data for testing

# label is the "survived" column - remove it from train/eval and store in y (labels)
y_train = dftrain.pop('survived')   
y_eval = dfeval.pop('survived')

In [None]:
# Categorical data: not numerical - have certain categories
# Need to encode this data into numerical format

CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']
NUMERIC_COLUMNS = ['age', 'fare']

feature_columns = []


for feature_name in CATEGORICAL_COLUMNS:
    vocab = dftrain[feature_name].unique()  # list of all unique values from given feature column (like ["male", "female"])
    # categoiral_column_with_vocabulary_list -> creates feature column (array) with feature name and the possible categories
    feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocab))

for feature_name in NUMERIC_COLUMNS:
    # numeric_column -> feature column with just numbers
    feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

print(feature_columns)

### Training Process
- load the data to the model in batches of 32 entries (can't load all at once - too large)
- **Epochs**: an epoch is one stream of the entire dataset
    - define number of epochs to define how many times model will repeat looking at the data
    - have to let the model see the data enough so that it learn how to predict
    - cannot have too many epochs because of overfitting - model becomes only accurate for this specifc dataset

**Input Function**: defines how our dataset will be converted into batches at each epoch 

In [None]:
# Define input function to transform pandas dataframe into a tf.data.Dataset object for model to be able to read
# This is from tensorflow website

def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
    def input_function():  # inner function, this will be returned
        ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))  # create tf.data.Dataset object with data and its label
        if shuffle:
            ds = ds.shuffle(1000)  # randomize order of data
        ds = ds.batch(batch_size).repeat(num_epochs)  # split dataset into batches of 32 and repeat process for number of epochs
        return ds  # return a batch of the dataset
    return input_function  # return a function object for use

train_input_fn = make_input_fn(dftrain, y_train)  # here we will call the input_function that was returned to us to get a dataset object we can feed to the model
eval_input_fn = make_input_fn(dfeval, y_eval, num_epochs=1, shuffle=False)

In [None]:
# Create the linear regression model
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)

# Train the model
linear_est.train(train_input_fn)

# Get model metrics and stats by testing on eval data
result = linear_est.evaluate(eval_input_fn)
print(f"Accuracy: {result['accuracy']}")

### Using the Model

In [None]:
result = list(linear_est.predict(eval_input_fn))    # predict returns a generator - convert to list

In [None]:
# Print individual entries

entry = 0

print(dfeval.loc[entry])    # first entry
print(f"Survival Prediction: {result[entry]['probabilities'][1]}")    # Chance of survival for the first entry in eval dataset
print(f"Actual Survival: {y_eval.loc[entry]}")    # Actual label of survival from dataset
