<a href="https://colab.research.google.com/github/SusheelThapa/ML-From-Scratch/blob/tensorflow/tensorflow/tensorFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Fundamentals

## Introduction of Tensorflow

### Installing Tensorflow

Use the below command, to install the ***tensorflow** in your local machine

```bash
pip install tensorflow
```

### Importing Tensorflow

In [None]:
import tensorflow as tf
tf.version

### What is tensor?

Tensor is a generalization of vectors and matrices to potentially higher dimension.

Internally, tensorflow represent tensors as  n-dimensional arrays of base datatypes.

Each tensor has a data type and a shape

**Data Types** includes: float32, int32, string and others

**Shape**: Represents the dimension of data

### Creating tensor

Below are the examples of creating tensor

In [None]:
string = tf.Variable("This is a string", tf.string)
number = tf.Variable(324, tf.int16)
floating = tf.Variable(3.567,tf.float64)

### Rank/Degree of Tensors

Another word for rank is degree, it can be define as the number of dimensions involved in the tensor.

In the above code block, what we have created is *tensor of rank zero*

Now, let's create tensor of higher degree/ranks

In [None]:
rank1_tensor = tf.Variable(["Something","Nothing"], tf.string)

To find the rank of the tensor we can call `rank()` method as 

In [None]:
tf.rank(rank1_tensor)

### Shape of Tensors

Shape of the tensors is simply the amount of elements that exist in each dimension.

*Tensorflow will try to determine the shape of a tensor but sometimes it may be unknown*

To get the shape of the tensor, we can call **shape attribute***

In [None]:
rank1_tensor.shape

### Changing the shape

Number of elements of a tensor is the product of the sizes of all its shape.

Due to which many shapes that have the same number of elements, making it convient to be able to change the shape of a tensor

Example of changing the shape of tensor

In [None]:
tensor1 = tf.ones([1,2,3]) # tf.ones will create tensor of provide shape will all its element of ones

tensor2 = tf.reshape(tensor1,[3,2,1]) # reshape the existing tensor to shape [3,2,1]

tensor3= tf.reshape(tensor2,[3,-1]) # -1 tells tensor to calculate the size of the dimension at that place

# The number of elements in orginal tensor and the reshape tensor is same

Now, lets have a look at the shape of the tensor we have created

In [None]:
print(tensor1.shape)
print(tensor2.shape)
print(tensor3.shape)

### Types of tensor

Commonly used tensor are as follows:
- Variable
- Constant
- Placeholder
- SparseTensor

## Core Learning Algorithms

We will be studying 4 fundamental machine learning algorithms.

- Linear Regression
- Classification
- Clustering
- Hidden Markov Models


### Linear Regression

Linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). (***Wikipedia***)

#### Setup and Imports

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np # Optimize version of array
import pandas as pd # Data analytics tools
import matplotlib.pyplot as plt # Visualization tools

import IPython.display as clear_output
from six.moves import urllib

import tensorflow.compat.v2.feature_column as fc # Required later in linear regression

import tensorflow as tf

#### Data

The dataset we will be focusing here will be titanic dataset. It has tons of information about each passanger on the ship.

**Below, we will load a dataset and learn how we can explore it using some built-in tools**

In [None]:
# Load datasets

# Training datasets
dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv') 
y_train = dftrain.pop('survived')

# Testing datasets
dftest = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv') # Testing datasets
y_test = dftest.pop('survived')

`pd.read_csv()` method will return a new pandas *dataframe*. Dataframe is like a table and actually have a look at the table representation.

We have decided to pop the "survived" column from our dataset and store it in a new varible as this column tells us whether the passanger survived or not. It is most like to be something that our model should predict

To look at the data we will use `head()` method from pandas.

In [None]:
dftrain.head()

And if we need more statical description of the data we can use `describe()` method

In [None]:
dftrain.describe()

To get the information about the dataype of each column, number of columns and what are those we can use `info()` method of pandas

In [None]:
dftrain.info()

Let's have a look at the shape of the dataframe

In [None]:
dftrain.shape

Now, let's visualize the data we have got.

In [None]:
dftrain.age.hist(bins=20)

In [None]:
dftrain.sex.value_counts().plot(kind='barh')

In [None]:
dftrain['class'].value_counts().plot(kind='barh')

In [None]:
pd.concat([dftrain,y_train],axis=1).groupby('sex').survived.mean().plot(kind='barh').set_xlabel('% survived')

After analyzing this information we should notice the following:
- The majority of passangers are in their 20's or 30's
- The majority of passengers are male
- The majority of passengers are in "Third Class"
- Females have a much higher chances of survival

##### Training vs Testing Data

**Training Data** is what we feed to the model so that it can develop and learn. It is usually much larger size than the testing data

**Testing Data** is what we use to evaluate the model and see how well it is performing. It is important to use seperate set of data that the model has not been trained on to evaluate it.


##### Features Columns

In the dataset, we have two types of information
- **Categorical**

    It is anything that isn't numerical.

    *For example, the sex column does use numbers, it use words 'male' and "female"*

- **Numeric**

    These are the data with numeric value.

Before continuing, we need to change all our categorical data into numeric data.
Todo this, Tensorflow has some tools to help us.

In [None]:
CATEGORICAL_COLUMNS = ['sex','n_siblings_spouses','parch','class','deck','embark_town','alone']
NUMERIC_COLUMNS = ['age','fare']

feature_columns =[]

for feature in CATEGORICAL_COLUMNS:
    vocabulary = dftrain[feature].unique()
    feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature,vocabulary))


for feature in NUMERIC_COLUMNS:
    feature_columns.append(tf.feature_column.numeric_column(feature,dtype=tf.float32))


#### Training the model

Training the model describes about how the model is being train. Specifically speaking how data is fed to our model.

To train the model, we will fed the model with data of batch size of 32. It means we will fed small batches of entries to our model multiple times according to the **epoches**

**Epoches** is one stream of our entire datasets. Number of epoches we define is the amount of times our model will see the entire dataset.

*Examples: If we have 10 epocs, our model will see the same datasets 10 times.*

To feed our data to model in the form of batches we need ***input function*** which task is to convert our dataset into batches at each epoch

##### Input function

The Tensorflow model we are going to use requires that the data we pass it comes in as `tf.data.Dataset` object.

It means that we must create a *input function* that can convert our current pandas dataframe into that object.

*input_function* show below is directly copied from tensorflow documentation.

In [None]:
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
  def input_function():
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df)) # Create tf.data.Dataset object with data and its label
    if shuffle:
      ds = ds.shuffle(1000) # randomize the order of data
    ds = ds.batch(batch_size).repeat(num_epochs) # split dataset into batches of 32 and repeat the process for number of epochs
    return ds # return a batch of the dataset
  return input_function # return a function object for use

train_input_fn = make_input_fn(dftrain, y_train)
eval_input_fn = make_input_fn(dftrain, y_train, num_epochs=1, shuffle=False)

##### Creating the model

We will be using linear estimator to utilize the linear regression algorithm.

In [None]:
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns) # We are creating a linear estimator by passing the feature columns we created earlier.

##### Training the model

Training the model is as easy as passing the input functions that we created earlier.

In [None]:
# Training the model
linear_est.train(train_input_fn) # just passing the input function

#### Testing our model
Testing is also same as training the model but here we will be passing input function for testing dataset

In [None]:
result = linear_est.evaluate(eval_input_fn)

print("The accuracy of our model is ",result['accuracy'])

#### Predicting using our model

In [None]:
result = list(linear_est.predict(eval_input_fn))

print("Passanger chance of survival is ",result[100]['probabilities'][1])

### Classification

#### Importing the necessary packages

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
import pandas as pd

#### Datasets

This species dataset seperates the flower into 3 different classes of species
- Setosa
- Versicolor
- Virginica

The information about each flower is the following:
- sepal length
- sepal width
- petal length
- petal width


#### Loading the datasets

Next, we will be loading the datasets

In [None]:
# Defining some constant that will help later on
CSV_COLUMN_NAMES = ["SepalLength","SepalWidth","PetalLength","PetalWidth","Species"]
SPECIES = ["Setosa","Versicolor","Virginica"]

# Loading the datasets, we are using keras to grab our datasets and read them into pandas dataframe
train_path = tf.keras.utils.get_file(
    "iris_training.csv","https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv"
)
test_path = tf.keras.utils.get_file(
    "iris_test.csv","https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv"
)

train = pd.read_csv(train_path,names= CSV_COLUMN_NAMES,header =0)
test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)

Let's look at our datasets.

In [None]:
train.head()

Now, we can pop the "Species" as they are label to classify.

In [None]:
y_train = train.pop('Species')
y_test = test.pop('Species')

Let's look into shape of our datasets

In [None]:
train.shape

So, we have 120 data with 4 features

#### Input function

In [None]:
def input_fn(features, labels, training=True, batch_size=256):
    # Convert the inputs to a dataset
    dataset = tf.data.Dataset.from_tensor_slices((dict(features),labels))

    # Shuffle and repeat if you are in training mode
    if training:
        dataset= dataset.shuffle(1000).repeat()
    
    return dataset.batch(batch_size)

#### Features columns

In [None]:
my_feature_columns = []

for key in train.keys():
    my_feature_columns.append(tf.feature_column.numeric_column(key=key))

#### Building the model

For classification tasks there are variety of different esitmators/models that we can pick from.

Some options are listed below:

- `DNNClassifier`(Deep Neural Network)
- `LinearClassifier`

We can choose either model but DNN is the best choice as we may not be able to  find a linear correspondence in our data.

In [None]:
# Building a DNN with 2 hidden layer with 30 and 10 hidden nodes each
classifier = tf.estimator.DNNClassifier(
    feature_columns = my_feature_columns,
    hidden_units=[30,10], # Defining the two hidden layer
    n_classes =3 #model must chose between 3 classes
)

#### Training the model

In [None]:
classifier.train(
    input_fn = lambda: input_fn(train,y_train, training=True),
    steps=5000
)

#### Testing the model

In [None]:
classifier.evaluate(
    input_fn=lambda:input_fn(test, y_test, training = False)
    )

#### Making prediction

In [None]:
def input_fn_pred(features, batch_size=256):
    # Convert the inputs to a Dataset without labels.
    return tf.data.Dataset.from_tensor_slices(dict(features)).batch(batch_size)

features = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']
predict = {}

print("Please type numeric values as prompted.")
for feature in features:
  valid = True
  while valid: 
    val = input(feature + ": ")
    if not val.isdigit(): valid = False

  predict[feature] = [float(val)]

predictions = classifier.predict(input_fn=lambda: input_fn_pred(predict))
for pred_dict in predictions:
    class_id = pred_dict['class_ids'][0]
    probability = pred_dict['probabilities'][class_id]

    print('Prediction is "{}" ({:.1f}%)'.format(
        SPECIES[class_id], 100 * probability))


## Clustering

Clustering is a machine learning technique that involves the grouping of data points. In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features.

### Algorithm for K-means clustering

- Randomly pick K points to place K centroids
- Assign all of the data points to the centroids by distance. The closest centroid to a point is the one it is assigned to.
- Average all of the points belonging to each centroid to find the middle of those clusters(center of mass). Place the corresponding centroids into that position
- Reassign every point once again to the closest centroid
- Repeat steps 3-4 until no points changes which centroid it belongs to