# Introduction to Machine Learning

# What is it and Why do I care?

![Google trends](img/gootrend.jpg)

![Best Jobs](img/mljobs.jpg)

# 1. Learning Resources

* Machine Learning by Stanford : https://www.coursera.org/learn/machine-learning
* Amazon Machine Learning University : https://aws.training/machinelearning
* Google Machine Learning Course : https://developers.google.com/machine-learning/crash-course
* Fast.ai courses : https://www.fast.ai/
* Kaggle Courses : https://www.kaggle.com/learn/overview
* Book : Data Science from Scratch
* Book : Introduction to Machine Learning with Python

# 1. Data Landscape

![image.png](attachment:image.png)

# 2. AI vs ML vs DL vs DS


![ML Meme](img/mlmeme.jpeg)



--------------


![image.png](attachment:image.png)

# 3. Common Machine Learning terminology



## Scalar

## Vector

## Matrix

## CPU

## GPU

## Distributed

## Batch

## Realtime

## Algorithm

## Model

## Line

## Plane

## Hyperplane

## Linear

## Non Linear

## Binary

## Non- Binary

## Multi Variate

## Numerical Variables

## Categorical Variables

## Nominal Variables

## Ordinal Variables

## Features

## Label

## Feature Engineering

## Feature Selection

## Predict/Inference

## Train

## Test

## Validation

## Scoring

## Overfitting

## Underfitting

## Supervised

## Unsupervised

## Semi Supervised

## Reinforcement Learning

## Neural Networks

## Deep Learning

## Regression

## Classification

## Clustering

## Transformation

## Scaling

## Optimization

## Accuracy

## Cost function


# 4. Machine Learning Categories

![image.png](attachment:image.png)

![title](img/MLcat.jpg)



![image.png](attachment:image.png)

## Supervised Learning


* Definition : Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples.

* The example Input-Output pair is called the training data

* Input variables or data are the identifiers or attributes of the example that is being considered

* Output or a label is the attribute that is correctly associated with the Input

* Example of Sales data:

* Input variables can be Month(January), day of the week(Sunday), weather(20), date(15) 

* Output variable can be : total sales($100,000)

* Lets call the Input Variables/vector : x

* Lets call the Output : Y

* Simply put the process of finding a function f such that : y = f(x) is called supervised learning

* It called learning because the algorithm iteratively corrects or optimizes the function f until it achives an acceptable level of accuracy

* Supervised Machine Learning can be further classified into two types : Classification and Clustering

![Classification VS Regression](img/clareg.jpeg)


## Regression

* Regression is the technique of predicting a continous numerical value for the output variable Y

* Examples : 

* Predicting weather

* Predicting value of a house

* Forecasting sales for the next week



## Classification

* Classification is the technique of predicting labels or discrete values for the output variable Y

* Example : 

* Predicting gender of a baby

* Predicting winner of NFL



# Machine Learning Algorithms
## Linear Regression

* Linear Regression is one of the most well known algorithms in M.L
* It is both statistical and M.L algorithm
* The basic assumption of Linear Regression Model is that the input X and Output Y have a linear relationship and that a Linear equation can be written such that Y can be calculated from a value of X

![Linear Vs NonLinear](img/linear-nonlinear-relationships.png)
* When there is a single input variable X, it is called Simple Regression

* A Simple linear equation


In [None]:
y = B0 + B1x

#Y is the value to be predicted ( Example Value of house)
#x is the input variable( Example Year it was built)
#B0 and B1 are called parameters

![Linear Regression Explained](img/LinearRegExpl.png)


* y — is the value of the dependent variable
* β₀ — is the y-intercept of the line means where the line intersects the Y-axis
* β₁ — The slope or gradient of the line(Coefficient)
* x — The value of the independent variable
* u — The residual or noise that are caused by unexplained factors

* In case of Linear Regression the model basically learns the best values of the parameters β₀ and β₁

* Once the model learns the parameters it can calculate the value of y based on a value of x

## How do models learn the parameters

* Multiple different ways of learning parameters

* Example : Minimising Cost Function, Orninary Least Square

## Cost Function

* Cost Functions are used to determine or estimate the difference between predicted vs actual value

* Example Cost Function : Mean Squared Error

* Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:

![MSE](img/MSE.png)

* The cost function (you may also see this referred to as loss or error.) can be estimated by iteratively running the model to compare estimated predictions against “ground truth” — the known values of y.

* The objective of a ML model, therefore, is to find parameters, weights or a structure that minimises the cost function.


## Gradiant Descent

* Gradiant Descent is an algorithm used to minimze the cost function

![Gradient Descent](img/GD.gif)

* As the model iterates, it gradually converges towards a minimum where further tweaks to the parameters produce little or zero changes in the loss


## Putting it all together

![How does it work](img/LRgif.gif)

* Once the model is trained, it learns the values of B0 and B1
* Now these can be used to predict the value of y for any arbitary value of x





## Implementing the algorithm in python

* Linear regression can be implemented in pure python with the help with libraries like Numpy
* We will have to write our own code to calculate things like : MSE, gradiant descent, co-variance
* We will have to write the code to train the model iteratively on the dataset and calculate the accuracy etc..
* Or like everything else in Python we can re-use an existing framework/Module/Libary



## Machine Learning Frameworks

* Examples : Scikit Learn, SparkML, Tensor Flow, Caffe, Pytorch etc..
* These frameworks have optimized implementations of almost all common M.L algorithms


## Scikit Learn

* One of the most popular framework for Machine Learning
* Simple and efficient tools for data mining and data analysis
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license


## Train & Test Split

* The data used for building models is split between train and test sets
* Typically 80% of the data is used for training and 20% is used for testing
* For massive data sets, its commomn to use 95% of data for training and 5% for testing
* Once the Model is trained on the Train set, it is used to predict the outputs of the Test set
* The Models performance is determined based on how the predicted values compare to the actual values


## Model Performance Evaluation

**Root Mean Squared Error**

* RSME is the standard deviation of prediction errors
* RMSE tells you how concentrated the data is around the line of best fit


**R2**

* R-squared is always between 0 and 100%
* 0% represents a model that does not explain any of the variation in the response variable around its mean
* 100% represents a model that explains all of the variation in the response variable around its mean
* Used as Model score

* The higher the value of R2 the better(some conditions apply)


## Model Fit

## Underfit, Balanced and Overfit

![Model Fit](img/fit.png)

# Linear Regression With Scikit Learn Example

* In this example we will predict the progression of diabetes after one year based on given features 
* We will use the built in diabetes dataset of scikit learn
* https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt
* We will do some basic analysis, split the data into Train and test split, fit the model, get predictions for the test set and evaluate the model


In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split


#Dataset
#https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt

    
# Load the diabetes dataset
diabetes = datasets.load_diabetes()


print('Data Set Keys :', diabetes.keys())

print('Feature Names : ', diabetes.feature_names)

print('Description :', diabetes.DESCR)


In [None]:
#Create Pandas Dataframe from the diabetes data
diabetes_df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
print(diabetes_df.head())

#Add the Target or Output variable and name it progreession
diabetes_df['progression'] = diabetes.target
print(diabetes_df.head())


**Generate scatter plot to determine correlation between features and the progression(output)**

In [None]:
#Generate scatter plot to determine correlation between features and the progression(output)
plt.figure(figsize=(20, 5))
features = ['bmi', 'age', 'bp']
target = diabetes_df['progression']

for i, col in enumerate(features):
    plt.subplot(1, len(features) , i+1)
    x = diabetes_df[col]
    y = target
    plt.scatter(x, y, marker='o')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('diabetes_df')
    
plt.show()


* **BMI and BP seem to have a linear relationship whereas age does not**
* **For the sake of this example, lets select BMI as out input feature for simple linear regression**

In [None]:

#Create a dataframe for the input variable(X) from bmi
X = pd.DataFrame(diabetes_df['bmi'].values.reshape(-1, 1))

#Y is the output variable
Y = diabetes_df['progression']

#Split the data into Test and Train sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

**Create Linear Regression Object, fit the model and generate predictions on the test data set**

In [None]:
# Create linear regression object
lr_model = linear_model.LinearRegression()

# Train the model using the training sets
lr_model.fit(X_train, Y_train)

# Make predictions using the testing set
Y_pred = lr_model.predict(X_test)

**Generate the Model parameters and performance meterics**

In [None]:
# The coefficients
print('Coefficients: \n', lr_model.coef_)


#The Intercept
print('Intercept', lr_model.intercept_)
# The mean squared error
print("Root Mean squared error: %.2f"
      % np.sqrt(mean_squared_error(Y_test, Y_pred)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(Y_test, Y_pred))


In [None]:
# Plot outputs and visualize the best fit line
plt.scatter(X_test, Y_test,  color='black')
plt.plot(X_test, Y_pred, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

In [None]:
#Create a DataFrame to compate actual vs Predicted values
#compare_df = pd.DataFrame(columns=['Actual Value', 'Predicted Value'])
compare_df = Y_test.to_frame(name='Actual Value').reset_index()

compare_df['Predicted Value'] = pd.DataFrame(Y_pred, columns=['Predicted Value'])

print(compare_df)