# Machine Learning Algorithms

1. Naive Bayes Algorithm
2. Decision Trees
3. Random Forest
4. Linear Regression
5. Logistic Regression
6. Suport Vector Machines
7. kNN - k Nearest Neighbours
8. K-Means

### 1. Naive Bayes
Based on Bayes Theorem, it assumes that no two or more features depend on each other. In simple terms, a particular feature in a class is unrelated to presence of any other feature.

*** Pros ***
 - Easy and fast
 - Works well in multi class problem
 - When assumption of independence holds, it performs better than other classifiers
 
*** Cons ***
 - Works well with categorical variables compared to numerical variables
 - Known as bad estimator
 - Works on assumption of independent variables, which is not the case in real life

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

iris = pd.read_csv('Iris.csv', index_col = 0)
iris.head()

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa


In [5]:
target = iris.Species
X_input = iris.drop('Species', axis = 1)
X_input.head()

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,5.1,3.5,1.4,0.2
2,4.9,3.0,1.4,0.2
3,4.7,3.2,1.3,0.2
4,4.6,3.1,1.5,0.2
5,5.0,3.6,1.4,0.2


In [6]:
target.head()

Id
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
5    Iris-setosa
Name: Species, dtype: object

In [11]:
# train test split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_input, target, test_size=0.20, random_state = 100)

In [12]:
# Import Naive Bayes Algorithm
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, y_train)
prediction = nb.predict(X_test)

In [13]:
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, prediction)
print(score)

0.9666666666666667


### 2. Decision Trees
Type of Supervised learning algorithm used for classification problems mostly.

*** Advantages***
 - Works for both categorical & numerical variables
 - Simple to understand and interpret
 - Requires little data preparation
 - Able to handle multiple output problems

*** Disadvantages***
 - Prone to overfitting
 - Can become unstable
 - Cannot express hard concepts like XOR gates

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
prediction = clf.predict(X_test)

### 3. Random Forest
Random decision forests made from several or few decision trees. Thus it is called an ensemble learning method.
Used for classification and regression.

*** Advantages***
 - Works for both categorical & numerical variables
 - Able to handle multiple output problems
 - Not prone to overfitting like Decision trees
 - Extremely flexible and have high accuracy

*** Disadvantages***
 - Harder and time consuming to construct
 - Computationally expensive
 - Cannot express hard concepts like XOR gates

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
prediction = clf.predict(X_test)

### 4. Linear Regression
Used to estimate real values like cost, price, sales etc based on continuous variables.
Establishes the mathematical relationship between he features and the target variable by fitting the best line represented by the equation

***y = ax + b***

Y - target variable

a - slope

X - feature variable

b - intercept

In [None]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)
reg.predict(X_test)

In [None]:
from sklearn.metrics import r2_score
score = r2_score(y_test, prediction)

### 5. Logistic Regression
Used to estimate discrete values(binary values, i.e 0 & 1 or True & False) in classification problems.


***Pros***

 - More robust than other classifiers
 - Works best for binary classification
 
*** Cons***

- Not fitted for multiiclassification problems

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression
clf.fit(X_train, y_train)
clf.predict(X_test)

### 6. Support Vector Machines
A very robust and efficient clasification algorithm which can perform really well with binary or multiclassification problems.
It can classify the datasets in n-dimensional space.


***Pros***

 - Very robust and efficient classifier
 - Works best for complex datasets
 - Very effective in high dimensional spaces
 - Versatile & Can classify data with any shape and distribution
 
*** Cons***

 - Computationally very expensive
 - Complex, hard and time consuming to construct & tune

In [None]:
from sklearn import svm
clf = svm.SVC()
clf.fit(X_train, y_train)
clf.predict(X_test)

### 7. k-NN (k Nearest Neighbours)
Can be used for both classification and regresion
Mostly used in classification problems to find the segmentation in our data and make distinguished circles/groups based on nearest neighbors


***Pros***

 - Easy to interpret
 - High Predictive power
 
*** Cons***

 - Can be computationally expensive
 - Relies on good preprocessing and removal of outliers

In [None]:
from sklearn.neighbours import KneighborsClassifier
clf = KneighborsClassifier(n_neighbors='k')
clf.fit(X_train, y_train)
clf.predict(X_test)

### 8. K Means
Algorithm for unsupervised problems to make clusters in our data


In [None]:
from sklearn.cluster import KMeans
cls = KMeans(n_clusters = 3)
cls.fit(X_train)
cls.predict(X_test)