# Notebook #5: Intro to Machine Learning

## Table of Contents:
###  1) What is Machine Learning?
###  2) Basic ideas of Machine Learning in Python
###  3) Logistic Regression
###  4) Random Forest
###  5) Support Vector Machines
###  6) Neural Networks
###  7) Example with GeoChem Data

###  1) What is Machine Learning?

#### Machine learning is a subset of AI which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy to solve growingly complex problems.

#### Machine learning is an important component of the growing field of data science. The models we make with ML are trained to make classifications or predictions, and to uncover key insights in data mining projects. These insights subsequently drive decision making within applications and businesses, ideally impacting key growth metrics.

###  2) Basic ideas of Machine Learning in Python

#### There's 4 main kinds of machine learning: supervised learning, semi-supervised learning, unsupervised learning, and reinforcement learning.

#### Supervised learning examples:
#### -Classification
#### -Regression

#### Unsupervised learning examples:
#### -Clustering
#### -Association
#### -Dimensionality Reduction

#### Some of the Python tools we'll use are

#### - NumPy
#### - SciPy
#### - Scikit-learn
#### - TensorFlow
#### - And many more...

###  3) Logistic Regression

#### Logistic regression classification algorithm used to predict the probability of a variable. This essentially boils down to a binary classification. Its is also possibly to have multinomial and ordinal regression.

In [None]:
## Lets see how we can do one in python using an example dataset

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn import datasets
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split

digits = datasets.load_digits() # Load the example dataset
X = digits.data
y = digits.target

plt.figure(figsize=(6, 6)) # Plot the example dataset for initial visual
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='g', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='y', label='1')
plt.legend()

In [None]:
## Here we are spliting up our data into the test and train split with the functino on the right that does it randomly 
## according to the test_size you assign.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

In [None]:
## Next, we create a model object
digreg = linear_model.LogisticRegression()

In [None]:
## Then, we fit othe model object to our data
digreg.fit(X_train, y_train)

In [None]:
## Now, by using the test Xs, we can see how well our trained model can create results
y_pred = digreg.predict(X_test)

In [None]:
print("Accuracy of Logistic Regression model is:",
metrics.accuracy_score(y_test, y_pred)*100) # With this line we can show off how accurate our model is at its predictions

###  4) Random Forest

#### Random forest is a supervised learning algorithm which is used for both classification as well as regression. A forest is made up of decision trees and more decision trees means more robust forest. Similarly, random forest algorithm creates decision trees on data samples and then gets the prediction from each of them and finally selects the best solution by means of voting.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

iris = datasets.load_iris()
X = iris.data[:, :2]
y = (iris.target != 0) * 1

In [None]:
## Split up our data into the test and train split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

In [None]:
## Next, we create a model object
classifier = RandomForestClassifier(n_estimators=50)

In [None]:
## Then, we fit othe model object to our data
classifier.fit(X_train, y_train)

In [None]:
## Now, by using the test Xs, we can see how well our trained model can create results
y_pred = classifier.predict(X_test)

In [None]:
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)

#### With this second example, we can see that the pipeline for using premade models like this are very similar.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
sns.set()###  5) Support Vector Machines

In [None]:
import pandas as pd
import numpy as np
from sklearn import svm, datasets
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target

In [None]:
# Create the max and min values for the boundaries on the plot
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

h = (x_max / x_min)/100

xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) 

X_plot = np.c_[xx.ravel(), yy.ravel()]

C = 1.0 # Value of regularization parameter

In [None]:
svc_classifier = svm.SVC(kernel='linear', C=C)

In [None]:
svc_classifier.fit(X, y)

In [None]:
Z = svc_classifier.predict(X_plot)

In [None]:
Z = Z.reshape(xx.shape)

In [None]:
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Support Vector Classifier with linear kernel')

###  6) Neural Networks

### 7) Example with GeoChem Data