## Chapter 2: Machine Learning

Some of the scripts presented in this notebook use several Python libraries which have been pre-installed for you. If you had been required to install these libaries on your own, you would issue the following commands:

```python
! pip install --user pandas
! pip install --user numpy
! pip install --user sklearn
```

# Training and Testing Dataframe

**Note: This notebook is intended to provide a demonstration of common machine-learning operations. As such, you are not expected to understand the entirety of the code in each sample.*

Supervised learning is the use of an algorithm that uses labeled data to produce a training dataset an algorithm can use to learn how to identify patterns. A solution that uses a supervised machine-learning approach, such as data classification, will perform the following steps:

1. Specify a training dataset from which the model can learn to match patterns.
2. Specify a testing dataset that the model can use to test its accuracy.
3. Use the model with new data to classify data or predict results.

Take the following Python script, TrainTest.py, for example, which performs the first 2 steps of the supervised machine-learning approach. This script reads the Breast.data.csv file, which contains attributes with which a machine-learning program can classify breast-cancer tumor data as malignant or benign, into a dataframe object. The script assigns 30% of the dataset to the testing dataframed the remainder (30%) goes to the training dataset. The script finishes by displaying the number of total rows, training rows, and testing rows:

In [None]:
#####################################
# Chapter 2 (Python) / Deliverable 1
#####################################

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

names = ['Sample', 'Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'class']

df = pd.read_csv('breast.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 0:9])
y = np.array(df['class'])

# split the data into train and test sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

print("Total rows: ", len(df))
print("Training rows: ", len(X_train))
print("Testing rows: ", len(X_test))

# Classifying E-mail as Spam

To test a model, a supervised machine-learning program uses the test attributes to determine a result, which it then compares to the known result. As you continue your exploration of machine-learning solutions, you will find that developers will normally divide a given dataset into training and testing sets such that the training set contains 70-80% of the data, and the testing set the remainder. Assume, for example, your goal is to identify incoming email as valid or spam.

The following Python script, ClassifySpam.py, uses the Spam.csv dataset to classify e-mails as valid or spam. In this case, the script uses the read_csv function to read the dataset file into a dataframe. The script then assigns the independent variables to the X array and the dependent values to the y array. Next, the script assigns the training and test datasets using the K-nearest-neighbors algorithm to classify the data. After classifying the data, the script displays its accuracy and confusion matrix:

In [None]:
#####################################
# Chapter 2 (Python) / Deliverable 2
#####################################

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

df = pd.read_csv('Spam.csv', header=None) # read dataset file into a dataframe
X = np.array(df.iloc[:, 0:40]) 	# independent variables (attributes that have high correlation with an email's validity)
y = np.array(df[57]) # dependent variable (valid or spam)

# split the data into the train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)

print ('\nModel accuracy score: ', accuracy_score(y_test, pred))
    
print('\nConfusion Matrix\n', confusion_matrix(y_test, pred))


# Clustering Stock Data

Data clustering is the process of assigning data items into related groups. In contrast to supervised machine-learning, as demonstrated in the data classification exercise above, clustering uses unsupervised learning. This means that it does not have correctly labeled data from which training and testing datasets can be used. Consider the Dow Jones Stocks dataset, which contains stock prices (open, high, low and close) and trading volume for many stocks. 

The following Python script, ClusterStocks.py, loads the DowJones.csv file and then uses K-means clustering to group the stocks into three related clusters. However, this stock prices dataset contains many different attributes, which are not well suited for clustering or scatter plotting. The script, therefore, uses the PCA library to transform the values into a two-dimensional array, the values of which are representative of the original data and which can be clustered and graphed. Lastly, within the cluster chart, the script marks the center of each cluster (called the centroid) with a black X:

In [None]:
#####################################
# Chapter 2 (Python) / Deliverable 3
#####################################

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np
from pandas import DataFrame
from sklearn.decomposition import PCA

df = pd.read_csv('DowJones.csv')

df = df.drop('stock', axis=1)
df = df.drop('date', axis=1)
df = df.dropna()

# decompose dataset into 2 representative components, then project on a 2D array.
pca = PCA(n_components=2).fit(df)
data_2d = pca.transform(df)

kmeans = KMeans(n_clusters=3).fit(data_2d)
centroids = kmeans.cluster_centers_

for i in range(0, data_2d.shape[0]):
  if kmeans.labels_[i] == 0:
    plt.scatter(data_2d[i,0], data_2d[i,1], c='green') 
  elif kmeans.labels_[i] == 1:
    plt.scatter(data_2d[i,0], data_2d[i,1], c='yellow') 
  elif kmeans.labels_[i] == 2:
    plt.scatter(data_2d[i,0], data_2d[i,1], c='blue')

plt.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X') #  mark center of each cluster
plt.title("Clusters")
plt.show()

# Scaling Dataset Values

Depending on your dataset, there may be times when different attributes have different underlying scales. For example, a quality attribute may be based on the values 1 to 5, whereas a satisfaction attribute, the values 1 to 10. To improve the results of your machine-learning and data-mining operations, you should align the attribute scales. 

The following Python script, Scale.py, uses the StandardScaler function to do just that. Behind the scenes, the StandardScaler function will scale column values such that values have a mean of 0 and a standard deviation of 1. The script scales the values of a dataframe, showing the value before and after scaling:

In [None]:
from pandas import DataFrame
from sklearn.preprocessing import StandardScaler

Data = {
	'x': [1, 2, 3, 4, 5, 6, 7, 8, 9],
        'y': [10, 20, 30, 40, 50, 60, 70, 80, 90]
       }
  
df = DataFrame(Data,columns=['x','y'])
  
print('Original Dataset')
print(df)

sc = StandardScaler()  
scaled = sc.fit_transform(df)  

print('\nScaled Dataset')
print(scaled)
print('\nMean:', scaled.mean())
print('Standard deviation:', scaled.std())

# Understanding Dimensionality Reduction

As the size of datasets increase, so too does the time required to process the data, as well as the amount of RAM required to hold it. Depending on the dataset with which you are working, there will often be times when one or more of the independent variables within the dataset do not influence the dependent variable. In such cases, you can delete the corresponding columns—a process to which data analysts refer as dimensionality reduction. 

A simple way to determine the relationship between variables is to determine the correlation value between them. Variables with a correlation value near 1, have a high correlation—as the value of one variable increases or decreases, so too will the value of the second. Variables with a correlation value of -1, the variables have a strong negative correlation, meaning, if you increase the value of one variable, the value of the second will decrease proportionally. Variables with a correlation value near 0 have no correlation—therefore, increasing or decreasing one variable’s value will have no impact on the second.

The following Python script, BreastCancerCorrelations.py, examines the correlation between the independent variables and the dependent variable class, which specifies the whether a tumor is malignant or benign:

In [None]:
import pandas as pd
import numpy as np

names = ['Sample', 'Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'class']

data = pd.read_csv('breast.data.csv', names=names)

for i in range(1,9):
  print('Correlation ', names[i], 'and class', np.corrcoef(data[names[i]], data['class'])[0,1])

# Understanding Primary Component Analysis

To perform dataset-dimensionality reduction, analysts will often use a technique called Primary Component Analysis (PCA) to determine (and select) the key independent variables. PCA is an unsupervised machine-learning algorithm. The following Python script, PCA.py, illustrates the use of PCA. In this case, the script does not reduce the dataset, but rather, displays the variance-attribute value for each variable. 

The analysis below shows that the primary variables are clump thickness, uniformity of cell size and uniformity of cell shape:

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

names = ['Sample', 'Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'class']

df = pd.read_csv('breast.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 1:9])
y = np.array(df['class'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) 

sc = StandardScaler()  
X_train = sc.fit_transform(X_train)  
X_test = sc.transform(X_test)  

from sklearn.decomposition import PCA
pca = PCA()

x_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)  
variance = pca.explained_variance_ratio_ 

for i in range(0, 8):
  print(names[i+1], variance[i])

Now we can compare tests using all attributes in the dataset to tests using only the primary variables PCA identified.

The following Python script, BreastCancerPCA.py, uses the K-nearest-neighbor classification method to determine if a breast-cancer tumor is malignant or benign. The script first uses all the independent variables to determine a result and then uses the first three variables identified by PCA. By reducing the number of independent variables to the three primary components, the model’s accuracy changes only slightly. In this case, the dataset is small, but if the dataset were very large, you can save considerable processing time by performing such a reduction:

In [None]:
#####################################
# Chapter 2 (Python) / Deliverable 4
#####################################

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

names = ['Sample', 'Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'class']

df = pd.read_csv('breast.data.csv', header=None, names=names) 
X = np.array(df.iloc[:, 1:9]) # selects all attributes in the dataset
y = np.array(df['class'])

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)

print ('Model accuracy score: ', accuracy_score(y_test, pred))
print(confusion_matrix(y_test, pred))
    
X = np.array(df.iloc[:, 2:4]) # selects the 3 high-variance attributes that PCA identified
y = np.array(df['class'])

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)

print ('Model accuracy score: ', accuracy_score(y_test, pred))
print(confusion_matrix(y_test, pred))

# Linear Discriminant Analysis (LDA)

To reduce the dimensionality of datasets, analysts often perform Primary Component Analysis, an unsupervised-machine-learning algorithm. However, like most machine-learning concepts, there are many algorithms (approaches) to performing dimensionality reduction. Linear Discriminant Analysis (LDA) is a second approach that uses supervised-machine learning. The following Python script, LDA.py, illustrates the use of LDA. Again, the script will use the breast-cancer dataset:

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

names = ['Sample', 'Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'class']

df = pd.read_csv('breast.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 1:9])
y = np.array(df['class'])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) 

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()  
X_train = sc.fit_transform(X_train)  
X_test = sc.transform(X_test)  

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 
lda = LinearDiscriminantAnalysis(n_components=None)

X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)  

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)

print("Accuracy", accuracy_score(y_test, pred))
print(confusion_matrix(y_test, pred))

# Mapping Categorical Values

As you perform machine-learning operations, you will find that many algorithms require that the dataset values are numeric. Unfortunately, many datasets contain text-based categorical data. Consider, for example, the Census dataset, that contains attributes a machine-learning algorithm can use to determine whether an individual will make less or more than $50,000 a year. 

One way to change the text values to numeric is to edit the dataset using Excel. You can also use a programming language such as Python or R to make the changes. At first glance, you might be tempted to perform simple changes, such as converting the text values ‘male’ and ‘female’ to the values 1 and 0 and values such as ‘black’, ‘white’, and ‘other’ to values such as 3, 2, and 1. Although such substitutions accomplish the goal of getting numeric values, it has the problem of introducing ordinal values which imply a numeric order that male is greater than female and black has a more significant value than white, that has a more significant value than other. 

To avoid such ordered implications, developers use a technique called hot encoding that rather than assigning ordinal numbers, instead assigns binary vector values. In the case of male and female, you might use the following vectors:

    male	  [1 0]
	female	[0 1]

Likewise, for black, white, and other, you would use:

	black    [1 0 0]
	white	[0 0 1]
	other	[0 1 0]
    
The following Python script, OneHotEncoding.py, illustrates the process to encode categorical data:

In [None]:
from sklearn import preprocessing

data = ['black', 'white', 'other']
print(data)
lb = preprocessing.LabelBinarizer()
encodedvalues = lb.fit_transform(data)
print(encodedvalues)

The following Python script, OneHotEncodeCensus.py, illustrates the use of hot encoding to update a dataframe that contains the census data. The script starts by loading the adult.csv dataset into a dataframe and then using the head function to display the first five rows. The script then encodes the race values, creating the corresponding binary vectors. The script then appends those vectors as individual columns within the dataset. The script repeats this processing for the gender and country columns. The script then deletes the original (categorical) columns and displays the dataset’s contents. Before you perform your machine-learning and data-mining operations, you would repeat the steps for the remaining categorical fields:

In [None]:
import pandas as pd
import numpy as np
from sklearn import preprocessing

df = pd.read_csv('adult.csv')

print(df.head())

lb = preprocessing.LabelBinarizer()
encodedvalues = lb.fit_transform(df.iloc[:,8:9])
print('Race')
print(encodedvalues)

dfOneHot = pd.DataFrame(encodedvalues, columns = ["A"+str(int(i)) for i in range(encodedvalues.shape[1])])
df = pd.concat([df, dfOneHot], axis=1)

encodedvalues = lb.fit_transform(df.iloc[:,9:10])
print('Gender')
print(encodedvalues)

dfOneHot = pd.DataFrame(encodedvalues, columns = ["B"+str(int(i)) for i in range(encodedvalues.shape[1])])
df = pd.concat([df, dfOneHot], axis=1)

encodedvalues = lb.fit_transform(df.iloc[:,13:14])
print('Country')
print(encodedvalues)

dfOneHot = pd.DataFrame(encodedvalues, columns = ["C"+str(int(i)) for i in range(encodedvalues.shape[1])])
df = pd.concat([df, dfOneHot], axis=1)

# delete the categorical column just replaced
df.drop(df.columns[13], axis=1)
df.drop(df.columns[8], axis=1)
df.drop(df.columns[9], axis=1)

print(df.head())