## Chapter 1: Data Mining and Analytics

Some of the scripts presented in this notebook use several Python libraries which have been pre-installed for you. If you had been required to install these libaries on your own, you would issue the following commands:

```python
! pip install --user pandas
! pip install --user matplotlib
! pip install --user sklearn
! pip install --user apriori
```

# Simple Visualization Example

**Note: This notebook is intended to provide a demonstration of common data mining operations. As such, you are not expected to understand the entirety of the code in each sample.*

Data analysts often leverage the Python and R Programming languages to perform data-mining and machine-learning operations. Data mining is the process of identifying patterns that exist within data, and one of the first steps data analysts perform to aid in pattern identification is to represent the data visually. Through transformation of datapoints into charts and graphs, the data can be parsed more easily, and patterns recognized more readily.

Take the following Python script, TitanicCharts.py, for example. This script opens the Titanic dataset, which contains information about the passengers who lived and died on the Titanic, and uses this data to create three different pie charts showing the passenger assignments by class, survivors by class, and deaths by class:

In [None]:
#####################################
# Chapter 1 (Python) / Deliverable 1
#####################################

import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('titanic.csv')

Class = data[['Pclass']].values
Survived = data[['Survived']].values

FirstClass = 0
SecondClass = 0 
ThirdClass = 0
FirstClassSurvived = 0
FirstClassDied = 0
SecondClassSurvived = 0 
SecondClassDied = 0
ThirdClassSurvived = 0 
ThirdClassDied = 0

for i in range(0, len(Class)):
  if Class[i] == 1:
    FirstClass += 1
    if Survived[i] == 1: 
       FirstClassSurvived += 1
    else:
       FirstClassDied += 1

  elif Class[i] == 2:
    SecondClass += 1
    if Survived[i] == 1: 
       SecondClassSurvived += 1
    else:
       SecondClassDied += 1

  elif Class[i] == 3:
    ThirdClass += 1
    if Survived[i] == 1: 
       ThirdClassSurvived += 1
    else:
       ThirdClassDied += 1
       

# Data to plot
labels = '1st Class', '2nd Class', '3rd Class'
sizes = [FirstClass, SecondClass, ThirdClass]
colors = ['gold', 'yellowgreen', 'lightcoral']
explode = (0.1, 0, 0)  # explode 1st slice
plt.title("Passenger Class Assignment")
# Plot chart
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)

plt.show()

sizes = [FirstClassSurvived, SecondClassSurvived, ThirdClassSurvived]
colors = ['gold', 'yellowgreen', 'lightcoral']
explode = (0.1, 0, 0)  # explode 1st slice
plt.title("Surviving Passengers by Class Assignment")
# Plot chart
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)

plt.show()

sizes = [FirstClassDied, SecondClassDied, ThirdClassDied]
colors = ['gold', 'yellowgreen', 'lightcoral']
explode = (0.1, 0, 0)  # explode 1st slice
plt.title("Passenger Deaths by Class Assignment")
# Plot chart
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)

plt.show()

# Predicting Titanic Deaths 

Data science, often used interchangeably with data mining, is the use of statistics, programming, scientific methods, and machine learning to extract knowledge from a dataset. Using patterns gleaned from data visualizations, data analysts can then train and test algorithms to produce useful models, some which may be leveraged to predict future outcomes with a high degree of accuracy.

The following Python script, Titanic.py, opens the TitanicFields dataset which contains data about many of the passengers, such as age, gender, and class of travel. The dataset eliminates several of columns that are not used in the prediction, deletes records with missing data, and converts the text strings male and female to the values 1 and 0. The script uses random forest classification to predict, based on age, gender, and class of travel, whether a passenger would have lived or died:

In [None]:
#####################################
# Chapter 1 (Python) / Deliverable 2
#####################################

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
import sklearn.tree as tree
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

names = ['Pclass','Sex','Age','Survived']

df = pd.read_csv('TitanicFields.csv', header=None, names=names)
X = np.array(df.iloc[:, 0:2]) 	

y = np.array(df['Survived'])

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

model = RandomForestClassifier()

model.fit(X_train, y_train)
pred = model.predict(X_test)
print ('Accuracy Score: ', accuracy_score(y_test, pred))
print('\nConfusion Matrix\n', confusion_matrix(y_test, pred))

# Clustering

Data clustering is the process of grouping related dataset items into one or more clusters (groups of related items based on shared features). Clustering uses an unsupervised machine-learning algorithm to generate such groups, which means the algorithm does not use a training dataset.

Consider the Iris dataset, for example, a well-known data-mining and machine-learning dataset used to introduce the clustering and classification processes. The Iris dataset contains sepal and pedal lengths for three varieties of Iris flowers:

    •	Iris-setosa
    •	Iris-vergenica 
    •	Iris-versicolor

The dataset has 50 records for each variety. Using clustering, you can collect the data into groups for further analysis. The following Python script, IrisCluster.py, uses the common k-Means clustering algorithm to identify related groups within the dataset:

In [None]:
#####################################
# Chapter 1 (Python) / Deliverable 3
#####################################

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np

names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

df = pd.read_csv('iris.data.csv', header=None, names=names)
from pandas import DataFrame
from sklearn.decomposition import PCA

df = df.drop('class', axis=1)
pca = PCA(n_components=2).fit(df)
data_2d = pca.transform(df)

kmeans = KMeans(n_clusters=3).fit(data_2d)
centroids = kmeans.cluster_centers_

# plot number
for i in range(0, data_2d.shape[0]):
  if kmeans.labels_[i] == 0:
    plt.scatter(data_2d[i,0], data_2d[i,1], c='green') 
  elif kmeans.labels_[i] == 1:
    plt.scatter(data_2d[i,0], data_2d[i,1], c='yellow') 
  elif kmeans.labels_[i] == 2:
    plt.scatter(data_2d[i,0], data_2d[i,1], c='blue')
   

plt.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X')
plt.title("Clusters")
plt.show()

# Data Classification

Data classification uses supervised machine learning, meaning the classification algorithm will use a training dataset to teach the algorithm the common attributes for each category. There are many different data-classification algorithms, which differ by memory use and CPU performance.

Consider, for example, the Census dataset at the University of California Irvine (UCI) that contains census data on individuals such as age, gender, race, and marital status. The following program uses a subset of the dataset which has reduced the number of records to 500 and reduced the number of columns. The dataset has also converted text fields, such as male and female (to the numeric values 1 and 0), and marital status (0 not married, 1 married, 2 never married, 3 divorced, and 4 widowed). The dataset represents individuals living in the United States as 1 and outside of the United States as 0. The dataset represents incomes lower than 50,000 with the numeric value 0 and incomes more than 50,000 as 1. 

The following Python program, PredictIncome.py, uses the census dataset to classify an individual as likely to fall within one of two income levels:

    •	Earns less than 50,000
    •	Earns more than 50,000


In [None]:
#####################################
# Chapter 1 (Python) / Deliverable 4
#####################################

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

names = ['Married', 'Race', 'Gender', 'Age', 'Country', 'Income']

df = pd.read_csv('census.csv', header=1, names=names)
X = np.array(df.iloc[:, 0:4]) # select the (range of) attributes to base the classification on.

y = np.array(df['Income'])

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)

print ('\nModel accuracy score: ', accuracy_score(y_test, pred))
print(confusion_matrix(y_test, pred))

# Predicting Patient Healthcare Costs

For decades, business have used data analytics to explain their performance during the previous quarter or year. Analysts refer to this as descriptive analytics-the analyts use data to describe what happened in the past. Predictive analytics, in contrast, uses data to predict what will happen in the future.

Consider the insurance.csv dataset, which contains over 1,300 records regarding insurance customers and their insurance charges. The following Python script, PredictInsuranceCosts.py, uses the dataset to create a predictive model that analysts can then use to predict a customer’s healthcare costs:

In [None]:
#####################################
# Chapter 1 (Python) / Deliverable 5
#####################################

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor

data = pd.read_csv('insurance.csv')

X = data[['age', 'sex', 'bmi', 'children', 'smoker', 'region']].values # select the values to base the prediction on
y = data['charges']


model = RandomForestRegressor(n_estimators=100)

model.fit(X, y)
predictions = model.predict(X)

print(model.feature_importances_)

predictions = model.predict(X)
for index in range(len(predictions)):
  print('Actual: ', y[index], 'Predicted: ', predictions[index])


# Using Association with E-Commerce Data

Data association is the process of identifying relationships between variables, for which the presence or absence of a first variable (called the antecedent) influences a second variable (called the consequent). One of the best-known data-association problems is market-basket analysis, which examines the items in a shopper’s basket to identify associations between them. Using market-basket analysis, for example, analysts found that shoppers who purchased diapers, are highly likely to also purchase beer. With such product insights in hand, a store might advertise a sale on diapers, while also increasing the price of beer. Or, the store may place beer and diapers far away from one another, so that customers needing both must walk past many other items.

Assume, for example, a popular e-commerce site wants to know how other product sales are driven by their key items: books, videos, and music. The dataset for the e-commerce sales contains transactions. Each transaction lists items purchased. The following Python script, EcommerceAssociation.py, uses the apriori algorithm to determine the associations between products:

In [None]:
#####################################
# Chapter 1 (Python) / Deliverable 6
#####################################

import pandas as pd
from apyori import apriori

data = pd.read_csv('ecommerce.csv', header=None)

records = []
for i in range(0, len(data)):  
    records.append([str(data.values[i,j]) for j in range(0, len(data.columns))])
#1.1
rules = apriori(records, min_length=2, min_lift=1.1, min_support=0.15)  
results = list(rules)  

for item in results:
  if not 'nan' in str(item):
   print()
   print(item)
   print()
   print("-----------------------")
