# Demo 2: Naive Bayes and DataStax Analytics
------
<img src="images/drinkWine.jpeg" width="300" height="500">


#### Dataset: https://archive.ics.uci.edu/ml/datasets/Wine+Quality

## What are we trying to learn from this dataset? 

# QUESTION:  Can Naive Bayes be used to classify a wine’s rating score by its attributes?

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
import pandas
import cassandra
import pyspark
import re
import os
import random
from random import randint, randrange
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from pyspark.sql import SparkSession
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
%store -r astraUsername astraPassword astraSecureConnect astraKeyspace

#### Helper function to have nicer formatting of Spark DataFrames

In [3]:
#Helper for pretty formatting for Spark DataFrames
def showDF(df, limitRows =  5, truncate = True):
    if(truncate):
        pandas.set_option('display.max_colwidth', 50)
    else:
        pandas.set_option('display.max_colwidth', None)
    pandas.set_option('display.max_rows', limitRows)
    display(df.limit(limitRows).toPandas())
    pandas.reset_option('display.max_rows')

<img src="images/dselogo.png" width="400" height="200">

## Creating Tables and Loading Tables

### Connect to Cassandra

In [4]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

cloud_config = {
    'secure_connect_bundle': '/tmp/'+astraSecureConnect
}
auth_provider = PlainTextAuthProvider(username=astraUsername, password=astraPassword)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()

### Set keyspace 

In [5]:
session.set_keyspace(astraKeyspace)

### Create table called `wines`. Our PRIMARY will be a unique key (wineid) we generate for each row.  This will have two datasets "white" and "red"

In [6]:
query = "CREATE TABLE IF NOT EXISTS wines \
                                   (wineid int, fixedAcidity float, volatileAcidity float, citricAcid float, sugar float, \
                                   chlorides float, freeSulfur float, totalSulfur float, density float, ph float, \
                                   sulphates float, alcohol float, quality float, \
                                   PRIMARY KEY (wineid))"
session.execute(query)

<cassandra.cluster.ResultSet at 0x7f2ac0200358>

### What do these of these 12 columns represent: 

* **Fixed acidity**
* **Volatile acidity**
* **Citric Acid**
* **Residual Sugar** 
* **Chlorides**
* **Free sulfur dioxide**     
* **Total sulfur dioxide**
* **Density** 
* **pH**
* **Sulphates**
* **Alcohol**
* **Quality**

### Load 2 Wine Dataset -- White and Red
<img src="images/whiteAndRed.jpeg" width="300" height="300">

In [7]:
#download file to local (working on better way)
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('andygoade-dev')

#download file for red wines
blob = storage.Blob('notebooks/jupyter/data/winequality-red.csv', bucket)
blob.download_to_filename('/tmp/winequality-red.csv')

#download file for white wines
blob = storage.Blob('notebooks/jupyter/data/winequality-white.csv', bucket)
blob.download_to_filename('/tmp/winequality-white.csv')

### Load Wine datasets from CSV file (winequality-red.csv winequality-white.csv)
* No clean up was requried! How nice :)

#### Insert all the Wine Data into the DSE table `wines`

In [8]:
fileName = '/tmp/winequality-red.csv'
input_file = open(fileName, 'r')
i = 1
for line in input_file:
    wineid = i
    row = line.split(';')
        
    query = "INSERT INTO wines (wineid, fixedAcidity, volatileAcidity, citricAcid, sugar, \
                               chlorides, freeSulfur, totalSulfur, density, ph, \
                               sulphates, alcohol, quality)"
    query = query + " VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    session.execute(query, (wineid, float(row[0]), float(row[1]), float(row[2]), float(row[3]), float(row[4]), float(row[5]), float(row[6]), float(row[7]), float(row[8]), float(row[9]), float(row[10]), float(row[11])))
    i = i + 1

fileName = '/tmp/winequality-white.csv'
input_file = open(fileName, 'r')

for line in input_file:
    wineid = i
    row = line.split(';')
        
    query = "INSERT INTO wines (wineid, fixedAcidity, volatileAcidity, citricAcid, sugar, \
                               chlorides, freeSulfur, totalSulfur, density, ph, \
                               sulphates, alcohol, quality)"
    query = query + " VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    session.execute(query, (wineid, float(row[0]), float(row[1]), float(row[2]), float(row[3]), float(row[4]), float(row[5]), float(row[6]), float(row[7]), float(row[8]), float(row[9]), float(row[10]), float(row[11])))
    i = i + 1
    

## Machine Learning with Apache Cassandra & Apache Spark
<img src="images/sparklogo.png" width="150" height="200">

#### Create a spark session that is connected to the database. From there load each table into a Spark Dataframe and take a count of the number of rows in each.

In [9]:
spark = SparkSession \
    .builder \
    .appName('demo') \
    .master("local") \
    .config( \
        "spark.cassandra.connection.config.cloud.path", \
        "file:/tmp/"+astraSecureConnect) \
    .config("spark.cassandra.auth.username", astraUsername) \
    .config("spark.cassandra.auth.password", astraPassword) \
    .getOrCreate()

In [10]:
wineDF = spark.read.format("org.apache.spark.sql.cassandra").options(table="wines", keyspace=astraKeyspace).load()

print ("Table Wine Row Count: ")
print (wineDF.count())

Table Wine Row Count: 
6497


In [11]:
showDF(wineDF)

Unnamed: 0,wineid,alcohol,chlorides,citricacid,density,fixedacidity,freesulfur,ph,quality,sugar,sulphates,totalsulfur,volatileacidity
0,4237,8.7,0.044,0.73,1.00013,8.7,27.0,2.96,5.0,14.35,0.88,191.0,0.31
1,5986,11.0,0.043,0.31,0.9924,6.0,54.0,3.28,6.0,5.0,0.52,170.0,0.27
2,3365,8.8,0.056,0.23,0.9967,6.9,56.0,3.17,5.0,8.6,0.44,215.0,0.29
3,5883,11.0,0.049,0.26,0.9928,6.0,22.0,3.15,6.0,6.8,0.42,93.0,0.2
4,1406,11.3,0.062,0.3,0.9952,7.7,18.0,3.28,7.0,2.0,0.9,34.0,0.28


#### Let's filter out only wines that have been rated 6.0 or higher and create a new dataframe with that information 

In [12]:
wine6DF = wineDF.filter("quality > 5")
showDF(wine6DF)

Unnamed: 0,wineid,alcohol,chlorides,citricacid,density,fixedacidity,freesulfur,ph,quality,sugar,sulphates,totalsulfur,volatileacidity
0,5986,11.0,0.043,0.31,0.9924,6.0,54.0,3.28,6.0,5.0,0.52,170.0,0.27
1,5883,11.0,0.049,0.26,0.9928,6.0,22.0,3.15,6.0,6.8,0.42,93.0,0.2
2,1406,11.3,0.062,0.3,0.9952,7.7,18.0,3.28,7.0,2.0,0.9,34.0,0.28
3,3093,12.3,0.033,0.49,0.9936,8.0,39.0,3.13,8.0,9.0,0.38,180.0,0.34
4,3134,9.0,0.044,0.74,0.9996,7.1,44.0,3.38,6.0,15.6,0.67,176.0,0.18


#### Create Vector with all elements of the wine 

In [13]:
assembler = VectorAssembler(
    inputCols=['alcohol', 'chlorides', 'citricacid', 'density', 'fixedacidity', 'ph', 'freesulfur', 'sugar', 'sulphates', 'totalsulfur', 'volatileacidity'],
    outputCol='features')

trainingData = assembler.transform(wine6DF)

labelIndexer = StringIndexer(inputCol="quality", outputCol="label", handleInvalid='keep')
trainingData1 = labelIndexer.fit(trainingData).transform(trainingData)

showDF(trainingData1)
print(trainingData1.count())

Unnamed: 0,wineid,alcohol,chlorides,citricacid,density,fixedacidity,freesulfur,ph,quality,sugar,sulphates,totalsulfur,volatileacidity,features,label
0,5986,11.0,0.043,0.31,0.9924,6.0,54.0,3.28,6.0,5.0,0.52,170.0,0.27,"[11.0, 0.0430000014603138, 0.3100000023841858,...",0.0
1,5883,11.0,0.049,0.26,0.9928,6.0,22.0,3.15,6.0,6.8,0.42,93.0,0.2,"[11.0, 0.04899999871850014, 0.2599999904632568...",0.0
2,1406,11.3,0.062,0.3,0.9952,7.7,18.0,3.28,7.0,2.0,0.9,34.0,0.28,"[11.300000190734863, 0.06199999898672104, 0.30...",1.0
3,3093,12.3,0.033,0.49,0.9936,8.0,39.0,3.13,8.0,9.0,0.38,180.0,0.34,"[12.300000190734863, 0.032999999821186066, 0.4...",2.0
4,3134,9.0,0.044,0.74,0.9996,7.1,44.0,3.38,6.0,15.6,0.67,176.0,0.18,"[9.0, 0.04399999976158142, 0.7400000095367432,...",0.0


4113


We need to split up our dataset in to a training and test set. Will split 80/20. 

In [14]:
# Split the data into train and test
splits = trainingData1.randomSplit([0.8, 0.2], 1234)
train = splits[0]
test = splits[1]

print ("Train Dataframe Row Count: ")
print (train.count())
print ("Test Dataframe Row Count: ")
print (test.count())

Train Dataframe Row Count: 
3361
Test Dataframe Row Count: 
752


### Now it's time to to use NaiveBayes. We will train the model, then use that model with out testing data to get our predictions. 
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#naive-bayes

In [15]:
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model = nb.fit(train)

predictions = model.transform(test)
#predictions.show()
print (predictions.count())
showDF(predictions)

752


Unnamed: 0,wineid,alcohol,chlorides,citricacid,density,fixedacidity,freesulfur,ph,quality,sugar,sulphates,totalsulfur,volatileacidity,features,label,rawPrediction,probability,prediction
0,8,10.0,0.065,0.0,0.9946,7.3,15.0,3.39,7.0,1.2,0.47,21.0,0.65,"[10.0, 0.06499999761581421, 0.0, 0.99459999799...",1.0,"[-116.5426657159957, -116.18955542422036, -118...","[0.3941855595710744, 0.5611184665684894, 0.034...",1.0
1,9,9.5,0.073,0.02,0.9968,7.8,9.0,3.36,7.0,2.0,0.57,18.0,0.58,"[9.5, 0.0729999989271164, 0.019999999552965164...",1.0,"[-108.1078663589339, -107.92666842054157, -110...","[0.44007753472114086, 0.5274999998447291, 0.02...",1.0
2,21,9.4,0.077,0.48,0.9968,8.9,29.0,3.39,6.0,1.8,0.53,60.0,0.22,"[9.399999618530273, 0.07699999958276749, 0.479...",0.0,"[-163.32412184144326, -163.32124479526436, -16...","[0.4729277480820231, 0.4742903422357937, 0.043...",1.0
3,117,10.0,0.077,0.28,0.9978,8.3,11.0,3.39,6.0,1.9,0.61,40.0,0.54,"[10.0, 0.07699999958276749, 0.2800000011920929...",0.0,"[-125.19513795757629, -125.1952323668751, -128...","[0.4846300556253142, 0.48458430420129395, 0.02...",0.0
4,238,9.2,0.097,0.0,0.99675,7.2,15.0,3.37,6.0,1.9,0.58,39.0,0.645,"[9.199999809265137, 0.09700000286102295, 0.0, ...",0.0,"[-124.74132514617129, -124.81822968851249, -12...","[0.4994239604985621, 0.4624557205070503, 0.033...",0.0


In [16]:
showDF(predictions.select("quality", "label", "prediction", "probability"))

Unnamed: 0,quality,label,prediction,probability
0,7.0,1.0,1.0,"[0.3941855595710744, 0.5611184665684894, 0.034..."
1,7.0,1.0,1.0,"[0.44007753472114086, 0.5274999998447291, 0.02..."
2,6.0,0.0,1.0,"[0.4729277480820231, 0.4742903422357937, 0.043..."
3,6.0,0.0,0.0,"[0.4846300556253142, 0.48458430420129395, 0.02..."
4,6.0,0.0,0.0,"[0.4994239604985621, 0.4624557205070503, 0.033..."


### We can now use the MutliclassClassifciationEvaluator to evalute the accurancy of our predictions. 

In [19]:
# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

AttributeError: 'NoneType' object has no attribute '_jvm'

In [None]:
session.execute("""drop table wines""")

In [18]:
spark.stop()