# Assignment 2

In [1]:
# tell jupyter where pyspark is
import findspark
findspark.init()

In [2]:
# import ALS and Linear Regression models
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.regression import LinearRegression
from pyspark.sql import Row
from pyspark.sql import SparkSession

## Recommendation
### Data Preparation
The dataset I found contains 515,000 customer reviews and scoring of 1493 luxury hotels across Europe. The columns of the dataset contains:

(1)Hotel_Address: Address of hotel.

(2)Review_Date: Date when reviewer posted the corresponding review.

(3)Average_Score: Average Score of the hotel, calculated based on the latest comment in the last year.

(4)Hotel_Name: Name of Hotel.

(5)Reviewer_Nationality: Nationality of Reviewer.

(6)Negative_Review: Negative Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Negative'.

(7)Review_Total_Negative_Word_Counts: Total number of words in the negative review.

(8)Positive_Review: Positive Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Positive'.

(9)Review_Total_Positive_Word_Counts: Total number of words in the positive review.

(10)Reviewer_Score: Score the reviewer has given to the hotel, based on his/her experience.

(11)Total_Number_of_Reviews_Reviewer_Has_Given: Number of Reviews the reviewers has given in the past.

(12)Total_Number_of_Reviews: Total number of valid reviews the hotel has.

(13)Tags: Tags reviewer gave the hotel.

(14)days_since_review: Duration between the review date and scrape date.

(15)Additional_Number_of_Scoring: There are also some guests who just made a scoring on the service rather than a review. This number indicates how many valid scores without review in there.

(16)lat: Latitude of the hotel.

(17)lng: longtitude of the hotel.

Now, we want to make a recommendation algorithm to recommend the most appropriate hotel for a customer if we know his/her nationality and days to stay in the hotel. Therefore, we will use hotel name to generate productID. However, for userID, we would combine the nationality and days for staying in the hotel to generate it.

In [3]:
# Build a SparkSession
# SparkSession provides a single point of entry to interact with underlying Spark functionality

spark = SparkSession\
    .builder\
    .appName("ALSExample")\
    .getOrCreate()

In [4]:
# Read the file
import csv
import numpy as np
with open('/Users/jbian/Downloads/hotel_reviews.csv', 'r') as f:
    hotels = list(csv.reader(f, delimiter=','))

In [5]:
# Clean the data

stay_days = []
i_list = []
for i in range(1,len(hotels)):
    if 'Stayed' in hotels[i][-4]:
        a = hotels[i][-4][hotels[i][-4].find("Stayed")+6:hotels[i][-4].find(" night")]
        if a != '':
            stay_days.append(int(a))
            i_list.append(i)
print(max(stay_days))
print(len(stay_days))
print(len(i_list))

31
515534
515534


In [6]:
hotel = list( hotels[i] for i in i_list)
print(len(hotel))

515534


In [7]:
# Transform the data

import math
name = []
nat = []
for i in range(1,len(hotel)):
    if hotel[i][4] not in name:
        name.append(hotel[i][4])
        hotel[i][4] = len(name)
    else:
        num = [j+1 for j,x in enumerate(name) if x == hotel[i][4]]
        a = map(str, num)
        b = ''.join(a)
        hotel[i][4] = int(b)
    if hotel[i][5] not in nat:
        nat.append(hotel[i][5])
        user_id = len(nat)
    else:
        num_2 = [p+1 for p,x in enumerate(nat) if x == hotel[i][5]]
        c = map(str, num_2)
        d = ''.join(c)
        user_id = int(d)
    hotel[i][5] = (user_id-1)*31+stay_days[i]
    hotel[i][3] = float(hotel[i][3])

In [8]:
# Extract valid data

import pandas as pd
hotel_df = pd.DataFrame.from_records(hotel[1:len(hotel)])
hotel_df = hotel_df.loc[:,[3,4,5]]

In [9]:
# Read the data from pdDataFrame into spark DataFrame

from pyspark.sql.types import *
mySchema = StructType([ StructField("Average_Score", FloatType(), True)\
                       ,StructField("Hotel_ID", IntegerType(), True)\
                       ,StructField("Reviewer_ID", IntegerType(), True)])
df = spark.createDataFrame(hotel_df,schema=mySchema)

# Split data to training part and test part

(training, test) = df.randomSplit([0.7, 0.3])

### ALS Recommendation Algorithm

In [10]:
# Build the recommendation model using ALS on the training data
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics

als = ALS(maxIter=5, regParam=0.02, userCol="Reviewer_ID", itemCol="Hotel_ID", 
          ratingCol="Average_Score", coldStartStrategy="drop")
model = als.fit(training)

In [11]:
# Make predictions using the model we just built; 
# Evaluate the model by computing the RMSE on the test data

predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="Average_Score",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 0.011005717584661975


We could see that the root-mean-square error is to some extent small. The behavior of the recommendation algorithm is good to believe.

In [12]:
# Generate top 5 hotel recommendations for each user

userRecs = model.recommendForAllUsers(5)
userRecs.show()

# Generate top 5 user recommendations for each hotel

hotelRecs = model.recommendForAllItems(5)
hotelRecs.show()

+-----------+--------------------+
|Reviewer_ID|     recommendations|
+-----------+--------------------+
|       4900|[[157, 9.785504],...|
|        471|[[157, 9.791729],...|
|       1591|[[157, 9.782719],...|
|       1342|[[157, 9.789544],...|
|       2122|[[157, 9.780739],...|
|       2142|[[157, 9.777184],...|
|       1645|[[157, 9.790073],...|
|       1088|[[157, 9.786528],...|
|       1959|[[157, 9.7895975]...|
|        540|[[157, 9.763406],...|
|       1460|[[157, 9.791541],...|
|       1990|[[157, 9.790315],...|
|       2580|[[157, 9.802184],...|
|       4190|[[157, 9.780701],...|
|        392|[[157, 9.791929],...|
|       1522|[[157, 9.779703],...|
|        623|[[157, 9.788346],...|
|       5614|[[157, 9.799236],...|
|       1025|[[157, 9.785852],...|
|       2235|[[157, 9.789667],...|
+-----------+--------------------+
only showing top 20 rows

+--------+--------------------+
|Hotel_ID|     recommendations|
+--------+--------------------+
|     471|[[5058, 8.131296]...|
|    1

In [13]:
# Generate top 5 hotel recommendations for a specified user

user = df.select(als.getUserCol()).distinct().limit(1)
userSubsetRecs = model.recommendForUserSubset(user, 5)
userSubsetRecs.show()

# Generate top 5 user recommendations for a specified hotel

movie = df.select(als.getItemCol()).distinct().limit(1)
movieSubSetRecs = model.recommendForItemSubset(movie, 5)
movieSubSetRecs.show()

+-----------+--------------------+
|Reviewer_ID|     recommendations|
+-----------+--------------------+
|       1088|[[157, 9.786528],...|
+-----------+--------------------+

+--------+--------------------+
|Hotel_ID|     recommendations|
+--------+--------------------+
|     148|[[5058, 7.631981]...|
+--------+--------------------+



Data source: https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe

## Regression
### Data Preparation

With million of apps around nowadays, the following data set has become very key to getting top trending apps in iOS app store. This data set contains more than 7000 Apple iOS mobile application details. The data was extracted from the iTunes Search API at the Apple Inc website.

The columns of the dataset contains:

"id" : App ID

"track_name": App Name

"size_bytes": Size (in Bytes)

"currency": Currency Type

"price": Price amount

"rating_count_tot": User Rating counts (for all version)

"rating_count_ver": User Rating counts (for current version)

"user_rating" : Average User Rating value (for all version)

"user_rating_ver": Average User Rating value (for current version)

"ver" : Latest version code

"cont_rating": Content Rating

"prime_genre": Primary Genre

"sup_devices.num": Number of supporting devices

"ipadSc_urls.num": Number of screenshots showed for display

"lang.num": Number of supported languages

"vpp_lic": Vpp Device Based Licensing Enabled

Data collection date (from API); July 2017.

Dimension of the data set; 7197 rows and 16 columns.

In [14]:
spark = SparkSession\
    .builder\
    .appName("LogisticRegression")\
    .getOrCreate()

In [15]:
as_df = spark.read.format('com.databricks.spark.csv')\
    .options(header='true', inferschema='true').load('/Users/jbian/Downloads/AppleStore.csv')
as_df.take(1)

[Row(_c0=1, id=281656475, track_name='PAC-MAN Premium', size_bytes=100788224, currency='USD', price=3.99, rating_count_tot=21292, rating_count_ver=26, user_rating=4.0, user_rating_ver=4.5, ver='6.3.5', cont_rating='4+', prime_genre='Games', sup_devices.num=38, ipadSc_urls.num=5, lang.num=10, vpp_lic=1)]

In [16]:
as_df.cache()
as_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- track_name: string (nullable = true)
 |-- size_bytes: long (nullable = true)
 |-- currency: string (nullable = true)
 |-- price: double (nullable = true)
 |-- rating_count_tot: integer (nullable = true)
 |-- rating_count_ver: integer (nullable = true)
 |-- user_rating: double (nullable = true)
 |-- user_rating_ver: double (nullable = true)
 |-- ver: string (nullable = true)
 |-- cont_rating: string (nullable = true)
 |-- prime_genre: string (nullable = true)
 |-- sup_devices.num: integer (nullable = true)
 |-- ipadSc_urls.num: integer (nullable = true)
 |-- lang.num: integer (nullable = true)
 |-- vpp_lic: integer (nullable = true)



In [17]:
as_df.describe().toPandas().transpose()

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
_c0,7197,4759.069612338474,3093.6252131502906,1,11097
id,7197,8.631309974515771E8,2.7123675589291865E8,281656475,1188375727
track_name,7197,1824.0,316.7838379715733,! OH Fantastic Free Kick + Kick Wall Challenge,ｗｗｗ
size_bytes,7197,1.99134453825066E8,3.592069135387029E8,589824,4025969664
currency,7197,,,USD,USD
price,7197,1.7262178685562626,5.833005786951921,0.0,299.99
rating_count_tot,7197,12892.907183548701,75739.40867472602,0,2974676
rating_count_ver,7197,460.3739057940809,3920.4551833619757,0,177050
user_rating,7197,3.526955675976101,1.517947593629884,0.0,5.0


### Linear Regression

In [18]:
from pyspark.ml.feature import VectorAssembler

vectorAssembler = VectorAssembler(inputCols = ['price','rating_count_tot', 'rating_count_ver',
                                              'size_bytes', 'user_rating_ver'],
                                  outputCol = 'features')
vas_df = vectorAssembler.transform(as_df)
vas_df = vas_df.select(['features', 'user_rating'])
vas_df.show(3)

+--------------------+-----------+
|            features|user_rating|
+--------------------+-----------+
|[3.99,21292.0,26....|        4.0|
|[0.0,161065.0,26....|        4.0|
|[0.0,188583.0,282...|        3.5|
+--------------------+-----------+
only showing top 3 rows



In [19]:
splits = vas_df.randomSplit([0.7, 0.3])
train_df = splits[0]
test_df = splits[1]

In [20]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol = 'features', 
                      labelCol='user_rating', maxIter=10, regParam=0.3, 
                      elasticNetParam=0.8)
lr_model = lr.fit(train_df)
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

Coefficients: [0.0,0.0,0.0,0.0,0.49798538217109894]
Intercept: 1.906138245036309


In [21]:
lr_predictions = lr_model.transform(test_df)
lr_predictions.select("prediction","user_rating","features").show(5)
from pyspark.ml.evaluation import RegressionEvaluator
lr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="user_rating",metricName="r2")
print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(lr_predictions))

+-----------------+-----------+--------------------+
|       prediction|user_rating|            features|
+-----------------+-----------+--------------------+
|1.906138245036309|        0.0|(5,[0,3],[0.99,12...|
|1.906138245036309|        0.0|(5,[0,3],[0.99,31...|
|1.906138245036309|        0.0|(5,[0,3],[0.99,96...|
|1.906138245036309|        0.0|(5,[0,3],[0.99,2....|
|1.906138245036309|        0.0|(5,[0,3],[0.99,2....|
+-----------------+-----------+--------------------+
only showing top 5 rows

R Squared (R2) on test data = 0.563619


In [22]:
test_result = lr_model.evaluate(test_df)
print("Root Mean Squared Error (RMSE) on test data = %g" % test_result.rootMeanSquaredError)

Root Mean Squared Error (RMSE) on test data = 1.00571


We could see that root mean squared error is 1.00159. R square is 0.55043. The closer the R square is to 1, the better fit of data the regression line is. However, the linear regression line is not so good. 

### Decision Tree Regression

In [23]:
from pyspark.ml.regression import DecisionTreeRegressor

dt = DecisionTreeRegressor(featuresCol ='features', labelCol = 'user_rating')
dt_model = dt.fit(train_df)
dt_predictions = dt_model.transform(test_df)
dt_evaluator = RegressionEvaluator(
    labelCol="user_rating", predictionCol="prediction", metricName="rmse")
rmse = dt_evaluator.evaluate(dt_predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

Root Mean Squared Error (RMSE) on test data = 0.477572


Root Mean Squared Error (RMSE) on test data got smaller, so the regression fits data better.

### Gradient-boosted Tree Regression

In [24]:
from pyspark.ml.regression import GBTRegressor

gbt = GBTRegressor(featuresCol = 'features', labelCol = 'user_rating', maxIter=10)
gbt_model = gbt.fit(train_df)
gbt_predictions = gbt_model.transform(test_df)
gbt_predictions.select('prediction', 'user_rating', 'features').show(5)

+--------------------+-----------+--------------------+
|          prediction|user_rating|            features|
+--------------------+-----------+--------------------+
|-0.00729217521636...|        0.0|(5,[0,3],[0.99,12...|
|-0.00729217521636...|        0.0|(5,[0,3],[0.99,31...|
|-0.00729217521636...|        0.0|(5,[0,3],[0.99,96...|
|-0.00729217521636...|        0.0|(5,[0,3],[0.99,2....|
|-0.00729217521636...|        0.0|(5,[0,3],[0.99,2....|
+--------------------+-----------+--------------------+
only showing top 5 rows



In [25]:
gbt_evaluator = RegressionEvaluator(
    labelCol="user_rating", predictionCol="prediction", metricName="rmse")
rmse = gbt_evaluator.evaluate(gbt_predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

Root Mean Squared Error (RMSE) on test data = 0.456392


Root Mean Squared Error (RMSE) on test data got smaller, so the regression got better, though not obviously.

Data Source: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

## Classification
### Data Preparation

This dataset is form Pima Indians Diabetes Database. The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. The columns are as below:


Pregnancies: Number of times pregnant;

Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test;

BloodPressure: Diastolic blood pressure (mm Hg);

SkinThickness: Triceps skin fold thickness (mm);

Insulin: 2-Hour serum insulin (mu U/ml);

BMI: Body mass index (weight in kg/(height in m)^2);

DiabetesPedigreeFunction: Diabetes pedigree function;

Age: Age (years);

Outcome: Class variable (0 or 1).

The data dimension is 768 * 9.

Firstly, we read it from csv into dataframe format and then save it in libsvm format.

In [26]:
spark = SparkSession\
    .builder\
    .appName("Classification")\
    .getOrCreate()

In [27]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint


db_df = spark.read.format('com.databricks.spark.csv')\
    .options(header='true', inferschema='true').load('/Users/jbian/Downloads/diabetes.csv')

# Convert your dataframe in a RDD
c = db_df.rdd 
print (c.take(3))

# FROM RDD OF TUPLE TO A RDD OF LABELEDPOINT
d = c.map(lambda line: LabeledPoint(line[8],[line[0], line[1], line[2], line[3], 
                                             line[4], line[5], line[6], line[7]]))
print (d.take(3))

# save the csv as libsvm format
# MLUtils.saveAsLibSVMFile(d, "/Users/jbian/Desktop/CU/6893/HW/diabetes_libsvm")

[Row(Pregnancies=6, Glucose=148, BloodPressure=72, SkinThickness=35, Insulin=0, BMI=33.6, DiabetesPedigreeFunction=0.627, Age=50, Outcome=1), Row(Pregnancies=1, Glucose=85, BloodPressure=66, SkinThickness=29, Insulin=0, BMI=26.6, DiabetesPedigreeFunction=0.351, Age=31, Outcome=0), Row(Pregnancies=8, Glucose=183, BloodPressure=64, SkinThickness=0, Insulin=0, BMI=23.3, DiabetesPedigreeFunction=0.672, Age=32, Outcome=1)]
[LabeledPoint(1.0, [6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0]), LabeledPoint(0.0, [1.0,85.0,66.0,29.0,0.0,26.6,0.351,31.0]), LabeledPoint(1.0, [8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0])]


### Decision tree classifier

In [28]:
# Load the data stored in LIBSVM format as a DataFrame.
data = spark.read.format("libsvm").load("/Users/jbian/Desktop/CU/6893/HW/diabetes_libsvm/part-00000")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)

# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))

treeModel = model.stages[2]
# summary only
print(treeModel)

+----------+------------+--------------------+
|prediction|indexedLabel|            features|
+----------+------------+--------------------+
|       0.0|         0.0|(8,[0,1,2,3,4,5,6...|
|       0.0|         0.0|(8,[0,1,2,3,4,5,6...|
|       0.0|         0.0|(8,[0,1,2,3,4,5,6...|
|       0.0|         0.0|(8,[0,1,2,3,4,5,6...|
|       0.0|         0.0|(8,[0,1,2,3,4,5,6...|
+----------+------------+--------------------+
only showing top 5 rows

Test Error = 0.299107 
DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4174a557897a2a976e44) of depth 5 with 45 nodes


The test error of decision tree classifier is 0.292576, which is not low enough to be believed.

### Gradient-boosted tree classifier

In [29]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Train a GBT model.
gbt = GBTClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxIter=10)

# Chain indexers and GBT in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, gbt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

gbtModel = model.stages[2]
print(gbtModel)  # summary only

+----------+------------+--------------------+
|prediction|indexedLabel|            features|
+----------+------------+--------------------+
|       0.0|         0.0|(8,[0,1,2,3,4,5,6...|
|       0.0|         0.0|(8,[0,1,2,3,4,5,6...|
|       0.0|         0.0|(8,[0,1,2,3,4,5,6...|
|       0.0|         0.0|(8,[0,1,2,3,4,5,6...|
|       0.0|         0.0|(8,[0,1,2,3,4,5,6...|
+----------+------------+--------------------+
only showing top 5 rows

Test Error = 0.290179
GBTClassificationModel (uid=GBTClassifier_47efb8c3129493ff6113) with 10 trees


The test error of Gradient-boosted tree classifier is lower than the one we get from decision tree classifier, but I think it is still not good enough.

### Naive Bayes

In [30]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Split the data into train and test
splits = data.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]

# create the trainer and set its parameters
nb = NaiveBayes(smoothing=1, modelType="multinomial")

# train the model
model = nb.fit(train)

# select example rows to display.
predictions = model.transform(test)
predictions.show()

# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(8,[0,1,2,3,4,5,6...|[-432.63478504583...|[0.20615767142937...|       1.0|
|  0.0|(8,[0,1,2,3,4,5,6...|[-550.68630466658...|[0.99999973388797...|       0.0|
|  0.0|(8,[0,1,2,3,4,5,6...|[-496.25653849313...|[0.99764576046580...|       0.0|
|  0.0|(8,[0,1,2,3,4,5,6...|[-610.97441414007...|[0.99999257855435...|       0.0|
|  0.0|(8,[0,1,2,3,4,5,6...|[-490.75921658592...|[0.07912384048554...|       1.0|
|  0.0|(8,[0,1,2,3,4,5,6...|[-626.91374900256...|[0.99993988758394...|       0.0|
|  0.0|(8,[0,1,2,3,4,5,6...|[-486.79337663904...|[0.99999930098523...|       0.0|
|  0.0|(8,[0,1,2,3,4,5,6...|[-552.92547395772...|[0.99205824357932...|       0.0|
|  0.0|(8,[0,1,2,3,4,5,6...|[-556.19080373204...|[0.75907721204694...|       0.0|
|  0.0|(8,[0,1,2

We could see that the naive bayes method only has around 0.6 accuracy rate, which doesn't lead to good fit of data.

Data source: https://www.kaggle.com/uciml/pima-indians-diabetes-database