## Clear Statement of the problem to solve.

Using the text data provided in descritption of wine dataset, we want to see if we can predict the wine's country of origin. Meaning, here it is going to be if its a US made or Non-US made wine based on reviews written by wine critics.

## Statement/Comment on the data collection (in terms of what is needed) and scope. 

Using PySpark and ML libraries functionalities like udf, StopWords Remover, Tokenizer, Count vectorizer, Hashing, TF-IDF we are going to focus on knowing the dataset and its contents, tokenisation, cleaning the data/removing stopwords and converting any categorical/string data into numerical form. 

For the classification algorithm, the labeled data should be numerical, so we need to convert any categorical/string data into numerical datatype. We use couunt vectorisation/hashing TF-IDF to do this.

## Preliminary decision and justification about the ML algorithm we will use. 

The algorithm we're about to use is "Classification algorithm". We have a dataset of 150935 rows and 2 columns, constituting labeled data.  

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. 
It is the process of predicting the class of given data points. Classes are sometimes called as targets/ labels or categories.

In supervised learning, we learn from historical dataset where answer is already known to predict future data points

Classification is applied to categorical objects. Here we're gonna do binary classification. There are two classes in the data.. US/Non-US.

Since we're going to identify a particular wine belongs to which country's origin, it is wise to use classification algorithm on the given dataset(having rows more than 100K) and origin being categorized into US/Non-US.



## Statement/comments on data preparation and clean up. Steps to get data ready for the ML algorithm of our choice.

In data preparation, we load the dataset from the local drive of our system. We check if there are any empty fields and try to get rid of them/fill them with appropriate values. 

First, we check to see if there are any empty cells in the data and focus on removing them. We are only going to use the rows from the dataset having origin as 'US' or 'Non-US' to predict the wine's origin. Then we're going to remove any punctuations from the reviews. The prediction is all about checking wine's country of origin based on reviews provided in the description column from the dataset using the above functionalities.

Then we convert the reviews into tokens by : remove punctuations, use tokeniser to break sentences into tokens, stopwords remover to remove unneccessary words for predictions, then we do vectorisation before we get the model ready for logistic regression. We run the model using training dataset and test the model on test dataset by splitting the dataset into 80-20 ratio.

By splitting the dataset into training and test datasets and using evaluation metrics like confusion matrix, precision, recall and F1, we compare the training and test dataset metrics and conclude if the model is working well or not.




## Perform data prep and clean-up. 

In [74]:
#create spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('nlp_wine').getOrCreate()

In [75]:
#read dataset
#change the drive
wine_df = spark.read.csv('C:/Users/Owner/Documents/Machine Learning/ShortWineReviews.csv', inferSchema=True, header=True,sep=',')

In [76]:
#know the columns and their datatypes
wine_df.printSchema()

root
 |-- Origin: string (nullable = true)
 |-- Description: string (nullable = true)



In [77]:
#count the number of rows
wine_df.count()

4000

In [78]:
#check total number of rows and columns
print(wine_df.count(),len(wine_df.columns))

4000 2


In [79]:
from pyspark.sql.functions import rand 

In [80]:
#select random rows from the dataset to see how they're.. using rand function
wine_df.orderBy(rand()).show(2,False)

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Origin|Description                                                                                                                                                                                                                                                                                 |
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Non-US|A ripe and rich wine, this is full of jammy red fruits. The structure is well integrated into the fruits, givi

In [81]:
#check if we have any null values in the dataset
for col in wine_df.columns:
    print(col,'\t', 'with null values: ', wine_df.filter(wine_df[col].isNull()).count())

Origin 	 with null values:  0
Description 	 with null values:  0


In [82]:
#As origin column has values other than US and Non-US, we are going to filter and make a fresh dataset using filter 
wine_df = wine_df.filter(((wine_df.Origin =='US') | (wine_df.Origin =='Non-US')))

In [83]:
#check if they've been filtered
wine_df.count()

4000

In [84]:
#Using groupby function, check the counts of categories US/Non-US
wine_df.groupBy('Origin').count().show()

+------+-----+
|Origin|count|
+------+-----+
|Non-US| 2000|
|    US| 2000|
+------+-----+



We have a balanced data

### <b> Comment on required data transformation needed to get the data ready for input to the ML algorithm of your choice. </b>

Since we have string data types in the column Origin, we convert it into float type using the udf.

In [85]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

In [86]:
#assigning values to origin column using if-else block
Origin_udf = udf(lambda Origin: 0.0 if Origin == 'US' else 1.0, StringType())

In [87]:
wine_df.withColumn('Label', Origin_udf(wine_df.Origin)).show(2)

+------+--------------------+-----+
|Origin|         Description|Label|
+------+--------------------+-----+
|    US|This tremendous 1...|  0.0|
|    US|Mac Watson honors...|  0.0|
+------+--------------------+-----+
only showing top 2 rows



In [88]:
wine_df = wine_df.withColumn('Label', Origin_udf(wine_df.Origin))

In [89]:
wine_df.printSchema()

root
 |-- Origin: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Label: string (nullable = true)



In [90]:
wine_df.show(3, False)

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|Origin|Description                                                                                                                                                                                                                                                                                                                                                                                       |Label|
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [91]:
#Making a new data frame out ofwine_df by selecting onyl the requried columns
wine_df_ready = wine_df.select('Label', 'Description')
wine_df_ready.show(5, False)

+-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Label|Description                                                                                                                                                                                                                                                                                                                                                                                       |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [92]:
wine_df_ready.printSchema()

root
 |-- Label: string (nullable = true)
 |-- Description: string (nullable = true)



In [93]:
#As we can see, Label is now a string type, we cast it into float
wine_df_ready = wine_df_ready.withColumn("Label", wine_df_ready.Label.cast('float'))

In [94]:
wine_df_ready.printSchema()

root
 |-- Label: float (nullable = true)
 |-- Description: string (nullable = true)



In [95]:
wine_df_ready.groupby('Label').count().show()

+-----+-----+
|Label|count|
+-----+-----+
|  1.0| 2000|
|  0.0| 2000|
+-----+-----+



#### 7.Required data transformations
*Hence we converted string datatype of Label column into float type here*

In [96]:
wine_df_ready.orderBy(rand()).show(5,False)

+-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Label|Description                                                                                                                                                                                                                                                                               |
+-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0.0  |Fresh-squeezed apple juice, grapefruit spritzer and nearly sweet cherry aromas sit atop minerality on the nose of this b

The data is not clean.. there are many punctuations which are unneccessary. When we tokenize its gonna convert it to a word. Sometimes lengths of sentences also differ.
We're gonna have to remove punctuations.

## <b> Conduct Explanatory data analysis (EDA) </b>

<U>*Again looking at data to make sure we haven't lost any during transformation*</U>

In [97]:
wine_df_ready.groupBy('Label').count().show()

+-----+-----+
|Label|count|
+-----+-----+
|  1.0| 2000|
|  0.0| 2000|
+-----+-----+



We can observe that we have balanced dataset.. equal number of reviews for labels 1 and 0.

In [98]:
# Add length to the dataframe
from pyspark.sql.functions import length

In [99]:
wine_df_ready = wine_df_ready.withColumn('length',length(wine_df_ready['Description']))

In [100]:
wine_df_ready.orderBy(rand()).show(10,False)

+-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|Label|Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

In [101]:
wine_df_ready.groupBy('Label').agg({'Length':'mean'}).show()

+-----+-----------+
|Label|avg(Length)|
+-----+-----------+
|  1.0|    252.964|
|  0.0|    254.411|
+-----+-----------+



### Comment on required data transformation needed to get the data ready for input to the ML algorithm of your choice.
We used stopwordsremover because Stopwordremover removes vocab. But punctuations wont be removed. And Tokenizer is used to break the sentences in description column into (single words)tokens. Before we run the algorithm these are the transformations done on the dataset.

agg is performed on length and finds avg of set of values in that column

To handle punctuations, create a function having a list of punctuations and check for all the reviews with the function and replace them.

From the aggregate, we can see that Label 0.0 which is US category is 2 points more than 1.0 which is Non-US category. This indicates there's not much of a difference in the data division.

## <U> Data Cleaning</U> *EDA*

### <U>Remove Punctuations</U>

In [102]:
from pyspark.sql.functions import regexp_replace, trim, col, lower

def removePunctuation(column):
    return trim(lower(regexp_replace(column, '([^\s\w_]|_)+', '')))

In [103]:
wine_df_ready = wine_df_ready.withColumn('Description_nopunct', removePunctuation(col('Description')))

### Required data transformations - tokenizer and stopwords remover
Tokenizer splits the sentence into words and also all the letters are converted to lower cases.

### <U>Tokenizer</U>

In [104]:
#Using description_nopunct column as input to tokenizer as it has punctuations removed
from pyspark.ml.feature import Tokenizer
tokenization = Tokenizer(inputCol='Description_nopunct',outputCol='tokens')

In [105]:
tokenized_df = tokenization.transform(wine_df_ready)

In [106]:
tokenized_df.show(5, False)

+-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [107]:
#Some very common words such as 'this', 'the', 'to' etc are known as stop words. 
#In order to decrease the computation overhead, its always a good idea to drop them
#hence we use stopwordsremover
from pyspark.ml.feature import StopWordsRemover
stopword_removal = StopWordsRemover(inputCol='tokens',outputCol='refined_tokens')

In [108]:
refined_text_df = stopword_removal.transform(tokenized_df)

In [109]:
refined_text_df.show()

+-----+--------------------+------+--------------------+--------------------+--------------------+
|Label|         Description|length| Description_nopunct|              tokens|      refined_tokens|
+-----+--------------------+------+--------------------+--------------------+--------------------+
|  0.0|This tremendous 1...|   355|this tremendous 1...|[this, tremendous...|[tremendous, 100,...|
|  0.0|Mac Watson honors...|   280|mac watson honors...|[mac, watson, hon...|[mac, watson, hon...|
|  0.0|This spent 20 mon...|   386|this spent 20 mon...|[this, spent, 20,...|[spent, 20, month...|
|  0.0|This re-named vin...|   298|this renamed vine...|[this, renamed, v...|[renamed, vineyar...|
|  0.0|The producer sour...|   307|the producer sour...|[the, producer, s...|[producer, source...|
|  0.0|From 18-year-old ...|   260|from 18yearold vi...|[from, 18yearold,...|[18yearold, vines...|
|  0.0|A standout even i...|   289|a standout even i...|[a, standout, eve...|[standout, even, ...|
|  0.0|Wit

In [110]:
# size of the dataset
print(refined_text_df.count(),len(refined_text_df.columns))

4000 6


In [111]:
# Schema of the dataset with 6 columns, where 4 columns were created after tokenization and Stopwords Removal
refined_text_df.printSchema()

root
 |-- Label: float (nullable = true)
 |-- Description: string (nullable = true)
 |-- length: integer (nullable = true)
 |-- Description_nopunct: string (nullable = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- refined_tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [112]:
# data distribution
# validate the number of movie reviews for each Origin.
refined_text_df.groupBy('Label').count().show()

+-----+-----+
|Label|count|
+-----+-----+
|  1.0| 2000|
|  0.0| 2000|
+-----+-----+



Both US and Non-US movie reviews count are the same (2000 each). We can say that we are dealing with a balanced dataset here as both the Origin(Label) classes have similar number of reviews.

### Comments on transformations required
Since we are now dealing with tokens instead of entire review, it would make more sense to capture a number of tokens in each review rather than using the length of the review. We create another column (token_count) that gives the number of tokens in each row.

In [113]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import *

In [114]:
len_udf = udf(lambda s: len(s), IntegerType())
refined_text_df = refined_text_df.withColumn("token_count", len_udf(col('refined_tokens')))

In [115]:
refined_text_df.orderBy(rand()).show(10)

+-----+--------------------+------+--------------------+--------------------+--------------------+-----------+
|Label|         Description|length| Description_nopunct|              tokens|      refined_tokens|token_count|
+-----+--------------------+------+--------------------+--------------------+--------------------+-----------+
|  0.0|Intense aromatics...|   302|intense aromatics...|[intense, aromati...|[intense, aromati...|         32|
|  1.0|Aged in wood, thi...|   349|aged in wood this...|[aged, in, wood, ...|[aged, wood, wine...|         33|
|  0.0|Named after a tra...|   293|named after a tra...|[named, after, a,...|[named, trail, gr...|         30|
|  1.0|Lifted honeysuckl...|   279|lifted honeysuckl...|[lifted, honeysuc...|[lifted, honeysuc...|         27|
|  0.0|This has a dark r...|   364|this has a dark r...|[this, has, a, da...|[dark, redblack, ...|         35|
|  1.0|Blackberry, cassi...|   265|blackberry cassis...|[blackberry, cass...|[blackberry, cass...|         26|
|

In [116]:
refined_text_df.printSchema()

root
 |-- Label: float (nullable = true)
 |-- Description: string (nullable = true)
 |-- length: integer (nullable = true)
 |-- Description_nopunct: string (nullable = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- refined_tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- token_count: integer (nullable = true)



In [117]:
refined_text_df.groupby('Label').count().show()

+-----+-----+
|Label|count|
+-----+-----+
|  1.0| 2000|
|  0.0| 2000|
+-----+-----+



*Feature Vectorization using Count Vectorizer to convert text into numerical features*

In [118]:
from pyspark.ml.feature import CountVectorizer
count_vec = CountVectorizer(inputCol='refined_tokens',outputCol='features')

In [119]:
cv_text_df = count_vec.fit(refined_text_df).transform(refined_text_df)

In [120]:
cv_text_df.select(['refined_tokens','token_count','features','Label']).show(10)

+--------------------+-----------+--------------------+-----+
|      refined_tokens|token_count|            features|Label|
+--------------------+-----------+--------------------+-----+
|[tremendous, 100,...|         36|(8239,[0,2,4,5,8,...|  0.0|
|[mac, watson, hon...|         30|(8239,[0,1,25,29,...|  0.0|
|[spent, 20, month...|         43|(8239,[1,3,4,5,6,...|  0.0|
|[renamed, vineyar...|         28|(8239,[0,6,7,22,5...|  0.0|
|[producer, source...|         30|(8239,[1,4,9,30,3...|  0.0|
|[18yearold, vines...|         27|(8239,[0,1,7,10,3...|  0.0|
|[standout, even, ...|         29|(8239,[1,4,11,27,...|  0.0|
|[sophisticated, m...|         40|(8239,[6,7,8,10,1...|  0.0|
|[first, made, 200...|         27|(8239,[1,7,19,92,...|  0.0|
|[blockbuster, pow...|         26|(8239,[0,2,16,20,...|  0.0|
+--------------------+-----------+--------------------+-----+
only showing top 10 rows



### Comment on required data transformation needed to get the data ready for input to the ML algorithm of our choice.  

We are going to do feature Vectorization using TF_IDF to convert text into numerical features as we have to deal with numerical data and not string type. For this we're using Hashing TF-IDF vectorisation and transform it into a vector. 

## The data transformation. Use TF-IDF method of transforming token to their respective numeric values. 

In [121]:
from pyspark.ml.feature import HashingTF,IDF

In [122]:
hashing_vec = HashingTF(inputCol='refined_tokens',outputCol='tf_features')

In [123]:
hashing_df = hashing_vec.transform(refined_text_df)

In [124]:
hashing_df.select(['refined_tokens','token_count','tf_features','label']).show(4,False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|refined_tokens                                                                                                                                                             

Here,

262144 - This represents the number of features to be included in each tf_feature vector.

[40266,43897,56749,57341,57508,61157,64289,88398,90748,93969,98087,99270,104786,131881,140784,145378,
146009,148957,191337,203214,209402,223329,225667,229305,235618,238819,238835,244...] implies TF scores calculated by HashingTF.
[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,
1.0,1.0,1.0,1.0,1.0,1.0,1.0,..] implies the number of times that term appeared in the document. 

In [125]:
tf_idf_vec = IDF(inputCol = 'tf_features', outputCol = 'tf_idf_features')

In [126]:
tf_idf_df = tf_idf_vec.fit(hashing_df).transform(hashing_df)

In [127]:
tf_idf_df.select(['tf_idf_features']).show(4, False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Here, 262144 - implies default number of features to be included in each vector.

[15519,23762,25381,29945,32292,42080,43870,69299,78216,78329,79846,81046,84547,85829,88244,89402,90757,98627,107621,130598,130631,131881,14101....  ] implies IF-IDF scores.


In [128]:
tf_idf_df.printSchema()

root
 |-- Label: float (nullable = true)
 |-- Description: string (nullable = true)
 |-- length: integer (nullable = true)
 |-- Description_nopunct: string (nullable = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- refined_tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- token_count: integer (nullable = true)
 |-- tf_features: vector (nullable = true)
 |-- tf_idf_features: vector (nullable = true)



## Application of the ML algorithm of our choice. Use an 80/20 split for creating training and test dataset. 

In [129]:
from pyspark.ml.feature import VectorAssembler

In [130]:
#select data for building model
model_text_df = tf_idf_df.select(['tf_idf_features','token_count','Label'])

In [131]:
#Once we've feature vector, we can use VectorAssembler to create input features for the machine learning model.
df_assembler = VectorAssembler(inputCols = ['tf_idf_features'],outputCol = 'tf_idf_features_vec')
model_text_df = df_assembler.transform(model_text_df)

In [132]:
model_text_df.printSchema()

root
 |-- tf_idf_features: vector (nullable = true)
 |-- token_count: integer (nullable = true)
 |-- Label: float (nullable = true)
 |-- tf_idf_features_vec: vector (nullable = true)



We can see that after vectorisation, tf_idf_features is now a vector datatype.

We can use any of the classification models on this data, but we
proceed with training the Logistic Regression Model.

In [133]:
from pyspark.ml.classification import LogisticRegression

In [134]:
#split the data 
training_df,test_df = model_text_df.randomSplit([0.80,0.20])we

To validate the presence of enough records for both classes in the train
and test dataset, we can apply the groupBy function on the Label column.

### Explanatory Data Analysis

In [135]:
training_df.groupBy('Label').count().show()

+-----+-----+
|Label|count|
+-----+-----+
|  1.0| 1568|
|  0.0| 1594|
+-----+-----+



In [136]:
test_df.groupBy('Label').count().show()

+-----+-----+
|Label|count|
+-----+-----+
|  1.0|  432|
|  0.0|  406|
+-----+-----+



To validate the presence of enough records for both classes in the train and test dataset, we can apply the groupBy function on the Label column.

## Train the model using the training dataset. 

In [137]:
# build and train the logistic regression model using features_vec as the input column and Label as the output column.
log_reg = LogisticRegression(featuresCol='tf_idf_features_vec',labelCol='Label').fit(training_df)

In [138]:
training_summary = log_reg.summary
print("Area Under ROC:" + str(training_summary.areaUnderROC))
print("Weighted Accuracy:" + str(training_summary.accuracy))
print("Weighted Recall:" + str(training_summary.weightedRecall))
print("Weighted Precision:" + str(training_summary.weightedPrecision))
print("Weighted F1 Measure:" + str(training_summary.weightedFMeasure()))

Area Under ROC:0.9999799951348168
Weighted Accuracy:1.0
Weighted Recall:1.0
Weighted Precision:1.0
Weighted F1 Measure:1.0


AUC should ideally be near to 1. Here it is almost 1 which is good.
All others being 1 indicates we have balanced data

*Talk about other metrics*

In [139]:
training_results = log_reg.evaluate(training_df).predictions

In [140]:
training_results.show()

+--------------------+-----------+-----+--------------------+--------------------+--------------------+----------+
|     tf_idf_features|token_count|Label| tf_idf_features_vec|       rawPrediction|         probability|prediction|
+--------------------+-----------+-----+--------------------+--------------------+--------------------+----------+
|(262144,[14,571,1...|         39|  0.0|(262144,[14,571,1...|[51.9015445526534...|[1.0,2.8803522226...|       0.0|
|(262144,[14,571,5...|         25|  1.0|(262144,[14,571,5...|[-21.684342961767...|[3.82480684433688...|       1.0|
|(262144,[14,571,6...|         50|  1.0|(262144,[14,571,6...|[-24.665158997547...|[1.94114309808708...|       1.0|
|(262144,[14,1322,...|         33|  1.0|(262144,[14,1322,...|[-19.851282255683...|[2.39165017831841...|       1.0|
|(262144,[14,5460,...|         26|  1.0|(262144,[14,5460,...|[-22.259552681921...|[2.15178611367380...|       1.0|
|(262144,[14,6053,...|         29|  0.0|(262144,[14,6053,...|[26.7935903911899..

Label and Prediction are haveing the same values.. they're predicting same things

## Test/Evaluate your trained model with test dataset. 

In [141]:
results = log_reg.evaluate(test_df).predictions

In [142]:
results.select('Label','prediction','probability').show(10)

+-----+----------+--------------------+
|Label|prediction|         probability|
+-----+----------+--------------------+
|  0.0|       0.0|[0.99999926990056...|
|  0.0|       0.0|[0.99982194774050...|
|  0.0|       0.0|[0.99977667235556...|
|  0.0|       0.0|[0.99999999848050...|
|  1.0|       1.0|[1.34690587129721...|
|  0.0|       0.0|[0.99999999999997...|
|  1.0|       1.0|[1.76636788061176...|
|  0.0|       0.0|[0.99999994150066...|
|  0.0|       0.0|[0.71312859829675...|
|  1.0|       1.0|[1.01037913321151...|
+-----+----------+--------------------+
only showing top 10 rows



label and prediction being same implies they are predicting almost the same kind of things

True measure is evaluation matrix

In [143]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [144]:
#confusion matrix
true_postives = results[(results.Label == 1) & (results.prediction == 1)].count()
true_negatives = results[(results.Label == 0) & (results.prediction == 0)].count()
false_positives = results[(results.Label == 0) & (results.prediction == 1)].count()
false_negatives = results[(results.Label == 1) & (results.prediction == 0)].count()

In [145]:
print(true_postives, true_negatives)
print(false_positives, false_negatives)

391 360
46 41


The performance of the model seems reasonably good, and it is able to
differentiate between positive and negative reviews easily.

Majority of them are tp and tn
remainig are fp and fn

Which implies reviews are being well differentiated


In [146]:
recall = float(true_postives)/(true_postives + false_negatives)
print(recall)

0.9050925925925926


In [147]:
precision = float(true_postives) / (true_postives + false_positives)
print(precision)

0.8947368421052632


In [148]:
accuracy = float((true_postives+true_negatives) /(results.count()))
print(accuracy)

0.89618138424821


In [149]:
F1_score = 2*((precision*recall)/(precision+recall))
F1_score

0.8998849252013807

In a classification task, a precision score of 1.0 means that every item labeled as belonging to a particular label does indeed belong to it whereas a recall of 1.0 means that every item labeled as belonging to a particular label does indeed belong to it.

Here we got very good precision and recall values which indicates it is predicting well. Precision is 90 % which means 90 % of the time, it is able to predict the wine's country of origin. Recall rate of 90 % says that 90% of the time model is predicting the wine's country of origin. Hencce we can saythe model is true to predict wine's origin through its review descritions.