## [Exemplary notebook has been taken from Spark documentation](https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes)

### Importing all the necessary libraries

In [28]:
import findspark
from pyspark.sql import SparkSession
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

### Initializing spark session

In [29]:
findspark.init()
spark = SparkSession.builder.getOrCreate()

### Loading the dataset

_[Dataset has been taken from Spark repository](https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt)_

In [30]:
data = spark.read.format("libsvm").load("/home/jovyan/work/sample_libsvm_data.txt")

### Splitting the dataset between test and train datasets

The dataset will be randomly split between two new sets. The *train* dataset will contain 70% of the original dataset's data. It will be used to build a classification model. The *test* dataset will contain 30% of the original dataset's data. It will be used to evaluate the classification model. After the split, the amount of instances will be printed.

In [31]:
splits = data.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]

print(f"Amount of instances in the train dataset: {train.count()}")
print(f"Amount of instances in the test dataset: {test.count()}")

Amount of instances in the train dataset: 58
Amount of instances in the test dataset: 42


### Creating and training a multiclass classificator with the use of Naive Bayes algorithm

The classificator will try to predict to which category a given instances belongs. It will determine this based on instance's features. **It's important to remember that this algorithm *assumes* independence between features. Additionally, this algorithm requires features to be non-negative**.

In [32]:
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
model = nb.fit(train)

### Running the classification model on the test dataset

Based on knowledge gained during training, the model will predict to which category a given instances from the test dataset belongs

In [33]:
predictions = model.transform(test)

### Presenting predictions

In the *label* column sits the actual category number of a given instance. In the *features* column sits a features vector. In the *rawPrediction* column sits a vector containing *raw* prediction values. In the *probability* column sits a vector containing probabilities of belonging to each of the classes. In the *prediction* column sits a predicted class of a given instance.

In [34]:
predictions.show()

+-----+--------------------+--------------------+-----------+----------+
|label|            features|       rawPrediction|probability|prediction|
+-----+--------------------+--------------------+-----------+----------+
|  0.0|(692,[95,96,97,12...|[-172664.79564650...|  [1.0,0.0]|       0.0|
|  0.0|(692,[98,99,100,1...|[-176279.15054306...|  [1.0,0.0]|       0.0|
|  0.0|(692,[122,123,124...|[-189600.55409526...|  [1.0,0.0]|       0.0|
|  0.0|(692,[124,125,126...|[-274673.88337431...|  [1.0,0.0]|       0.0|
|  0.0|(692,[124,125,126...|[-183393.03869049...|  [1.0,0.0]|       0.0|
|  0.0|(692,[125,126,127...|[-256992.48807619...|  [1.0,0.0]|       0.0|
|  0.0|(692,[126,127,128...|[-210411.53649773...|  [1.0,0.0]|       0.0|
|  0.0|(692,[127,128,129...|[-170627.63616681...|  [1.0,0.0]|       0.0|
|  0.0|(692,[127,128,129...|[-212157.96750469...|  [1.0,0.0]|       0.0|
|  0.0|(692,[127,128,129...|[-183253.80108550...|  [1.0,0.0]|       0.0|
|  0.0|(692,[128,129,130...|[-246528.93739632...|  

### Evaluation

A *MulticlassClassificationEvaluator* from the MLlib will be used to evaluate the model. It will be used to compute the accuracy of the model.

In [35]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(f"Model's accuracy is around {round(accuracy * 100, 2)}%")

Model's accuracy is around 100.0%


### Closing the Spark session

In [36]:
spark.stop()