<a href="https://colab.research.google.com/github/MickDobbsKildavin2/firstrepo/blob/main/Lab5-my-data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 5 for Big Data programming.
# Apache Spark Machine Learning using Dataframes in Google Colab




# 1.	Setup an Apache Spark instance in Google Colab

In [1]:
# Run once.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz
!tar xf spark-3.0.2-bin-hadoop2.7.tgz
!pip install -q findspark

#Run Once
import os
os.environ["SPARK_HOME"] = "/content/spark-3.0.2-bin-hadoop2.7"
import findspark
findspark.init()


# 2.	Create a Spark session

In [20]:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()
spark


# 3.	Download the Iris dataset and another dataset of your choosing



In [21]:
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" -O sample_data/iris.data

--2023-03-26 11:45:23--  https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4551 (4.4K) [application/x-httpd-php]
Saving to: ‘sample_data/iris.data’


2023-03-26 11:45:23 (89.4 MB/s) - ‘sample_data/iris.data’ saved [4551/4551]



# 4.	Import the Iris dataset into a dataframe and insert screenshot of df.show()command output:

In [22]:
#df = spark.read.csv('sample_data/iris.data', header=False, sep=",", inferSchema=True)
df = spark.read.csv('sample_data/iris.data', inferSchema=True)\
.toDF("SepalLength","SepalWidth","PetalLength","PetalWidth","Class")

insert screenshot of df.show()command output:

In [23]:
df.show()

+-----------+----------+-----------+----------+-----------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|      Class|
+-----------+----------+-----------+----------+-----------+
|        5.1|       3.5|        1.4|       0.2|Iris-setosa|
|        4.9|       3.0|        1.4|       0.2|Iris-setosa|
|        4.7|       3.2|        1.3|       0.2|Iris-setosa|
|        4.6|       3.1|        1.5|       0.2|Iris-setosa|
|        5.0|       3.6|        1.4|       0.2|Iris-setosa|
|        5.4|       3.9|        1.7|       0.4|Iris-setosa|
|        4.6|       3.4|        1.4|       0.3|Iris-setosa|
|        5.0|       3.4|        1.5|       0.2|Iris-setosa|
|        4.4|       2.9|        1.4|       0.2|Iris-setosa|
|        4.9|       3.1|        1.5|       0.1|Iris-setosa|
|        5.4|       3.7|        1.5|       0.2|Iris-setosa|
|        4.8|       3.4|        1.6|       0.2|Iris-setosa|
|        4.8|       3.0|        1.4|       0.1|Iris-setosa|
|        4.3|       3.0|        1.1|    

# 5.	Spark ML can only deal with one features column - so we need to vectorise the multiple columns into one:

In [24]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler


In [25]:
vector_assembler=VectorAssembler(\
inputCols=["SepalLength","SepalWidth","PetalLength","PetalWidth"],\
outputCol="features")
df_temp=vector_assembler.transform(df)
df_temp.show(3)

+-----------+----------+-----------+----------+-----------+-----------------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|      Class|         features|
+-----------+----------+-----------+----------+-----------+-----------------+
|        5.1|       3.5|        1.4|       0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|
|        4.9|       3.0|        1.4|       0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|
|        4.7|       3.2|        1.3|       0.2|Iris-setosa|[4.7,3.2,1.3,0.2]|
+-----------+----------+-----------+----------+-----------+-----------------+
only showing top 3 rows



Drop the original feature columns and just display Class & features - add screenshot here:

In [26]:
df_temp=df_temp.drop("SepalLength","SepalWidth","PetalLength","PetalWidth")
df_temp.show(3)


+-----------+-----------------+
|      Class|         features|
+-----------+-----------------+
|Iris-setosa|[5.1,3.5,1.4,0.2]|
|Iris-setosa|[4.9,3.0,1.4,0.2]|
|Iris-setosa|[4.7,3.2,1.3,0.2]|
+-----------+-----------------+
only showing top 3 rows



# 6.	The final data preparation step is to index the Class column - to use numeric rather than text values - run the following command and display your output of Class, features & ClassIndex columns:

In [9]:
from pyspark.ml.feature import StringIndexer
l_indexer=StringIndexer(inputCol="Class", outputCol="ClassIndex")
df = l_indexer.fit(df).transform(df)

In [10]:
df.show(10)

+-----------+----------+-----------+----------+-----------+----------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|      Class|ClassIndex|
+-----------+----------+-----------+----------+-----------+----------+
|        5.1|       3.5|        1.4|       0.2|Iris-setosa|       0.0|
|        4.9|       3.0|        1.4|       0.2|Iris-setosa|       0.0|
|        4.7|       3.2|        1.3|       0.2|Iris-setosa|       0.0|
|        4.6|       3.1|        1.5|       0.2|Iris-setosa|       0.0|
|        5.0|       3.6|        1.4|       0.2|Iris-setosa|       0.0|
|        5.4|       3.9|        1.7|       0.4|Iris-setosa|       0.0|
|        4.6|       3.4|        1.4|       0.3|Iris-setosa|       0.0|
|        5.0|       3.4|        1.5|       0.2|Iris-setosa|       0.0|
|        4.4|       2.9|        1.4|       0.2|Iris-setosa|       0.0|
|        4.9|       3.1|        1.5|       0.1|Iris-setosa|       0.0|
+-----------+----------+-----------+----------+-----------+----------+
only s

I think I need to use the temp dataframe here to keep the new feature.

In [29]:
df = l_indexer.fit(df_temp).transform(df_temp)

In [30]:
df.show()

+-----------+-----------------+----------+
|      Class|         features|ClassIndex|
+-----------+-----------------+----------+
|Iris-setosa|[5.1,3.5,1.4,0.2]|       0.0|
|Iris-setosa|[4.9,3.0,1.4,0.2]|       0.0|
|Iris-setosa|[4.7,3.2,1.3,0.2]|       0.0|
|Iris-setosa|[4.6,3.1,1.5,0.2]|       0.0|
|Iris-setosa|[5.0,3.6,1.4,0.2]|       0.0|
|Iris-setosa|[5.4,3.9,1.7,0.4]|       0.0|
|Iris-setosa|[4.6,3.4,1.4,0.3]|       0.0|
|Iris-setosa|[5.0,3.4,1.5,0.2]|       0.0|
|Iris-setosa|[4.4,2.9,1.4,0.2]|       0.0|
|Iris-setosa|[4.9,3.1,1.5,0.1]|       0.0|
|Iris-setosa|[5.4,3.7,1.5,0.2]|       0.0|
|Iris-setosa|[4.8,3.4,1.6,0.2]|       0.0|
|Iris-setosa|[4.8,3.0,1.4,0.1]|       0.0|
|Iris-setosa|[4.3,3.0,1.1,0.1]|       0.0|
|Iris-setosa|[5.8,4.0,1.2,0.2]|       0.0|
|Iris-setosa|[5.7,4.4,1.5,0.4]|       0.0|
|Iris-setosa|[5.4,3.9,1.3,0.4]|       0.0|
|Iris-setosa|[5.1,3.5,1.4,0.3]|       0.0|
|Iris-setosa|[5.7,3.8,1.7,0.3]|       0.0|
|Iris-setosa|[5.1,3.8,1.5,0.3]|       0.0|
+----------

# 7.	Split your data into training and test datasets:

In [36]:
(trainingData,testData) = df.randomSplit([0.7,0.3])

# 8.	Decision Tree Classifier 
## Specify the DecisionTreeClassifier and train the model on your training dataset:


In [37]:
from pyspark.ml.classification import DecisionTreeClassifier 
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


In [38]:
trainingData

DataFrame[Class: string, features: vector, ClassIndex: double]

In [39]:
dt = DecisionTreeClassifier(labelCol="ClassIndex",featuresCol="features")

In [40]:
model = dt.fit(trainingData)

# 9.	Test your model with your test dataset: 

In [41]:
predictions = model.transform(testData)

In [42]:
predictions.select("prediction","ClassIndex").show(5)

+----------+----------+
|prediction|ClassIndex|
+----------+----------+
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
+----------+----------+
only showing top 5 rows



Insert a screenshot here of the first 15 rows of data:

In [43]:
predictions.select("prediction","ClassIndex").show(15)

+----------+----------+
|prediction|ClassIndex|
+----------+----------+
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       1.0|       1.0|
|       1.0|       1.0|
|       1.0|       1.0|
+----------+----------+
only showing top 15 rows



# 10.	Run an evaluator function to show the accuracy of your model:

In [46]:
evaluator= MulticlassClassificationEvaluator(\
labelCol="ClassIndex", predictionCol="prediction",\
metricName="accuracy")
accuracy=evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
print("Test Set accuracy = " +str(accuracy))

Test Error = 0.128205
Test Set accuracy = 0.8717948717948718


# 11.	Random Forest Classifier

## Specify the RandomForestClassifier, train the model on your training dataset, predict using your test dataset, and run an evaluator to test accuracy:


In [45]:
from pyspark.ml.classification import RandomForestClassifier 
rf=RandomForestClassifier(labelCol="ClassIndex",\
featuresCol="features",numTrees=10)
model=rf.fit(trainingData)
predictions=model.transform(testData)
predictions.select("prediction","ClassIndex").show(5)

+----------+----------+
|prediction|ClassIndex|
+----------+----------+
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
+----------+----------+
only showing top 5 rows



In [47]:
evaluator= \
MulticlassClassificationEvaluator(labelCol="ClassIndex",\
predictionCol="prediction",metricName="accuracy")
accuracy=evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
print("Test Set accuracy = " +str(accuracy))

Test Error = 0.128205
Test Set accuracy = 0.8717948717948718


# 12.	Naive Bayes Classifier
## Specify the NaiveBayes classifier, train the model on your training dataset, predict using your test dataset, and run an evaluator to test accuracy:


In [49]:
from pyspark.ml.classification import NaiveBayes 
nb=NaiveBayes(labelCol="ClassIndex",\
featuresCol="features",smoothing=1.0,\
modelType="multinomial")
model=nb.fit(trainingData)


Training data and Test data were mis-labelled

In [52]:
predictions=model.transform(testData)
predictions.select("Class","ClassIndex",
"probability","prediction").show(5)

+-----------+----------+--------------------+----------+
|      Class|ClassIndex|         probability|prediction|
+-----------+----------+--------------------+----------+
|Iris-setosa|       0.0|[0.68897051141422...|       0.0|
|Iris-setosa|       0.0|[0.70846187545311...|       0.0|
|Iris-setosa|       0.0|[0.71866737524836...|       0.0|
|Iris-setosa|       0.0|[0.71866737524836...|       0.0|
|Iris-setosa|       0.0|[0.66660245059756...|       0.0|
+-----------+----------+--------------------+----------+
only showing top 5 rows



In [53]:
from pyspark.ml.classification import RandomForestClassifier 
rf=RandomForestClassifier(labelCol="ClassIndex",\
featuresCol="features",numTrees=10)
model=rf.fit(trainingData)
predictions=model.transform(testData)
predictions.select("prediction","ClassIndex").show(5)

+----------+----------+
|prediction|ClassIndex|
+----------+----------+
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
+----------+----------+
only showing top 5 rows



In [54]:
evaluator= \
MulticlassClassificationEvaluator(labelCol="ClassIndex",\
predictionCol="prediction",metricName="accuracy")
accuracy=evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
print("Test Set accuracy = " +str(accuracy))

Test Error = 0.128205
Test Set accuracy = 0.8717948717948718
