<a href="https://colab.research.google.com/github/MickDobbsKildavin2/firstrepo/blob/main/Lab5-my-data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 5 for Big Data programming.
# Apache Spark Machine Learning using Dataframes in Google Colab




# 1.	Setup an Apache Spark instance in Google Colab

In [7]:
# Run once.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz
!tar xf spark-3.0.2-bin-hadoop2.7.tgz
!pip install -q findspark

#Run Once
import os
os.environ["SPARK_HOME"] = "/content/spark-3.0.2-bin-hadoop2.7"
import findspark
findspark.init()


# 2.	Create a Spark session

In [8]:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()
spark


# 3.	Download the adult dataset .



In [30]:
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" -O sample_data/adult.data

--2023-03-26 12:33:00--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3974305 (3.8M) [application/x-httpd-php]
Saving to: ‘sample_data/adult.data’


2023-03-26 12:33:01 (7.03 MB/s) - ‘sample_data/adult.data’ saved [3974305/3974305]



# 4.	Import the adult dataset into a dataframe and insert screenshot of df.show()command output:

In [90]:
#df = spark.read.csv('sample_data/adult.data', inferSchema=True)\
df = spark.read.csv('sample_data/adult.data', header=False, sep=",", inferSchema=True)\
.toDF("age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","Class","salary")

insert screenshot of df.show()command output:

In [91]:
df.show()

+---+-----------------+--------+-------------+-------------+--------------------+------------------+--------------+-------------------+-------+------------+------------+--------------+--------------+------+
|age|        workclass|  fnlwgt|    education|education-num|      marital-status|        occupation|  relationship|               race|    sex|capital-gain|capital-loss|hours-per-week|         Class|salary|
+---+-----------------+--------+-------------+-------------+--------------------+------------------+--------------+-------------------+-------+------------+------------+--------------+--------------+------+
| 39|        State-gov| 77516.0|    Bachelors|         13.0|       Never-married|      Adm-clerical| Not-in-family|              White|   Male|      2174.0|         0.0|          40.0| United-States| <=50K|
| 50| Self-emp-not-inc| 83311.0|    Bachelors|         13.0|  Married-civ-spouse|   Exec-managerial|       Husband|              White|   Male|         0.0|         0.0|   

# 5.	Spark ML can only deal with one features column - so we need to vectorise the multiple columns into one:

In [92]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler


In [93]:
vector_assembler=VectorAssembler(\
inputCols=["age","fnlwgt","education-num","capital-gain","capital-loss","hours-per-week"],\
outputCol="features")
df_temp=vector_assembler.transform(df)
df_temp.show(3)

+---+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-----+------------+------------+--------------+--------------+------+--------------------+
|age|        workclass|  fnlwgt| education|education-num|     marital-status|        occupation|  relationship|  race|  sex|capital-gain|capital-loss|hours-per-week|         Class|salary|            features|
+---+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-----+------------+------------+--------------+--------------+------+--------------------+
| 39|        State-gov| 77516.0| Bachelors|         13.0|      Never-married|      Adm-clerical| Not-in-family| White| Male|      2174.0|         0.0|          40.0| United-States| <=50K|[39.0,77516.0,13....|
| 50| Self-emp-not-inc| 83311.0| Bachelors|         13.0| Married-civ-spouse|   Exec-managerial|       Husband| White| Male|         0.0|         0.0|          13.0

Drop the original feature columns and just display Class & features - add screenshot here:

In [94]:
df_temp=df_temp.drop("age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","salary")
df_temp.show(3)


+--------------+--------------------+
|         Class|            features|
+--------------+--------------------+
| United-States|[39.0,77516.0,13....|
| United-States|[50.0,83311.0,13....|
| United-States|[38.0,215646.0,9....|
+--------------+--------------------+
only showing top 3 rows



# 6.	The final data preparation step is to index the Class column - to use numeric rather than text values - run the following command and display your output of Class, features & ClassIndex columns:

In [95]:
from pyspark.ml.feature import StringIndexer
l_indexer=StringIndexer(inputCol="Class", outputCol="ClassIndex")
df = l_indexer.fit(df).transform(df)

df = l_indexer.fit(df_temp).transform(df_temp)
df.show(10)

+--------------+--------------------+----------+
|         Class|            features|ClassIndex|
+--------------+--------------------+----------+
| United-States|[39.0,77516.0,13....|       0.0|
| United-States|[50.0,83311.0,13....|       0.0|
| United-States|[38.0,215646.0,9....|       0.0|
| United-States|[53.0,234721.0,7....|       0.0|
|          Cuba|[28.0,338409.0,13...|       9.0|
| United-States|[37.0,284582.0,14...|       0.0|
|       Jamaica|[49.0,160187.0,5....|      11.0|
| United-States|[52.0,209642.0,9....|       0.0|
| United-States|[31.0,45781.0,14....|       0.0|
| United-States|[42.0,159449.0,13...|       0.0|
+--------------+--------------------+----------+
only showing top 10 rows



# 7.	Split your data into training and test datasets:

In [96]:
(trainingData,testData) = df.randomSplit([0.7,0.3])

# 8.	Decision Tree Classifier 
## Specify the DecisionTreeClassifier and train the model on your training dataset:


In [98]:
from pyspark.ml.classification import DecisionTreeClassifier 
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
trainingData

DataFrame[Class: string, features: vector, ClassIndex: double]

In [99]:
dt = DecisionTreeClassifier(labelCol="ClassIndex",featuresCol="features")
model = dt.fit(trainingData)

# 9.	Test your model with your test dataset: 

In [100]:
predictions = model.transform(testData)

In [101]:
predictions.select("prediction","ClassIndex").show(5)

+----------+----------+
|prediction|ClassIndex|
+----------+----------+
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
+----------+----------+
only showing top 5 rows



In [102]:
predictions.select("prediction","ClassIndex").show(15)

+----------+----------+
|prediction|ClassIndex|
+----------+----------+
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
+----------+----------+
only showing top 15 rows



# 10.	Run an evaluator function to show the accuracy of your model:

In [103]:
evaluator= MulticlassClassificationEvaluator(\
labelCol="ClassIndex", predictionCol="prediction",\
metricName="accuracy")
accuracy=evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
print("Test Set accuracy = " +str(accuracy))

Test Error = 0.0965369
Test Set accuracy = 0.9034630707937481


# 11.	Random Forest Classifier

## Specify the RandomForestClassifier, train the model on your training dataset, 
## predict using your test dataset, and run an evaluator to test accuracy:


In [104]:
from pyspark.ml.classification import RandomForestClassifier 
rf=RandomForestClassifier(labelCol="ClassIndex",\
featuresCol="features",numTrees=10)
model=rf.fit(trainingData)
predictions=model.transform(testData)
predictions.select("prediction","ClassIndex").show(5)

+----------+----------+
|prediction|ClassIndex|
+----------+----------+
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
+----------+----------+
only showing top 5 rows



In [105]:
evaluator= \
MulticlassClassificationEvaluator(labelCol="ClassIndex",\
predictionCol="prediction",metricName="accuracy")
accuracy=evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
print("Test Set accuracy = " +str(accuracy))

Test Error = 0.0962305
Test Set accuracy = 0.9037695372356727


# 12.	Naive Bayes Classifier
## Specify the NaiveBayes classifier, train the model on your training dataset, predict using your test dataset, and run an evaluator to test accuracy:


In [106]:
from pyspark.ml.classification import NaiveBayes 
nb=NaiveBayes(labelCol="ClassIndex",\
featuresCol="features",smoothing=1.0,\
modelType="multinomial")
model=nb.fit(trainingData)


In [107]:
predictions=model.transform(testData)
predictions.select("Class","ClassIndex",
"probability","prediction").show(5)

+-----+----------+--------------------+----------+
|Class|ClassIndex|         probability|prediction|
+-----+----------+--------------------+----------+
|    ?|       2.0|[4.434599E-317,2....|      31.0|
|    ?|       2.0|[0.0,6.7779769517...|      31.0|
|    ?|       2.0|[0.0,2.0968078097...|      31.0|
|    ?|       2.0|[0.0,6.7521230130...|      31.0|
|    ?|       2.0|[0.0,6.6394469474...|      31.0|
+-----+----------+--------------------+----------+
only showing top 5 rows



In [109]:
from pyspark.ml.classification import RandomForestClassifier 
rf=RandomForestClassifier(labelCol="ClassIndex",\
featuresCol="features",numTrees=10)
model=rf.fit(trainingData)
predictions=model.transform(testData)
predictions.select("prediction","ClassIndex").show(5)

+----------+----------+
|prediction|ClassIndex|
+----------+----------+
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
|       0.0|       2.0|
+----------+----------+
only showing top 5 rows



In [110]:
evaluator= \
MulticlassClassificationEvaluator(labelCol="ClassIndex",\
predictionCol="prediction",metricName="accuracy")
accuracy=evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
print("Test Set accuracy = " +str(accuracy))

Test Error = 0.0962305
Test Set accuracy = 0.9037695372356727
