<a href="https://colab.research.google.com/github/Maryam-Mostafa/Pyspark-/blob/master/mail_classifer_with_spark_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Using PySpark

## Objective:
- The objective from this project is to create a <b>Spam filter using NaiveBayes classifier</b>.
- It is required to obtain <b>f1_scored > 0.9</b>.
- We'll use a dataset from UCI Repository. SMS Spam Detection: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
- Data is also provided for you in the assignment (you do not have to download it).

## To perform this task follow the following guiding steps:

### Create a spark session and import the required libraries

In [None]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 53 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 61.6 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=a091095d55b44acc181bf23cd09dede1a13136c406c809c722bb43eeae1def3a
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [None]:
spark = SparkSession.builder.appName('final_project').getOrCreate()

### Read the readme file to learn more about the data

### Read the data into a DataFrame

In [None]:
df = spark.read.csv('SMSSpamCollection.csv', sep = '\t',  header=False, inferSchema= True)

In [None]:
df.show()

+----+--------------------+
| _c0|                 _c1|
+----+--------------------+
| ham|Go until jurong p...|
| ham|Ok lar... Joking ...|
|spam|Free entry in 2 a...|
| ham|U dun say so earl...|
| ham|Nah I don't think...|
|spam|FreeMsg Hey there...|
| ham|Even my brother i...|
| ham|As per your reque...|
|spam|WINNER!! As a val...|
|spam|Had your mobile 1...|
| ham|I'm gonna be home...|
|spam|SIX chances to wi...|
|spam|URGENT! You have ...|
| ham|I've been searchi...|
| ham|I HAVE A DATE ON ...|
|spam|XXXMobileMovieClu...|
| ham|Oh k...i'm watchi...|
| ham|Eh u remember how...|
| ham|Fine if thats th...|
|spam|England v Macedon...|
+----+--------------------+
only showing top 20 rows



### Print the schema

In [None]:
df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)



### Rename the first column to 'class' and second column to 'text'

In [None]:
df = df.withColumnRenamed("_c0","class").withColumnRenamed("_c1","text")

In [None]:
df.printSchema()

root
 |-- class: string (nullable = true)
 |-- text: string (nullable = true)



### Show the first 10 rows from the dataframe
- Show once with truncate=True and once with truncate=False

In [None]:
df.show(10, truncate= True)

+-----+--------------------+
|class|                text|
+-----+--------------------+
|  ham|Go until jurong p...|
|  ham|Ok lar... Joking ...|
| spam|Free entry in 2 a...|
|  ham|U dun say so earl...|
|  ham|Nah I don't think...|
| spam|FreeMsg Hey there...|
|  ham|Even my brother i...|
|  ham|As per your reque...|
| spam|WINNER!! As a val...|
| spam|Had your mobile 1...|
+-----+--------------------+
only showing top 10 rows



In [None]:
df.show(10, truncate= False)

+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|class|text                                                                                                                                                            |
+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ham  |Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...                                                 |
|ham  |Ok lar... Joking wif u oni...                                                                                                                                   |
|spam |Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075o

## Clean and Prepare the Data

### Create a new feature column contains the length of the text column

In [None]:
from pyspark.sql.functions import col , length

new_df = df.withColumn("length", length(col("text")))

### Show the new dataframe

In [None]:
new_df.show(truncate=False)

+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|class|text                                                                                                                                                                                                |length|
+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|ham  |Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...                                                                                     |111   |
|ham  |Ok lar... Joking wif u oni...                                                                                                                    

### Get the average text length for each class (give alias name to the average length column)

In [None]:
new_df.createOrReplaceTempView("dataset")

In [None]:
df.groupBy("class").agg(avg(df["length"]).alias('Avg. Lenght')).show()

+-----+-----------------+
|class|      Avg. Lenght|
+-----+-----------------+
|  ham|71.45431945307645|
| spam|138.6706827309237|
+-----+-----------------+



In [None]:
# df1 = spark.sql('select class, avg(length) as "Avg. Length" from dataset group by class')
# df1.show()

## Feature Transformations

### In this part you transform you raw text in to tf_idf model :
- For more information about TF-IDF check the following link: <b>(Not needed for the test)</b>
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

### Perform the following steps to obtain TF-IDF:
1. Import the required transformers/estimators for the subsequent steps.
2. Create a <b>Tokenizer</b> from the text column.
3. Create a <b>StopWordsRemover</b> to remove the <b>stop words</b> from the column obtained from the <b>Tokenizer</b>.
4. Create a <b>CountVectorizer</b> after removing the <b>stop words</b>.
5. Create the <b>TF-IDF</b> from the <b>CountVectorizer</b>.

In [None]:
from pyspark.ml.feature import StringIndexer, VectorAssembler, Tokenizer,StopWordsRemover, CountVectorizer, IDF
from pyspark.ml import Pipeline
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from sklearn.metrics import confusion_matrix

- Convert the <b>class column</b> to index using <b>StringIndexer</b>
- Create feature column from the <b>TF-IDF</b> and <b>lenght</b> columns.

In [None]:
# 1- string indexer for the class name
stringIndexer = StringIndexer(inputCol='class',outputCol='label', handleInvalid='skip')

In [None]:
# 2- creating the tokenizer to get the countVectorizer
tokenizer = Tokenizer(inputCol="text", outputCol="token")
stopRemover = StopWordsRemover(inputCol="token",outputCol="stop_token")
countVectorizer = CountVectorizer(inputCol='stop_token',outputCol='count_vec')
idf = IDF(inputCol="count_vec", outputCol="tf_idf")

## The Model
- Create a <b>NaiveBayes</b> classifier with the default parameters.

In [None]:
vecAssembler  = VectorAssembler(inputCols = ['tf_idf','length'], outputCol='features')

In [None]:
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

## Pipeline
### Create a pipeline model contains all the steps starting from the Tokenizer to the NaiveBays classifier.

In [None]:
stg = [stringIndexer,tokenizer, stopRemover, countVectorizer, idf, vecAssembler, nb]
pipeline = Pipeline(stages=stg)

### Split your data to trian and test data with ratios 0.7 and 0.3 respectively.

In [None]:
train, test = new_df.randomSplit([0.7, 0.3], seed = 42)

### Fit your Pipeline model to the training data

In [None]:
model = pipeline.fit(train)

### Perform predictions on tests dataframe

In [None]:
pred = model.transform(test)

### Print the schema of the prediction dataframe

In [None]:
pred.printSchema()

root
 |-- class: string (nullable = true)
 |-- text: string (nullable = true)
 |-- length: integer (nullable = true)
 |-- label: double (nullable = false)
 |-- token: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- stop_token: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- count_vec: vector (nullable = true)
 |-- tf_idf: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



## Model Evaluation
- Use <b>MulticlassClassificationEvaluator</b> to calculate the <b>f1_score</b>.

In [None]:
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(pred)
print(f"accuracy : {acc}")

accuracy : 0.9727502290227267


In [None]:
y_pred=pred.select("prediction").collect()
y_orig=pred.select("label").collect()

cm = confusion_matrix(y_orig, y_pred)
print(f"Confusion Matrix: \n {cm}")

Confusion Matrix: 
 [[1352   29]
 [  15  197]]


# GOOD LUCK
<b><font color='GREEN'>AI-PRO Spark Team ITI</font></b>

![image-3.png](attachment:image-3.png)