# NLP Using PySpark

## Objective:
- The objective from this project is to create a <b>Spam filter using NaiveBayes classifier</b>.
- It is required to obtain <b>f1_scored > 0.9</b>.
- We'll use a dataset from UCI Repository. SMS Spam Detection:<br>
https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

### Create a spark session and import the required libraries

In [1]:
import findspark 
findspark.init()
from pyspark.sql import SparkSession
import pyspark.sql.functions as pyfunc
spark = SparkSession.builder.getOrCreate()

### Read the data into a DataFrame

In [3]:
df = spark.read.format('csv').option('sep','\t').load('SMSSpamCollection')

In [4]:
df.show(10)

+----+--------------------+
| _c0|                 _c1|
+----+--------------------+
| ham|Go until jurong p...|
| ham|Ok lar... Joking ...|
|spam|Free entry in 2 a...|
| ham|U dun say so earl...|
| ham|Nah I don't think...|
|spam|FreeMsg Hey there...|
| ham|Even my brother i...|
| ham|As per your reque...|
|spam|WINNER!! As a val...|
|spam|Had your mobile 1...|
+----+--------------------+
only showing top 10 rows



### Print the schema

In [5]:
df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)



### Rename the first column to 'class' and second column to 'text'

In [6]:
df2 = df.withColumnRenamed("_c0","class").withColumnRenamed("_c1","text")

In [7]:
df2.printSchema()

root
 |-- class: string (nullable = true)
 |-- text: string (nullable = true)



### Show the first 10 rows from the dataframe

In [8]:
df2.show(10)

+-----+--------------------+
|class|                text|
+-----+--------------------+
|  ham|Go until jurong p...|
|  ham|Ok lar... Joking ...|
| spam|Free entry in 2 a...|
|  ham|U dun say so earl...|
|  ham|Nah I don't think...|
| spam|FreeMsg Hey there...|
|  ham|Even my brother i...|
|  ham|As per your reque...|
| spam|WINNER!! As a val...|
| spam|Had your mobile 1...|
+-----+--------------------+
only showing top 10 rows



## Clean and Prepare the Data

### Create a new feature column contains the length of the text column

In [9]:
df3 = df2.withColumn("length", pyfunc.length(df2.text))

### Show the new dataframe

In [10]:
df3.show()

+-----+--------------------+------+
|class|                text|length|
+-----+--------------------+------+
|  ham|Go until jurong p...|   111|
|  ham|Ok lar... Joking ...|    29|
| spam|Free entry in 2 a...|   155|
|  ham|U dun say so earl...|    49|
|  ham|Nah I don't think...|    61|
| spam|FreeMsg Hey there...|   147|
|  ham|Even my brother i...|    77|
|  ham|As per your reque...|   160|
| spam|WINNER!! As a val...|   157|
| spam|Had your mobile 1...|   154|
|  ham|I'm gonna be home...|   109|
| spam|SIX chances to wi...|   136|
| spam|URGENT! You have ...|   155|
|  ham|I've been searchi...|   196|
|  ham|I HAVE A DATE ON ...|    35|
| spam|XXXMobileMovieClu...|   149|
|  ham|Oh k...i'm watchi...|    26|
|  ham|Eh u remember how...|    81|
|  ham|Fine if thats th...|    56|
| spam|England v Macedon...|   155|
+-----+--------------------+------+
only showing top 20 rows



### Get the average text length for each class (give alias name to the average length column)

In [11]:
df3.groupBy('class').agg({'length':'avg'}).show()

+-----+-----------------+
|class|      avg(length)|
+-----+-----------------+
|  ham|71.45431945307645|
| spam|138.6706827309237|
+-----+-----------------+



## Feature Transformations

### In this part you transform you raw text in to tf_idf model :
- For more information about TF-IDF check the following link: <br>
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

### Perform the following steps to obtain TF-IDF:
1. Import the required transformers/estimators for the subsequent steps.
2. Create a <b>Tokenizer</b> from the text column.
3. Create a <b>StopWordsRemover</b> to remove the <b>stop words</b> from the column obtained from the <b>Tokenizer</b>.
4. Create a <b>CountVectorizer</b> after removing the <b>stop words</b>.
5. Create the <b>TF-IDF</b> from the <b>CountVectorizer</b>.

In [12]:
from pyspark.ml.feature import Tokenizer,StopWordsRemover,CountVectorizer,StringIndexer,VectorAssembler

In [13]:
tokenizer = Tokenizer(inputCol='text',outputCol='tokenized_text')
stopwordsremover = StopWordsRemover(inputCol='tokenized_text',outputCol='stop_removed')
countvectorizer = CountVectorizer(inputCol='stop_removed',outputCol='TF_IDF')

- Convert the <b>class column</b> to index using <b>StringIndexer</b>
- Create feature column from the <b>TF-IDF</b> and <b>lenght</b> columns.

In [14]:
stringIndexer = StringIndexer(inputCol='class',outputCol='class_index')

In [15]:
vectorassembler = VectorAssembler(inputCols=['TF_IDF','length'],outputCol='features')

## The Model
- Create a <b>NaiveBayes</b> classifier with the default parameters.

In [16]:
from pyspark.ml.classification import NaiveBayes
naivebayes = NaiveBayes(featuresCol='features',labelCol='class_index')

## Pipeline
### Create a pipeline model contains all the steps starting from the Tokenizer to the NaiveBays classifier.

In [17]:
from pyspark.ml import Pipeline

In [18]:
pipeline = Pipeline(stages=[tokenizer,stopwordsremover,countvectorizer,stringIndexer,\
                           vectorassembler,naivebayes])

### Split your data to trian and test data with ratios 0.7 and 0.3 respectively.

In [19]:
train, test = df3.randomSplit([0.7,0.3],seed=42)

### Fit your Pipeline model to the training data

In [20]:
model = pipeline.fit(train)

### Perform predictions on tests dataframe

In [21]:
predDF = model.transform(test)

### Print the schema of the prediction dataframe

In [22]:
predDF.printSchema()

root
 |-- class: string (nullable = true)
 |-- text: string (nullable = true)
 |-- length: integer (nullable = true)
 |-- tokenized_text: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- stop_removed: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- TF_IDF: vector (nullable = true)
 |-- class_index: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



## Model Evaluation
- Use <b>MulticlassClassificationEvaluator</b> to calculate the <b>f1_score</b>.

In [25]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator 
metric = MulticlassClassificationEvaluator(predictionCol='prediction',
                                           labelCol='class_index',
                                           metricName='f1')
f1_score = metric.evaluate(predDF)
print(f'f1_score is: {f1_score}')

f1_score is: 0.9723827056612554
