<a href="https://colab.research.google.com/github/Bishop1303/ML_PySpark/blob/dev/ML_SpamFilter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Carica il drive con i dati:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Getting the softwares:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
!tar -xvf spark-3.1.1-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark

# To use spark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"

# SparkSession
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark


# From Unstructured text data to Structured text data

The objective is to build a spam filter using ML. The dataset *sms.csv* contains sms already classfied as spam (1) or not (0).

```
+---+--------------------+-----+
| id|                text|label|
+---+--------------------+-----+
|  1|Sorry, I'll call ...|    0|
|  2|Dont worry. I gue...|    0|
|  3|Call FREEPHONE 08...|    1|
        ...
```
To be able to use the text data for ML *few* actions are needed:

* remove punctuation and numbers
* tokenize (split into individual words)
* remove stop words (i.e. those words that do not provide any useful information to decide in which category a text should be classified).
* apply the hashing trick
* convert to TF-IDF representation.

In [None]:
from pyspark.sql.functions import regexp_replace
from pyspark.ml.feature import Tokenizer

# Reading data
sms = spark.read.csv("/content/drive/My Drive/sms.csv", inferSchema=True, header=False, sep=';').toDF("id","text","label")

# Check DF
#sms.show()
#sms.printSchema()

# Remove punctuation
wrangled = sms.withColumn('text', regexp_replace(sms.text, '[_():;,.!?\\-]', ' '))

# Remove numbers
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, '[0-9]', ' '))

# Merge multiple spaces 
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, ' +', ' '))

# Split the text into words
wrangled = Tokenizer(inputCol='text', outputCol='words').transform(wrangled)

wrangled.show(4, truncate=False)

# Stop words and hashing

Remove stop words, apply the **hashing trick** and converting the results into a **TF-IDF Matrix**.

The **hashing trick** provides a fast and space-efficient way to map a very large (possibly infinite) set of items (in this case, all words contained in the SMS messages) onto a smaller, finite number of values.  

NB: *HashingTF* argument *numFeatures* tells the lenght of the hash code assigned to a word in the doc. Even with a numFeatures big as vocabulary size is possible to have duplicate "unique" hash for different words that's bad ofc..

The **TF-IDF Matrix** reflects how important a word is to each document. It takes into account both:


*   The frequency of the word within each document.

*   The frequency of the word across all of the documents in the collection.

In [None]:
from pyspark.ml.feature import StopWordsRemover, HashingTF, IDF

# Remove stop words.
wrangled = StopWordsRemover(inputCol='words', outputCol='terms')\
      .transform(wrangled)

# Apply the hashing trick
wrangled = HashingTF(inputCol="terms", outputCol="hash", numFeatures=1024)\
      .transform(wrangled)

# Convert hashed symbols to TF-IDF
tf_idf = IDF(inputCol='hash', outputCol='features')\
      .fit(wrangled).transform(wrangled)
      
tf_idf.select('terms', 'features').show(4, truncate=False)

# Training the spam classifier

The SMS data have now been prepared for building a classifier.  

Next steps are:
1.   Split the **TF-IDF** data into *training* and *testing sets*.
2.   Use the training data to fit a Logistic Regression model.
3.   Evaluate the performance of that model on the testing data.

NB. Regularization (*regParam*, default 0.0) is a technique used for tuning the function by adding an additional penalty term in the error function. The additional term controls the excessively fluctuating function such that the coefficients don’t take extreme values.

In [None]:
from pyspark.ml.classification import LogisticRegression

# Selecting right columns of sms manipulated data
sms_ml_ready = tf_idf.select('label', 'features')

# Split the data into training and testing sets
sms_train, sms_test = sms_ml_ready.randomSplit([0.8,0.2], seed=13)

# Fit a Logistic Regression model to the training data
logistic = LogisticRegression(regParam=0.2).fit(sms_train)

# Make predictions on the testing data
prediction = logistic.transform(sms_test)

# Confusion matrix, comparing predictions to known labels
prediction.groupBy('label', 'prediction').count().show()

# One-Hot Encoding


