Email Spam Detection - 5th Semester Big Data Project

Team Number: BD2_359_307_433

This is the final Big Data course project repository where we have implemented Machine Learning with Spark Streaming. We have used the email spam detection data set and implemented Logistic regression, SGD Classifier and MLP Classiifier.

Dataset Description:

The dataset given to us was aready cleaned and ready for pre-processing, having the following features:

Each record consists of 3 features - the subject, the email content and the label
Each email is one of 2 classes, spam or ham
30k examples in train and 3k in test

Link to the original dataset: https://www.kaggle.com/wanderfj/enron-spam (Enron Email Spam Detection Dataset)

Libraries Used:

PySpark, Pickle, Numpy

Steps:

Create a SparkContext, StreamingContext and SQLContext to stream data real time and convert it to SQL readable data.
SQL is not Spark readable so convert it to JSON which is Spark readable. Then convert it to a dataframe and populate the dataset.
Pre-processing - Regex Tokeniser (Breaking down sentence into words), Stopword Remover (Removes stopwords/unnecessary words), Word2Vec (Converts words to vectors), String Indexer (Converts labels to indices, acts as data encoder).
Put all these into a pipeline, perform fit and transform.
You have your (vector, category) tuples ready, use this to train the various inbuilt models and determine the accuracy score.
Deploy it on local host and use Pickle to store it on the disk.
Perform the same tasks with the test data set.

Acknowledgements

I'd like to thank Prof. Animesh Giri and the TAs - Aditeya Baral, Ansh Sarkar and Vishesh P - for their guidance throughout the project. I'd also like to thank my teammates - Samriddhi Vishwakarma and Sohan Beela for their contribution in the project.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
BD2_359_397_433.pdf		BD2_359_397_433.pdf
BD2_359_397_433_Commit 1.ipynb		BD2_359_397_433_Commit 1.ipynb
Dataset.zip		Dataset.zip
LICENSE		LICENSE
README.md		README.md
Screenshot from 2021-12-06 23-07-23.png		Screenshot from 2021-12-06 23-07-23.png
Screenshot from 2021-12-06 23-17-09.png		Screenshot from 2021-12-06 23-17-09.png
preprocess.py		preprocess.py
testmodels.py		testmodels.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Email Spam Detection - 5th Semester Big Data Project

Dataset Description:

Libraries Used:

Steps:

Acknowledgements

About

Releases

Packages

Languages

License

Toshani/Email-Spam-Detection-Big-Data

Folders and files

Latest commit

History

Repository files navigation

Email Spam Detection - 5th Semester Big Data Project

Dataset Description:

Libraries Used:

Steps:

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages