Skip to content

A natural language processing and machine learning project that predicts spam messages and explains how it does so

License

Notifications You must be signed in to change notification settings

BigBangData/SMS_SpamDetect

Repository files navigation

SMS Spam Detect

The Problem

Spam detection is an old and continuing problem. I get spam texts every day, and wonder how they got through my spam filter, given that every day I flag them and in doing so train what I believe now should be a state-of-the-art, bleeding edge spam detector since I have a Google phone. Shouldn't the filter catch what is clearly a spam SMS to me?

In this project I tackled this old problem using a small corpus (download the SMS Spam Collection from this UCI Machine Learning Repository and classical ML algorithms, aiming at explainability. I achieve 99% accuracy (see more evaluation metrics and tests in this notebook) during model evaluation, yet since the training data is small I expect this model to generalize poorly, despite all the tests.

So I deploy the model in an app to see how it does in the wild - with unseen data - to fully understand the challenge.

homepage

homepage

The App

Hosted in Heroku, the app consists of a simple homepage (above) with a form that accepts a text input and a results page (below) in which I offer a detailed look into all that goes behind the scenes to transform this text into a prediction of whether it is spam or not.

results

top of results page

The app is meant to demistify machine learning (or "AI" as it's commonly referred to) - since it often is but a series, however complex and probabilistic, of transformations of inputs into outputs - a text becomes a 1 or a 0.

Machines are not intelligent. As one of the founders of the field, Michael I. Jordan, expertly comments in this Lex Fridman Podcast: the "I" in "AI" is a misnomer. We have yet to fully comprehend how humans think, let understand whether machines think at all - and if so, how that might differ from how humans think.

Business Applications

This app employes both Natural Language Processing (NLP) and Supervised Machine Learning which are widely applicable to businesses in a variety of ways. The proportion of unstructured text data in the internet only grows compared to structured data such as tabular data. Text data is often found in databases sitting around untapped, as front-facing apps continuously capture open text fields with user comments.

Insights can be extracted from text using NLP and various analytic methods, whether using machine learning or using simpler designs and iterating through solutions. This project's framework for processing text and for classification can be extended and adapted to any other classification tasks involving textual data.

Acknowledgements

This journey into the fields of NLP and ML took months of learning and development of my own understanding of various inner workings of models I never ended up deploying. I am indebted to numerous tutorials and blogs I've read and watched along the way. Below is a list in order of most-to-least influential: