In this project I built a model for classifying the Email/SMS into Spam or Ham through the text of Email/SMS using standard classifiers.
Extract the text and the target class from the dataset. Extract the features of the test using TF IDF vectorizer for the input features. Split the skewed data into shuffled sets using stratified shuffle split in sklearn library. Use standard classifiers to classify the data into spam or ham.
- Python
- scikit-learn/sklearn
- Pandas
- NumPy
- nltk
- Matplolib
- Jupyter/Spyder/Pycharm
You can collect raw dataset from here. The files contain one message per line. Each line is composed by two columns:
- Class(v1)- contains the label (ham or spam)
- Message(v2) - contains the raw text.
Considering overall performance of Precision and Accuracy
Since NB has the best Accuracy and Precision, Naive Bayes is the model.