Task : Classification of tweets into Traffic(1) class or Non-traffic(0) class

Text Classification is an automated process of classification of text into 
predefined categories.

Now lets realize this with a supervised ML model to classify text


The data set used here can be downloaded from - 

https://data.mendeley.com/datasets/c3xvj5snvv/1




In [1]:
# STEP 1 : Import desired libraries
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score
from sklearn import metrics



In [2]:
import sys
print(sys.executable)

C:\ProgramData\Anaconda3\python.exe


In [3]:
# STEP 2 : Read dataset
# Read data into dataframes and define column names
train_df = pd.read_csv('1_TrainingSet_2Class.csv',names=['label','user_id','text']);
test_df = pd.read_csv('1_TestSet_2Class.csv',names=['label','user_id','text']);

STEP -3 : Prepare Train and Test Data sets

The Corpus has two data sets, Training and Test

The training data set will be used to fit the model and the predictions will be performed on the test data set.

Learn more about training and test data here- https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7

In [4]:
# Define training and testing data(text and labels)
train_x = train_df['text']; 
train_y = train_df['label'];

test_x = test_df['text'];
test_y = test_df['label'];

In [5]:
print("Length of train data")
print(len(train_x), len(train_y))

print("Length of test data")
print(len(test_x), len(test_y))

Length of train data
40879 40879
Length of test data
10221 10221


In [6]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(train_df['label'],label="label")
plt.show()
sns.countplot(test_df['label'],label="label")
plt.show()

<Figure size 640x480 with 1 Axes>

<Figure size 640x480 with 1 Axes>

In [7]:
import matplotlib.pyplot as pPlot
from wordcloud import WordCloud, STOPWORDS
import numpy as npy

tweets = train_df[train_df['label'] == 1].sample(n=2000)['text'].values
wc = WordCloud(background_color="black", max_words=2000, stopwords=STOPWORDS)
wc.generate(" ".join(tweets))

plt.figure(figsize=(10,5))
plt.axis("off")
plt.title("Frequent words in traffic related tweets", fontsize=20)
plt.imshow(wc.recolor(colormap= 'viridis' , random_state=17), alpha=0.98)
plt.show()

ModuleNotFoundError: No module named 'wordcloud'

STEP -7: Word Vectorization


It is a general process of turning a collection of text documents into numerical feature vectors.Their are many methods to convert text data to 
vectors which the model can understand but by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency — Inverse Document” Frequency which are the components of the resulting scores assigned to each word.
Term Frequency: This summarizes how often a given word appears within a document.
Inverse Document Frequency: This down scales words that appear a lot across documents

TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

Learn more about TF-IDF here - https://www.youtube.com/watch?v=4vT4fzjkGCQ

Learn more about TF-IDF vectorizer here - https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

In [6]:
#  Word Vectorization
# Finally we will transform Train_X and Test_X to vectorized Train_X_Tfidf and Test_X_Tfidf
# It will contain for each row a list of unique integer number and its associated importance as calculated by TF-IDF.
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(train_x)
Train_X_Tfidf = Tfidf_vect.transform(train_x)
Test_X_Tfidf = Tfidf_vect.transform(test_x)

In [7]:
# to see the vocabulary that it has learned from the corpus
print(Tfidf_vect.vocabulary_)



STEP -5: Use the ML Algorithms to Predict the outcome 




In [8]:
# fit the training dataset on the Naive Bayes classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,train_y)
# predict the labels on validation dataset
ypred_NB = Naive.predict(Test_X_Tfidf)


STEP 6 : Check model performance

Learn about different classification metrics (precision, recall, f1-score, accuracy) here - https://medium.com/@MohammedS/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b#:~:text=We%20can%20use%20classification%20performance,primarily%20used%20by%20search%20engines.

In [9]:
from sklearn.metrics import classification_report
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(ypred_NB, test_y)*100)
print("Naive Bayes Performance -> \n",classification_report(test_y, ypred_NB))

Naive Bayes Accuracy Score ->  97.28010957831916
Naive Bayes Performance -> 
               precision    recall  f1-score   support

           0       0.97      0.97      0.97      5110
           1       0.97      0.97      0.97      5111

    accuracy                           0.97     10221
   macro avg       0.97      0.97      0.97     10221
weighted avg       0.97      0.97      0.97     10221



In [10]:
#confusion matrix for  naive bayes
from sklearn.metrics import confusion_matrix

print(confusion_matrix(test_y, ypred_NB)) 

[[4973  137]
 [ 141 4970]]


As a next step you can try the following:

Play around with the Data pre-processing steps and see how it effects the accuracy.

Try other Word Vectorization techniques such as Count Vectorizer and Word2Vec.

Try Parameter tuning with the help of GridSearchCV on these Algorithms.

Try other classification Algorithms Like Linear Classifier, Boosting Models and even Neural Networks.

In [11]:
# Classifier - Algorithm - SVM
from sklearn.svm import SVC
svclassifier=SVC(kernel='linear')
svclassifier.fit(Train_X_Tfidf,train_y)
# predict the labels on validation dataset
predictions_SVM = svclassifier.predict(Test_X_Tfidf)

In [12]:
from sklearn.metrics import classification_report
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, test_y)*100)
print("SVM Performance -> \n",classification_report(test_y,predictions_SVM ))


SVM Accuracy Score ->  98.25848742784463
SVM Performance -> 
               precision    recall  f1-score   support

           0       0.98      0.98      0.98      5110
           1       0.98      0.98      0.98      5111

    accuracy                           0.98     10221
   macro avg       0.98      0.98      0.98     10221
weighted avg       0.98      0.98      0.98     10221



In [13]:
#confusion matrix for  SVC
from sklearn.metrics import confusion_matrix

print(confusion_matrix(test_y, predictions_SVM)) 

[[5021   89]
 [  89 5022]]


In [14]:
#logistic regression classifier
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

logreg = LogisticRegression()
logreg.fit(Train_X_Tfidf, train_y)

# predict the labels on validation dataset
y_pred=logreg.predict(Test_X_Tfidf)


In [15]:
from sklearn.metrics import classification_report
print("LogisticRegression Accuracy Score -> ",accuracy_score(y_pred, test_y)*100)
print("LogisticRegression Performance -> \n",classification_report(test_y,y_pred))


LogisticRegression Accuracy Score ->  98.14108208590157
LogisticRegression Performance -> 
               precision    recall  f1-score   support

           0       0.98      0.99      0.98      5110
           1       0.99      0.98      0.98      5111

    accuracy                           0.98     10221
   macro avg       0.98      0.98      0.98     10221
weighted avg       0.98      0.98      0.98     10221



In [16]:
#confusion matrix for  Logistic regression
from sklearn.metrics import confusion_matrix

print(confusion_matrix(test_y, predictions_SVM)) 

[[5021   89]
 [  89 5022]]


In [17]:
#KNN classifier
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(Train_X_Tfidf, train_y)

# predict the labels on validation dataset
y_predic = classifier.predict(Test_X_Tfidf)
print("KNN Accuracy Score -> ",metrics.accuracy_score(test_y, y_predic)*100)
print("KNN Performance -> \n",classification_report(test_y,y_predic))


KNN Accuracy Score ->  96.46805596321299
KNN Performance -> 
               precision    recall  f1-score   support

           0       0.95      0.98      0.97      5110
           1       0.98      0.95      0.96      5111

    accuracy                           0.96     10221
   macro avg       0.97      0.96      0.96     10221
weighted avg       0.97      0.96      0.96     10221



In [32]:
#confusion matrix for  KNN classifier
from sklearn.metrics import confusion_matrix

print(confusion_matrix(test_y, predictions_SVM)) 

[[5021   89]
 [  89 5022]]
