# Predicting Spam SMS

## Load Data

In [1]:
import pandas as pd

yelp = pd.read_csv('yelp_review.csv')

yelp.head()

Unnamed: 0,class,review
0,2,"Contrary to other reviews, I have zero complai..."
1,1,Last summer I had an appointment to get new ti...
2,2,"Friendly staff, same starbucks fair you get an..."
3,1,The food is good. Unfortunately the service is...
4,2,Even when we didn't have a car Filene's Baseme...


In [2]:
corpus = yelp['review']
corpus.head()

0    Contrary to other reviews, I have zero complai...
1    Last summer I had an appointment to get new ti...
2    Friendly staff, same starbucks fair you get an...
3    The food is good. Unfortunately the service is...
4    Even when we didn't have a car Filene's Baseme...
Name: review, dtype: object

## (a) Prediction with all four sentiment scores

In [3]:
# import nltk vader library
from nltk.sentiment.vader import SentimentIntensityAnalyzer
    
# initiate an analyzer
sia = SentimentIntensityAnalyzer()

senti_pos = []
senti_neg = []
senti_neu = []
senti_comp = []


# iterate through each sentence in corpus
for sentence in corpus:
    
    #print(sentence)
    
    # analyze the sentiment. ss is a dictionary
    ss = sia.polarity_scores(sentence)
    
    # output each sentiment score (neg, neu, pos, compound) in ss
    #print(ss['pos']) # for debugging
    senti_pos.append(ss['pos'])
    senti_neg.append(ss['neg'])
    senti_neu.append(ss['neu'])
    senti_comp.append(ss['compound'])
    
    # print an empty line as seperator
    #print('\n')

In [4]:
# adding the list to the dataframe as column using assign(column_name = data)
yelp = yelp.assign(pos = senti_pos, neg = senti_neg, neu = senti_neu, compound = senti_comp)

In [5]:
X = yelp[['pos', 'neg', 'neu', 'compound']]

In [6]:
# select target
y=yelp[['class']]

y.head()

Unnamed: 0,class
0,2
1,1
2,2
3,1
4,2


In [7]:
y = y.values.ravel()

In [8]:
# load the required library
from sklearn.model_selection import train_test_split

# split data into training (70%) and testing (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=200)

In [9]:
# import the library
from sklearn.ensemble import RandomForestClassifier

# initialize the algorithm
rfc_pos = RandomForestClassifier(random_state=200)
rfc_neg = RandomForestClassifier(random_state=200)
rfc_neu = RandomForestClassifier(random_state=200)
rfc_compound = RandomForestClassifier(random_state=200)

# Generate a new model using training data only
rfc_pos.fit(X_train[['pos']],y_train)
rfc_neg.fit(X_train[['neg']],y_train)
rfc_neu.fit(X_train[['neu']],y_train)
rfc_compound.fit(X_train[['compound']],y_train)

In [10]:
# load the required libraries
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix  

# make a prediction for the input data
y_pred_pos = rfc_pos.predict(X_test[['pos']])
y_pred_neg = rfc_neg.predict(X_test[['neg']])
y_pred_neu = rfc_neu.predict(X_test[['neu']])
y_pred_compound = rfc_compound.predict(X_test[['compound']])

In [11]:
print(accuracy_score(y_test, y_pred_pos))
print(classification_report(y_test, y_pred_pos)) 

0.79
              precision    recall  f1-score   support

           1       0.79      0.84      0.81       110
           2       0.79      0.73      0.76        90

    accuracy                           0.79       200
   macro avg       0.79      0.78      0.79       200
weighted avg       0.79      0.79      0.79       200



In [12]:
print(accuracy_score(y_test, y_pred_neg))
print(classification_report(y_test, y_pred_neg)) 

0.66
              precision    recall  f1-score   support

           1       0.72      0.62      0.67       110
           2       0.60      0.71      0.65        90

    accuracy                           0.66       200
   macro avg       0.66      0.66      0.66       200
weighted avg       0.67      0.66      0.66       200



In [13]:
print(accuracy_score(y_test, y_pred_neu))
print(classification_report(y_test, y_pred_neu)) 

0.635
              precision    recall  f1-score   support

           1       0.66      0.70      0.68       110
           2       0.60      0.56      0.58        90

    accuracy                           0.64       200
   macro avg       0.63      0.63      0.63       200
weighted avg       0.63      0.64      0.63       200



In [14]:
print(accuracy_score(y_test, y_pred_compound))
print(classification_report(y_test, y_pred_compound)) 

0.71
              precision    recall  f1-score   support

           1       0.75      0.71      0.73       110
           2       0.67      0.71      0.69        90

    accuracy                           0.71       200
   macro avg       0.71      0.71      0.71       200
weighted avg       0.71      0.71      0.71       200



#### Discussion: The positive sentiment is most influential in this prediction.

## (b) Generate normalized TF-IDF for the whole corpus with cleaning

## Data Cleaning for DTM Generation (with stemming)

In [15]:
import nltk
# Install required lexicons for your account
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jay\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jay\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [16]:
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

corpus_cleaned = []

for text in corpus:
    # Seperate text into individual words
    tokens = word_tokenize(text)
    
    # Remove the punctuations and numbers
    tokens = [word for word in tokens if word.isalpha()]

    # Lower the tokens
    tokens = [word.lower() for word in tokens]

    # Remove stopword
    tokens = [word for word in tokens if not word in stopwords.words("english")]

    # Stem the tokens
    ps = PorterStemmer()
    tokens = [ps.stem(w) for w in tokens]
    
    text_cleaned = " ".join(tokens)

    corpus_cleaned.append(text_cleaned)

In [17]:
# genereate normalized TF-IDF DTM

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(norm='l2')
X_tfidf = vectorizer.fit_transform(corpus_cleaned)

#print(vectorizer.get_feature_names())
#print(X_tfidf.toarray())

In [18]:
# covert DTM to a DataFrame
X_tfidf = pd.DataFrame(X_tfidf.toarray())
#X_tfidf.columns=vectorizer.get_feature_names() # sklearn.__version__ <= 0.24.x
X_tfidf.columns=vectorizer.get_feature_names_out() # sklearn.__version__ >= 1.0.x

#X_tfidf.head()

## Use TF-IDF DTM to train a Random Forest classifier

In [19]:
# load the required library
from sklearn.model_selection import train_test_split

# split data into training (70%) and testing (30%)
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.20, random_state=200)

In [20]:
# import the library
from sklearn.ensemble import RandomForestClassifier

# initialize the algorithm
rfc_tfidf=RandomForestClassifier(random_state=200)

# Generate a new model using training data only
rfc_tfidf.fit(X_train,y_train)

In [21]:
# load the required libraries
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix  

# make a prediction for the input data
y_pred_tfidf = rfc_tfidf.predict(X_test)

In [22]:
print(accuracy_score(y_test, y_pred_tfidf))
print(classification_report(y_test, y_pred_tfidf)) 

0.74
              precision    recall  f1-score   support

           1       0.76      0.77      0.77       110
           2       0.72      0.70      0.71        90

    accuracy                           0.74       200
   macro avg       0.74      0.74      0.74       200
weighted avg       0.74      0.74      0.74       200



## (c) Use both sentiment and normalized TF-IDF

In [23]:
X_full = X_tfidf.assign(pos = senti_pos)

In [24]:
X_full.head()

Unnamed: 0,aaa,aamco,abbrevi,abc,abid,abil,abl,abomin,abra,abruptli,...,yuppi,zach,zapata,zero,zimbrick,zinho,zip,zoo,zucchini,pos
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.153874,0.0,0.0,0.0,0.0,0.0,0.089
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.094
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.297
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.096
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.171


In [25]:
# load the required library
from sklearn.model_selection import train_test_split

# split data into training (70%) and testing (30%)
X_train, X_test, y_train, y_test = train_test_split(X_full, y, test_size=0.20, random_state=200)

In [26]:
# import the library
from sklearn.ensemble import RandomForestClassifier

# initialize the algorithm
rfc_full = RandomForestClassifier(random_state=200)

# Generate a new model using training data only
rfc_full.fit(X_train,y_train)

In [27]:
# load the required libraries
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix  

# make a prediction for the input data
y_pred_full = rfc_full.predict(X_test)

In [28]:
print(accuracy_score(y_test, y_pred_full))
print(classification_report(y_test, y_pred_full)) 

0.84
              precision    recall  f1-score   support

           1       0.83      0.89      0.86       110
           2       0.85      0.78      0.81        90

    accuracy                           0.84       200
   macro avg       0.84      0.83      0.84       200
weighted avg       0.84      0.84      0.84       200



## Summary

|Metric	            |Model a	|Mode b	    |Model c	|
|:-                 |:-         |:-         |:-         |
|Accuracy	        |0.79 	    |0.74	    |**0.84**   |
|F1	                |0.81	    |0.77	    |**0.86**	|

The comparison of the three models—Model A (using positive sentiment), Model B (using normalized TF-IDF), and Model C (combining positive sentiment with TF-IDF)—reveals distinct performance characteristics. Model A's reliance on positive sentiment alone yielded good accuracy and F1 score, underscoring sentiment as a significant factor in review classification. Model B, utilizing TF-IDF, performed slightly lower, possibly due to incorporating a broader, potentially noisier range of textual features. The standout performance of Model C, combining sentiment with TF-IDF, suggests that integrating emotional tone and textual context offers a more comprehensive and effective approach to classification. This hybrid model's superior accuracy and F1 score indicate that capturing both the emotional and content aspects of reviews leads to more nuanced and accurate predictions.