<p align="center">
<b>Instituto Tecnológico y de Estudios Superiores de Monterrey</b><br>
<b>Análisis de métodos de razonamiento e incertidumbre (Gpo 102)</b><br>
Actividad PBL 1<br><br>
Profesor: Hugo Eduardo Ramírez Jaime<br><br>
Realizado por:<br>
Diego Colín Reyes A01666354<br>
Aldo Reséndiz Cravioto A01625395<br>
Daniel Alejandro López Martínez A01770442<br>
Eduardo Ramírez Almanza A01660118
</p>


In [2]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import  PorterStemmer
from collections import defaultdict
from sklearn.model_selection import train_test_split
from collections import Counter
import string

## Problem Statement

We have a message ( m = (w1, w2, ..., wn) ),  
where ( (w1, w2, ..., wn) ) is a set of unique words contained in the message.  
We need to find:

P(spam | w1 ⋂ w2 ⋂ ... ⋂ wn) = (P(w1 ⋂ w2 ⋂ ... ⋂ wn | spam) * P(spam)) / P(w1 ⋂ w2 ⋂ ... ⋂ wn)


If we assume that the occurrence of a word is **independent** of all other words,  we can simplify the above expression to:

(P(w1 | spam) * P(w2 | spam) * ... * P(wn | spam) * P(spam))
/ P(w1) * P(w2) * ... * P(wn)


In order to classify, we have to determine which is greater:


P(spam | w1 ⋂ w2 ⋂ ... ⋂ wn) versus P(¬ spam | w1 ⋂ w2 ⋂ ... ⋂ wn)



## Data Preparation

We can see we do not need the columns ‘Unnamed: 2’, ‘Unnamed: 3’ and ‘Unnamed: 4’, so we are going to remove them. Next, we rename the column ‘v1’ as ‘label’ and ‘v2’ as ‘message’. ‘ham’ (ham being authentic messages) is replaced by 0 and ‘spam’ (fake messages) is replaced by 1 in the ‘label’ column.

In [3]:
data = pd.read_csv('spam.csv', encoding="latin1")

data

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [4]:
#Renaming columns and labeling data with 0's and 1's
data = data[['v1', 'v2']]

data = data.rename(columns={'v1':'label', 'v2':'message'})

data['label'].replace('ham', 0, inplace=True)
data['label'].replace('spam', 1, inplace=True)

data

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['label'].replace('ham', 0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['label'].replace('spam', 1, inplace=True)
  data['label'].replace('spam', 1, inplace=True)


Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will Ì_ b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


We split the data into the training dataset and test dataset. We used the training dataset to train the model and then it was tested on the test dataset. For this project, we used 75% of the dataset as a training dataset and the rest as a test dataset. Selection of this 75% of the data is uniformly random.



In [None]:
#Making data lowercase
data['message'] = data['message'].str.lower()

#Dividing data in train and test
train, test = train_test_split(data, test_size=0.25, random_state=65, shuffle=True, stratify=data['label'])

## Modelling

#### Train & Test

In [None]:
#To stemming the messages and removing stopwords
stemmer = PorterStemmer()

sw = stopwords.words('english')

def stem_text(msg):
    if pd.isna(msg):
        return None
    # Remove punctuation
    msg = ''.join([char for char in msg if char not in string.punctuation])
    
    tokens = word_tokenize(msg)
    # Filter out tokens that are just spaces or empty
    tokens = [token for token in tokens if token.strip()]

    stemmed_tokens = [stemmer.stem(word) for word in tokens if word not in sw]

    return stemmed_tokens

In [None]:

# Count most common words among spam messages only
spam_messages = data.loc[data['label'] == 1, 'message']

spam_word_list = []
for msg in spam_messages:
    stemmed = stem_text(msg)
    if stemmed:
        spam_word_list.extend(stemmed)

spam_word_frequencies = Counter(spam_word_list)
print("Top 25 spam words:")
print(spam_word_frequencies.most_common(25))


Top 25 spam words:
[('call', 366), ('free', 216), ('2', 173), ('txt', 163), ('u', 147), ('ur', 144), ('text', 138), ('mobil', 135), ('4', 119), ('claim', 115), ('stop', 113), ('repli', 109), ('prize', 94), ('get', 87), ('tone', 73), ('servic', 72), ('send', 69), ('new', 69), ('nokia', 68), ('award', 66), ('urgent', 63), ('week', 62), ('cash', 62), ('win', 61), ('contact', 61)]


In [None]:
#Dict to save the ocurrences of each word in spam and ham and all

spam_ocurr = defaultdict(int)
ham_ocurr = defaultdict(int)
all_ocurr = defaultdict(int)

#Filling the three dictionaries
total_words = 0
spam_words = 0
ham_words = 0

#Filling the dictionaries with the frecuency of each word according to the subgroup
for i, row in train.iterrows():
    words = stem_text(row['message'])
    if(row['label'] == 0):
        #Not spam
        ham_words += len(words)
        for w in words:
            ham_ocurr[w]+=1
            all_ocurr[w]+=1
    else:
        #print('In spam')
        #Spam
        spam_words += len(words)
        for w in words:
            spam_ocurr[w]+=1
            all_ocurr[w]+=1

total_words = spam_words + ham_words

In [None]:
print(f'Total words: {total_words}')
print(f'Spam words: {spam_words}')
print(f'Ham words: {ham_words}')

Total words: 39218
Spam words: 9382
Ham words: 29836


## Probability 

### Bayes Rule and Naive Bayes

#### Bayes Rule  
Bayes rule is a formula in probability that allows us to update the probability of an event based on new evidence.  
It relates the conditional probability of \(A\) given \(B\) to the conditional probability of \(B\) given \(A\).  


P(A|B) = (P(B|A) * P(A)) / P(B)
 

It is widely used in statistics, machine learning, and decision-making under uncertainty.  

---

#### Naive Bayes  
Naive Bayes is a classification method based on Bayes rule.  
It makes the simplifying assumption that all features are **conditionally independent** given the class.  

Despite this “naive” assumption, it often works surprisingly well in practice, especially for **text classification** tasks like spam filtering or sentiment analysis.

We now apply Naive Bayes in the next cell.

In [None]:
#Calculating  P(w|spam), P(w|~spam) and P(w) with additive smoothing

def prob_w_spam(word):
    return spam_ocurr[word]/spam_words
    
def prob_w_ham(word):
    return ham_ocurr[word]/ham_words

def prob_w(word):
    return all_ocurr[word]/total_words

## Results

In [39]:
def calculate_metrics(true_pos, true_neg, false_pos, false_neg):
    precision = true_pos / (true_pos + false_pos)
    recall = true_pos / (true_pos + false_neg)
    Fscore = 2 * precision * recall / (precision + recall)
    accuracy = (true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg)

    print("Precision: ", precision)
    print("Recall: ", recall)
    print("F-score: ", Fscore)
    print("Accuracy: ", accuracy)

In [40]:
#Comparing the model in training


def validate_performance(dataset):
    p_spam = dataset.loc[dataset['label'] == 1].shape[0]/ dataset.shape[0]
    p_ham = dataset.loc[dataset['label'] == 0].shape[0]/ dataset.shape[0]

    true_pos = 0
    true_neg = 0
    false_pos = 0
    false_neg = 0

    for _, row in dataset.iterrows():
        #print(phrase)
        phrase = row['message']
        phrase_stemmed = stem_text(phrase)
        prob_spam_p = p_spam
        prob_ham_p = p_ham
        for w in phrase_stemmed:
            #print(prob_w_spam(w), prob_w(w))
            
            if(spam_ocurr[w] != 0): #Validating that the word has appeared in the training set
                prob_spam_p *= (prob_w_spam(w))/(prob_w(w))
            
            if(ham_ocurr[w] != 0): #Validating that the word has appeared in the training set
                prob_ham_p *= (prob_w_ham(w))/(prob_w(w))

            pred_label = 0
            if(prob_spam_p>prob_ham_p): #Verifying if which probability is greater
                pred_label = 1

            if(pred_label == 1 and row['label'] == 1):
                true_pos += 1
            if(pred_label == 0 and row['label'] == 0):
                true_neg += 1
            if(pred_label == 1 and row['label'] == 0):
                false_pos += 1
            if(pred_label == 0 and row['label'] == 1):
                false_neg += 1

    calculate_metrics(true_pos, true_neg, false_pos, false_neg)

In [41]:
validate_performance(train)

Precision:  0.9531122745782432
Recall:  0.8731613728416115
F-score:  0.9113867719864273
Accuracy:  0.9593808965271049


In [42]:
validate_performance(test)

Precision:  0.9192433612222627
Recall:  0.8063178047223994
F-score:  0.8590855005949346
Accuracy:  0.9372730024213075


## Interpretation
We consider that our model shows strong performance on both training and test data.

Precision is high (0.95 train, 0.92 test), meaning the classifier correctly identifies most of the messages it labels as spam.

Recall is slightly lower (0.87 train, 0.81 test), indicating that some spam messages are missed, but still a solid capture rate.

F-score (0.91 train, 0.86 test) demonstrates a good balance between precision and recall, confirming reliable classification.

Accuracy remains consistently high (0.96 train, 0.94 test), showing that the model generalizes well and avoids overfitting.

## Bibliography
Uzo. (2020, October 9). Building a Spam Classifier in Python From Scratch. Medium. Retrieved [17/08/25], from https://medium.com/@uzoeze/building-spam-classifier-nlp-in-python-from-scratch-a103ffdea411

thecodinguru. (n.d.). SpamControl-NLP: A Spam Classifier NLP project labeling spam messages or authentic messages (MIT License) [Repository]. GitHub. Retrieved [18/08/25], from https://github.com/thecodinguru/SpamControl-NLP
