# iPhone Tweet Sentiment Classifier

This notebook demonstrates a Naive Bayes classifier implementation to analyze sentiment and relevance in tweets about iPhones.

Authors: [Enricco Gemha](https://github.com/G3mha), [Marcelo Barranco](https://github.com/Maraba23), [Rafael Leventhal](https://github.com/rafaelcl292)

Date: 2021-09-27

---

## Loading Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import re
import nltk

# Download required NLTK packages
nltk.download('rslp');
nltk.download('stopwords');

In [2]:
print('Working directory:')
print(os.getcwd())

Loading the dataset containing tweets classified according to the criteria described in the README.md:

In [3]:
filename = 'iphone.xlsx'

In [4]:
train = pd.read_excel(filename)
train.head()

In [5]:
train.RELEVÂNCIA.value_counts(True)

In [6]:
test = pd.read_excel(filename, sheet_name='Teste')
test.head()

---

## Automated Sentiment Classifier


### Classification Categories

Product: iPhone

- **VERY IRRELEVANT (0)**: Off-topic tweets, unrelated to iPhone, or tweets with minimal content (e.g., just a hashtag)
- **IRRELEVANT (1)**: Sales advertisements (e.g., "Buy now at Magalu")
- **NEUTRAL (2)**: Jokes about iPhone (e.g., "iPhone is like a mini Corsa lol")
- **RELEVANT (3)**: Indirect comments related to iPhone (e.g., "My science teacher spent 30 minutes just talking about his new iPhone")
- **VERY RELEVANT (4)**: Direct comments about iPhone - opinions, questions, or purchase intent (e.g., "iPhone 13 will have to wait a bit longer to reach my hands")

---

### Building a Naive-Bayes Classifier

Training the classifier using only the messages from the Training spreadsheet.

### Defining the Tweet Cleaning Function

The function below is responsible for removing:
- Links (HTTP, HTTPS, and FTP)
- Username tags (@)
- Punctuation marks (! - _ . : ? ; [] \ /)

Additionally, the function performs the following processes:
- STEMMING (described in NLTK documentation as: "A processing interface for removing morphological affixes from words.")
- TOKENIZING (described by NLTK documentation as: "A tokenizer that divides a string into substrings by splitting on the specified string.")

In [7]:
def clean_data(text):
    # Remove URLs
    http_re = r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?'
    text = re.sub(http_re, '', text)

    # Remove usernames
    text = re.sub(r'@[^\s]*', '', text)

    # Remove punctuation marks
    text = re.sub(r'[!-_.:?;[\]/]', '', text)

    # Tokenize
    tokenizer = nltk.tokenize.casual.TweetTokenizer()
    text = tokenizer.tokenize(text)

    # Stemming
    stemmer = nltk.stem.RSLPStemmer()
    text = list(map(stemmer.stem, text))

    return text


In [8]:
# Data preparation
train.loc[:, 'Treinamento'] = train.Treinamento.apply(clean_data)
train_relevance = train.RELEVÂNCIA.map({
    0: 'very irrelevant',
    1: 'irrelevant',
    2: 'neutral',
    3: 'relevant',
    4: 'very relevant'
})

categories = ['very irrelevant', 'irrelevant',
              'neutral', 'relevant', 'very relevant']

train.loc[:, 'RELEVÂNCIA'] = pd.Categorical(
    train_relevance, categories=categories, ordered=True
)
train.head()

In [9]:
train.RELEVÂNCIA.describe()

## Building the Naive Bayes Classifier

The process consists of training the classifier based on tweets collected ONLY in the training spreadsheet (train), with the following steps:

1. Create a list of relevance categories
2. Create a list of words with only one occurrence across all tweets
3. Create a dictionary with words for each category
4. Create a dictionary with the frequency count of each word

In [10]:
# Building the classifier
categories = ['very irrelevant', 'irrelevant',
              'neutral', 'relevant', 'very relevant']

# List of words/emojis present in the entire database
total_words = sum(train.Treinamento, [])

# Number of unique words/emojis in the entire database
unique_words = set(total_words)

# Dictionary of words by category
words_by_category = {
    category: sum(train[train.RELEVÂNCIA == category].Treinamento, [])
    for category in categories
}

# Word occurrence count by category
word_occurrence_by_category =  {
    category: {
        word: words_by_category[category].count(word)
        for word in unique_words
    }
    for category in categories
}

# Probability by category
prob_by_category = {
    category: len(words_by_category[category]) / len(total_words)
    for category in categories
}

## Probability Calculation

In this section, we define the probability of a phrase belonging to a specific category.

In [11]:
# Function that calculates the probability of a phrase belonging to a given category
def prob_phrase(category, phrase):
    '''
    Calculates the probability of a phrase being in a category
    '''
    # Clean the phrase if provided as a string
    if phrase is str:
        phrase = clean_data(phrase)
    # Probability calculation
    return prob_by_category[category] * np.array(list(
        # Probability of each word with Laplace smoothing
        (((word_occurrence_by_category[category][word] + 1)
        if word in unique_words else 1) /
         (len(words_by_category[category]) + len(unique_words)))
        for word in phrase
    )).prod()  # Product of each word's probability

## Building the Classifier Function

Finally, we structure the classifier function that returns the category with the highest probability of containing a given phrase, using the probability calculated in the previous cell.

In [12]:
def classifier(phrase):
    '''
    Returns the category with the highest probability of containing the phrase
    '''
    return max(
        categories, key=lambda category: prob_phrase(category, phrase)
    )

---

### Verifying Classifier Performance

Now we test our classifier with the test dataset.


In [13]:
# Checking the dataframe we'll work with
test.head()

In [14]:
# Preparing the test dataframe
test.loc[:, 'Teste'] = test.Teste.apply(clean_data)
test_relevance = test.RELEVÂNCIA.map(
    {0: 'very irrelevant',
     1: 'irrelevant',
     2: 'neutral',
     3: 'relevant',
     4: 'very relevant'}
)

# Creating the test categories
test_categories = ['very irrelevant', 'irrelevant',
              'neutral', 'relevant', 'very relevant']

# Unifying information provided by manual classification
test.loc[:, 'RELEVÂNCIA'] = pd.Categorical(
    test_relevance, categories=test_categories, ordered=True
)
test.head()

In [15]:
# Applying the classifier to the test table
test.loc[:, 'Classifier'] = pd.Categorical(
    test.Teste.apply(classifier), categories=categories, ordered=True
)
test.head()

In [16]:
# Performance of classification on the training set itself
sum(train.RELEVÂNCIA == train.Treinamento) / train.shape[0]

In [17]:
# Showing the classifier performance
sum(test.RELEVÂNCIA == test.Classifier) / test.shape[0]

In [18]:
# Creating a crosstab containing the accuracy rate of the test by relevance
test['Correct'] = test.Classifier == test.RELEVÂNCIA
pd.crosstab(test.Correct, test.RELEVÂNCIA, normalize='columns')

In [19]:
# Creating a crosstab to evaluate the correspondence of relevance
pd.crosstab(test.Classifier, test.RELEVÂNCIA)

# Understanding the Naive-Bayes Classifier Concept


The classifier cannot be used to generate more tweets for classification as it would be biased. It wouldn't generate new samples with new words, but rather reuse the same pool of words from the original dataset.

---

### Classifier Quality through New Splits between Training and Test Data


# New Iterations
We decided to implement randomization in the split between Training and Test tweets, with the accuracy of 100 repetitions. **Results follow below.**

In [20]:
# Combining both dataframes, Training and Test
df_complete = pd.DataFrame({
    'Tweets': train.Treinamento.append(test.Teste, ignore_index=True),
    'Relevance': train.RELEVÂNCIA.append(test.RELEVÂNCIA, ignore_index= True),
})
df_complete.head()

In [21]:
# Recreating the classification categories
# Simplifying to 3 categories by merging very irrelevant with irrelevant and very relevant with relevant

new_relevance = []
for r in df_complete.Relevance:
    if r == "very irrelevant":
        r = "irrelevant"
    if r == "very relevant":
        r = "relevant"
    new_relevance.append(r)
df_complete["Relevance"] = new_relevance
df_complete.head()

In [22]:
# Building the list of categories
categories = ['irrelevant','neutral', 'relevant']

# List for classifier performance
performances = []

# Running performance test 100 times
for _ in range(100):
    # Randomizing tweets
    df_complete = df_complete.sample(frac=1).reset_index(drop=True)
    train = df_complete.iloc[:750]
    test = df_complete.iloc[750:1000]

    # Total words/emojis in the entire database
    total_words = sum(train.Tweets, [])
    # Unique words/emojis in the entire database
    unique_words = set(total_words)

    # Words by category
    words_by_category = {
        category: sum(train[train.Relevance == category].Tweets, [])
        for category in categories
    }

    # Word occurrence count by category
    word_occurrence_by_category =  {
        category: {
            word: words_by_category[category].count(word)
            for word in unique_words
        }
        for category in categories
    }

    # Probability by category
    prob_by_category = {
        category: len(words_by_category[category]) / len(total_words)
        for category in categories
    }


    def prob_phrase(category, phrase):
        '''
        Calculates the probability of a phrase being in a category
        '''
        # Clean the phrase if provided as a string
        if phrase is str:
            phrase = clean_data(phrase)
        
        # Probability calculation
        return prob_by_category[category] * np.array(list(
            # Probability of each word with Laplace smoothing
            (((word_occurrence_by_category[category][word] + 1)
            if word in unique_words else 1) /
            (len(words_by_category[category]) + len(unique_words)))
            for word in phrase
        )).prod()  # Product of each word's probability


    def classifier(phrase):
        '''
        Returns the category with the highest probability of containing the phrase
        '''
        return max(
            categories, key=lambda category: prob_phrase(category, phrase)
        )


    correct = sum(test.Tweets.apply(classifier) == test.Relevance)
    performance = correct / test.shape[0]
    performances.append(performance)

In [23]:
# Plotting histogram for the 100 tests
plt.hist(performances, edgecolor='white', density=True)
plt.title('Histogram of performance obtained from 100 different classifier iterations')
plt.ylabel('Density')
plt.xlabel('Performance')
plt.show()

In [24]:
# Getting some important statistics from the 100 tests to support the histogram above
pd.Series(performances).describe()

<p>
<hr></hr>
</p>

# CONCLUSION
<p>
<hr></hr>
</p>

## Best Approach for Classifier Construction
Based on the graph above and the performance metrics obtained throughout this notebook, we can infer that this method of randomization and multiple testing helps IMPROVE the performance of the Naive Bayes classifier by reducing bias and increasing the test area.


## Applications of the Classifier Beyond This Project
Based on what we've presented in this notebook, we can state that the applications of the Naive Bayes classifier are extensive and relevant. For example, a company could employ this classifier to evaluate the performance of a product or advertising campaign. Furthermore, this classifier could find applications in social media engagement algorithms, as mentioned in Fabio Akita's blog post (see references). It's also worth mentioning the various possible applications in healthcare, such as disease test inspection (calculating false positive/negative probabilities), behavioral science, mental health conditions like Alzheimer's, personality disorder studies, and the well-known spam filtering in email inboxes.

## Real Improvements for the Naive Bayes Classifier
Throughout the project, we observed that the classifier has some limitations, such as when a tweet is ironic or contains double negation, which can lead to incorrect classification. Additionally, to improve the classifier's accuracy, the quantity of classified tweets should be expanded, both manually and through training, to minimize potential errors and inconsistencies inherent in the tweet separation process. Finally, it's worth mentioning the need to implement the Monte Carlo method, which involves generating random numbers between 0 and 1, and if the number is less than the probability of a given tweet being relevant, the tweet will be classified as relevant, and vice versa. This method becomes considerably more effective as the database grows.

---

# References

<a href="https://arxiv.org/pdf/1410.5329.pdf">Naive Bayes and Text Classification</a><br> **More comprehensive**


<a href="https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/">A practical explanation of a Naive Bayes Classifier</a><br> **Simpler explanation**


<a href="https://www.akitaonrails.com/2020/09/30/akitando-84-entendendo-o-dilema-social-e-como-voce-e-manipulado">Blog post on Bayesian theorem applications by Fabio Akita</a><br>


<a href="https://www.youtube.com/watch?v=HZGCoVF3YvM">3Blue1Brown: Bayes Theorem</a><br>


<a href="https://www.youtube.com/watch?v=R13BD8qKeTg">Veritasium: The Bayesian Trap</a><br>
