# COMP 3610 - PROJECT CEEJMO - DEMONSTRATION

| UWI ID   | NAME    |
| -------- | ------- |
| `816031687` |  `RAUL ALI`   |
| `816030501` |  `JOSIAH JOEL`   |
| `816030814` |  `DAYANAND MOONOO`   |
| `816031173` |  `ZACHARY RAMPERSAD`   |

NOTES:
- This is a Multi-class Classification problem involving Natural Language Processing.
- In this notebook, we demonstrate the classifier models we have built:
    - we take a random instance from our data
    - process the input and vectorize it using the trained tfidf vectorizer.
    - generate probabilities of being in each class using the trained classification model,

In [1]:
# Dataframe / Visualization Imports
import pandas as pd
import numpy as np
import random

# Text Cleaning Imports
import re
#!pip install emoji
import emoji
from bs4 import BeautifulSoup
import string
from string import punctuation
# from itertools import chain

# Model Imports
from joblib import load

In [2]:
# Initialize seed for reproducibility 
seed_value = 42

# Import data into a data frame
df=pd.read_csv("text.csv", index_col=0)
# df = pd.read_csv("https://drive.google.com/uc?export=download&id=1mQryd71hRYMLl3vzyCS9deedDQak_1LG", index_col=0)

# Print the last 5 rows to get a general idea of the dataset
df.tail()

Unnamed: 0,text,label
416804,i feel like telling these horny devils to find...,2
416805,i began to realize that when i was feeling agi...,3
416806,i feel very curious be why previous early dawn...,5
416807,i feel that becuase of the tyranical nature of...,3
416808,i think that after i had spent some time inves...,5


For this demonstration we use our trained `Random Forest Classifier`.

In [24]:
# Load the saved models
loaded_model = load('randf_tfidf.joblib')
loaded_vectr = load('tfidf_for_randf.joblib')

## DATA PRE-PROCESSING / CLEANING

We will perform the following steps to refine our corpus of text:
- Removing links from the corpus
- Removing punctuation
- Removing HTML tags
- Removing Emojis
- Covert text to lowercase only
- Removing additional white spaces

In [4]:
# Function to remove specified words from a text
def remove_word_from_string(word, string):

    # Construct a regular expression pattern to match the word
    pattern = r'\b{}\b'.format(re.escape(word))

    # Use re.sub() to replace the matched word with an empty string
    return re.sub(pattern, '', string)

In [5]:
# Function to preprocess text
def preprocess_input(text,loaded_vectr):
    
    text = ' '.join(text.split())   # Remove additional white spaces
    text = emoji.demojize(text, delimiters=("", "")) # Replace emojis with word representation
    text = re.sub(r'http\S+', '', text) # Remove links
    text = text.translate(str.maketrans('', '', punctuation)) # Remove punctuation
    text = BeautifulSoup(text, "html.parser").get_text() # Remove HTML tags
    text.lower() # Convert all text to lowercase
    
    words_to_remove = ['feeling', 'feel', 'like', 'im'] # Remove common words
    for word in words_to_remove:
            text = remove_word_from_string(word, text)
    
    text_trans = loaded_vectr.transform([text])

    return text_trans

In [6]:
# Function to generate the probabilities 
def generate_probs(text, loaded_model):
    # Obtain the probabilities for each class
    pred_probs = loaded_model.predict_proba(text)

    labels = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

    # Create a dictionary to store the probabilities with labels
    probs_dict = {}

    # Iterate over the probabilities array and labels simultaneously
    for label, prob in zip(labels, pred_probs[0]):
        probs_dict[label] = round(prob*100,5)

    print('\n',probs_dict)

    return probs_dict

## CLASSIFICATION

Take a random text sample from the data-frame

In [28]:
# Take a random instance from df
r_num = random.randint(0, len(df))
# r_num = 209283
X_inst = df.at[r_num, 'text']  
y_inst = df.at[r_num, 'label']  

print(f'({r_num}). Class {y_inst} : {X_inst}')

X_inst = preprocess_input(X_inst,loaded_vectr)

probs = generate_probs(X_inst,loaded_model)

(175300). Class 3 : i know i am a sick fuck but i am not going to lie about how i am feeling because it is my truth and sometimes my truth is rude and mean and not full of love

 {'sadness': 5.39561, 'joy': 6.30013, 'love': 6.01527, 'anger': 75.62315, 'fear': 4.8754, 'surprise': 1.79044}
