## Exploring NLPs and LLMs
Hello! This is supposed to be a demo for the types of tasks you might run into and show you some of the tools you can use for the tasks, with both traditional NLP methods and LLMs! This is a taster - there are many more types of tasks, and even more ways in which you can complete them!
This will  focus on open source models and datasets that we can easily access from HuggingFace (HF)!

###**Sentiment analysis**
You can use traditional NLP methods to gauge the sentiment present in the text. Classic tools such as VADER and Textblob are light weight and fast making them a godsend for big datasets, but struggle with longer sentences and nuance, such as when sarcasm is used. On the other hand, NLP models such as the ones used here might take more time, but offer can offer better classification of the text and might be beneficial for text that might not be as straightforward. However, remember that the hallucinations are with you - before running this on a large dataset, manually evaluate the results on a small one.

In [None]:
#Pip install the packages we need
#Because of course they don't come on vanilla colab
!pip install vaderSentiment
!pip install transformers
!pip install textstat
!pip install textblob

In [None]:
#Load in the packages
#Some stat and data handling packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

#A whole host of NLP and HF packages
import re  #This is supposed to help with regular expressions. Unique symbols? Arcane knowledge? Someone who keeps having newlines in their text? Use re to clean the text!
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer #This is the VADER sentiment analyser - easy to use, 10/10 elite employee
from transformers import pipeline #Very important for using hugging face models.
import textstat #Do you need more information about the word count? Readability scores (in English)? This has got you fam.
from textblob import TextBlob #This can also give you the sentiment, but has an additional component! You can look at the subjetcivity of the text!
import nltk #This is the all-in-one traditional NLP toolkit. Want to do anything? It's got you. That said, I don't like using it lol. It's here to show you some of the stuff you can use
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Download resources for the nltk packages
# We're proably not going to use them, but I'm always paranoid about leaving it out when I analyse text
nltk.download('vader_lexicon')
nltk.download('punkt_tab')

In [None]:
#Creating a random dataset - you can change this up to see how the different scores you get look!
#This is a list. You list things out in a list
reviews = ['Man, the food here was nasty. Would not recommend.',
             'I absolutely loved everything on the menu!',
             'It was fine, I guess.',
             'Bad food, poor service.',
             'Mediocre at best, unappetizing at its worst.',
             'That was some really amazing pasta!',
             'The tiramisu was delicious. Good service, recommended the restaurant to my friends!',
             'Lovely ambience, terrible food.']

In [None]:
#As I am allergic to list, I am converting it into a dataframe. All hail pandas
sent_text = pd.DataFrame(reviews, columns=['Reviews'])

In [None]:
#Check
sent_text.head(10)

In [None]:
#Okay, we have our food reviews for this totally existing Italian place
#Now i want you to add a column with the 'ground truth'
#Or rather the vibe the review is giving
#Is it positive?
#Negative?
#Neutral?
sent_text['Ground Truth'] = ['vibe1', 'vibe2', 'vibe3', 'vibe4',
                             'vibe5', 'vibe6', 'vibe7', 'vibe8']

In [None]:
#The data needs cleaning and normalisation
#Cleaning the text data from both to remove the weird spaces and stuff
#Having it as a function let's you come back and add more bizarre conditions depeding on how your data looks like
def clean_text(text):
    # Lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove underscores
    text = text.replace('_', '')
    # Remove special characters and numbers but keep ending punctuation
    text = re.sub(r"[^a-z0-9\s.,!?;:'\"()\[\]{}\-]", '', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    #Remove any jumps
    text = text.replace('\n', ' ').replace('\r', ' ')
    return text

In [None]:
#Apply the function
sent_text['Reviews'] = sent_text['Reviews'].apply(clean_text)

#### THe DaRTH VADER edit
Well, it's just VADER. This is one of the (very many) lightweight models that analyses informal, short form content and provides continuous numerical scores from -1 to 1 which can essentially be gruped into 3 ranges - positive, negative and neutral.

In [None]:
# VADER analysis!
#Initial vader set up
vader = SentimentIntensityAnalyzer()
#Then we find out what VADER scores them - We get a score of the sentence from -1 to 1
sent_text['vader_compound'] = sent_text['Reviews'].apply(lambda x: vader.polarity_scores(x)['compound'])
#But reading the values directly is annoying - so we group them into 3,  +ve, neutral, -ve
sent_text['vader_sentiment'] = sent_text['vader_compound'].apply(lambda x: 'positive' if x >= 0.5 else ('negative' if x <= -0.5 else 'neutral'))

In [None]:
#Then we look at it:
sent_text.head()

In [None]:
#Okay, time for the fun bit, more visualisations!
#Let's see how many there are in each category
f,ax = plt.subplots(figsize=(8,6))

#Primary enrollment as per the census
sns.countplot(data = sent_text, ax = ax, x = 'vader_sentiment', palette = 'viridis')

# Labeling
ax.set_title('Sentiment of the food reviews with VADER')
ax.set_ylabel('Number of reviews')
ax.set_xlabel('Categories')

#### The Sesame Street BERT edit

Okay, we've tried that with VADER! Now let's see how it looks like with an LLM! For this sectiojn we will be using NLPTown's bert-base-multilingual-uncased-sentiment model, which we will be getting from HuggingFace! This model has been fine-tuned on reviews and can give you the scores from one to five stars!

In [None]:
#Setting up the LLM for the portion!
#First, we load pre-trained BERT pipeline
bert_pipeline = pipeline('sentiment-analysis', model='nlptown/bert-base-multilingual-uncased-sentiment')

# Map BERT labels to positive, negative and neutral!
star_to_sentiment = {'1 star': 'negative', '2 stars': 'negative', '3 stars': 'neutral',
                     '4 stars': 'positive', '5 stars': 'positive'}

In [None]:
#We can then create a function to do the classifying for us here!
#Create a BERT classifier function
def classify_bert(text):
    try:
        result = bert_pipeline(text[:512])[0]  # this returns the top predictions, at the maximum input length
        return star_to_sentiment[result['label']] #This converts the stars to our labels
    except:
        return 'error' #This catches any problems before everything crashes and burns and takes out weeks of work.

In [None]:
#Classify the reviews
sent_text['bert_sentiment'] = sent_text['Reviews'].apply(classify_bert)

In [None]:
#Okay, time for the fun bit, more visualisations!
#Let's see how many there are in each category
f,ax = plt.subplots(figsize=(8,6))

#Primary enrollment as per the census
sns.countplot(data = sent_text, ax = ax, x = 'bert_sentiment', palette = 'viridis')

# Labeling
ax.set_title('Sentiment of the food reviews for BERT')
ax.set_ylabel('Number of reviews')
ax.set_xlabel('Categories')

In [None]:
#Okay, let's see them all
sent_text.head(10)

As we can see, some of the phrases we have have been classified differently. BERT picked up the negative sentiment in 'Mediocre at best, unappetizing at its worst.', but misclassified 'That was some really amazing pasta!' and 'Lovely ambience, terrible food.' That said, this is the first time I've run this, and BERT likes reclassifying things, so it might look different. Who knows.

We can also compare the tool performance by comparing it to the ground truth you just added to the reviews before we ran everything!


Therefore remember, when you're using these techniques!!! To always check on a sample of your work!! And pick the right technique for the right problem!!!!

Also an advantage of LLMs like BERT: They can process things like emojis.

Now, you can change up the reviews! Add more reviews, change the words and add five emojis. See how the sentiment changes, and add your own spin!