<a href="https://colab.research.google.com/github/Cyporg53/machine-learning/blob/main/Week_5_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is Sentiment Analysis?
It's the process of analyzing text to determine the emotional tone. By default the text can be positive, neutral, or negative.

Credit: https://www.datacamp.com/tutorial/text-analytics-beginners-nltk    

Dataset: https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset

# Preprocessing

Before we input our data (the tweets) into our sentiment analysis, we will need to clean the text.

## **Tokenization:**
The first step in preprocessing our text. Tokenization will break down the tweets into separate words, characters, or subwords. This way, we can analyze the individual words from the text for our sentiment analysis. The words will be processed into a vector, in our case, an array.

## **Removing Stop Words:**
The tweets will include words that have little to no sentiment. For example, "and", "the", "of", "it", "from", and "to" are examples of stop words. It is crucial to remove the stop words to avoid inaccuracy of our analysis

## **Stemming and Lemmaization**:
Stemming reduces words to their base form by removing suffixes. We will use the base word to see if it is positive or negative. Sometimes stemming will reduce into meaningless forms. Lemmaization reduces words to their base form taking account of their part of speech in the text. This will take more time to process, but gives the words more meaning and representation during the analysis

## **Step 1: Load the Dataset and Libraries**


In [None]:
# import libraries
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import seaborn as sns

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

True

In [None]:
df = pd.read_csv('/content/Tweets.csv')
df['text'] = df['text'].astype(str)
df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


# Step 2: Preprocess text

In [None]:
df['positive'] = pd.get_dummies(df, columns=['sentiment'])['sentiment_positive']
df.head()

Unnamed: 0,textID,text,selected_text,sentiment,positive
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,0
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,0
2,088c60f138,my boss is bullying me...,bullying me,negative,0
3,9642c003ef,what interview! leave me alone,leave me alone,negative,0
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,0


In [None]:
def preprocess(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())

    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]

    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

    # Join the tokens back into a string
    processed_text = ' '.join(lemmatized_tokens)
    return processed_text

# apply the function df
df['text'] = df['text'].apply(preprocess)
df

Unnamed: 0,textID,text,selected_text,sentiment,positive
0,cb774db0d1,"` responded , going","I`d have responded, if I were going",neutral,0
1,549e992a42,sooo sad miss san diego ! ! !,Sooo SAD,negative,0
2,088c60f138,bos bullying ...,bullying me,negative,0
3,9642c003ef,interview ! leave alone,leave me alone,negative,0
4,358bd9e861,"son * * * * , ` put release already bought","Sons of ****,",negative,0
...,...,...,...,...,...
27476,4eac33d1c0,wish could come see u denver husband lost job ...,d lost,negative,0
27477,4f4c4fc327,"` wondered rake . client made clear .net , ` f...",", don`t force",negative,0
27478,f67aae2310,yay good . enjoy break - probably need hectic ...,Yay good for both of you.,positive,1
27479,ed167662a5,worth * * * * .,But it was worth it ****.,positive,1


# Step 3: Predict and evaluate the model

In [None]:
analyzer = SentimentIntensityAnalyzer()

def get_sentiment(text):
    scores = analyzer.polarity_scores(text)
    sentiment = 1 if scores['pos'] > 0 else 0
    return sentiment

df['sentiment'] = df['text'].apply(get_sentiment)
df

Unnamed: 0,textID,text,selected_text,sentiment,positive
0,cb774db0d1,"` responded , going","I`d have responded, if I were going",0,0
1,549e992a42,sooo sad miss san diego ! ! !,Sooo SAD,0,0
2,088c60f138,bos bullying ...,bullying me,0,0
3,9642c003ef,interview ! leave alone,leave me alone,0,0
4,358bd9e861,"son * * * * , ` put release already bought","Sons of ****,",0,0
...,...,...,...,...,...
27476,4eac33d1c0,wish could come see u denver husband lost job ...,d lost,1,0
27477,4f4c4fc327,"` wondered rake . client made clear .net , ` f...",", don`t force",1,0
27478,f67aae2310,yay good . enjoy break - probably need hectic ...,Yay good for both of you.,1,1
27479,ed167662a5,worth * * * * .,But it was worth it ****.,1,1


In [None]:
phrase = input("Enter your phrase to be analyized:" )

Enter your phrase to be analyized:I love college


In [None]:
# positive = 1, negative = 0
processed = preprocess(phrase)
print(get_sentiment(processed))

1


In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(df['positive'], df['sentiment']))

[[10546  8353]
 [  796  7786]]


In [None]:
from sklearn.metrics import classification_report
print(classification_report(df['positive'], df['sentiment']))

              precision    recall  f1-score   support

           0       0.93      0.56      0.70     18899
           1       0.48      0.91      0.63      8582

    accuracy                           0.67     27481
   macro avg       0.71      0.73      0.66     27481
weighted avg       0.79      0.67      0.68     27481

