<a href="https://colab.research.google.com/github/RachelNderitu/RachelNderitu/blob/main/Rach_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Sentiment Analysis**
This technique, a subset of Natural Language Processing (NLP), involves classifying texts into sentiments such as positive, negative, or neutral. Thus, the ultimate goal of sentiment analysis is to decipher the underlying mood, emotion, or sentiment of a text. This is also referred to as Opinion Mining.

**Sentiment Analysis Use Cases**
We just saw how sentiment analysis can empower organizations with insights that can help them make data-driven decisions. Now, let’s peep into some more use cases of sentiment analysis:

Social Media Monitoring for Brand Management: Brands can use sentiment analysis to gauge their Brand’s public outlook. For example, a company can gather all Tweets with the company’s mention or tag and perform sentiment analysis to learn the company’s public outlook.
Product/Service Analysis: Brands/Organizations can perform sentiment analysis on customer reviews to see how well a product or service is doing in the market and make future decisions accordingly.
Stock Price Prediction: Predicting whether the stocks of a company will go up or down is crucial for investors. One can determine the same by performing sentiment analysis on News Headlines of articles containing the company’s name. If the news headlines pertaining to a particular organization happen to have a positive sentiment — its stock prices should go up and vice-versa.
**Ways to Perform Sentiment Analysis in Python**
Python is one of the most powerful tools when it comes to performing data science tasks — it offers a multitude of ways to perform sentiment analysis in Python. The most popular ones are enlisted here:

Using Text Blob
Using Vader
Using Bag of Words Vectorization-based Models
Using LSTM-based Models
Using Transformer-based Models

In [None]:
# import libraries
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# download nltk corpus (first time only)
import nltk

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

True

In [None]:
# load dataset
df = pd.read_csv('/content/HateSpeech_Kenya.csv')
df.head()

Unnamed: 0,hate_speech,offensive_language,neither,Class,Tweet
0,0,0,3,0,['The political elite are in desperation. Ordi...
1,0,0,3,0,"[""Am just curious the only people who are call..."
2,0,0,3,0,['USERNAME_3 the area politicians are the one ...
3,0,0,3,0,['War expected in Nakuru if something is not d...
4,0,0,3,0,['USERNAME_4 tells kikuyus activists that they...


**Step 2 - Preprocess text**
We create a function preprocess_text in which we first tokenize the documents using word_tokenize function from NLTK, then we remove stop words using stopwords module from NLTK and finally, we lemmatize the filtered_tokens using WordNetLemmatizer from NLTK.

In [None]:
# create preprocess_text function
def preprocess_text(text):

    # Tokenize the text

    tokens = word_tokenize(text.lower())

    # Remove stop words

    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]

    # Lemmatize the tokens

    lemmatizer = WordNetLemmatizer()

    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

    # Join the tokens back into a string

    processed_text = ' '.join(lemmatized_tokens)

    return processed_text

# apply the function df

df['Tweet']= df['Tweet'].apply(preprocess_text)
df.head()

Unnamed: 0,hate_speech,offensive_language,neither,Class,Tweet
0,0,0,3,0,[ 'the political elite desperation . ordinary ...
1,0,0,3,0,[ `` curious people calling old mad kikuyus ka...
2,0,0,3,0,[ 'username_3 area politician one blame coz r ...
3,0,0,3,0,[ 'war expected nakuru something done . luo gi...
4,0,0,3,0,[ 'username_4 tell kikuyus activist targeted ....


**Step 3 - NLTK Sentiment Analyzer**
First, we’ll initialize a Sentiment Intensity Analyzer object from the nltk.sentiment.vader library.

Next, we’ll define a function called get_sentiment that takes a text string as its input. The function calls the polarity_scores method of the analyzer object to obtain a dictionary of sentiment scores for the text, which includes a score for positive, negative, and neutral sentiment.

The function will then check whether the positive score is greater than 0 and returns a sentiment score of 1 if it is, and a 0 otherwise. This means that any text with a positive score will be classified as having a positive sentiment, and any text with a non-positive score will be classified as having a negative sentiment.

Finally, we’ll apply the get_sentiment function to the reviewText column of the df DataFrame using the apply method. This creates a new column called sentiment in the DataFrame, which stores the sentiment score for each review. We’ll then display the updated DataFrame using df.

The NLTK sentiment analyzer returns a score between -1 and +1. We have used a cut-off threshold of 0 in the get_sentiment function above. Anything above 0 is classified as 1 (meaning positive). Since we have actual labels, we can evaluate the performance of this method by building a confusion matrix.

In [None]:
# initialize NLTK sentiment analyzer

analyzer = SentimentIntensityAnalyzer()

# create get_sentiment function

def get_sentiment(text):

    scores = analyzer.polarity_scores(text)

    sentiment = 1 if scores['pos'] > 0 else 0

    return sentiment




# apply get_sentiment function

df['sentiment'] = df['Tweet'].apply(get_sentiment)

df

Unnamed: 0,hate_speech,offensive_language,neither,Class,Tweet,sentiment
0,0,0,3,0,[ 'the political elite desperation . ordinary ...,0
1,0,0,3,0,[ `` curious people calling old mad kikuyus ka...,1
2,0,0,3,0,[ 'username_3 area politician one blame coz r ...,0
3,0,0,3,0,[ 'war expected nakuru something done . luo gi...,0
4,0,0,3,0,[ 'username_4 tell kikuyus activist targeted ....,1
...,...,...,...,...,...,...
48071,0,0,2,0,[ 'this exactly moses kuria & ilk . say negati...,0
48072,0,0,2,0,[ 'this exactly kenyan going thank god time ri...,1
48073,0,0,2,0,[ `` exactly wrong country . kikuyus ca n't st...,0
48074,1,0,2,0,[ `` exactly thing . well difference kilifi 'r...,1


In [None]:
from sklearn.metrics import confusion_matrix

import pandas as pd

# Assuming your dataframe is named df
df['combined_class'] = df[['hate_speech', 'offensive_language']].max(axis=1)

print(confusion_matrix(df['Class'], df['sentiment']))

[[18843 17509     0]
 [ 4632  3911     0]
 [ 1859  1322     0]]


We can also check the classification report:

This code imports the classification_report function from the sklearn.metrics module.

The classification_report function is used to generate a report that shows various metrics for a classification model, such as precision, recall, and F1 score.
The code then prints the classification report for the df DataFrame's Positive column and sentiment column.
This suggests that the df DataFrame contains data related to sentiment analysis, where the Positive column contains the true labels and the sentiment column contains the predicted labels.
The classification_report function compares these two columns and generates a report that shows how well the model performed in terms of precision, recall, and F1 score for each label.
The printed report provides a summary of the model's performance, which can be used to evaluate and improve the model.

In [None]:
from sklearn.metrics import classification_report

print(classification_report(df['Class'], df['sentiment']))

              precision    recall  f1-score   support

           0       0.74      0.52      0.61     36352
           1       0.17      0.46      0.25      8543
           2       0.00      0.00      0.00      3181

    accuracy                           0.47     48076
   macro avg       0.31      0.33      0.29     48076
weighted avg       0.59      0.47      0.51     48076



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
