<a href="https://colab.research.google.com/github/Sankalpa0011/Sentiment-Analysis-with-NLP/blob/main/Sentiment_Analysis_with_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  **Sentiment Analysis using NLP**

## Step 1 - Load Data Set

In [33]:
!pip install nlp



In [34]:
import pandas as pd
import nlp

### Read the dataset using pandas

In [32]:
Tweet_csv = pd.read_csv("/content/Tweets.csv")
Tweet_csv.shape
Tweet_csv.head()

(14640, 15)

### Keep only related data and remove others

In [35]:
Tweet_csv = Tweet_csv[["airline_sentiment","text"]]
Tweet_csv.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


## Step 2 – Preprocess Text

### Preprocesses the text by obtaining lower case, removing URLs, removing stop words, and generating stem. Add the code snippet to your notebook

In [36]:
import nltk
import string
import re

from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

In [37]:
nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [38]:
ps = PorterStemmer()

In [39]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'http.?://[^\s]?','',text)
    text = nltk.word_tokenize(text)
    y = []

    for i in text:
        if i not in stopwords.words("english"):
            y.append(ps.stem(i))

    return " ".join(y)

### Replace text column to text_cleaned

In [40]:
Tweet_csv['text_cleaned'] = Tweet_csv['text'].apply(clean_text)
Tweet_csv.head()

Unnamed: 0,airline_sentiment,text,text_cleaned
0,neutral,@VirginAmerica What @dhepburn said.,@ virginamerica @ dhepburn said .
1,positive,@VirginAmerica plus you've added commercials t...,@ virginamerica plu 've ad commerci experi ......
2,neutral,@VirginAmerica I didn't today... Must mean I n...,@ virginamerica n't today ... must mean need t...
3,negative,@VirginAmerica it's really aggressive to blast...,@ virginamerica 's realli aggress blast obnoxi...
4,negative,@VirginAmerica and it's a really big bad thing...,@ virginamerica 's realli big bad thing


## Step 3 – Feature Extraction

### Import necessary libraries

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

### Create a TF-IDF Vectorizer with max_features set to 3000

In [42]:
tfidf_vectorizer = TfidfVectorizer(max_features=3000)

### Use the TF-IDF Vectorizer to transform the "text_cleaned" column into TF-IDF vector representations

In [43]:
X = tfidf_vectorizer.fit_transform(Tweet_csv['text_cleaned']).toarray()

### Convert the "airline_sentiment" column into an array and assign it to the variable "Y"

In [44]:
Y = Tweet_csv['airline_sentiment'].values

## Step 4 – Train Model

### Import the necessary libraries

In [45]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [46]:
Tweet_csv.shape

(14640, 3)

### Split the dataset into training and testing sets using train_test_split

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

### Train a Multinomial Naïve Bayes classifier using the training dataset

In [48]:
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

### Predict the sentiment of tweets in the test dataset using the trained Naïve Bayes classifier

In [49]:
nb_y_pred = nb_classifier.predict(X_test)

### Calculate the accuracy of the Naïve Bayes classifier

In [50]:
nb_accuracy = accuracy_score(y_test, nb_y_pred)
print("Multinomial Naive Bayes Classifier Accuracy: ", nb_accuracy)

Multinomial Naive Bayes Classifier Accuracy:  0.7226775956284153


### Train a Random Forest classifier using the training dataset

In [24]:
rf_classifier = RandomForestClassifier(random_state=2)
rf_classifier.fit(X_train, y_train)

### Predict the sentiment of tweets in the test dataset using the trained Random Forest classifier

In [25]:
rf_y_pred = rf_classifier.predict(X_test)

### Calculate the accuracy of the Random Forest classifier

In [26]:
rf_accuracy = accuracy_score(y_test, rf_y_pred)
print("Random Forest Classifier Accuracy: ", rf_accuracy)

Random Forest Classifier Accuracy:  0.7496584699453552


## **Here's the complete code snippet incorporating all these steps:**

In [27]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

nb_y_pred = nb_classifier.predict(X_test)

nb_accuracy = accuracy_score(y_test, nb_y_pred)
print("Multinomial Naïve Bayes Classifier Accuracy:", nb_accuracy)

rf_classifier = RandomForestClassifier(random_state=2)
rf_classifier.fit(X_train, y_train)

rf_y_pred = rf_classifier.predict(X_test)

rf_accuracy = accuracy_score(y_test, rf_y_pred)
print("Random Forest Classifier Accuracy:", rf_accuracy)


Multinomial Naïve Bayes Classifier Accuracy: 0.7226775956284153
Random Forest Classifier Accuracy: 0.7496584699453552


###  Shape of input variable “X”

In [28]:
X.shape

(14640, 3000)

### Group the DataFrame by the "airline_sentiment" column and describe the groups

In [29]:
sentiment_counts = Tweet_csv.groupby("airline_sentiment").describe()
print(sentiment_counts)


                   text         \
                  count unique   
airline_sentiment                
negative           9178   9087   
neutral            3099   3067   
positive           2363   2298   

                                                                           \
                                                                 top freq   
airline_sentiment                                                           
negative           @AmericanAir that's 16+ extra hours of travel ...    2   
neutral                                           @SouthwestAir sent    5   
positive                                            @JetBlue thanks!    5   

                  text_cleaned         \
                         count unique   
airline_sentiment                       
negative                  9178   9084   
neutral                   3099   3055   
positive                  2363   2264   

                                                                           
         

### Instances/rows with “negative” sentiment class in the dataset

In [30]:
Tweet_csv.groupby("airline_sentiment").size()["negative"]


9178

### Unique tweets in the dataset with the sentiment of “neutral”

In [31]:
neutral_tweets = Tweet_csv[Tweet_csv['airline_sentiment'] == 'neutral']['text']
num_unique_neutral_tweets = neutral_tweets.nunique()
print("Number of unique tweets with neutral sentiment:", num_unique_neutral_tweets)


Number of unique tweets with neutral sentiment: 3067
