Essential libraries for sentiment analysis

In [1]:
import pandas as pd      
import numpy as np       
import re                
import nltk              
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.linear_model import LogisticRegression    
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

Importing the data (Here, I'm using Twitter review dataset)

In [2]:
df = pd.read_csv("C:\\Users\\Rakes\\Downloads\\archive (3)\\training.1600000.processed.noemoticon.csv", encoding="latin-1")

In [3]:
df.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


Upgrading column names for better readability because the current ones require a genius brain to understand

In [4]:
col_names = ["target", "id", "date", "flag", "user", "text"]
df = pd.read_csv("C:\\Users\\Rakes\\Downloads\\archive (3)\\training.1600000.processed.noemoticon.csv", encoding="latin-1", names = col_names)

In [5]:
df.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


Counting the non-existent values... because missing values took a vacation without telling us and will give us a lot of headache during training phase!

In [6]:
df.isnull().sum()

target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64

Swapping sentiment labels to make positivity more universally relatable: because who needs a "4" when you can have a "1" for all the good vibes!

In [7]:
df.replace({"target" : {4:1}}, inplace=True)

Streamlining text processing: Removing non-alphabetic characters, converting to lowercase, tokenizing, stemming, and eliminating stopwords for optimal analysis.

In [None]:
def stemming(text):
    # Remove non-alphabetic characters and convert to lowercase
    text_stemmed = re.sub('[^a-zA-Z]', ' ', text)
    text_stemmed = text_stemmed.lower()
    
    # Tokenize the text
    text_stemmed = text_stemmed.split()
    
    # Stem each word and remove stopwords
    nltk.download("stopwords")         
    text_stemmed = [stemmer.stem(word) for word in text_stemmed if word not in stopwords.words("english")]
    
    # Join the stemmed words back into a string
    text_stemmed = " ".join(text_stemmed)
    
    return text_stemmed

Enhancing text data: Applying stemming to "text" column and storing the results in "stemmed_text" for improved analysis.

In [None]:
df["stemmed_text"] = df["text"].apply(stemming)

Time-saving heroics: Storing dataset post-stemming to skip the hour-long staring contest with progress bars!

In [None]:
df_stemmed.to_csv("stemmed_dataframe.csv", index=False)

In [8]:
new_df = pd.read_csv("C:\\Users\\Rakes\\Downloads\\stemmed_dataframe.csv", encoding = "latin-1")

In [9]:
new_df

Unnamed: 0,target,id,date,flag,user,text,stemmed_text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com zl awww bummer sho...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset updat facebook text might cri result sch...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,kenichan dive mani time ball manag save rest g...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole bodi feel itchi like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",nationwideclass behav mad see
...,...,...,...,...,...,...,...
1599995,1,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...,woke school best feel ever
1599996,1,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...,thewdb com cool hear old walt interview http b...
1599997,1,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...,readi mojo makeov ask detail
1599998,1,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...,happi th birthday boo alll time tupac amaru sh...


In [11]:
new_df["stemmed_text"].isnull().sum()   # Checking if any rows have null values

495

Null-nullifying magic: Kicking out empty rows like a boss to keep the model's mojo intact!

In [12]:
df = new_df.dropna(subset=['stemmed_text'])  # Drop rows with null values in 'stemmed_text' column

In [14]:
df["stemmed_text"].isnull().sum()   # Checking again if any rows have null values

0

Extracting feature and target arrays: Assigning stemmed text values to X and sentiment labels to y for analysis.

In [15]:
X = df["stemmed_text"].values
y = df["target"].values

Splitting dataset: Dividing feature and target arrays into training and testing sets, ensuring balanced classes for analysis. 

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y, random_state = 42)

Text vectorization: Transforming text data into TF-IDF vectors for training set using TfidfVectorizer.

In [17]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)

Model selection: Utilizing Logistic Regression algorithm with increased iterations for enhanced performance.

In [19]:
model = LogisticRegression(max_iter = 1000)

Model training: Fitting the logistic regression model to the training data for sentiment analysis. 

In [20]:
model.fit(X_train, y_train)

Prediction on training set: Using the trained logistic regression model to predict sentiment labels for the training data. 

In [21]:
y_train_predict = model.predict(X_train)

Model evaluation: Calculating the accuracy score by comparing predicted and actual sentiment labels for the training data.

In [23]:
accuracy_score(y_train, y_train_predict)

0.8024521648885123

In [35]:
def predict(review):
    """
    Predicts the sentiment of a given review.

    Args:
        review (str): The review text to analyze.

    Returns:
        str: A message indicating whether the review is positive or negative.
    """
    review_vec = vectorizer.transform([review])
    if model.predict(review_vec)[0] == 1:
        return "review is positive...! :-)"
    return "review is negative...! :-("

Looks like my model's got a sixth sense for emotions! It's reading reviews like a mood-reading psychic, hitting the nail on the head every time!
Here "I Love you my dear cute madam" is "positive", but her acceptance is always "negative" :-(

In [39]:
predict("""
The movie was absolutely fantastic! The storyline was engaging, the acting was superb, and the visuals were 
stunning. I couldn't recommend it enough!""")

'review is positive...! :-)'

In [40]:
predict("""I was extremely disappointed with the product. The quality was poor, 
it arrived late, and the customer service was unresponsive. 
I would not recommend it to anyone.""")

'review is negative...! :-('

In [41]:
predict("I love you my dear cute madam")

'review is positive...! :-)'

In [42]:
predict("I hate you")

'review is negative...! :-('