<a href="https://colab.research.google.com/github/Bibek0130/Sentiment-analysis/blob/master/Sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Sentiment Analysis
It is also called opinion mining. This is where the text is used to make sentiment of the text. THe sentiment can be positive, negative and neutral.

The flow for sentiment Analysis is :

Datasets --> Cleaning and preprocessing --> Choosing algorithm --> constructing our model pipelines --> Evaluations --> predictions


###Data
The data used for this task will be the Amazon reviews dataset, which consists of reviews from Amazon customers downloaded from Xiang Zhang’s Google Drive dir[1]. The dataset spans 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. For more information, please refer to the following paper: Hidden Factors and Hidden Topics: Understanding Rating Dimensions with review text [2].

The Amazon reviews dataset is constructed by taking review scores 1 and 2 as negative and 4 and 5 as positive. Samples of score 3 is ignored. In the dataset, class 1 is the negative, and class 2 is the positive. Each class has 1,800,000 training samples and 200,000 testing samples.


In [None]:
#downloading directly from kaggle using opendatasets
!pip install opendatasets
import opendatasets as od

od.download("https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl.metadata (9.2 kB)
Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: mainalibibek01
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Downloading imdb-dataset-of-50k-movie-reviews.zip to ./imdb-dataset-of-50k-movie-reviews


100%|██████████| 25.7M/25.7M [00:01<00:00, 22.0MB/s]





In [None]:
import pandas as pd
dataset = "/content/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv"
df = pd.read_csv(dataset)
df.head()


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
df.tail()


Unnamed: 0,review,sentiment
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative
49999,No one expects the Star Trek movies to be high...,negative


##Data Preprocessing

To prepare our model for training, we need to do the following data preprocessing techniques:

1. Data cleaning:
    1. remove unwanted characters
    2. Handle missing values
    3. remove duplicates
2. Encode labels
    convert labels into numerical formats.
    categorical elements like postive into 1 and negative into 0.
4. split the dataset
    split the dataset into trainig and test (80, 20)
5. Tokenization
    change words into numerical indices using tokenizer.
    Because we are using sequential model, tokenizer is used. If we use non-sequential modedl lilke Logistic Regression, NB, SVMm etc we use Vectirization to convert the dataset into numerical formats.







#Data cleaning

1. Remove unwanted characters lillke special characters, HTML tags, and non-alphanumeric characters.

In [None]:
import re
def clean_text(text):
  #re.sub(pattern, replacement, string)
  text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
  text = re.sub(r'[^\w\s]', '', text)  # Remove special characters
  text = text.lower()  # Convert to lowercase
  text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
  return text

df['review'] = df['review'].apply(clean_text)

2. Handle missing values

In [None]:
df = df.dropna()

3. Remove deuplicate values

In [None]:
df = df.drop_duplicates()

In [None]:
df.tail(10)

Unnamed: 0,review,sentiment
49989,i got this one a few weeks ago and love it its...,positive
49990,lame lame lame a 90minute cringefest thats 89 ...,negative
49992,john garfield plays a marine who is blinded by...,positive
49993,robert colomb has two fulltime jobs hes known ...,negative
49994,this is your typical junk comedythere are almo...,negative
49995,i thought this movie did a down right good job...,positive
49996,bad plot bad dialogue bad acting idiotic direc...,negative
49997,i am a catholic taught in parochial elementary...,negative
49998,im going to have to disagree with the previous...,negative
49999,no one expects the star trek movies to be high...,negative


# Encode labels
positive -> 1
Neagtive -> 0

In [None]:

df['sentiment'].unique()

array(['positive', 'negative'], dtype=object)

In [None]:
print(df['sentiment'].isna())
#verifying that there is no null values in sentiment
print(df['sentiment'].isna().sum())

0        False
1        False
2        False
3        False
4        False
         ...  
49995    False
49996    False
49997    False
49998    False
49999    False
Name: sentiment, Length: 49580, dtype: bool
0


In [None]:
from sklearn import preprocessing
Le = preprocessing.LabelEncoder()
df['sentiment'] = Le.fit_transform(df['sentiment'])

In [None]:
df['sentiment'].unique()

array([1, 0])

In [None]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production the filming tech...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically theres a family where a little boy j...,0
4,petter matteis love in the time of money is a ...,1


#Tokenization
Convert the words intp numerical indices using a tokenizer

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from tensorflow.keras.preprocessing.sequence import pad_sequences
nltk.download('punkt_tab') # for word tokenization

#tokenize the dataset
df['tokenized_review']= df['review'].apply(word_tokenize)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
# Create a vocabulary from the tokenized data
from collections import Counter

# Flatten all tokens into a single list and count occurrences
all_tokens = [token for review in df['tokenized_review'] for token in review]
vocab = Counter(all_tokens)

# Create a mapping of word to index
vocab_size = 30000  # Define a max vocabulary size
word_to_index = {word: idx + 1 for idx, (word, _) in enumerate(vocab.most_common(vocab_size))}

# Map each tokenized review to its numerical sequence
def tokens_to_sequence(tokens):
    return [word_to_index.get(token, 0) for token in tokens]  # Use 0 for unknown words (OOV)

df['sequence'] = df['tokenized_review'].apply(tokens_to_sequence)


In [None]:
# Set a maximum sequence length
max_length = 200

# Pad the sequences
X = pad_sequences(df['sequence'], maxlen=max_length, padding='post', truncating='post')

# Encode the labels
y = df['sentiment'].map({'positive': 1, 'negative': 0}).values



# Split the datasets into training and testing
training -> 80%
testing -> 20%

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size = 0.2, random_state = 42)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout, Conv1D

# Model parameters
vocab_size = 30000  # Same as used during tokenization
embedding_dim = 100
max_length = 200  # Same as used during padding

# Build the model
model = Sequential([
    Embedding(input_dim=vocab_size + 1, output_dim=embedding_dim, input_length=max_length),
    Bidirectional(LSTM(128, return_sequences=True)),  # BiLSTM with 128 units
    Bidirectional(LSTM(64, return_sequences=False)),  # BiLSTM with 64 units, modified to return_sequences=False
    # Conv1D(100, 5, activation='relu'), # Conv1D is removed as it expects a 3D input
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')  # Sigmoid for binary classification
])

model.summary()



In [None]:
from tensorflow.keras.optimizers import Adam

model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])


In [None]:
!pip install tensorflow

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 1. Create a Tokenizer object
tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')

# 2. Fit the tokenizer on your training data
tokenizer.fit_on_texts(df['review']) # Assuming 'df' is your DataFrame

# 3. Convert text to sequences of numerical indices
X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

# 4. Pad sequences to ensure uniform length
X_train_padded = pad_sequences(X_train_sequences, maxlen=max_length, padding='post', truncating='post')
X_test_padded = pad_sequences(X_test_sequences, maxlen=max_length, padding='post', truncating='post')

# 5. Now you can use X_train_padded and X_test_padded in your model.fit()
history = model.fit(X_train_padded, y_train,
                    epochs=5,
                    batch_size=32,
                    validation_data=(X_test_padded, y_test),
                    verbose=1)

Epoch 1/5
[1m1240/1240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 40ms/step - accuracy: 0.6748 - loss: 0.5862 - val_accuracy: 0.7982 - val_loss: 0.4357
Epoch 2/5
[1m1240/1240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 39ms/step - accuracy: 0.8416 - loss: 0.3746 - val_accuracy: 0.8671 - val_loss: 0.3303
Epoch 3/5
[1m1240/1240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 39ms/step - accuracy: 0.9158 - loss: 0.2241 - val_accuracy: 0.8739 - val_loss: 0.3094
Epoch 4/5
[1m1240/1240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 39ms/step - accuracy: 0.9528 - loss: 0.1366 - val_accuracy: 0.8730 - val_loss: 0.3531
Epoch 5/5
[1m1240/1240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 38ms/step - accuracy: 0.9757 - loss: 0.0778 - val_accuracy: 0.8687 - val_loss: 0.4494


In [None]:
import tensorflow as tf  # Import TensorFlow

# Preprocess X_test in the same way as X_train:
# 1. Convert text to sequences of numerical indices
X_test_sequences = tokenizer.texts_to_sequences(X_test)

# 2. Pad sequences to ensure uniform length
X_test_padded = pad_sequences(X_test_sequences, maxlen=max_length, padding='post', truncating='post')

# Preprocess y_test if it contains strings:
# Assuming labels are 'positive' and 'negative', and you want to convert them to 1 and 0:
y_test_numerical = [1 if label == 'positive' else 0 for label in y_test]

# Convert X_test_padded and y_test_numerical to TensorFlow tensors
X_test_padded = tf.convert_to_tensor(X_test_padded)  # Convert to tf.Tensor
y_test_numerical = tf.convert_to_tensor(y_test_numerical)  # Convert to tf.Tensor

# Finally, evaluate the model using the preprocessed data:
test_loss, test_accuracy = model.evaluate(X_test_padded, y_test_numerical, verbose=0)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

Test Loss: 2.9689
Test Accuracy: 0.4652
