<a href="https://colab.research.google.com/github/Khouloud-Kessentini/Sentiment-Analysis-Project-NLP-/blob/main/src.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Reviews Rating Prediction Project**

The main steps include:

1.   Data pre-processing

  *   Data loading and columns selection
  *   Convert reviews to lowercase and remove emojis, numbers and punctuation
  *   Remove stop words ("and", "in", "at", etc)

2.   One-Hot encoding of reviews (conversion of text reviews to boolean vectors)
3.   Splitting data into test and evaluation (cross-validation 80% - 20%)
4.   Neural network implementation
  * Architecture:
      *  Input layer size  = number of variables (columns) in the One-Hot matrix
      *  Hidden layer 1 size = 64
      *  Hidden layer 2 size = 128
      *  Output layer size = 2 (number of classes, or potential reviews; either 0 or 1)
5.   Adding dropout to prevent overfitting
6.   Testing model accuracy


In [88]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('words')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import words
from sklearn.preprocessing import LabelEncoder
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


#**Data preprocessing**

In [89]:
df = pd.read_csv("/content/full-corpus.csv")

In [90]:
df

Unnamed: 0,Topic,Sentiment,TweetId,TweetDate,TweetText
0,apple,positive,126415614616154112,Tue Oct 18 21:53:25 +0000 2011,Now all @Apple has to do is get swype on the i...
1,apple,positive,126404574230740992,Tue Oct 18 21:09:33 +0000 2011,@Apple will be adding more carrier support to ...
2,apple,positive,126402758403305474,Tue Oct 18 21:02:20 +0000 2011,Hilarious @youtube video - guy does a duet wit...
3,apple,positive,126397179614068736,Tue Oct 18 20:40:10 +0000 2011,@RIM you made it too easy for me to switch to ...
4,apple,positive,126395626979196928,Tue Oct 18 20:34:00 +0000 2011,I just realized that the reason I got into twi...
...,...,...,...,...,...
5108,twitter,irrelevant,126855687060987904,Thu Oct 20 03:02:07 +0000 2011,me re copè con #twitter
5109,twitter,irrelevant,126855171702661120,Thu Oct 20 03:00:04 +0000 2011,Buenas noches genteeee :) #twitter los quieroo...
5110,twitter,irrelevant,126854999442587648,Thu Oct 20 02:59:23 +0000 2011,#twitter tiene la mala costumbre de ponerce bn...
5111,twitter,irrelevant,126854818101858304,Thu Oct 20 02:58:40 +0000 2011,Oi @flaviasansi. Muito bem vinda ao meu #Twitt...


In [91]:
sentiment = df[['Sentiment']].copy()
sentiment = sentiment[:480]
for i in range(sentiment.shape[0]):
  if sentiment['Sentiment'][i] == "positive":
    sentiment['Sentiment'][i] = 0
  if sentiment['Sentiment'][i] == "negative":
    sentiment['Sentiment'][i] = 1
print(sentiment['Sentiment'].unique())

[0 1]


In [92]:
reviews = df[['TweetText']].copy()
reviews = reviews[:480]
reviews.columns = ['reviews']

In [93]:
reviews.head() # data before cleaning

Unnamed: 0,reviews
0,Now all @Apple has to do is get swype on the i...
1,@Apple will be adding more carrier support to ...
2,Hilarious @youtube video - guy does a duet wit...
3,@RIM you made it too easy for me to switch to ...
4,I just realized that the reason I got into twi...


In [94]:
sentiment

Unnamed: 0,Sentiment
0,0
1,0
2,0
3,0
4,0
...,...
475,1
476,1
477,1
478,1


In [95]:
stop_words = set(stopwords.words("english"))
english_words = set(words.words())

# preprocessing (1)
for i in range(reviews.shape[0]):
  reviews['reviews'][i] = reviews['reviews'][i].lower() # convert to lowercase
  reviews['reviews'][i] = re.sub(r'[^\w\s]', '', reviews['reviews'][i]) # remove punctuation and emojis
  reviews['reviews'][i] = re.sub(r'\d+', '', reviews['reviews'][i]) # remove numbers
  reviews['reviews'][i] = re.sub(r'[^a-zA-Z\s]', '', reviews['reviews'][i])
  reviews['reviews'][i] = re.sub(r'(.)\1{2,}', r'\1', reviews['reviews'][i])
  tokens = word_tokenize(reviews['reviews'][i])
  tokens = [word for word in tokens if word in english_words and len(word) > 2] # remove non english words
  filtered_tokens = [word for word in tokens if word not in stop_words] # remove stop words
  reviews['reviews'][i] = ' '.join(filtered_tokens)

In [96]:
reviews

Unnamed: 0,reviews
0,apple get crack
1,apple carrier support
2,hilarious video guy duet apple pretty much lov...
3,rim made easy switch apple see
4,reason got twitter thanks apple
...,...
475,went little last night come apple get together...
476,ford apple instead make sync system new focus ...
477,fine restore backup help apple
478,really apple done cant click music get


In [97]:
reviews[:500]

Unnamed: 0,reviews
0,apple get crack
1,apple carrier support
2,hilarious video guy duet apple pretty much lov...
3,rim made easy switch apple see
4,reason got twitter thanks apple
...,...
475,went little last night come apple get together...
476,ford apple instead make sync system new focus ...
477,fine restore backup help apple
478,really apple done cant click music get


#**OneHot Encoding**

In [98]:
# initialize vectorizer
vectorizer = CountVectorizer()

In [99]:
# create one-hot matrix
one_hot_matrix = vectorizer.fit_transform(reviews['reviews'])
one_hot_matrix_df = pd.DataFrame(one_hot_matrix.toarray(), columns=vectorizer.get_feature_names_out())
one_hot_matrix_df = one_hot_matrix_df

# **Splitting data into training and test data**

In [100]:
sentiment["Sentiment"]

Unnamed: 0,Sentiment
0,0
1,0
2,0
3,0
4,0
...,...
475,1
476,1
477,1
478,1


# **Neural Network (Multi-Layer Perceptron)**

In [101]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam

# Assuming `reviews` DataFrame has columns 'reviews' and 'stars'
#reviews = df[['reviews', 'stars']]

# Vectorize the reviews text
vectorizer = CountVectorizer()
one_hot_matrix = vectorizer.fit_transform(reviews['reviews'])
one_hot_matrix_df = pd.DataFrame(one_hot_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Encode the labels (ratings)
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(sentiment['Sentiment'])
categorical_labels = to_categorical(encoded_labels, num_classes=2)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(one_hot_matrix_df, categorical_labels, test_size=0.3, random_state=42)

# Build the neural network model
model = Sequential()

# Input Layer + Hidden Layer 1
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
model.add(Dropout(0.1))  # Dropout to reduce overfitting

# Hidden Layer 3
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))

# Output Layer
model.add(Dense(2, activation='softmax'))  # 5 output classes for ratings 1-5

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=50, batch_size=2, validation_data=(X_test, y_test))

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Accuracy: {accuracy:.2f}')


Epoch 1/50


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.6837 - loss: 0.6409 - val_accuracy: 0.6111 - val_loss: 0.6244
Epoch 2/50
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.7066 - loss: 0.4493 - val_accuracy: 0.7639 - val_loss: 0.4946
Epoch 3/50
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9930 - loss: 0.1139 - val_accuracy: 0.7917 - val_loss: 0.6113
Epoch 4/50
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9857 - loss: 0.0402 - val_accuracy: 0.7986 - val_loss: 0.5489
Epoch 5/50
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9977 - loss: 0.0187 - val_accuracy: 0.7847 - val_loss: 0.5469
Epoch 6/50
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9990 - loss: 0.0101 - val_accuracy: 0.7986 - val_loss: 0.6569
Epoch 7/50
[1m168/168[0m [32m━━━━━━━