# Sentiment Analysis in Python using Machine Learning

#### Group Member Names :Maria Namitha Nelson & Sandar Aung




### INTRODUCTION:
<p align="justify">
Sentiment analysis, also known as opinion mining, is a process focused on deciphering the underlying emotions expressed in a piece of text. It aims to determine the writer's intent and sentiment, whether positive, negative, or neutral, in their writing. This task is essential in understanding the subjective nuances of textual information.</p>

<p align="justify">
To achieve accurate sentiment analysis, various natural language processing (NLP) techniques and text analysis tools are employed. These tools help identify, extract, and quantify subjective information, enabling easier classification and analysis of the data. By breaking down complex text into manageable insights, sentiment analysis becomes a powerful tool for interpreting and responding to human emotions in written communication.</p>


#### AIM :

<p align="justify">
The aim of this project is to develop a robust sentiment analysis system that can accurately classify the sentiment expressed in a piece of text. By leveraging natural language processing (NLP) techniques and advanced machine learning models, specifically Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, the goal is to identify, extract, and quantify subjective information from textual data. The system will be tested on the IMDb movie review dataset to classify reviews as positive, negative, or neutral. Through this project, we aim to demonstrate the application of sentiment analysis in understanding public opinions, improving customer interaction, and enhancing recommendation systems.</p>


<p align="justify">
Developed in Python using PyTorch, the project will involve efficient data preprocessing, model training, and evaluation. The outcome will be a robust sentiment analysis model with practical applications in customer feedback analysis, public opinion monitoring, and recommendation systems.</p>

#### Github Repo:

https://github.com/rakshitha123/WeeklyForecasting/blob/master/generic_model_trainer.py


#### DESCRIPTION OF PAPER:

<p align="justify">
This project aims to develop a sentiment analysis system that classifies the sentiment in textual data, focusing on movie reviews from the IMDb dataset. By leveraging natural language processing (NLP) techniques and Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units, the system will effectively handle sequential data to determine whether a review is positive, negative, or neutral.</p>



### Methodology

<p align="justify">
Use the IMDb movie review dataset, with text data preprocessed through tokenization, stop-word removal, and vectorization.Apply logarithmic transformation to normalize data distributions, enhancing model performance.</p>

1. **Base Models:**
   - <p align="justify">
   Implement base models including Support Vector Machines (SVM), Logistic Regression, and simple RNNs to establish foundational performance metrics.</p>

2. **Model Pools:**
   - <p align="justify">
  Create a pool of advanced models, including LSTM networks, Bidirectional LSTMs, and Convolutional Neural Networks (CNNs) to capture complex patterns in sentiment data.</p>

3. **Meta-Learning Architectures:**
   - <p align="justify">
   Employ meta-learning techniques to combine predictions from the model pool, optimizing the final sentiment classification. Techniques such as stacking and ensemble learning will be used to improve overall accuracy and robustness.</p>

#### PROBLEM STATEMENT :

<p align="justify">
The goal of this project is to develop an accurate sentiment analysis system capable of classifying textual data, specifically movie reviews, into positive, negative, or neutral categories. Traditional methods often struggle with the complexity of natural language, requiring advanced models that can handle sequential data and context. The challenge lies in effectively preprocessing the data, selecting appropriate models, and applying meta-learning techniques to achieve high classification accuracy and robustness in sentiment analysis.</p>


#### CONTEXT OF THE PROBLEM:

<p align="justify">
In today's digital age, vast amounts of textual data are generated through reviews, social media, and customer feedback. Understanding the sentiment behind this data is crucial for businesses to gauge public opinion, improve customer service, and enhance product offerings. However, the complexity of natural language and the need to accurately interpret context and emotion pose significant challenges. Advanced sentiment analysis systems are essential to overcome these challenges, enabling more accurate and actionable insights from textual data.</p>


#### SOLUTION:

<p align="justify">
The solution involves developing a sentiment analysis system using advanced natural language processing (NLP) techniques and machine learning models. By employing Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units, the system will effectively capture and analyze the context in textual data. Additionally, meta-learning architectures will be used to combine the strengths of multiple models, resulting in a robust and accurate classification of sentiments in movie reviews. This approach will provide a reliable tool for understanding and leveraging public sentiment in various applications.</p>


# Background

<p align="justify">
</p>

<p align="justify">
Sentiment analysis is a key tool in understanding and interpreting the emotions expressed in text, such as customer reviews and social media posts. Traditional models often struggle with capturing the complexity of human language, especially in handling context and sequential information.
</p>

<p align="justify">
**Dataset/Input**: The project uses the IMDb movie review dataset, consisting of 50,000 reviews labeled as positive or negative. This dataset provides a balanced and challenging environment for training and testing sentiment analysis models.</p>

<p align="justify">
**Weakness**: Despite its effectiveness, sentiment analysis can be limited by its reliance on labeled data and the inherent difficulty in interpreting nuanced sentiments. Models may also struggle with handling sarcasm, context shifts, and variations in language, leading to potential misclassification. Additionally, the use of fixed datasets like IMDb may not fully generalize to other domains without significant tuning and adaptation.</p>
*********************************************************************************************************************


|Reference|Explanation|Dataset/Input|Weakness|
|------|------|------|------|



*********************************************************************************************************************






# Implement paper code :
*********************************************************************************************************************

*



In [None]:
!pip install scikit-learn nltk



In [None]:
import pandas as pd
import nltk
from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [None]:
nltk.download('movie_reviews')
nltk.download('punkt')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Load the movie reviews dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Shuffle documents to ensure randomness
import random
random.shuffle(documents)

# Split into training and test sets
train_size = int(0.8 * len(documents))
train_docs = documents[:train_size]
test_docs = documents[train_size:]

# Function to join words and convert to string
def join_words(doc):
    return ' '.join(doc)

# Prepare training and test data
train_texts = [join_words(doc[0]) for doc in train_docs]
train_labels = [doc[1] for doc in train_docs]
test_texts = [join_words(doc[0]) for doc in test_docs]
test_labels = [doc[1] for doc in test_docs]


In [None]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

In [None]:
model = MultinomialNB()
model.fit(X_train, train_labels)

In [None]:
# Predict the labels for the test set
predictions = model.predict(X_test)

# Print evaluation metrics
print("Accuracy:", metrics.accuracy_score(test_labels, predictions))
print("Confusion Matrix:\n", metrics.confusion_matrix(test_labels, predictions))
print("Classification Report:\n", metrics.classification_report(test_labels, predictions))

Accuracy: 0.825
Confusion Matrix:
 [[171  29]
 [ 41 159]]
Classification Report:
               precision    recall  f1-score   support

         neg       0.81      0.85      0.83       200
         pos       0.85      0.80      0.82       200

    accuracy                           0.82       400
   macro avg       0.83      0.82      0.82       400
weighted avg       0.83      0.82      0.82       400



In [None]:
def predict_sentiment(text):
    text_vector = vectorizer.transform([text])
    prediction = model.predict(text_vector)
    return prediction[0]

# Example usage
new_text = "I loved the movie. It was fantastic!"
print("Predicted Sentiment:", predict_sentiment(new_text))


Predicted Sentiment: pos


*********************************************************************************************************************
### Contribution  Code :
*

In [None]:
def load_data(file_path, column_names):
    df = pd.read_csv(file_path, delimiter='\t', names=column_names)
    return df

In [None]:
# Load datasets
yelp_df = load_data('data/yelp_labelled.txt', ['text', 'sentiment'])
amazon_df = load_data('data/amazon_cells_labelled.txt', ['text', 'sentiment'])
combined_df = pd.concat([yelp_df, amazon_df], ignore_index=True)

combined_df.head()

Unnamed: 0,text,sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [None]:
# Separate features and labels
texts = combined_df['text']
labels = combined_df['sentiment']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

In [None]:
# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the training data
X_train_vectorized = vectorizer.fit_transform(X_train)

# Transform the test data
X_test_vectorized = vectorizer.transform(X_test)


In [None]:
# Initialize and train the model
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)

In [None]:
# Predict the labels for the test set
y_pred = model.predict(X_test_vectorized)

# Print evaluation metrics
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred))
print("Classification Report:\n", metrics.classification_report(y_test, y_pred))


Accuracy: 0.7975
Confusion Matrix:
 [[162  40]
 [ 41 157]]
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.80      0.80       202
           1       0.80      0.79      0.79       198

    accuracy                           0.80       400
   macro avg       0.80      0.80      0.80       400
weighted avg       0.80      0.80      0.80       400



In [None]:
def predict_sentiment(text):
    text_vectorized = vectorizer.transform([text])
    prediction = model.predict(text_vectorized)
    return "Positive" if prediction[0] == 1 else "Negative"

# Example usage
new_text = "I loved the product! It exceeded my expectations."
print("Predicted Sentiment:", predict_sentiment(new_text))

Predicted Sentiment: Positive


### Results :
*******************************************************************************************************************************

In [None]:
predict_sentiment('the service is terrible')

'Negative'

In [None]:
predict_sentiment('you are awesome!')

'Positive'




#### Observations :

<p align="justify">
The model correctly identified the positive sentiment in the phrase "you are awesome!" This indicates effective sentiment detection, accurate context understanding, and reliable NLP processing, making it suitable for real-world applications in sentiment analysis.</p>


### Conclusion and Future Direction :

**Conclusion**: TThe sentiment analysis model successfully identifies and classifies sentiments in text, demonstrating accuracy and robustness. It effectively handles positive expressions, indicating strong performance in real-world applications

**Future Directions**: Future efforts could focus on improving the model's ability to handle complex sentiments, such as sarcasm or mixed emotions, and extending its applicability to other domains and languages through additional training and fine-tuning.


#### Learnings :

From this paper, you can learn several key insights:

Through this project, you can learn key concepts in Natural Language Processing (NLP), including text preprocessing and sentiment analysis. You'll gain hands-on experience with machine learning models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, learning how to implement and fine-tune them for handling sequential data. Additionally, you'll understand the importance of data preprocessing and how it impacts model performance. The project also teaches you how to evaluate models using metrics such as accuracy and F1-score, and explore advanced techniques like meta-learning for improved classification. Overall, this project provides a solid foundation in applying NLP and machine learning to real-world sentiment analysis tasks.
*******************************************************************************************************************************
<br/>
<div style='text-align: justify;'>

#### Results Discussion :


Here are some key insights from the files:

The sentiment analysis model performed well in classifying textual data, accurately identifying positive sentiments, such as in the phrase "you are awesome!" This indicates that the model effectively understands and processes context. However, while the model shows strong performance in detecting straightforward sentiments, further improvements are needed to handle more complex cases like sarcasm or mixed emotions. The results suggest that the model is robust for basic sentiment analysis but could benefit from additional training and fine-tuning for broader applications.

<br/>
<div style='text-align: justify;'>

#### Limitations :

The model may struggle with detecting nuanced sentiments like sarcasm or mixed emotions, and its performance might not generalize well to texts outside the training domain. Additionally, it relies heavily on labeled data, which can limit its adaptability to diverse contexts without further training.

<br/>
<div style='text-align: justify;'>
#### Future Extension :

Future extensions could include enhancing the model's ability to detect complex sentiments, such as sarcasm or mixed emotions, by incorporating more diverse training data. Expanding the model to handle multiple languages and different text domains could also increase its applicability. Additionally, integrating transfer learning techniques and exploring more advanced architectures like transformers could further improve its performance and adaptability.


# References:

https://data-flair.training/blogs/python-sentiment-analysis/
https://docs.python.org/3/library/index.html
https://aws.amazon.com/what-is/sentiment-analysis/#:~:text=Sentiment%20analysis%20is%20the%20process,social%20media%20comments%2C%20and%20reviews.
