# **Sentiment Analysis on Movie Reviews**

### **Description**:

Create a sentiment analysis tool that classifies movie reviews as positive or negative using a machine learning model. This project will help you understand natural language processing (NLP) and basic machine learning concepts.

**Step 1**: Install Required Libraries:

pip install pandas scikit-learn nltk

**Step 2**: Data Collection

**Step 2.1**: Load and Process the Text Files

In [1]:
import os
import pandas as pd

In [2]:
def load_data(pos_dir, neg_dir):
    reviews = []
    sentiments = []

    # Load positive reviews
    for filename in os.listdir(pos_dir):
        with open(os.path.join(pos_dir, filename), 'r', encoding='utf-8') as file:
            reviews.append(file.read())
            sentiments.append(1)  # Positive sentiment

    # Load negative reviews
    for filename in os.listdir(neg_dir):
        with open(os.path.join(neg_dir, filename), 'r', encoding='utf-8') as file:
            reviews.append(file.read())
            sentiments.append(0)  # Negative sentiment

    return pd.DataFrame({'review': reviews, 'sentiment': sentiments})

In [3]:
# Specify the paths to the positive and negative review directories
pos_dir = 'train/pos'  # Update with your path
neg_dir = 'train/neg'  # Update with your path

In [4]:
# Load the dataset

data = load_data(pos_dir, neg_dir)

In [5]:
# Display the first few rows of the dataset

data.head()

Unnamed: 0,review,sentiment
0,Bromwell High is a cartoon comedy. It ran at t...,1
1,Homelessness (or Houselessness as George Carli...,1
2,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,This is easily the most underrated film inn th...,1
4,This is not the typical Mel Brooks film. It wa...,1


**Step 3**: Data Preprocessing

In [6]:
import os
import pandas as pd
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix
import string

In [7]:
# Download NLTK resources

nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\91843\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [8]:
# Load the dataset from text files

def load_data(pos_dir, neg_dir):
    reviews = []
    sentiments = []

    # Load positive reviews
    for filename in os.listdir(pos_dir):
        with open(os.path.join(pos_dir, filename), 'r', encoding='utf-8') as file:
            reviews.append(file.read())
            sentiments.append(1)  # Positive sentiment

    # Load negative reviews
    for filename in os.listdir(neg_dir):
        with open(os.path.join(neg_dir, filename), 'r', encoding='utf-8') as file:
            reviews.append(file.read())
            sentiments.append(0)  # Negative sentiment

    return pd.DataFrame({'review': reviews, 'sentiment': sentiments})

In [9]:
# Specify the paths to the positive and negative review directories

pos_dir = 'train/pos'  # Update with your path
neg_dir = 'train/neg'  # Update with your path

In [10]:
# Load the dataset

data = load_data(pos_dir, neg_dir)

In [11]:
# Data Cleaning

def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    return text

In [12]:
data['cleaned_review'] = data['review'].apply(clean_text)

In [13]:
# Remove stop words

stop_words = set(stopwords.words('english'))
data['cleaned_review'] = data['cleaned_review'].apply(lambda x: ' '.join(word for word in x.split() if word not in stop_words))

In [14]:
# Split the dataset

X = data['cleaned_review']
y = data['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
# Feature Extraction

vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [16]:
# Model Training

model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

In [17]:
# Model Evaluation

y_pred = model.predict(X_test_tfidf)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.88      0.87      2485
           1       0.88      0.85      0.86      2515

    accuracy                           0.86      5000
   macro avg       0.86      0.86      0.86      5000
weighted avg       0.86      0.86      0.86      5000

[[2193  292]
 [ 388 2127]]


**Step 4**: User Interface

In [18]:
def predict_sentiment(review):
    cleaned_review = clean_text(review)
    cleaned_review = ' '.join(word for word in cleaned_review.split() if word not in stop_words)
    review_tfidf = vectorizer.transform([cleaned_review])
    prediction = model.predict(review_tfidf)
    return prediction[0]

# User input
while True:
    user_review = input("Enter a movie review (or type 'exit' to quit): ")
    if user_review.lower() == 'exit':
        break
    sentiment = predict_sentiment(user_review)
    print(f'Sentiment: {sentiment}')

Sentiment: 0
Sentiment: 0
Sentiment: 0
Sentiment: 0
Sentiment: 1
Sentiment: 0
Sentiment: 0
Sentiment: 0
Sentiment: 0
Sentiment: 0
Sentiment: 1
Sentiment: 1
Sentiment: 0


# **Conclusion**

In this project, we developed a sentiment analysis tool that classifies movie reviews as positive or negative using the IMDb dataset. We successfully processed raw text data, applied TF-IDF for feature extraction, and trained a Multinomial Naive Bayes classifier, achieving satisfactory performance metrics. The project included a user-friendly command-line interface for real-time predictions. Future enhancements could involve using advanced models, hyperparameter tuning, and deploying a web application to broaden accessibility and application scope.