<a href="https://colab.research.google.com/github/Thippeshj/bootcamp_/blob/main/Copy_of_nlp_sentiment_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Beginner Project: Movie Sentiment Analysis

## Objective
The goal of this project is to build a machine learning model that can automatically classify movie reviews as **positive or negative**.

This project helps beginners understand how Natural Language Processing (NLP) works in real-world applications such as:
- Product review analysis
- Customer feedback
- Social media sentiment
- Brand monitoring

We will go through the full NLP pipeline step by step.

## Step 1: Import Required Libraries

In this step, we import all the libraries needed for this project.

- **NLTK** → For text processing and dataset
- **NumPy & Pandas** → For handling data
- **Random** → To shuffle the dataset

These libraries help us clean, process, and analyze text data.

In [None]:
import nltk
import numpy as np
import pandas as pd
import random

## Step- 2. Download Dataset


In [None]:
nltk.download('movie_reviews')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Ajmal\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ajmal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ajmal\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Ajmal\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Step 3: Load and Prepare the Dataset

Here we load all movie reviews and their labels (positive or negative).

Each review consists of:
- Text (words)
- Category (sentiment)

We also shuffle the dataset to avoid bias during training.

In [None]:
import pandas as pd
from nltk.corpus import movie_reviews

# Create empty list
data = []

# Loop through dataset
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        review = movie_reviews.raw(fileid)  # full review text
        data.append([review, category])

# Convert to DataFrame
df = pd.DataFrame(data, columns=["Review", "Sentiment"])

df.head()

Unnamed: 0,Review,Sentiment
0,"plot : two teen couples go to a church party ,...",neg
1,the happy bastard's quick movie review \ndamn ...,neg
2,it is movies like these that make a jaded movi...,neg
3,""" quest for camelot "" is warner bros . ' firs...",neg
4,synopsis : a mentally unstable man undergoing ...,neg


In [None]:
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
 for category in movie_reviews.categories()
 for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

## Step 4: Text Preprocessing

Text preprocessing is one of the most important steps in NLP.

Real-world text data contains:
- Uppercase and lowercase words
- Unwanted characters
- Common words that do not add meaning

In this step, we:
1. Convert words to lowercase
2. Remove stopwords
3. Remove special characters and numbers
4. Apply lemmatization to get root words

Example:
playing → play  
movies → movie

In [None]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def preprocess(words):
 filtered = [w.lower() for w in words if w.isalpha() and w.lower() not in stop_words]
 return [lemmatizer.lemmatize(w) for w in filtered]

In [None]:
cleaned_docs = [(preprocess(doc), label) for doc, label in documents]

## Step 5: Feature Extraction (Convert Text into Numbers)

Machine learning models cannot understand text directly.

So we convert text into numerical format using **Bag of Words**.

Bag of Words:
- Counts how many times each word appears
- Creates a vocabulary of important words
- Converts reviews into numeric vectors

We also limit to the top 3000 important words to reduce complexity.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
texts = [' '.join(doc) for doc, label in cleaned_docs]
labels = [label for doc, label in cleaned_docs]
vectorizer = CountVectorizer(max_features=3000)
X = vectorizer.fit_transform(texts)
y = labels

## Step 6: Train Machine Learning Model

Now we train a classification model using Logistic Regression.

Why Logistic Regression?
- Simple and effective for text classification
- Works well for sentiment analysis
- Fast and beginner-friendly

We split the dataset into:
- Training data (80%)
- Testing data (20%)
```

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


##  Model Evaluation

## Step 7: Model Evaluation

We evaluate the performance of our model using:

- Accuracy → How many predictions are correct
- Classification metrics → Precision, Recall, and F1-score

This helps us understand how well the model is performing.

In [None]:
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))

Accuracy: 0.815


##  Prediction on New Data

Paste before prediction function:

## Step 8: Predict Sentiment of New Review

Now we test the model on a new movie review.

Steps:
1. Tokenize the input text
2. Apply preprocessing
3. Convert text into features
4. Predict sentiment



In [None]:
def predict_sentiment(text):
 tokens = nltk.word_tokenize(text)
 cleaned = preprocess(tokens)
 features = vectorizer.transform([' '.join(cleaned)])
 return model.predict(features)[0]
print(predict_sentiment('This movie is not amazing'))

neg


#### This is how NLP models are used in real-world applications.

In this project, we learned:
- Text preprocessing
- Feature extraction
- Sentiment classification
- Model evaluation

This project is a strong foundation for advanced NLP topics like:
- Spam detection
- Fake news detection
- Chatbots
- Deep learning (RNN, LSTM, Transformers)

You can extend this project by:
- Using real product review data
- Building a web app using Streamlit
- Deploying the model using FastAPI