# Step 1: Setup and Data Loading
We will use a popular, pre-packaged dataset for speed.
This code prepares the movie reviews dataset for sentiment analysis. It does the following:

Imports libraries for data handling, machine learning, and natural language processing.

Downloads NLTK resources needed for the dataset, stopwords, and tokenization.

Loads all movie reviews, pairing each review with its sentiment label (positive or negative).

Creates a pandas DataFrame with two columns: the review text and its sentiment.

Encodes sentiment numerically: 1 for positive, 0 for negative, which is suitable for machine learning.

Prints a summary showing total reviews, counts of positive and negative reviews, and sample data.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
import nltk
from nltk.corpus import movie_reviews


print("Downloading NLTK resources...")
try:
    nltk.download('movie_reviews', quiet=True)
    nltk.download('stopwords', quiet=True)
    nltk.download('punkt', quiet=True)
    print("NLTK resources downloaded successfully.")
except Exception as e:
    print(f"Error downloading NLTK resources: {e}")

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

data = {'review': [' '.join(words) for words, category in documents],
        'sentiment': [category for words, category in documents]}
df = pd.DataFrame(data)

df['sentiment_encoded'] = df['sentiment'].apply(lambda x: 1 if x == 'pos' else 0)

print("--- Data Loading Complete ---")
print(f"Total reviews loaded: {len(df)}")
print(f"Positive reviews: {df['sentiment'].value_counts()['pos']}")
print(f"Negative reviews: {df['sentiment'].value_counts()['neg']}")
print("\nSample Data:")
print(df.head())

Downloading NLTK resources...
NLTK resources downloaded successfully.
--- Data Loading Complete ---
Total reviews loaded: 2000
Positive reviews: 1000
Negative reviews: 1000

Sample Data:
                                              review sentiment  \
0  plot : two teen couples go to a church party ,...       neg   
1  the happy bastard ' s quick movie review damn ...       neg   
2  it is movies like these that make a jaded movi...       neg   
3  " quest for camelot " is warner bros . ' first...       neg   
4  synopsis : a mentally unstable man undergoing ...       neg   

   sentiment_encoded  
0                  0  
1                  0  
2                  0  
3                  0  
4                  0  


# Step 2: Text Preprocessing and Vectorization (Feature Engineering)
We will use the TF-IDF (Term Frequency-Inverse Document Frequency) method to convert text into numerical features.

Separate features and labels:

- X contains the review text.

- y contains the numerical sentiment labels (1 = positive, 0 = negative).

Split the data into training (80%) and testing (20%) sets.

- stratify=y ensures both sets have the same proportion of positive and negative reviews.

Convert text into numerical features using TF-IDF:

- TF-IDF measures how important a word/phrase is in a review relative to the dataset.

- max_features=5000 limits to the 5000 most important words/phrases.

- stop_words='english' ignores common words like “the” or “is”.

- ngram_range=(1, 2) uses single words (1-grams) and pairs of words (2-grams) as features.

Fit and transform training data (X_train_vec) and transform testing data (X_test_vec).

- The model will learn patterns from X_train_vec.

Print summary:

- Shapes of training and testing data (number of reviews × number of features)

- Total number of features (words/phrases) extracted

In [3]:
X = df['review']
y = df['sentiment_encoded']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', ngram_range=(1, 2))

X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

print("\nVectorization Complete.")
print(f"Training data shape: {X_train_vec.shape}")
print(f"Testing data shape: {X_test_vec.shape}")
print(f"Number of features (words/phrases): {len(vectorizer.get_feature_names_out())}")


Vectorization Complete.
Training data shape: (1600, 5000)
Testing data shape: (400, 5000)
Number of features (words/phrases): 5000


# Step 3: Model Training (Logistic Regression)
Logistic Regression is fast and highly effective for binary text classification.

## Create the model:

- LogisticRegression is a simple and effective algorithm for binary classification (positive vs negative).

- max_iter=1000 allows the model enough iterations to converge.

- random_state=42 ensures reproducible results.

## Train the model:

- model.fit(X_train_vec, y_train) tells the model to learn patterns from the training data vectors (X_train_vec) and their corresponding labels (y_train).

## Print messages:

- Shows when training starts and when it finishes.

In [4]:
model = LogisticRegression(max_iter=1000, random_state=42)
print("\nStarting Model Training...")
model.fit(X_train_vec, y_train)
print("Model Training Complete.")


Starting Model Training...
Model Training Complete.


# Step 4: Evaluation and Results
Now we evaluate the model's performance on unseen data.

## Make predictions:

- y_pred = model.predict(X_test_vec) uses the trained model to predict the sentiment of the testing reviews.

## Calculate accuracy:

- accuracy_score(y_test, y_pred) computes the percentage of correct predictions.

- Accuracy tells us how often the model gets it right.

## Generate detailed metrics:

- classification_report shows precision, recall, and F1-score for both classes (Negative = 0, Positive = 1).

These metrics help understand how well the model identifies positive and negative reviews separately.

## Print results:

- Displays overall accuracy.

- Shows a full classification report with key performance metrics.

In [5]:
y_pred = model.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=['Negative (0)', 'Positive (1)'])

print("\n--- Model Evaluation Results ---")
print(f"Overall Accuracy: {accuracy:.4f}")
print("\nClassification Report (Key Metrics):")
print(report)


--- Model Evaluation Results ---
Overall Accuracy: 0.8300

Classification Report (Key Metrics):
              precision    recall  f1-score   support

Negative (0)       0.85      0.81      0.83       200
Positive (1)       0.81      0.85      0.83       200

    accuracy                           0.83       400
   macro avg       0.83      0.83      0.83       400
weighted avg       0.83      0.83      0.83       400



## Define the prediction function predict_sentiment:

- Takes a review (text), the vectorizer (TF-IDF), and the trained model.

- Converts the review into a numerical vector using vectorizer.transform.

- Uses the model to predict the sentiment: positive (1) or negative (0).

- Converts the numerical prediction into a label ("Positive" or "Negative").

- Calculates the confidence of the prediction from the model’s probabilities.

- Prints the review, predicted sentiment, and confidence percentage in a clear format.

## Test cases:

- Positive review → the function should predict “Positive” with high confidence.

- Negative review → the function should predict “Negative” with high confidence.

- Mixed review → the function predicts whichever sentiment the model finds stronger.

In [6]:
def predict_sentiment(review, vectorizer, model):

    review_vec = vectorizer.transform([review])


    prediction = model.predict(review_vec)[0]


    sentiment_label = "Positive" if prediction == 1 else "Negative"


    probability = model.predict_proba(review_vec)[0]
    confidence = max(probability) * 100

    print("-----------------------------------------------------")
    print(f"Review: '{review}'")
    print(f"Predicted Sentiment: **{sentiment_label}**")
    print(f"Confidence: {confidence:.2f}%")
    print("-----------------------------------------------------")


# --- Test Cases ---
# Case 1: Positive Review
positive_review = "This film was absolutely brilliant, the acting was superb and the story was gripping. A true masterpiece!"
predict_sentiment(positive_review, vectorizer, model)

# Case 2: Negative Review
negative_review = "The movie was a complete waste of time. Terrible script, slow pace, and the ending made no sense at all."
predict_sentiment(negative_review, vectorizer, model)

# Case 3: Neutral/Mixed
mixed_review = "The special effects were great, but the plot dragged on forever and the main actor was unconvincing."
predict_sentiment(mixed_review, vectorizer, model)

-----------------------------------------------------
Review: 'This film was absolutely brilliant, the acting was superb and the story was gripping. A true masterpiece!'
Predicted Sentiment: **Positive**
Confidence: 73.63%
-----------------------------------------------------
-----------------------------------------------------
Review: 'The movie was a complete waste of time. Terrible script, slow pace, and the ending made no sense at all.'
Predicted Sentiment: **Negative**
Confidence: 74.68%
-----------------------------------------------------
-----------------------------------------------------
Review: 'The special effects were great, but the plot dragged on forever and the main actor was unconvincing.'
Predicted Sentiment: **Negative**
Confidence: 52.65%
-----------------------------------------------------
