<a href="https://colab.research.google.com/github/Surya2004-janardhan/colab/blob/main/sentimentanalysis_imdb_review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Step 1: Import Necessary Modules
Import essential libraries for data handling, preprocessing, feature extraction, model training, and evaluation. These include Pandas, NLTK, Scikit-learn, and others. This step ensures all required tools are available for the pipeline.

In [3]:
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Step 2: Load Dataset with Robust Settings
Load the dataset using robust settings to handle encoding, quoting issues, and malformed rows. If parsing fails, clean the dataset line by line and reload it. This ensures the dataset is error-free and ready for processing.

In [4]:
def load_dataset(file_path):
    try:
        df = pd.read_csv(file_path, encoding='utf-8', engine='python', on_bad_lines="skip")
    except pd.errors.ParserError:
        print("Parser Error detected. Cleaning the dataset...")
        cleaned_data = []
        with open(file_path, 'r', encoding='utf-8', errors='replace') as f:
            for i, line in enumerate(f):
                try:
                    cleaned_line = line.replace('\x00', '').strip()
                    if cleaned_line:
                        cleaned_data.append(cleaned_line)
                except Exception as e:
                    print(f"Error in row {i}: {e}")
        cleaned_file_path = 'cleaned_' + file_path.split('/')[-1]
        with open(cleaned_file_path, 'w', encoding='utf-8') as f:
            f.write('\n'.join(cleaned_data))
        df = pd.read_csv(cleaned_file_path)
    return df

tep 3: Data Cleaning and Preprocessing
Preprocess the text data by converting it to lowercase, removing non-alphabetic characters, tokenizing, and eliminating stop words. This step ensures the data is clean and consistent for further analysis.

In [5]:
def preprocess_text(text):
    if not isinstance(text, str):
        return ''
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens) if tokens else ''

Step 4: Feature Extraction
Convert the cleaned text into numerical features using TF-IDF with bigrams. This step captures both unigrams and bigrams, reducing noise and improving context representation.

In [6]:
def extract_features(corpus):

    vectorizer = TfidfVectorizer(
    max_features=10000, # Increase feature limit
    ngram_range=(1, 2), # Use unigrams and bigrams
    min_df=2, # Ignore terms appearing in fewer than 2 documents
    max_df=0.9 # Ignore terms appearing in more than 90% of documents
    )
    X = vectorizer.fit_transform(corpus)
    return X, vectorizer

Step 5: Model Training with Hyperparameter Tuning (Continued)
Train a Logistic Regression model with hyperparameter tuning using Grid Search. This step optimizes the model's performance by finding the best parameters for regularization strength and class weighting.

In [7]:
def train_model(X_train, y_train):
    param_grid = {
        'C': [0.01, 0.1, 1, 10],
        'solver': ['liblinear'],
        'class_weight': [None, 'balanced']
    }
    model = LogisticRegression(max_iter=1000)
    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='f1', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    print(f"Best Parameters: {grid_search.best_params_}")
    print(f"Best F1-Score (CV): {grid_search.best_score_:.2f}")
    return grid_search.best_estimator_


Step 6: Evaluation
Evaluate the trained model on the test set using metrics like accuracy, precision, recall, and F1-score. A detailed classification report provides insights into the model's performance across classes.

In [8]:
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.2f}")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1-Score: {f1:.2f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))

Step 7: Main Execution
Combine all steps into a single pipeline. Load the dataset, preprocess the data, extract features, split the data into training and testing sets, train the model, and evaluate its performance.

In [11]:
if __name__ == "__main__":
    file_path = '/content/IMDB Dataset.csv'
    df = load_dataset(file_path)
    print(f"Dataset Size: {len(df)}")
    print("First 5 Rows:")
    print(df.head())

    df = df.dropna(subset=['review', 'sentiment'])
    df = df[df['sentiment'].isin(['positive', 'negative'])]
    df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
    df['clean_review'] = df['review'].apply(preprocess_text)
    df = df[df['clean_review'] != '']

    X, vectorizer = extract_features(df['clean_review'])
    y = df['sentiment']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

    model = train_model(X_train, y_train)
    evaluate_model(model, X_test, y_test)

Dataset Size: 7886
First 5 Rows:
                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive
Best Parameters: {'C': 10, 'class_weight': None, 'solver': 'liblinear'}
Best F1-Score (CV): 0.88
Accuracy: 0.88
Precision: 0.87
Recall: 0.89
F1-Score: 0.88

Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.87      0.88       787
           1       0.87      0.89      0.88       791

    accuracy                           0.88      1578
   macro avg       0.88      0.88      0.88      1578
weighted avg       0.88      0.88      0.88      1578



Overall Summary
This pipeline efficiently handles text classification tasks by loading a robust dataset, preprocessing text data to remove noise, extracting meaningful TF-IDF features, tuning hyperparameters for optimal performance, and evaluating the model comprehensively. The use of bigrams in TF-IDF and Grid Search ensures high-quality feature representation and model optimization. The pipeline is modular, scalable, and adaptable to various text classification problems.



Challenges Encountered

Malformed Dataset : Handling quoting issues and encoding errors required robust loading mechanisms.


Class Imbalance : Addressed using class_weight='balanced' to ensure fair treatment of both classes.


Preprocessing Complexity : Removing stop words and tokenizing while preserving context was challenging but resolved with NLTK tools.


Computational Cost : Large datasets increased processing time, mitigated by limiting TF-IDF features and using efficient algorithms.


Hyperparameter Tuning : Finding optimal parameters for the model required careful experimentation and validation.