**Objective : Performing Countvectorization on IMDB Dataset**

# Load all Libraries

In [163]:
import numpy as np 
import pandas as pd
import re 
import os
import random 

# Set a Chunk size and read the dataset

In [164]:
chunk_size = 1000
chunks = pd.read_csv("E:\\NLP\\IMDB Dataset.csv\\IMDB Dataset.csv", sep = ",",chunksize = chunk_size)

df = next(chunks)
print(df.columns)

Index(['review', 'sentiment'], dtype='object')


The code reads a large CSV file ("IMDB Dataset.csv") in chunks of 1000 rows at a time using pandas.read_csv() with the chunksize parameter. It retrieves the first chunk of data (df = next(chunks)) and prints the column names of the DataFrame. This is useful for efficiently processing large datasets without loading the entire file into memory


In [None]:
df['words'] = "default value"
df['sentences'] = "default value"
for i in range(df.shape[0]):
    df.at[i,"words"] = list("")
    df.at[i,"sentences"] = list("")

This code adds two new columns, `words` and `sentences`, to the DataFrame `df`, initializing them with the value "default value". It then iterates over each row in the DataFrame, setting the values in the `words` and `sentences` columns to empty lists (`[]`). This process prepares the columns for further processing, likely involving tokenization or sentence segmentation.

# Import Sentence and Word Tokenization

In [166]:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

The code imports two tokenization functions from the NLTK (Natural Language Toolkit) library:

sent_tokenize: This function is used to split a text into individual sentences.
word_tokenize: This function splits a text into individual words (tokens), handling punctuation and other language-specific details.

# Lowercase the text

In [167]:
df['review'] = df['review'].str.lower()

In [168]:
df

Unnamed: 0,review,sentiment,words,sentences
0,one of the other reviewers has mentioned that ...,positive,[],[]
1,a wonderful little production. <br /><br />the...,positive,[],[]
2,i thought this was a wonderful way to spend ti...,positive,[],[]
3,basically there's a family where a little boy ...,negative,[],[]
4,"petter mattei's ""love in the time of money"" is...",positive,[],[]
...,...,...,...,...
995,nothing is sacred. just ask ernie fosselius. t...,positive,[],[]
996,i hated it. i hate self-aware pretentious inan...,negative,[],[]
997,i usually try to be professional and construct...,negative,[],[]
998,if you like me is going to see this in a film ...,negative,[],[]


# Removal of HTML Tags

In [169]:
import re 
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'',text)

The provided code defines a function remove_html_tags(text) that removes HTML tags from a given text:

* Imports re module: The re module is used for regular expression operations in Python.
* Defines the pattern: A regular expression pattern '<.*?>' is compiled to match HTML tags (anything between < and >).
* Removes HTML tags: The pattern.sub(r'', text) replaces all matched HTML tags with an empty string (''), effectively removing them from the input text.

In [170]:
df['review'] = df['review'].apply(remove_html_tags)

In [171]:
df

Unnamed: 0,review,sentiment,words,sentences
0,one of the other reviewers has mentioned that ...,positive,[],[]
1,a wonderful little production. the filming tec...,positive,[],[]
2,i thought this was a wonderful way to spend ti...,positive,[],[]
3,basically there's a family where a little boy ...,negative,[],[]
4,"petter mattei's ""love in the time of money"" is...",positive,[],[]
...,...,...,...,...
995,nothing is sacred. just ask ernie fosselius. t...,positive,[],[]
996,i hated it. i hate self-aware pretentious inan...,negative,[],[]
997,i usually try to be professional and construct...,negative,[],[]
998,if you like me is going to see this in a film ...,negative,[],[]


# Removal of URL's

In [172]:
def remove_urls(text):
    pattern = re.compile(r'https?://\s+|www.\.\S+')
    return pattern.sub(r'',text)

In [173]:
df['review'] = df['review'].apply(remove_urls)

# Removal of Punctuations 

In [174]:
import string,time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

The function `remove_urls(text)` uses a regular expression to remove URLs from the input text. The pattern `r'https?://\S+|www\.\S+'` matches URLs starting with `http://`, `https://`, or `www.`. It then replaces the matched URLs with an empty string (`''`), effectively removing them from the text.

In [175]:
exclude = string.punctuation

In [176]:
import string

def remove_punc1(text):
    exclude = string.punctuation
    return text.translate(str.maketrans('', '', exclude))

df['review'] = df['review'].apply(remove_punc1)


The code defines a function remove_punc1(text) that removes all punctuation from the input text using Python's string.punctuation. It utilizes str.maketrans() to create a translation table that maps punctuation characters to None. The function is then applied to the review column of the DataFrame df, removing punctuation from each review.

# Removal of Emojis

In [177]:
import re

def remove_emoji(text):
    # Define a pattern to match emoji using Unicode ranges
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # Emotions
        "\U0001F300-\U0001F5FF"  # Symbols & pictographs
        "\U0001F680-\U0001F6FF"  # Transport & map symbols
        "\U0001F1E0-\U0001F1FF"  # Flags (iOS)
        "\U00002702-\U000027B0"  # Additional symbols
        "\U000024C2-\U0001F251"  # Enclosed characters
        "]+", 
        flags=re.UNICODE
    )
    # Substitute matched emoji with an empty string
    return emoji_pattern.sub('', text)


This code defines a function `remove_emoji(text)` that removes emojis from a given text:

1. **Define Emoji Pattern**: A regular expression (`emoji_pattern`) is compiled to match emojis using specific Unicode ranges that correspond to various types of emojis (emotions, symbols, flags, etc.).
2. **Substitute Emojis**: The `emoji_pattern.sub('', text)` replaces any matched emoji characters with an empty string, effectively removing them.
3. **Return Cleaned Text**: The function returns the text with all emojis removed, leaving only the non-emoji characters.

In [178]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Tokenization and Lemmetization of sentences

In [179]:
for i in range(df.shape[0]):
    l1 = sent_tokenize(str(df.loc[i,"review"]))
    df.at[i,"sentences"] = l1

In [180]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

This resource is used for part-of-speech (POS) tagging, specifically the averaged perceptron-based POS tagger for English. It assigns grammatical categories like nouns, verbs, adjectives, etc., to words in a given text.

In [181]:
from pywsd.utils import lemmatize_sentence

The code imports the `lemmatize_sentence` function from the `pywsd.utils` module. This function is used to lemmatize a given sentence, reducing words to their base or root form (e.g., "running" becomes "run"). It is typically used in natural language processing tasks to standardize word forms for better analysis.

In [182]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [183]:
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

# Function to lemmatize sentences
def lemmatize_with_nltk(sentence):
    tokens = word_tokenize(sentence)
    return [lemmatizer.lemmatize(word) for word in tokens]

# Apply the custom lemmatizer
for k in range(df.shape[0]):
    df.at[k, "words"] = []
    for sentence in df.loc[k, "sentences"]:
        lemmatized_words = lemmatize_with_nltk(sentence)
        df.at[k, "words"].extend(lemmatized_words)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


The code imports necessary NLTK functions for tokenization and lemmatization, including `word_tokenize` and `WordNetLemmatizer`. It downloads required NLTK resources like `punkt` and `wordnet` for processing text. The `lemmatize_with_nltk()` function tokenizes each sentence and lemmatizes the tokens. For each row in the DataFrame, it initializes an empty list in the "words" column and processes the sentences from the "sentences" column. The lemmatized words are then added to the "words" column by extending the list with lemmatized tokens.

In [184]:
df['words_sentences'] = "default"

In [185]:
import functools
for k in range(df.shape[0]):
    df.loc[k,"words_sentences"]=functools.reduce(lambda a,b:( str(a)+str(" ")+str(b)),df.loc[k,"words"])

# Implementing CountVectorizer

In [186]:
from sklearn.feature_extraction.text import CountVectorizer

In [187]:
df1=df

no_features = 1000
tf_vectorizer = CountVectorizer( max_features=no_features, stop_words='english')
tf = tf_vectorizer.fit_transform(df1.words_sentences)

In [188]:
df_x = pd.DataFrame(tf.toarray(),columns = tf_vectorizer.get_feature_names_out())

This code uses `CountVectorizer` to convert text data into a word frequency matrix. It defines a DataFrame `df1` as a copy of `df` and sets `no_features` to 1000, limiting the number of features. The vectorizer is initialized to remove common English stop words and extract up to 1000 features. The `fit_transform()` method processes the `words_sentences` column into a sparse matrix. Finally, the matrix is converted to a DataFrame `df_x`, with columns representing the extracted features.

In [189]:
df_x

Unnamed: 0,10,100,12,20,30,60,70,80,90,ability,...,yeah,year,yes,york,youd,youll,young,youre,youve,zombie
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
998,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [190]:
df_y = df['sentiment']

In [191]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [192]:
df_y_1 = pd.DataFrame(df_y)
df_y_enc = df_y_1.apply(le.fit_transform)

In [193]:
df_y_enc

Unnamed: 0,sentiment
0,1
1,1
2,1
3,0
4,1
...,...
995,1
996,0
997,0
998,0


In [194]:
df_x.columns

Index(['10', '100', '12', '20', '30', '60', '70', '80', '90', 'ability',
       ...
       'yeah', 'year', 'yes', 'york', 'youd', 'youll', 'young', 'youre',
       'youve', 'zombie'],
      dtype='object', length=1000)

In [195]:
df_y_enc.columns

Index(['sentiment'], dtype='object')

# Implementing Metrics

In [196]:
from sklearn.model_selection import train_test_split

# Extract features (X) and target (y)
X = df["review"]  # Features, assuming `df_x` is a list of column names
y = df["sentiment"]  # Access the target column directly by its name

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print results
print("Training features shape:", X_train.shape)
print("Testing features shape:", X_test.shape)
print("Training target shape:", y_train.shape)
print("Testing target shape:", y_test.shape)


Training features shape: (800,)
Testing features shape: (200,)
Training target shape: (800,)
Testing target shape: (200,)


In [197]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier(n_estimators=500, random_state=42)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred)

print(f"Random Forest Accuracy: {accuracy_rf * 100:.2f}%")


Random Forest Accuracy: 80.50%


This code trains and evaluates a Random Forest classifier.

The Random Forest model is initialized with 500 estimators and a fixed random state. It is trained using the `fit()` method on the training data (`X_train` and `y_train`). The model then predicts the labels on the test set (`X_test`), and the predictions are compared to the true labels (`y_test`) using `accuracy_score`. Finally, the accuracy of the model is printed as a percentage.

In [198]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Train the Naive Bayes model
nb = MultinomialNB()
nb.fit(X_train, y_train)

# Compute predictions using the Naive Bayes model
y_pred_nb = nb.predict(X_test)

# Compute accuracy for the Naive Bayes model
accuracy_nb = accuracy_score(y_test, y_pred_nb)
print(f"Naive Bayes Accuracy: {accuracy_nb * 100:.2f}%")


Naive Bayes Accuracy: 81.50%


This code trains and evaluates a Naive Bayes classifier for text classification.

The `MultinomialNB` model is initialized and trained using the `fit()` method on the training data (`X_train` and `y_train`). Predictions are made on the test set (`X_test`) using the trained model, and the accuracy of the model is calculated by comparing the predictions (`y_pred_nb`) to the true labels (`y_test`) using `accuracy_score`. Finally, the accuracy is printed as a percentage.

In [199]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Train the Gradient Boosting Classifier
gb = GradientBoostingClassifier(n_estimators=500, random_state=42)
gb.fit(X_train, y_train)

# Make predictions using the trained model
y_pred_gb = gb.predict(X_test)

# Compute accuracy for the Gradient Boosting model
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f"Gradient Boosting Accuracy: {accuracy_gb * 100:.2f}%")


Gradient Boosting Accuracy: 79.50%


This code trains and evaluates a Gradient Boosting classifier for text classification.

The `GradientBoostingClassifier` is initialized with 500 estimators and a fixed random state, then trained using the `fit()` method on the training data (`X_train` and `y_train`). Predictions are made on the test set (`X_test`) using the trained model, and the accuracy of the model is calculated by comparing the predictions (`y_pred_gb`) with the true labels (`y_test`) using `accuracy_score`. Finally, the accuracy is printed as a percentage.

* BY TAKING 100 ROWS
* Random Forest Accuracy: 45.00%
* Naive Bayes Accuracy: 30.00%
* Gradient Boosting Accuracy: 55.00%

* BY TAKING 500 ROWS
* Random Forest Accuracy: 69.00%
* Naive Bayes Accuracy: 73.00%
* Gradient Boosting Accuracy: 69.00%

* BY TAKING 1000 ROWS
* Random Forest Accuracy: 80.50%
* Naive Bayes Accuracy: 81.50%
* Gradient Boosting Accuracy: 79.50%