**Objective : Performing Countvectorization on AWS Dataset**

# Load all Libraries

In [2]:
import pandas as pd
import numpy as np 
import re
import os 
import random

# Set a Chunk size and read the dataset

In [3]:
import pandas as pd

chunk_size = 15000
chunks = pd.read_csv("E:\\NLP\\aws_review_sofware_dataset (1).csv", sep=',', chunksize=chunk_size)

# Get the first chunk and access its columns
df = next(chunks)
print(df.columns)


Index(['Unnamed: 0', 'overall', 'verified', 'reviewTime', 'reviewerID', 'asin',
       'style', 'reviewerName', 'reviewText', 'summary', 'unixReviewTime',
       'vote', 'image'],
      dtype='object')


The code reads a large CSV file ("IMDB Dataset.csv") in chunks of 1000 rows at a time using pandas.read_csv() with the chunksize parameter. It retrieves the first chunk of data (df = next(chunks)) and prints the column names of the DataFrame. This is useful for efficiently processing large datasets without loading the entire file into memory


In [4]:
df.columns

Index(['Unnamed: 0', 'overall', 'verified', 'reviewTime', 'reviewerID', 'asin',
       'style', 'reviewerName', 'reviewText', 'summary', 'unixReviewTime',
       'vote', 'image'],
      dtype='object')

In [5]:
df["words"]="default value"
df["sentences"]="default value"


for i in range(df.shape[0]):
    df.at[i,"words"]= list("")
    df.at[i,"sentences"] = list("")


This code adds two new columns, `words` and `sentences`, to the DataFrame `df`, initializing them with the value "default value". It then iterates over each row in the DataFrame, setting the values in the `words` and `sentences` columns to empty lists (`[]`). This process prepares the columns for further processing, likely involving tokenization or sentence segmentation.

# Import Sentence Tokenization

In [7]:
from nltk.tokenize import sent_tokenize

The code imports two tokenization functions from the NLTK (Natural Language Toolkit) library:

sent_tokenize: This function is used to split a text into individual sentences.

In [8]:

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
for i in range(df.shape[0]):
    l1= sent_tokenize(str(df.loc[i,"reviewText"]))
    df.at[i,"sentences"]=l1

In [10]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

This resource is used for part-of-speech (POS) tagging, specifically the averaged perceptron-based POS tagger for English. It assigns grammatical categories like nouns, verbs, adjectives, etc., to words in a given text.

# Implimenting Lemmatization

In [12]:
from pywsd.utils import lemmatize_sentence


The code imports the lemmatize_sentence function from the pywsd.utils module. This function is used to lemmatize a given sentence, reducing words to their base or root form (e.g., "running" becomes "run"). It is typically used in natural language processing tasks to standardize word forms for better analysis.

In [13]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

# Function to lemmatize sentences
def lemmatize_with_nltk(sentence):
    tokens = word_tokenize(sentence)
    return [lemmatizer.lemmatize(word) for word in tokens]

# Apply the custom lemmatizer
for k in range(df.shape[0]):
    df.at[k, "words"] = []
    for sentence in df.loc[k, "sentences"]:
        lemmatized_words = lemmatize_with_nltk(sentence)
        df.at[k, "words"].extend(lemmatized_words)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


This code lemmatizes sentences in a DataFrame using NLTK. It first downloads necessary resources such as `averaged_perceptron_tagger`, `punkt`, and `wordnet`. A `WordNetLemmatizer` is initialized, and a function `lemmatize_with_nltk()` is defined to tokenize sentences and lemmatize each word. For each row in the DataFrame, it initializes an empty list in the "words" column, then lemmatizes and tokenizes each sentence from the "sentences" column. The lemmatized words are added to the "words" column by extending the list with each lemmatized token.

In [None]:
df["words_sentences"] = "default"

In [None]:
import functools
for k in range(df.shape[0]):
    df.loc[k,"words_sentences"]=functools.reduce(lambda a,b:( str(a)+str(" ")+str(b)),df.loc[k,"words"])

This code uses `functools.reduce()` to join words in the "words" column into a single sentence for each row in the DataFrame.

1. **`functools.reduce()`**: This function is used to apply a lambda function cumulatively to the items in an iterable (in this case, `df.loc[k, "words"]`), resulting in a single value (the joined sentence).
2. The lambda function concatenates the words by adding a space (`" "`) between them: `lambda a, b: (str(a) + str(" ") + str(b))`.
3. The `for` loop iterates through each row of the DataFrame (`df.shape[0]` gives the number of rows), and for each row, the lambda function is applied to the "words" list, combining the individual words into a full sentence, which is then stored in the "words_sentences" column.

# Applying CountVectorizer

In [None]:
from sklearn.feature_extraction.text import  CountVectorizer

In [None]:
df1=df

no_features = 500
tf_vectorizer = CountVectorizer( max_features=no_features, stop_words='english')
tf = tf_vectorizer.fit_transform(df1.words_sentences)

In [None]:
df_x = pd.DataFrame(tf.toarray(), columns=tf_vectorizer.get_feature_names_out())

This code uses `CountVectorizer` to convert text data into a word frequency matrix. It defines a DataFrame `df1` as a copy of `df` and sets `no_features` to 1000, limiting the number of features. The vectorizer is initialized to remove common English stop words and extract up to 1000 features. The `fit_transform()` method processes the `words_sentences` column into a sparse matrix. Finally, the matrix is converted to a DataFrame `df_x`, with columns representing the extracted features.

In [None]:
df_y=df["verified"]

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
df_y_1=pd.DataFrame(df_y)

In [None]:
df_y_enc=df_y_1.apply(le.fit_transform)

In [None]:
df_y_enc.columns

Index(['verified'], dtype='object')

In [None]:
df_y_enc.head(5)

Unnamed: 0,verified
0,1
1,1
2,1
3,1
4,0


# Applying Metrics

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=1000, random_state=42)
rf.fit(df_x,df_y_enc)

# Accuracy
accuracy_rf = rf.score(df_x,df_y_enc)
print(f"Random Forest Accuracy: {accuracy_rf * 100:.2f}%")


  return fit_method(estimator, *args, **kwargs)


Random Forest Accuracy: 99.87%


This code trains and evaluates a Random Forest classifier.

The Random Forest model is initialized with 500 estimators and a fixed random state. It is trained using the `fit()` method on the training data (`X_train` and `y_train`). The model then predicts the labels on the test set (`X_test`), and the predictions are compared to the true labels (`y_test`) using `accuracy_score`. Finally, the accuracy of the model is printed as a percentage.

In [None]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(df_x,df_y_enc)

# Compute accuracy
accuracy_nb = nb.score(df_x,df_y_enc)
print(f"Naive Bayes Accuracy: {accuracy_nb * 100:.2f}%")

Naive Bayes Accuracy: 66.53%


  y = column_or_1d(y, warn=True)


This code trains and evaluates a Naive Bayes classifier for text classification.

The `MultinomialNB` model is initialized and trained using the `fit()` method on the training data (`X_train` and `y_train`). Predictions are made on the test set (`X_test`) using the trained model, and the accuracy of the model is calculated by comparing the predictions (`y_pred_nb`) to the true labels (`y_test`) using `accuracy_score`. Finally, the accuracy is printed as a percentage.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
GBC=GradientBoostingClassifier(n_estimators=1000)

In [None]:
gb_c=GBC.fit(df_x,df_y_enc)

  y = column_or_1d(y, warn=True)


In [None]:
gbc_score=GBC.score(df_x,df_y_enc)
print(f"gbc_score: {gbc_score* 100:.2f}%")

gbc_score: 85.17%


This code trains and evaluates a Gradient Boosting classifier for text classification.

The `GradientBoostingClassifier` is initialized with 1000estimators and a fixed random state, then trained using the `fit()` method on the training data (`X_train` and `y_train`). Predictions are made on the test set (`X_test`) using the trained model, and the accuracy of the model is calculated by comparing the predictions (`y_pred_gb`) with the true labels (`y_test`) using `accuracy_score`. Finally, the accuracy is printed as a percentage.