**Objective : Performing Tf-idf Vectorizer on AWS Dataset**

# Import Libraries

In [82]:
import numpy as numpy
import pandas as pandas
import os 
import re 
import random

# Select the chunk size and read the columns

In [83]:
chunk_size = 5000
chunks = pd.read_csv("E:\\NLP\\aws_review_sofware_dataset (1).csv", sep=',', chunksize=chunk_size)

# Get the first chunk and access its columns
df = next(chunks)
print(df.columns)


Index(['Unnamed: 0', 'overall', 'verified', 'reviewTime', 'reviewerID', 'asin',
       'style', 'reviewerName', 'reviewText', 'summary', 'unixReviewTime',
       'vote', 'image'],
      dtype='object')


* Chunk size matters in NLP as long data takes a lot of time to work in the program.

# Form two seperate columns 

In [84]:
df['words'] = "default value"
df['sentences'] = "default value"

for i in range(df.shape[0]):
    df.at[i,"words"] = list("")
    df.at[i,"sentences"] = list("")

* After reading the csv file make two seperate columns which should contain empty strings.
* list("") initializes words and sentences as lists of individual characters from an empty string 

In [85]:
df

Unnamed: 0.1,Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image,words,sentences
0,0,4.0,True,"03 11, 2014",A240ORQ2LF9LUI,0077613252,{'Format:': ' Loose Leaf'},Michelle W,The materials arrived early and were in excell...,Material Great,1394496000,,,[],[]
1,1,4.0,True,"02 23, 2014",A1YCCU0YRLS0FE,0077613252,{'Format:': ' Loose Leaf'},Rosalind White Ames,I am really enjoying this book with the worksh...,Health,1393113600,,,[],[]
2,2,1.0,True,"02 17, 2014",A1BJHRQDYVAY2J,0077613252,{'Format:': ' Loose Leaf'},Allan R. Baker,"IF YOU ARE TAKING THIS CLASS DON""T WASTE YOUR ...",ARE YOU KIDING ME?,1392595200,7.0,,[],[]
3,3,3.0,True,"02 17, 2014",APRDVZ6QBIQXT,0077613252,{'Format:': ' Loose Leaf'},Lucy,This book was missing pages!!! Important pages...,missing pages!!,1392595200,3.0,,[],[]
4,4,5.0,False,"10 14, 2013",A2JZTTBSLS1QXV,0077775473,,Albert V.,I have used LearnSmart and can officially say ...,Best study product out there!,1381708800,,,[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4995,2.0,False,"09 19, 2005",AJ8CO6GDEE501,B00004Z0C8,,Walter K. White,Compared to Magellan Mapsend the detail provid...,Garmin MapSource CD ROM,1127088000,4.0,,[],[]
4996,4996,3.0,True,"08 23, 2005",AHIEI9NL58TBP,B00004Z0C8,,PrintGuy,"First, before you buy this product, you should...",A Good Companion for Paper Topo Maps,1124755200,35.0,,[],[]
4997,4997,2.0,True,"08 22, 2005",A1YA43VVH64NSO,B00004Z0C8,,George Wilson,I bought this product to be able to upload TOP...,Mediocre Mapping,1124668800,10.0,,[],[]
4998,4998,4.0,False,"07 2, 2005",A3UO195ZCOA59U,B00004Z0C8,,Aaron D. Chacon,I break up my review into several phases:\n\na...,Good but there are some nagging issues,1120262400,26.0,,[],[]


In [86]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

* Punkt is a nltk tool which is required for tokenization tasks such as splitting text into sentences or words.

# Import word and sentences tokenizer

In [87]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

In [88]:
for i in range(len(df)):
    l1= sent_tokenize(df.loc[i,"reviewText"])
    df.at[i,"sentences"]=l1

* In this iteration of text in reviewText and then sentence tokenize is performed which then get stored in new columns sentences.

In [89]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

 Averaged_perceptron_tagger is a machine learning-based model for part-of-speech (POS) tagging, which assigns grammatical categories (e.g., noun, verb, adjective) to words in a sentence.

In [90]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

 WordNet is a large lexical database of the English language. It groups words into sets of synonyms called synsets and provides definitions, examples, and relationships between these words.

In [91]:
from pywsd.utils import lemmatize_sentence

 PyWSD library, which stands for Python Word Sense Disambiguation. This function lemmatizes all words in a sentence, reducing each word to its base or root form while considering the context of the word (e.g., part of speech).

# Perform Lemmetization

In [92]:
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

# Function to lemmatize sentences
def lemmatize_with_nltk(sentence):
    tokens = word_tokenize(sentence)
    return [lemmatizer.lemmatize(word) for word in tokens]

# Apply the custom lemmatizer
for k in range(df.shape[0]):
    df.at[k, "words"] = []
    for sentence in df.loc[k, "sentences"]:
        lemmatized_words = lemmatize_with_nltk(sentence)
        df.at[k, "words"].extend(lemmatized_words)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\gkris\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


1. WordNetLemmatizer() initializes the lemmatizer, which will be used to reduce words to their base form (e.g., "running" becomes "run").

2. lemmatize_with_nltk(sentence): This function takes a sentence as input, tokenizes it into words, and then applies the lemmatizer to each token.
   word_tokenize(sentence): Splits the sentence into individual words.
   lemmatizer.lemmatize(word): Applies the lemmatizer to each token to get the base form of the word.

3. df.shape[0] gives the number of rows in the DataFrame.

4. df.at[k, "words"] = [] initializes an empty list in the "words" column for each row k.
   
   For each row, it accesses the "sentences" column (which is expected to contain a list of sentences):

5. for sentence in df.loc[k, "sentences"]: iterates through the list of sentences in that row.
6. lemmatize_with_nltk(sentence) calls the previously defined function to lemmatize each sentence.
7. df.at[k, "words"].extend(lemmatized_words): After lemmatizing the sentence, it adds the lemmatized words to the "words" column
  (using .extend() to append the list of lemmatized words).


In [93]:
df["words_sentences"] = "default"

In [None]:
import functools

# Iterate through the DataFrame
for k in range(df.shape[0]):
    words = df.loc[k, "words"]
    # Check if words is empty and handle accordingly
    if words:
        # Join the words into a single string
        df.loc[k, "words_sentences"] = functools.reduce(lambda a, b: str(a) + " " + str(b), words)
    else:
        # If the list is empty, set the column to an empty string or some default value
        df.loc[k, "words_sentences"] = ""

* functools is a standard library module in Python that provides higher-order functions for working with other functions. It contains several utilities that can be used to enhance, manipulate, or simplify the behavior of functions and callable objects.

# Importing Tf-IDF Vectorizer

* Term-Frequency-Inverse Document Frequency Vectorizer isa amore advanced version of CountVectorizer it highlights important words in a document ,while downweighting common words that appear in many documents.  

In [95]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [96]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(df.words_sentences)

In [97]:
dense = pd.DataFrame(tfidf_matrix.todense(), columns=tfidf_vectorizer.get_feature_names_out())

# Forming Dependent andf Independent features 

In [98]:
df_y=df["verified"]

* As the "verified" column is having a textual document and as it is a target column then we can perform label encoding to encode the data for Ml algorithms.

In [99]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [100]:
df_y_1 = pd.DataFrame(df_y)

In [101]:
df_y_enc = df_y_1.apply(le.fit_transform)

In [102]:
df_y_enc.head(10)

Unnamed: 0,verified
0,1
1,1
2,1
3,1
4,0
5,1
6,0
7,0
8,1
9,1


# Performing Classification

In [103]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=500, random_state=42)
rf.fit(dense, df_y_enc)

# Accuracy
accuracy_rf = rf.score(dense, df_y_enc)
print(f"Random Forest Accuracy: {accuracy_rf * 100:.2f}%")


  return fit_method(estimator, *args, **kwargs)


Random Forest Accuracy: 99.96%


* RandomForestClassifier: This is the Random Forest algorithm from scikit-learn. Random Forest is an ensemble learning method that builds multiple decision trees during training and outputs the class that is the majority vote of the individual trees. It’s used for classification tasks.
* accuracy_score: This function computes the accuracy of a model, i.e., the proportion of correct predictions out of all predictions. However, it's not actually used in the code directly in this case, because rf.score() is being used to calculate accuracy.

In [104]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(dense, df_y_enc)

# Compute accuracy
accuracy_nb = nb.score(dense, df_y_enc)
print(f"Naive Bayes Accuracy: {accuracy_nb * 100:.2f}%")


Naive Bayes Accuracy: 85.54%


  y = column_or_1d(y, warn=True)


* MultinomialNB: This is the Multinomial Naive Bayes classifier from scikit-learn, which is commonly used for classification tasks, especially for text classification (such as spam detection or sentiment analysis). It is well-suited for data where features represent counts or frequencies (e.g., word counts in a document).
* dense: This is the feature data, which likely consists of the input features for each sample. It could be a matrix of word counts (for text data) or other types of numeric features.
* df_y_enc: This is the target labels, which are the classes or categories that the model is trying to predict.
* The Naive Bayes classifier computes the conditional probabilities for each feature (given each class) and then applies Bayes' theorem to predict the class labels for new data.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
GBC=GradientBoostingClassifier(n_estimators=1000)
gb_c = GBC.fit(dense, df_y_enc)

* GradientBoostingClassifier: This is a class from scikit-learn that implements Gradient Boosting, an ensemble machine learning technique. Gradient Boosting builds a series of weak learners (typically decision trees), where each new tree corrects the errors made by the previous trees. This iterative process creates a powerful model capable of making accurate predictions.

In [None]:
gbc_score=GBC.score(dense, df_y_enc)
print(f"gbc_score: {gbc_score* 100:.2f}%")

gbc_score: 98.32%


 * for 1000 rows :
Random Forest Accuracy: 96.40%; 
 Naive Bayes Accuracy: 86.40%; 
 gbc_score: 96.40%



* for 2500 rows :Random Forest Accuracy: 99.92%; Naive Bayes Accuracy: 78.24%; gbc_score: 98.32%

* for 5000 rows :Random Forest Accuracy: 99.96%; Naive Bayes Accuracy: 85.54%; 