# **Checkpoint 1: Data Preprocessing for Fake Review Detection**
### Manthan Parmar 202201416

## **Data Cleaning**

- **Handling missing values** by filling or removing incomplete data entries.
- **Removing duplicates** and irrelevant data (e.g., spam or unrelated reviews).


In [2]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('fakeReviewData.csv')

df.head()


Unnamed: 0,category,rating,label,text_
0,Home_and_Kitchen_5,5.0,CG,"Love this! Well made, sturdy, and very comfor..."
1,Home_and_Kitchen_5,5.0,CG,"love it, a great upgrade from the original. I..."
2,Home_and_Kitchen_5,5.0,CG,This pillow saved my back. I love the look and...
3,Home_and_Kitchen_5,1.0,CG,"Missing information on how to use it, but it i..."
4,Home_and_Kitchen_5,5.0,CG,Very nice set. Good quality. We have had the s...


In [3]:
#Doing the check for missing values in the data entry.
missing_count = df.isna().sum().sum()
print(missing_count)
# There is no missing data in in the data given to us.

0


In [4]:
#Doing the check for duplicate values in the data entry.
duplicate_count = df.duplicated().sum()
print(duplicate_count)

12


In [5]:
duplicate_rows = df[df.duplicated(keep = False)]
duplicate_rows

Unnamed: 0,category,rating,label,text_
6018,Sports_and_Outdoors_5,5.0,CG,"This is a really good starter kit, with lots o..."
6025,Sports_and_Outdoors_5,5.0,CG,"This is a really good starter kit, with lots o..."
6706,Sports_and_Outdoors_5,5.0,CG,"Great, no complaints. Comfortable, phone fits ..."
6708,Sports_and_Outdoors_5,5.0,CG,"Great, no complaints. Comfortable, phone fits ..."
12289,Movies_and_TV_5,5.0,CG,One of the best movies of the year. Not for e...
12548,Movies_and_TV_5,5.0,CG,One of the best movies of the year. Not for e...
19638,Pet_Supplies_5,5.0,CG,My dog loves these and it has kept her occupie...
19802,Pet_Supplies_5,5.0,CG,My dog loves these and it has kept her occupie...
19803,Pet_Supplies_5,5.0,CG,My dog loves it and it has kept her occupied f...
20242,Pet_Supplies_5,5.0,CG,My dog loves it and it has kept her occupied f...


In [6]:
#We use a different data frame without the duplicate rows for further processing.
df2 = df.drop_duplicates()
df2.head()

Unnamed: 0,category,rating,label,text_
0,Home_and_Kitchen_5,5.0,CG,"Love this! Well made, sturdy, and very comfor..."
1,Home_and_Kitchen_5,5.0,CG,"love it, a great upgrade from the original. I..."
2,Home_and_Kitchen_5,5.0,CG,This pillow saved my back. I love the look and...
3,Home_and_Kitchen_5,1.0,CG,"Missing information on how to use it, but it i..."
4,Home_and_Kitchen_5,5.0,CG,Very nice set. Good quality. We have had the s...


In [7]:
duplicate_count_2 = df2.duplicated().sum()
print(duplicate_count_2)
#Now, there are no duplicate rows present in the data.

0


## **Text Normalization**

- **Converting all text to lowercase** to maintain uniformity.
- **Removing punctuation, special characters, and numbers** where not required.

In [8]:
#Convert all text to lowercase
df3 = df2.copy()

df3['text_'] = df3['text_'].str.lower()
df3.head()

Unnamed: 0,category,rating,label,text_
0,Home_and_Kitchen_5,5.0,CG,"love this! well made, sturdy, and very comfor..."
1,Home_and_Kitchen_5,5.0,CG,"love it, a great upgrade from the original. i..."
2,Home_and_Kitchen_5,5.0,CG,this pillow saved my back. i love the look and...
3,Home_and_Kitchen_5,1.0,CG,"missing information on how to use it, but it i..."
4,Home_and_Kitchen_5,5.0,CG,very nice set. good quality. we have had the s...


In [9]:
#Remove punctuation, special characters and numbers
#Using regex for simplicity
df4 = df3.copy()

df4['text_'] = df4['text_'].str.replace('[^a-zA-Z\\s]','',regex=True)
df4.head()


Unnamed: 0,category,rating,label,text_
0,Home_and_Kitchen_5,5.0,CG,love this well made sturdy and very comfortab...
1,Home_and_Kitchen_5,5.0,CG,love it a great upgrade from the original ive...
2,Home_and_Kitchen_5,5.0,CG,this pillow saved my back i love the look and ...
3,Home_and_Kitchen_5,1.0,CG,missing information on how to use it but it is...
4,Home_and_Kitchen_5,5.0,CG,very nice set good quality we have had the set...



## **Tokenization**

- **Breaking down sentences into individual words or tokens** for easier analysis.

In [10]:
#Break down sentences to individual words, or tokens.
df5 = df4.copy()

df5['tokens'] = df5['text_'].apply(lambda x: x.split())
df5['tokens'].head()


0    [love, this, well, made, sturdy, and, very, co...
1    [love, it, a, great, upgrade, from, the, origi...
2    [this, pillow, saved, my, back, i, love, the, ...
3    [missing, information, on, how, to, use, it, b...
4    [very, nice, set, good, quality, we, have, had...
Name: tokens, dtype: object

In [11]:
df5.head()

Unnamed: 0,category,rating,label,text_,tokens
0,Home_and_Kitchen_5,5.0,CG,love this well made sturdy and very comfortab...,"[love, this, well, made, sturdy, and, very, co..."
1,Home_and_Kitchen_5,5.0,CG,love it a great upgrade from the original ive...,"[love, it, a, great, upgrade, from, the, origi..."
2,Home_and_Kitchen_5,5.0,CG,this pillow saved my back i love the look and ...,"[this, pillow, saved, my, back, i, love, the, ..."
3,Home_and_Kitchen_5,1.0,CG,missing information on how to use it but it is...,"[missing, information, on, how, to, use, it, b..."
4,Home_and_Kitchen_5,5.0,CG,very nice set good quality we have had the set...,"[very, nice, set, good, quality, we, have, had..."


## **Stopword Removal**

- **Eliminating common words** (e.g., "and," "the") that do not add significant meaning.

In [12]:
# Taking data of commonly considered stop words in english language as an array.
stop_words = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

df6 = df5.copy()

df6['tokens'] = df6['tokens'].apply(lambda x:[word for word in x if word.lower() not in stop_words])

df6.head()


Unnamed: 0,category,rating,label,text_,tokens
0,Home_and_Kitchen_5,5.0,CG,love this well made sturdy and very comfortab...,"[love, well, made, sturdy, comfortable, love, ..."
1,Home_and_Kitchen_5,5.0,CG,love it a great upgrade from the original ive...,"[love, great, upgrade, original, ive, mine, co..."
2,Home_and_Kitchen_5,5.0,CG,this pillow saved my back i love the look and ...,"[pillow, saved, back, love, look, feel, pillow]"
3,Home_and_Kitchen_5,1.0,CG,missing information on how to use it but it is...,"[missing, information, use, great, product, pr..."
4,Home_and_Kitchen_5,5.0,CG,very nice set good quality we have had the set...,"[nice, set, good, quality, set, two, months]"


## **Stemming/Lemmatization**

- **Reducing words to their root or base form** (e.g., "running" → "run").



- **Stemming** We reduce a word to it's root form by removing prefixes and suffixes. It may always not result in valid word.

- eg: "Running" -> "Runn" and,"Happily" -> "Happi".

- **Lemmatization** We reduce word to base or dictionary form. It takes in the meaning of the word. It always results in a valid word, but it is slower and more complex than stemming.

- In this particular process, I am going ahead with Lemmatization.


In [13]:
#We use Wordnet Lemmatizer with help of NLTK (Natural Language ToolKit)

#Download Wordnet through NLTK.
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

#Create lemmatizer object
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\manth\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [14]:
df7 = df6.copy()

df7['tokens'] = df7['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
df7.head()

Unnamed: 0,category,rating,label,text_,tokens
0,Home_and_Kitchen_5,5.0,CG,love this well made sturdy and very comfortab...,"[love, well, made, sturdy, comfortable, love, ..."
1,Home_and_Kitchen_5,5.0,CG,love it a great upgrade from the original ive...,"[love, great, upgrade, original, ive, mine, co..."
2,Home_and_Kitchen_5,5.0,CG,this pillow saved my back i love the look and ...,"[pillow, saved, back, love, look, feel, pillow]"
3,Home_and_Kitchen_5,1.0,CG,missing information on how to use it but it is...,"[missing, information, use, great, product, pr..."
4,Home_and_Kitchen_5,5.0,CG,very nice set good quality we have had the set...,"[nice, set, good, quality, set, two, month]"


In [15]:
#We notice above, certain words are not reduced like from missing to miss, this is because we are only treating the token words as nouns, so in order to make accurate reduction to lemma, we have to use POS tags (Part of Speech tags).

from nltk.corpus import wordnet
#Corpus is large collection of text used for NLP. 
#Wordnet is such corpus which groups words into sets of synonyms called synsets.

nltk.download('averaged_perceptron_tagger_eng')
#APT is pretrained POS tagger in NLTK

#We map word's POS tag to format that WordNetLemmatizer accepts
def get_wordnet_pos(word):
    #Map POS tag to first character lemmatize() accepts
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J":wordnet.ADJ,"N":wordnet.NOUN,"V":wordnet.VERB,"R":wordnet.ADV}
    return tag_dict.get(tag,wordnet.NOUN)

#Above function explanation, if we input word running, it returns [('running','VBG')] which is verb in present participle form. so we take indexing to get the first letter of the POS tag which is V. First 0 is to take tuple out from list, then 1 for POS tag, and 0 for first letter. Convert to upper case for uniformity.
#Default return tag is noun, if there is no tag found in the tag_dict

lemmatizer = WordNetLemmatizer()

df8 = df7.copy()

df8['tokens'] = df8['tokens'].apply(lambda x : [lemmatizer.lemmatize(word,get_wordnet_pos(word)) for word in x])

df8.head()

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\manth\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


Unnamed: 0,category,rating,label,text_,tokens
0,Home_and_Kitchen_5,5.0,CG,love this well made sturdy and very comfortab...,"[love, well, make, sturdy, comfortable, love, ..."
1,Home_and_Kitchen_5,5.0,CG,love it a great upgrade from the original ive...,"[love, great, upgrade, original, ive, mine, co..."
2,Home_and_Kitchen_5,5.0,CG,this pillow saved my back i love the look and ...,"[pillow, save, back, love, look, feel, pillow]"
3,Home_and_Kitchen_5,1.0,CG,missing information on how to use it but it is...,"[miss, information, use, great, product, price]"
4,Home_and_Kitchen_5,5.0,CG,very nice set good quality we have had the set...,"[nice, set, good, quality, set, two, month]"


## **Vectorization**

- **Converting text data into numerical formats** (e.g., TF-IDF or word embeddings) suitable for machine learning algorithms.

- **Bag of Words (BoW)** Convert text into vector of word counts. 
- Eg: We have 
    - Sentences
        - "I love machine learning" 
        - "Machine learning is fun"
        - "I love fun"
    - Vocabulary
        - ["I","love","machine","learning","is","fun"]
    - Vectors
        - [1,1,1,1,0,0]
        - [0,0,1,1,1,1]
        - [1,1,0,0,0,1]

- **TF-IDF (Term Frequency - Inverse Document Frequency)** Measure word importance in document.
- Working:
    - **TF**: Frequency of word in document
    - **IDF**: How rare it is across all documents.
    - The formula is:
  $$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log \left( \frac{N}{\text{DF}(t)} \right) $$
    - **TF(t, d)** is term frequency
  $$ \text{TF}(t, d) = \frac{\text{count of } t \text{ in document } d}{\text{total number of terms in document } d} $$
    - **DF(t)** is the document frequency, representing the number of documents containing the term **t**.
    - **N** is the total number of documents.
- A unique word may have high IDF, and a common word may have low IDF, they get lower weight across documents.

-Eg: * **Documents**:
   1. **[i love machine learning]**
   2. **[machine learning is fun]**
   3. **[i love fun]**

* **Step 1: TF (Term Frequency)**:
   - For Document 1: **[i love machine learning]**:
     - TF("i") = 1/4 = 0.25
     - TF("love") = 1/4 = 0.25
     - TF("machine") = 1/4 = 0.25
     - TF("learning") = 1/4 = 0.25
   - For Document 2: **[machine learning is fun]**:
     - TF("machine") = 1/4 = 0.25
     - TF("learning") = 1/4 = 0.25
     - TF("is") = 1/4 = 0.25
     - TF("fun") = 1/4 = 0.25
   - For Document 3: **[i love fun]**:
     - TF("i") = 1/3 = 0.33
     - TF("love") = 1/3 = 0.33
     - TF("fun") = 1/3 = 0.33

* **Step 2: IDF (Inverse Document Frequency)**:
   - **Total documents (N) = 3**
   - IDF("i") = log(3/2) = 0.1761
   - IDF("love") = log(3/2) = 0.1761
   - IDF("machine") = log(3/2) = 0.1761
   - IDF("learning") = log(3/2) = 0.1761
   - IDF("is") = log(3/1) = 0.4771
   - IDF("fun") = log(3/2) = 0.1761

* **Step 3: TF-IDF (Term Frequency - Inverse Document Frequency)**:
   - For Document 1:
     - TF-IDF("i") = 0.25 * 0.1761 = 0.0440
     - TF-IDF("love") = 0.25 * 0.1761 = 0.0440
     - TF-IDF("machine") = 0.25 * 0.1761 = 0.0440
     - TF-IDF("learning") = 0.25 * 0.1761 = 0.0440
   - For Document 2:
     - TF-IDF("machine") = 0.25 * 0.1761 = 0.0440
     - TF-IDF("learning") = 0.25 * 0.1761 = 0.0440
     - TF-IDF("is") = 0.25 * 0.4771 = 0.1193
     - TF-IDF("fun") = 0.25 * 0.1761 = 0.0440
   - For Document 3:
     - TF-IDF("i") = 0.33 * 0.1761 = 0.0587
     - TF-IDF("love") = 0.33 * 0.1761 = 0.0587
     - TF-IDF("fun") = 0.33 * 0.1761 = 0.0587
- Vector
    - [1.0, 1.18, 1.48, 0.0, 0.0, 0.0, 0.0, 0.0]
    - [1.0, 1.18, 0.0, 1.48, 0.0, 0.0, 0.0, 0.0]
    - [1.0, 0.0, 0.0, 0.0, 1.48, 1.48, 1.48, 1.48]

- **Word Embeddings (Word2Vec, GloVe)** Represents words as dense vectors in continuous space.
- Pre-trained models are used to map words to high-dimensional vectors. Similar words are closer in vector space.
- Eg: "king" - "man" + "woman" = "queen".


In [16]:
#For this particular scenario, I am using TF IDF. Since, we do not require too much complexity as is involved in Word2Vec and is also computationally less heavy.

df9 = df8.copy()

df9['text_'] = df9['tokens'].apply(lambda x: ' '.join(x))
df9.head()


Unnamed: 0,category,rating,label,text_,tokens
0,Home_and_Kitchen_5,5.0,CG,love well make sturdy comfortable love itvery ...,"[love, well, make, sturdy, comfortable, love, ..."
1,Home_and_Kitchen_5,5.0,CG,love great upgrade original ive mine couple year,"[love, great, upgrade, original, ive, mine, co..."
2,Home_and_Kitchen_5,5.0,CG,pillow save back love look feel pillow,"[pillow, save, back, love, look, feel, pillow]"
3,Home_and_Kitchen_5,1.0,CG,miss information use great product price,"[miss, information, use, great, product, price]"
4,Home_and_Kitchen_5,5.0,CG,nice set good quality set two month,"[nice, set, good, quality, set, two, month]"


In [17]:
from gensim.models import Word2Vec
import numpy as np

#Word2Vec model
model = Word2Vec(
    sentences = df9['tokens'], #Tokenised data as input
    vector_size = 100, # Dimension of word vector
    window = 5, # Context window size for words
    min_count = 1
)

def get_average_word2vec(tokens_list,model,vector_size):
    v = np.zeros(vector_size)
    count = 0
    for word in tokens_list:
        if word in model.wv: # Check if word is in model's vocabulary
            v +=model.wv[word]
            count+=1
    if count>0:
        v/=count # Calculate average
    return v

df9['word2vec'] = df9['tokens'].apply(lambda x: get_average_word2vec(x,model,model.vector_size))

df10 = pd.DataFrame(df9['word2vec'].tolist(), index = df9.index)

df11 = df10.join(df9[['rating','category','label']])

df11.head()

# #TfidfVectorizer used to convert text to matrix
# from sklearn.feature_extraction.text import TfidfVectorizer

# #Initialise object
# tfidf_vectorizer = TfidfVectorizer()

# tfidf_matrix = tfidf_vectorizer.fit_transform(df9['text_'])
# #Fit -> Learns vocabulary, Transform -> Convert text to TF-IDF Matrix

# # We use sparse matrix since it allows us to handle data where most values are 0 without using too much memory.
# #Converting it to a denser NumPy array causes Memory Limit to Exceed, for this particular dataset, i need to provide it with more than 12 Gigabytes of memory, The obtained MemoryError is Unable to allocate 12.0 GiB for an array with shape (40420, 39836) and data type float64


# # get_feature_names_out() -> get list of word names corresponding to columns of TF IDF matrix
# df10 = pd.DataFrame.sparse.from_spmatrix(tfidf_matrix,columns=tfidf_vectorizer.get_feature_names_out())
# df10.shape


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,93,94,95,96,97,98,99,rating,category,label
0,-0.016201,-0.958625,-1.034785,0.47943,-0.575406,-0.68528,0.59689,0.728719,-1.069964,0.092817,...,0.279581,0.951936,-0.171093,0.888181,-1.025417,0.452089,0.464295,5.0,Home_and_Kitchen_5,CG
1,0.063518,-0.075566,-0.395096,0.501121,-0.865776,-0.603348,-0.331116,0.885015,-0.579975,-0.10253,...,-0.139194,0.206701,0.184619,-0.055229,-0.547976,-0.335931,1.016516,5.0,Home_and_Kitchen_5,CG
2,0.249049,-0.503783,-0.492904,0.262155,0.324256,-0.496594,0.64346,0.824033,-1.163048,-0.374925,...,0.780556,0.737468,-0.101524,-0.088016,-1.093231,0.277394,0.007453,5.0,Home_and_Kitchen_5,CG
3,-0.03414,0.517855,-0.499791,1.418628,0.044079,-0.726697,-0.093427,0.983345,-0.813905,-0.667083,...,0.459059,1.23373,0.413352,-0.025329,-0.926069,-0.243112,-0.425895,1.0,Home_and_Kitchen_5,CG
4,-0.505681,-0.273524,-0.877795,0.597551,-0.688114,0.050559,-0.141156,1.116429,-0.757271,-0.214179,...,0.507116,0.564768,0.196237,-0.112224,-1.03966,-0.470514,0.636248,5.0,Home_and_Kitchen_5,CG


In [18]:
df_final = df11.copy()
# df9 = df9.reset_index(drop = True)
# df_final = df_final.reset_index(drop = True)

# df_final['category'] = df9['category']
# df_final['rating'] = df9['rating']
# df_final['label'] = df9['label']
# df_final['text_'] = df9['text_']
# df_final['tokens'] = df9['tokens']

# print(df_final.head())

In [19]:
df_final.to_csv('Processed_data.csv', index=False)
