# Task : Amazon Reviews Analysis

## Installing Libraries

In [1]:
!pip install pandas nltk whoosh scikit-learn

^C





[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


## Imports

In [2]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## NLTK stopwords and wordnet (lexical database of English)

In [3]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\usmim\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\usmim\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Using Pandas for reading data and displaying the last five rows in the dataframe(i.e. table)

Reading the dataset using pandas and displaying the table(i.e dataframe)

In [4]:
data = pd.read_csv('amazon_reviews.csv')
data.tail()

Unnamed: 0,userName,content
45594,Mary Mora,Amazon Smile donates. Make sure you get all se...
45595,Marie Elliott,After having problems with the app and having ...
45596,Dan Preston,"Used to be great. Got greedy, they ruined the ..."
45597,Jhosh,New search bar location sucks. At least give m...
45598,Christopher Read,for me personally I use Amazon prime due to be...


In [5]:
data.head()

Unnamed: 0,userName,content
0,David Lane (littleBIG),it's very functional
1,Emery Hayward,Trash app login function now long works.
2,Sohaila Elzayady,One of my favorite apps 🥰
3,mo jo,Lost my former purchases and my wish list this...
4,Simba Davies,Brilliant


Displaying information about the dataframe

In [28]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45599 entries, 0 to 45598
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   userName  45599 non-null  object
 1   content   45599 non-null  object
dtypes: object(2)
memory usage: 712.6+ KB


## Some example pandas operations

In [6]:
data['userName']

0        David Lane (littleBIG)
1                 Emery Hayward
2              Sohaila Elzayady
3                         mo jo
4                  Simba Davies
                  ...          
45594                 Mary Mora
45595             Marie Elliott
45596               Dan Preston
45597                     Jhosh
45598          Christopher Read
Name: userName, Length: 45599, dtype: object

In [29]:
# viewing a column in the dataset
data['content']

0                                     it's very functional
1                 Trash app login function now long works.
2                                One of my favorite apps 🥰
3        Lost my former purchases and my wish list this...
4                                                Brilliant
                               ...                        
45594    Amazon Smile donates. Make sure you get all se...
45595    After having problems with the app and having ...
45596    Used to be great. Got greedy, they ruined the ...
45597    New search bar location sucks. At least give m...
45598    for me personally I use Amazon prime due to be...
Name: content, Length: 45599, dtype: object

In [7]:
# viewing a row in the dataset
data.iloc[3]

userName                                                mo jo
content     Lost my former purchases and my wish list this...
Name: 3, dtype: object

In [8]:
# viewing a cell in the dataset
data['content'][0]

"it's very functional"

## Data Pre-processing

Convert to lowercase and removing special characters and keeping text only

In [10]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z]', ' ', text)
    return text

Removing stop words

In [9]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [11]:
def stopwords_removal(text):
    stop_words = set(stopwords.words('english'))
    sentence = ' '.join([word for word in text.split() if word not in stop_words])
    return sentence

Performing lemmatization

In [12]:
def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    sentence = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    return sentence

Applying on our review data

In [13]:
data['content'] = data['content'].apply(clean_text)
data['content'] = data['content'].apply(lemmatize)
data['content'] = data['content'].apply(stopwords_removal)

print(data['content'])

0                                               functional
1                       trash app login function long work
2                                        one favorite apps
3        lost former purchase wish list wa disappointin...
4                                                brilliant
                               ...                        
45594    amazon smile donates make sure get set learn w...
45595                   problem app reinstall working fine
45596    used great got greedy ruined music app longer ...
45597    new search bar location suck least give option...
45598    personally use amazon prime due disabled deliv...
Name: content, Length: 45599, dtype: object


In [37]:
# iterating over the rows of a specific column in the dataset
for d in data['content']:
    print(d)
    break

functional


In [14]:
# iterating over the rows of the dataset
for index, row in data.iterrows():
    print(row['userName'])
    print(row['content'])
    break

David Lane (littleBIG)
functional


## Term-Document Incidence Matrix

In [15]:
def term_document_incidence_matrix(data):
    # initialize an empty set to store unique words
    words = set()
    
    # iterate over the content column in the dataframe and update the set with unique words
    for content in data:
        words.update(content.split())
    
    # convert the set to a list and sort it
    words = list(words)
    words.sort()
    
    # define an empty list to store the matrix
    matrix = []
    
    # create a row for each content in the dataframe 
    for content in data:
        
        # store the row as a list of zeros with the length of the unique words
        row = [0] * len(words)
        
        # for each word in the content, update the row with 1 where the word is found
        for word in content.split():
            row[words.index(word)] = 1
        
        matrix.append(row)
    
    # return the matrix as a dataframe with the words as columns
    return pd.DataFrame(matrix, columns=words).T


In [16]:
matrix = term_document_incidence_matrix(data['content'][:2000])
matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999
aamazon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aarives,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ability,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abismal,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
able,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zero,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zfold,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zipper,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
matrix.shape

(5087, 2000)

In [18]:
matrix.loc['good']

0       0
1       0
2       0
3       0
4       0
       ..
1995    0
1996    0
1997    0
1998    0
1999    0
Name: good, Length: 2000, dtype: int64

In [19]:
matrix.loc['good'].sum()

168

In [20]:
filtered_matrix = matrix.loc['good']
filtered_matrix = filtered_matrix[(filtered_matrix > 0)]
filtered_matrix

7       1
16      1
26      1
31      1
37      1
       ..
1828    1
1838    1
1900    1
1925    1
1935    1
Name: good, Length: 168, dtype: int64

## Boolean Search

In [22]:
def boolean_search(data, matrix, search_terms, operator='AND'):
    try:
        # Ensure that the search terms are in lowercase
        search_terms = [term.lower() for term in search_terms]
                
        # Filter the matrix to include only the search terms
        filtered_matrix = matrix.loc[search_terms]
    
        # Find all reviews where all terms appear (boolean AND operation)
        if operator == 'AND':
            valid_indices = filtered_matrix.columns[(filtered_matrix > 0).all()]
            
        # Find all reviews where any term appears (boolean OR operation)
        elif operator == 'OR':
            valid_indices = filtered_matrix.columns[(filtered_matrix > 0).any()]
            
        else:
            raise ValueError("Operator must be 'AND' or 'OR'")

        # Select and return the relevant data using the valid indices
        return data.loc[valid_indices, ['userName', 'content']]

    except KeyError as e:

        # Handle the case where one or more search terms are not in the matrix
        print(f"Warning: {str(e).strip('[]')} not found")
        return pd.DataFrame()  # Return an empty DataFrame if any term is not found

In [23]:
result = boolean_search(data, matrix, ['good', 'product','delivery'], operator='AND')
print(result)

         userName                                            content
694  Charlie Cook  happy amazon shopping experience cost wide ran...


## Inverted Index using Dictionaries

In [24]:
# an example dictionary
d = {}

d['a'] = [1]

print(d['a'])
print(d.get('a'))

d['b'] = [4]
print(d)


d['a'].append(5)
print(d)

[1]
[1]
{'a': [1], 'b': [4]}
{'a': [1, 5], 'b': [4]}


In [25]:
def create_inverted_index(data):
    inverted_index = {}
    
    # iterating over the rows of the dataframe
    # here i is the index of the row and row is the content of the row
    for i, row in data.iterrows():
        
        # tokenize the content into words
        words = row['content'].split()
        
        # iterate over the words and update the inverted index
        for word in words:
            
            # find the word in the inverted index
            if word in inverted_index:    
                # add the id of the document to the word
                inverted_index[word].add(i)
            else:
                # else create a new entry for the word
                inverted_index[word] = {i}
    # return the inverted index dictionary
    return inverted_index

In [26]:
def inverted_index_search(data, inverted_index, search_terms):
    
    try:
        
        # if a term is in the search terms and in the inverted index then get the set of indices
        sets_of_indices = [set(inverted_index[term]) for term in search_terms if term in inverted_index]
        # print(sets_of_indices)
        
        # check if there are no sets of indices then return an empty dataframe
        if not sets_of_indices:
            return pd.DataFrame() 

        #  find the intersection of the sets of indices of the search terms
        #  the documents that are present in all the sets are the relevant documents
        # this is the boolean AND operation
        valid_indices = set.intersection(*sets_of_indices)
    

        # Convert the set of valid indices to a list
        valid_indices_list = list(valid_indices)

        # return the dataframe with the rows of the valid indices of our search terms
        return data.loc[valid_indices_list, ['userName', 'content']]
   
    except KeyError as e:
        # Handle the case where one or more search terms are not in the matrix
        print(f"Warning: {str(e).strip('[]')} not found")
        return pd.DataFrame()

In [30]:
# call the function to create inverted index
inverted_index = create_inverted_index(data[:10000])
# print(inverted_index)


# search using inverted index
result = inverted_index_search(data, inverted_index, ['de'])
print(result)
print(f'\n Returned {len(result)} documents')

                                userName  \
8774                 Ladislav Jurdik ml.   
1993                Susanna Monika Faltz   
3212                        2023 Warzone   
6163                            Matej S.   
2391                 Michail Savvoulidis   
3673                       A Google user   
9313                             Mandy O   
9701                           J Gilbert   
619                        Cortes Cortes   
7595                           natasa vl   
3951  Dimitrios Troumpetakis - Voutsinos   
1843                         Leonardo MT   
9462                       Shirley Drury   
7416                    Theodore Tekkers   
187                      David Gutierrez   
8510                   Deb & Mark Draper   

                                                content  
8774  worst amazon de experience ever lost trust ord...  
1993  fix app language setting completely missing ap...  
3212  need update gps tracking item location like di...  
6163  app need impr

## Latency Calculation

In [107]:
# average latency calculation for boolean search for 3 queries
import time

latency = []


start_time = time.time()
boolean_search(data, matrix, ['good', 'product','delivery'], operator='AND')
print("--- %s seconds ---" % (time.time() - start_time))
latency.append(time.time() - start_time)

start_time = time.time()
boolean_search(data, matrix, ['good', 'product','delivery'], operator='AND')
print("--- %s seconds ---" % (time.time() - start_time))
latency.append(time.time() - start_time)

start_time = time.time()
boolean_search(data, matrix, ['good', 'product','delivery'], operator='AND')
print("--- %s seconds ---" % (time.time() - start_time))
latency.append(time.time() - start_time)

# average latency of the times
latency = sum(latency)/len(latency)
print("\nthe average latency: ",latency)

--- 0.008294820785522461 seconds ---
--- 0.00860595703125 seconds ---
--- 0.006006002426147461 seconds ---

the average latency 0.007964372634887695


## Simple Whoosh Implementation

In [31]:
# simple search using whoosh
from whoosh import index
from whoosh.fields import Schema, TEXT, ID
from whoosh.analysis import StemmingAnalyzer
from whoosh.qparser import QueryParser
from whoosh.index import create_in
import os

In [32]:
# define a schema 
schema = Schema(
    userName=TEXT(stored=True),
    content=TEXT(analyzer=StemmingAnalyzer(), stored=True)
)

In [33]:
# make a directory to store the index
if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

In [34]:
# define the index schema and create an index
ix = create_in("indexdir", schema)

In [35]:
def write_to_index(data):
    # create a writer object
    writer = ix.writer()
    # write the data to the index
    for i, row in data.iterrows():
        writer.add_document(userName=row['userName'], content=row['content'])
    # commit the writer
    writer.commit()

# call the function to write to the index
write_to_index(data[:10000])

In [36]:
#  search using whoosh
def whoosh_search(searcher, query):
    
    # create a query parser for the content field in the index schema
    query_parser = QueryParser("content", ix.schema)
    
    # parse the query string
    query = query_parser.parse(query)
    
    # search the index
    results = searcher.search(query)
    # return the results
    return results

In [37]:
# create a searcher object not function
searcher = ix.searcher()

# whoosh search for the query using the searcher object
result = whoosh_search(searcher, 'good delivery product')
print(result)

<Top 10 Results for And([Term('content', 'good'), Term('content', 'deliveri'), Term('content', 'product')]) runtime=0.044635600002948195>


In [38]:
for res in result:
    print(res)

<Hit {'content': 'good product good price fast delivery', 'userName': 'John Williams'}>
<Hit {'content': 'amazon shopping ha extreme product ordered excellent delivery good', 'userName': 'melanie hastings'}>
<Hit {'content': 'good product time delivery could bit done timely manner', 'userName': 'Randall Liestman'}>
<Hit {'content': 'review improved many fraudulent lot wrong product always start star review go delivery good bad blame goin ups want return delivery container hope remember customer service good absolute worst part shopping amazon promotes product search product throw unrelated product get interest purchase go ebay trouble finding product', 'userName': 'Dave Orozco'}>
<Hit {'content': 'always good product wise chosen delivery people really suck seems cannot read basic straight forward delivery instruction mean basic', 'userName': 'William Marshall'}>
<Hit {'content': 'happy amazon shopping experience cost wide range availability product good alternative also shown searching