This project builds a small opinion-based search engine using Amazon reviews. The goal is to retrieve reviews that mention a product aspect together with a positive or negative opinion. I implemented a Boolean baseline, a Boolean + lexicon polarity filter (M1), and an SBERT + lexicon semantic method (M2) to compare retrieval performance.
- Evaluation work: A completed precision table with every retrieved review manually checked and marked as relevant or not.
- Code files:
Baseline.py,m1.py, andm2.py - README.md: Full documentation explaining setup, methods, files, and execution steps.
- Outputs folder: Contains results from Baseline (tests 1–3), Method 1 (test4), and Method 2 (test4).
- Data and lexicon files:
positive-words.txtnegative-words.txtreviews_segment.pkl(Main dataset)data.pkl(SBERT embeddings for Method 2)
You need Python installed to run this project.
- Works with Python 3.9 or higher
- Method 2 (M2) requires Python 3.10
- Do NOT use Python 3.11 for M2 because the pickle file will not load.
Download Python here: https://www.python.org
Run this:
- pip install pandas
- pip install numpy
- pip install scikit-learn
- pip install nltk
- pip install sentence-transformers
- pip install torch
Make sure the following files are inside the same folder as the Python scripts (Baseline.py, m1.py, m2.py).
If any of these are missing or placed in the wrong folder, the code will NOT run CORRECTLY.
positive-words.txtnegative-words.txt
These come from the Hu & Liu (2004) opinion lexicon and are used to detect positive or negative words when filtering reviews by polarity.
This is the main Amazon review dataset for the project.
It contains:
- review text
- review titles
- star ratings
- product/user metadata
Your Baseline, M1, and M2 load this file using:
df = pd.read_pickle("reviews_segment.pkl")- This file contains precomputed BERT sentence embeddings for every sentence in the review corpus
- is required by Method 2 (M2) for semantic similarity search
- Credit: data.pkl was generated and provided by TA Navid Ayoobi
M2 uses this file to compare your query embedding with all stored sentence embeddings
NOTE : data.pkl will not be included inside my file zip due to file being too large
All scripts must be run from inside the Codes folder so the output paths resolve correctly. Open a terminal and move into the Codes directory:
python3 Baseline.py
python3 m1.py
python3 m2.py
All retrieved results for this project are saved into the Outputs directory.
Each method (Baseline, M1, M2) writes its own output files for the five required queries:
audio_qualitywifi_signalmouse_buttongps_mapimage_quality
If you made it this far, then you have successfully set up the project and run all three methods: Baseline, Method 1, and Method 2. All outputs should now be saved in the Outputs folder for each of the five required queries.