Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Being able to compare products online revolutionized the experience of shoppers around the world. However, due to ever-increasing scale of online marketplaces such as,, and eBay, it has become impossible for consumers and businesses alike to evaluate features for every product in the desired search space.

In this project we find product feature summaries based only on natural text reviews. We use review data from from the excellent Amazon product reviews dataset provided by Julian McAuley. The work was done for the seminar Mining Massive Datasets at Hasso Plattner Institute in Potsdam, Germany.

The pipeline—as described below—is implemented as Apache Spark jobs in Java and Scala. It makes use of Spark's MLlib and GraphX toolkits as well as Twitters DIMSUM algorithm. It can be run locally and on Spark clusters.


We employ a three-step pipeline: Feature Extraction, Feature Clustering, and Modifier Weighting which are explained below.


A Feature in our sense is a word (-group) that stands for a product feature, such as display, battery, or color temperature. A Modifier is a word (-group) that describes a feature, such as better, large, or the coolest.

Feature Extraction


We use filters to narrow down reviews to products of specific categories, brands, and price ranges. There are four filters currently implemented:

Filter Class Input Output
SampleFilter List of Products, Sample Fraction (0..1), Seed Sampled list of products
BrandFilter List of Products, Brand Name Only Products of the given brand
CategoryFilter List of Products, Category Name Only products of the given (sub-) category
PriceFilter List of Products, Minimum Price, Maximum Price Only products of the given price range

In the consecutive steps, only reviews with matching product ids will be evaluated.

Building NGrams from Templates

We POS-tag the remaining reviews using the Stanford Core NLP library. Over each review, we slide a window of a fixed length to find ngrams which represent the features of the product and their modifiers. For this, we implemented templates - Finite State Machines that only accept an ngram if it matches the template. Having the knowledge about the structure of the template allows us to make assumptions on the position of the feature(s) and modifier(s) within the ngram. One of the templates we use is "[comparing adjective/adverb] + [noun]" with the former being the modifier and the latter being the feature.

Feature Clustering

Baseline (Exact Match)

This method clusters ngrams by their feature. If two ngrams contain the same feature (i.e. the same string), they are grouped together, so that we can have multiple modifiers for a feature.

Similarity-based Aggregation


DIMSUM (Dimensionality independent matrix similarity using map-reduce) is an algorithm proposed by Twitter in order to cut the number of comparisons between matrix columns with the use of sampling. We use the SPARK-implementation of this algorithm as we hoped that it would be faster as our Aggregation method. We compute the similarity between vectors of our Features given by the Word2Vec model previously computed. We build a coordinate-matrix out of these vectors, transpose it by swapping the cells indices and are then able to transform it into a rowmatrix. After computing the similarities of the columns, we build a graph (each Node is a Feature, each Edge a similarity over a given threshold) in order to find all connected components. Each connected component then contains words with similar vectors, which we interprete as words describing the same feature. In contrast to the original use-case of DIMSUM, our Matrix is dense, has many columns and few rows. As such the advantages of DIMSUM do not come into play, however, the quality of the algorithm is comparable to the aggregation.

Modifier Weighting

After deduplicating the features, we build a linear model for each feature over the reviews where the rating of the review is the dependent variable and the Modifiers are the independent variables. For this task we use Spark's MLlib. For each Modifier we get a coefficient which indicates how much this Modifier influences the rating of the review. Finally we sort the Modifiers over all Models in order to see which combinations of Feature and Modifier are the most positive or most negative.



The following experiments were done using a cluster with these specifications:

  • Master: Dell PowerEdge R310 (4x2.66 GHz, 8 GB RAM)
  • 10 Slave Nodes: Dell OptiPlex 780 (2x2.6 GHz, 8 GB RAM)
  • Shared HDFS


Runtime Chart

Relative Speedup

Relative Speedup Chart

Percentage Improvement

Percentage Improvement Chart


We use product reviews as provided in the Amazon product reviews. The program will only accept reviews and metadata in JSON format from the local file system or the Hadoop filesystem (HDFS).

Running the Code


  1. mvn clean install
  2. java -jar target/mmds-[version].jar [parameters, see below]

On a Spark Cluster

  1. mvn clean install
  2. mvn package
  3. spark-submit --deploy-mode client --jars $(echo target/lib/*.jar | tr ' ' ',') target/mmds-[version].jar [parameters, see below]

Use spark-submit parameters like --executor-memory 4G --driver-memory 8G --total-executor-cores 20 to scale.


The program accepts the following parameters:

.../mmds-[version].jar [ReviewFilePath] [NumberOfPartitions] [ClusteringAlgorithm] [MetadataFilePath]

Parameter Expected Value Example
ReviewFilePath A valid file path on the local or Hadoop file system. The file must contain review JSON objects. resources/samples/musical_instruments_top100.json
NumberOfPartitions Integer. Number of partitions used to parallelize Spark jobs. 4
ClusteringAlgorithm String. One of [TreeAggregate, DIMSUM, ExactMatch]. Default is ExactMatch. DIMSUM
MetadataFile A valid file path on the local or Hadoop file system. The file must contain metadata JSON objects. resources/samples/musical_instruments_metadata_top100.json

Maximilian Grundke, Axel Kroschk, Jaspar Mang, 2016.


Extract the features of a product from corresponding amazon reviews.



No releases published


No packages published
You can’t perform that action at this time.