Domain Specific Text Data Analysis and Processing
Execution of the programs and all relative links mentioned here requires the change of your current directory to the SourceCode
folder. Do this with the following command in command prompt
.
$ cd SourceCode
- Python 3.6
The review dataset file are omitted to save space in project directory. Please place your CellPhoneReview.json
dataset file in the dataset/
folder.
Project setup is as simple as the following 2 steps
The following command will install the required python packages.
$ pip install -r requirements.txt
The following NLTK models need to be downloaded to run some of the features provided by the NLTK
i.e. pos tagging
. Run the following codes on a python interpreter.
>>> import nltk
>>> nltk.download('vader_lexicon')
This section describes the steps or commands needed for running the code that solves the problems listed in the project assignment. Please ensure the CellPhoneReview.json
has been placed in the dataset/
folder before using any of the commands.
Execute the following command in command prompt
to start Dataset Analysis. A trace sample is available here
$ python main.py analysis
The followings are the actual sample trace and graph outputs you would see when running the above command.
Top 10 products with most reviews:
B005SUHPO6 836
B0042FV2SI 690
B008OHNZI0 657
B009RXU59C 634
B000S5Q9CA 627
B008DJIIG8 510
B0090YGJ4I 448
B009A5204K 434
B00BT7RAPG 431
B0015RB39O 424
Name: asin, dtype: int64
Sample Sentence Lengths
,asin,overall,reviewText,reviewTime,reviewerID,summary,unixReviewTime,sentencelength
43527,B004Z9W4B0,4,"the product is nice, cute, but not as i expected ...i think there are nicer stuff out there that you can choose form","07 27, 2012",AQIGD6RKS9OJX,"Cute but not as expected",1343347200,1
96469,B008CZO7OU,5,"This was a gift for the father-in-law & he's not always easy to please. As always Seidio does a great job. The case is slim, buttons press easily & it looks great as well as protects.","02 6, 2013",A1S48ANGFTOCM9,"Great case.",1360108800,3
55198,B005LFXBJG,4,"I have these plugged in all over my house and never had a problem with them. I like to get OEM chargers because they seem to last longer than knockoffs.","06 20, 2013",A2L9DBKBXLGHOL,"Never have enough",1371686400,2
Top 20 most frequent words
Top 20 most frequent words before stemming:
[('phone', 174958), ('case', 146050), ('one', 86759), ('like', 71853), ('great', 66874), ('would', 66809), ('use', 61676), ('screen', 60145), ('good', 58601), ('battery', 57976), ('well', 51149), ('iphone', 47691), ('get', 46380), ('charge', 44897), ('charger', 38646), ('product', 38223), ('really', 38055), ('also', 37145), ('time', 36661), ('works', 32743)]
Top 20 most frequent words after stemming:
[('phone', 192494), ('case', 165228), ('use', 118087), ('one', 92736), ('charg', 92539), ('like', 79743), ('work', 76325), ('great', 66912), ('would', 66814), ('batteri', 66237), ('screen', 61812), ('get', 61243), ('good', 58823), ('look', 52093), ('well', 51161), ('iphon', 50521), ('fit', 50296), ('time', 48381), ('charger', 45281), ('protect', 44734)]
POS Tagging
POS Tagging for 5 random sentences:
it looked good but did not fit properly the volume would change without touching the buttons and the vibrate switch was partially covered buy a different bumper they are all much better => [('it', 'PRP'), ('looked', 'VBD'), ('good', 'JJ'), ('but', 'CC'), ('did', 'VBD'), ('not', 'RB'), ('fit', 'VB'), ('properly', 'RB'), ('the', 'DT'), ('volume', 'NN'), ('would', 'MD'), ('change', 'VB'), ('without', 'IN'), ('touching', 'VBG'), ('the', 'DT'), ('buttons', 'NNS'), ('and', 'CC'), ('the', 'DT'), ('vibrate', 'NN'), ('switch', 'NN'), ('was', 'VBD'), ('partially', 'RB'), ('covered', 'VBN'), ('buy', 'VB'), ('a', 'DT'), ('different', 'JJ'), ('bumper', 'NN'), ('they', 'PRP'), ('are', 'VBP'), ('all', 'DT'), ('much', 'RB'), ('better', 'JJR')]
In addition, the following shows the graph plot for frequency statistics of sentence segmentation, tokenized word without stemming, and tokenized word with stemming. This graph will be available as .png
image in the folder here once the program finish running.
Execute the following command in command prompt
to start generating the top 20 most frequent noun phrase.
$ python main.py nounphrase
A trace sample is available here. In our experiment, we ran the program using two different types of taggers. The final one used will be the RegExp Parser which is the default mode. If you wish to run it with the ConsecutiveNPChunk tagger simply uncomment the line which sets the mode to "chunktagger".
Execute the following command in command prompt
to start generating the top 20 positive and negative words.
$ python main.py sentiment
A trace sample and graph image of the result can be found in the folder here. Please note that for the trace.txt
the actual console output will differ somewhat as the file does not include console output for printing progress such as 1000 of 190,000 done
.
The following image shows the actual sample trace and graph output you would see when running the sentiment word detection
algorithm. The sample trace shows the top 20 positive and negative words along with the calculation of its adjusted sentiment
value.