## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")

### Read reviews data

In [4]:
# Load the Samsung.txt dataset
# con=open("../data/Samsung.txt",'r', encoding="utf-8")
con=open("../Dataset/Samsung.txt",'r', encoding="utf-8")
samsung_reviews=con.read()
con.close()

In [5]:
len(samsung_reviews.split("\n"))

46355

### Dataset is a text file where each review is in a new line

In [6]:
samsung_reviews.split("\n")[0:4]

["I feel so LUCKY to have found this used (phone to us & not used hard at all), phone on line from someone who upgraded and sold this one. My Son liked his old one that finally fell apart after 2.5+ years and didn't want an upgrade!! Thank you Seller, we really appreciate it & your honesty re: said used phone.I recommend this seller very highly & would but from them again!!",
 'nice phone, nice up grade from my pantach revue. Very clean set up and easy set up. never had an android phone but they are fantastic to say the least. perfect size for surfing and social media. great phone samsung',
 'Very pleased',
 'It works good but it goes slow sometimes but its a very good phone I love it']

### Will our hypothesis hold on real world data? `Product features---POS_NOUN`

In [11]:
review1=samsung_reviews.split("\n")[0]
review1=nlp(review1)

### Lets do nlp parse on part of one review in our dataset

In [12]:
for tok in review1[0:10]:
  print(tok.text,"--",tok.lemma_, "--", tok.pos_)

I -- I -- PRON
feel -- feel -- VERB
so -- so -- ADV
LUCKY -- LUCKY -- NOUN
to -- to -- PART
have -- have -- AUX
found -- find -- VERB
this -- this -- DET
used -- use -- VERB
( -- ( -- PUNCT


#### Real world data is usually messy, observe the words `found` and `used`

In [13]:
pos = []
lemma = []
text = []
for tok in review1:
  pos.append(tok.pos_)
  lemma.append(tok.lemma_)
  text.append(tok.text)

In [14]:
nlp_table = pd.DataFrame({'text':text, 'lemma':lemma, 'pos':pos})
nlp_table.head()

Unnamed: 0,text,lemma,pos
0,I,I,PRON
1,feel,feel,VERB
2,so,so,ADV
3,LUCKY,LUCKY,NOUN
4,to,to,PART


In [15]:
## Get most frequent lemma forms of nouns
nlp_table[nlp_table['pos'] == 'NOUN']['lemma'].value_counts()


Unnamed: 0_level_0,count
lemma,Unnamed: 1_level_1
phone,3
one,2
LUCKY,1
line,1
year,1
upgrade,1
honesty,1
re,1
seller,1


#### It seems possible that if we extract all the nouns from the reviews and look at the top 5 most frequent lemmatised noun forms, we will be able to identify `What people are talking about?`

### Lets repeat this experiment on a larger set of reviews

In [18]:
nouns = []
for review in samsung_reviews.split("\n")[0:100]:
  doc = nlp(review)
  for tok in doc:
    if tok.pos_ == 'NOUN':
      nouns.append(tok.lemma_.lower())

### Lets add some way of keeping track of time

In [22]:
from tqdm import tqdm
nouns = []
for review in tqdm(samsung_reviews.split("\n")[0:1000]):
  doc = nlp(review)
  for tok in doc:
    if tok.pos_ == 'NOUN':
      nouns.append(tok.lemma_.lower())
pd.Series(nouns).value_counts().head(5)

100%|██████████| 1000/1000 [00:15<00:00, 65.83it/s]


Unnamed: 0,count
phone,1216
time,90
battery,90
screen,87
price,87


In [21]:
len(samsung_reviews.split("\n"))

46355

### Did you notice anything? What do you think will be the time taken to process all the reviews?

In [23]:
(46355//1000)*15

690

In [24]:
690//60

11

## Summary
- POS tag based rule seems to be working well
- We need to figure out a way to reduce the time taken to process reviews