# Social Media Analytics
## Webscraping Project
### Introduction to Text Mining
### Named Entity Recognition

Felix Funes 20220306 | Paula Catalan 20221048 | Efstathia Styliagkatzi 20220078 | Alisson Tapia 20221156 | S M Abrar Hossain Asif 20220223

Some of the importan reason of doing Entity Recognition in our data are:

Information Extraction: Entity Recognition helps extract important information from unstructured text by identifying and classifying entities such as names, locations, organizations, dates, and more. This can be valuable for generating insights from large amounts of text data.

Text Understanding: Entity Recognition enhances text understanding by identifying and labeling entities within the text. This can help in better comprehending the context, relationships, and key elements mentioned in the text.

Document Organization and Summarization: Can aid in organizing and summarizing documents by identifying and categorizing entities. It can help in creating structured databases, indexing documents, and generating summaries that capture the most important entities and their relationships.

Information Retrieval and Search: Enables more accurate and relevant information retrieval by recognizing entities within a text and allowing users to search based on specific entities or entity types. 

Data Cleaning and Standardization: It assists in data cleaning and standardization processes by identifying and normalizing entity mentions. 




In [1]:
# Import packages
import csv
import pandas as pd
import numpy as np
import nltk 
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from rake_nltk import Rake
from sklearn.feature_extraction.text import CountVectorizer
from bs4 import BeautifulSoup
import spacy
from spacy import displacy
from collections import Counter
!pip install plotly
from plotly.offline import iplot
import plotly.graph_objs as go
import plotly.express as px

[nltk_data] Downloading package stopwords to C:\Users\Paula
[nltk_data]     Muñoz\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Paula
[nltk_data]     Muñoz\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Defaulting to user installation because normal site-packages is not writeable


In [2]:
# Load dataset
dtypes = {'device':'category','user':'category','rating':'integer','ownership_lenght':'category'}
ds = pd.read_excel("ExtractedReviewsDataCollection_bestbuy.xlsx", sheet_name="Sheet1", index_col = 0)

In [3]:
# Drop non-English reviews
print(ds.head())

                                         device          user  rating  \
0  Apple - iPhone 14 128GB - Midnight (Verizon)          BigG       5   
1  Apple - iPhone 14 128GB - Midnight (Verizon)       Jp44087       5   
2  Apple - iPhone 14 128GB - Midnight (Verizon)  GamerDadLife       5   
3  Apple - iPhone 14 128GB - Midnight (Verizon)       LevanaP       5   
4  Apple - iPhone 14 128GB - Midnight (Verizon)     Anonymous       5   

                                                text       date  \
0  Apple makes the best cellphone on the market h... 2023-02-03   
1  Ease of use, good battery life, 128gb fits me ... 2023-02-03   
2  Love it works great and the red color is the m... 2022-12-24   
3  Been a long time iPhone user. This is a awesom... 2023-04-14   
4  My wife dropped her phone right AFTER the Appl... 2023-04-15   

   ownership_length  
0  less than 1 week  
1           3 weeks  
2           2 weeks  
3            1 week  
4           3 weeks  


In [4]:
# Defining Key
ds.reset_index(inplace=True)
ds.rename(columns={'index': 'key'}, inplace=True)
print(ds)

     key                                        device          user  rating  \
0      0  Apple - iPhone 14 128GB - Midnight (Verizon)          BigG       5   
1      1  Apple - iPhone 14 128GB - Midnight (Verizon)       Jp44087       5   
2      2  Apple - iPhone 14 128GB - Midnight (Verizon)  GamerDadLife       5   
3      3  Apple - iPhone 14 128GB - Midnight (Verizon)       LevanaP       5   
4      4  Apple - iPhone 14 128GB - Midnight (Verizon)     Anonymous       5   
..   ...                                           ...           ...     ...   
369  369   Apple - iPhone 14 128GB - Purple (T-Mobile)         Heart       3   
370  370   Apple - iPhone 14 128GB - Purple (T-Mobile)      CharlesK       5   
371  371   Apple - iPhone 14 128GB - Purple (T-Mobile)     Darklight       5   
372  372   Apple - iPhone 14 128GB - Purple (T-Mobile)    user482290       1   
373  373   Apple - iPhone 14 128GB - Purple (T-Mobile)    user849170       1   

                                       

## Functions

In [5]:
# Text preprocessing
def textPreProcess(rawText, removeHTML=True, charsToRemove = r'\?|\.|\!|\;|\.|\"|\,|\(|\)|\&|\:|\-', removeNumbers=True, removeLineBreaks=False, specialCharsToRemove = r'[^\x00-\xfd]', convertToLower=True, removeConsecutiveSpaces=True):
    if type(rawText) != str:
        return rawText
    procText = rawText
        
    # Remove HTML
    if removeHTML:
        procText = BeautifulSoup(procText,'html.parser').get_text()

    # Remove punctuation and other special characters
    if len(charsToRemove)>0:
        procText = re.sub(charsToRemove,' ',procText)

    # Remove numbers
    if removeNumbers:
        procText = re.sub(r'\d+',' ',procText)

    # Remove line breaks
    if removeLineBreaks:
        procText = procText.replace('\n',' ').replace('\r', '')

    # Remove special characters
    if len(specialCharsToRemove)>0:
        procText = re.sub(specialCharsToRemove,' ',procText)

    # Normalize to lower case
    if convertToLower:
        procText = procText.lower() 

    # Replace multiple consecutive spaces with just one space
    if removeConsecutiveSpaces:
        procText = re.sub(' +', ' ', procText)

    return procText

## Analysis

In [6]:
# Create a dataframe with only the description
processedReviews = pd.DataFrame(data=ds.text.apply(textPreProcess,charsToRemove ='', removeLineBreaks=False, removeNumbers=False).values, index=ds.index, columns=['PreProcessedText'])


The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.



In [7]:
# Remove rows with empty text
processedReviews.PreProcessedText = processedReviews.PreProcessedText.str.strip()
processedReviews = processedReviews[processedReviews.PreProcessedText != '']

In [8]:
# Load Spacy English model
nlp = spacy.load("en_core_web_sm")

In [10]:
# Check entities in review 
print(processedReviews.at[5, 'PreProcessedText'])
doc = nlp(processedReviews.at[5, 'PreProcessedText'])
print([(X.text, X.label_) for X in doc.ents])

the perfect iphone! this thing is amazing for anyone in the family! kids or grandma! or anyone else! super powerful, you can run a business off of it!
[]


In [12]:
# Check entities in review 
print(processedReviews.at[305, 'PreProcessedText'])
doc = nlp(processedReviews.at[305, 'PreProcessedText'])
print([(X.text, X.label_) for X in doc.ents])

not much different than the iphone11. great service from the staff at the best buy phone booths.
[('iphone11', 'GPE')]


In [14]:
# Check entities in review 
print(processedReviews.at[350, 'PreProcessedText'])
doc = nlp(processedReviews.at[350, 'PreProcessedText'])
print([(X.text, X.label_) for X in doc.ents])

iphone 14 was easy to setup. it runs smoothly and has great battery life.
[('14', 'CARDINAL')]


In [15]:
# Count the labels
labels = [x.label_ for x in doc.ents]
Counter(labels)

Counter({'CARDINAL': 1})

In [16]:
# Show top 3 labels
top_labels = [x.text for x in doc.ents]
Counter(top_labels).most_common(3)

[('14', 1)]

In [17]:
# Entities visualization
displacy.render(doc, jupyter=True, style='ent')

In [18]:
# For example, if our objective was understand what customers say about the Iphone 14 cardinal we could look for reviews that mention 14 cardinal
counter=0   # to stop after x for demostration speed
annReviews=[]
for r in processedReviews['PreProcessedText']:
  doc = nlp(r)
  for i in doc.ents:
      if i.label_=='CARDINAL':
          annReviews.append(r)
          counter = counter + 1
          break
  if counter>=3:    # Stop after the first three reviews have been found
      break

annReviews

['ease of use, good battery life, 128gb fits me just fine',
 'long overdue upgrade from iphone 7 to 14. totally worth it, especially when adding a verizon plan upgrade (that actually reduced our monthly cost) and the 0% financing and best buy incentives.',
 'i waited a month before entering this review-very pleased with the iphone 14 and just as happy with the service. they had all of the data transferred within 15 minutes and i was up and running>']