# Social Media Analytics
## Data Collection

## Webscraping Project - Best Buy



In this notebook our main objective is to collect data from the Iphone 14 in the website: https://www.bestbuy.com/. 
We selected 8 different sources of iphone 14. In webscraping some tools are used to extract data from the website automatically, this data it will be save in a structured spreadsheet format. 

The Web scraping process involves accessing the HTML source code of a webpage, extracting the relevant data from the HTML elements, and storing this data in a structured format for further analysis. In our project we will extract the reviews, user, rating, ownership. 
It's imporat to take in consideration the legal and ethical considerations to take into account when collecting data. Therefore, it's important to use web scraping tools and techniques responsibly, and to comply with the terms of use of the websites you are scraping.

### Step 1: Load packages and do the initializations

In [5]:
# Load libraries
import numpy as np
import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
import time
import re
from datetime import datetime, date, timedelta
import requests
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\madel\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [6]:
# Allow not verified SSL (Secure Socket Layer) certificates to be opened
ssl._create_default_https_context = ssl._create_unverified_context

In [7]:
# Get Firefox options (configurations)
options = Options()

In [8]:
# Load the list of the pages to read the content
reviews_to_scrape = pd.read_excel("iphone-reviews-to-scrape.xlsx", sheet_name="Sheet1", index_col="ID", engine='openpyxl')


In [9]:
# Create an empty dataframe for the resuls
iphone_reviews = pd.DataFrame({'device': pd.Series([], dtype='string'),
                             'user': pd.Series([], dtype='string'),
                             'rating': pd.Series([], dtype='float'),
                             'text': pd.Series([], dtype='string')
                             })

### Step 2: Functions to use in the Main Loop

In [10]:
# Open page and read HTML
def openPageReadHTML(url):
    # Create a Firefox profile with location services disabled
    firefox_profile = FirefoxProfile()
    firefox_profile.set_preference('geo.enabled', False)
    firefox_profile.set_preference('geo.provider.network.url', '')
    
    # Set the Firefox profile to be used by Selenium
    options = webdriver.FirefoxOptions()
    options.profile = firefox_profile
    
    # Launch Firefox with the custom profile
    browser = webdriver.Firefox(options=options)
    browser.get(url)
    time.sleep(1) # Wait one second
    

    # If there is a privacy pop-up, click the OK button
    privacy_button = browser.find_elements(By.CLASS_NAME,"us-link")
    if len(privacy_button)>0:
        browser.execute_script("arguments[0].click()", privacy_button[0])
        time.sleep(0.5) # Wait half a second

    
    # Read the content close de browser
    html_source = browser.page_source 
    browser.quit()
    
     # Transform the html into a BeautifulSoup object
    soupObj = BeautifulSoup(html_source) 

    return soupObj


In [11]:
# Process each page
def processPage(soupObj, ID, extractedDF):   

    # Read reviews
    reviews = soupObj.find_all("li", class_="review-item")

    # Loop thru each review
    for i in range(0,len(reviews)):

        # Get Rating
        rating = reviews[i].select_one("p[class*=visually-hidden]")
        if rating:
            reviewRating = rating.text.strip()[6]

        # Get User
        user = reviews[i].select_one("div[class*=ugc-author]")
        if user:
            user = user.text.strip()

        # Get Review Text
        reviewText = reviews[i].select_one("div[class=ugc-review-body]")
        if reviewText:
            reviewText = reviewText.text.strip()

        #Get date
        date = reviews[i].select_one("div[class*=posted-date-ownership]")
        if date:
            date = date.find('time')['title']
            date = pd.to_datetime(date).date()
        else:
            date = None

        # Get length of ownership
        ownership = reviews[i].select_one("div.posted-date-ownership")
        if ownership:
            ownership_text = ownership.get_text()
            match = re.search(r'Owned for\s*(.+?)\s*(when reviewed)?\.', ownership_text)
            if match:
                ownership_length = match.group(1)
            else:
                ownership_length = "Unknown"
        else:
            ownership_length = "Not specified"
            
        # Extract language of webpage
        #language = soup.html.get('language')

        # Get sentiment of the review
        # Create a SentimentIntensityAnalyzer object
        #sid = SentimentIntensityAnalyzer()
        # Calculate the sentiment scores for the review
        #scores = sid.polarity_scores(reviewText)

        # Determine the overall sentiment based on the compound score
        #if scores['compound'] > 0.05:
            #sentiment = 'Positive'
        #elif scores['compound'] < -0.05:
            #sentiment = 'Negative'
        #else:
            #sentiment = 'Neutral'
            

        # Update extracted reviews dataframe
        tDF = pd.DataFrame({'device': [ID],
                             'user': [user],
                             'rating': [reviewRating],
                             'text': [reviewText],
                             'date': [date],
                             'ownership_length': [ownership_length],                            
                              })
        extractedDF = pd.concat([extractedDF,tDF],ignore_index=True)
        
     # Return the resulting dataframe
    return extractedDF


### Step 3: Main loop

In [12]:
# PRUEBA 16/4
# Loop for all pages
for index, row in reviews_to_scrape.iterrows():

    # Present feedback on which page is being processed
    print("Processing ", index)

    # Reset counter per page
    reviewsExtracted = 0
    page_num = 0

    # Loop until it extracts the pre-defined number of reviews
    while page_num < reviews_to_scrape['PAGINA'][index]:
        # Increment page number for the next iteration
        page_num += 1
        
        urlToUse = row['URL']
        if reviewsExtracted > 0:
            if "page=" in urlToUse:
                urlToUse = urlToUse.split("page=")[0] + f"page={page_num}"
            else:
                urlToUse = f"{urlToUse}&page={page_num}"

        # Open and read the web page content
        print("Url => ", urlToUse)
        soup = openPageReadHTML(urlToUse)
        
        # Process web page
        reviews_ant = len(iphone_reviews)
        iphone_reviews = processPage(soup, index, iphone_reviews)

        # Update counter
        reviewsExtracted = len(iphone_reviews) - reviews_ant

        # Present feedback on the number of extracted reviews
        print("Extracted ",reviewsExtracted,"/", page_num)     
      

        


Processing  Apple - iPhone 14 128GB - Midnight (Verizon)
Url =>  https://www.bestbuy.com/site/reviews/apple-iphone-14-128gb-midnight-verizon/6505109?variant=A&skuId=6505109
Extracted  20 / 1
Url =>  https://www.bestbuy.com/site/reviews/apple-iphone-14-128gb-midnight-verizon/6505109?variant=A&skuId=6505109&page=2
Extracted  20 / 2
Url =>  https://www.bestbuy.com/site/reviews/apple-iphone-14-128gb-midnight-verizon/6505109?variant=A&skuId=6505109&page=3
Extracted  20 / 3
Url =>  https://www.bestbuy.com/site/reviews/apple-iphone-14-128gb-midnight-verizon/6505109?variant=A&skuId=6505109&page=4
Extracted  20 / 4
Url =>  https://www.bestbuy.com/site/reviews/apple-iphone-14-128gb-midnight-verizon/6505109?variant=A&skuId=6505109&page=5
Extracted  20 / 5
Url =>  https://www.bestbuy.com/site/reviews/apple-iphone-14-128gb-midnight-verizon/6505109?variant=A&skuId=6505109&page=6
Extracted  13 / 6
Processing  Apple - iPhone 14 128GB - Midnight (AT&T)
Url =>  https://www.bestbuy.com/site/reviews/apple

### Step 4: The Final Excel File

In [13]:
iphone_reviews

Unnamed: 0,device,user,rating,text,date,ownership_length
0,Apple - iPhone 14 128GB - Midnight (Verizon),BigG,5,Apple makes the best cellphone on the market h...,2023-02-03,less than 1 week
1,Apple - iPhone 14 128GB - Midnight (Verizon),Jp44087,5,"Ease of use, good battery life, 128gb fits me ...",2023-02-03,3 weeks
2,Apple - iPhone 14 128GB - Midnight (Verizon),GamerDadLife,5,Love it works great and the red color is the m...,2022-12-24,2 weeks
3,Apple - iPhone 14 128GB - Midnight (Verizon),LevanaP,5,Been a long time iPhone user. This is a awesom...,2023-04-14,1 week
4,Apple - iPhone 14 128GB - Midnight (Verizon),Anonymous,5,My wife dropped her phone right AFTER the Appl...,2023-04-15,3 weeks
...,...,...,...,...,...,...
369,Apple - iPhone 14 128GB - Purple (T-Mobile),Heart,3,Value for the $$$. Security a headache. It is ...,2023-02-24,1 week
370,Apple - iPhone 14 128GB - Purple (T-Mobile),CharlesK,5,My mom got this and she loves this phone the n...,2023-01-08,Unknown
371,Apple - iPhone 14 128GB - Purple (T-Mobile),Darklight,5,I loved it because the camra looks great abd d...,2022-09-19,Unknown
372,Apple - iPhone 14 128GB - Purple (T-Mobile),user482290,1,I went into the store with my wife and child t...,2023-02-05,Unknown


In [14]:
# Save the extracted reviews data frame to an Excel file
iphone_reviews.to_excel("ExtractedReviewsDataCollection_bestbuy.xlsx")