# Exploration of Google Local Data (2021)

The main objective of this notebook is to explore the `Google Local Data (2021)` and extract important statistics for subsequent **Exploratory Data Analysis (EDA)**. 

**Description**

This Dataset contains review information on Google map (ratings, text, images, etc.), business metadata (address, geographical info, descriptions, category information, price, open hours, and MISC info), and links (relative businesses) up to Sep 2021 in the United States.

## 1. Import Packages

In [1]:
import json
import gzip
import os
import requests
from typing import Iterator, Dict, Any

import pandas as pd
from bs4 import BeautifulSoup
from IPython.display import display, HTML

## 2. Webscrape and Parse URL


In the notebook, we analyzed the `Google Local Data (2021)` available through the provided URL. Within this data, we identified three key tables of interest:

1. The first table is a summary table presenting essential statistics, including the total number of reviews, users, and businesses in the dataset. 
2. The second table contains comprehensive review data, organized by states, along with the corresponding review counts and metadata for each state. 
3. Lastly, the third table comprises a subset of review data, also organized by states, but this time featuring k-core reviews and ratings for each state. 

These tables serve as valuable sources of information for further analysis and insights into the Google Local Data.

**Retrieves and Parses contents of URL using `requests` and `BeautifulSoup4`:**

In [2]:
url = "https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

**Locate all three tables of interest:**

In [3]:
summary_table = soup.find(
    lambda tag: tag.name=='table' and 
    tag.findChild('td').contents == ['Reviews:']
)
complete_data_table = soup.find(
    lambda tag: tag.name=='table' and 
    "reviews" in tag.findChildren('td')[1].get_text()
)
subset_data_table = soup.find(
    lambda tag: tag.name=='table' and 
    "10-core" in tag.findChildren('td')[1].get_text()
)

## 3. Display Tables of Interest


### A. Summary Table

Here are the following summary statistics of `Google Local Data (2021)`:
- **666,324,103** reviews
- **113,643,107** users 
- **4,963,111** businesses

In [4]:
display(HTML(summary_table.prettify()))

0,1
Reviews:,666324103
Users:,113643107
Businesses:,4963111


### B. Complete Review Data Table

In [5]:
display(HTML(complete_data_table.prettify()))

0,1,2
Alabama,"reviews  (8,967,499 reviews)","metadata  (74,967 businesses)"
Alaska,"reviews  (1,051,246 reviews)","metadata  (12,774 businesses)"
Arizona,"reviews  (18,375,050 reviews)","metadata  (108,579 businesses)"
Arkansas,"reviews  (5,106,056 reviews)","metadata  (47,246 businesses)"
California,"reviews  (70,529,977 reviews)","metadata  (515,961 businesses)"
Colorado,"reviews  (15,681,222 reviews)","metadata  (106,829 businesses)"
Connecticut,"reviews  (5,181,800 reviews)","metadata  (49,200 businesses)"
Delaware,"reviews  (1,885,948 reviews)","metadata  (14,706 businesses)"
District of Columbia,"reviews  (1,894,317 reviews)","metadata  (11,060 businesses)"
Florida,"reviews  (61,803,524 reviews)","metadata  (378,020 businesses)"


### C. Subset Review Data Table

In [6]:
display(HTML(subset_data_table.prettify()))

0,1,2
Alabama,"10-core  (5,146,330 reviews)","ratings only  (8,967,499 ratings)"
Alaska,"10-core  (521,515 reviews)","ratings only  (1,051,246 ratings)"
Arizona,"10-core  (10,764,435 reviews)","ratings only  (18,375,050 ratings)"
Arkansas,"10-core  (2,855,468 reviews)","ratings only  (5,106,056 ratings)"
California,"10-core  (44,476,890 reviews)","ratings only  (70,529,977 ratings)"
Colorado,"10-core  (8,738,271 reviews)","ratings only  (15,681,222 ratings)"
Connecticut,"10-core  (2,680,107 reviews)","ratings only  (5,181,800 ratings)"
Delaware,"10-core  (905,537 reviews)","ratings only  (1,885,948 ratings)"
District of Columbia,"10-core  (564,783 reviews)","ratings only  (1,894,317 ratings)"
Florida,"10-core  (35,457,319 reviews)","ratings only  (61,803,524 ratings)"


## 4. Extract Download Links

In this section, we will retrieve the "href" links from the Complete and Subset tables and store them in separate arrays. Specifically, we'll extract the URLs associated with the "href" attribute from anchor tags in both tables and save them in distinct arrays based on their respective tables.

**Complete Review Data Table:**

In [7]:
complete_table_rows = complete_data_table.findAll(lambda tag: tag.name=='td')
href_links = [
    row.find('a')['href'] for row in complete_table_rows if row.find('a')
]
complete_review_links = [link for link in href_links if "review" in link]
complete_meta_links = [link for link in href_links if "meta" in link]

**Subset Review Data Table:**

In [8]:
subset_table_rows = subset_data_table.findAll(lambda tag: tag.name=='td')
href_links = [
    row.find('a')['href'] for row in subset_table_rows if row.find('a')
]
subset_review_links = [link for link in href_links if "review" in link]
subset_rating_links = [link for link in href_links if "rating" in link]

**Print sample of download links:**

In [9]:
complete_review_links

['https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal/review-Alabama.json.gz',
 'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal/review-Alaska.json.gz',
 'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal/review-Arizona.json.gz',
 'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal/review-Arkansas.json.gz',
 'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal/review-California.json.gz',
 'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal/review-Colorado.json.gz',
 'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal/review-Connecticut.json.gz',
 'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal/review-Delaware.json.gz',
 'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal/review-District_of_Columbia.json.gz',
 'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal/review-Florida.json.gz',
 'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal/review-Georgi

## 5. Download and Explore Data

In this part, we will download and investigate the initial file from each type of link to identify the necessary data for our project.

**Extract initial URLs:**

In [3]:
initial_links = [
    complete_review_links[0], 
    complete_meta_links[0], 
    subset_review_links[0], 
    subset_rating_links[0]
]

NameError: name 'complete_review_links' is not defined

**Download the initial files:**

In [11]:
def download_file(url: str, datapath: str) -> None:
    filename = url.split("/")[-1]
    res = requests.get(url)
    if res.status_code == 200:
        filepath = os.path.join(datapath, filename)
        with open(filepath, "wb") as file:
            file.write(res.content)
        print(f"Downloaded and saved to: {filepath}")
    else:
        print(f"Failed to download the file from: {url}")

In [12]:
for url in initial_links:
    download_file(url, "../data/01_raw")

Downloaded and saved to: ../data/01_raw/review-Alabama.json.gz
Downloaded and saved to: ../data/01_raw/meta-Alabama.json.gz
Downloaded and saved to: ../data/01_raw/review-Alabama_10.json.gz
Downloaded and saved to: ../data/01_raw/rating-Alabama.csv.gz


**Function provided to read the data:**

In [2]:
def parse(path: str) -> Iterator[Dict[str, Any]]:
  g = gzip.open(path, 'r')
  for l in g:
    yield json.loads(l)

**Read downloaded data into DataFrame:**

In [20]:
filepaths = [
    os.path.join("../data/01_raw/", link.split("/")[-1]) 
    for link in initial_links
]
review_dataframe, meta_dataframe, subset_review_dataframe, rating_dataframe = [
    pd.DataFrame(parse(filepath)) for filepath in filepaths
]

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [22]:
meta_dataframe = pd.DataFrame(parse(filepaths[1]))

KeyboardInterrupt: 

In [None]:
subset_review_dataframe = pd.DataFrame(parse(filepaths[2]))

In [None]:
rating_dataframe = pd.DataFrame(parse(filepaths[3]))