# Data Collection
### Introduction
* We have collected data of 85 cafes in the KL and SELANGOR area using the SERP API playground which can be found here: https://serpapi.com/playground?engine=google_maps
* These cafe data was collected via 5 queries in the SERP API playground. I saved the results of each query in a text file.
* This resulted in 5 text files (Page 1.txt to Page 5.txt) that contain JSON-like data about 85 cafes in KL and SELANGOR area.
* However, these files do not contain review data for the cafes. Instead it contains surface level data of the cafes ranging from the Cafe's Name, Opening Hours, Company Contacts and more.

### Therefore, we have to complete the data collection process via a series of steps
1. Extract the Title and Place_ID of each cafe.
2. Use Place_ID and SERP API to query for user review data.
3. Save the review data in a JSON file.

## 1. Extract the Title and Place_ID of each cafe.

In [1]:
import json
import pandas as pd


## This is a function to extract title and place_id based on page
def extract_locations(file_path, i):
    # For pages 1, 2 and 4, we open the page using "r" read modifier
    if (i==1|i==2|i==4):
        with open(file_path, "r") as file:
            data = json.load(file)
        local_results = data.get("local_results", [])
        locations = [(location.get("title"), location.get("place_id")) for location in local_results]
        return locations
    # For pages 3 and 5, we open the page using "utf-8" encoding
    else:
        with open(file_path, encoding='utf-8') as file:
            data = json.load(file)
        local_results = data.get("local_results", [])
        locations = [(location.get("title"), location.get("place_id")) for location in local_results]
        return locations

# List to store all locations
all_locations = []

# Iterate through each file, run the extract_locations function, and store them in the all_locations array
for i in range(1, 6):
    file_path = f"Page {i}.txt"
    locations = extract_locations(file_path,i)
    all_locations.extend(locations)

# Create DataFrame
df = pd.DataFrame(all_locations, columns=["Name", "Place_ID"])

# Store DataFrame to place_id.csv
df.to_csv("place_id.csv", index=False)

# Display DataFrame containing Cafe Name and Place_ID
df

Unnamed: 0,Name,Place_ID
0,6Yi Cafe,ChIJ41hkTA5LzDERdBV9pntuGl8
1,Strangers at 47,ChIJMes2_lpJzDERwaN-YKyU_gs
2,Loop Cafe,ChIJFV3n3UU1zDERlncUFuHKuv8
3,VCR,ChIJPYhalddJzDERgRWIgOgOTvA
4,Jam and Kaya Caf√©,ChIJ-_0A5sRLzDERowJMalBYuYs
...,...,...
95,Dplace Cafe @ ÊÇ¶È£üÂùä - Bandar Sunway,ChIJgRGH9DBNzDERCAKgrZsJPxs
96,BaoBao Cafe,ChIJgXnQs_hLzDERhJ__J1AHdz0
97,Keopi & Sul. - Cafe & Bistro,ChIJGZV-yH1JzDERIPWaDmZdi_Q
98,NOTA | Cafe ¬∑ Restaurant,ChIJ50fhkqSzzTERqgqGpUsPnzk


Note: There are 100 rows, but some of the rows are duplicates, resulting in only 85 unique rows.

## 2. Use Place_ID and SERP API to query for user review data.

In [2]:
import os
from dotenv import load_dotenv
import serpapi

##load the env from .env file
load_dotenv()

## from env, get the api_key. This key is taken from the SERP API account.
api_key = os.getenv('SERPAPI_KEY')
client = serpapi.Client(api_key=api_key)

# Load the DataFrame containing place names and place IDs
df = pd.read_csv("place_id.csv") #This is unecessary if you have not restarted the kernal after step 1.

# List to store all results
all_results = []

# Iterate through each row in the DataFrame and query for more cafe data. Store the resulting cafe data in all_results list
for index, row in df.iterrows():
    place_id = row["Place_ID"]
    result = client.search({
        'engine':'google_maps',
        'type':'search',
        'place_id': place_id
    })
    all_results.append(result)

In [3]:
## Add cafe name to each set of user review data
user_reviews = {}
for i in range(len(all_results)):
    user_reviews[df.iloc[i,0]] = all_results[i]["place_results"]["user_reviews"]
    
user_reviews

{'6Yi Cafe': {'most_relevant': [{'username': 'San San Lee',
    'rating': 4,
    'contributor_id': '100807926156842396383',
    'description': 'Nice quite friendly cafe. Food is like a homely meal. Interesting the drinks come with 2 small coconut cookies. Overall meals are ok, will be better if the homemade luncheon meat egg rice has more sources. Has minimum spend per person. 3 residents cats, only see 2, clean & no odour.',
    'link': 'https://www.google.com/maps/reviews/data=!4m8!14m7!1m6!2m5!1sChdDSUhNMG9nS0VJQ0FnSUMxdnJINmt3RRAB!2m1!1s0x0:0x5f1a6e7ba67d1574!3m1!1s2@1:CIHM0ogKEICAgIC1vrH6kwE%7CCgwIrbXBrAYQwPa9lAM%7C?hl=en-US',
    'images': [{'thumbnail': 'https://lh5.googleusercontent.com/p/AF1QipParrXjPOTuED8fucyIbW5LYwrOfB6cafPmVztd=w150-h150-k-no-p'},
     {'thumbnail': 'https://lh5.googleusercontent.com/p/AF1QipPIXjK6jyQjfX11Z9X-6kv3LUNd6Rh6Ou11d20R=w150-h150-k-no-p'},
     {'thumbnail': 'https://lh5.googleusercontent.com/p/AF1QipP4TRGDOmzb90GnDLBxPFDqdobOFWSUsUbJApRz=w150-h1

## 3. Save the review data in a JSON file.

In [4]:
# Save results to a JSON file
with open("results.json", "w") as json_file:
    json.dump(user_reviews, json_file)

print("Results saved to results.json")

Results saved to results.json
