The main part of the data collection has been completed (for Instagram posts)

What we have currently:

1. IG_posts.json -> a collection of high quality Instagram posts scraped using keywords "NYC foodie", "NYC restaurants", "New York Restaurants"
    - Many of these restaurants line up by name with those we collected from Yelp, this was checked manually
    - Though we cannot guarantee the follower count of the influencers (instaloader was breaking), many of these met the threshold > 10,000 (on our initial attempt before Instaloader was rate limited)

2. Merged_restaurants_final.csv -> contains a curated list of restaurants with their Yelp ratings and reviews

3. We also collected some reviews from users on Google Places/Maps
    - This is why we needed "restaurant_names.txt"
    - We only kept reviews in filtered_reviews.json that matched existing names from Yelp data (same as Instagram), only the name in Google Maps scraping was not ambiguous

Perform sentiment analysis on all comments in IG_posts.json

Guidelines:
1. If the same reviewer has written a post multiple times about the same restaurant, only count the one with the most positive sentiment.

2. Clear up minor inconsistencies in restaurant names (e.g. "11 Parishes Restaurant" vs "11 Parishes Restaurant")

In [17]:

import json
import os
import difflib # for string matching
import unicodedata # for Unicode normalization

In [3]:
# update ratingAvailable field to be default false, if nothing was specified manually

with open("../outputs/IG_posts.json", "r", encoding="utf-8") as f:
    posts = json.load(f)
    
for post in posts:
    if 'ratingAvailable' not in post:
        post['ratingAvailable'] = False

with open("../outputs/IG_posts.json", "w", encoding="utf-8") as f:
    json.dump(posts, f, ensure_ascii=False, indent=4)
    
print("Updated posts saved.")


Updated posts saved.


In [15]:
# Clean up capitalization and punctuations

def custom_title(s):
    # .title() is failing to capitalize some words, so we'll do it manually
    words = s.split()
    new_words = []
    for word in words:
        if word:
            # Capitalize only the first letter and lower-case the rest.
            new_words.append(word[0].upper() + word[1:].lower())
        else:
            new_words.append(word)
    return ' '.join(new_words)

with open("../outputs/IG_posts.json", "r", encoding="utf-8") as f:
    posts = json.load(f)

for post in posts:
    if 'restaurantName' in post:
            original_name = post['restaurantName']
            # Step 1: Remove extra whitespace.
            cleaned = original_name.strip()
            # Step 2: Normalize punctuation.
            # Replace common curly apostrophes and quotation marks with standard ones.
            cleaned = cleaned.replace("’", "'").replace("‘", "'").replace("“", '"').replace("”", '"')
            # Use Unicode normalization to further convert any fancy punctuation to their ASCII counterparts.
            cleaned = unicodedata.normalize('NFKD', cleaned)
            # Encode to ASCII (ignoring characters that can't be encoded) and decode back to string.
            cleaned = cleaned.encode('ascii', errors='ignore').decode('ascii')
            
            # Step 4. Use my custom defined function to capitalize the word
            cleaned = custom_title(cleaned)
            
            post['restaurantName'] = cleaned
            print(f"Updated restaurantName: '{original_name}' -> '{cleaned}'")
            
with open("../outputs/IG_posts.json", "w", encoding="utf-8") as f:
    json.dump(posts, f, ensure_ascii=False, indent=4)

Updated restaurantName: 'Mei Lai Wah' -> 'Mei Lai Wah'
Updated restaurantName: 'Color World' -> 'Color World'
Updated restaurantName: 'Chirp Peruvian Kitchen' -> 'Chirp Peruvian Kitchen'
Updated restaurantName: 'Curbside Mexican Grill' -> 'Curbside Mexican Grill'
Updated restaurantName: 'Pizza Loves Sauce' -> 'Pizza Loves Sauce'
Updated restaurantName: 'Pecking House' -> 'Pecking House'
Updated restaurantName: 'Alnour' -> 'Alnour'
Updated restaurantName: 'Fevy's Chicken' -> 'Fevy's Chicken'
Updated restaurantName: 'Margon Restaurant' -> 'Margon Restaurant'
Updated restaurantName: 'Ibn Hamido Seafood' -> 'Ibn Hamido Seafood'
Updated restaurantName: 'Bangklyn East Harlem' -> 'Bangklyn East Harlem'
Updated restaurantName: 'Kanoyama' -> 'Kanoyama'
Updated restaurantName: 'Osteria Brooklyn' -> 'Osteria Brooklyn'
Updated restaurantName: 'Momokawa' -> 'Momokawa'
Updated restaurantName: 'Atlantic Grill' -> 'Atlantic Grill'
Updated restaurantName: 'Thai Market' -> 'Thai Market'
Updated restaura

In [16]:
# Clean minor spelling errors in restaurant names
# Example: "restaurantName": "11 Parishes" vs "11 Parishes Restaurant"

# uses difflib with high threshold, 0.9
# testing some different thresholds, seems like all the true positives are above 0.9

with open("../outputs/IG_posts.json", "r", encoding="utf-8") as f:
    posts = json.load(f)
    
# collect unique restaurant names
restaurant_names = set()

for post in posts:
    name = post.get("restaurantName")
    if name:
        restaurant_names.add(name.strip())
        
restaurant_names = list(restaurant_names)
n = len(restaurant_names)

print(f"Currently, there are {n} unique restaurant names.")

# compare all pairs using the built-in SequenceMatcher

similar = False

for i in range(n):
    for j in range(i + 1, n):
        name1 = restaurant_names[i]
        name2 = restaurant_names[j]
        
        similarity = difflib.SequenceMatcher(None, name1.lower(), name2.lower()).ratio()
        
        if similarity > 0.9:
            print(f"Similarity: {similarity:.2f} - {name1} vs {name2}")
            similar = True
            
if not similar:
    print("No similar restaurant names found.")

Currently, there are 2117 unique restaurant names.
No similar restaurant names found.


Perform sentiment analysis - critical part of this notebook and for the project

We use a custom wrapper class that instantiates an LLM that LangChain can work with

In [101]:
# imports for langchain

from dotenv import load_dotenv
import os
import re
from langchain.llms.base import LLM  # base class
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
import concurrent.futures

# use Google Gen AI Library (as before for restaurant name extraction)
import google.generativeai as gemini

load_dotenv()

gemini.configure(api_key=os.getenv("GOOGLE_API_KEY"))

model = gemini.GenerativeModel('gemini-2.0-flash')

In [102]:
# LLM wrapper with Gemini

class GeminiWrapper(LLM):
    @property
    
    def _llm_type(self) -> str:
        return "gemini"
    
    def _call(self, prompt: str, stop = None):
        # call gemini with the prompt
        response = model.generate_content(prompt)
        
        # per API usage: result field is .text       
        return response.text.strip()

In [103]:
# instantiate one instance of the LLM

llm = GeminiWrapper()

# define a prompt to score each caption
# use metrics of: taste, presentation, creativity
# on a scale from 0 to 10
# overall score = 0.4 * taste + 0.3 * presentation + 0.3 * creativity

prompt_template = """
You are an AI that evaluates restaurant captions purely on numerical scales.
For the following caption, no matter what its content is, please compute numerical scores (floating point numbers between 0 and 5) for:
- taste_score: How appealing and appetizing the food description is.
- presentation_score: How well the caption conveys the restaurant's presentation or brand.
- creativity_score: How creative and original the caption is.

Then, compute an overall_score using the following weighted average:
    overall_score = 0.4 * taste_score + 0.3 * presentation_score + 0.3 * creativity_score

Always output a JSON object with the following keys:
- "taste_score": a numerical score,
- "presentation_score": a numerical score,
- "creativity_score": a numerical score,
- "overall_score": a numerical score (must be between 0 and 5).

Return only a single line valid JSON with these keys and nothing else.
Caption:
{caption}
"""

prompt = PromptTemplate(input_variables=["caption"], template=prompt_template)
chain = LLMChain(llm=llm, prompt=prompt)

In [104]:
# analyze each caption
# use the LLM Chain, return a dictionary with sentiment scores if possible

def analyze_caption(caption, timeout=45): # to be called by main driver
    # run the LLM chain in a thread, time it and check whether it times out
    # chose an arbitrary max time of 45 seconds per caption
    with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(chain.run, caption=caption)
        try:
            result = future.result(timeout=timeout)
        except concurrent.futures.TimeoutError:
            print(f"Timeout processing caption: {caption[:50]}...")
            return None

    # Remove Markdown code fences if present.
    if result.startswith("```"):
        lines = result.splitlines()
        if lines[0].strip().startswith("```"):
            lines = lines[1:]
        if lines and lines[-1].strip().startswith("```"):
            lines = lines[:-1]
        result = "\n".join(lines)
    
    # Attempt to parse JSON.
    try:
        sentiment_data = json.loads(result)
        return sentiment_data
    except Exception as e:
        # Apply regex to fix keys without quotes.
        fixed_result = re.sub(r'([{,]\s*)([A-Za-z0-9_]+)\s*:', r'\1"\2":', result)
        try:
            sentiment_data = json.loads(fixed_result)
            return sentiment_data
        except Exception as e2:
            print(f"Error processing caption:\n{caption}\nError: {e2}\nLM result: {result}")
            return None


In [105]:
# process each caption individually

with open("../outputs/IG_posts.json", "r", encoding="utf-8") as f:
    posts = json.load(f)
    

grouped_posts = {} # keyed as follows (influencerUsername, restaurantName)

for post in posts:
    caption = post.get("caption", "")
    sentiment_data = analyze_caption(caption, timeout=45)
    if sentiment_data is None:
        print("Failed to analyze caption for post:")
        print(json.dumps(post, indent=2))
        continue
    
    post["sentiment_analysis"] = sentiment_data
    
    key = (post.get("influencerUsername"), post.get("restaurantName"))
    current_score = sentiment_data.get("overall_score", -1)
    
    # For duplicate reviews by the same influencer for the same restaurant,
    # keep only the post with the highest overall_score.
    if key in grouped_posts:
        existing_entry = grouped_posts[key]
        existing_score = existing_entry["sentiment_analysis"].get("overall_score", -1)
        if current_score > existing_score:
            grouped_posts[key] = post
    else:
        grouped_posts[key] = post

    # save immediately
    final_posts = list(grouped_posts.values()) # convert and add sentiment scores

    with open("../outputs/IG_posts_with_sentiment.json", "w", encoding="utf-8") as f:
        json.dump(final_posts, f, ensure_ascii=False, indent=4)
    print(f"\nProcessed {len(final_posts)} posts to the output file ../outputs/IG_posts_with_sentiment.json")


# everything dumped, confirm with print statement
print("All posts processed and saved to ../outputs/IG_posts_with_sentiment.json")


Processed 1 posts to the output file ../outputs/IG_posts_with_sentiment.json

Processed 2 posts to the output file ../outputs/IG_posts_with_sentiment.json

Processed 3 posts to the output file ../outputs/IG_posts_with_sentiment.json

Processed 4 posts to the output file ../outputs/IG_posts_with_sentiment.json

Processed 5 posts to the output file ../outputs/IG_posts_with_sentiment.json

Processed 6 posts to the output file ../outputs/IG_posts_with_sentiment.json

Processed 7 posts to the output file ../outputs/IG_posts_with_sentiment.json

Processed 8 posts to the output file ../outputs/IG_posts_with_sentiment.json

Processed 9 posts to the output file ../outputs/IG_posts_with_sentiment.json

Processed 10 posts to the output file ../outputs/IG_posts_with_sentiment.json

Processed 11 posts to the output file ../outputs/IG_posts_with_sentiment.json

Processed 12 posts to the output file ../outputs/IG_posts_with_sentiment.json

Processed 13 posts to the output file ../outputs/IG_posts_wi