# Task F

Choose any 10 beers in your data. Now choose any one of them, and find the most similar beer (among the remaining 9). Explain your method and logic.

In [25]:
import pandas as pd
reviews = pd.read_csv('reviews_final.csv')
reviews.iloc[:5]

Unnamed: 0,beer,brewery,style,style_id,average_user_rating,username,user_rating,delta_from_average,look,smell,taste,feel,overall,date,review_text,brewery_id,beer_id,page_start
0,Caffè Americano,Cigar City Brewing,American Imperial Stout,157,4.46,MadMadMike,4.53,0.07,4.25,4.25,4.75,4.5,4.5,"Jul 29, 2025","In bottle, on tap, at the brewery - anywhere t...",17981,98020,0
1,Caffè Americano,Cigar City Brewing,American Imperial Stout,157,4.46,Rug,4.06,-0.4,4.0,4.25,4.0,4.0,4.0,"Jul 01, 2022",Unknown vintage\n\nSome more BIF heat from the...,17981,98020,0
2,Caffè Americano,Cigar City Brewing,American Imperial Stout,157,4.46,BFCarr,4.43,-0.03,4.25,4.25,4.5,4.5,4.5,"Apr 02, 2021",Pours dark brown with a thin tan head. Aroma c...,17981,98020,0
3,Caffè Americano,Cigar City Brewing,American Imperial Stout,157,4.46,Dfeinman1,4.23,-0.23,4.0,4.75,4.0,4.0,4.25,"Mar 02, 2021",Such a tasty beer. Perfect mouthfeel and carbo...,17981,98020,0
4,Caffè Americano,Cigar City Brewing,American Imperial Stout,157,4.46,Radome,4.54,0.08,4.75,4.5,4.5,4.75,4.5,"Jan 02, 2021",Poured from a bomber bottle into a Duvel glass...,17981,98020,0


In [26]:
!pip install gensim



Based on the results we saw in tasks B and C, we used a bag-of-words (BoW) approach with TF-IDF and cosine similarity in task F to find the beer most similar to a chosen target.

In order to select which of the 10 beers was most similar, we used an approach very similar to that of task B, but with the target beer's reviews as the input to calculate similarity rather than 3 user-defined attributes.

While word embeddings capture broader semantic relationships, they can blur distinctions between closely mentioned attributes, leading to less precise matches. BoW, by emphasizing exact keyword overlap, reliably identified beers with highly similar flavor descriptors, aligning closely with the target beer’s profile. Incorporating sentiment weighting further refined the results, ensuring positively described beers were prioritized.

In [29]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# --- knobs ---
SENTIMENT_WEIGHT = 0.5
TOP_UNIGRAMS_PER_BEER = 5

# -------------------------------------------------
# 0) Pick 10 beers and reset index
beer_ids = reviews['beer'].drop_duplicates().sample(10).tolist()
subset = reviews[reviews['beer'].isin(beer_ids)].copy().reset_index(drop=True)

# -------------------------------------------------
# 1) Normalize text
RAW_COL = "review_text"  # change if your raw text column has a different name

def normalize_text(s):
    s = str(s).lower()
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

subset["review_norm"] = subset[RAW_COL].map(normalize_text)

# -------------------------------------------------
# 2) Compute sentiment
if "_sentiment" not in subset.columns:
    try:
        _ = SentimentIntensityAnalyzer()
    except:
        nltk.download("vader_lexicon", quiet=True)
    sia = SentimentIntensityAnalyzer()
    subset["_sentiment"] = subset["review_norm"].map(lambda t: sia.polarity_scores(str(t))["compound"])

# -------------------------------------------------
# 3) TF-IDF vectorization with stopwords and limits
vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words="english",
    token_pattern=r"(?u)\b[a-zA-Z]{3,}\b",  # only words ≥3 letters
    max_df=0.8,    # ignore terms in >80% of docs
    min_df=1,      # keep terms in ≥1 review (small sample)
    max_features=10000
)
X_tfidf = vectorizer.fit_transform(subset["review_norm"].astype(str))

# -------------------------------------------------
# 4) Collapse to beer-level centroids and normalize
grp = subset.groupby("beer", sort=False).indices
beer_keys = list(grp.keys())
beer_to_row = {b: i for i, b in enumerate(beer_keys)}
beer_mat = np.vstack([X_tfidf[idxs].mean(axis=0).A1 for idxs in grp.values()])
beer_mat = normalize(beer_mat, axis=1)  # normalize to unit length

# -------------------------------------------------
# 5) Choose target beer and build query vector from its reviews
target_beer = beer_keys[0]  # change index to pick another
target_idxs = grp[target_beer]
query_doc = " ".join(subset.loc[target_idxs, "review_norm"].astype(str))
qvec = vectorizer.transform([query_doc])
qvec = normalize(qvec)  # normalize query vector

# -------------------------------------------------
# 6) Cosine similarity
cosine_scores = cosine_similarity(beer_mat, qvec).ravel()

# -------------------------------------------------
# 7) Sentiment weighting
beer_sent = subset.groupby("beer")["_sentiment"].mean().reindex(beer_keys).values
sent_pos = (beer_sent + 1.0) / 2.0  # [-1,1] → [0,1]
final_score = cosine_scores * ((1 - SENTIMENT_WEIGHT) + SENTIMENT_WEIGHT * sent_pos)

# -------------------------------------------------
# 8) Assemble results table
meta = subset.groupby("beer")[["brewery","style","average_user_rating"]].first().reset_index()
results = pd.DataFrame({
    "beer": beer_keys,
    "cosine": cosine_scores,
    "sentiment_mean": beer_sent,
    "final_score": final_score
}).merge(meta, on="beer", how="left").sort_values("final_score", ascending=False).reset_index(drop=True)

# -------------------------------------------------
# 9) Exclude the target beer
results = results[results["beer"] != target_beer].reset_index(drop=True)

# -------------------------------------------------
# 10) Show most similar beer
top_match = results.iloc[0]
print(f"10 beers: {beer_keys}")
print(f"Target beer: {target_beer}")
print(f"Most similar beer: {top_match['beer']} (score={top_match['final_score']:.4f})")

# Optional: display full table
display(results)


10 beers: ['4th Anniversary', 'Barrel Aged Bomb!', 'Sunday Brunch - Bourbon Barrel-Aged', 'West Ashley', 'Congress Street IPA', 'Pliny For President', 'Birth Of Tragedy', 'Heady Topper', 'Green', 'Fort Point Pale Ale - Double Dry Hopped']
Target beer: 4th Anniversary
Most similar beer: Green (score=0.5990)


Unnamed: 0,beer,cosine,sentiment_mean,final_score,brewery,style,average_user_rating
0,Green,0.662655,0.6155,0.598957,Tree House Brewing Company,Sweet / Milk Stout,4.57
1,Fort Point Pale Ale - Double Dry Hopped,0.630639,0.655246,0.576285,Trillium Brewing Company,American Pale Ale,4.57
2,Congress Street IPA,0.63726,0.558533,0.566928,Trillium Brewing Company,American Pale Ale,4.49
3,Heady Topper,0.575482,0.689169,0.530762,The Alchemist,New England IPA,4.7
4,Pliny For President,0.545794,0.800924,0.51863,Russian River Brewing Company,Imperial IPA,4.5
5,West Ashley,0.318987,0.749337,0.298997,Sante Adairius Rustic Ales,Saison,4.56
6,Birth Of Tragedy,0.31268,0.666323,0.286596,Hill Farmstead Brewery,Saison,4.46
7,Barrel Aged Bomb!,0.315692,0.568274,0.281619,Prairie Artisan Ales,American Imperial Stout,4.48
8,Sunday Brunch - Bourbon Barrel-Aged,0.275019,0.806153,0.261691,Kane Brewing Company,Imperial Porter,4.56
