Choose any 10 beers in your data. Now choose any one of them, and find the most similar beer (among the remaining 9). Explain your method and logic.

In [1]:
import pandas as pd
df = pd.read_csv('reviews_final.csv')
df.iloc[:5]

Unnamed: 0,beer,brewery,style,style_id,average_user_rating,username,user_rating,delta_from_average,look,smell,taste,feel,overall,date,review_text,brewery_id,beer_id,page_start
0,Caffè Americano,Cigar City Brewing,American Imperial Stout,157,4.46,MadMadMike,4.53,0.07,4.25,4.25,4.75,4.5,4.5,"Jul 29, 2025","In bottle, on tap, at the brewery - anywhere t...",17981,98020,0
1,Caffè Americano,Cigar City Brewing,American Imperial Stout,157,4.46,Rug,4.06,-0.4,4.0,4.25,4.0,4.0,4.0,"Jul 01, 2022",Unknown vintage\n\nSome more BIF heat from the...,17981,98020,0
2,Caffè Americano,Cigar City Brewing,American Imperial Stout,157,4.46,BFCarr,4.43,-0.03,4.25,4.25,4.5,4.5,4.5,"Apr 02, 2021",Pours dark brown with a thin tan head. Aroma c...,17981,98020,0
3,Caffè Americano,Cigar City Brewing,American Imperial Stout,157,4.46,Dfeinman1,4.23,-0.23,4.0,4.75,4.0,4.0,4.25,"Mar 02, 2021",Such a tasty beer. Perfect mouthfeel and carbo...,17981,98020,0
4,Caffè Americano,Cigar City Brewing,American Imperial Stout,157,4.46,Radome,4.54,0.08,4.75,4.5,4.5,4.75,4.5,"Jan 02, 2021",Poured from a bomber bottle into a Duvel glass...,17981,98020,0


In [2]:
df[['beer', 'style']].value_counts()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
beer,style,Unnamed: 2_level_1
§ucaba,Quadrupel (Quad),100
Pliny The Elder,Imperial IPA,100
Pseudo Sue,American Pale Ale,100
Dinner,Imperial IPA,100
KBS - Maple Mackinac Fudge,Oatmeal Stout,100
...,...,...
Monster Tones,American Imperial Stout,13
I Let My Tape Rock,Berliner Weisse,12
10 Year Barleywine,American Imperial Stout,11
Label Us Notorious,Berliner Weisse,8


In [3]:
!pip install gensim



Perform word embedding on these 10 beers in order to calculate a cosine similarity between each beer and find one most similar to the target beer. (Ethan will write a more complete description here later)

In [4]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.metrics.pairwise import cosine_similarity

# ---- 2. Pick 10 unique beers ----
beer_ids = df['beer_id'].drop_duplicates().sample(10, random_state=42).tolist()
subset = df[df['beer_id'].isin(beer_ids)]

# ---- 3. Prepare data: one document per beer ----
beer_texts = (
    subset.groupby(['beer_id', 'beer'])['review_text']
    .apply(lambda x: " ".join(x.astype(str)))
    .reset_index()
)

# ---- 4. Tokenize and tag ----
documents = [
    TaggedDocument(words=text.lower().split(), tags=[str(i)])
    for i, text in enumerate(beer_texts['review_text'])
]

# ---- 5. Train a Doc2Vec model ----
model = Doc2Vec(vector_size=100, window=5, min_count=2, workers=4, epochs=20)
model.build_vocab(documents)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

# ---- 6. Extract beer embeddings ----
embeddings = [model.dv[str(i)] for i in range(len(documents))]

# ---- 7. Compute similarity matrix ----
sim_matrix = cosine_similarity(embeddings)

# ---- 8. Pick one beer ----
target_idx = 0
target_beer = beer_texts.loc[target_idx, 'beer']   # <-- define this
similarities = sim_matrix[target_idx]
similarities[target_idx] = -1  # ignore self-similarity

most_similar_idx = similarities.argmax()

# ---- 9. Build result DataFrame ----
result_df = pd.DataFrame({
    "beer": beer_texts['beer'],
    "similarity": similarities
})
result_df = (
    result_df[result_df['beer'] != target_beer]
    .sort_values("similarity", ascending=False)
)

print("10 beers:", beer_texts['beer'].tolist())
print("Target beer:", target_beer)
print("Most similar beer:", beer_texts.loc[most_similar_idx, 'beer'])
print("Similarity score:", similarities[most_similar_idx])
print(f"\nSimilarity of all beers relative to target:")
print(result_df.reset_index(drop=True).round(3))


10 beers: ['Trappist Westvleteren 12 (XII)', 'Saint Lamvinus', 'Abner', 'Fundamental Observation', 'Headroom', 'Leaner', 'Speedway Stout - Vietnamese Coffee - Rye Whiskey Barrel-Aged', 'Adios Ghost', 'The Adjunct Trail - Bourbon Barrel-Aged', 'Double Barrel V.S.O.J.']
Target beer: Trappist Westvleteren 12 (XII)
Most similar beer: Double Barrel V.S.O.J.
Similarity score: 0.70316595

Similarity of all beers relative to target:
                                                beer  similarity
0                             Double Barrel V.S.O.J.       0.703
1            The Adjunct Trail - Bourbon Barrel-Aged       0.461
2                                        Adios Ghost       0.454
3  Speedway Stout - Vietnamese Coffee - Rye Whisk...       0.342
4                                             Leaner       0.286
5                                              Abner       0.258
6                            Fundamental Observation       0.257
7                                           Headroo