# Data Analasis
---
This notebook will explore the coffee bags from the dataset

In [97]:
import pandas as pd
df = pd.read_csv('../data/cleaned_data.csv')

## Example data row

In [98]:
from IPython.display import display, Markdown
import random

# find by index
example_1 = random.randrange(0, len(df))
markdown_table = df.iloc[[example_1]].to_markdown()
display(Markdown(markdown_table))

# find by title
example_2 = 'Tropical Summer Colombia La Sierra'
markdown_table = df.loc[df['title'] == example_2].to_markdown()
display(Markdown(markdown_table))

|     | title          |   rating |   acidity_structure |   aftertaste |   aroma |   body |   flavor |   with_milk | agtron   | blind_assessment                                                                                                                                                                                                    | bottom_line                                            | coffee_origin   | est_price        | notes                                                                                                                                                                                                                                                                                                                                                                                                              | review_date   | roast_level   | roaster     | roaster_location   | url                                                 |
|----:|:---------------|---------:|--------------------:|-------------:|--------:|-------:|---------:|------------:|:---------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------|:----------------|:-----------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------|:--------------|:------------|:-------------------|:----------------------------------------------------|
| 395 | Caribbean Blue |       92 |                   8 |            8 |       9 |      8 |        9 |         nan | 58/78    | Crisply sweet, nut-toned. Almond, cocoa powder, orange zest, magnolia, date in aroma and cup. Sweet in structure with round acidity; lightly satiny mouthfeel. Nut-toned finish with hints of orange zest and date. | A classic island cup: nutty-sweet with gentle acidity. | Dondon, Haiti   | $15.99/12 ounces | Produced entirely of the Blue Mountain and Typica varieties of Arabica and processed by the washed method (fruit skin and flesh are removed from the beans before they are dried). Certified organically grown. Cafe Kreyol is a roaster in Manassas, Virginia with the mission of creating sustainable employment by way of specialty coffee. Visit www.cafekreyol.com or call 571-719-7018 for more information. | February 2022 | Medium-Light  | Cafe Kreyol | Manassas, Virginia | https://www.coffeereview.com/review/caribbean-blue/ |

|    | title                              |   rating |   acidity_structure |   aftertaste |   aroma |   body |   flavor |   with_milk | agtron   | blind_assessment                                                                                                                                                                                                                                  | bottom_line                                                                                          | coffee_origin                         | est_price       | notes                                                                                                                                                                                                                                                                                                                                                                                                                                | review_date   | roast_level   | roaster              | roaster_location       | url                                                                     |
|---:|:-----------------------------------|---------:|--------------------:|-------------:|--------:|-------:|---------:|------------:|:---------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------|:--------------------------------------|:----------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------|:--------------|:---------------------|:-----------------------|:------------------------------------------------------------------------|
|  4 | Tropical Summer Colombia La Sierra |       93 |                   9 |            8 |       9 |      8 |        9 |         nan | 60/77    | Fruit-driven, crisply chocolaty. Goji berry, dried plum, baking chocolate, amber, narcissus in aroma and cup. Crisply sweet structure with balanced acidity; lightly satiny mouthfeel. Fruit-toned finish supported by notes of baking chocolate. | An experimentally processed Colombia, sweetly fruit-forward with ballast from crisp chocolate notes. | La Sierra, Cauca Department, Colombia | $18.99/8 ounces | Produced by smallholding farmers from trees of the Castillo, Caturra, Pajarito, Tabi and Bourbon varieties of Arabica, and processed by the traditional washed method using species of lactic acid-producing yeast and bacteria during the fermentation step. Merge is a specialty coffee roaster in Harrisonburg, Virginia dedicated to ethical sourcing of high-quality coffees. Visit www.mergecoffeeco.com for more information. | November 2022 | Medium-Light  | Merge Coffee Company | Harrisonburg, Virginia | https://www.coffeereview.com/review/tropical-summer-colombia-la-sierra/ |

## Finding new coffees using cosine similarity

In [99]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

def recommend_coffees(title, k=5, method='numeric'):
    if method not in ['numeric', 'text']:
        raise ValueError("Method should be either 'numeric' or 'text'")
    
    # numeric method
    if method == 'numeric':
        # select numeric columns
        numeric_columns = ['rating', 'acidity_structure', 'aftertaste', 'aroma', 'body', 'flavor']
        
        # standardize the data
        scaler = StandardScaler()
        numeric_data_scaled = scaler.fit_transform(df[numeric_columns])
        
        # calculate cosine similarity using the numeric data
        similarity_matrix = cosine_similarity(numeric_data_scaled)
        
    # text method
    elif method == 'text':
        # select the text column
        text_data = df['blind_assessment']
        
        # vectorize the text data
        vectorizer = TfidfVectorizer()
        text_data_vectorized = vectorizer.fit_transform(text_data)
        
        # calculate cosine similarity
        similarity_matrix = cosine_similarity(text_data_vectorized)
    
    # find the index of the input coffee
    coffee_index = df[df['title'] == title].index[0]
    
    # get the similarity scores for the input coffee
    similarity_scores = list(enumerate(similarity_matrix[coffee_index]))
    
    # sort by similarity scores
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    
    # get the indices of the k most similar coffees, excluding the original coffee
    similar_coffees_indices = [i[0] for i in similarity_scores if i[0] != coffee_index][:k]
    
    # return the DataFrame of the top k similar coffee bags
    similar_coffees_df = df.iloc[similar_coffees_indices]
    
    print(f"Top {k} similar coffees to '{title}' using {method} similarity:")
    return similar_coffees_df

## Example Usage for numeric

In [100]:
recommended_coffees = recommend_coffees('Tropical Summer Colombia La Sierra', k=5, method='numeric')
i = 1
for row in recommended_coffees.iterrows():
    print(f"{i}. {row[1]['title']} by {row[1]['roaster']}")
    i += 1

Top 5 similar coffees to 'Tropical Summer Colombia La Sierra' using numeric similarity:
1. Bolivia Manantial Gesha by Red Rooster Coffee Roaster
2. Kenya AA by AKA Coffee Roasters
3. Guatemala Los Santos by Temple Coffee
4. Honeyed Floral by 1980 CAFE
5. Philippines Sitio Belis Garnica by Mostra Coffee


## Example usage for text

In [101]:
recommended_coffees = recommend_coffees('Tropical Summer Colombia La Sierra', k=5, method='text')
i = 1
for row in recommended_coffees.iterrows():
    print(f"{i}. {row[1]['title']} by {row[1]['roaster']}")
    i += 1

Top 5 similar coffees to 'Tropical Summer Colombia La Sierra' using text similarity:
1. Ethiopia Guji Hambela Wate Natural G1 by Samlin Coffee
2. Haraaz Red Yemen by JBC Coffee Roasters
3. Zambia Mafinga Hills Natural by Jackrabbit Java
4. Ethiopia Gedeb Chelchele Slow Dry Washed by Kafe Coffee Roastery
5. Haraaz Red by JBC Coffee Roasters


## Additional data can be easily requested

In [102]:
coffee = recommended_coffees.iloc[0]
print(f"Here is some information about '{coffee['title']}' by {coffee['roaster']}:")
for key, value in coffee.items():
    print(f"{key}: {value}")

Here is some information about 'Ethiopia Guji Hambela Wate Natural G1' by Samlin Coffee:
title: Ethiopia Guji Hambela Wate Natural G1
rating: 92.0
acidity_structure: 8.0
aftertaste: 8.0
aroma: 9.0
body: 8.0
flavor: 9.0
with_milk: nan
agtron: 57/75
blind_assessment: High-toned, crisply sweet-tart. Dried plum, baking chocolate, rhododendron, cedar, almond in aroma and cup. Sweetly tart with brisk acidity; velvety mouthfeel. The fruit-toned finish is supported by notes of cedar and baking chocolate.
bottom_line: A delicately fruit-toned natural-process Ethiopia with crisp chocolate and rich aromatic wood notes throughout the profile.
coffee_origin: Guji Zone, Oromia Region, Southern Ethiopia
est_price: NT $550/250 grams
notes: Produced in the distinguished Guji growing region, nestled next to Ethiopia's better-known Yirgacheffe and Sidamo regions, largely from distinctive indigenous landrace varieties of Arabica long grown in the region. Processed by the natural method (beans are dried in