So I am a huge coffee nerd, and searched for a dataset that contained coffee reviews so I could drill into it. This set is named the coffee reviews dataset and can be found here:
https://www.kaggle.com/datasets/schmoyote/coffee-reviews-dataset

In each item there is a review of a blend of coffee with 3 distinct reviews.

I'm first going to prepare my enviroment, and tee up som basic spaCy and NLTK data sources. 

In [8]:
import pandas as pd
import spacy
import nltk
from collections import Counter

In [9]:
#Load the spaCy model
nlp = spacy.load('en_core_web_sm')
nlp.max_length = 2000000  #Increase the maximum length limit 

In [10]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Aman\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Aman\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

Essentially the nltk is a fantastic natural langauge tool kit. We're going to use a tokenizer. A tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

In addition, were going to extract the stop words. A stop word is a frequently used word (like "the," "a," "an," or "in"), and adds no relative meaning to the analysis were going to do. 

From here, were going to load in the CSV, extract the 3 reviews per coffee, and transform them into a singular dataframe, and save that. 


In [12]:
#Load the CSV file
file_path = 'coffee_analysis.csv'
df = pd.read_csv(file_path)

#Extract reviews
desc_df = df[['desc_1', 'desc_2', 'desc_3']]

#Transform the reviews from one row with three reviews to three rows with 1 review (melt it)
melted_desc_df = desc_df.melt(var_name='description_type', value_name='description').drop(columns='description_type')
melted_desc_df

Unnamed: 0,description
0,"Evaluated as espresso. Sweet-toned, deeply ric..."
1,"Evaluated as espresso. Sweetly tart, floral-to..."
2,"Crisply sweet, cocoa-toned. Lemon blossom, roa..."
3,"Delicate, sweetly spice-toned. Pink peppercorn..."
4,"Deeply sweet, subtly pungent. Honey, pear, tan..."
...,...
6280,"A quietly confident, sweetly nut-toned Guatema..."
6281,"A deeply floral, richly chocolaty Guatemala cu..."
6282,"A bright, balanced, juicy Guatemala cup driven..."
6283,"Balanced, bright, invigoratingly crisp, with t..."


Now let's make a corpus file so we ideally never have to do this again. 

In [13]:
#Clean the data
melted_desc_df = melted_desc_df.dropna()
#Turn it into a list, so that we can write it out. 
corpus = melted_desc_df['description'].tolist()

#save the corpus
corpus_file_path = 'descriptions_corpus.txt'
with open(corpus_file_path, 'w', encoding='utf-8') as file:
    for description in corpus:
        file.write(description + '\n')
        


In [14]:
total_descriptions = len(corpus)
lengths = [len(description) for description in corpus]
average_length = sum(lengths) / total_descriptions
longest_description = max(corpus, key=len)
shortest_description = min(corpus, key=len)

summary_statistics = {
    'Total Descriptions': total_descriptions,
    'Average Length': average_length,
    'Longest Description': longest_description,
    'Length of Longest Description': len(longest_description),
    'Shortest Description': shortest_description,
    'Length of Shortest Description': len(shortest_description)
}
summary_statistics

{'Total Descriptions': 6283,
 'Average Length': 297.3931243036766,
 'Longest Description': 'This exceptional coffee was selected as the No. 24 coffee on\xa0Coffee Review’s\xa0list of the Top 30 Coffees of 2018.\xa0 This coffee tied for the highest rating in a tasting of natural-processed single-origin espressos for Coffee Review’s August 2018 tasting report. With its generally elongated beans and distinctive floral and crisp, often chocolaty cup, the Gesha variety of Arabica continues to distinguish itself as one of the world’s rarest and most unique coffees. Although the Gesha originated in Ethiopia, it was “discovered” by the coffee world in 2004 growing in Boquete, Panama, and Panama continues to dominate the expanding world of Gesha. This particular version, however, is the outcome of efforts to commercialize Gesha in the region from which it originally came. It was grown in western Ethiopia by farmers Adam and Rachel Overton and their indigenous Meanit culture collaborators from s

In [27]:
all_text = " ".join(corpus[:20]) #were going to play with just 20 reviews, and then expand out to the whole set. This will help with runtime

In [28]:
#Create a single text from all descriptions for NLP processing
#all_text = " ".join(corpus)
#all_text = " ".join(corpus[:20])

#Tokenize
tokens = word_tokenize(all_text.lower())

#Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

In [29]:
#Calculate how diverse the review is
lexical_diversity = len(set(filtered_tokens)) / len(filtered_tokens)

#Calculate 200 most common words
word_freq = Counter(filtered_tokens)
most_common_words = word_freq.most_common(200)

#Use spaCy for NER
doc = nlp(all_text)
entities = [(entity.text, entity.label_) for entity in doc.ents]
entity_counter = Counter(entities)

#Get interesting stuff
summary_statistics = {
    'Total Descriptions': total_descriptions,
    'Average Length': average_length,
    'Longest Description': longest_description,
    'Length of Longest Description': len(longest_description),
    'Shortest Description': shortest_description,
    'Length of Shortest Description': len(shortest_description),
    'Lexical Diversity': lexical_diversity,
    'Most Common Words': most_common_words,
    'Named Entities': entity_counter.most_common(10) #I wanted to see if there was any fun named entities
}

summary_statistics_df = pd.DataFrame.from_dict(summary_statistics, orient='index', columns=['Value'])
summary_statistics_df

Unnamed: 0,Value
Total Descriptions,6283
Average Length,297.393
Longest Description,This exceptional coffee was selected as the No...
Length of Longest Description,1380
Shortest Description,"A plush, smooth roast-touched Kona."
Length of Shortest Description,35
Lexical Diversity,0.335484
Most Common Words,"[(finish, 21), (aroma, 20), (cup, 20), (mouthf..."
Named Entities,"[((three, CARDINAL), 4), ((Sweet, PERSON), 3),..."


NER doesn't seem to be working too well, but its rather interesting to see that finish and Aroma are pretty common words (at least among the 20 supercut) 

In [30]:
sorted_word_freq = word_freq.most_common()
total_words = sum(word_freq.values())
cumulative_count = 0
unique_word_count = 0
words_that_represent_half = []
for word, freq in sorted_word_freq:
    cumulative_count += freq
    unique_word_count += 1
    words_that_represent_half.append(word)
    if cumulative_count >= total_words / 2:
        break

# Display the result
unique_word_count, words_that_represent_half

(34,
 ['finish',
  'aroma',
  'cup',
  'mouthfeel',
  'structure',
  'chocolate',
  'long',
  'sweet',
  'short',
  'sweetly',
  'notes',
  'lemon',
  'delicate',
  'dark',
  'zest',
  'roasted',
  'cacao',
  'nib',
  'peppercorn',
  'floral',
  'rich',
  'resonant',
  'tart',
  'dried',
  'gently',
  'crisply',
  'blossom',
  'crisp',
  'balanced',
  'plush',
  'syrupy',
  'apricot',
  'juicy',
  'hints'])