# Case Study - Byte Match

- Background: In the heart of the digital age, the "Daily Byte" stands as one of
the most renowned online news platforms. With a vast array of articles
spanning various topics, readers are spoilt for choice. However, with such an
abundance of information, many readers feel overwhelmed, often missing out
on articles that might resonate with their interests.

- Challenge: The editorial team, led by the visionary editor-in-chief, Ms.
Penelope Wordsworth, has a challenge for the students. They've noticed a
trend: readers who enjoy articles on specific topics tend to have a high
likelihood of enjoying other articles with semantic similarities. With this
observation, Ms. Wordsworth poses a question: "Can we guide our readers to
articles they'll love, based on what they've previously enjoyed?"

- Objective: You are tasked with creating ByteMatch, a recommendation
system for the Daily Byte.

In [1]:
import spacy
nlp = spacy.load('en_core_web_md')

The selection of sample articles used in this demonstration are from CNN's 'Good News Generator 2023':
https://edition.cnn.com/interactive/2023/12/specials/good-news-generator-dg-cec/

In [2]:
news_articles = {
    "Ticklish Rats" : "Rats are ticklish, study says! Researchers played with and tickled rats in a study, identifying a part of the brain that's active both when the rodents are playing and when they're tickled.",
    "Clever Elephants" : "An elephant learned how to peel a banana, which is more impressive than you think. An elephant's trunk is a remarkable organ: a fusion of nose and upper lip, capable of movement via a dense network of muscles. It's strong enough to lift a log, and sensitive enough to perform delicate tasks like picking up a single tortilla chip without breaking it.",
    "ChatGPT Crochet" : "ChatGPT created crochet patterns, and the results were adorably baffling. ChatGPT, a publicly available language-learning AI, was not designed to create things like crochet or knitting patterns, but what happens when you ask it to do just that? We crocheted some ChatGPT-generated patterns to find out.",
    "Sologamy" : "These cool ladies married the loves of their lives - themselves. The practice, called sologamy, sometimes involves lavish ceremonies complete with vows, cakes and bridesmaids. Critics call it narcissistic, but those who do it say it's a healthy expression of self-love.",
    "Imaging Improvements" : "New technology could improve medical imaging on dark skin. Traditional medical imaging - used to diagnose, monitor or treat certain medical conditions - has long struggled to get clear pictures of patients with dark skin, according to experts.",
    "Clothing Waste" : "France says it will pay for some wardrobe repairs to cut down on clothing waste. France is to introduce a scheme that will subsidize repairs to clothing and shoes in order to cut waste and planet-heating pollution from the textile industry.",
    "Viral Novel" : "A daughter helped her father's decade-old thriller novel go viral on TikTok. Lloyd Devereux Richards first published “Stone Maidens” in 2012 and didn't garner much excitement. His daughter recently decided to post it on TikTok, helping it rocket to the top of Amazon's Best Seller list.",
    "Dress Restoration" : "This woman restores gorgeous, historic wedding dresses. Karen Tierney, a California-based textiles expert, restores historical wedding dresses. Earlier this year, she put out a call to her clients who gathered for a special fundraiser, to show off more than 150 years of history, craftsmanship and love.",
    "Penguin Surgery" : "A trio of elderly penguins got cataract surgery in a scientific first. Three elderly king penguins have been fitted with custom-made eye lenses during surgery to remove cataracts in what is believed to be a world first procedure to improve their sight, according to a Singapore zoo."
}

### Preprocessing function:

In [18]:
def preprocess(text):
    """This function preprocesses a string by removing punctuation 
    and stop words, and lemmatizing the remaining words.
    
    Parameters
    ----------
    text : str
        The text to be preprocessed.

    Returns
    -------
    str
        The preprocessed text."""
    doc = nlp(text)
    return ' '.join([token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct])

### Recommendation function:

In [19]:
def recommendation(text, articles):
    """This function takes the text of the last article read
    and a dictionary containing article titles and text, and
    returns the title of the most similar article as a recommendation
    to read next.

    Parameters
    ----------
    text : str
        The text of the last article read.
    articles : dict[str, str]
        A dictionary in the form {title : text}

    Returns
    -------
    str
        The title of the most similar article from the dictionary."""
    

    # Converts the description to a doc object for comparison:
    model_text = nlp(text)

    # Creates a dictionary of movie names and similarity scores:
    comparison = {}
    for title, text in articles.items():
        score = nlp(text).similarity(model_text)
        
        # Exclude identical article:
        if score != 1.0:
            comparison[title] = score
    
    # Obtains the movie with the maximum similarity score:
    top_recommendation = max(comparison, key=comparison.get)
    #print(comparison)

    return top_recommendation

#### Preprocessing the original articles:

In [20]:
processed_articles = {}
for key, value in news_articles.items():
    processed_articles[key] = preprocess(value)

#### Testing out the recommendations:

In [21]:
# The text of the last article the user read:
users_article = news_articles.get("Sologamy")
users_article_processed = preprocess(users_article)

In [22]:
recommendation(users_article_processed, processed_articles)

'Dress Restoration'

In [23]:
recommendation(users_article, news_articles)

'Dress Restoration'

#### Evaluating all recommendations with and without preprocessing the article text:

In [24]:
for key in news_articles.keys():
    users_article = news_articles.get(key)
    users_article_processed = preprocess(users_article)
    print(f"Original article: {key}")
    print(f"Unprocessed: {recommendation(users_article, news_articles)}")
    print(f"Processed: {recommendation(users_article_processed, processed_articles)}\n")

Original article: Ticklish Rats
Unprocessed: Sologamy
Processed: Penguin Surgery

Original article: Clever Elephants
Unprocessed: Penguin Surgery
Processed: Penguin Surgery

Original article: ChatGPT Crochet
Unprocessed: Clothing Waste
Processed: Clever Elephants

Original article: Sologamy
Unprocessed: Dress Restoration
Processed: Dress Restoration

Original article: Imaging Improvements
Unprocessed: Clothing Waste
Processed: Penguin Surgery

Original article: Clothing Waste
Unprocessed: Penguin Surgery
Processed: Penguin Surgery

Original article: Viral Novel
Unprocessed: Dress Restoration
Processed: Dress Restoration

Original article: Dress Restoration
Unprocessed: Sologamy
Processed: Viral Novel

Original article: Penguin Surgery
Unprocessed: Clever Elephants
Processed: Imaging Improvements



Four of the nine produce the same result regardless of pre processing. For the remaining five:

- Ticklish Rats - Sologamy is the more surprising result, as it is a human interest story featuring love and marriage. Penguin Surgery would seem to be the better match, as it also involves animals and science.
More obvious match: processed.

- ChatGPT Crochet - It is not entirely clear how either of these relate to the original article. Clothing is also fabric, so this seems the more logical suggestion.
More obvious match: unprocessed

- Imaging Improvements - The article on surgery makes more sense, as it is also an article relating to medicine.
More obvious match: processed

- Dress Restoration -The Sologamy article is about weddings, Dress Restoration wedding dresses, so this seems the better match.
More obvious match: unprocessed

- Penguin Surgery - Both of these matches have merit, one matches the animal theme, the other the medical theme.
More obvious match: draw

Although the unprocessed and processed texts give different results, neither is clearly better than the other for the sample data. To improve the recommendations, could try using the larger en_core_web_lg model.
