# Better Tags with the Co:here API

Team: Artificially Intelligent

## Book recommendations are aweful!

### Can't we do better?
- We aim to use the Co:here API to create better tags for short story recommendations.
- We crafted detailed tags for a set of short stories from the website [Reedsy](https://blog.reedsy.com/short-stories/) by hand
- These tags were used, along with part of the story text, as examples for [Co:here](https://cohere.ai/) to generate tags for even more stories
- A Streamlit app was then used as an interface for a user to input a reedsy URL they enjoyed in hopes of finding similar stories
- This URL was scraped for its body text and Co:here was used to generate tags for this story
- The cosine similarity of these generagted tags and the tags in our corpus were calculated
- The URLs of the top three most similar stories are then returned to the user!



In [4]:
# install required packages
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.






Set the API key here:

In [5]:
import os
os.environ['COHERE_API_KEY'] = ''

Set up the API client and a driver for web scraping:

In [6]:
import cohere
import os
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from collections import Counter # for cosine similarity
import math
co = cohere.Client(os.getenv('COHERE_API_KEY'))
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 105.0.5195
Get LATEST driver version for 105.0.5195
Driver [C:\Users\bensn\.wdm\drivers\chromedriver\win32\105.0.5195.52\chromedriver.exe] found in cache
  


In [7]:
from scrape_functions import * # importing url scraping functions

In [8]:
# list of reedsy stories in our corpus
url_list = ['https://blog.reedsy.com/short-story/ndit0a/',
            'https://blog.reedsy.com/short-story/n5toc8/',
            'https://blog.reedsy.com/short-story/eoex6z/',
            'https://blog.reedsy.com/short-story/vi8uge/',
            'https://blog.reedsy.com/short-story/7yrery/',
            'https://blog.reedsy.com/short-story/j48lmh/',
            'https://blog.reedsy.com/short-story/1toaoe/',
            'https://blog.reedsy.com/short-story/7yrery/',
            'https://blog.reedsy.com/short-story/c2ar53/',
            'https://blog.reedsy.com/short-story/iur9u1/',
            'https://blog.reedsy.com/short-story/0wjsaw/',
            'https://blog.reedsy.com/short-story/rcuiva/',
            'https://blog.reedsy.com/short-story/zf6eb9/',
            'https://blog.reedsy.com/short-story/evwarr/',
            'https://blog.reedsy.com/short-story/omxn2o/',
            'https://blog.reedsy.com/short-story/gm80ij/',
            'https://blog.reedsy.com/short-story/dkhusv/',
            'https://blog.reedsy.com/short-story/idca55/',
            'https://blog.reedsy.com/short-story/ua19kk/',
            'https://blog.reedsy.com/short-story/vdgjfo/',
            'https://blog.reedsy.com/short-story/5antpt/',
            'https://blog.reedsy.com/short-story/4gdaax/',
            'https://blog.reedsy.com/short-story/nkkeif/',
            'https://blog.reedsy.com/short-story/nc54kv/']

Scraping the URLs and saving the data to a dataframe:

In [9]:
df = pd.DataFrame()
for url in url_list:
    df = df.append(scrape_URL(url, driver), ignore_index=True)

Here we can see the dataframe with all of the stories in but they have no tags:

In [10]:
df

Unnamed: 0,Article title,Body,URL,Tags
0,Mothering,"Carmen invades my mind a lot, even years after...",https://blog.reedsy.com/short-story/ndit0a/,
1,10 Ways to Explain Your Husband's Deat...,Content warning: Themes of death***1. Tell the...,https://blog.reedsy.com/short-story/n5toc8/,
2,"And the Radio Said, “There’s Another S...","CW: Gun violence, deathThe air hung heavy with...",https://blog.reedsy.com/short-story/eoex6z/,
3,Long Distance Phone Call,You can talk to the dead on the pay phone in f...,https://blog.reedsy.com/short-story/vi8uge/,
4,Wish You Were Here,Bonjour from the city of love! I don't have mu...,https://blog.reedsy.com/short-story/7yrery/,
5,Old Soul,Can you hear me?I remember being reborn a thou...,https://blog.reedsy.com/short-story/j48lmh/,
6,Your Cat™️ Customer Service,"Hi, I'm Timmy McHill and drugs and AA started ...",https://blog.reedsy.com/short-story/1toaoe/,
7,Wish You Were Here,Bonjour from the city of love! I don't have mu...,https://blog.reedsy.com/short-story/7yrery/,
8,We/I,Something happened in the lab the other night....,https://blog.reedsy.com/short-story/c2ar53/,
9,Someone Has Died,TW: This story contains several depictions and...,https://blog.reedsy.com/short-story/iur9u1/,


Below are 5 examples of tags that have been manually annotated, these correspond to the first 5 stories in the dataset:

In [11]:
tags = ["Single mother, reminiscent, cancer, death, grief, judgment",
        "Death of parent, death of spouse, contemplative, religious undertones, grief",
        "Gun violence, death of parent, wartime, cynical", 
        "cancer, death of parent, loss, parent bonding, paranormal",
        "travel, menwritingwomen, delusion, forlorn, ex-boyfriend, twist"]

We will add these to the dataframe:

In [12]:
for i, tag in enumerate(tags):
    df.at[i, 'Tags'] = tag

Here are the newly assigned tags:

In [13]:
df.head()

Unnamed: 0,Article title,Body,URL,Tags
0,Mothering,"Carmen invades my mind a lot, even years after...",https://blog.reedsy.com/short-story/ndit0a/,"Single mother, reminiscent, cancer, death, gri..."
1,10 Ways to Explain Your Husband's Deat...,Content warning: Themes of death***1. Tell the...,https://blog.reedsy.com/short-story/n5toc8/,"Death of parent, death of spouse, contemplativ..."
2,"And the Radio Said, “There’s Another S...","CW: Gun violence, deathThe air hung heavy with...",https://blog.reedsy.com/short-story/eoex6z/,"Gun violence, death of parent, wartime, cynical"
3,Long Distance Phone Call,You can talk to the dead on the pay phone in f...,https://blog.reedsy.com/short-story/vi8uge/,"cancer, death of parent, loss, parent bonding,..."
4,Wish You Were Here,Bonjour from the city of love! I don't have mu...,https://blog.reedsy.com/short-story/7yrery/,"travel, menwritingwomen, delusion, forlorn, ex..."


Let's define a function to predict the tags of a given article based on the tags we already have from some of the stories. We use the Co:here API to do this and will send three examples of 1500 characters of each story along with the tags we wrote in order to predict the tags for another story.

In [14]:
def predict_tags(index):
    number_chars = 1500
    co = cohere.Client(os.getenv('COHERE_API_KEY'))
    response = co.generate(
    model='large',
    prompt=f"Passage: {df['Body'].values[0][:number_chars]}\n\nTags:{df['Tags'][0]}\n--\nPassage:{df['Body'].values[1][:number_chars]}\n\nTLDR:{df['Tags'][1]}\n--\nPassage:{df['Body'].values[2][:number_chars]}\n\nTLDR:{df['Tags'][2]}\n--\nPassage:{df['Body'].values[index][:number_chars]}\n\Tags:",
    max_tokens=50,
    temperature=0.8,
    k=0,
    p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop_sequences=["--"],
    return_likelihoods='NONE')
    return response.generations[0].text

Here we predict the tags for all of the stories:

In [15]:
for i in range(len(tags), df.shape[0]):
    predicted_tag = predict_tags(i)
    df.at[i, 'Tags'] = predicted_tag

We can look at the predicted tags here:

In [16]:
df

Unnamed: 0,Article title,Body,URL,Tags
0,Mothering,"Carmen invades my mind a lot, even years after...",https://blog.reedsy.com/short-story/ndit0a/,"Single mother, reminiscent, cancer, death, gri..."
1,10 Ways to Explain Your Husband's Deat...,Content warning: Themes of death***1. Tell the...,https://blog.reedsy.com/short-story/n5toc8/,"Death of parent, death of spouse, contemplativ..."
2,"And the Radio Said, “There’s Another S...","CW: Gun violence, deathThe air hung heavy with...",https://blog.reedsy.com/short-story/eoex6z/,"Gun violence, death of parent, wartime, cynical"
3,Long Distance Phone Call,You can talk to the dead on the pay phone in f...,https://blog.reedsy.com/short-story/vi8uge/,"cancer, death of parent, loss, parent bonding,..."
4,Wish You Were Here,Bonjour from the city of love! I don't have mu...,https://blog.reedsy.com/short-story/7yrery/,"travel, menwritingwomen, delusion, forlorn, ex..."
5,Old Soul,Can you hear me?I remember being reborn a thou...,https://blog.reedsy.com/short-story/j48lmh/,"Rebirth, death, war, redemption, karma, suffer..."
6,Your Cat™️ Customer Service,"Hi, I'm Timmy McHill and drugs and AA started ...",https://blog.reedsy.com/short-story/1toaoe/,"Feline, AA, Drug/Alcohol, Fraud, Alcoholic, Al..."
7,Wish You Were Here,Bonjour from the city of love! I don't have mu...,https://blog.reedsy.com/short-story/7yrery/,"love, breakup, breakup letter, toxic relations..."
8,We/I,Something happened in the lab the other night....,https://blog.reedsy.com/short-story/c2ar53/,"Parallel universe, personal narrative, scienc..."
9,Someone Has Died,TW: This story contains several depictions and...,https://blog.reedsy.com/short-story/iur9u1/,"Death, death of a child, death of a parent, wa..."


We will save the dataframe to a csv file for use in the streamlit app

In [17]:
df.to_csv('reedsy.csv', index=False, encoding='utf-8')

Here we define a standard cosine similarity function that we will use to estimate how close a given list of tags is to another:

In [18]:
def counter_cosine_similarity(c1, c2):
    terms = set(c1).union(c2)
    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
    return dotprod / (magA * magB)

We define a function to calculate the cosine similarity between a given set of tags and the rest of our tagged corpus:

In [19]:
def calc_sims(new_story_df, df):
    counterA = Counter(new_story_df['Tags'][0])
    
    for i in range(1, df.shape[0]):
        counterB = Counter(df['Tags'][i])
        df.at[i, 'cosine_similarity'] = counter_cosine_similarity(counterA, counterB) * 100

Let's grab a story from reedsy and see how it compares to our corpus:

In [20]:
new_story = scrape_URL('https://blog.reedsy.com/short-story/lqa7qt/', driver)

Defining the predict_tags function so it can be used to predict tags for new stories:

In [38]:
def predict_tags_forURL(index, df, scraped_url):
    number_chars = 1500
    co = cohere.Client(os.getenv('COHERE_API_KEY'))
    response = co.generate(
    model='large',
    prompt=f"Passage: {df['Body'].values[0][:number_chars]}\n\nTags:{df['Tags'][0]}\n--\nPassage:{df['Body'].values[1][:number_chars]}\n\nTLDR:{df['Tags'][1]}\n--\nPassage:{df['Body'].values[2][:number_chars]}\n\nTLDR:{df['Tags'][2]}\n--\nPassage:{scraped_url['Body'].values[0][:number_chars]}\n\Tags:",
    max_tokens=20,
    temperature=0.8,
    k=0,
    p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop_sequences=["--"],
    return_likelihoods='NONE')
    return response.generations[0].text

Predicting the tags:

In [39]:
new_story.at[0, 'Tags'] = predict_tags_forURL(0, df, new_story)

In [40]:
new_story

Unnamed: 0,Article title,Body,URL,Tags
0,Liability in Lingotto,When the police forcibly entered 43-year-old P...,https://blog.reedsy.com/short-story/lqa7qt/,"Wartime, investigative journalism, work, death..."


Calculating the similarities between the new story and the rest of the corpus:

In [41]:
calc_sims(new_story, df)

In [42]:
df

Unnamed: 0,Article title,Body,URL,Tags,cosine_similarity
0,Mothering,"Carmen invades my mind a lot, even years after...",https://blog.reedsy.com/short-story/ndit0a/,"Single mother, reminiscent, cancer, death, gri...",
1,10 Ways to Explain Your Husband's Deat...,Content warning: Themes of death***1. Tell the...,https://blog.reedsy.com/short-story/n5toc8/,"Death of parent, death of spouse, contemplativ...",92.715515
2,"And the Radio Said, “There’s Another S...","CW: Gun violence, deathThe air hung heavy with...",https://blog.reedsy.com/short-story/eoex6z/,"Gun violence, death of parent, wartime, cynical",89.674838
3,Long Distance Phone Call,You can talk to the dead on the pay phone in f...,https://blog.reedsy.com/short-story/vi8uge/,"cancer, death of parent, loss, parent bonding,...",84.084207
4,Wish You Were Here,Bonjour from the city of love! I don't have mu...,https://blog.reedsy.com/short-story/7yrery/,"travel, menwritingwomen, delusion, forlorn, ex...",89.628923
5,Old Soul,Can you hear me?I remember being reborn a thou...,https://blog.reedsy.com/short-story/j48lmh/,"Rebirth, death, war, redemption, karma, suffer...",83.205822
6,Your Cat™️ Customer Service,"Hi, I'm Timmy McHill and drugs and AA started ...",https://blog.reedsy.com/short-story/1toaoe/,"Feline, AA, Drug/Alcohol, Fraud, Alcoholic, Al...",91.047989
7,Wish You Were Here,Bonjour from the city of love! I don't have mu...,https://blog.reedsy.com/short-story/7yrery/,"love, breakup, breakup letter, toxic relations...",89.405228
8,We/I,Something happened in the lab the other night....,https://blog.reedsy.com/short-story/c2ar53/,"Parallel universe, personal narrative, scienc...",89.257573
9,Someone Has Died,TW: This story contains several depictions and...,https://blog.reedsy.com/short-story/iur9u1/,"Death, death of a child, death of a parent, wa...",79.131979


We can find the top three most similar stories to this one:

In [43]:
list_of_highest = df['cosine_similarity'].nlargest(3).index

And the associated URLs:

In [44]:
list_of_urls = df['URL'][list_of_highest].values

We can also get a list of the similarity scores themselves:

In [45]:
list_of_similar = df['cosine_similarity'][list_of_highest].values

This will form a list of recommendations to the user:

In [46]:
list_of_recs = []
for url, similaity in zip(list_of_urls, list_of_similar):
    list_of_recs.append(f'{url} : {similaity}')

In [47]:
list_of_recs

['https://blog.reedsy.com/short-story/n5toc8/ : 92.71551465458188',
 'https://blog.reedsy.com/short-story/1toaoe/ : 91.0479893363779',
 'https://blog.reedsy.com/short-story/gm80ij/ : 90.53918186901424']

In [48]:
df['Tags'][list_of_highest].values

array(['Death of parent, death of spouse, contemplative, religious undertones, grief',
       'Feline, AA, Drug/Alcohol, Fraud, Alcoholic, Alcoholism, Death, family, foster home, addiction, bereavement, alcoholism, recovery, 12 steps, 12th step, Twelve Steps, Twelve-',
       'reconstruction, death, grief, feminist, chess, religion, miscarriage\n--'],
      dtype=object)

In [49]:
new_story['Tags'][0]

'Wartime, investigative journalism, work, death, lifestyle, murder, depression, mental health,'

We can see that the top three recommendations are all similar to the story we inputted. We have common themes throughout the reccomended tags that are dark and contain death. This is a good sign that the model is working well but further investigation may be required to see if the model can be improved.