<br>Dora Li
<br>CS315 
<br>Mar 12, 2024

## Results of applying the cosine similarity between post transcriptions and NYT articles that gathered from the NYT API


#### Detailed Steps:
1. Write a Python function that takes a date, for example, "2024-02-12", and returns the list of articles for that day (extracting it from the month’s archive).
2. Write some code that explores whether the fields "abstract" and "snippet" are always the same or often differ. Which one has more information?
3. Write a function that given one article (in its nested structure), creates a flat dictionary with keys that are relevant for analysis: either the abstract or snippet (see point 2); lead paragraph; headline; keywords concatenated via semicolon; pub_date; document_type; section_name; and type_of_material
4. Write another function that calls the function from point 3 on every article, to create a list of article dictionaries, and convert this list into a dataframe and then store it as a CSV file with the date-month in the title (this is important for point 5 below).
5. Once you have done all of these in the notebook, create a Python script that can be called with a date (from a TikTok video). First, the script looks whether a CSV with cleaned articles is in our folder. If not, calls first the API function to get the articles and then the function that converts them into a CSV. Then, it loads the CSV into a dataframe and it uses filtering to get the articles for the desired date. These articles will be used for the Semantic Similarity portion of the TikTok Project.


#### Import Related Libraries + Setup

In [None]:
import requests
import pandas as pd
from collections import Counter
from datetime import datetime
import os
from sklearn.cluster import KMeans
import numpy as np
from numpy.linalg import norm
import matplotlib.pyplot as plt
import plotly.express as px
import tensorflow as tf
import tensorflow_hub as hub

In [None]:
pd.set_option('display.max_colwidth', None)

#### Function Definitions:
1. Generate CSV from NYT API


In [None]:
def get_articles(date):
    """
    function that takes a date, for example, "2024-02-12", and returns the list of articles for 
    that month (extracting it from the month’s archive)
    """
    result = []
    # extract year, month, day from the date
    datestring = date
    dt = datetime.strptime(datestring, '%Y-%m')
    
    # constant 
    myAPIKey = "iDALMAL9VFMiwzWionTqK3Ve4tFDUDAQ"
    
    # access NYT API
    URL = f"https://api.nytimes.com/svc/archive/v1/{dt.year}/{dt.month}.json?api-key={myAPIKey}"
    data = requests.get(URL)
    articles = data.json()
    
    # add to the list if the article index if it is of the day
    n = len(articles['response']['docs'])
    for i in range(n):
            result.append(articles['response']['docs'][i])
            
    return result
    


In [None]:
def get_article_info(article):
    """
    Write a function that given one article (in its nested structure), creates a flat dictionary 
    with keys that are relevant for analysis: either the abstract or snippet (see point 2); lead 
    paragraph; headline; keywords concatenated via semicolon; pub_date; document_type; section_name; 
    and type_of_material
    """
    result = {}
    # either the abstract or snippet (see point 2)
    if len(article['abstract']) >= len(article['snippet']):
        result['abstract/snippet']= article['abstract']
    else:
        result['abstract/snippet']= article['snippet']
    result['lead_paragraph'] = article['lead_paragraph']
    result['headline'] = article['headline']['main']
    
    # keywords concatenated via semicolon
    k = ""
    for keyword in article['keywords']:
        k+=";" + keyword['value']
    result['keywords'] = k[1:] #remove the first semicolon
    
    # others
    result['pub_date'] = article['pub_date'][:10]
    result['document_type'] = article['document_type']
    result['section_name'] = article['section_name']
    result['type_of_material'] = article['type_of_material']
    
    return result

In [None]:
def get_articles_df(date):
    """
    Write another function that calls the function from point 3 on every article, to create a list
    of article dictionaries, and convert this list into a dataframe and then store it as a CSV file 
    with the date-month in the title (this is important for point 5 below).
    """
    df = pd.DataFrame()
    # get all NYT article of that date
    articles = get_articles(date)
    # iterate through the articles and concatenate them
    for article in articles:
        article_dict = get_article_info(article)
        d = pd.DataFrame([article_dict])
        df = pd.concat([df,d])
    df.to_csv(f"{date}.csv")
    return df

2. Cosine Similarity for the words

In [None]:
def cosineSimilarity(vec1, vec2):
    """Calculate the cosine similarity between two vectors."""
    V1 = np.array(vec1)
    V2 = np.array(vec2)
    cosine = np.dot(V1, V2)/(norm(V1)*norm(V2))# edited dot product to be v1 and transpose of v2 instead of v2
    return cosine

In [None]:
def generate_similarity(date, meta_data_df):
    """
    Once you have done all of these in the notebook, create a Python script that can be called with 
    a date (from a TikTok video). First, the script looks whether a CSV with cleaned articles is in 
    our folder. If not, calls first the API function to get the articles and then the function that 
    converts them into a CSV. Then, it loads the CSV into a dataframe and it uses filtering to get 
    the articles for the desired date. These articles will be used for the Semantic Similarity portion 
    of the TikTok Project.
    
    inputs:
    1. date is the date we are looking at
    2. meta_data_df is a dataframe subset of all the meta data available on a specific day specified by date
    """
    # check if a CSV w/ cleaned articles is in folder
    paths = []
    for root, dirs, files in os.walk(".", topdown=False):
        for name in files:
            paths.append(name)
    date_month = date[:7] # get the month specific csv
    if f"{date_month}.csv" not in paths:
        df_nyt = get_articles_df(date_month)
        # print("check done")
    else:
        # get NYT and TikTok Meta Data
        df_nyt = pd.read_csv(f"{date_month}.csv")
        #df_tiktok = meta_data_date
        # print("read done")

    # get NYT and TikTok Meta Data ready
    df_nyt = df_nyt[df_nyt["pub_date"]== date]
    df_nyt = df_nyt.fillna("") # replace nan in the dataframe with empty string
    df_tiktok = meta_data_df
    df_tiktok = df_tiktok.fillna("")
    # print("data done")

    # function for generating embeddings
    # embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

    # generate embeddings for the NYT content
    abstract_list = embed(df_nyt['abstract/snippet'].tolist())
    lead_paragraph_list = embed(df_nyt['lead_paragraph'].tolist())
    headline_list = embed(df_nyt['headline'].tolist())
    headline_val_list = df_nyt['headline'].tolist()
    # keywords_list = embed(df_nyt['keywords'].tolist())
    # print("embedding done")
    
    def find_max_cosine(row):
        """
        helper function to find maximum cosine similarity index and which article 
        results in the index 
        """
        tiktok = embed([row['suggested_words']])
        # tiktok = embed([row['video_description']])
        max_cosine = 0
        article_index = 0
        num = len(abstract_list) # number of NYT articles to compare to
        # iterate through the article
        for i in range(num):
            # find the maximum cosineSimilarity between abstract, lead paragraph, headline, and keywords
            curr_max_cosine = max(cosineSimilarity(tiktok, abstract_list[i]),
                                  cosineSimilarity(tiktok, lead_paragraph_list[i]),
                                  cosineSimilarity(tiktok, headline_list[i]))
            if curr_max_cosine > max_cosine:
                max_cosine = curr_max_cosine
                article_index = i
        headline = headline_val_list[article_index]
        return (max_cosine, headline)
    
    df_tiktok['result'] = df_tiktok.apply(find_max_cosine, axis=1)
    
    return df_tiktok


In [None]:
def get_date(row):
    return row['video_timestamp'][:10]

try to generate cosine similarity for the entirely for the dataframe

In [None]:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

In [None]:
meta_data = pd.read_csv("results_cleaned.csv")
meta_data["format_date"] = meta_data.apply(get_date, axis=1)
dates = meta_data["format_date"].unique()

In [None]:
result = pd.DataFrame()
for date in dates:
    df_tiktok = meta_data[meta_data['format_date']==date]  
    df = generate_similarity(date, df_tiktok)
    result = pd.concat([result, df])
    print(date)
    