<br>Dora Li
<br>CS315 
<br>April 1st, 2024

## Results of applying the cosine similarity between post transcriptions and NYT articles that gathered from the NYT API


#### Detailed Steps:
1. Write a Python function that takes a date, for example, "2024-02-12", and returns the list of articles for that day (extracting it from the month’s archive).
2. Write some code that explores whether the fields "abstract" and "snippet" are always the same or often differ. Which one has more information?
3. Write a function that given one article (in its nested structure), creates a flat dictionary with keys that are relevant for analysis: either the abstract or snippet (see point 2); lead paragraph; headline; keywords concatenated via semicolon; pub_date; document_type; section_name; and type_of_material
4. Write another function that calls the function from point 3 on every article, to create a list of article dictionaries, and convert this list into a dataframe and then store it as a CSV file with the date-month in the title (this is important for point 5 below).
5. Once you have done all of these in the notebook, create a Python script that can be called with a date (from a TikTok video). First, the script looks whether a CSV with cleaned articles is in our folder. If not, calls first the API function to get the articles and then the function that converts them into a CSV. Then, it loads the CSV into a dataframe and it uses filtering to get the articles for the desired date. These articles will be used for the Semantic Similarity portion of the TikTok Project.


#### Import Related Libraries + Setup

In [1]:
import requests
import pandas as pd
from collections import Counter
from datetime import datetime
import os
from sklearn.cluster import KMeans
import numpy as np
from numpy.linalg import norm
import matplotlib.pyplot as plt
import plotly.express as px
import tensorflow as tf
import tensorflow_hub as hub

In [2]:
pd.set_option('display.max_colwidth', None)

#### Function Definitions:
1. Generate CSV from NYT API


In [3]:
def get_articles(date):
    """
    function that takes a date, for example, "2024-02-12", and returns the list of articles for 
    that month (extracting it from the month’s archive)
    """
    result = []
    # extract year, month, day from the date
    datestring = date
    dt = datetime.strptime(datestring, '%Y-%m')
    
    # constant 
    myAPIKey = "iDALMAL9VFMiwzWionTqK3Ve4tFDUDAQ"
    
    # access NYT API
    URL = f"https://api.nytimes.com/svc/archive/v1/{dt.year}/{dt.month}.json?api-key={myAPIKey}"
    data = requests.get(URL)
    articles = data.json()
    
    # add to the list if the article index if it is of the day
    try:
        n = len(articles['response']['docs'])
        for i in range(n):
                result.append(articles['response']['docs'][i])
    except:
        URL = f"https://api.nytimes.com/svc/archive/v1/{dt.year}/{dt.month}.json?api-key={myAPIKey}"
        data = requests.get(URL)
        articles = data.json()
        n = len(articles['response']['docs'])
        for i in range(n):
            result.append(articles['response']['docs'][i])
                
    return result
    


In [4]:
def get_article_info(article):
    """
    Write a function that given one article (in its nested structure), creates a flat dictionary 
    with keys that are relevant for analysis: either the abstract or snippet (see point 2); lead 
    paragraph; headline; keywords concatenated via semicolon; pub_date; document_type; section_name; 
    and type_of_material
    """
    result = {}
    # either the abstract or snippet (see point 2)
    if len(article['abstract']) >= len(article['snippet']):
        result['abstract/snippet']= article['abstract']
    else:
        result['abstract/snippet']= article['snippet']
    result['lead_paragraph'] = article['lead_paragraph']
    result['headline'] = article['headline']['main']
    
    # keywords concatenated via semicolon
    k = ""
    for keyword in article['keywords']:
        k+=";" + keyword['value']
    result['keywords'] = k[1:] #remove the first semicolon
    
    # others
    result['pub_date'] = article['pub_date'][:10]
    result['document_type'] = article['document_type']
    result['section_name'] = article['section_name']
    result['type_of_material'] = article['type_of_material']
    
    return result

In [5]:
def get_articles_df(date):
    """
    Write another function that calls the function from point 3 on every article, to create a list
    of article dictionaries, and convert this list into a dataframe and then store it as a CSV file 
    with the date-month in the title (this is important for point 5 below).
    """
    df = pd.DataFrame()
    # get all NYT article of that date
    articles = get_articles(date)
    # iterate through the articles and concatenate them
    for article in articles:
        article_dict = get_article_info(article)
        d = pd.DataFrame([article_dict])
        df = pd.concat([df,d])
    df.to_csv(f"{date}.csv")
    return df

2. Cosine Similarity for the words

In [6]:
def cosineSimilarity(vec1, vec2):
    """Calculate the cosine similarity between two vectors."""
    V1 = np.array(vec1)
    V2 = np.array(vec2)
    cosine = np.dot(V1, V2)/(norm(V1)*norm(V2))# edited dot product to be v1 and transpose of v2 instead of v2
    return cosine

In [7]:
def generate_similarity(date, meta_data_df):
    """
    Once you have done all of these in the notebook, create a Python script that can be called with 
    a date (from a TikTok video). First, the script looks whether a CSV with cleaned articles is in 
    our folder. If not, calls first the API function to get the articles and then the function that 
    converts them into a CSV. Then, it loads the CSV into a dataframe and it uses filtering to get 
    the articles for the desired date. These articles will be used for the Semantic Similarity portion 
    of the TikTok Project.
    
    inputs:
    1. date is the date we are looking at
    2. meta_data_df is a dataframe subset of all the meta data available on a specific day specified by date
    """
    # check if a CSV w/ cleaned articles is in folder
    paths = []
    for root, dirs, files in os.walk(".", topdown=False):
        for name in files:
            paths.append(name)
    date_month = date[:7] # get the month specific csv
    if f"{date_month}.csv" not in paths:
        df_nyt = get_articles_df(date_month)
        # print("check done")
    else:
        # get NYT and TikTok Meta Data
        df_nyt = pd.read_csv(f"{date_month}.csv")
        #df_tiktok = meta_data_date
        # print("read done")

    # get NYT and TikTok Meta Data ready
    df_nyt = df_nyt[df_nyt["pub_date"]== date]
    df_nyt = df_nyt.fillna("") # replace nan in the dataframe with empty string
    df_tiktok = meta_data_df
    df_tiktok = df_tiktok.fillna("")
    # print("data done")

    # function for generating embeddings
    # embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

    # generate embeddings for the NYT content
    abstract_list = embed(df_nyt['abstract/snippet'].tolist())
    lead_paragraph_list = embed(df_nyt['lead_paragraph'].tolist())
    headline_list = embed(df_nyt['headline'].tolist())
    headline_val_list = df_nyt['headline'].tolist()
    # keywords_list = embed(df_nyt['keywords'].tolist())
    # print("embedding done")
    
    def find_max_cosine(row):
        """
        helper function to find maximum cosine similarity index and which article 
        results in the index 
        """
        tiktok = embed([row['suggested_words']])
        # tiktok = embed([row['video_description']])
        # tiktok = embed([row['hashtags']])
        # tiktok = embed([row['Seperated_Hashtags'][1:-1].replace("'","")])
        max_cosine = 0
        article_index = 0
        num = len(abstract_list) # number of NYT articles to compare to
        # iterate through the article
        for i in range(num):
            # find the maximum cosineSimilarity between abstract, lead paragraph, headline, and keywords
            curr_max_cosine = max(cosineSimilarity(tiktok, abstract_list[i]),
                                  cosineSimilarity(tiktok, lead_paragraph_list[i]),
                                  cosineSimilarity(tiktok, headline_list[i]))
            if curr_max_cosine > max_cosine:
                max_cosine = curr_max_cosine
                article_index = i
        headline = headline_val_list[article_index]
        return (max_cosine, headline)
    
    df_tiktok['result'] = df_tiktok.apply(find_max_cosine, axis=1)
    
    return df_tiktok


In [8]:
def get_date(row):
    return row['video_timestamp'][:10]

#### try to generate cosine similarity for the entirely for the dataframe

In [9]:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

In [10]:
# meta_data = pd.read_csv("results_cleaned.csv") # our grp's result
# meta_data = pd.read_csv("results_26301.csv")
# meta_data = pd.read_csv("results_33534.csv")
# meta_data = pd.read_csv("results_38129.csv")
meta_data = pd.read_csv("metadata.csv")
meta_data["format_date"] = meta_data.apply(get_date, axis=1)
dates = meta_data["format_date"].unique()

  meta_data = pd.read_csv("metadata.csv")


In [11]:
len(dates)

957

In [12]:
result = pd.DataFrame()
for i, date in enumerate(dates):
    print(date)
    df_tiktok = meta_data[meta_data['format_date']==date]  
    try:
        df = generate_similarity(date, df_tiktok)
        result = pd.concat([result, df])
        if i%100 == 0:
            result.to_csv("cosine_result_all_users.csv")
            print(i)
    except ValueError:
        print(date)

2024-02-29
0
2024-01-30
2023-12-11
2023-11-27
2023-12-24
2024-03-02
2024-02-15
2023-12-23
2024-01-28
2024-01-19
2024-03-03
2024-03-01
2024-01-26
2023-12-21
2024-02-25
2024-02-28
2024-02-19
2024-02-26
2024-01-12
2024-02-23
2024-02-09
2024-02-22
2024-02-10
2024-01-23
2024-01-24
2024-02-12
2024-02-21
2024-01-31
2024-02-27
2024-02-01
2023-12-15
2024-02-24
2024-01-16
2023-06-26


  df_tiktok = df_tiktok.fillna("")


2022-12-29
2023-10-30
2023-06-16
2023-11-10
2023-12-31
2023-07-25
2023-05-05


  df_tiktok = df_tiktok.fillna("")


2023-05-13
2023-10-12
2023-11-26
2023-10-18
2023-07-22
2023-05-25
2023-08-24
2023-01-25
2023-08-17
2023-03-30


  df_tiktok = df_tiktok.fillna("")


2022-01-03


  df_tiktok = df_tiktok.fillna("")


2023-12-13
2023-09-25
2023-04-14
2023-04-12
2023-08-16
2020-01-13


  df_tiktok = df_tiktok.fillna("")


2022-06-09
2024-02-13
2024-01-02
2024-02-16
2024-02-17
2024-01-29
2024-02-06
2023-12-10
2024-01-25
2024-01-20
2024-01-14
2022-09-27


  df_tiktok = df_tiktok.fillna("")


2021-07-28


  df_tiktok = df_tiktok.fillna("")


2022-04-19


  df_tiktok = df_tiktok.fillna("")


2023-11-02
2022-06-29


  df_tiktok = df_tiktok.fillna("")


2022-10-21
2022-10-17


  df_tiktok = df_tiktok.fillna("")


2022-09-26
2022-08-28


  df_tiktok = df_tiktok.fillna("")


2022-09-01


  df_tiktok = df_tiktok.fillna("")


2022-07-26


  df_tiktok = df_tiktok.fillna("")


2022-07-21


  df_tiktok = df_tiktok.fillna("")


2022-05-18


  df_tiktok = df_tiktok.fillna("")


2021-12-27
2021-11-22


  df_tiktok = df_tiktok.fillna("")


2021-11-18


  df_tiktok = df_tiktok.fillna("")


2021-11-17


  df_tiktok = df_tiktok.fillna("")


2021-11-14


  df_tiktok = df_tiktok.fillna("")


2021-11-11


  df_tiktok = df_tiktok.fillna("")


2021-09-22


  df_tiktok = df_tiktok.fillna("")


2021-09-17


  df_tiktok = df_tiktok.fillna("")


2021-10-26


  df_tiktok = df_tiktok.fillna("")


2021-09-24


  df_tiktok = df_tiktok.fillna("")


2021-09-16


  df_tiktok = df_tiktok.fillna("")


2021-09-13


  df_tiktok = df_tiktok.fillna("")


2021-08-25


  df_tiktok = df_tiktok.fillna("")


2021-08-23


  df_tiktok = df_tiktok.fillna("")


2021-09-01


  df_tiktok = df_tiktok.fillna("")


2021-09-03


  df_tiktok = df_tiktok.fillna("")


2021-02-23


  df_tiktok = df_tiktok.fillna("")


2021-02-24


  df_tiktok = df_tiktok.fillna("")


2022-04-30


  df_tiktok = df_tiktok.fillna("")


100
2024-02-18
2022-05-04
2022-04-15
2024-02-20
2023-12-20
2024-02-14
2023-07-18
2023-07-16
2022-10-28
2023-01-19
2023-01-20
2023-01-22


  df_tiktok = df_tiktok.fillna("")


2024-01-09
2023-12-16
2024-02-11
2023-10-08
2022-12-27
2024-01-21
2024-02-02
2023-12-05
2024-01-05
2024-01-01
2023-05-06


  df_tiktok = df_tiktok.fillna("")


2023-05-10
2023-05-18
2023-05-09
2023-04-24
2023-06-02
2023-05-07


  df_tiktok = df_tiktok.fillna("")


2023-05-11
2023-05-08
2023-08-26
2023-08-09
2023-08-10
2023-10-09
2023-07-23
2023-01-01


  df_tiktok = df_tiktok.fillna("")


2022-11-22


  df_tiktok = df_tiktok.fillna("")


2022-09-19
2022-09-20


  df_tiktok = df_tiktok.fillna("")


2022-09-17


  df_tiktok = df_tiktok.fillna("")


2022-09-22


  df_tiktok = df_tiktok.fillna("")


2023-09-05
2023-09-17
2023-01-30
2024-01-08
2023-03-27
2023-01-21


  df_tiktok = df_tiktok.fillna("")


2023-07-10
2022-12-05
2022-12-22


  df_tiktok = df_tiktok.fillna("")


2022-12-19
2023-03-28
2023-03-31


  df_tiktok = df_tiktok.fillna("")


2023-09-09
2022-11-05
1969-12-31
2023-12-26
2023-10-01
2021-01-26
2021-12-22


  df_tiktok = df_tiktok.fillna("")


2023-08-15
2024-01-17
2024-02-05
2024-02-07
2023-07-13
2023-08-02
2023-08-13
2023-08-23
2023-08-27
2023-09-01
2023-09-11
2023-09-20
2023-09-27
2023-10-04
2023-10-05
2023-10-13
2023-10-14
2023-11-06
2023-11-13
2023-11-14
2023-11-30
2023-12-17
2024-01-07
2024-01-22
2023-02-13
2023-02-03
2022-06-26
2023-10-27
2023-10-02
2022-01-24


  df_tiktok = df_tiktok.fillna("")


2021-10-27


  df_tiktok = df_tiktok.fillna("")


2023-11-16
2023-05-31
2021-09-19


  df_tiktok = df_tiktok.fillna("")


2024-01-15
2024-02-04
2024-01-11
2023-07-14
2024-01-06
200
2023-12-22
2023-12-07
2023-12-30
2021-11-12
2021-11-13


  df_tiktok = df_tiktok.fillna("")


2022-07-04


  df_tiktok = df_tiktok.fillna("")


2024-01-10
2022-04-27


  df_tiktok = df_tiktok.fillna("")


2023-12-18
2023-09-07
2023-06-22
2023-08-21
2023-11-28
2023-08-04
2023-11-07
2023-10-17
2023-10-19
2023-10-21
2023-10-23
2023-10-24
2023-10-25
2023-10-29
2023-11-11
2023-12-19
2024-02-08
2024-01-04
2023-12-28
2023-12-14
2024-01-13
2023-12-27
2023-12-08
2023-11-22
2023-11-24
2023-07-30
2023-04-17
2023-05-02


  df_tiktok = df_tiktok.fillna("")


2020-03-18


  df_tiktok = df_tiktok.fillna("")


2023-06-07
2023-07-29
2023-08-28
2023-09-02
2022-07-13


  df_tiktok = df_tiktok.fillna("")


2024-02-03
2023-07-09
2023-12-06
2023-07-24
2024-01-27
2023-11-12
2024-01-03
2023-01-12
2022-10-26


  df_tiktok = df_tiktok.fillna("")


2023-06-06
2023-06-28
2023-02-26


  df_tiktok = df_tiktok.fillna("")


2023-01-03
2023-01-02
2022-12-11
2022-12-07
2022-11-06


  df_tiktok = df_tiktok.fillna("")


2022-09-29


  df_tiktok = df_tiktok.fillna("")


2022-03-20


  df_tiktok = df_tiktok.fillna("")


2022-03-05


  df_tiktok = df_tiktok.fillna("")


2022-02-20
2021-12-17


  df_tiktok = df_tiktok.fillna("")


2023-11-21
2023-01-06
2023-08-06
2023-05-21
2022-11-07


  df_tiktok = df_tiktok.fillna("")


2023-04-01


  df_tiktok = df_tiktok.fillna("")


2022-11-28
2022-04-03


  df_tiktok = df_tiktok.fillna("")


2023-04-07


  df_tiktok = df_tiktok.fillna("")


2023-04-08


  df_tiktok = df_tiktok.fillna("")


2022-12-18
2023-04-09


  df_tiktok = df_tiktok.fillna("")


2023-11-29
2023-06-19
2023-10-06
2023-06-14
2023-01-11
2023-04-04


  df_tiktok = df_tiktok.fillna("")


2023-01-29
2022-07-03
2023-03-15


  df_tiktok = df_tiktok.fillna("")


2022-12-31


  df_tiktok = df_tiktok.fillna("")


2021-10-14


  df_tiktok = df_tiktok.fillna("")


2021-10-19


  df_tiktok = df_tiktok.fillna("")


2022-06-11


  df_tiktok = df_tiktok.fillna("")


2022-01-07


  df_tiktok = df_tiktok.fillna("")


2023-10-11
2023-11-09
2023-01-23


  df_tiktok = df_tiktok.fillna("")


2023-09-29
2023-09-26
2023-11-25
2023-08-14
2023-11-19
2022-10-27
2023-02-16
300
2023-04-13
2023-03-06


  df_tiktok = df_tiktok.fillna("")


2023-05-17


  df_tiktok = df_tiktok.fillna("")


2023-08-18
2023-06-29
2023-02-22
2023-02-10
2023-01-28


  df_tiktok = df_tiktok.fillna("")


2023-02-17
2023-08-01
2022-06-23


  df_tiktok = df_tiktok.fillna("")


2023-07-21
2022-12-15


  df_tiktok = df_tiktok.fillna("")


2022-11-29


  df_tiktok = df_tiktok.fillna("")


2022-03-27


  df_tiktok = df_tiktok.fillna("")


2022-02-27


  df_tiktok = df_tiktok.fillna("")


2021-12-07


  df_tiktok = df_tiktok.fillna("")


2023-05-15


  df_tiktok = df_tiktok.fillna("")


2023-05-14
2023-05-12
2023-09-03
2023-08-30
2023-08-22
2023-08-20
2023-07-27


  df_tiktok = df_tiktok.fillna("")


2023-12-01
2022-09-03


  df_tiktok = df_tiktok.fillna("")


2022-08-27


  df_tiktok = df_tiktok.fillna("")


2022-07-02


  df_tiktok = df_tiktok.fillna("")


2023-07-08


  df_tiktok = df_tiktok.fillna("")


2023-07-06
2023-07-19
2022-06-16
2022-06-14
2023-06-18


  df_tiktok = df_tiktok.fillna("")


2023-08-07
2023-12-29
2023-02-19
2023-05-04
2022-05-10


  df_tiktok = df_tiktok.fillna("")


2023-03-17
2021-11-24
2022-09-12


  df_tiktok = df_tiktok.fillna("")


2021-11-25


  df_tiktok = df_tiktok.fillna("")


2023-03-26
2022-01-12


  df_tiktok = df_tiktok.fillna("")


2023-04-03


  df_tiktok = df_tiktok.fillna("")


2021-11-23


  df_tiktok = df_tiktok.fillna("")


2023-05-20


  df_tiktok = df_tiktok.fillna("")


2023-06-15


  df_tiktok = df_tiktok.fillna("")


2021-12-23


  df_tiktok = df_tiktok.fillna("")


2023-04-05
2023-05-19
2023-04-10
2023-09-22
2023-05-01
2023-03-01
2023-04-11
2022-05-23


  df_tiktok = df_tiktok.fillna("")


2022-06-27


  df_tiktok = df_tiktok.fillna("")


2019-07-09


  df_tiktok = df_tiktok.fillna("")


2023-10-26
2022-05-11


  df_tiktok = df_tiktok.fillna("")


2023-04-22


  df_tiktok = df_tiktok.fillna("")


2023-12-12
2023-09-13
2023-11-20
2023-11-18
2023-06-17


  df_tiktok = df_tiktok.fillna("")


2023-06-11
2021-08-24
2023-10-31
2022-12-28
2023-03-08


  df_tiktok = df_tiktok.fillna("")


2023-09-21
2022-04-21


  df_tiktok = df_tiktok.fillna("")


2023-01-13
2023-04-23
2023-09-19
2023-05-22
2022-03-24


  df_tiktok = df_tiktok.fillna("")


2022-05-20


  df_tiktok = df_tiktok.fillna("")


2023-11-23
2022-11-03
2023-03-21
2022-11-12


  df_tiktok = df_tiktok.fillna("")


2023-09-10
2022-01-26
2023-02-08
2023-06-09
2021-01-14


  df_tiktok = df_tiktok.fillna("")


2021-01-13


  df_tiktok = df_tiktok.fillna("")


2022-04-01


  df_tiktok = df_tiktok.fillna("")


2024-01-18
2023-12-04
2022-05-27


  df_tiktok = df_tiktok.fillna("")


2021-04-20
2023-01-31
2021-08-27


  df_tiktok = df_tiktok.fillna("")


2023-03-20


  df_tiktok = df_tiktok.fillna("")


400
2023-06-04


  df_tiktok = df_tiktok.fillna("")


2023-09-12
2023-09-06
2023-06-24
2023-08-29
2023-12-02
2023-07-07
2021-05-04
2023-12-03
2021-05-07
2023-12-25
2023-04-15
2023-06-08
2023-03-10


  df_tiktok = df_tiktok.fillna("")


2023-10-03
2020-09-15


  df_tiktok = df_tiktok.fillna("")


2023-02-27


  df_tiktok = df_tiktok.fillna("")


2023-02-07
2022-01-22
2023-04-18
2023-11-08
2023-12-09
2023-08-12
2022-02-16
2023-03-24
2022-01-31


  df_tiktok = df_tiktok.fillna("")


2021-12-28


  df_tiktok = df_tiktok.fillna("")


2023-05-24
2023-05-30


  df_tiktok = df_tiktok.fillna("")


2022-10-20


  df_tiktok = df_tiktok.fillna("")


2023-06-12
2020-06-18


  df_tiktok = df_tiktok.fillna("")


2021-08-31


  df_tiktok = df_tiktok.fillna("")


2023-04-16


  df_tiktok = df_tiktok.fillna("")


2022-08-03
2023-01-14
2023-06-13
2022-06-30
2021-04-02


  df_tiktok = df_tiktok.fillna("")


2022-01-14


  df_tiktok = df_tiktok.fillna("")


2020-04-28


  df_tiktok = df_tiktok.fillna("")


2022-08-09


  df_tiktok = df_tiktok.fillna("")


2021-04-07


  df_tiktok = df_tiktok.fillna("")


2021-10-05


  df_tiktok = df_tiktok.fillna("")


2023-08-25
2021-02-20
2022-06-01


  df_tiktok = df_tiktok.fillna("")


2020-06-14


  df_tiktok = df_tiktok.fillna("")


2023-06-21
2022-04-08
2023-05-28
2023-06-05
2022-10-22
2023-09-24
2020-10-10


  df_tiktok = df_tiktok.fillna("")


2023-05-27


  df_tiktok = df_tiktok.fillna("")


2023-07-31
2023-09-16
2023-11-01
2023-07-26
2020-02-19


  df_tiktok = df_tiktok.fillna("")


2020-10-19


  df_tiktok = df_tiktok.fillna("")


2020-11-10


  df_tiktok = df_tiktok.fillna("")


2022-06-03
2020-10-08


  df_tiktok = df_tiktok.fillna("")


2022-04-23


  df_tiktok = df_tiktok.fillna("")


2020-05-15


  df_tiktok = df_tiktok.fillna("")


2022-11-16
2022-07-22


  df_tiktok = df_tiktok.fillna("")


2022-08-07


  df_tiktok = df_tiktok.fillna("")


2022-10-07


  df_tiktok = df_tiktok.fillna("")


2021-02-26


  df_tiktok = df_tiktok.fillna("")


2023-07-05
2023-07-04
2023-11-15
2021-07-12


  df_tiktok = df_tiktok.fillna("")


2020-11-16


  df_tiktok = df_tiktok.fillna("")


2023-11-17
2023-06-27
2023-07-02


  df_tiktok = df_tiktok.fillna("")


2023-07-28
2022-11-20


  df_tiktok = df_tiktok.fillna("")


2023-06-25
2023-10-15
2022-07-28


  df_tiktok = df_tiktok.fillna("")


2021-12-30


  df_tiktok = df_tiktok.fillna("")


2022-01-27


  df_tiktok = df_tiktok.fillna("")


2023-02-09


  df_tiktok = df_tiktok.fillna("")


2022-08-18
2022-02-12


  df_tiktok = df_tiktok.fillna("")


2022-11-19


  df_tiktok = df_tiktok.fillna("")


2021-04-22


  df_tiktok = df_tiktok.fillna("")


2023-09-30
2020-05-20


  df_tiktok = df_tiktok.fillna("")


2023-05-16


  df_tiktok = df_tiktok.fillna("")


2023-09-14
2023-01-17


  df_tiktok = df_tiktok.fillna("")


2023-10-07
2022-12-09
2020-07-18


  df_tiktok = df_tiktok.fillna("")


500
2023-03-02
2022-12-06
2021-03-21


  df_tiktok = df_tiktok.fillna("")


2020-09-05


  df_tiktok = df_tiktok.fillna("")


2023-10-20
2023-04-06
2023-04-21
2022-10-11


  df_tiktok = df_tiktok.fillna("")


2022-05-07


  df_tiktok = df_tiktok.fillna("")


2022-09-14


  df_tiktok = df_tiktok.fillna("")


2019-11-04


  df_tiktok = df_tiktok.fillna("")


2023-10-16
2023-08-31
2023-11-05
2021-05-25
2021-02-17
2020-05-09


  df_tiktok = df_tiktok.fillna("")


2023-03-07
2022-06-22


  df_tiktok = df_tiktok.fillna("")


2022-09-24


  df_tiktok = df_tiktok.fillna("")


2023-04-27


  df_tiktok = df_tiktok.fillna("")


2021-02-14


  df_tiktok = df_tiktok.fillna("")


2023-07-17
2023-03-14
2023-04-30


  df_tiktok = df_tiktok.fillna("")


2023-09-18
2023-09-15
2022-02-21


  df_tiktok = df_tiktok.fillna("")


2022-04-11


  df_tiktok = df_tiktok.fillna("")


2022-06-05


  df_tiktok = df_tiktok.fillna("")


2022-08-06


  df_tiktok = df_tiktok.fillna("")


2022-09-05


  df_tiktok = df_tiktok.fillna("")


2023-11-04
2022-03-01


  df_tiktok = df_tiktok.fillna("")


2022-01-09


  df_tiktok = df_tiktok.fillna("")


2022-02-13


  df_tiktok = df_tiktok.fillna("")


2022-12-12


  df_tiktok = df_tiktok.fillna("")


2023-09-04
2020-12-08


  df_tiktok = df_tiktok.fillna("")


2021-10-20


  df_tiktok = df_tiktok.fillna("")


2023-02-01
2023-06-01
2023-03-29
2023-05-03
2023-08-08
2023-08-19
2023-07-01
2023-07-03
2023-09-23
2023-05-26
2023-10-22
2022-05-17
2022-07-15


  df_tiktok = df_tiktok.fillna("")


2022-06-19


  df_tiktok = df_tiktok.fillna("")


2022-08-05


  df_tiktok = df_tiktok.fillna("")


2023-09-08
2023-08-05
2022-04-18


  df_tiktok = df_tiktok.fillna("")


2023-10-28
2020-01-16


  df_tiktok = df_tiktok.fillna("")


2023-04-29


  df_tiktok = df_tiktok.fillna("")


2023-07-12
2023-03-09
2022-08-26
2023-06-23
2021-12-21


  df_tiktok = df_tiktok.fillna("")


2022-04-16


  df_tiktok = df_tiktok.fillna("")


2023-11-03
2021-07-18


  df_tiktok = df_tiktok.fillna("")


2023-09-28
2020-08-25


  df_tiktok = df_tiktok.fillna("")


2021-03-05


  df_tiktok = df_tiktok.fillna("")


2022-03-07


  df_tiktok = df_tiktok.fillna("")


2022-01-10


  df_tiktok = df_tiktok.fillna("")


2023-01-09


  df_tiktok = df_tiktok.fillna("")


2023-01-18
2023-01-04
2021-10-03


  df_tiktok = df_tiktok.fillna("")


2021-01-18


  df_tiktok = df_tiktok.fillna("")


2023-02-20
2022-11-24


  df_tiktok = df_tiktok.fillna("")


2023-01-24


  df_tiktok = df_tiktok.fillna("")


2023-01-16


  df_tiktok = df_tiktok.fillna("")


2022-11-23
2023-03-13


  df_tiktok = df_tiktok.fillna("")


2022-10-12


  df_tiktok = df_tiktok.fillna("")


2022-10-03


  df_tiktok = df_tiktok.fillna("")


2023-06-30
2023-01-10


  df_tiktok = df_tiktok.fillna("")


2021-04-12


  df_tiktok = df_tiktok.fillna("")


2021-01-31


  df_tiktok = df_tiktok.fillna("")


2023-06-03
2021-02-15


  df_tiktok = df_tiktok.fillna("")


2022-09-06


  df_tiktok = df_tiktok.fillna("")


2022-09-04


  df_tiktok = df_tiktok.fillna("")


2023-01-05
2021-06-18


  df_tiktok = df_tiktok.fillna("")


2023-01-15
2023-08-11
2021-05-23


  df_tiktok = df_tiktok.fillna("")


600
2022-12-21


  df_tiktok = df_tiktok.fillna("")


2022-12-16


  df_tiktok = df_tiktok.fillna("")


2023-06-10


  df_tiktok = df_tiktok.fillna("")


2022-12-04


  df_tiktok = df_tiktok.fillna("")


2021-12-24


  df_tiktok = df_tiktok.fillna("")


2022-11-21


  df_tiktok = df_tiktok.fillna("")


2022-10-31


  df_tiktok = df_tiktok.fillna("")


2023-07-11


  df_tiktok = df_tiktok.fillna("")


2022-06-08
2022-04-13


  df_tiktok = df_tiktok.fillna("")


2023-10-10
2023-02-11


  df_tiktok = df_tiktok.fillna("")


2021-11-19


  df_tiktok = df_tiktok.fillna("")


2023-02-02
2023-02-04


  df_tiktok = df_tiktok.fillna("")


2022-01-13


  df_tiktok = df_tiktok.fillna("")


2023-02-25
2022-10-23


  df_tiktok = df_tiktok.fillna("")


2022-10-16
2023-03-04
2023-01-27
2021-05-29


  df_tiktok = df_tiktok.fillna("")


2022-09-13


  df_tiktok = df_tiktok.fillna("")


2022-05-19


  df_tiktok = df_tiktok.fillna("")


2022-06-04


  df_tiktok = df_tiktok.fillna("")


2020-05-10


  df_tiktok = df_tiktok.fillna("")


2023-04-25
2022-06-02


  df_tiktok = df_tiktok.fillna("")


2021-03-27


  df_tiktok = df_tiktok.fillna("")


2022-01-19


  df_tiktok = df_tiktok.fillna("")


2023-02-06


  df_tiktok = df_tiktok.fillna("")


2023-03-16
2021-07-13


  df_tiktok = df_tiktok.fillna("")


2021-07-27


  df_tiktok = df_tiktok.fillna("")


2021-08-17


  df_tiktok = df_tiktok.fillna("")


2021-07-21


  df_tiktok = df_tiktok.fillna("")


2021-07-05


  df_tiktok = df_tiktok.fillna("")


2021-07-22


  df_tiktok = df_tiktok.fillna("")


2021-07-10


  df_tiktok = df_tiktok.fillna("")


2021-07-04


  df_tiktok = df_tiktok.fillna("")


2021-06-12


  df_tiktok = df_tiktok.fillna("")


2023-05-23
2023-07-20
2023-02-21
2023-04-19


  df_tiktok = df_tiktok.fillna("")


2022-12-08


  df_tiktok = df_tiktok.fillna("")


2023-07-15


  df_tiktok = df_tiktok.fillna("")


2023-01-07
2021-12-18


  df_tiktok = df_tiktok.fillna("")


2022-11-25
2020-10-26


  df_tiktok = df_tiktok.fillna("")


2020-10-28


  df_tiktok = df_tiktok.fillna("")


2021-05-08


  df_tiktok = df_tiktok.fillna("")


2020-05-29
2022-07-17
2022-03-08


  df_tiktok = df_tiktok.fillna("")


2021-09-18


  df_tiktok = df_tiktok.fillna("")


2022-08-12


  df_tiktok = df_tiktok.fillna("")


2022-05-16


  df_tiktok = df_tiktok.fillna("")


2022-03-17


  df_tiktok = df_tiktok.fillna("")


2021-10-21


  df_tiktok = df_tiktok.fillna("")


2020-11-27


  df_tiktok = df_tiktok.fillna("")


2022-03-25


  df_tiktok = df_tiktok.fillna("")


2022-05-28


  df_tiktok = df_tiktok.fillna("")


2022-08-20


  df_tiktok = df_tiktok.fillna("")


2022-09-10
2022-12-23


  df_tiktok = df_tiktok.fillna("")


2022-07-25


  df_tiktok = df_tiktok.fillna("")


2023-02-12
2022-08-19


  df_tiktok = df_tiktok.fillna("")


2022-11-04


  df_tiktok = df_tiktok.fillna("")


2022-05-09
2021-07-02


  df_tiktok = df_tiktok.fillna("")


2022-09-16


  df_tiktok = df_tiktok.fillna("")


2022-12-02


  df_tiktok = df_tiktok.fillna("")


2020-02-03


  df_tiktok = df_tiktok.fillna("")


2021-03-29


  df_tiktok = df_tiktok.fillna("")


2022-03-26


  df_tiktok = df_tiktok.fillna("")


2022-05-06
2021-11-27


  df_tiktok = df_tiktok.fillna("")


2022-08-16
2022-07-01


  df_tiktok = df_tiktok.fillna("")


2022-07-30


  df_tiktok = df_tiktok.fillna("")


2022-09-18


  df_tiktok = df_tiktok.fillna("")


2022-10-10


  df_tiktok = df_tiktok.fillna("")


2022-10-01


  df_tiktok = df_tiktok.fillna("")


2023-02-18
2022-12-30
2023-04-28
2022-12-25
2021-09-12


  df_tiktok = df_tiktok.fillna("")


2023-04-26
2023-08-03
2023-06-20
2022-11-11
2023-03-12
2023-02-28


  df_tiktok = df_tiktok.fillna("")


2023-02-23
2022-08-31


  df_tiktok = df_tiktok.fillna("")


2021-03-01


  df_tiktok = df_tiktok.fillna("")


700
2021-03-02


  df_tiktok = df_tiktok.fillna("")


2021-03-03
2021-01-29


  df_tiktok = df_tiktok.fillna("")


2021-11-15


  df_tiktok = df_tiktok.fillna("")


2023-01-08
2022-05-21
2022-10-13


  df_tiktok = df_tiktok.fillna("")


2023-04-20
2022-08-14


  df_tiktok = df_tiktok.fillna("")


2022-03-10


  df_tiktok = df_tiktok.fillna("")


2021-11-26


  df_tiktok = df_tiktok.fillna("")


2020-12-23


  df_tiktok = df_tiktok.fillna("")


2022-08-01
2021-08-10


  df_tiktok = df_tiktok.fillna("")


2022-11-08
2021-12-05


  df_tiktok = df_tiktok.fillna("")


2023-03-03
2022-08-11
2022-03-19


  df_tiktok = df_tiktok.fillna("")


2022-12-03
2022-08-24


  df_tiktok = df_tiktok.fillna("")


2022-08-25
2020-11-23


  df_tiktok = df_tiktok.fillna("")


2021-10-04


  df_tiktok = df_tiktok.fillna("")


2021-07-29


  df_tiktok = df_tiktok.fillna("")


2020-10-20


  df_tiktok = df_tiktok.fillna("")


2021-11-02


  df_tiktok = df_tiktok.fillna("")


2022-09-21
2021-11-08


  df_tiktok = df_tiktok.fillna("")


2021-12-26


  df_tiktok = df_tiktok.fillna("")


2022-08-15


  df_tiktok = df_tiktok.fillna("")


2022-01-16


  df_tiktok = df_tiktok.fillna("")


2022-08-23


  df_tiktok = df_tiktok.fillna("")


2023-03-11
2023-03-22


  df_tiktok = df_tiktok.fillna("")


2022-06-20


  df_tiktok = df_tiktok.fillna("")


2022-02-17


  df_tiktok = df_tiktok.fillna("")


2021-01-07


  df_tiktok = df_tiktok.fillna("")


2022-07-10


  df_tiktok = df_tiktok.fillna("")


2023-03-19


  df_tiktok = df_tiktok.fillna("")


2022-08-21
2021-07-19


  df_tiktok = df_tiktok.fillna("")


2020-05-28


  df_tiktok = df_tiktok.fillna("")


2020-11-25


  df_tiktok = df_tiktok.fillna("")


2021-09-05


  df_tiktok = df_tiktok.fillna("")


2021-09-09


  df_tiktok = df_tiktok.fillna("")


2021-07-01


  df_tiktok = df_tiktok.fillna("")


2022-11-17


  df_tiktok = df_tiktok.fillna("")


2023-03-05


  df_tiktok = df_tiktok.fillna("")


2022-11-27


  df_tiktok = df_tiktok.fillna("")


2021-09-04


  df_tiktok = df_tiktok.fillna("")


2021-03-26


  df_tiktok = df_tiktok.fillna("")


2021-10-10


  df_tiktok = df_tiktok.fillna("")


2022-09-28
2021-10-11


  df_tiktok = df_tiktok.fillna("")


2023-03-23


  df_tiktok = df_tiktok.fillna("")


2021-10-30


  df_tiktok = df_tiktok.fillna("")


2020-02-07


  df_tiktok = df_tiktok.fillna("")


2022-03-21


  df_tiktok = df_tiktok.fillna("")


2020-09-29


  df_tiktok = df_tiktok.fillna("")


2022-08-02


  df_tiktok = df_tiktok.fillna("")


2021-01-08


  df_tiktok = df_tiktok.fillna("")


2021-05-03


  df_tiktok = df_tiktok.fillna("")


2023-03-25
2023-02-14


  df_tiktok = df_tiktok.fillna("")


2022-03-11


  df_tiktok = df_tiktok.fillna("")


2021-08-30


  df_tiktok = df_tiktok.fillna("")


2022-03-14
2021-10-24


  df_tiktok = df_tiktok.fillna("")


2022-12-26


  df_tiktok = df_tiktok.fillna("")


2021-05-02


  df_tiktok = df_tiktok.fillna("")


2022-07-19


  df_tiktok = df_tiktok.fillna("")


2021-04-25


  df_tiktok = df_tiktok.fillna("")


2022-10-30
2022-02-03


  df_tiktok = df_tiktok.fillna("")


2023-03-18


  df_tiktok = df_tiktok.fillna("")


2022-12-13


  df_tiktok = df_tiktok.fillna("")


2022-01-17
2020-06-24


  df_tiktok = df_tiktok.fillna("")


2022-05-03


  df_tiktok = df_tiktok.fillna("")


2022-11-01
2022-10-14


  df_tiktok = df_tiktok.fillna("")


2022-09-15


  df_tiktok = df_tiktok.fillna("")


2021-11-05


  df_tiktok = df_tiktok.fillna("")


2021-05-11
2021-04-10


  df_tiktok = df_tiktok.fillna("")


2021-08-12


  df_tiktok = df_tiktok.fillna("")


2022-08-22


  df_tiktok = df_tiktok.fillna("")


2021-07-11


  df_tiktok = df_tiktok.fillna("")


2022-01-28


  df_tiktok = df_tiktok.fillna("")


2022-02-02
2022-03-13


  df_tiktok = df_tiktok.fillna("")


2021-09-21


  df_tiktok = df_tiktok.fillna("")


2022-02-19


  df_tiktok = df_tiktok.fillna("")


2022-06-10


  df_tiktok = df_tiktok.fillna("")


2022-08-13


  df_tiktok = df_tiktok.fillna("")


2022-07-11


  df_tiktok = df_tiktok.fillna("")


2022-07-08


  df_tiktok = df_tiktok.fillna("")


2022-08-17


  df_tiktok = df_tiktok.fillna("")


2022-09-07
800
2020-12-24


  df_tiktok = df_tiktok.fillna("")


2021-04-24


  df_tiktok = df_tiktok.fillna("")


2020-09-14


  df_tiktok = df_tiktok.fillna("")


2022-03-29


  df_tiktok = df_tiktok.fillna("")


2022-01-25


  df_tiktok = df_tiktok.fillna("")


2023-05-29
2021-04-26


  df_tiktok = df_tiktok.fillna("")


2021-06-07


  df_tiktok = df_tiktok.fillna("")


2020-08-22


  df_tiktok = df_tiktok.fillna("")


2020-07-26


  df_tiktok = df_tiktok.fillna("")


2022-06-18


  df_tiktok = df_tiktok.fillna("")


2022-07-09


  df_tiktok = df_tiktok.fillna("")


2022-02-09


  df_tiktok = df_tiktok.fillna("")


2022-02-01


  df_tiktok = df_tiktok.fillna("")


2022-02-24


  df_tiktok = df_tiktok.fillna("")


2022-07-18
2021-05-31


  df_tiktok = df_tiktok.fillna("")


2022-10-08


  df_tiktok = df_tiktok.fillna("")


2021-01-22


  df_tiktok = df_tiktok.fillna("")


2022-11-13


  df_tiktok = df_tiktok.fillna("")


2022-12-01
2022-12-20


  df_tiktok = df_tiktok.fillna("")


2021-12-08


  df_tiktok = df_tiktok.fillna("")


2022-10-25


  df_tiktok = df_tiktok.fillna("")


2022-01-04
2022-11-09


  df_tiktok = df_tiktok.fillna("")


2022-08-04


  df_tiktok = df_tiktok.fillna("")


2022-01-29


  df_tiktok = df_tiktok.fillna("")


2022-11-10
2021-12-12


  df_tiktok = df_tiktok.fillna("")


2022-11-18


  df_tiktok = df_tiktok.fillna("")


2022-09-30


  df_tiktok = df_tiktok.fillna("")


2022-07-20


  df_tiktok = df_tiktok.fillna("")


2020-10-27


  df_tiktok = df_tiktok.fillna("")


2022-10-04


  df_tiktok = df_tiktok.fillna("")


2022-02-25
2022-09-08


  df_tiktok = df_tiktok.fillna("")


2021-11-16


  df_tiktok = df_tiktok.fillna("")


2022-03-22
2022-10-19


  df_tiktok = df_tiktok.fillna("")


2021-02-03


  df_tiktok = df_tiktok.fillna("")


2022-11-26
2022-11-02


  df_tiktok = df_tiktok.fillna("")


2022-10-18


  df_tiktok = df_tiktok.fillna("")


2022-11-14


  df_tiktok = df_tiktok.fillna("")


2022-11-30


  df_tiktok = df_tiktok.fillna("")


2022-11-15
2021-02-08


  df_tiktok = df_tiktok.fillna("")


2022-03-16


  df_tiktok = df_tiktok.fillna("")


2021-06-16


  df_tiktok = df_tiktok.fillna("")


2022-12-10


  df_tiktok = df_tiktok.fillna("")


2021-06-06


  df_tiktok = df_tiktok.fillna("")


2022-02-05


  df_tiktok = df_tiktok.fillna("")


2022-07-24


  df_tiktok = df_tiktok.fillna("")


2020-10-09


  df_tiktok = df_tiktok.fillna("")


2020-02-25


  df_tiktok = df_tiktok.fillna("")


2020-12-19
2021-08-22


  df_tiktok = df_tiktok.fillna("")


2022-05-02


  df_tiktok = df_tiktok.fillna("")


2021-01-27


  df_tiktok = df_tiktok.fillna("")


2022-02-15


  df_tiktok = df_tiktok.fillna("")


2022-01-20


  df_tiktok = df_tiktok.fillna("")


2022-07-06


  df_tiktok = df_tiktok.fillna("")


2022-09-11


  df_tiktok = df_tiktok.fillna("")


2022-02-08


  df_tiktok = df_tiktok.fillna("")


2022-06-17


  df_tiktok = df_tiktok.fillna("")


2022-10-09
2021-09-10


  df_tiktok = df_tiktok.fillna("")


2021-06-23


  df_tiktok = df_tiktok.fillna("")


2022-08-10


  df_tiktok = df_tiktok.fillna("")


2021-12-19


  df_tiktok = df_tiktok.fillna("")


2021-11-07


  df_tiktok = df_tiktok.fillna("")


2022-05-24


  df_tiktok = df_tiktok.fillna("")


2023-04-02


  df_tiktok = df_tiktok.fillna("")


2022-01-30


  df_tiktok = df_tiktok.fillna("")


2020-10-31


  df_tiktok = df_tiktok.fillna("")


2022-09-23


  df_tiktok = df_tiktok.fillna("")


2022-10-02
2023-02-15
2023-02-24
2022-12-14


  df_tiktok = df_tiktok.fillna("")


2021-11-10


  df_tiktok = df_tiktok.fillna("")


2022-07-23
2022-12-17


  df_tiktok = df_tiktok.fillna("")


2021-12-20


  df_tiktok = df_tiktok.fillna("")


2020-09-30


  df_tiktok = df_tiktok.fillna("")


2022-02-22


  df_tiktok = df_tiktok.fillna("")


2021-07-23


  df_tiktok = df_tiktok.fillna("")


2021-06-22


  df_tiktok = df_tiktok.fillna("")


2021-06-29


  df_tiktok = df_tiktok.fillna("")


2022-10-05


  df_tiktok = df_tiktok.fillna("")


2021-04-16


  df_tiktok = df_tiktok.fillna("")


2022-04-09


  df_tiktok = df_tiktok.fillna("")


2022-01-06
2022-05-13
2021-02-28
2021-07-07
2022-01-02
2022-03-23
2022-10-06
900
2021-08-08
2021-01-01
2022-10-24
2022-06-12
2020-07-20
2020-03-22
2021-01-20
2020-12-15
2020-01-20
2021-06-20
2022-06-25
2022-06-28
2019-03-30
2018-07-07
 The Bache
 The Bache
2022-07-14
2020-07-06
M!
M!
2022-06-07
2022-05-05
2020-12-04
2022-12-24
2021-08-04
2021-10-01
2022-02-26
2024-03-07
2024-03-05
2024-03-04
2024-03-06
2021-03-06
2020-04-03
2020-09-20
2022-10-29
2021-08-09
2022-09-25
2020-06-03
2020-08-03
2021-03-17
2021-03-25
2020-09-13
2021-10-25
2021-05-10
2021-04-03
2021-09-11
2020-02-28
2022-09-02
2020-11-20
2022-08-30
2021-10-09
2022-03-30
2022-03-02
2021-08-13
2021-12-09
2022-05-22
2020-07-01
video_time
video_time


In [None]:
d1 = pd.read_csv("cosine_result_all_users_2.csv")
d2 = pd.read_csv("cosine_result_all_users.csv")
dfinal = pd.concat([result, df])
dfinal.to_csv("cosine_result_all_users_final.csv")

#### Analysis of Data

clean data

In [13]:
def get_max(row):
    return np.ndarray.item(row['result'][0])

def get_headline(row):
    return row['result'][1]

In [14]:
result['max_similarity'] = result.apply(get_max,axis = 1)
result['NYT_headline'] = result.apply(get_headline,axis = 1)
result.to_csv('cosine_result_all_users.csv')

In [15]:
df = pd.read_csv('cosine_result_all_users.csv')
df.head()

  df = pd.read_csv('cosine_result_all_users.csv')


Unnamed: 0.1,Unnamed: 0,video_id,video_timestamp,video_duration,video_locationcreated,suggested_words,video_diggcount,video_sharecount,video_commentcount,video_playcount,...,author_heartcount,author_videocount,author_diggcount,author_verified,hashtags,data_user,format_date,result,max_similarity,NYT_headline
0,0,7.34e+18,2024-02-29T15:12:05,58.0,US,"kendalljenner, Kylie Jenner, kylie and timothee, timothee chalamet, Rise And Shine, kylie jenner kids, Kylie Jenner And Timaothée Chalamet Seen Together, kylie and travis, kylie jenner rise and shine, kylie jenner timothee",6000000.0,9642.0,10800.0,40200000.0,...,,,,True,,user_0001,2024-02-29,"(array([0.23529173], dtype=float32), 'Word of the Day: liminal')",0.235292,Word of the Day: liminal
1,1,7.34e+18,2024-02-29T15:12:05,58.0,US,"Kendalljenner, kylie jenner, kylie and timothee, timothee chalamet, Rise And Shine, kylie jenner kids, Kylie Jenner And Timaothée Chalamet Seen Together, kylie and travis, kylie jenner rise and shine, kylie jenner timothee",6000000.0,9642.0,10800.0,40200000.0,...,,,,True,,user_0001,2024-02-29,"(array([0.23529173], dtype=float32), 'Word of the Day: liminal')",0.235292,Word of the Day: liminal
2,2,7.34e+18,2024-02-29T20:32:30,36.0,US,"froyo, froyo yolo",65200.0,1650.0,746.0,253500.0,...,,,,False,,user_0001,2024-02-29,"(array([0.21686819], dtype=float32), 'Leap Day')",0.216868,Leap Day
3,36,7.34e+18,2024-02-29T14:21:20,0.0,CA,"Cake Boss, cousin anthony, cake boss dropping cake, cousin anthony cake boss, cousin anthony from cake boss dropped cake, cousin anthony cake boss now, cake boss anthony, cake boss funny moments, cake boss cakes, cake from cake boss",710600.0,7097.0,2445.0,4400000.0,...,,,,False,"cakeboss, tlc, childhood, tv, tvshow, iconic, memories, nostalgia, trend",user_0001,2024-02-29,"(array([0.25542623], dtype=float32), 'Richard Lewis and ‘The (Blank) From Hell’')",0.255426,Richard Lewis and ‘The (Blank) From Hell’
4,44,7.34e+18,2024-02-29T20:04:35,14.0,US,"isabelle mccalla, grant gustin broadway, Grant Gustin, broadway, Performing On Broadway, Broadway Musical Theatre, Broadway Theatre, broadway tiktok, Broadway Musicals, Broadway Shows",43700.0,322.0,109.0,454700.0,...,,,,False,"TechTok, Broadway, WaterForElephants",user_0001,2024-02-29,"(array([0.48280385], dtype=float32), 'Cast Album Roundup: ‘Sweeney Todd,’ ‘Parade,’ ‘Camelot’ and More')",0.482804,"Cast Album Roundup: ‘Sweeney Todd,’ ‘Parade,’ ‘Camelot’ and More"


In [16]:
# drop those with NaN values for suggested_words
df = df[df['suggested_words'].notna()]

general descriptive statistics of the similarity ratios

In [17]:
df['max_similarity'].describe()

count    46749.000000
mean         0.274732
std          0.077270
min          0.109096
25%          0.223708
50%          0.263259
75%          0.308530
max          0.779948
Name: max_similarity, dtype: float64

In [19]:
df.to_csv("cosine_result_all_users.csv")

#### closer look at the ones with a similarity ratio of 1

1. calculate cosine similarity

In [None]:
df = result[result['max_similarity']==float(1)]

In [None]:
def cosineSimilarity(vec1, vec2):
    """Calculate the cosine similarity between two vectors."""
    V1 = np.array(vec1)
    V2 = np.array(vec2)
    cosine = np.dot(V1, V2.T)/(norm(V1)*norm(V2))# edited dot product to be v1 and transpose of v2 instead of v2
    return cosine

In [None]:
v1 = embed([''])
v2 = embed(['Fall Fashion Has the Spirit of Adventure'])
cosineSimilarity(v1, v2)

2. for those with similarity ratio of 1, drop them and redo the generate_similarity

In [None]:
# original results
orig_result = pd.read_csv('cosine_result_69117.csv')
result = orig_result.drop(orig_result[orig_result['max_similarity']==1].index)

In [None]:
# see if dropped the correct ones
orig_result = pd.read_csv('cosine_result_69117.csv')
index = orig_result[orig_result['max_similarity']==1].index
dropped = orig_result.iloc[index]

In [None]:
dropped['max_similarity'].describe()

In [None]:
# run generate_similarity again for those dropped (method 1: failed because other videos are generated and hard to merge)
dates = dropped["format_date"].unique()
new = pd.DataFrame()
for i, date in enumerate(dates):
    print(date)
    df_tiktok = meta_data[meta_data['format_date']==date]  
    df = generate_similarity(date, df_tiktok)
    new = pd.concat([result, df])
    if i%100 == 0:
        new.to_csv("cosine_result_all_users_one.csv")
        print(i)

In [None]:
new['max_similarity'].describe()

In [None]:
#new['max_similarity'] = float(new.apply(get_max,axis = 1))
#new['NYT_headline'] = new.apply(get_headline,axis = 1)

In [None]:
new = pd.read_csv("cosine_result_69117_one.csv")
result = orig_result.drop(orig_result[orig_result['max_similarity']==1].index)

In [None]:
#new.set_index("video_timestamp", inplace = True)
#result.set_index("video_timestamp", inplace = True)

In [None]:
idx = new.groupby('video_description')['max_similarity'].idxmax().dropna() 
new = new.loc[idx]

In [None]:
idx = result.groupby('video_description')['max_similarity'].idxmax().dropna() 
result = result.loc[idx]

In [None]:
result.update(new)

In [None]:
result['max_similarity'].describe()

In [None]:
result.to_csv("cosine_result_all_users_new.csv")

In [None]:
result = pd.read_csv("cosine_result_69117_new.csv")

In [None]:
fig = px.histogram(result, x="max_similarity")
fig.show()

In [None]:
result[result['max_similarity']>0.65]

#### look at results when hashtags are separated

In [None]:
meta_data = pd.read_csv("data_seperated_hashtags.csv")
meta_data["format_date"] = meta_data.apply(get_date, axis=1)
meta_data.dropna(subset=['hashtags',"Seperated_Hashtags"], inplace=True)
dates = meta_data["format_date"].unique()

In [None]:
len(dates)

In [None]:
# raw hashtags
result = pd.DataFrame()
for i, date in enumerate(dates):
    print(date)
    df_tiktok = meta_data[meta_data['format_date']==date]  
    df = generate_similarity(date, df_tiktok)
    result = pd.concat([result, df])
result.to_csv("cosine_result_hashtags.csv")

In [None]:
result['max_similarity'] = result.apply(get_max,axis = 1)
result['NYT_headline'] = result.apply(get_headline,axis = 1)
result.to_csv('cosine_result_hashtags.csv')

In [None]:
result = pd.read_csv("cosine_result_hashtags.csv")
result['max_similarity'].describe()

In [None]:
fig = px.histogram(result, x="max_similarity")
fig.show()

In [None]:
# separated hashtags
result = pd.DataFrame()
for i, date in enumerate(dates):
    print(date)
    df_tiktok = meta_data[meta_data['format_date']==date]  
    df = generate_similarity(date, df_tiktok)
    result = pd.concat([result, df])

result.to_csv("cosine_result_separate_hashtags.csv")


In [None]:
result['max_similarity'] = result.apply(get_max,axis = 1)
result['NYT_headline'] = result.apply(get_headline,axis = 1)
result.to_csv('cosine_result_separate_hashtags.csv')

In [None]:
result = pd. read_csv("cosine_result_separate_hashtags.csv")
fig = px.histogram(result, x="max_similarity")
fig.show()

In [None]:
result["max_similarity"].describe()

### generate visualizations for all users

In [20]:
df = pd.read_csv("cosine_result_all_users.csv")
df.head()

  df = pd.read_csv("cosine_result_all_users.csv")


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,video_id,video_timestamp,video_duration,video_locationcreated,suggested_words,video_diggcount,video_sharecount,video_commentcount,...,author_heartcount,author_videocount,author_diggcount,author_verified,hashtags,data_user,format_date,result,max_similarity,NYT_headline
0,0,0,7.34e+18,2024-02-29T15:12:05,58.0,US,"kendalljenner, Kylie Jenner, kylie and timothee, timothee chalamet, Rise And Shine, kylie jenner kids, Kylie Jenner And Timaothée Chalamet Seen Together, kylie and travis, kylie jenner rise and shine, kylie jenner timothee",6000000.0,9642.0,10800.0,...,,,,True,,user_0001,2024-02-29,"(array([0.23529173], dtype=float32), 'Word of the Day: liminal')",0.235292,Word of the Day: liminal
1,1,1,7.34e+18,2024-02-29T15:12:05,58.0,US,"Kendalljenner, kylie jenner, kylie and timothee, timothee chalamet, Rise And Shine, kylie jenner kids, Kylie Jenner And Timaothée Chalamet Seen Together, kylie and travis, kylie jenner rise and shine, kylie jenner timothee",6000000.0,9642.0,10800.0,...,,,,True,,user_0001,2024-02-29,"(array([0.23529173], dtype=float32), 'Word of the Day: liminal')",0.235292,Word of the Day: liminal
2,2,2,7.34e+18,2024-02-29T20:32:30,36.0,US,"froyo, froyo yolo",65200.0,1650.0,746.0,...,,,,False,,user_0001,2024-02-29,"(array([0.21686819], dtype=float32), 'Leap Day')",0.216868,Leap Day
3,3,36,7.34e+18,2024-02-29T14:21:20,0.0,CA,"Cake Boss, cousin anthony, cake boss dropping cake, cousin anthony cake boss, cousin anthony from cake boss dropped cake, cousin anthony cake boss now, cake boss anthony, cake boss funny moments, cake boss cakes, cake from cake boss",710600.0,7097.0,2445.0,...,,,,False,"cakeboss, tlc, childhood, tv, tvshow, iconic, memories, nostalgia, trend",user_0001,2024-02-29,"(array([0.25542623], dtype=float32), 'Richard Lewis and ‘The (Blank) From Hell’')",0.255426,Richard Lewis and ‘The (Blank) From Hell’
4,4,44,7.34e+18,2024-02-29T20:04:35,14.0,US,"isabelle mccalla, grant gustin broadway, Grant Gustin, broadway, Performing On Broadway, Broadway Musical Theatre, Broadway Theatre, broadway tiktok, Broadway Musicals, Broadway Shows",43700.0,322.0,109.0,...,,,,False,"TechTok, Broadway, WaterForElephants",user_0001,2024-02-29,"(array([0.48280385], dtype=float32), 'Cast Album Roundup: ‘Sweeney Todd,’ ‘Parade,’ ‘Camelot’ and More')",0.482804,"Cast Album Roundup: ‘Sweeney Todd,’ ‘Parade,’ ‘Camelot’ and More"


In [23]:
fig = px.histogram(df, x="max_similarity",color="data_user", barmode="group")
fig.show()

In [36]:
dff = df[df["max_similarity"]>0.70]
dff = dff[['suggested_words','max_similarity', "NYT_headline","data_user"]]

In [37]:
dff.drop_duplicates()

Unnamed: 0,suggested_words,max_similarity,NYT_headline,data_user
11505,"Yemen, us attacking yemen, Air Strike Firework, iran, uk and yemen, us attacking yemen aftermath",0.715962,U.S. and Allies Hit Yemen With Airstrikes,user_0002
13228,"ian ousley, where your homies at, kiawentiio, avatar, avatar last airbender, the avatar, the last airbender, air bending avatar, avatar the last airbender interview, air bending",0.701292,"‘Avatar: The Last Airbender’: Been There, Saved That",user_0005
13482,"ian ousley, where your homies at, kiawentiio, avatar, avatar last airbender, the avatar, the last airbender, air bending avatar, avatar the last airbender interview, air bending",0.701292,"‘Avatar: The Last Airbender’: Been There, Saved That",user_0007
14141,"oscar nominees 2024, oscars 2024, greta gerwig, Oscar Awards, 2024 oscar nominations, oscar nominations, margot robbie, America Ferrera, the oscars 2024, margot robbie oscars",0.725701,Oscar Nominees 2024: See the Full List,user_0002
24760,"Robert Downey Jr, critic choice award 2024, cillian murphy critics choice awards, critics choice awards, critics choice awards 2024 tom holland, robert downey jr oppenheimer, iron man, Oppenheimer, Robert Downey Jr. American Actor, critics choice best actor",0.701861,Critics Choice Awards 2024: The Complete Winners List,user_0002
24880,"robert downey jr, critic choice award 2024, cillian murphy critics choice awards, critics choice awards, critics choice awards 2024 tom holland, robert downey jr oppenheimer, iron man, Oppenheimer, Robert Downey Jr. American Actor, critics choice best actor",0.701861,Critics Choice Awards 2024: The Complete Winners List,user_0006
25042,"robert downey jr, critic choice award 2024, cillian murphy critics choice awards, critics choice awards, critics choice awards 2024 tom holland, robert downey jr oppenheimer, iron man, Oppenheimer, Robert Downey Jr. American Actor, critics choice best actor",0.701861,Critics Choice Awards 2024: The Complete Winners List,user_0007
26748,"Malik Brookins, my valentine, best valentine couple, happy valentines day",0.727654,What Californians Love About the Golden State,user_0002
26780,"happy valentines day, Happy Valentine's Day Video, in honor of valentine's day, valentines day dinner",0.779948,What Californians Love About the Golden State,user_0002
26904,"valentine's day gift, happy valentine's day, victor you actually did this",0.762767,What Californians Love About the Golden State,user_0006
