## 1) Collect of data from json file 

In [1]:
import pandas as pd 

In [2]:
# Collect json dataset in a dataframe
path_json = 'yt_dataset_transcript.json'

df = pd.read_json(path_json)


In [3]:
df.columns

Index(['channelId', 'channelTitle', 'description', 'latestVideos'], dtype='object')

In [4]:
df.head()

Unnamed: 0,channelId,channelTitle,description,latestVideos
0,UC1P99hQE2TJ342mUrWqFN8w,Anime Audios,"HI My LUVZ, It's Anime Audios here with amazin...","[{'videoId': 'Cdf4xy6uFL0', 'title': 'Killua l..."
1,UCpWInGp4Af1t21WDZWKN99w,How To Beat Anime,,"[{'videoId': 'KJdjwvGgfl0', 'title': 'How to b..."
2,UCaq9q1REW3zYv2Jn1OQzThg,Anime Luna,,"[{'videoId': 'M0VXP3mfr9I', 'title': 'Take a r..."
3,UCyRbTEG3_QpuZK7w8D_OYZQ,PikaTV Anime,"Pikagamer here (previously known as PIkaTV) ,w...","[{'videoId': 'wHETMzeXhaY', 'title': 'yokai wa..."
4,UCeU-B6baUdz3S-W9B4ycAlQ,AniMe 🌎,Hello Internet Its your Favorite (or not so fa...,"[{'videoId': 'tF7UCP33tTc', 'title': '🔴LIVE🔴Pl..."


In [5]:
# List to store duplicated rows
rows_to_concat = []

# Iterating through each row of the DataFrame
for _, row in df.iterrows():
    # Retrieve information not related to "latestVideos"
    non_video_info = row[['channelId', 'channelTitle', 'description']]

    # Iterating through each video in "latestVideos"
    for video in row['latestVideos']:
        # Creating a new row by combining non-video information with video information
        new_row = pd.Series({
            'channel_id': non_video_info['channelId'],
            'channel_title': non_video_info['channelTitle'],
            'channel_Description': non_video_info['description'],
            'video_id': video.get('videoId', None),
            'video_title': video.get('title', None),
            'video_description': video.get('description', None),
            'video_publishedAt': video.get('publishedAt', None),
            'video_transcript': video.get('transcript', None),
        })
        # Adding the new row to the list
        rows_to_concat.append(new_row)

# Creating a new DataFrame by concatenating the duplicated rows
new_df = pd.DataFrame(rows_to_concat)


In [6]:
df_2 = new_df.copy()

In [7]:
df_2.head()

Unnamed: 0,channel_id,channel_title,channel_Description,video_id,video_title,video_description,video_publishedAt,video_transcript
0,UC1P99hQE2TJ342mUrWqFN8w,Anime Audios,"HI My LUVZ, It's Anime Audios here with amazin...",Cdf4xy6uFL0,Killua lets you do his hair ll Killua x listener,THIS WAS A REQUESTED BY: Cyria\n\nThank you so...,2023-11-15T02:16:16Z,L hey what are you doing I can see that but ye...
1,UC1P99hQE2TJ342mUrWqFN8w,Anime Audios,"HI My LUVZ, It's Anime Audios here with amazin...",TkQ8lUCAXAs,He just want to talk to them 🥲,,2023-10-28T05:21:41Z,you think the wind is ever trying to tell us s...
2,UC1P99hQE2TJ342mUrWqFN8w,Anime Audios,"HI My LUVZ, It's Anime Audios here with amazin...",mNNRne4M4J8,Doppelgänger: Part 1￼,#animeaudios #killuaxlistener #doppelgangers \...,2023-10-25T03:54:51Z,there's no way we're going to fix this tonight...
3,UC1P99hQE2TJ342mUrWqFN8w,Anime Audios,"HI My LUVZ, It's Anime Audios here with amazin...",kcIDDMaxWSw,Doppelgängers ll MOIVE TRAILER ll,ARE YOU READY TO BE SPOOKED??😱,2023-10-10T01:11:34Z,
4,UC1P99hQE2TJ342mUrWqFN8w,Anime Audios,"HI My LUVZ, It's Anime Audios here with amazin...",Cb7RHJcdjws,Talking to Sleepy Killua ll Killua x Listener,,2023-10-09T22:01:57Z,what do you want no what I want to sleep just ...


## 2) Preprocessing 

Dans cette démarche, nous opterons pour la concaténation de plusieurs colonnes afin de créer une colonne de documents consolidés, permettant ainsi à l'algorithme de mieux appréhender le contenu de chaque chaîne. Les colonnes prises en compte sont les suivantes : **"channel_title",    "channel_Description",  "video_title",  "video_description",  "video_transcript".**

In [8]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
import re
import emoji
from tqdm import tqdm

In [9]:
# Creation of a 
def preprocess_text(text):
    """
    Description: 
        Preprocessing function for our columns and including all the necessary steps
    Args: 
        -text: the text of each row in a column 
    Return: 
        -processed_text: the text preprocessed 
    """
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)
    
    # Remove emojis
    text = emoji.demojize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in text.split() if word not in stop_words]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Join the tokens to form a sentence
    processed_text = ' '.join(tokens)
    return processed_text



In [10]:
# Combine columns to form documents
df_2['combined_text'] = df_2['channel_title'] + ' ' + df_2['channel_Description'] + ' ' + df_2['video_title'] + ' ' + df_2['video_description'] + ' ' + df_2['video_transcript']

# Fill NaN values with an empty string
df_2['combined_text'] = df_2['combined_text'].fillna('')

# Apply pretreatment to the combined column with tqdm for progress tracking
tqdm.pandas(desc="Processing text")
df_2['processed_text'] = df_2['combined_text'].progress_apply(preprocess_text)

Processing text: 100%|██████████| 5233/5233 [00:20<00:00, 254.29it/s]


In [11]:
df_2.head()

Unnamed: 0,channel_id,channel_title,channel_Description,video_id,video_title,video_description,video_publishedAt,video_transcript,combined_text,processed_text
0,UC1P99hQE2TJ342mUrWqFN8w,Anime Audios,"HI My LUVZ, It's Anime Audios here with amazin...",Cdf4xy6uFL0,Killua lets you do his hair ll Killua x listener,THIS WAS A REQUESTED BY: Cyria\n\nThank you so...,2023-11-15T02:16:16Z,L hey what are you doing I can see that but ye...,"Anime Audios HI My LUVZ, It's Anime Audios her...",anime audio hi luvz anime audio amazing audio ...
1,UC1P99hQE2TJ342mUrWqFN8w,Anime Audios,"HI My LUVZ, It's Anime Audios here with amazin...",TkQ8lUCAXAs,He just want to talk to them 🥲,,2023-10-28T05:21:41Z,you think the wind is ever trying to tell us s...,"Anime Audios HI My LUVZ, It's Anime Audios her...",anime audio hi luvz anime audio amazing audio ...
2,UC1P99hQE2TJ342mUrWqFN8w,Anime Audios,"HI My LUVZ, It's Anime Audios here with amazin...",mNNRne4M4J8,Doppelgänger: Part 1￼,#animeaudios #killuaxlistener #doppelgangers \...,2023-10-25T03:54:51Z,there's no way we're going to fix this tonight...,"Anime Audios HI My LUVZ, It's Anime Audios her...",anime audio hi luvz anime audio amazing audio ...
3,UC1P99hQE2TJ342mUrWqFN8w,Anime Audios,"HI My LUVZ, It's Anime Audios here with amazin...",kcIDDMaxWSw,Doppelgängers ll MOIVE TRAILER ll,ARE YOU READY TO BE SPOOKED??😱,2023-10-10T01:11:34Z,,,
4,UC1P99hQE2TJ342mUrWqFN8w,Anime Audios,"HI My LUVZ, It's Anime Audios here with amazin...",Cb7RHJcdjws,Talking to Sleepy Killua ll Killua x Listener,,2023-10-09T22:01:57Z,what do you want no what I want to sleep just ...,"Anime Audios HI My LUVZ, It's Anime Audios her...",anime audio hi luvz anime audio amazing audio ...


## 3) Cosinus similarity approach 

This method measures the similarity between two vectors in a vector space, often used with vector representations of documents. It evaluates the angle between vectors, providing a measure of proximity. Values ​​closer to 1 indicate higher similarity.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [13]:
def recommend_channels_cos_sim(query, channel_data, top_n=5):
    """
    Description: 
        Function for computing cosine similarity between query and data and return names of the most corresponding channels
    Args: 
        - query: the user's query 
        - channel_data: the preprocessed DataFrame
    Return: 
        - top_channels_df: DataFrame containing the most corresponding channels and videos to the user's query
    """
    # Preprocess the query
    query = preprocess_text(query)

    # Use the preprocessed text directly from the dataframe
    all_texts = [query] + channel_data['processed_text'].tolist()

    # Create a TF-IDF matrix
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(all_texts)

    # Calculate cosine similarity between the query and channel descriptions
    similarities = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1:])[0]

    # Get indices of the most similar channels
    top_indices = similarities.argsort()[-top_n:][::-1]

    # Get corresponding channels and videos
    top_channels_df = channel_data.iloc[top_indices][['channel_title', 'video_title', 'video_description', 'video_publishedAt']]

    return top_channels_df


## 4) BM25 approach

Best Matching 25 (BM25) is a ranking function widely used for information retrieval tasks, particularly in search engines. Unlike traditional term frequency-inverse document frequency (TF-IDF) methods, BM25 considers document length normalization and term saturation. It calculates a relevance score for each document based on the frequency of query terms, adjusting for document length and term saturation. BM25 assigns higher weights to rare terms and penalizes document lengths. This scoring mechanism allows BM25 to effectively handle varying document lengths and provide improved retrieval results, making it a popular choice for search engines and recommendation systems where balancing term importance and document structure is crucial.

In [14]:
from rank_bm25 import BM25Okapi

In [15]:
def recommend_channels_bm25(query, channel_data, top_n=5):
    """
    Description: 
        Function for recommending YouTube channels and videos based on BM25 similarity scores between the user's query and preprocessed channel data.
    Args: 
        - query: the user's query 
        - channel_data: the preprocessed DataFrame containing channel information
    Return: 
        - top_channels_df: DataFrame containing the most relevant channels and videos to the user's query
    """
    # Query preprocessing
    query = preprocess_text(query)

    # Use the preprocessed text directly from the dataframe
    all_texts = [query] + channel_data['processed_text'].tolist()
    
    # Create a corpus for BM25
    corpus = [query] + all_texts

    # Create a BM25 model
    bm25 = BM25Okapi(corpus)

    # Calculate BM25 scores between the query and channel descriptions
    scores = bm25.get_scores(query)

    # Retrieve indices of the most similar channels
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_n]

    # Get corresponding channels and videos
    top_channels_df = channel_data.iloc[top_indices][['channel_title', 'video_title', 'video_description', 'video_publishedAt']]

    return top_channels_df


## 5) LSI model approach 

**Latent Semantic Indexing (LSI)** is a dimensionality reduction technique applied to text data. It utilizes Singular Value Decomposition (SVD) to identify and capture latent semantic structures in a term-document matrix, reducing the dimensionality while preserving semantic meaning. By transforming documents and terms into a lower-dimensional space, LSI enables efficient computation of semantic similarity between them. This approach enhances the understanding of content relationships and is commonly used for tasks such as information retrieval and document clustering in natural language processing.

In [16]:
from gensim import corpora, models, similarities

In [17]:
def recommend_channels_lsi(query, channel_data, top_n=5):
    """
    Description:
        Function for computing LSI similarity between query and data and return names of the most corresponding channels
    Args:
        - query: the user's query
        - channel_data: the preprocessed DataFrame
        - top_n: the number of top channels to recommend
    Return:
        - top_channels_df: DataFrame containing the most corresponding channels and videos to the user's query
    """
    # Preprocess the query
    query = preprocess_text(query)

    # Create a dictionary and a corpus for LSI
    dictionary = corpora.Dictionary(channel_data['processed_text'].apply(str.split))
    corpus = [dictionary.doc2bow(text) for text in channel_data['processed_text'].apply(str.split)]

    # Create an LSI model
    lsi_model = models.LsiModel(corpus, id2word=dictionary, num_topics=200)

    # Transform the query into LSI space
    query_bow = dictionary.doc2bow(query.split())
    query_lsi = lsi_model[query_bow]

    # Transform the channel descriptions into LSI space
    channel_lsi = [lsi_model[doc] for doc in corpus]

    # Calculate similarity between the query and channel descriptions in LSI space
    index = similarities.MatrixSimilarity(channel_lsi)
    similarities_lsi = index[query_lsi]

    # Get indices of the most similar channels
    top_indices = sorted(range(len(similarities_lsi)), key=lambda i: similarities_lsi[i], reverse=True)[:top_n]

    # Get corresponding channels and videos
    top_channels_df = channel_data.iloc[top_indices][['channel_title', 'video_title', 'video_description', 'video_publishedAt']]

    return top_channels_df


## 6) Execution and comparaison of results 

In [19]:
# user_query is the input from the user
user_query = input("Enter your query: ")

tqdm.pandas(desc="Processing demand ...")

# Using the recommend_channels functions to get channel recommendations
recommended_channels_cos_sim_df = recommend_channels_cos_sim(user_query, df_2)
recommended_channels_bm25_df = recommend_channels_bm25(user_query, df_2)
recommend_channels_lsi_df = recommend_channels_lsi(user_query, df_2)


Enter your query:  attack of titans


### a) Results for each method 

In [20]:
# Using similarity cosinus 
recommended_channels_cos_sim_df.head()

Unnamed: 0,channel_title,video_title,video_description,video_publishedAt
117,Let's Watch Some Anime,Attack on Titan Entire Series Recap | GET READ...,YA'LL WE ARE APPROACHING THE END OF ATTACK ON ...,2023-11-01T17:00:48Z
4757,Anime Esteem,Eren Jaeger | Journey of Eren Jaeger | Attack ...,"Leave a like and subscribe to our channel, If ...",2021-05-15T12:19:50Z
731,The Anime Chamber,Is this the Best Attack on Titan Statue?? Open...,Hi Everyone! This is by far my favorite statue...,2023-11-23T23:00:28Z
1599,Tamil Anime Gaming,Attack on Titan will come back! | Attack on Ti...,breakin down the end credits scene of attack o...,2023-12-07T14:27:54Z
1678,Foxen Anime,INSANE! My Attack on Titan PART 4 TRAILER REAC...,Attack on Titan Final Season THE FINAL CHAPTER...,2023-07-02T23:30:14Z


In [21]:
# Using BM25 method 
recommended_channels_bm25_df.head()

Unnamed: 0,channel_title,video_title,video_description,video_publishedAt
1201,kumaru - anime on piano,Ashitaka and San アシタカとサン Princess Mononoke ものの...,Hello everyone! I recorded one of my newfound ...,2023-09-16T16:00:39Z
2553,Anime Sensei,"He is the Most Powerful Exorcist in the Clan, ...",The Most Powerful Exorcist is Forced to Live i...,2023-11-30T21:02:59Z
44,AniMe 🌎,"🔴LIVE🔴NOT Chilling, Playing Dragon Ball Xenove...",I Hope You Guys Enjoy The Stream ;)\n\n(Beats ...,2023-10-01T06:28:55Z
731,The Anime Chamber,Is this the Best Attack on Titan Statue?? Open...,Hi Everyone! This is by far my favorite statue...,2023-11-23T23:00:28Z
723,Funkey Anime,Jobless Guy is Reincarnated into Another World...,Bullied Guy Gets Reincarnated Into Another Wor...,2023-12-15T04:00:07Z


In [22]:
# Using LSI model 
recommend_channels_lsi_df

Unnamed: 0,channel_title,video_title,video_description,video_publishedAt
117,Let's Watch Some Anime,Attack on Titan Entire Series Recap | GET READ...,YA'LL WE ARE APPROACHING THE END OF ATTACK ON ...,2023-11-01T17:00:48Z
1237,Anime Culture Corner,The Broken Mind of Eren Yeager.,Attack on Titan: THE FINAL CHAPTERS Special 1 ...,2023-03-05T23:00:29Z
3567,FictionRanker | Anime & Waifu Comparison Channel,Who Was The Boy And The Dog In The AOT Ending?,,2023-11-28T16:46:09Z
1599,Tamil Anime Gaming,Attack on Titan will come back! | Attack on Ti...,breakin down the end credits scene of attack o...,2023-12-07T14:27:54Z
4757,Anime Esteem,Eren Jaeger | Journey of Eren Jaeger | Attack ...,"Leave a like and subscribe to our channel, If ...",2021-05-15T12:19:50Z


### b) Comparaisons and conclusion 

We note that our results obtained, the methods where we have approximately similar results are **cosine similarity** and the use of the **LSI model**.      
However, cosine similarity calculation is particularly suitable for video search, because it measures the similarity between representation vectors of videos based on their description or other textual characteristics, it takes less time to run than other methods and gives more relevant results.
The **LSI models** and **the BM25 method** can also be adapted, depending on the complexity of our application and the specifics of our dataset. However, cosine similarity remains a popular and effective approach for video search due to its simplicity and performance.