# Trending YouTube Video Statistics
This repository contains a comprehensive analysis of trending YouTube video statistics using Python. The analysis includes data cleaning, exploratory data analysis (EDA), and visualization of key metrics such as views, likes, comments, and more.

# lmporting libraries
We import the necessary Python libraries for data manipulation and text analysis:
- `pandas` for loading and handling tabular data.
- `TfidfVectorizer` from sklearn.feature_extraction.text for converting text data into TF-IDF vectors.
- `cosine_similarity` from sklearn.metrics.pairwise for calculating similarity between text vectors.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Loading the data
The dataset `MXvideos.csv` is loaded using pandas. This file contains trending YouTube video statistics for Mexico, including the video titles.

In [2]:
ruta_csv = 'MXvideos.csv'
df = pd.read_csv(ruta_csv, encoding='latin-1')

# Selecting the "title" Column
We extract the `title` column from the dataset, which contains the text of each video title. Missing values are filled with empty strings and all entries are converted to string type to ensure consistency.

In [3]:
titles = df['title'].fillna('').astype(str)

# Vectorization with TF-IDF
The TF-IDF (Term Frequency-Inverse Document Frequency) method is applied to the video titles. This technique transforms the text data into numerical vectors that reflect the importance of each word in the context of all titles, while ignoring common Spanish stopwords.

In [4]:
%pip install nltk

from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

spanish_stopwords = stopwords.words('spanish')
vectorizer = TfidfVectorizer(stop_words=spanish_stopwords)
tfidf_matrix = vectorizer.fit_transform(titles)

Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package stopwords to C:\Users\Ismenia
[nltk_data]     Guevara\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Choosing a Reference Title
A reference title is selected for comparison. In this example, we search for a title containing "Imagine Dragons"; if not found, the first title in the dataset is used. This title will be compared to all others to find the most similar ones.


In [5]:
titulo_referencia = next((t for t in titles if "Imagine Dragons" in t), titles[0])
print("Título de referencia:", titulo_referencia)

Título de referencia: Imagine Dragons, Khalid - Thunder / Young Dumb & Broke (Medley/Audio)


# Calculating Cosine Similarity
Cosine similarity is computed between the TF-IDF vector of the reference title and the vectors of all other titles. This metric quantifies how similar two text documents are, with values ranging from 0 (no similarity) to 1 (identical).


In [6]:
query_vec = vectorizer.transform([titulo_referencia])
similitudes = cosine_similarity(query_vec, tfidf_matrix).flatten()

# Obtaining the Top 10 Most Similar Titles
The indices of the 10 titles with the highest similarity scores (excluding the reference itself) are identified. These represent the most recommended titles based on textual similarity.

In [7]:
indices_similares = similitudes.argsort()[::-1][1:11]

# Printing the Top 10 Recommended Titles
The 10 most similar video titles are printed, along with their similarity scores. This provides a recommendation list based on the content of the titles.

In [8]:
print("Títulos más recomendados:")
for idx in indices_similares:
    print(f"- {titles[idx]} (Similitud: {similitudes[idx]:.3f})")

Títulos más recomendados:
- Imagine Dragons, Khalid - Thunder / Young Dumb & Broke (Medley/Audio) (Similitud: 1.000)
- Imagine Dragons, Khalid - Thunder / Young Dumb & Broke (Similitud: 0.912)
- Imagine Dragons, Khalid - Thunder / Young Dumb & Broke (Similitud: 0.912)
- Imagine Dragons - Next To Me (Audio) (Similitud: 0.432)
- Imagine Dragons - Next To Me (Audio) (Similitud: 0.432)
- Imagine Dragons - Next To Me (Similitud: 0.375)
- Imagine Dragons â Next To Me (Lyrics / Lyric VIdeo) (Similitud: 0.313)
- we broke up (Similitud: 0.231)
- Khalid - Saved (Official Video) (Similitud: 0.197)
- Timbiriche - Medley Vaselina (En Vivo) (Similitud: 0.185)
