# 📝 Assignment 5

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BevRice/CMI_Course/blob/main/docs/source/notebooks/Assignment5_Week3.ipynb)

## EDA Exercises in Python

⏳ Estimated Duration: 2 Hours  
🎯 Due: Friday, 28 March at 11:59pm

📌 **Assignment Overview**
In this assignment, you will conduct an Exploratory Data Analysis (EDA) on the YouTube Trending Videos in the US dataset found [here](https://www.kaggle.com/datasets/rsrishav/youtube-trending-video-dataset?select=US_youtube_trending_data.csv).

The dataset contains engagement metrics such as views, likes, dislikes, and comment counts across multiple countries.

Your goal is to analyze trends, detect anomalies, and gain insights into what makes a video trend.

**Hint:** Pull up Lesson 7 side by side with this assignment in google colab and run the appropriate code from the lesson

*Questions in italics are mental notes and are not graded in this assignment*

### Load Data

In [1]:
# Import the pandas library
# Enter code here

**Question 1: What file type is this data?**  
US_youtube_trending_data.csv

Double click and enter answer here

In [2]:
# Load dataset
youtube_df = pd.read_csv("https://raw.githubusercontent.com/BevRice/CMI_Course/refs/heads/main/docs/source/data/US_youtube_trending_data_sample.csv")

NameError: name 'pd' is not defined

### Understand Your Data

*The first step of EDA is understanding the dataset structure.  What functions or methods should be run to do this?*

In [3]:
# Display basic information about the dataset
# Enter code here

In [4]:
# Display the first 5 rows of data
# Enter code here

In [5]:
# Display 5 sample rows  youtby running -> youtube_df.sample(5)
# Enter code here

In [6]:
# Display basic statistics of the numerical columns
# Enter code here

*Look at the mean, median, and standard deviation.  Are there any surprising outliers or trends?*

**Question 2: What are some key observations so far?**

Double click and enter answer here

### Dealing with Duplicates

In real-world datasets, duplicate entries can introduce bias in analysis. To check for exact duplicate rows, use the duplicated() method.
Run the following code to count the number of duplicate rows in the dataset:

In [7]:
# Check for completely identical rows
duplicate_rows = youtube_df[youtube_df.duplicated()]
print(f"Total exact duplicate rows: {duplicate_rows.shape[0]}")

NameError: name 'youtube_df' is not defined

In [8]:
# Drop identical rows by running -> youtube_df = youtube_df.drop_duplicates()
# Enter code here

In [9]:
# Verify the new shape
print(f"New dataset size after removing duplicates: {youtube_df.shape}")

NameError: name 'youtube_df' is not defined

After checking for exact duplicate rows, the next step is to determine if certain videos appear multiple times in the dataset.

📌 **Why does this matter?**

- Some videos may trend on multiple days, meaning they are not exact duplicates but still appear more than once.
- Understanding how often videos trend can provide insights into content virality and platform engagement trends.

**Task**: Identifying Videos That Appeared Multiple Times

Now, let's check how many times each video ID appears in the dataset. This will help us find videos that repeatedly trended over time.

💡 Run the following code to count occurrences of each video_id:

In [10]:
# Count occurrences of each video_id
duplicate_videos = youtube_df["video_id"].value_counts()

# Display videos that trended multiple times
multiple_trending_videos = duplicate_videos[duplicate_videos > 1]
print(f"Total videos that appeared more than once: {len(multiple_trending_videos)}")
multiple_trending_videos.head(10)  # Show top repeated videos

NameError: name 'youtube_df' is not defined

**Question 3: Why might there be duplicate video entries in the dataset?**

Double click and enter answer here

Depending on our analysis goals, we may choose to:
- 1. Keep only the first entry of each video to analyze the time it takes for a video to trend after publishing and to examine its initial engagement metrics.
- 2. Use the latest entry of each video to assess the most up-to-date engagement statistics and understand how a video performed over time.

In [11]:
# We will proceed with Option 1: Keeping only the first entry of each video
# This allows us to analyze the time between publishing and trending, along with initial engagement metrics.

# Sort by trending date and keep only the first instance of each video
youtube_df = youtube_df.sort_values("trending_date").drop_duplicates(subset="video_id", keep="first")

NameError: name 'youtube_df' is not defined

### Handle Missing Values

In [12]:
# Count the number of null values in each column by running -> youtube_df.isnull().sum()
# Enter code here

**Question 4: What are some options for handling the missing data?**

Double click and enter answer here

In [13]:
# Fill the missing values
# Enter code here

### Standardize Values

In [14]:
# Convert trending_date to datetime format, removing 'Z' and parsing correctly
youtube_df["trending_date"] = pd.to_datetime(youtube_df["trending_date"].str.replace("Z", ""), format="%Y-%m-%dT%H:%M:%S")

NameError: name 'pd' is not defined

In [15]:
# Convert publishedAt to datetime format, removing 'Z' and parsing correctly
# Enter code here

### Explore Distributions

In [16]:
# Import matplotlib and seaborn by running this code
import matplotlib.pyplot as plt
import seaborn as sns

In [17]:
# Plot histograms for numerical columns
# Enter code here

*Does the data follow normal distributions?  Are there extreme outlier?  If so, what could explain these?*

**Question 5: What are some key observations from the histograms?**

Double click and enter answer here

In [18]:
# Print list of column names by running -> youtube_df.columns
# Enter code here

### Explore Engagement Metrics

In [19]:
# Plot boxplots of engagement metrics (view_count, likes, what else???)
# Enter code here

**Question 6: What are some key observations from the boxplots?**

Double click and enter answer here

In [20]:
# Find videos with the highest engagement likes by running this code
top_videos = youtube_df.sort_values(by="likes", ascending=False).head(10)
top_videos

NameError: name 'youtube_df' is not defined

In [21]:
# Display specific columns of top 10 videos by likes by running this code
top_videos[["title", "channelTitle", "likes", "view_count", "comment_count"]]

NameError: name 'top_videos' is not defined

In [22]:
# Adjust the following code to find videos with the highest comments
# top_videos = youtube_df.sort_values(by="likes", ascending=False).head(10)
# top_videos
# Enter code here

In [23]:
# Adjust the code above to find videos with the highest views
# Enter code here

### Feature Engineering

In [24]:
# Create new columns for year, month, day of the week, and hour of trending_date
# Enter code here

In [25]:
# Verify your new columns are a part of the dataset
youtube_df.columns

NameError: name 'youtube_df' is not defined

In [26]:
# Convert trending_date to datetime format, removing 'Z' and parsing correctly
youtube_df["trending_date"] = pd.to_datetime(youtube_df["trending_date"].str.replace("Z", ""), format="%Y-%m-%dT%H:%M:%S")

# Convert trending_date to datetime format, removing 'Z' and parsing correctly
youtube_df["publishedAt"] = pd.to_datetime(youtube_df["publishedAt"].str.replace("Z", ""), format="%Y-%m-%dT%H:%M:%S")

NameError: name 'pd' is not defined

Now that we have converted the trending_date and publishedAt columns into proper datetime format, we can use them to gain deeper insights into how long it takes for a video to trend after being published.

📌 Why is this important?
- Not all videos immediately trend after being uploaded.
- Some videos go viral quickly, while others take days or weeks to gain traction.
- Understanding the time-to-trend can help us analyze patterns in content virality and the impact of the YouTube algorithm.

**Next Step: Calculating Time to Trend**

We will create a new column, time_to_trend, which calculates the difference between when a video was published and when it first appeared in the trending list.

💡 Run the following code to compute this:

In [27]:
# Create a new column for time between publish and trending
youtube_df['time_to_trend'] = youtube_df['trending_date'] - youtube_df['publishedAt']

NameError: name 'youtube_df' is not defined

Now that we've calculated time_to_trend, which represents the difference between when a video was published and when it trended, we need to make this value more interpretable.

📌 Why Convert to Days and Hours?
- The raw time difference is currently stored as a Timedelta object, which is useful for calculations but not intuitive for quick analysis.
- Converting this into days and hours allows us to:
- Compare how long different videos take to trend.
- Analyze trends at a daily or hourly level.
- Identify patterns, such as whether certain categories or video types tend to trend faster.

**Next Step: Extracting Days and Hours from Time Difference**

We will now convert time_to_trend into total days and hours using the total_seconds() function, which allows us to break down the difference into meaningful time units.

💡 Run the following code:

In [28]:
# Calculate total time difference in days, including partial days
youtube_df["days_to_trend"] = youtube_df["time_to_trend"].dt.total_seconds() / 86400  # Convert seconds to days

# Calculate total time difference in hours
youtube_df["hours_to_trend"] = youtube_df["time_to_trend"].dt.total_seconds() / 3600  # Convert seconds to hours

# Display results
youtube_df[["video_id", "time_to_trend", "days_to_trend", "hours_to_trend"]].head()

NameError: name 'youtube_df' is not defined

**Question 8: What is the average time-to-trend in days and hours?**

Double click and enter answer here

---

Now that we’ve calculated days and hours to trend, we can move beyond raw numbers and use visualizations to uncover trends and patterns in the data.

📌 Why Use Visualizations?
- Tables and raw numbers only tell part of the story—graphs help reveal patterns at a glance.
- By plotting the distribution and trends of days_to_trend, we can answer key questions about how videos gain popularity.

**Next Steps: Visualizing Time-to-Trend and Trending Patterns**

We’ll now create basic visualizations to explore:  
✅ When videos tend to trend (time of day, day of week, seasonality)  
✅ The distribution of time-to-trend and whether there are outliers  
✅ If there are patterns in how long it takes for videos to trend
Basic Visualizations

Now, let’s try to answer the following questions using visualizations:

1️⃣ Which days see the most trending videos?  
Hint: Try extracting the day of the week from trending_date and plot a bar chart.

In [29]:
# Enter code here

2️⃣ Which time of day sees the most trending videos?  
Hint: Extract hour of the day and create a histogram to show when videos trend most often.

In [30]:
# Enter code here

3️⃣ Do more videos typically trend over the summer months?  
Hint: Analyze seasonality by plotting trends across months.

In [31]:
# Enter code here

4️⃣ Is there a significant number of outliers in trending videos?  
Hint: A box plot of days_to_trend will help us identify extreme values.

In [32]:
# Enter code here

In [33]:
5️⃣ How long does it typically take for a video to trend after being published?  
Hint: Create a histogram of days_to_trend to visualize the distribution.

SyntaxError: invalid character '️' (U+FE0F) (3494954948.py, line 1)

In [34]:
# Enter code here

### Filtering Dataframes

In real-world data analysis, filtering is an essential technique that allows us to focus on specific subsets of data that are most relevant to our investigation. Rather than analyzing the entire dataset at once—which can be overwhelming and filled with irrelevant information—we can narrow our focus to extract meaningful insights.

📌 Why Filter a DataFrame?  
- To analyze specific trends (e.g., identifying disinformation-related content).
- To remove irrelevant data that may skew our analysis.
- To explore targeted questions, such as which types of videos include certain keywords in their tags.

Filtering on the **tags** Column

In this case, we are particularly interested in the disinformation tag in trending YouTube videos. Since YouTube creators add tags to describe their videos, we can use this column to identify videos that explicitly mention "disinformation."

By filtering the dataset based on whether the tags contain the word "disinformation," we can:  
✅ Identify how many trending videos discuss disinformation.  
✅ Determine which video categories or creators frequently use this term.  
✅ Compare engagement metrics (views, likes, comments) between disinformation-related videos and other trending content.

Now, let’s apply this filtering technique to extract all videos that include **"disinformation"**

In [35]:
# The code below filters rows based on if the string "disinformation" is in the "tag" column
disinfo = youtube_df[youtube_df['tags'].str.contains('disinformation', case=False)]
disinfo

NameError: name 'youtube_df' is not defined

*How many videos match this filer?  Do they have higher or lower engagement (views, likes, comments) compared to other videos?*

In [36]:
# Modify the code above to filter rows based on any tag of interest
# Save the dataframe as tag_1
# Enter code here

### Aggregating data
Now that we’ve learned how to filter the dataset to focus on specific topics, the next step is to aggregate the data to uncover broader patterns.

📌 Why Aggregate Data?  
While filtering allows us to zoom in on specific videos, aggregation helps us summarize trends across multiple entries.

**Aggregation allows us to answer questions like:**  
    ✅ Which channels post the most videos on a given topic?  
    ✅ Are certain content creators or networks consistently producing trending content with specific tags?  
    ✅ How does the frequency of a topic vary across different creators?

📌 Why Group by channelId?
- By grouping the dataset by channelId, we can analyze how many unique videos each channel has posted with a given tag.
- This helps identify which channels contribute the most content related to a specific topic, such as political content or misinformation.

**Next Step: Aggregating Video Counts for Specific Tags**

To demonstrate this, we’ll filter videos that contain the tag "Trump", and then identify which channels post the most videos with this tag.

💡 Run the following code:

In [37]:
tag_1 = youtube_df[youtube_df['tags'].str.contains('trump', case=False)]
tag_1

NameError: name 'youtube_df' is not defined

In [38]:
#Identify if specific channels post more videos with this tag than others
tag_1.groupby('channelId')['video_id'].nunique().sort_values().tail(20)

NameError: name 'tag_1' is not defined

In [39]:
# Try the above code grouping by ChannelTitle
# Enter code here

*Which channels frequently post videos with this tag?  Are the most active channelgs news-based, political, or entertainment focused?* 

In [40]:
# Try other data aggregations and visualizations
# Enter code here

In [41]:
# Enter code here

In [42]:
# Enter code here

In [43]:
# Enter code here

---
Once complete, please submit by saving your ipynb file to the [Assignment 5 Student Submissions Folder](https://github.com/BevRice/CMI_Course/tree/main/Student_Submissions/Assignment_5)

From Google Colab
- 1. Click File
- 2. Click Download
- 3. Select Download .ipynb
- 4. Upload onto GitHub Discussions by 11:59pm, Fri 28 Mar

---

Grading Criteria (15 Points Total)

- ✅ Code Completeness (5 pts) → Attempts all coding exercises
- ✅ Code Accuracy (2 pts) -> No errors in executions
- ✅ Interpretation of Results (8 pts) → Thoughtful answers to questions 1-8

[Provide Anonymous Feedback on this Assignment Here](https://forms.gle/4ZRmNr5rmGCAR1Re6)