# Data Cleaning

## The Importance of Cleaning YouTube Video Data

When working with YouTube video data, proper cleaning is essential because:

 **Raw data contains inconsistencies**:
   - Video titles may include special characters or emojis
   - Descriptions often have promotional links or boilerplate text
   - Metadata fields (like view counts or durations) might be formatted inconsistently

 **Missing or incomplete information**:
   - Some videos may lack thumbnails or proper category tags
   - Engagement metrics (likes/dislikes) could be unavailable for certain videos

 **Impact on analysis**:
   - Unclean data can distort viewership trends and performance comparisons
   - Machine learning models trained on raw data may produce unreliable predictions
   - Business decisions based on uncleaned data could be misguided

By systematically, standardizing text formats, handling missing values appropriately, removing duplicate entries and validating numerical metrics we can ensure the data accurately represents channel performance and viewer behavior. This cleaning process forms the foundation for all subsequent analysis, from basic reporting to advanced predictive modeling.

---
## Imports

### Core Data Handling

**pandas**: The primary library for working with structured data. It enables creating organized tables (DataFrames) from YouTube API responses, filtering specific videos based on criteria, and handling missing or incomplete video information efficiently.

**numpy**: Provides fundamental numerical computing capabilities. It powers the mathematical operations needed to calculate video statistics, analyze trends in view counts, and process arrays of numerical data like engagement metrics.

### YouTube-Specific Processing

**isodate**: Specialized library for working with YouTube's duration format. It converts the platform's unique time format (like PT15M33S for 15 minutes, 33 seconds) into standard Python time objects, enabling analysis of video length patterns.

**ast**: Safely converts string representations of Python objects into actual usable objects. This is particularly useful when YouTube API returns complex data structures (like tags or metadata) as text strings that need to be transformed into workable formats.

### Text Analysis

**TextBlob**: Natural language processing toolkit that enables understanding of video titles, descriptions, and comments. It provides sentiment analysis to gauge emotional tone, identifies key phrases, and performs grammatical analysis - crucial for understanding viewer engagement and content themes.

In [2]:
import pandas as pd
import numpy as np
import isodate
import ast
from isodate import parse_duration
from textblob import TextBlob

These libraries work together to transform raw YouTube API data into actionable insights. The process begins with structured storage of API responses, converts specialized YouTube formats into standard Python objects, analyzes textual components, and finally enables comprehensive statistical analysis of video performance metrics.

---
## Organizing YouTube Channel Data

###  Loading Raw Data Files
- **Video Metadata**: 
  - Loaded from `redPillAnalytics.csv` containing core video information
  - Stored in `video_df` DataFrame with columns like `video_id` and `channelTitle`

In [3]:
video_df = pd.read_csv("dataFolder/raw/redPillAnalytics.csv")

- **Comment Data Consolidation**:
  - Split across two batches (`commentsBatchOne_df.csv` and `commentsBatchTwo_df.csv`)
  - Initially loaded into separate DataFrames `df1` and `df2`
  - **Combining Batches**:
  - Uses `pd.concat()` to merge both comment files
  - `ignore_index=True` ensures continuous row numbering
  - Creates unified `comments_df` containing all comments

In [4]:
df1 = pd.read_csv("dataFolder/raw/commentsBatchOne_df.csv")
df2 = pd.read_csv("dataFolder/raw/commentsBatchTwo_df.csv")
comments_df = pd.concat([df1, df2], ignore_index=True)

###  Data Filtering and Merging
- **Video Data Selection**:
  - Extracts only essential columns (`video_id` and `channelTitle`)
  - Creates lean `df_filtered` DataFrame

In [5]:
df_filtered = video_df[['video_id', 'channelTitle']]

- **Comment Enrichment**:
  - Merges comment data with video metadata using `video_id` as key
  - `how='inner'` ensures only comments with matching videos are kept
  - Final `comments_df` contains both comment text and associated channel info

In [6]:
comments_df = df_filtered.merge(comments_df, on='video_id', how='inner').copy()
comments_df

Unnamed: 0,video_id,channelTitle,comment,published_at
0,F5eSaabAAmk,Benjamin Seda,big boobs lmao,2025-03-10T00:36:43Z
1,F5eSaabAAmk,Benjamin Seda,"This will work for a specific type of woman, o...",2025-03-09T23:01:56Z
2,F5eSaabAAmk,Benjamin Seda,Can you do a video on what to do if you enco...,2025-03-09T07:13:54Z
3,F5eSaabAAmk,Benjamin Seda,God of the Dates 🤍,2025-03-08T14:57:50Z
4,F5eSaabAAmk,Benjamin Seda,"About cold approaches, it's just not true. I d...",2025-03-07T19:06:08Z
...,...,...,...,...
668718,Jyjqw_HwXVg,The Corbett Report (Unofficial),"Make everyone you know, or are even slightly r...",2022-05-25T12:19:52Z
668719,Jyjqw_HwXVg,The Corbett Report (Unofficial),The Green Scheme is Green Death!,2022-05-25T12:15:15Z
668720,Jyjqw_HwXVg,The Corbett Report (Unofficial),"a real timeless classic, perfect choice",2022-05-23T18:30:31Z
668721,Jyjqw_HwXVg,The Corbett Report (Unofficial),This is great stuff,2022-05-23T16:13:58Z


> **Technical Notes**
>
> To ensure memory efficiency during data processing, the use of `.copy()` helps prevent the common `SettingWithCopyWarning` in pandas, making data manipulation safer and clearer. Additionally, performing an inner join automatically reduces the dataset size by including only matching records, which optimizes resource usage. The final DataFrame is structured to retain the original comment data, the corresponding video IDs, and channel titles, making it easier to group and analyze the data by channel or video for further exploration.

---
## Data Quality Verification Steps

### For Comments Data (`comments_df`)

**Initial Null Check**  
- `comments_df.isnull().any()`: Identifies which columns contain missing values  
- `comments_df.isnull().sum()`: Counts nulls per column  

In [7]:
comments_df.isnull().any()

video_id        False
channelTitle    False
comment          True
published_at    False
dtype: bool

**Data Type Inspection**  
- `comments_df.dtypes`: Verifies each column's data type  

In [8]:
comments_df.dtypes

video_id        object
channelTitle    object
comment         object
published_at    object
dtype: object

**Cleaning Process**  
- `comments_df.dropna(subset=['comment'])`: Removes rows with missing comments  
- `comments_df.isin([0, np.nan]).sum()`: Checks for zeros/NaNs across all columns  

In [9]:
comments_df.isnull().sum()
comments_df.dropna()
comments_df.isin([0, np.nan]).sum()

video_id         0
channelTitle     0
comment         51
published_at     0
dtype: int64

**Final Validation**  
- `print(comments_df.isnull().sum())`: Confirms no nulls remain in comment text  

In [10]:
comments_df = comments_df.dropna(subset=['comment'])
print(comments_df.isnull().sum())

video_id        0
channelTitle    0
comment         0
published_at    0
dtype: int64


---
### For Video Data (`video_df`)  

**Initial Assessment**  
- `video_df.isnull().any()`: Flags columns with nulls  
- `video_df.dtypes`: Checks column data types  

In [11]:
video_df.isnull().any()

video_id          False
channelTitle      False
title             False
description        True
tags               True
publishedAt       False
viewCount         False
likeCount          True
favouriteCount     True
commentCount       True
duration          False
definition        False
caption           False
dtype: bool

In [12]:
video_df.dtypes

video_id           object
channelTitle       object
title              object
description        object
tags               object
publishedAt        object
viewCount           int64
likeCount         float64
favouriteCount    float64
commentCount      float64
duration           object
definition         object
caption              bool
dtype: object

**Completeness Check**  
- `video_df.isnull().sum()`: Quantifies missing values per column  
- `video_df.isin([0, np.nan]).sum()`: Detects zeros/NaNs
- `video_df.dropna()`: Optional complete null removal

In [13]:
video_df.isnull().sum()
video_df.dropna()
video_df.isin([0, np.nan]).sum()

video_id              0
channelTitle          0
title                 0
description         394
tags               6314
publishedAt           0
viewCount             3
likeCount            23
favouriteCount    13675
commentCount        150
duration              0
definition            0
caption           13246
dtype: int64

In [14]:
video_df.isnull().sum()

video_id              0
channelTitle          0
title                 0
description         394
tags               6314
publishedAt           0
viewCount             0
likeCount            16
favouriteCount    13675
commentCount          9
duration              0
definition            0
caption               0
dtype: int64

This approach preserves the comment text, which is important for NLP applications, and systematically detects nulls across both datasets to maintain data quality. Type verification is performed to avoid analysis errors, and all operations are non-destructive, ensuring the original data remains intact.

### Video Data (`video_df`) Status Report

Before cleaning, several fields exhibited significant missing data: `favouriteCount` had 13,675 missing values, accounting for 98% of records, `tags` were absent in 6,314 cases (45%), and `caption` was missing in 13,246 records (95%). Moderate levels of nulls were found in `description` (394 missing, 3%) and `likeCount` (23 missing, 0.2%). After the initial cleaning, notable improvements included the complete resolution of nulls in `viewCount` (from 3 to 0), a substantial reduction in `commentCount` nulls (from 150 to 9), and full imputation of the `caption` field (from 13,246 to 0 nulls). However, gaps persisted in `favouriteCount`, which remains largely unavailable due to YouTube’s deprecation of the metric, and `tags`, which are still missing in 45% of videos—a common occurrence for content without tags.

---
## Cleaning Solution

The handling of missing values and data types was carefully implemented to support reliable downstream analysis. For the description field, any missing values were filled with empty strings, ensuring a consistent text column format and preventing errors during NLP processing, while also retaining all video records without row deletion. Tag information was parsed using a robust `safe_literal_eval` function, which safely converts stringified lists to Python lists and manages null or malformed data, returning empty lists in problematic cases; this approach preserves data structure integrity and enables accurate tag analysis and vectorisation. For count-based engagement metrics, missing values were imputed with zeros, matching YouTube’s API convention for private or removed content, which ensures that all records are preserved, maintains numerical consistency for aggregation, and provides computational efficiency.

In [15]:
video_df['description'] = video_df['description'].fillna("")
def safe_literal_eval(x):
    try:
        return ast.literal_eval(x) if isinstance(x, str) else []
    except (ValueError, SyntaxError):
        return []

video_df['tags'] = video_df['tags'].apply(safe_literal_eval)
video_df['likeCount'] = video_df['likeCount'].fillna(0)
video_df['favouriteCount'] = video_df['favouriteCount'].fillna(0)
video_df['commentCount'] = video_df['commentCount'].fillna(0)
print(video_df.isnull().sum())

video_id          0
channelTitle      0
title             0
description       0
tags              0
publishedAt       0
viewCount         0
likeCount         0
favouriteCount    0
commentCount      0
duration          0
definition        0
caption           0
dtype: int64


### Project-Specific Advantages

For machine learning applications, all features now have complete values with no nulls and consistent data types, allowing for proper feature extraction and ensuring that zero-filled counts do not distort statistical analysis. From a business analysis perspective, the approach maintains the full dataset size without losing records, uses empty strings and lists to clearly represent missing data, and aligns with YouTube’s platform conventions. Regarding maintenance, defensive programming strategies are used to prevent future errors, the imputation strategy is clearly documented in the code, and the handling rules can be easily modified as needed. The key strengths of this solution include systematic handling of all null cases to ensure completeness, context-aware selection of fill values for each field type, the inclusion of validation checks for verifiability, and scalability to larger datasets.

---
## Data Transformation & Feature Creation

###  Data Type Standardization

This processing step converts raw API responses into analysis-ready formats by enforcing proper data types across all fields. Engagement metrics such as view counts, likes, and comments are transformed from strings or objects into numeric values, while publication dates are standardized to timezone-naive datetime objects. Problematic values are automatically converted to a consistent missing data representation, which ensures accurate calculations for performance metrics, enables correct temporal comparisons across videos, and maintains data integrity in the presence of API inconsistencies.

In [17]:
numeric_cols = ['viewCount', 'likeCount', 'favouriteCount', 'commentCount']
video_df[numeric_cols] = video_df[numeric_cols].apply(pd.to_numeric, errors = 'coerce', axis = 1)

###  Derived Metadata and Temporal Feature Extraction

This feature engineering process extracts quantifiable characteristics from video metadata to enable content-focused analysis and unlock time-based patterns in video performance through multiple temporal representations. The following features were created:

- **Tag Analysis**: Count of tags per video as a measure of content tagging completeness
- **Title Metrics**: Character length of video titles as a potential engagement factor
- **Duration**: Video length converted to seconds for precise duration analysis
- **Publication Day**: Categorical weekday name for weekly pattern analysis
- **Unix Timestamp**: Continuous numeric representation for time-series modeling
- **Seasonal Indicators**: Derived datetime components for trend analysis

These engineered features reveal relationships between metadata quality and viewer engagement, identify optimal content characteristics such as title length and tagging strategy, and provide measurable inputs for recommendation algorithms. Use cases include identifying best-performing publication days and times, enabling decay rate modeling of video engagement, and supporting analysis of long-term viewership trends.

In [18]:
video_df['publishedAt'] = pd.to_datetime(video_df['publishedAt']).dt.tz_localize(None)
video_df['publishDayName'] = video_df['publishedAt'].dt.strftime("%A")
video_df['publishedAt_timestamp'] = video_df['publishedAt'].astype('int64') / 10**9
video_df['tagCount'] = video_df['tags'].apply(lambda x: len(x) if isinstance(x, list) else [])
video_df['durationSecs'] = video_df['duration'].apply(lambda x: isodate.parse_duration(x).total_seconds())
video_df['durationSecs'] = video_df['duration'].apply(
    lambda x: isodate.parse_duration(x).total_seconds() if pd.notnull(x) else 0
)
video_df['titleLength'] = video_df['title'].apply(lambda x: len(x))

### Implementation Approach

The data validation process includes distribution analysis of all derived features, cross-verification of temporal conversions, and thorough null value checks after transformation. Feature documentation is maintained with clear naming conventions that reflect the source data, ensuring type consistency across all transformations and preserving raw data alongside newly engineered features. As a result, all features are properly typed for visualisation and are compatible with both statistical and machine learning workflows, making the dataset ready for correlation analysis and hypothesis testing.

In [19]:
video_df

Unnamed: 0,video_id,channelTitle,title,description,tags,publishedAt,viewCount,likeCount,favouriteCount,commentCount,duration,definition,caption,publishDayName,publishedAt_timestamp,tagCount,durationSecs,titleLength
0,F5eSaabAAmk,Benjamin Seda,How to ACTUALLY Get a Girlfriend in 2025 (Full...,👉🏼 Get 1-3+ dates per week in 30 days (coachin...,"[how to flirt with a girl, dates, how to get a...",2025-03-06 15:27:49,5034.0,254.0,0.0,27.0,PT15M4S,hd,False,Thursday,1.741275e+09,21,904.0,53
1,xJ6b8CV-pQ0,Benjamin Seda,How to Find A 10/10 Girlfriend,👫 My 3 step formula to approach & attract wome...,"[how to flirt with a girl, dates, how to get a...",2025-03-03 15:01:24,3346.0,330.0,0.0,22.0,PT59S,hd,False,Monday,1.741014e+09,13,59.0,30
2,kPhrei5S88U,Benjamin Seda,The Mistake 99% of Men Make That Keep Them Single,👫 My 3 step formula to approach & attract wome...,"[how to flirt with a girl, dates, how to get a...",2025-03-01 14:45:07,2690.0,222.0,0.0,19.0,PT36S,hd,False,Saturday,1.740840e+09,13,36.0,49
3,4ZnwTwLcAeM,Benjamin Seda,How to Always Get That 2nd Date,👫 My 3 step formula to approach & attract wome...,"[how to flirt with a girl, dates, how to get a...",2025-02-27 14:15:00,4060.0,413.0,0.0,9.0,PT46S,hd,False,Thursday,1.740666e+09,13,46.0,31
4,VW9-SBs6yIg,Benjamin Seda,The Donald Trump Method for Tinder (STEAL THIS),👫 My 3 step formula to approach & attract wome...,"[how to flirt with a girl, dates, how to get a...",2025-02-26 13:45:03,6818.0,316.0,0.0,30.0,PT32S,hd,False,Wednesday,1.740578e+09,13,32.0,47
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13670,_X4P6L622-8,The Corbett Report (Unofficial),How Do I Defend Voluntarism? - Questions For C...,I have no affiliation with James Corbett or Th...,[],2022-05-25 13:22:06,393.0,15.0,0.0,2.0,PT28M13S,hd,False,Wednesday,1.653485e+09,0,1693.0,52
13671,tkmZ4c2AOVY,The Corbett Report (Unofficial),The 5G Dragnet,I have no affiliation with James Corbett or Th...,[],2022-05-24 17:28:15,1226.0,61.0,0.0,2.0,PT25M44S,sd,False,Tuesday,1.653413e+09,0,1544.0,14
13672,mr7itEUIVew,The Corbett Report (Unofficial),False Flags: The Secret History of Al Qaeda — ...,I have no affiliation with James Corbett or Th...,[],2022-05-24 17:01:10,5531.0,179.0,0.0,16.0,PT1H16M19S,sd,False,Tuesday,1.653412e+09,0,4579.0,66
13673,ochRNyIDTE8,The Corbett Report (Unofficial),Episode 409 - False Flags: The Secret History ...,I have no affiliation with James Corbett or Th...,[],2022-05-24 17:01:06,6132.0,212.0,0.0,16.0,PT1H59M39S,sd,False,Tuesday,1.653412e+09,0,7179.0,72


---
## Channel Performance Aggregation

This process calculates core performance metrics for each YouTube channel by grouping the data by channel title. It sums up total views, likes, and comments for each channel and then computes the engagement rate, defined as (Total Likes + Total Comments) divided by Total Views. This approach provides a clear measure of audience interaction and overall channel performance.

In [20]:
channel_stats = video_df.groupby('channelTitle').agg({
    'viewCount': 'sum',
    'likeCount': 'sum',
    'commentCount': 'sum'
}).reset_index()

In [21]:
channel_stats['engagementRate'] = (channel_stats['likeCount'] + channel_stats['commentCount']) / channel_stats['viewCount']

This analysis provides valuable insights by comparing absolute performance across channels, identifying high-engagement content creators, and normalizing interaction metrics by view count.

| Channel                        | View Count   | Engagement Rate | Relevance to Red Pill                              |
|--------------------------------|--------------|-----------------|---------------------------------------------------|
| Benjamin Seda                  | 202,773,449  | 3.36%           | Likely relevant (focus on dating, masculinity, and self-improvement) |
| Better Bachelor                | 231,797,826  | 6.87%           | Highly relevant (focus on men’s rights, dating, and anti-feminism) |
| Coach Corey Wayne              | 251,196,988  | 2.10%           | Relevant (focus on dating advice and relationships) |
| FreshandFit                    | 221,150,653  | 4.77%           | Highly relevant (focus on dating, gender dynamics, and masculinity) |
| Jordan B Peterson              | 950,219,566  | 3.64%           | Partially relevant (focus on psychology, self-improvement, and gender roles) |
| The Corbett Report (Unofficial)| 1,124,728    | 8.61%           | Less relevant (focus on conspiracy theories, not directly Red Pill) |
| The Distributist               | 6,873,745    | 4.96%           | Less relevant (focus on traditionalism and economics, not directly Red Pill) |
| Alhpamales                     | 239          | 4.18%           | Likely irrelevant (very low view count, unclear relevance) |
| Paul Joseph Watson | Перевод     | 2,178          | 2.02%           | Likely irrelevant (low view count, unclear relevance) |

### Channels to Keep (Highly Relevant to Red Pill):
- **Better Bachelor**
- **FreshandFit**
- **Benjamin Seda**
- **Coach Corey Wayne**
- **Jordan B Peterson** 

### Channels to Exclude (Less Relevant or Low Engagement):
- **The Corbett Report (Unofficial)** (focuses on conspiracy theories, not Red Pill)
- **The Distributist** (focuses on traditionalism/economics, not Red Pill)
- **Alhpamales** and **Paul Joseph Watson | Перевод** (very low engagement, unclear relevance)

---
## Excluding Channels

I've removed irrelevant channels from my analysis to eliminate noise and avoid false signals. Channels that focus on unrelated topics, such as conspiracy theories or economics, or that show very low engagement, tend to distort the dataset and make it difficult to identify genuine patterns in Red Pill-related viewer behaviour. By narrowing the focus to channels that consistently discuss masculinity, dating dynamics, and gender relations—and that have statistically significant viewership—I can ensure that my findings more accurately reflect the actual engagement patterns within the Red Pill community. This careful curation helps improve the reliability of my results and maintains the thematic focus of the analysis.

In [23]:
excluded_channels = [
    "The Corbett Report (Unofficial)", 
    "The Distributist", 
    "Alhpamales", 
    "Paul Joseph Watson | Перевод"
]

video_df = video_df[~video_df["channelTitle"].isin(excluded_channels)].copy()

 **Data Processing**:
   - Uses `~.isin()` to keep only channels **not** in the exclusion list
   - Applies filtering to both:
     - `video_df` (video metadata)
     - `comments_df` (user comments)
   - `.copy()` prevents pandas warnings

>**Technical Note**:  
The `~` operator inverts the condition - we keep rows where `channelTitle` is **not** in the exclusion list.

In [19]:
comments_df = comments_df[~comments_df["channelTitle"].isin(excluded_channels)].copy()

**Outcome**:  
Clean datasets where all content relates to:
- Dating dynamics  
- Masculinity  
- Gender relations  
- Male self-improvement

---
## Video Performance Metrics & Feature Engineering

### Key Insights from Channel Duration Analysis

In [24]:
video_df.groupby('channelTitle')['durationSecs'].mean()

channelTitle
Benjamin Seda         195.985205
Better Bachelor      1379.782349
Coach Corey Wayne     659.063435
FreshandFit          3706.578228
Jordan B Peterson    2940.362724
Name: durationSecs, dtype: float64

Average video duration by channel (seconds):
- **FreshandFit**: 3,707 (long-form discussions)  
- **Jordan B Peterson**: 2,940 (in-depth lectures)  
- **Better Bachelor**: 1,380 (medium-length content)  
- **Coach Corey Wayne**: 659 (concise advice)  
- **Benjamin Seda**: 196 (short clips)

### Newly Engineered Features

 **Engagement Metrics**  
  I've added some new metrics to help analyse channel engagement more effectively. The `view_per_like` metric shows how many views it takes to get a single like, which helps measure how "costly" likes are for a video. The `commentRatio` and `likeRatio` metrics give normalised rates of engagement, showing how often viewers comment or like compared to the total views. Finally, the `popularity_score` combines comments, likes, and views into one weighted score (with comments counted as most important), making it easier to compare overall popularity between videos.

In [None]:
video_df['view_per_like'] = np.where(
    video_df['likeCount'] != 0,
    video_df['viewCount'] / video_df['likeCount'], 
    0 
)

video_df['commentRatio'] = np.where(
    video_df['viewCount'] != 0, 
    video_df['commentCount'] / video_df['viewCount'], 
    0  
)

video_df['likeRatio'] = np.where(
    video_df['viewCount'] != 0, 
    video_df['likeCount'] / video_df['viewCount'],  
    0  
)

video_df['popularity_score'] = (
    video_df['viewCount'] + 
    video_df['likeCount'] * 10 + 
    video_df['commentCount'] * 20
)

 **Content Interaction Features**  
  I've introduced new metrics to dig deeper into video engagement. The `comment_duration_interaction` metric combines how much discussion a video generates with its length, helping show which videos spark conversation relative to their duration. The `title_sentiment` metric analyses the tone of each video title, giving a polarity score from -1 (negative) to +1 (positive), which makes it easier to see if certain title tones attract more engagement.

In [None]:
video_df['comment_duration_interaction'] = video_df['commentCount'] * video_df['durationSecs']
video_df = video_df.fillna(0)
video_df['title_sentiment'] = video_df['title'].apply(lambda x: TextBlob(x).sentiment.polarity)

### Statistical Findings

In [27]:
video_df['commentCount'].mean()

623.0344854308742

On average, each video receives about 623 comments, although there is a wide range in the number of comments across different videos.

In [28]:
video_df[['viewCount', 'commentCount']].corr()

Unnamed: 0,viewCount,commentCount
viewCount,1.0,0.617668
commentCount,0.617668,1.0


The correlation between views and comments is 0.62, indicating a moderately strong relationship. This means that, in general, videos with more views tend to receive more comments. However, the relationship is not perfectly linear—after a certain point, the number of comments doesn't increase as rapidly as views do. This suggests that while higher viewership does encourage more discussion, audience interaction may start to level off for very popular videos.

> **Technical Implementation Notes**
>
> - **NaN Handling**:  
>   Missing values (NaN) are safely replaced with zeros to prevent errors and maintain consistency in calculations.
>
> - **Conditional Logic**:  
>   Conditional operations are implemented using:
>   ```python
>   np.where(condition, true_val, false_val)
>   ```
>   This allows for efficient, vectorized assignment of values based on a specified condition.

In [29]:
video_df

Unnamed: 0,video_id,channelTitle,title,description,tags,publishedAt,viewCount,likeCount,favouriteCount,commentCount,...,publishedAt_timestamp,tagCount,durationSecs,titleLength,view_per_like,comment_duration_interaction,popularity_score,commentRatio,likeRatio,title_sentiment
0,F5eSaabAAmk,Benjamin Seda,How to ACTUALLY Get a Girlfriend in 2025 (Full...,👉🏼 Get 1-3+ dates per week in 30 days (coachin...,"[how to flirt with a girl, dates, how to get a...",2025-03-06 15:27:49,5034.0,254.0,0.0,27.0,...,1.741275e+09,21,904.0,53,19.818898,24408.0,8114.0,0.005364,0.050457,0.175000
1,xJ6b8CV-pQ0,Benjamin Seda,How to Find A 10/10 Girlfriend,👫 My 3 step formula to approach & attract wome...,"[how to flirt with a girl, dates, how to get a...",2025-03-03 15:01:24,3346.0,330.0,0.0,22.0,...,1.741014e+09,13,59.0,30,10.139394,1298.0,7086.0,0.006575,0.098625,0.000000
2,kPhrei5S88U,Benjamin Seda,The Mistake 99% of Men Make That Keep Them Single,👫 My 3 step formula to approach & attract wome...,"[how to flirt with a girl, dates, how to get a...",2025-03-01 14:45:07,2690.0,222.0,0.0,19.0,...,1.740840e+09,13,36.0,49,12.117117,684.0,5290.0,0.007063,0.082528,-0.071429
3,4ZnwTwLcAeM,Benjamin Seda,How to Always Get That 2nd Date,👫 My 3 step formula to approach & attract wome...,"[how to flirt with a girl, dates, how to get a...",2025-02-27 14:15:00,4060.0,413.0,0.0,9.0,...,1.740666e+09,13,46.0,31,9.830508,414.0,8370.0,0.002217,0.101724,0.000000
4,VW9-SBs6yIg,Benjamin Seda,The Donald Trump Method for Tinder (STEAL THIS),👫 My 3 step formula to approach & attract wome...,"[how to flirt with a girl, dates, how to get a...",2025-02-26 13:45:03,6818.0,316.0,0.0,30.0,...,1.740578e+09,13,32.0,47,21.575949,960.0,10578.0,0.004400,0.046348,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12917,KO7Z0HdxIek,FreshandFit,The Most OPTIMAL Rep Range For STRENGTH,Many people struggle to determine the best rep...,"[repetition range, how to lift weights, how to...",2020-06-25 00:36:00,9633.0,601.0,0.0,21.0,...,1.593045e+09,23,258.0,39,16.028286,5418.0,16063.0,0.002180,0.062390,0.500000
12918,3LXo6A-JnV4,FreshandFit,What's better for fat loss? Low carb or high c...,Today we answer the age old question. Are high...,"[fatloss, lowcarb, weight loss, keto, evidence...",2020-06-20 14:45:11,9209.0,490.0,0.0,26.0,...,1.592664e+09,23,193.0,50,18.793878,5018.0,14629.0,0.002823,0.053209,0.220000
12919,e9Gdl-szTg4,FreshandFit,Are Fitness/Calorie Tracking Apps Accurate? Th...,Are these popular apps bringing you closer to ...,[redpill fitness hypergamy #gainz],2020-06-13 15:00:11,4787.0,242.0,0.0,10.0,...,1.592060e+09,1,271.0,66,19.780992,2710.0,7407.0,0.002089,0.050554,-0.163889
12920,RHlPDYsuBYs,FreshandFit,IS FASTING SUPERIOR? What the science says...,Does fasting build more muscle or help burn mo...,"[fasting, fitness, aesthetic]",2020-05-30 15:00:27,23952.0,1353.0,0.0,64.0,...,1.590851e+09,3,426.0,45,17.702882,27264.0,38762.0,0.002672,0.056488,0.700000


---
## Exploratory Data Snapshot

*(While not part of the data cleaning pipeline, examining top-performing videos helps contextualize our dataset.)*

In [30]:
top_10_views = video_df.nlargest(10, 'viewCount')[['title', 'viewCount', 'channelTitle']]
print(top_10_views)

                                                   title   viewCount  \
9910           JBP X @MattRifeComedy.  Today at 5pm EST.  18502980.0   
10451  Lecture: Biblical Series I: Introduction to th...  13706650.0   
1463   THIS is How A Girl Wants You to TEXT HER | How...  11807649.0   
1473   7 Ways To INSTANTLY Look MORE ATTRACTIVE | How...   8246244.0   
9984                             COVID-19 Cause of Death   7952713.0   
10053  Talking to Muslims About Christ | Mohammed Hij...   7756906.0   
9973                            What Are Women Good For?   7656196.0   
10072         Africa is Not Poor Because of Colonization   7541337.0   
10389  Documentary: A Glitch in the Matrix (David Ful...   6605578.0   
90             If a Girl is Looking at You, Approach Her   6594217.0   

            channelTitle  
9910   Jordan B Peterson  
10451  Jordan B Peterson  
1463       Benjamin Seda  
1473       Benjamin Seda  
9984   Jordan B Peterson  
10053  Jordan B Peterson  
9973   Jordan B Pe

In [31]:
top_10_likes = video_df.nlargest(10, 'likeCount')[['title', 'likeCount', 'channelTitle']]
print(top_10_likes)

                                                   title  likeCount  \
9910           JBP X @MattRifeComedy.  Today at 5pm EST.   611913.0   
10290                                        Return Home   543720.0   
9955   The Fight Against Worldwide Child Slavery & th...   318009.0   
10110                               Article: Twitter Ban   315518.0   
9973                            What Are Women Good For?   305820.0   
9984                             COVID-19 Cause of Death   299559.0   
10087    Language Is Used as a Group Protection Strategy   232440.0   
114                 If A Girl is Looking at You, Do This   231402.0   
10451  Lecture: Biblical Series I: Introduction to th...   227120.0   
10072         Africa is Not Poor Because of Colonization   220386.0   

            channelTitle  
9910   Jordan B Peterson  
10290  Jordan B Peterson  
9955   Jordan B Peterson  
10110  Jordan B Peterson  
9973   Jordan B Peterson  
9984   Jordan B Peterson  
10087  Jordan B Peterson  
11

## Top 10 Most-Viewed Videos
| Views       | Channel             | Title Excerpt                     |
|-------------|---------------------|-----------------------------------|
| 18.5M       | Jordan B Peterson   | JBP X @MattRifeComedy...          |
| 13.7M       | Jordan B Peterson   | Biblical Series Lecture...        |
| 11.8M       | Benjamin Seda       | How to Text a Girl...             |

### Top 10 Most-Liked Videos  
| Likes      | Channel             | Title Excerpt                     |
|------------|---------------------|-----------------------------------|
| 611K       | Jordan B Peterson   | JBP X @MattRifeComedy...          |
| 543K       | Jordan B Peterson   | Return Home                       |
| 318K       | Jordan B Peterson   | Child Slavery Article...          |

### Key Observations:
 **Dominant Channels**:  
   - Jordan Peterson dominates both views and likes  
   - Benjamin Seda appears only in top views (not top likes)  

 **Content Patterns**:  
   - High-performing videos combine:  
     - Celebrity collaborations  
     - Controversial topics  
     - Practical dating advice  

 **Engagement Disconnect**:  
   - Some highly-viewed videos don't make top likes list  
   - Suggests viewership ≠ agreement/enjoyment  

**Analytical Value:**

- Identifies the true leaders in audience reach and engagement
- Distinguishes between passive viewership and active support
- Lays the groundwork for deeper analysis of what makes a video or channel successful

---
## Saving Processed Data

**File Outputs**:
`cleanedDataFrame.csv` - Processed video metadata with:
   - Standardized formats
   - Engineered features
   - Null values handled

In [None]:
video_df.to_csv("dataFolder/processed/cleanedDataFrame.csv", index=False)

 `cleanedComments.csv` - Filtered comments with:
   - Irrelevant channels removed
   - Consistent text encoding
   - Associated video metadata

In [21]:
comments_df.to_csv("dataFolder/processed/cleanedComments.csv", index=False)

> Key parameters include using `index=False` to avoid exporting the pandas index as a column and choosing CSV format for its cross-platform compatibility, human readability, and ease of version control. Files are organized in the `/processed/` subdirectory, keeping them separate from raw data in line with data pipeline best practices.

---
# Conclusion: Data Cleaning and Exploration

This notebook successfully standardised formats, addressed missing values and malformed entries, and refined content by filtering irrelevant channels and preparing comments for NLP analysis. Through feature engineering, key analytical metrics such as engagement ratios and sentiment scores were created, and temporal posting patterns were extracted. Rigorous quality checks ensured no missing values remained, high view-comment correlation was verified, and leading channels like Jordan Peterson were identified. The resulting dataset provides reliable metrics, actionable insights, and ML-ready features, forming a robust foundation for meaningful exploratory analysis with a continued focus on Red Pill themes.