# Data Pre-Processing

First, we are going to clean the data and then we are going to perform feature engineering to enrich the data.


## 1. Data Cleaning

First, we are going to clean the data by:
- checking for duplicate rows
- checking for missing values
- checking the percentage of missing values of each column
- checking for any redundant (uninformative) columns
- checking the column data types

In [115]:
import os
import re
import ast
import pandas as pd
import numpy as np
import spacy         # Natural language processing
import isodate       # Date transformation and manipulation

from os.path import dirname, abspath
from dateutil import parser
from datetime import datetime, date

#### Read the channels and videos files into dataframes

In [116]:
# Read in the csv files from the raw data folder
project_dir = dirname(dirname(abspath("02-data-preprocessing.ipynb")))
channels_df = pd.read_csv(project_dir + "/data/raw/fitness_channels_2023_07_11.csv")
videos_df = pd.read_csv(project_dir + "/data/raw/fitness_videos_2023_07_11.csv")

In [117]:
# rename the columns to make the naming more intuitive and consistent
channels_df = channels_df.rename(columns={
    'ChannelName': 'channel',
    'ChannelDescription': 'channel_description',
    'PublishedDate': 'published_date',
    'TotalSubscribers': 'total_subscribers',
    'TotalViews':'total_views',
    'TotalVideos': 'total_videos', 
    'playlistID': 'playlist_id'
    
})

videos_df = videos_df.rename(columns={
    'channelTitle': 'channel',
    'viewCount': 'total_views',
    'likeCount': 'total_likes',
    'commentCount': 'total_comments',
    'favouriteCount': 'total_favourites',
    'publishedAt': 'published_at'
})

#### Check for duplicate rows and columns

In [118]:
# check for duplicate rows in channels and videos data
(channels_df.duplicated().any(),videos_df.duplicated().any())

(False, False)

#### Check for missing data

In [119]:
videos_df.isnull().any()

video_id            False
channel             False
title               False
description          True
tags                 True
published_at        False
total_views         False
total_likes          True
total_favourites     True
total_comments       True
duration            False
definition          False
dtype: bool

#### Check for percentage of missing values

In [120]:
videos_df.isnull().sum() / videos_df.shape[0] * 100.00

video_id              0.000000
channel               0.000000
title                 0.000000
description           4.554311
tags                 12.883585
published_at          0.000000
total_views           0.000000
total_likes           1.217730
total_favourites    100.000000
total_comments        0.243546
duration              0.000000
definition            0.000000
dtype: float64

In [121]:
# Drop rows with missing values for likes - those videos are shorts
videos_df = videos_df.dropna(subset=['total_likes'])

# Drop the favouriteCount column since it only contains missing values
videos_df = videos_df.drop(columns='total_favourites')

# Fill missing comments with 0 - comments was not allowed for those videos
videos_df['total_comments'] = videos_df['total_comments'].fillna(0)


#### Check for redundant columns

In [122]:
# Check the percentage of each unique value in the definition column
videos_df['definition'].value_counts() / videos_df['definition'].shape[0] * 100

hd    99.556213
sd     0.443787
Name: definition, dtype: float64

In [123]:
# Drop the definition column for being uninformative
videos_df = videos_df.drop(columns='definition')

In [124]:
# Check the date values to make sure there are no errors
videos_df.published_at.sort_values()

2836    2009-10-06T04:47:37Z
2835    2009-11-04T18:05:19Z
2834    2009-11-10T05:18:29Z
2833    2009-11-18T06:49:44Z
2832    2009-11-30T08:13:39Z
                ...         
2837    2023-07-10T15:00:09Z
733     2023-07-11T12:46:31Z
1652    2023-07-11T13:00:16Z
0       2023-07-11T14:00:17Z
1241    2023-07-11T14:00:39Z
Name: published_at, Length: 4056, dtype: object

#### Checking the column data types

In [125]:
videos_df.dtypes

video_id           object
channel            object
title              object
description        object
tags               object
published_at       object
total_views         int64
total_likes       float64
total_comments    float64
duration           object
dtype: object

## 2. Feature Engineering

Second, we are going to perform feature engineering to enrich the data by:
* Convert ISO 8601 duration format from the YouTube Data API v3
    1. First convert to timedelta[s]
    2. Convert duration from units of seconds to minutes since it is more intuitive for workout videos
    3. Check for any errors or outliers (address them if necessary)
    4. Categorize the duration columns into groups
* Derive new features from the published date column 
    1. Get published year
    2. Get published month
    3. Get published day of the week
    4. Get the age of the video
* Get the length of the video title
* Get the total number of tags in the tags column
* use natural language processing (via spaCy) to derive useful features from the video title, description, and/or tags
    1. Get the workout length; since actual workout duration and video duration are usually not the same
    2. Get the workout type (cardio, HIIT, yoga, pilates, etc..)
    3. Get the target body part (legs, abs, arms ,etc..)
    3. Get any special aspect of the workout (no jumping, standing, no equipment, etc..)

In [126]:
# Convert duration to seconds
videos_df['duration_secs'] = videos_df['duration'].apply(lambda x: isodate.parse_duration(x))
videos_df['duration_mins'] = videos_df['duration_secs'].astype('timedelta64[s]')/60

# Check the distribution of the video duration for presence of any outliers
videos_df['duration_mins'].describe()

count    4056.000000
mean       10.891958
std         9.208619
min         0.000000
25%         3.962500
50%        10.691667
75%        14.575000
max       130.816667
Name: duration_mins, dtype: float64

In [127]:
# Dropping videos that have 0 mins as duration - most likely livestreams that might have accidentally got turned on and off right away
videos_df = videos_df[videos_df['duration_mins'] > 0]

In [128]:
# Check for videos that are extremely long, greater than 45 mins.
# Note: we get that there are 20 videos that are that long
videos_df[videos_df['duration_mins'] > 45][['channel','title','duration','video_id']]

Unnamed: 0,channel,title,duration,video_id
822,emi wong,45 MIN WALKING CARDIO WORKOUT | Intense Full B...,PT45M18S,fIDmwKCJmlA
930,emi wong,45 min Full Body Workout to BURN MAX CALORIES ...,PT47M58S,Wgm1Xc25imM
991,emi wong,1 HOUR FULL BODY FAT BURN HOME WORKOUT (Warm U...,PT1H2M35S,p188evCXF0k
1013,emi wong,"2 Weeks Workout Program to Lose Weight, Get Ab...",PT1H4M51S,EJKw3Mh0MyI
1094,emi wong,45-min Full Body Fat Burn HIIT at home with NO...,PT46M43S,wAIRYalt75w
1389,Chloe Ting,Abs & Booty Workout Livestream,PT1H34M6S,tahd5q-onKc
1409,Chloe Ting,10 Million Subs LIVESTREAM | Let's hangout + W...,PT2H10M49S,IdO0Ie3_2QU
1451,Chloe Ting,2000 REP Full Body & Abs Workout CHALLENGE for...,PT49M48S,004CudS_3Ew
1517,Chloe Ting,45 Min Full Body FAT BURN Workout | Get Flat A...,PT46M53S,LDvAuqTZxMw
1737,blogilates,POPFLEX Pre-Black Friday Extravaganza,PT1H12M46S,HO84VmEkfq0


In [129]:
# Define the bin edges for the duration ranges
bin_edges = [0, 5, 10, 20, 30, 45, float('inf')]

# Define the labels for each range
labels = ['0-5', '5-10', '10-20', '20-30', '30-45', '45+']

# Use pd.cut to categorize the 'duration_mins' column into the specified ranges
videos_df['duration_category'] = pd.cut(videos_df['duration_mins'], bins=bin_edges, labels=labels, right=False)

In [130]:
# Create publish year and month (of the year) columns
videos_df['published_datetime'] = pd.to_datetime(videos_df['published_at'].apply(lambda x: parser.parse(x)))
videos_df['published_date'] = pd.to_datetime(videos_df['published_datetime'].dt.date)

# Extract the month and year from the 'published_date' to create new columns
videos_df['published_month'] = videos_df['published_date'].dt.month
videos_df['published_year'] = videos_df['published_date'].dt.year

# Get the current date 
current_date_str = date.today()
current_date = pd.to_datetime(current_date_str)

# Calculate the video age in months
videos_df['video_age'] = ((current_date.year - videos_df['published_year']) * 12 +
                         (current_date.month - videos_df['published_month']))

In [131]:
# Convert the description column to string
videos_df['description'] = videos_df['description'].astype(str)

# Title character length
videos_df['title_length'] = videos_df['title'].str.len()

# Create the number of tags column
videos_df['tags'] = videos_df['tags'].replace(np.nan, None)
videos_df['tags'] = videos_df['tags'].apply(lambda x: x if x is None else ast.literal_eval(x))
videos_df['total_tags'] = videos_df['tags'].apply(lambda x: 0 if x is None else len(x))

In [132]:
# Now we are going to use spaCy's entity ruler along with regex 
# to look for specific entities in the text columns.

# Download spaCy's small english model
nlp = spacy.load("en_core_web_sm")

# Create and add the EntityRuler
ruler = nlp.add_pipe("entity_ruler", before="ner")

#List of Entities and Patterns
patterns = [
    # labels for body part target
    {"label": "FULL_BODY", "pattern": [{"LOWER": {"REGEX": r"(full|total|whole)"}}, {"LOWER": "body"}]},
    {"label": "UPPER_BODY", "pattern": [{"LOWER": "upper"}, {"LOWER": "body"}]},
    {"label": "LOWER_BODY", "pattern": [{"LOWER": "lower"}, {"LOWER": "body"}]},
    {"label": "CHEST", "pattern": [{"LOWER": "chest"}]},
    {"label": "BACK", "pattern": [{"LOWER": "back"}]},
    {"label": "ABS", "pattern": [{"LOWER": {"REGEX": r"(core|ab|abs|plank)"}}]},
    {"label": "ARMS", "pattern": [{"LOWER": {"REGEX": r"arms?"}}]},
    {"label": "LEGS", "pattern": [{"LOWER": {"REGEX": r"(thigh|thighs|leg|legs)"}}]},
    {"label": "GLUTES", "pattern": [{"LOWER": {"REGEX": r"(booty|glute|glutes|butt)"}}]},
    # labels for workout type
    {"label": "HIIT", "pattern": [{"LOWER": "hiit"}]},
    {"label": "CARDIO", "pattern": [{"LOWER": "cardio"}]},
    {"label": "DANCE", "pattern": [{"LOWER": "dance"}]},
    {"label": "TABATA", "pattern": [{"LOWER": "tabata"}]},
    {"label": "PILATES", "pattern": [{"LOWER": "pilates"}]},
    {"label": "BARRE", "pattern": [{"LOWER": "barre"}]},
    {"label": "YOGA", "pattern": [{"LOWER": "yoga"}]},
    {"label": "STANDING", "pattern": [{"LOWER": "standing"}]},
    {"label": "NO_EQUIPMENT", "pattern": [{"LOWER": "no", "LOWER": {"REGEX": r"(equip|equipment|equipments|weight|weights)"}}]},
    {"label": "NO_JUMPING", "pattern": [{"LOWER": "no", "LOWER": "jumping"}]},
    {"label": "LOW_IMPACT", "pattern": [{"LOWER": "low", "LOWER": "impact"}]},
    {"label": "STRENGTH_TRAINING", "pattern": [{"LOWER": {"REGEX": r"(strength|sculpt|sculpting|tone|toning|toned)"}}]}
]

ruler.add_patterns(patterns)

In [133]:

def extract_ent_label(string, label):
    """ Identify presence of label using regular expressions

    Arguments:
        string -- the string value of either the title, description columns
        label -- can be either one of the labels mentioned in the patterns list above

    Returns:
        A logical value, True if the label is found, else the default value is False
    """
    doc = nlp(string)
    result = False
    for ent in doc.ents:
        if ent.label_ == label:
            result = True
            break
    return result


In [134]:
# Extract the values for the "label" key and store them as a list
labels_list = [pattern['label'] for pattern in patterns]

# Create new columns to the exting videos dataframe
videos_df = videos_df.assign(**{label: None for label in labels_list})

# Apply the extract_ent_label function to the new columns row-wise using apply and lambda function
for label in labels_list:
    print("Getting video text information for column: " + label)
    
    # extract text information about the workout from the video title first
    videos_df[label] = videos_df.apply(lambda row: extract_ent_label(row['title'], label), axis=1)

    # if missing from the title, extract text information about the workout from the video description
    videos_df[label] = videos_df.apply(lambda row: extract_ent_label(row['description'], label)\
        if (not row[label]) else row[label], axis=1)

Getting video text information for column: FULL_BODY
Getting video text information for column: UPPER_BODY
Getting video text information for column: LOWER_BODY
Getting video text information for column: CHEST
Getting video text information for column: BACK
Getting video text information for column: ABS
Getting video text information for column: ARMS
Getting video text information for column: LEGS
Getting video text information for column: GLUTES
Getting video text information for column: HIIT
Getting video text information for column: CARDIO
Getting video text information for column: DANCE
Getting video text information for column: TABATA
Getting video text information for column: PILATES
Getting video text information for column: BARRE
Getting video text information for column: YOGA
Getting video text information for column: STANDING
Getting video text information for column: NO_EQUIPMENT
Getting video text information for column: NO_JUMPING
Getting video text information for column:

In [143]:
videos_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4052 entries, 0 to 4105
Data columns (total 38 columns):
 #   Column              Non-Null Count  Dtype                  
---  ------              --------------  -----                  
 0   video_id            4052 non-null   object                 
 1   channel             4052 non-null   category               
 2   title               4052 non-null   object                 
 3   description         4052 non-null   object                 
 4   tags                3537 non-null   object                 
 5   total_views         4052 non-null   int64                  
 6   total_likes         4052 non-null   float64                
 7   total_comments      4052 non-null   float64                
 8   duration_mins       4052 non-null   float64                
 9   duration_category   4052 non-null   category               
 10  published_datetime  4052 non-null   datetime64[ns, tzutc()]
 11  published_date      4052 non-null   datetim

In [142]:
# Drop the columns we no longer need
videos_df = videos_df.drop(columns=['duration','published_at','duration_secs'])

#### Save the processed data 

In [144]:
# Save dataframes as csv files in the processed subfolder of the data folder
channels_df.to_csv(project_dir + "/data/processed/fitness_channels_processed_2023_07_11.csv", index=False)
videos_df.to_csv(project_dir + "/data/processed/fitness_videos_processed_2023_07_11.csv", index=False)