# Data Pre-Processing

First, we are going to clean the data and then we are going to perform feature engineering to enrich the data.


## 1. Data Cleaning

First, we are going to clean the data by:
- checking for duplicate rows
- checking for missing values
- checking the percentage of missing values of each column
- checking for any redundant (uninformative) columns
- checking the column data types

In [26]:
import os
import re
import ast
import pandas as pd
import numpy as np
import spacy         # Natural language processing
import isodate       # Date transformation and manipulation

from os.path import dirname, abspath
from dateutil import parser

#### Read the channels and videos files into dataframes

In [27]:
# Read in the csv files from the raw data folder
project_dir = dirname(dirname(abspath("02-data-preprocessing.ipynb")))
channels_df = pd.read_csv(project_dir + "/data/raw/fitness_channels_2023_07_11.csv")
videos_df = pd.read_csv(project_dir + "/data/raw/fitness_videos_2023_07_11.csv")

#### Check for duplicate rows and columns

In [28]:
# check for duplicate rows in channels and videos data
(channels_df.duplicated().any(),videos_df.duplicated().any())

(False, False)

#### Check for missing data

In [29]:
channels_df.isnull().any()

ChannelName           False
ChannelDescription    False
PublishedDate         False
TotalSubscribers      False
TotalViews            False
TotalVideos           False
playlistID            False
dtype: bool

In [30]:
videos_df.isnull().any()

video_id          False
channelTitle      False
title             False
description        True
tags               True
publishedAt       False
viewCount         False
likeCount          True
favouriteCount     True
commentCount       True
duration          False
definition        False
dtype: bool

#### Check for percentage of missing values

In [31]:
# Find the percentage of missing values from columns that contains them in the videos dataframe
missingval_columns = videos_df.loc[:, ['description', 'tags', 'likeCount','favouriteCount','commentCount']]
missingval_columns.isnull().sum() / missingval_columns.shape[0] * 100.00

description         4.554311
tags               12.883585
likeCount           1.217730
favouriteCount    100.000000
commentCount        0.243546
dtype: float64

In [32]:
# Drop the favouriteCount column since it only contains missing values
videos_df.drop('favouriteCount', axis=1, inplace=True)

#### Check for redundant columns

In [33]:
# Find the unique values of the definition column
videos_df['definition'].unique()

array(['hd', 'sd'], dtype=object)

In [34]:
# Check the percentage of each unique value in the definition column
videos_df['definition'].value_counts() / videos_df['definition'].shape[0] * 100

hd    99.561617
sd     0.438383
Name: definition, dtype: float64

In [35]:
# Drop the definition column for being uninformative
videos_df.drop('definition', axis=1, inplace=True)

In [36]:
# Check the date values to make sure there are no errors
videos_df.publishedAt.sort_values()

2836    2009-10-06T04:47:37Z
2835    2009-11-04T18:05:19Z
2834    2009-11-10T05:18:29Z
2833    2009-11-18T06:49:44Z
2832    2009-11-30T08:13:39Z
                ...         
1       2023-07-10T15:18:40Z
733     2023-07-11T12:46:31Z
1652    2023-07-11T13:00:16Z
0       2023-07-11T14:00:17Z
1241    2023-07-11T14:00:39Z
Name: publishedAt, Length: 4106, dtype: object

#### Checking the column data types

In [37]:
channels_df.dtypes

ChannelName           object
ChannelDescription    object
PublishedDate         object
TotalSubscribers       int64
TotalViews             int64
TotalVideos            int64
playlistID            object
dtype: object

In [38]:
videos_df.dtypes

video_id         object
channelTitle     object
title            object
description      object
tags             object
publishedAt      object
viewCount         int64
likeCount       float64
commentCount    float64
duration         object
dtype: object

## 2. Feature Engineering

Second, we are going to perform feature engineering to enrich the data by:
* Convert ISO 8601 duration format from the YouTube Data API v3
    1. first convert to timedelta[s]
    2. then convert to minutes since this unit of duration is more intuitive to work with for workout videos
    3. check for any errors or outliers (address them if necessary)
* Derive new features from the published date column 
    1. Get published year
    2. Get published month
    3. Get published day of the week
    4. Get published time of day
* Get the length of the video title
* Get the total number of tags in the tags column
* use natural language processing (via spaCy) to derive useful features from the video title, description, and/or tags
    1. Get the workout length; since actual workout duration and video duration are usually not the same
    2. Get the workout type (cardio, HIIT, yoga, pilates, etc..)
    3. Get the target body part (legs, abs, arms ,etc..)
    3. Get any special aspect of the workout (no jumping, standing, no equipment, etc..)

In [39]:
# Convert duration to seconds
videos_df['durationSecs'] = videos_df['duration'].apply(lambda x: isodate.parse_duration(x))
videos_df['durationMins'] = videos_df['durationSecs'].astype('timedelta64[s]')/60

In [40]:
# Check the distribution of the video duration for presence of any outliers
videos_df['durationMins'].describe()

count    4106.000000
mean       10.764178
std         9.224501
min         0.000000
25%         3.650000
50%        10.616667
75%        14.491667
max       130.816667
Name: durationMins, dtype: float64

In [41]:
# Check for videos that are extremely long, greater than 45 mins.
# Note: we get that there are 20 videos that are that long
videos_df[videos_df['durationMins'] > 45][['channelTitle','title','duration','video_id']]

Unnamed: 0,channelTitle,title,duration,video_id
822,emi wong,45 MIN WALKING CARDIO WORKOUT | Intense Full B...,PT45M18S,fIDmwKCJmlA
930,emi wong,45 min Full Body Workout to BURN MAX CALORIES ...,PT47M58S,Wgm1Xc25imM
991,emi wong,1 HOUR FULL BODY FAT BURN HOME WORKOUT (Warm U...,PT1H2M35S,p188evCXF0k
1013,emi wong,"2 Weeks Workout Program to Lose Weight, Get Ab...",PT1H4M51S,EJKw3Mh0MyI
1094,emi wong,45-min Full Body Fat Burn HIIT at home with NO...,PT46M43S,wAIRYalt75w
1389,Chloe Ting,Abs & Booty Workout Livestream,PT1H34M6S,tahd5q-onKc
1409,Chloe Ting,10 Million Subs LIVESTREAM | Let's hangout + W...,PT2H10M49S,IdO0Ie3_2QU
1451,Chloe Ting,2000 REP Full Body & Abs Workout CHALLENGE for...,PT49M48S,004CudS_3Ew
1517,Chloe Ting,45 Min Full Body FAT BURN Workout | Get Flat A...,PT46M53S,LDvAuqTZxMw
1737,blogilates,POPFLEX Pre-Black Friday Extravaganza,PT1H12M46S,HO84VmEkfq0


In [42]:
# Check for videos that are have 0 mins as duration.
# there are livestreams, that looks like might have accidentally got turned on 
# and was turned off right away, so we are dropping those videos
videos_df = videos_df[videos_df['durationMins'] > 0]

In [43]:
# Create publish datetime and year for channels data
channels_df['publishedDatetime'] =  channels_df['PublishedDate'].apply(lambda x: parser.parse(x))
channels_df['publishedYear'] = channels_df['publishedDatetime'].apply(lambda x: int(x.strftime("%Y")))

# Create publish year and month (of the year) columns
videos_df['publishedDatetime'] = videos_df['publishedAt'].apply(lambda x: parser.parse(x))
videos_df['publishedYear'] = videos_df['publishedDatetime'].apply(lambda x: int(x.strftime("%Y")))
videos_df['publishedMonth'] = videos_df['publishedDatetime'].apply(lambda x: x.strftime("%b"))

# Create publish day (of the week) column
videos_df['pushblishDayName'] = videos_df['publishedDatetime'].apply(lambda x: x.strftime("%a")) 

In [44]:
# Convert the description column to string
videos_df['description'] = videos_df['description'].astype(str)

# Title character length
videos_df['titleLength'] = videos_df['title'].apply(lambda x: len(x))

# Create the number of tags column
videos_df['tags'] = videos_df['tags'].replace(np.nan, None)
videos_df['tags'] = videos_df['tags'].apply(lambda x: x if x is None else ast.literal_eval(x))
videos_df['tagsCount'] = videos_df['tags'].apply(lambda x: 0 if x is None else len(x))


In [45]:
# Now we are going to use spaCy's entity ruler along with regex 
# to look for specific entities in the text columns.

# Download spaCy's small english model
nlp = spacy.load("en_core_web_sm")

# Create and add the EntityRuler
ruler = nlp.add_pipe("entity_ruler", before="ner")

#List of Entities and Patterns
patterns = [
    # labels for body part target
    {"label": "FULL_BODY", "pattern": [{"LOWER": {"REGEX": r"(full|total|whole)"}}, {"LOWER": "body"}]},
    {"label": "UPPER_BODY", "pattern": [{"LOWER": "upper"}, {"LOWER": "body"}]},
    {"label": "LOWER_BODY", "pattern": [{"LOWER": "lower"}, {"LOWER": "body"}]},
    {"label": "CHEST", "pattern": [{"LOWER": "chest"}]},
    {"label": "BACK", "pattern": [{"LOWER": "back"}]},
    {"label": "ABS", "pattern": [{"LOWER": {"REGEX": r"(core|ab|abs|plank)"}}]},
    {"label": "ARMS", "pattern": [{"LOWER": {"REGEX": r"arms?"}}]},
    {"label": "LEGS", "pattern": [{"LOWER": {"REGEX": r"(thigh|thighs|leg|legs)"}}]},
    {"label": "GLUTES", "pattern": [{"LOWER": {"REGEX": r"(booty|glute|glutes|butt)"}}]},
    # labels for workout type
    {"label": "HIIT", "pattern": [{"LOWER": "hiit"}]},
    {"label": "CARDIO", "pattern": [{"LOWER": "cardio"}]},
    {"label": "DANCE", "pattern": [{"LOWER": "dance"}]},
    {"label": "TABATA", "pattern": [{"LOWER": "tabata"}]},
    {"label": "PILATES", "pattern": [{"LOWER": "pilates"}]},
    {"label": "BARRE", "pattern": [{"LOWER": "barre"}]},
    {"label": "YOGA", "pattern": [{"LOWER": "yoga"}]},
    {"label": "STANDING", "pattern": [{"LOWER": "standing"}]},
    {"label": "NO_EQUIPMENT", "pattern": [{"LOWER": "no", "LOWER": {"REGEX": r"(equip|equipment|equipments|weight|weights)"}}]},
    {"label": "NO_JUMPING", "pattern": [{"LOWER": "no", "LOWER": "jumping"}]},
    {"label": "LOW_IMPACT", "pattern": [{"LOWER": "low", "LOWER": "impact"}]},
    {"label": "STRENGTH_TRAINING", "pattern": [{"LOWER": {"REGEX": r"(strength|sculpt|sculpting|tone|toning|toned)"}}]}
]

ruler.add_patterns(patterns)

In [46]:

def extract_ent_label(string, label):
    """ Identify presence of label using regular expressions

    Arguments:
        string -- the string value of either the title, description columns
        label -- can be either one of the labels mentioned in the patterns list above

    Returns:
        A logical value, True if the label is found, else the default value is False
    """
    doc = nlp(string)
    result = False
    for ent in doc.ents:
        if ent.label_ == label:
            result = True
            break
    return result


In [47]:
# Extract the values for the "label" key and store them as a list
labels_list = [pattern['label'] for pattern in patterns]

# Create new columns to the exting videos dataframe
videos_df = videos_df.assign(**{label: None for label in labels_list})

# Apply the extract_ent_label function to the new columns row-wise using apply and lambda function
for label in labels_list:
    print("Getting video text information for column: " + label)
    
    # extract text information about the workout from the video title first
    videos_df[label] = videos_df.apply(lambda row: extract_ent_label(row['title'], label), axis=1)

    # if missing from the title, extract text information about the workout from the video description
    videos_df[label] = videos_df.apply(lambda row: extract_ent_label(row['description'], label)\
        if pd.isna(row[label]) else row[label], axis=1)

Getting video text information for column: FULL_BODY
Getting video text information for column: UPPER_BODY
Getting video text information for column: LOWER_BODY
Getting video text information for column: CHEST
Getting video text information for column: BACK
Getting video text information for column: ABS
Getting video text information for column: ARMS
Getting video text information for column: LEGS
Getting video text information for column: GLUTES
Getting video text information for column: HIIT
Getting video text information for column: CARDIO
Getting video text information for column: DANCE
Getting video text information for column: TABATA
Getting video text information for column: PILATES
Getting video text information for column: BARRE
Getting video text information for column: YOGA
Getting video text information for column: STANDING
Getting video text information for column: NO_EQUIPMENT
Getting video text information for column: NO_JUMPING
Getting video text information for column:

#### Save the processed data 

In [49]:
# Save dataframes as csv files in the processed subfolder of the data folder
channels_df.to_csv(project_dir + "/data/processed/fitness_channels_processed_2023_07_11.csv", index=False)
videos_df.to_csv(project_dir + "/data/processed/fitness_videos_processed_2023_07_11.csv", index=False)