# Data Preprocessing

Dieses Notebook wendet ein Data Cleaning auf diese Daten an und unterteilt diese in länder- und jahresspezifische csv-files, die im Ordner "Data/preprocessed_data" gespeichert werden.

# Install the requirements

Laden aller Packages die für das Projekt benötigt werden

In [11]:
! pip install -r ../requirements.txt



# Imports


In [1]:
import pandas as pd
import numpy as np
import os
import tensorflow as tf
import tqdm

# Preprocessing
import nltk
from profanity_check import predict_prob



2023-07-02 21:56:51.976264: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Load the data

Um die Daten über die Kaggel API zu laden muss ein API Token für die Authentifizierung hinterlegt werden.

Dieser kann über diesen Link erstellt werden https://www.kaggle.com/settings/account.

Das kaggel.json file muss im Ordner ~/.kaggle/kaggle.json.

Versteckte Datein können bei Mac über  “Command” + “Shift” + “.” sichtbar gemacht werden.

Weitere Informationen unter: https://www.kaggle.com/docs/api

In [2]:
# Download the Dataset from the Kaggel API

! kaggle datasets download -d datasnaek/youtube-new

Downloading youtube-new.zip to /Users/leamayer/Programmierung/YouTube_Stats_Forecasting/src
100%|███████████████████████████████████████▊| 200M/201M [00:21<00:00, 11.1MB/s]
100%|████████████████████████████████████████| 201M/201M [00:21<00:00, 9.92MB/s]


In [3]:
# Unzi download into folder

! unzip youtube-new.zip -d extracted_files

Archive:  youtube-new.zip
  inflating: extracted_files/CA_category_id.json  
  inflating: extracted_files/CAvideos.csv  
  inflating: extracted_files/DE_category_id.json  
  inflating: extracted_files/DEvideos.csv  
  inflating: extracted_files/FR_category_id.json  
  inflating: extracted_files/FRvideos.csv  
  inflating: extracted_files/GB_category_id.json  
  inflating: extracted_files/GBvideos.csv  
  inflating: extracted_files/IN_category_id.json  
  inflating: extracted_files/INvideos.csv  
  inflating: extracted_files/JP_category_id.json  
  inflating: extracted_files/JPvideos.csv  
  inflating: extracted_files/KR_category_id.json  
  inflating: extracted_files/KRvideos.csv  
  inflating: extracted_files/MX_category_id.json  
  inflating: extracted_files/MXvideos.csv  
  inflating: extracted_files/RU_category_id.json  
  inflating: extracted_files/RUvideos.csv  
  inflating: extracted_files/US_category_id.json  
  inflating: extracted_files/USvideos.csv  


In [10]:
# Copy english speacking contries to data folder

! cp extracted_files/CAvideos.csv ../Data/original_data/
! cp extracted_files/GBvideos.csv ../Data/original_data/
! cp extracted_files/USvideos.csv ../Data/original_data/

In [11]:
# remove folder with unneccesary files and zip

! rm -r extracted_files
! rm -r youtube-new.zip

# Enrichment of the Data

## Add Missing Values or Subscription Details

Das Abonnieren eines Kanals auf YouTube kann darauf hinweisen, wie die anderen Videos dieses Youtubers abschneiden. Daher ist die Abonnentenzahl ein Maß für die Reichweite des YouTubers. [1] Je nach Reichweite des Publikums, das durch ein Video erreicht werden kann, ist die Abonnentenzahl ein wichtiges Merkmal bei der Vorhersage der potenziellen Aufrufe eines Videos. 

Im Folgenden wurde die `YouTube Data API v3` von `www.googleapis.com` verwendet, um die Abonnentenzahl des Kanals zum df hinzuzufügen. Darüber hinaus wurde versucht die API zu verwenden um Missing Values im Dataframe zu ersetzen. Durch ein tägliches Limit bei den Request konnte dieser Ansatz jedoch nicht weiter verfolgt werden.

[1] https://tuberanker.com/blog/what-are-the-benefits-of-subscribing-to-a-youtube-channel (04.06.23)

In [6]:
'''# The following youtube_authenticate function is available under this url: https://www.thepythoncode.com/article/using-youtube-api-in-python#google_vignette
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

import urllib.parse as p
import re
import os
import pickle

SCOPES = ["https://www.googleapis.com/auth/youtube.force-ssl"]

def youtube_authenticate():
    os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
    api_service_name = "youtube"
    api_version = "v3"
    client_secrets_file = "client_secret_883243427026-gmpvuoto8cd0gsl6djl80sesaa6b5jo0.apps.googleusercontent.com.json"
    creds = None
    # the file token.pickle stores the user's access and refresh tokens, and is
    # created automatically when the authorization flow completes for the first time
    if os.path.exists("token.pickle"):
        with open("token.pickle", "rb") as token:
            creds = pickle.load(token)
    # if there are no (valid) credentials availablle, let the user log in.
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(client_secrets_file, SCOPES)
            creds = flow.run_local_server(port=0)
        # save the credentials for the next run
        with open("token.pickle", "wb") as token:
            pickle.dump(creds, token)

    return build(api_service_name, api_version, credentials=creds)

# authenticate to YouTube API
youtube = youtube_authenticate()'''

In [7]:
'''# See the Youtube API Documentation for more insights: https://developers.google.com/youtube/v3/docs/subscriptions/list?hl=de&apix=true

def get_channel_details(youtube, **kwargs):

    request = youtube.channels().list(
        part="statistics", # just get stats
        **kwargs
    )
    response = request.execute()

    return response'''

In [8]:
'''# get subscription count and save it to the df

from tqdm import tqdm

for i in tqdm(range (0, len(df))):

    try:

        # get current channel Name
        channel_name = df.iloc[i]["channel_title"]

        # make API call to get channel infos
        response = get_channel_details(youtube, forUsername=channel_name)
        items = response.get("items")[0]
        statistics = items["statistics"]

        # get stats infos
        subscriberCount = statistics["subscriberCount"]

        # save subscriptions to df
        df.at[i,"subscriber_count"] = int(subscriberCount)

    except:

        df.at[i,"subscriber_count"] = np.NAN'''

100%|██████████| 40881/40881 [22:59<00:00, 29.63it/s]


# Data Preprocessing

Im Folgenden werden die Daten verarbeitet, indem diese bereinigt und mit NLP-Preprocessing Methoden bearbeitet werden. Damit soll die Qualität, die Aussagekraft und die politische Korrektheit der Texte gewehrleistet werden.
Weitere modellspezifische Preprocessing-Schritte werden in den entsprechenden Notebooks durchgeführt.

## Download Thumnails and add Path to jpg to the df

Jedes YouTube-Video wird durch eine Thumbnail dargestellt, ein kleines Bild, das zusammen mit dem Titel und dem Kanal als "Cover" des Videos dient. Interessante und gut gestaltete Thumbnails ziehen Zuschauer an, während verwirrende und minderwertige Thumbnails die Zuschauer dazu bringen, woanders hinzuklicken. Ein Beweis dafür, wie wichtig ein guter Thumbnail ist: 90 % der erfolgreichsten YouTube-Videos haben Kunden-Thumbnails [2]. 

Dementsprechend kann und wurde diese Funktion verwendet, um die Thumbnails zu laden. Jedoch werden diese Daten als bereits genannten Gründen nicht für die finalen Modelle verwendet.

[2] YouTube Creator Academy. Lesson: Make click- able thumbnails. https://creatoracademy.youtube.com/page/lesson/thumbnails#yt-creators-strategies-5. (04.06.23)

In [4]:
'''# load thumbnail and save picture path in df
import urllib.request
from tqdm import tqdm
import os

def load_and_save_thumbnail(url, iterator):
    try:

        # get unique thumbnail id 
        splitted_words = url.split("/")
        thumbnail_id = splitted_words[4]

        # create filename to save thumbnail
        file_path = f"../Data/thumbnails/{thumbnail_id}.jpg"
        #df_w_jgp.at[iterator,"thumbnail_link"] = thumbnail_id

        # check if thumbnail is already downloaded
        if os.path.exists(file_path):
            pass
        else:
            # download and save thumbnail 
            urllib.request.urlretrieve(url, file_path)

    except urllib.error.URLError as e:

        df_w_jgp.at[iterator,"thumbnail_link"] = np.NAN
        # print(f"An error occurred: {e}")
        return None

for i in tqdm(range (0, len(df))):

    # get thumbnail url
    url = df.iloc[i]["thumbnail_link"]

    # safe thumnail and write path into df
    load_and_save_thumbnail(url, i)  '''

100%|██████████| 500/500 [00:08<00:00, 58.34it/s]


## Define Data Cleaning

In [12]:
def data_cleaning(df):

    # Drop Unnececary Columns

    df = df [["channel_title","tags","description", "title", "views", 'trending_date', 'video_id']]
    
    # Handling Missing Values

    if df.isnull().any().any() or df.isna().any().any():

        num = df.isnull().sum().sum() + df.isna().sum().sum() # Count Missing Values

        print(">>>",num, "Missing Values are beeing handeled")

        df = df.dropna() # Drop rows with missing values
        #df.fillna(value) # Fill missing values with a specific value

    else:
        print(">>> No Missing Values detected")
        
    # Remove Duplicate

    if df.duplicated().any():

        num = df.duplicated().sum() # Count Duplicates

        print(">>>",num, "Duplicates are beeing handeled")

        df.drop_duplicates()

    else:
        print(">>> No Duplicates detected")

    # Convert Data Types

    #print( df.dtypes ) # check data types

    # Convert 'trending_date' to year
    df["trending_date"] = df["trending_date"].str[6:].astype(int) # save trending year
    df = df.rename(columns={'trending_date': 'trending_year'})

    for column in df:

        # Check if column is of type bool
        if df[column].dtype == 'bool':
            df[column] = df[column].astype(int)

    # Check preprocessed data

    print(f">>> Updated Data Types \n {df.dtypes}")

    return df

## Remove Profanity

Damit das Modell nicht darauf trainiert wird, dass beleidigende Videoinhalte Views generieren können, wird an dieser Stelle ein Vortrainiertes Modell verwendet, um Videos mit solchen Inhalten zu identifizieren und zu entfernen. Mithilfe des Schwellenwertes kann die Stärke der Kontrolle abgepasst werden. Dieser wurde so festgelegt, dass versehentliche Aussortierungen möglichst verhindert werden können. Trotzdem gilt es, lieber Videos mit unfeindlichen Inhalten auszuschließen, als Videos mit feindlichen Inhalten zu behalten.

In [None]:
# Check text features for profanity and remove Videos with hate speech

threshold = 0.9 # Strength of the algorithm punishment (Higher -> more profanity needed to exculde Video)

def check_profanity(feature_df):

    before = feature_df.shape[0]

    for feature in feature_df:

        print(f"\n--- Checking {feature} feature for profanity ---\n")

        # for each row in dataframe column check if condition is true and save the index 
        # 0.0 -> no profanity / 1.0 -> profanity

        rows_to_drop = [index for index, f in enumerate(feature_df[feature]) if predict_prob([f]) > threshold]

        if len(rows_to_drop) >= 1:
            
            example = feature_df[feature][rows_to_drop[0]]
            print(f">>> Removed for example: {example}")

        try:
            feature_df = feature_df.drop(rows_to_drop, axis = 0)
        except:
            pass

    after = feature_df.shape[0]

    print(f">>> Removed {before-after} Videos containing possible profanity")

    return feature_df


## Apply Functions and save partial DataFrame as CSV

Hier werden die definierten Preprocessing Funktionen angewendet. Die bearbeiteten Daten werden anschließend jahres- und länderspezifisch abgespeichert. (Hinweis: Ganzen Output anzeigen lassen um Funktion der einzelnen Methoden nachzuvollziehen)

In [None]:
folder_path = "../Data/original_data"  # Path to original df
result_folder_path = "../Data/processed_data/"  # Path for processed df
iterator = 0

if os.path.exists(result_folder_path):

    # Iterate over files in the folder
    for file_name in os.listdir(folder_path):

        if file_name.endswith('.csv'):  # Process only CSV files
            
            file_path = os.path.join(folder_path, file_name)

            # Read the CSV file and create a DataFrame
            df = pd.read_csv(file_path, encoding='latin-1')

            print(f"\n--- Preprocessing {file_name} ---\n")

            df = data_cleaning(df)

            # --- Preprocess text features ---

            df[["channel_title","tags","title","description"]] = check_profanity(df[["channel_title","tags","title","description"]]) # check for profanity and remove rows accordingly

            #feature_df_preprocessed = apply_stemmer_stopwords(df[["channel_title","tags","title","description"]]) # remove stopwords and get stemm of words
            #df[["channel_title","tags","title","description"]] = feature_df_preprocessed

            # --- Group by trending year ---

            original_dataset = file_name[:-4]
            df_grouped = df.groupby('trending_year')

            # --- Save partial dataframes ---

            for group_name, group_data in df_grouped:

                # Save final df as csv
                csv_path = result_folder_path+original_dataset+"_"+str(group_name)+".csv"
                group_data.to_csv(csv_path, index=False)

    print(f"\n--- Preprocessing succesfull - Data has been saved to {csv_path} ---\n")
    



--- Preprocessing CAvideos.csv ---

>>> 2592 Missing Values are beeing handeled
>>> No Duplicates detected
>>> Updated Data Types 
 channel_title    object
tags             object
description      object
title            object
views             int64
trending_year     int64
video_id         object
dtype: object

--- Checking channel_title feature for profanity ---

>>> Removed for example: PowerfulJRE

--- Checking tags feature for profanity ---

>>> Removed for example: ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"idy"|"rhpc"|"dares"|"no truth"|"comments"|"comedy"|"funny"|"stupid"|"fail"

--- Checking title feature for profanity ---

>>> Removed for example: John Oliver video of Charlie Rose is extra creepy now

--- Checking description feature for profanity ---

>>> Removed for example: THIS VIDEO WILL MAKE YOU FORGET YOUR NAME ð®ðª\nPatreon https://www.patreon.com/cowchop\nSubscribe http://bit.ly/1RQtfNf Â \nCow Chop Merch: http://bit.ly/2dY0HrO  \nDiscuss: http://bit.ly/1qvr