# Data Preprocessing

Wikipedia is an encyclopedia that covers a large amount of diverse topics. All articles are created, corrected and updated by individuals. The goal is to correctly document as many topics as possible by collecting the knowledge of a large number of people. However, some articles stand out due to their completeness, scope and presentation, and for this they are marked with the distinction of the Excellent Article. 

As part of the Natural Language Processing lecture, a classification of Wikipedia articles is to be carried out as a sub-task of an assignment with the goal of being able to identify excellent articles. This notebook contains the code to accomplish this goal and is structured as follows:

- [1. Imports](article_classification.ipynb#1-imports)
- [2. Enrichment of the Data](article_classification.ipynb#2-check-data-availability)
	- [2.1 Add Subscription Details](#thema1)
	- [2.2 Download Thumnails and add Path to jpg to the df](#thema2)
- [3. Data Cleaning](article_classification.ipynb#3-data-processing)
	- [4.1 Define Data Cleaning](#thema1)
	- [4.2 Define String Encoding](#thema2)
	- [4.3 Apply Functions](#thema3)
- [4. Export df]()


## 1. Imports
Import the requiered libraties into the notebook.
If some libraries are not installed, you can use the `requierements.txt` and run
```
$ pip install -r requirements.txt
```
in the terminal.

In [1]:
import pandas as pd
import numpy as np
import tensorflow



2023-07-01 10:35:43.615334: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Enrichment of the Data

## Add Subscription Details for Channel Details

Subscribing to a channel on YouTube can make it easier for you to watch multiple videos from the same channel. Therefore the Subscription count is a measure for the YouTubers reach. [1] According to range of the audiance that can be reached through a video, the subscription count is an important feature when forecasting the potential views of a video. 

In the following the `YouTube Data API v3` from `www.googleapis.com` is going to be used to add the subscription coun of the channel to the df.

[1] https://tuberanker.com/blog/what-are-the-benefits-of-subscribing-to-a-youtube-channel (04.06.23)

In [6]:
# The following youtube_authenticate function is available under this url: https://www.thepythoncode.com/article/using-youtube-api-in-python#google_vignette
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

import urllib.parse as p
import re
import os
import pickle

SCOPES = ["https://www.googleapis.com/auth/youtube.force-ssl"]

def youtube_authenticate():
    os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
    api_service_name = "youtube"
    api_version = "v3"
    client_secrets_file = "client_secret_883243427026-gmpvuoto8cd0gsl6djl80sesaa6b5jo0.apps.googleusercontent.com.json"
    creds = None
    # the file token.pickle stores the user's access and refresh tokens, and is
    # created automatically when the authorization flow completes for the first time
    if os.path.exists("token.pickle"):
        with open("token.pickle", "rb") as token:
            creds = pickle.load(token)
    # if there are no (valid) credentials availablle, let the user log in.
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(client_secrets_file, SCOPES)
            creds = flow.run_local_server(port=0)
        # save the credentials for the next run
        with open("token.pickle", "wb") as token:
            pickle.dump(creds, token)

    return build(api_service_name, api_version, credentials=creds)

# authenticate to YouTube API
youtube = youtube_authenticate()

In [7]:
# See the Youtube API Documentation for more insights: https://developers.google.com/youtube/v3/docs/subscriptions/list?hl=de&apix=true

def get_channel_details(youtube, **kwargs):

    request = youtube.channels().list(
        part="statistics", # just get stats
        **kwargs
    )
    response = request.execute()

    return response

In [8]:
# get subscription count and save it to the df

from tqdm import tqdm

for i in tqdm(range (0, len(df))):

    try:

        # get current channel Name
        channel_name = df.iloc[i]["channel_title"]

        # make API call to get channel infos
        response = get_channel_details(youtube, forUsername=channel_name)
        items = response.get("items")[0]
        statistics = items["statistics"]

        # get stats infos
        subscriberCount = statistics["subscriberCount"]

        # save subscriptions to df
        df.at[i,"subscriber_count"] = int(subscriberCount)

    except:

        df.at[i,"subscriber_count"] = np.NAN

100%|██████████| 40881/40881 [22:59<00:00, 29.63it/s]


## Download Thumnails and add Path to jpg to the df

Every YouTube video is represented by a thumbnail, a small image that, along with the title and channel, serves as the “cover” of the video. Thumbnails that are interest- ing and well-framed attract viewers, while those that are confusing and low-quality encourage viewers to click else- where. As a testament to the important of a good thumb- nail, 90% of the most successful YouTube videos have cus- tom thumbnails [2]. YouTube uploaders without the time or skills to create a custom thumbnail, however, must pick one of 3 frames automatically chosen from the video. Our mission is to improve this frame selection process and help uploaders select high quality frames that will attract viewers to their channel.

http://cs231n.stanford.edu/reports/2017/pdfs/710.pdf
[2] YouTube Creator Academy. Lesson: Make click- able thumbnails. https://creatoracademy. youtube.com/page/lesson/thumbnails# yt-creators-strategies-5.

In [4]:
# load thumbnail and save picture path in df
import urllib.request
from tqdm import tqdm
import os

df_w_jgp = df.copy()

def load_and_save_thumbnail(url, iterator):
    try:

        # get unique thumbnail id 
        splitted_words = url.split("/")
        thumbnail_id = splitted_words[4]

        # create filename to save thumbnail
        file_path = f"../Data/thumbnails/{thumbnail_id}.jpg"
        #df_w_jgp.at[iterator,"thumbnail_link"] = thumbnail_id

        # check if thumbnail is already downloaded
        if os.path.exists(file_path):
            pass
        else:
            # download and save thumbnail 
            urllib.request.urlretrieve(url, file_path)

    except urllib.error.URLError as e:

        df_w_jgp.at[iterator,"thumbnail_link"] = np.NAN
        # print(f"An error occurred: {e}")
        return None

for i in tqdm(range (0, len(df))):

    # get thumbnail url
    url = df.iloc[i]["thumbnail_link"]

    # safe thumnail and write path into df
    load_and_save_thumbnail(url, i)  

100%|██████████| 500/500 [00:08<00:00, 58.34it/s]


# Data Cleaning

## Define Data Cleaning

In [6]:
def data_cleaning(df):

    # Drop Unnececary Columns

    df = df [["channel_title","tags","description", "title", "views", 'trending_date']]
    
    # Handling Missing Values

    if df.isnull().any().any() or df.isna().any().any():

        num = df.isnull().sum().sum() + df.isna().sum().sum() # Count Missing Values

        print(">>>",num, "Missing Values are beeing handeled")

        #df.dropna() # Drop rows with missing values
        #df.fillna(value) # Fill missing values with a specific value

    else:
        print(">>> No Missing Values detected")
        
    # Remove Duplicate

    if df.duplicated().any():

        num = df.duplicated().sum() # Count Duplicates

        print(">>>",num, "Duplicates are beeing handeled")

        df.drop_duplicates()

    else:
        print(">>> No Duplicates detected")

    # Convert Data Types

    #print( df.dtypes ) # check data types

    # Convert 'trending_date' to year
    df["trending_date"] = df["trending_date"].str[6:].astype(int) # save trending year
    df = df.rename(columns={'trending_date': 'trending_year'})

    for column in df:

        # Check if column is of type bool
        if df[column].dtype == 'bool':
            df[column] = df[column].astype(int)

    # Check preprocessed data

    print(">>> Updated Data Types \n", df.dtypes)

    return df

## Apply Functions and save partial DataFrame as CSV

In [7]:
import os
import pandas as pd

folder_path = "../Data/original_data"  # Path to original df
result_folder_path = "../Data/processed_data/"  # Path for processed df
iterator = 0

# Iterate over files in the folder
for file_name in os.listdir(folder_path):

    if file_name.endswith('.csv'):  # Process only CSV files
        
        file_path = os.path.join(folder_path, file_name)

        # Read the CSV file and create a DataFrame
        df = pd.read_csv(file_path, encoding='latin-1')

        print(f"\n--- Preprocessing {file_name} ---\n")

        cleaned_df = data_cleaning(df)

        original_dataset = file_name[:-4]

        df_grouped = cleaned_df.groupby('trending_year')

        for group_name, group_data in df_grouped:

            # Save final df as csv
            csv_path = result_folder_path+original_dataset+"_"+str(group_name)+".csv"
            group_data.to_csv(csv_path, index=False)



--- Preprocessing CAvideos.csv ---

>>> 2592 Missing Values are beeing handeled
>>> No Duplicates detected
>>> Updated Data Types 
 channel_title    object
tags             object
description      object
title            object
views             int64
trending_year     int64
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["trending_date"] = df["trending_date"].str[6:].astype(int) # save trending year



--- Preprocessing USvideos.csv ---

>>> 1140 Missing Values are beeing handeled
>>> 49 Duplicates are beeing handeled
>>> Updated Data Types 
 channel_title    object
tags             object
description      object
title            object
views             int64
trending_year     int64
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["trending_date"] = df["trending_date"].str[6:].astype(int) # save trending year



--- Preprocessing GBvideos.csv ---

>>> 1224 Missing Values are beeing handeled
>>> 173 Duplicates are beeing handeled
>>> Updated Data Types 
 channel_title    object
tags             object
description      object
title            object
views             int64
trending_year     int64
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["trending_date"] = df["trending_date"].str[6:].astype(int) # save trending year
