# Data Preprocessing Intial Thoughts
- Installing google trans lib to detect foreign languages and converting them to english
- Importing regex to handle emojis and other pictographs
- Goal is to convert comments into plain english with no emojis 

I consdiered converting emojis into sentiment but meanings of emojis do not align to basic sentiment and would not be optimal for preprocessing. For example: a crying face emoji can mean laughing (positive) but can also be negative (sad) therefore better to remove all emojis and if a commenter only used emojis in their comment I will convert to NaN and remove.

In [2]:
pip install langdetect googletrans==4.0.0-rc1

Collecting langdetect
  Using cached langdetect-1.0.9.tar.gz (981 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting googletrans==4.0.0-rc1
  Using cached googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting httpx==0.13.3 (from googletrans==4.0.0-rc1)
  Using cached httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting hstspreload (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Using cached hstspreload-2025.1.1-py3-none-any.whl.metadata (2.1 kB)
Collecting sniffio (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Using cached sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting chardet==3.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Using cached chardet-3.0.4-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting idna==2.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Using cached idna-2.10-py2.py3-none-any.whl.metadata (9.1 kB)
Collecting rfc3986<2,>=1.3 (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Using cached r

In [29]:
import numpy as np
import pandas as pd
from langdetect import detect, lang_detect_exception
from googletrans import Translator
import re

In [6]:
df = pd.read_csv('dataset_youtube-comments-scraper_2025-10-24_00-01-21-625.csv')
df.head()

Unnamed: 0,author,comment,commentsCount,pageUrl,replyCount,title,type,videoID,voteCount
0,@MrBeast,I noticed some comments asking how they take t...,52636,https://www.youtube.com/watch?v=4l97aNza_Zc,553,"Survive 30 Days Chained To Your Ex, Win $250,000",comment,,130000
1,@MiguelSilveira-x8f,Olk,52636,https://www.youtube.com/watch?v=4l97aNza_Zc,0,"Survive 30 Days Chained To Your Ex, Win $250,000",comment,,0
2,@SelimCopet,😭😭😭😿😓,52636,https://www.youtube.com/watch?v=4l97aNza_Zc,0,"Survive 30 Days Chained To Your Ex, Win $250,000",comment,,1
3,@ДаняКалинков,"Я еще не досмотрел видео, но если он снимет це...",52636,https://www.youtube.com/watch?v=4l97aNza_Zc,0,"Survive 30 Days Chained To Your Ex, Win $250,000",comment,,0
4,@guadalupehernandezlezama9804,Que vivan los novios,52636,https://www.youtube.com/watch?v=4l97aNza_Zc,0,"Survive 30 Days Chained To Your Ex, Win $250,000",comment,,0


## Handling emojis with regex

In [None]:
df_no_emojis = df

emojis_code = re.compile(
    "["
    # Emoticons code
    "\U0001F600-\U0001F64F"
    # Symbols and Pictographs code
    "\U0001F300-\U0001F5FF"
    # Transport and Map symbols code
    "\U0001F680-\U0001F6FF"
    # Flags (iOS)
    "\U0001F1E0-\U0001F1FF"
    # Dingbats
    "\U00002702-\U000027B0"
    # Uncommon symbols like trademark, etc
    "\U000024C2-\U0001F251"
    "]+", flags=re.UNICODE
)

# Finding and removing emojis function
def remove_emojis(text):
    # if input is a string, check if it has above codes and replace with blank
    if isinstance(text, str):
        return emojis_code.sub(r'', text)
    return text

# Applying function to df copy to assure everything is good before actual df
df_no_emojis['comment'] = df_no_emojis['comment'].apply(remove_emojis)

# remove blank comments after applying emoji logic
df_no_emojis['comment'].replace('', np.nan, inplace=True)
df_no_emojis.dropna(subset=['comment'], inplace=True)

df_no_emojis.head()
df_no_emojis.describe()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_no_emojis['comment'].replace('', np.nan, inplace=True)


Unnamed: 0,commentsCount,replyCount,videoID,voteCount
count,1801.0,1801.0,0.0,1801.0
mean,52636.0,0.32482,,77.361466
std,0.0,13.031095,,3064.034237
min,52636.0,0.0,,0.0
25%,52636.0,0.0,,0.0
50%,52636.0,0.0,,0.0
75%,52636.0,0.0,,0.0
max,52636.0,553.0,,130000.0


## Translate Comments to English
Great! Regex method to find symbol codes and replace with blank text worked. We lost 199 of our 2000 comments so we still have a good size set to work with. Now let's detect foreign languages and convert it to english

In [30]:
df_translate = df_no_emojis

# Create function to tranlate to English
def translate_to_english(text):
    # handle non-string and empty text
    if not isinstance(text, str) or not text.strip():
        return 'unknown', text

    try:
        lang = detect(text)
        if lang != 'en':
            translator = Translator()
            translation = translator.translate(text, dest='en')
            return lang, translation.text
        else:
            return 'en', text
    except lang_detect_exception.LangDetectException:
        # handling if lang detect cannot find features
        return 'unknown', text
    except Exception as e:
        # catch other errors
        print(f"Error processing: '{text}' - {e}")

    # if google translate cannot detect language return unknown
    except:
        return 'unknown'
    if src_lang != 'en':
        try:
            return translator.translate(text, src=src_lang, dest='en').text
        except:
            return text
    return text

# Applying function
df_translate[['detected_language', 'translated_text']] = df_translate['comment'].apply(
    lambda x: pd.Series(translate_to_english(x))
)
# checking to see if detection/translation worked as intended
# check_df_translate = df_translate[~df_translate['language_detected'].str.contains('unknown', case=False, na=False)]
# check_df_translate.head()
df_translate.head()

: 

: 