# YouTube NLP

## Ingesting Comments from the YouTube Video

In [4]:
api_key = ''

In [5]:
from googleapiclient.discovery import build

# The string after v=? in the url
video_id = "dQw4w9WgXcQ"

# creating youtube resource object
youtube = build('youtube', 'v3',
developerKey=api_key)

# retrieve youtube video results
video_response=youtube.commentThreads().list(
  part='snippet',
  maxResults=1000,
  order='relevance',
  videoId=video_id
).execute()


## Data Preprocessing

## Data Wrangling

### Assumptions


*   Picking only the comment made by the original author
  *   Comment Threads on the Original Authors comment could be diverging from the video
  *   Limiting Response to about 10 for testing

*   Using the 'textOriginal' instead of 'textDisplay' for easier cleanup of the comment


*   Analyzing only English Comments 
  *   Storing the number of non English comments for metrics








In [6]:
# empty list for storing reply
replies = []

# extracting required info from each result object
for item in video_response['items']:

  # Extracting comments
  comment = item['snippet']['topLevelComment']['snippet']['textOriginal']
  replies.append(comment)

In [11]:
# print comment with list of reply
replies[10]

"Honestly I love this song, so whenever i get rick rolled I actually thank the person for sending it! Don't hate the player, hate the game. \nI don't understand why people get s\n\n\nWe're no strangers to love\nYou know the rules and so do I\nA full commitment's what I'm thinking of\nYou wouldn't get this from any other guy\n\nI just wanna tell you how I'm feeling\nGotta make you understand\n\nNever gonna give you up\nNever gonna let you down\nNever gonna run around and desert you\nNever gonna make you cry\nNever gonna say goodbye\nNever gonna tell a lie and hurt you\n\nWe've known each other for so long\nYour heart's been aching but you're too shy to say it\nInside we both know what's been going on\nWe know the game and we're gonna play it\nAnd if you ask me how I'm feeling\nDon't tell me you're too blind to see\n\nNever gonna give you up\nNever gonna let you down\nNever gonna run around and desert you\nNever gonna make you cry\nNever gonna say goodbye\nNever gonna tell a lie and hurt

### NLP

#### Install Libraries

In [13]:
pip install Unidecode

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Unidecode
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 5.9 MB/s 
[?25hInstalling collected packages: Unidecode
Successfully installed Unidecode-1.3.6


In [12]:
pip install langdetect

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 6.8 MB/s 
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993242 sha256=2a92d45fb59479cff82894b46bb630a4b83477ec0fb565cf4196885a6abb4e89
  Stored in directory: /root/.cache/pip/wheels/13/c7/b0/79f66658626032e78fc1a83103690ef6797d551cb22e56e734
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [60]:
pip install pycountry

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pycountry
  Downloading pycountry-22.3.5.tar.gz (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 6.8 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: pycountry
  Building wheel for pycountry (PEP 517) ... [?25l[?25hdone
  Created wheel for pycountry: filename=pycountry-22.3.5-py2.py3-none-any.whl size=10681845 sha256=f81c66a8854e8ab224231732662df76affd5cef6f11a2542bf4e22c7d487b47f
  Stored in directory: /root/.cache/pip/wheels/e2/aa/0f/c224e473b464387170b83ca7c66947b4a7e33e8d903a679748
Successfully built pycountry
Installing collected packages: pycountry
Successfully installed pycountry-22.3.5


In [14]:
pip install clean-text

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting clean-text
  Downloading clean_text-0.6.0-py3-none-any.whl (11 kB)
Collecting emoji<2.0.0,>=1.0.0
  Downloading emoji-1.7.0.tar.gz (175 kB)
[K     |████████████████████████████████| 175 kB 6.7 MB/s 
[?25hCollecting ftfy<7.0,>=6.0
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.8 MB/s 
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-1.7.0-py3-none-any.whl size=171046 sha256=51ddb6a865167e782cf25ef11928bc792ec6b97842b5dd124dfba1f230ee5183
  Stored in directory: /root/.cache/pip/wheels/5e/8c/80/c3646df8201ba6f5070297fe3779a4b70265d0bfd961c15302
Successfully built emoji
Installing collected packages: ftfy, emoji, clean-text
Successfully installed clean-text-0.6.0 emoji-1.7.0 ftfy-6.1.1


#### Text Processing

In [45]:
from cleantext import clean
clean_rep = []
for items in replies:
  # Remove extra spaces and make them lower case
  items = items.strip().lower()
  # Replace special emojis
  items = items.replace('xd','').replace('<3','')
  # Clean text from line breaks, unicodes, emojis and punctuations
  clean_rep.append(clean(items, no_emoji=True, no_punct=True, no_line_breaks=True, fix_unicode=True))
clean_rep

['1 billion views for never gonna give you up amazing crazy wonderful rick',
 'this isnt your first time here and you know it',
 'vienes del rewind 2022 jkajsk',
 'memes aside even after 13 years this song still has value and gives that early 200s and 1980s vibe back when the world was at its best',
 'rick astley haciendose aun mas famoso gracias al rewind hispano 2022 sencillamente maravilloso edit gracias por los likes gente espero que vuestro 2023 sea prospero y mejor que este 2022',
 'naah temardo en verdad 6025619 personas ya lo han escuchado espero que te haya gustado',
 'this music video is still going strong',
 'i grew up with this song since i was born i got rick rolled million of times and i still love it with all my heart ill never get tired of it',
 'naah temardo en verdad 2100011 personas ya lo han escuchado espero que te haya gustado',
 'even whens its 2023 thousands of people still listen to this song everydayit never gets old',
 'honestly i love this song so whenever i 

In [101]:
from langdetect import detect
lan_profile={items : detect(items) for items in clean_rep}
lan_profile

{'1 billion views for never gonna give you up amazing crazy wonderful rick': 'en',
 'this isnt your first time here and you know it': 'en',
 'vienes del rewind 2022 jkajsk': 'af',
 'memes aside even after 13 years this song still has value and gives that early 200s and 1980s vibe back when the world was at its best': 'en',
 'rick astley haciendose aun mas famoso gracias al rewind hispano 2022 sencillamente maravilloso edit gracias por los likes gente espero que vuestro 2023 sea prospero y mejor que este 2022': 'es',
 'naah temardo en verdad 6025619 personas ya lo han escuchado espero que te haya gustado': 'es',
 'this music video is still going strong': 'en',
 'i grew up with this song since i was born i got rick rolled million of times and i still love it with all my heart ill never get tired of it': 'en',
 'naah temardo en verdad 2100011 personas ya lo han escuchado espero que te haya gustado': 'es',
 'even whens its 2023 thousands of people still listen to this song everydayit never

In [102]:
import pycountry
# for key,value in lan_profile.items():
lan_profile = {key: pycountry.languages.get(alpha_2=value).name for (key,value) in lan_profile.items()}

In [103]:
lan_profile


{'1 billion views for never gonna give you up amazing crazy wonderful rick': 'English',
 'this isnt your first time here and you know it': 'English',
 'vienes del rewind 2022 jkajsk': 'Afrikaans',
 'memes aside even after 13 years this song still has value and gives that early 200s and 1980s vibe back when the world was at its best': 'English',
 'rick astley haciendose aun mas famoso gracias al rewind hispano 2022 sencillamente maravilloso edit gracias por los likes gente espero que vuestro 2023 sea prospero y mejor que este 2022': 'Spanish',
 'naah temardo en verdad 6025619 personas ya lo han escuchado espero que te haya gustado': 'Spanish',
 'this music video is still going strong': 'English',
 'i grew up with this song since i was born i got rick rolled million of times and i still love it with all my heart ill never get tired of it': 'English',
 'naah temardo en verdad 2100011 personas ya lo han escuchado espero que te haya gustado': 'Spanish',
 'even whens its 2023 thousands of pe

In [104]:
import pandas as pd
df_lang = pd.DataFrame.from_dict(lan_profile, orient='index')

In [105]:
df_lang.reset_index(inplace=True)
df_lang.columns=['comments','language']

In [106]:
df_lang

Unnamed: 0,comments,language
0,1 billion views for never gonna give you up am...,English
1,this isnt your first time here and you know it,English
2,vienes del rewind 2022 jkajsk,Afrikaans
3,memes aside even after 13 years this song stil...,English
4,rick astley haciendose aun mas famoso gracias ...,Spanish
...,...,...
92,one hour and 20 minutes to the new year,English
93,i got rick rolled so many times that i memoriz...,English
94,epicness,Catalan
95,help i just got rickrolled for the first time ...,English
