# YouTube NLP

## Ingesting Comments from the YouTube Video

In [1]:
api_key = ''

In [2]:
from googleapiclient.discovery import build

# The string after v=? in the url
video_id = "dQw4w9WgXcQ"

# creating youtube resource object
youtube = build('youtube', 'v3',
developerKey=api_key)

# retrieve youtube video results
video_response=youtube.commentThreads().list(
  part='snippet',
  maxResults=1000,
  order='relevance',
  videoId=video_id
).execute()


## Data Preprocessing

### Data Wrangling

#### Assumptions


*   Picking only the comment made by the original author
  *   Comment Threads on the Original Authors comment could be diverging from the video
  *   Limiting Response to about 10 for testing

*   Using the 'textOriginal' instead of 'textDisplay' for easier cleanup of the comment


*   Analyzing only English Comments 
  *   Storing the number of non English comments for metrics








In [30]:
# empty list for storing reply
comments = []

# extracting required info from each result object
for item in video_response['items']:

  # Extracting comments
  comment = item['snippet']['topLevelComment']['snippet']['textOriginal']
  # Extracting author
  author = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
  # Extracting published time
  published_at = item['snippet']['topLevelComment']['snippet']['publishedAt']
  # Extracting likes
  like_count = item['snippet']['topLevelComment']['snippet']['likeCount']
  # Extracting total replies to the comment
  reply_count = item['snippet']['totalReplyCount']

  comments.append([author, comment, published_at, like_count, reply_count])

In [31]:
comments[0:5]

['Space Shuttle Atlantis',
 'Memes aside, even after 13 years this song still has value and gives that early 200s and 1980s vibe , back when the world was at its best.',
 '2022-12-31T17:13:19Z',
 204,
 16]

### NLP

#### Install Libraries

In [7]:
pip install Unidecode

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [8]:
pip install langdetect

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [9]:
pip install pycountry

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [10]:
pip install clean-text

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#### Text Processing

In [11]:
[items[1] for items in comments]

['1 BILLION views for Never Gonna Give You Up!\xa0 Amazing, crazy, wonderful! Rick ♥️',
 'This has been the most amazing rick roll I have ever gotten. Thank you discord',
 'Naah... temardo en verdad 😎🤙\n6025619 personas ya lo han escuchado,\nespero que te haya gustado 😉',
 'Even when’s it’s 2023 thousands of people still listen to this song everyday,it never gets old❤',
 'Naah... temardo en verdad 😎🤙\n21047218 personas ya lo han escuchado,\nespero que te haya gustado 😉',
 'Memes aside, even after 13 years this song still has value and gives that early 200s and 1980s vibe , back when the world was at its best.',
 'It’s come to the point that being rickrolled is a privilege',
 "I grew up with this song since I was born. I got Rick rolled million of times, and I still love it with all my heart, I'll never get tired of it <3",
 'Greatest music video the 90’s ever saw!',
 "Honestly I love this song, so whenever i get rick rolled I actually thank the person for sending it! Don't hate the pla

In [32]:
import pandas as pd
df = pd.DataFrame({'Author': [i[0] for i in comments], 'Comment': [i[1] for i in comments], 'Time': [i[2] for i in comments],
                  'Likes': [i[3] for i in comments], 'Reply Count': [i[4] for i in comments]})

In [37]:
df.head()

Unnamed: 0,Author,Comment,Time,Likes,Reply Count
0,Rick Astley,1 billion views for never gonna give you up am...,2021-07-28T21:00:32Z,1165821,497
1,Amelia Honey,this has been the most amazing rick roll i hav...,2022-05-13T22:16:45Z,713,39
2,Rafael Perez,naah temardo en verdad 6025619 personas ya lo ...,2022-12-31T15:25:35Z,388,11
3,Natan Kozłowski,even whens its 2023 thousands of people still ...,2022-12-30T18:57:47Z,219,19
4,XIÁN PENA PUZA,naah temardo en verdad 21047218 personas ya lo...,2023-01-01T14:10:35Z,59,9


In [38]:
from cleantext import clean

# Remove extra spaces and make them lower case. Replace special emojis
clean_df = df
clean_df['Comment'] = clean_df['Comment'].apply(lambda x: x.strip().lower().replace('xd','').replace('<3',''))

# Clean text from line breaks, unicodes, emojis and punctuations
clean_df['Comment'] = clean_df['Comment'].apply(lambda x: clean(x, no_emoji=True, no_punct=True, no_line_breaks=True, fix_unicode=True))
clean_df.head()

Unnamed: 0,Author,Comment,Time,Likes,Reply Count
0,Rick Astley,1 billion views for never gonna give you up am...,2021-07-28T21:00:32Z,1165821,497
1,Amelia Honey,this has been the most amazing rick roll i hav...,2022-05-13T22:16:45Z,713,39
2,Rafael Perez,naah temardo en verdad 6025619 personas ya lo ...,2022-12-31T15:25:35Z,388,11
3,Natan Kozłowski,even whens its 2023 thousands of people still ...,2022-12-30T18:57:47Z,219,19
4,XIÁN PENA PUZA,naah temardo en verdad 21047218 personas ya lo...,2023-01-01T14:10:35Z,59,9


In [39]:
from langdetect import detect
clean_df['language'] = clean_df['Comment'].apply(lambda x: detect(x))
clean_df.head()

Unnamed: 0,Author,Comment,Time,Likes,Reply Count,language
0,Rick Astley,1 billion views for never gonna give you up am...,2021-07-28T21:00:32Z,1165821,497,en
1,Amelia Honey,this has been the most amazing rick roll i hav...,2022-05-13T22:16:45Z,713,39,en
2,Rafael Perez,naah temardo en verdad 6025619 personas ya lo ...,2022-12-31T15:25:35Z,388,11,es
3,Natan Kozłowski,even whens its 2023 thousands of people still ...,2022-12-30T18:57:47Z,219,19,en
4,XIÁN PENA PUZA,naah temardo en verdad 21047218 personas ya lo...,2023-01-01T14:10:35Z,59,9,es


In [40]:
import pycountry
# Convert ISO country codes to Languages
clean_df['language'] = clean_df['language'].apply(lambda x: pycountry.languages.get(alpha_2=x).name)
clean_df.head()

Unnamed: 0,Author,Comment,Time,Likes,Reply Count,language
0,Rick Astley,1 billion views for never gonna give you up am...,2021-07-28T21:00:32Z,1165821,497,English
1,Amelia Honey,this has been the most amazing rick roll i hav...,2022-05-13T22:16:45Z,713,39,English
2,Rafael Perez,naah temardo en verdad 6025619 personas ya lo ...,2022-12-31T15:25:35Z,388,11,Spanish
3,Natan Kozłowski,even whens its 2023 thousands of people still ...,2022-12-30T18:57:47Z,219,19,English
4,XIÁN PENA PUZA,naah temardo en verdad 21047218 personas ya lo...,2023-01-01T14:10:35Z,59,9,Spanish
