## YouTube Comments Scraping for Hate Speech Detection
This notebook demonstrates the process of scraping YouTube comments from videos related to the Lok Sabha elections. Comments were sourced from popular Indian news channels like Hindustan Times, India Today, and Aaj Tak, providing a diverse dataset with a broad range of opinions on election-related topics. The scraped data will later be analyzed to detect hate speech using pretrained models.

**Scraping Methodology :**
YouTube Data API v3: Using the googleapiclient.discovery library, we accessed the YouTube Data API v3 from Google. This API enables us to retrieve a maximum of 100 comments per request, along with metadata such as the comment author, timestamp, and number of likes.

**API Setup :** An API key is required to access YouTube’s data. The youtube.commentThreads().list method is used to retrieve the comments from the top-level thread of each specified video. The comments are stored in a structured format for easy analysis and further processing.

**Data Extraction Loop:**

Initial Request: The first request retrieves up to 100 comments from the target video.
Pagination: If additional comments are available, a nextPageToken is used to paginate through the remaining comments. This loop continues until all comments for the video have been retrieved or the API limit is reached.
Error Handling: Basic error handling is implemented to manage cases where the nextPageToken is unavailable, indicating that all comments have been retrieved. Additionally, time delays can be added to avoid exceeding API request limits.

Data Storage: The extracted comments are stored in a Pandas DataFrame, with columns for the author, timestamp, comment text, like count, and update timestamp. This format allows for further processing, such as translation and text classification, as we work towards building a model to detect hate speech.

By following this approach, we were able to build a dataset of multilingual comments relevant to the Lok Sabha elections, which will be preprocessed and labeled for hate speech analysis in the subsequent steps.

In [None]:
import googleapiclient.discovery
import pandas as pd

api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = "Your_API_Key"

youtube = googleapiclient.discovery.build(
    api_service_name, api_version, developerKey=DEVELOPER_KEY)


def getcomments(video):
  request = youtube.commentThreads().list(
      part="snippet",
      videoId=video,
      maxResults=100
  )

  comments = []

  # Execute the request.
  response = request.execute()

  # Get the comments from the response.
  for item in response['items']:
      comment = item['snippet']['topLevelComment']['snippet']
      comments.append([
          comment['authorDisplayName'],
          comment['updatedAt'],
          comment['publishedAt'],
          comment['likeCount'],
          comment['textOriginal']

      ])

  while (1 == 1):
    try:
     nextPageToken = response['nextPageToken']
    except KeyError:
     break
    nextPageToken = response['nextPageToken']
    # Create a new request object with the next page token.
    nextRequest = youtube.commentThreads().list(part="snippet", videoId=video, maxResults=100, pageToken=nextPageToken)
    # Execute the next request.
    response = nextRequest.execute()
    # Get the comments from the next response.
    for item in response['items']:
      comment = item['snippet']['topLevelComment']['snippet']
      comments.append([
          comment['authorDisplayName'],
          comment['updatedAt'],
          comment['publishedAt'],
          comment['likeCount'],
          comment['textOriginal']
      ])

  df2 = pd.DataFrame(comments, columns=['author', 'updated_at','publishedAt', 'like_count', 'text'])
  return df2

Scraping Comments from Two Videos

In [None]:
df = pd.DataFrame()
for i in ['tshlJqWO3ug','HJnXbDcaMrY']:
  df2 = getcomments(i)
  df = pd.concat([df, df2])

In [None]:
df

Unnamed: 0,author,updated_at,publishedAt,like_count,text
0,@NehaYadav-ry5rq,2024-10-06T06:29:40Z,2024-10-06T06:29:40Z,1,Dont remember the last time hindus crashed a p...
1,@MASAKLOMAKN9,2024-10-04T15:05:28Z,2024-10-04T15:05:28Z,0,Being a Muslim it is our duty to te...
2,@NirajKumar-wb5hg,2024-10-04T07:03:07Z,2024-10-04T07:03:07Z,1,Bohot badhiya
3,@ShyamSaren-u2f,2024-10-03T19:44:25Z,2024-10-03T19:44:25Z,2,All Indian muslim go Pakistan
4,@GgLl-j3x,2024-10-02T21:23:14Z,2024-10-02T21:23:14Z,0,So modi pushing for more children 🧒
...,...,...,...,...,...
4236,@Nadeem-mg2qv,2024-04-15T06:02:20Z,2024-04-15T06:02:20Z,4,Karnatak. Me. Congress 12. Se. Zyada seats jet...
4237,@Gufran85,2024-04-15T06:01:02Z,2024-04-15T06:01:02Z,42,Rahul came
4238,@faisalbaig1861,2024-04-15T06:00:28Z,2024-04-15T06:00:28Z,106,Achcha hai Modi media godi media😅😅
4239,@sanandanyadav2557,2024-04-15T05:58:59Z,2024-04-15T05:58:59Z,10,Jis news ko batana chahiye use to bta nhi rahe...


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8333 entries, 0 to 4240
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       8333 non-null   object
 1   updated_at   8333 non-null   object
 2   publishedAt  8333 non-null   object
 3   like_count   8333 non-null   int64 
 4   text         8333 non-null   object
dtypes: int64(1), object(4)
memory usage: 648.6+ KB


In [None]:
from google.colab import files
# Save DataFrame to an Excel file
df.to_excel('output.xlsx', index=False)

# Download the file
files.download('output.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import googleapiclient.discovery
import pandas as pd

api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = "Your_API_Key"

youtube = googleapiclient.discovery.build(
    api_service_name, api_version, developerKey=DEVELOPER_KEY)


def getcomments(video):
  request = youtube.commentThreads().list(
      part="snippet",
      videoId=video,
      maxResults=100
  )

  comments = []

  # Execute the request.
  response = request.execute()

  # Get the comments from the response.
  for item in response['items']:
      comment = item['snippet']['topLevelComment']['snippet']
      comments.append([
          comment['authorDisplayName'],
          comment['updatedAt'],
          comment['publishedAt'],
          comment['likeCount'],
          comment['textOriginal']

      ])

  while (1 == 1):
    try:
     nextPageToken = response['nextPageToken']
    except KeyError:
     break
    nextPageToken = response['nextPageToken']
    # Create a new request object with the next page token.
    nextRequest = youtube.commentThreads().list(part="snippet", videoId=video, maxResults=100, pageToken=nextPageToken)
    # Execute the next request.
    response = nextRequest.execute()
    # Get the comments from the next response.
    for item in response['items']:
      comment = item['snippet']['topLevelComment']['snippet']
      comments.append([
          comment['authorDisplayName'],
          comment['updatedAt'],
          comment['publishedAt'],
          comment['likeCount'],
          comment['textOriginal']
      ])

  df3 = pd.DataFrame(comments, columns=['author', 'updated_at','publishedAt', 'like_count', 'text'])
  return df3

From six videos

In [None]:
dt = pd.DataFrame()
for i in ['M7UfD79lFxw','LkjoDUcvhdc','gEDxhiNqOVc','uEftPBmbFH8','B_KM4JF5VBg','SY-i0iYhLjE']:
  df3 = getcomments(i)
  dt = pd.concat([dt, df3])

In [None]:
dt

Unnamed: 0,author,updated_at,publishedAt,like_count,text
0,@islamkejankari8025,2024-06-07T03:15:06Z,2024-06-07T03:15:06Z,0,😂😂😂😂😂😂😂
1,@MD.chahatMD.chahat,2024-06-05T07:28:50Z,2024-06-05T07:28:50Z,0,😂😂😂😂😢😢😢😢😢❤❤❤❤🎉🎉🎉🎉😊😊😅
2,@SukhatMahto,2024-06-05T06:49:34Z,2024-06-05T06:49:34Z,0,No
3,@AtulKumarThakuriya1,2024-06-05T06:45:30Z,2024-06-05T06:45:30Z,1,1:41
4,@a.djewelleryworker3875,2024-06-05T06:39:55Z,2024-06-05T06:39:55Z,0,Ajtak ko pa modi ji ne ki gusdiyahai
...,...,...,...,...,...
266,@krisk4868,2024-06-03T09:12:09Z,2024-06-03T09:12:09Z,20,Congress and tukde tukde gang is becoming desp...
267,@pnmathematics5822,2024-06-03T09:11:28Z,2024-06-03T09:11:28Z,25,We believe your work❤
268,@mybuisness6299,2024-06-03T09:05:31Z,2024-06-03T09:05:31Z,27,He is right 👍
269,@mayank866,2024-06-03T09:05:20Z,2024-06-03T09:05:20Z,10,His Exit poll was showing 33 seats in Rajastha...


In [None]:
from google.colab import files
# Save DataFrame to an Excel file
dt.to_excel('output1.xlsx', index=False)

# Download the file
files.download('output1.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import googleapiclient.discovery
import pandas as pd

api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = "Your_API_Key"

youtube = googleapiclient.discovery.build(
    api_service_name, api_version, developerKey=DEVELOPER_KEY)

request = youtube.commentThreads().list(
    part="snippet",
    videoId="uSrsNUyPe-w",
    maxResults=100
)

comments = []

# Execute the request.
response = request.execute()

# Get the comments from the response.
for item in response['items']:
    comment = item['snippet']['topLevelComment']['snippet']
    public = item['snippet']['isPublic']
    comments.append([
        comment['authorDisplayName'],
        comment['updatedAt'],
        comment['publishedAt'],
        comment['likeCount'],
        comment['textOriginal']
    ])

while (1 == 1):
  try:
   nextPageToken = response['nextPageToken']
  except KeyError:
   break
  nextPageToken = response['nextPageToken']
  # Create a new request object with the next page token.
  nextRequest = youtube.commentThreads().list(part="snippet", videoId="uSrsNUyPe-w", maxResults=100000, pageToken=nextPageToken)
  # Execute the next request.
  response = nextRequest.execute()
  # Get the comments from the next response.
  for item in response['items']:
    comment = item['snippet']['topLevelComment']['snippet']
    comments.append([
        comment['authorDisplayName'],
        comment['updatedAt'],
        comment['publishedAt'],
        comment['likeCount'],
        comment['textOriginal']
    ])

df4 = pd.DataFrame(comments, columns=['author', 'updated_at','publishedAt', 'like_count', 'text'])




In [None]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1171 entries, 0 to 1170
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       1171 non-null   object
 1   updated_at   1171 non-null   object
 2   publishedAt  1171 non-null   object
 3   like_count   1171 non-null   int64 
 4   text         1171 non-null   object
dtypes: int64(1), object(4)
memory usage: 45.9+ KB


In [None]:
from google.colab import files
# Save DataFrame to an Excel file
df4.to_excel('output2.xlsx', index=False)

# Download the file
files.download('output2.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import googleapiclient.discovery
import pandas as pd

api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = "Your_API_Key"

youtube = googleapiclient.discovery.build(
    api_service_name, api_version, developerKey=DEVELOPER_KEY)

request = youtube.commentThreads().list(
    part="snippet",
    videoId="RP5LjFoZSuY",
    maxResults=100
)

comments = []

# Execute the request.
response = request.execute()

# Get the comments from the response.
for item in response['items']:
    comment = item['snippet']['topLevelComment']['snippet']
    comments.append([
        comment['authorDisplayName'],
        comment['updatedAt'],
        comment['publishedAt'],
        comment['likeCount'],
        comment['textOriginal']
    ])

while (1 == 1):
  try:
   nextPageToken = response['nextPageToken']
  except KeyError:
   break
  nextPageToken = response['nextPageToken']
  # Create a new request object with the next page token.
  nextRequest = youtube.commentThreads().list(part="snippet", videoId="RP5LjFoZSuY", maxResults=100000, pageToken=nextPageToken)
  # Execute the next request.
  response = nextRequest.execute()
  # Get the comments from the next response.
  for item in response['items']:
    comment = item['snippet']['topLevelComment']['snippet']
    comments.append([
        comment['authorDisplayName'],
        comment['updatedAt'],
        comment['publishedAt'],
        comment['likeCount'],
        comment['textOriginal']
    ])

df5 = pd.DataFrame(comments, columns=['author', 'updated_at','publishedAt', 'like_count', 'text'])




In [None]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7412 entries, 0 to 7411
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       7412 non-null   object
 1   updated_at   7412 non-null   object
 2   publishedAt  7412 non-null   object
 3   like_count   7412 non-null   int64 
 4   text         7412 non-null   object
dtypes: int64(1), object(4)
memory usage: 289.7+ KB


In [None]:
from google.colab import files
# Save DataFrame to an Excel file
df5.to_excel('output3.xlsx', index=False)

# Download the file
files.download('output3.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import pandas as pd
# List of CSV files to combine
files = ['output.xlsx', 'output1.xlsx', 'output2.xlsx', 'output3.xlsx']

# Load all files and combine them row-wise using pd.concat()
data = pd.concat([pd.read_excel(file) for file in files], ignore_index=True)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23988 entries, 0 to 23987
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       23986 non-null  object
 1   updated_at   23988 non-null  object
 2   publishedAt  23988 non-null  object
 3   like_count   23988 non-null  int64 
 4   text         23987 non-null  object
dtypes: int64(1), object(4)
memory usage: 937.2+ KB


In [None]:
data.head()

Unnamed: 0,author,updated_at,publishedAt,like_count,text
0,@NehaYadav-ry5rq,2024-10-06T06:29:40Z,2024-10-06T06:29:40Z,1,Dont remember the last time hindus crashed a p...
1,@MASAKLOMAKN9,2024-10-04T15:05:28Z,2024-10-04T15:05:28Z,0,Being a Muslim it is our duty to te...
2,@NirajKumar-wb5hg,2024-10-04T07:03:07Z,2024-10-04T07:03:07Z,1,Bohot badhiya
3,@ShyamSaren-u2f,2024-10-03T19:44:25Z,2024-10-03T19:44:25Z,2,All Indian muslim go Pakistan
4,@GgLl-j3x,2024-10-02T21:23:14Z,2024-10-02T21:23:14Z,0,So modi pushing for more children 🧒


In [None]:
from google.colab import files
# Save DataFrame to an Excel file
data.to_excel('youtube_comment.xlsx', index=False)

# Download the file
files.download('youtube_comment.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>