# Browse Twitch website to extract streamers data

This code is used to browse twitch website in order to provide data of streamers. As the website does not provide API to get all the users data once, we first get all the streams. Then, the streams are filtered based on a tag (e.g. Iran tag). After that, the streams' usernames are applied to access the streamers links. Finally, streamers data such as number of viewers, followers, start data and so on are extracted by scraping. 

### Install  and Import libraries

In [71]:
! pip install selenium

import pandas as pd
import requests
from datetime import date
from bs4 import BeautifulSoup
from selenium import webdriver
from lxml import html
import sys


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Getting token by sending request to Twitch OAUTH2 API

In [72]:
# Geting token by curl command
"""
! curl -X POST 'https://id.twitch.tv/oauth2/token' \
-H 'Content-Type: application/x-www-form-urlencoded' \
-d 'client_id=<client_id>&client_secret=<client_secret>&grant_type=client_credentials'
"""

# Geting token by python
url = 'https://id.twitch.tv/oauth2/token'
client_id = ''
client_secret = ''
param = {'client_id' : client_id ,'client_secret': client_secret,'grant_type':'client_credentials'}
header = {'content-type' : 'application/x-www-form-urlencoded'}

r = requests.post(url,headers=header,params=param)
d = r.json() 
token = d['access_token']

### Geting Streams data by sending request to Twitch streams API and storing results in dataframe

In [73]:
first = '100' #Maximum number of objects to return
after = ''    #Cursor for forward pagination
language = '' #Stream language= persian:'&language=fa'   all:''
pages = 100000    #Number of pages
data = pd.DataFrame()
for page in range(pages):   #The loop for accessing next pages of the results
    url = 'https://api.twitch.tv/helix/streams?first='+first+'&after='+after+language
    header = {'Client-Id' : client_id, 'Authorization' : 'Bearer '+token}
    r = requests.get(url,headers=header)
    d = r.json()
    data = data.append(pd.DataFrame(d['data']))
    try:
        after = d['pagination']['cursor']   # check if next page of results exists
    except:
        break

data = data[data['tag_ids'].notna()]    # drop streams that do not have tag
data      # show dataframe

Unnamed: 0,id,user_id,user_login,user_name,game_id,game_name,type,title,viewer_count,started_at,language,thumbnail_url,tag_ids,is_mature
0,40198265145,71092938,xqc,xQc,498566,Slots,live,#ad 18+☢️LIVE☢️DRAMA☢️NEWS☢️BIG☢️THINGS☢️CLICK...,58171,2022-09-11T23:01:55Z,en,https://static-cdn.jtvnw.net/previews-ttv/live...,[6ea6bca4-4712-4ab9-a906-e3336a9d8039],False
1,47132323357,641972806,kaicenat,KaiCenat,772421245,NBA 2K23,live,🟩BIG GIVEAWAY STARTING NOW🟩W STREAMER HERE🟩CLI...,52284,2022-09-11T22:18:02Z,en,https://static-cdn.jtvnw.net/previews-ttv/live...,[6ea6bca4-4712-4ab9-a906-e3336a9d8039],False
2,40199078873,254489093,casimito,casimito,518203,Sports,live,ao vivo e a cores!!!!!,31735,2022-09-12T01:14:21Z,pt,https://static-cdn.jtvnw.net/previews-ttv/live...,[39ee8140-901a-4762-bfca-8260dea1310f],True
3,39818569896,87056709,pgl_dota2,PGL_Dota2,29595,Dota 2,live,TI 11 Regional Qualifiers CN - Day 10 - Royal ...,22413,2022-09-12T02:32:59Z,en,https://static-cdn.jtvnw.net/previews-ttv/live...,"[36a89a80-4fcd-4b74-b3d2-2c6fd9b30c95, 6ea6bca...",False
4,47132001597,26490481,summit1g,summit1g,509549,Pummel Party,live,chillin. !gfuel !merch - @summit1g,14552,2022-09-11T21:24:24Z,en,https://static-cdn.jtvnw.net/previews-ttv/live...,[6ea6bca4-4712-4ab9-a906-e3336a9d8039],False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10,39664956679,146791017,i_am_nota,서생원_노타,,,live,배고픈,0,2022-09-12T06:32:49Z,ko,https://static-cdn.jtvnw.net/previews-ttv/live...,[ab2975e3-b9ca-4b1a-a93e-fb61a5d5c3a4],False
11,47134881229,537083794,melaridocrew,melaridocrew,417752,Talk Shows & Podcasts,live,#mostocca - Lunedi 12 Settembre,0,2022-09-12T06:30:58Z,it,https://static-cdn.jtvnw.net/previews-ttv/live...,[5b9935eb-1e9a-4217-98ad-62bda5cff0d1],False
13,39664958775,176263440,nemosemoo,네세동,511224,Apex Legends,live,크립토로 마스터가는날,0,2022-09-12T06:33:44Z,ko,https://static-cdn.jtvnw.net/previews-ttv/live...,[ab2975e3-b9ca-4b1a-a93e-fb61a5d5c3a4],False
14,39818752936,818254433,betby_games_34,betby_games_34,494995,Injustice 2,live,Random Select,0,2022-09-12T06:31:36Z,en,https://static-cdn.jtvnw.net/previews-ttv/live...,[6ea6bca4-4712-4ab9-a906-e3336a9d8039],False


### Filtering data by tag

In [74]:
# to get the tag string visit: https://www.twitch.tv/directory/all/tags and look at each tag's url
tag = '4c71840b-49cc-4f26-bcbc-13b3550a6b2a' #Iran tag to filter Iranian streams
data_persian = pd.DataFrame()   #dataframe to store persian streams data
for i in range(data.shape[0]):
    if tag in data['tag_ids'].iloc[i]:
        data_persian = data_persian.append(data.iloc[i])

### Showing top persian streams based on the number of viewers

In [75]:
data_persian.sort_values(by=['viewer_count'],ignore_index=True)
data_persian.to_csv('data_persian.csv')
data_persian

Unnamed: 0,id,user_id,user_login,user_name,game_id,game_name,type,title,viewer_count,started_at,language,thumbnail_url,tag_ids,is_mature
15,39818674824,435357764,ebi1374,EBI1374,502732.0,Garena Free Fire,live,🔥FreeFire ba ebi zhooon🔥 [Per/Eng] !donate !in...,3.0,2022-09-12T05:12:41Z,other,https://static-cdn.jtvnw.net/previews-ttv/live...,"[4c71840b-49cc-4f26-bcbc-13b3550a6b2a, fd76c79...",0.0
48,39818742680,741289651,eghigame,EGHIGAME,512710.0,Call of Duty: Warzone,live,🔥play ba viewer in warzone🔥king is back/ [+18]...,1.0,2022-09-12T06:22:51Z,other,https://static-cdn.jtvnw.net/previews-ttv/live...,"[fd76c790-0505-4c4c-865a-d6bd139c0901, dbe2039...",0.0
8,39818724248,626710620,varpexstream,VarpexStream,,,live,🔴 سرویس مولتی استریم وارپکس (ری استریم): لایو ...,1.0,2022-09-12T06:04:48Z,other,https://static-cdn.jtvnw.net/previews-ttv/live...,"[fd76c790-0505-4c4c-865a-d6bd139c0901, 4c71840...",0.0
33,39818740808,104545605,ali1867,ali1867,26936.0,Music,live,[English/Farsi] - Trying Figure Out How to Str...,1.0,2022-09-12T06:21:03Z,other,https://static-cdn.jtvnw.net/previews-ttv/live...,"[fd76c790-0505-4c4c-865a-d6bd139c0901, 4c71840...",0.0


### Creating streamers dataframe

In [76]:
streamers = data_persian[['user_id','user_name']]
streamers.drop_duplicates
streamers = streamers.reset_index()

In [77]:
data_streamer = pd.DataFrame()    #dataframe containing iranian streamers data
for user in streamers['user_id']:
    url = 'https://api.twitch.tv/helix/users?id='+user
    header = {'Client-Id' : client_id, 'Authorization' : 'Bearer '+token}
    r = requests.get(url,headers=header)
    d = r.json()
    data_streamer = data_streamer.append(pd.DataFrame(d['data']))
data_streamer = data_streamer.reset_index()
data_streamer

Unnamed: 0,index,id,login,display_name,type,broadcaster_type,description,profile_image_url,offline_image_url,view_count,created_at
0,0,435357764,ebi1374,EBI1374,,affiliate,"Yo guys welcome to my stream, I play Freefire ...",https://static-cdn.jtvnw.net/jtv_user_pictures...,,214,2019-05-12T17:10:36Z
1,0,741289651,eghigame,EGHIGAME,,,"Hello, I am Alireza Eghbali. I am happy to sup...",https://static-cdn.jtvnw.net/jtv_user_pictures...,https://static-cdn.jtvnw.net/jtv_user_pictures...,10,2021-11-08T13:04:09Z
2,0,626710620,varpexstream,VarpexStream,,,برای استفاده از سرویس مولتی استریم وارپکس به س...,https://static-cdn.jtvnw.net/jtv_user_pictures...,,4970,2020-12-27T08:35:55Z
3,0,104545605,ali1867,ali1867,,,,https://static-cdn.jtvnw.net/user-default-pict...,,604,2015-10-17T07:09:09Z


### Adding a column to count days from when the user subscribed

In [78]:
data_streamer['days_from_create'] = ((date.today()-pd.to_datetime(data_streamer['created_at']).dt.date).astype('timedelta64[D]')+1).astype('int')

### Extracting number of followers for each user and calculating follower acquisition rate

In [79]:
data_streamer['followers'] = ''   # follower column to add number of followers to dataframe
for i, user in data_streamer.iterrows():
    url = 'https://api.twitch.tv/helix/users/follows?to_id='+user['id']
    header = {'Client-Id' : client_id, 'Authorization' : 'Bearer '+token}
    r = requests.get(url,headers=header)
    d = r.json()
    data_streamer.iloc[i,data_streamer.columns.get_loc('followers')] = d['total']
# Calculating Follower_acquisition_rate by dividing number of followers by number of days since the user subscribed
data_streamer['follower_acquisition_rate'] = data_streamer['followers'] / data_streamer['days_from_create']

In [80]:
data_streamer

Unnamed: 0,index,id,login,display_name,type,broadcaster_type,description,profile_image_url,offline_image_url,view_count,created_at,days_from_create,followers,follower_acquisition_rate
0,0,435357764,ebi1374,EBI1374,,affiliate,"Yo guys welcome to my stream, I play Freefire ...",https://static-cdn.jtvnw.net/jtv_user_pictures...,,214,2019-05-12T17:10:36Z,1220,831,0.681148
1,0,741289651,eghigame,EGHIGAME,,,"Hello, I am Alireza Eghbali. I am happy to sup...",https://static-cdn.jtvnw.net/jtv_user_pictures...,https://static-cdn.jtvnw.net/jtv_user_pictures...,10,2021-11-08T13:04:09Z,309,256,0.828479
2,0,626710620,varpexstream,VarpexStream,,,برای استفاده از سرویس مولتی استریم وارپکس به س...,https://static-cdn.jtvnw.net/jtv_user_pictures...,,4970,2020-12-27T08:35:55Z,625,145,0.232
3,0,104545605,ali1867,ali1867,,,,https://static-cdn.jtvnw.net/user-default-pict...,,604,2015-10-17T07:09:09Z,2523,72,0.028537


### Extracting total watching time of videos
In order to extract total watching time of a streamer's videos, we should first scrape the videos page. For each user, this is the link to the videos:

https://www.twitch.tv/username/videos?filter=all&sort=time

As this page is a dynamic js page, it is not possible to scrape it using popular libraries for html such as BeautifulSoup. So, we used Selenium library to scrape it and chrome driver to open the contents. Therefore, the chrome driver is installed and mounted. After that, the content of the page is scraped and filtered by the class containing the videos' data. Each video's data containing duration of video [hh:mm:ss}, number of viewers and days passed from broadcasting. For example: ['4:26:35', '89 views', '18 hours ago', ...]

To calculate total watching hours of videos, the summation of number of videos multiplied by video duration is calculated. Finally, this value is stored in dataframe in column total_videos_watching_minutes.

In [None]:
# popular BeautifulSoup is not responded. As, the site is a dynamic js page.
'''
url = 'https://www.twitch.tv/ben3f1t/videos?filter=all&sort=time'
video_page = requests.get(url)
soup = BeautifulSoup(video_page.content, 'html.parser')
soup.find_all("a")
print(soup.prettify())
'''

In [81]:
# Installing chrome driver and setting options
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options=chrome_options)

Reading package lists... Done
Building dependency tree       
Reading state information... Done
chromium-chromedriver is already the newest version (104.0.5112.101-0ubuntu0.18.04.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 20 not upgraded.
cp: '/usr/lib/chromium-browser/chromedriver' and '/usr/bin/chromedriver' are the same file


In [103]:
data_streamer['total_videos_watching_minutes'] = ''   # total_videos_watching_minutes
for i, user in data_streamer.iterrows():
    driver.get('https://www.twitch.tv/'+user['login']+'/videos?filter=all&sort=time')
    page_source = driver.page_source
    tree = html.fromstring(page_source)
    # filtering the class containing videos' data including video duration, number of views, and days passed from broadcasting.
    videos_data = tree.xpath('//div[@class="ScMediaCardStatWrapper-sc-1ncw7wk-0 jluyAA tw-media-card-stat"]/text()')

    videos_data_df = pd.DataFrame()   # dataframe of video data
    videos_data_df['duration'] = videos_data[0::3]      # video duration
    videos_data_df['views'] = videos_data[1::3]         # number of views
    videos_data_df['days_before'] = videos_data[2::3]   # days passed from broadcasting
    videos_data_df['views'] = videos_data_df['views'].astype('str')
    videos_data_df['duration'] = videos_data_df['duration'].astype('str')
    videos_data_df['views'] = videos_data_df['views'].str.split(' ').str[0]
    videos_data_df['duration_hour'] = ''  # video duration (Hour)
    videos_data_df['duration_min']  = '' # video duration (Minute)
    videos_data_df['duration_sec']  = '' # video duration (Second)

    videos_data_df['duration_hour'] =  videos_data_df['duration'].str.split(':').str[-3]
    videos_data_df['duration_min'] =  videos_data_df['duration'].str.split(':').str[-2]
    videos_data_df['duration_sec'] =  videos_data_df['duration'].str.split(':').str[-1]

    videos_data_df = videos_data_df.fillna(0)    # convert NaN values to zeros

    total_videos_watching_minutes = ((videos_data_df['duration_hour'].astype('int')*60 + \
                                 videos_data_df['duration_min'].astype('int') + \
                                 videos_data_df['duration_sec'].astype('int')/60   ) * \
                                 videos_data_df['views'].astype('int')) . sum()
    data_streamer.iloc[i,data_streamer.columns.get_loc('total_videos_watching_minutes')] = int(total_videos_watching_minutes)
    del videos_data_df

### Streamers data containing following main data:
1. user_id
2. view_count
3. follower_acquisition_rate
4. total_videos_watching_minutes

In [104]:
data_streamer

Unnamed: 0,index,id,login,display_name,type,broadcaster_type,description,profile_image_url,offline_image_url,view_count,created_at,days_from_create,followers,follower_acquisition_rate,total_videos_watching_minutes
0,0,435357764,ebi1374,EBI1374,,affiliate,"Yo guys welcome to my stream, I play Freefire ...",https://static-cdn.jtvnw.net/jtv_user_pictures...,,214,2019-05-12T17:10:36Z,1220,831,0.681148,12577
1,0,741289651,eghigame,EGHIGAME,,,"Hello, I am Alireza Eghbali. I am happy to sup...",https://static-cdn.jtvnw.net/jtv_user_pictures...,https://static-cdn.jtvnw.net/jtv_user_pictures...,10,2021-11-08T13:04:09Z,309,256,0.828479,0
2,0,626710620,varpexstream,VarpexStream,,,برای استفاده از سرویس مولتی استریم وارپکس به س...,https://static-cdn.jtvnw.net/jtv_user_pictures...,,4970,2020-12-27T08:35:55Z,625,145,0.232,0
3,0,104545605,ali1867,ali1867,,,,https://static-cdn.jtvnw.net/user-default-pict...,,604,2015-10-17T07:09:09Z,2523,72,0.028537,0
