# Code to retreive Youtube Video Context info, concerning Bill Gates

### Importing all required packages
Light overdose, just copied from old project

In [33]:
#Text Analysis
import re
import string
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import requests
from collections import Counter
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Date Parsing
#import isodate
from dateutil.parser import parse
import datetime

# Data Visualization
import seaborn as sns
sns.set(style = 'whitegrid')
import pandas as pd
import matplotlib.pyplot as plt
from pprint import pprint
import networkx as nx

# Saving Data
import json
import csv

# API Communication
import sys
from apiclient.discovery import build
from apiclient.errors import HttpError
from time import sleep

### My API-Key

Scraping on Youtube Videos requires an API Key, since we are communicating with an API server. This requires a Google account as well as Accesss to the API library: https://console.developers.google.com/apis/library?project=scraping-232414. Thereby, I need the YouTube Data API v3, which I can search in the provided field. Below, I am storing the API Key at one assigned place, in case I need to replace it. 

In [34]:
api_key = 'AIzaSyBETL5RIt9uC9fLtGcfPZLpM3A4XR1a4Fg'

## Part 1 - Scraping and analyzing Video IDs and Information by Search Result

### Building a Basic ID Scraper

First of all, I build a function to get a list of Video IDs, for videos which would pop up, if we entered a certain search term into Youtube. At Maximum, I am able to do 50 calls at once, which means, that I must also build an iterator later. Here, I am just interested in building a function, that gets me the first 50 Youtube IDs. The resulting list is thereby sorted by date and a marker (token) is set at the last video, for the iterator to know, where to start the call next. 

Originally, this function was designed to go through all pages with a while loop and an exception handler. Nevertheless, I had the experience of infinite loops with this method, therefore I specified the number of iterations in a separate function. 

#### Modifications in May 2020: added relevance language param and videoCaption in function

In [16]:
def youtube_search(q, max_results=50,order="date", token=None, relevanceLanguage='en'):  
    youtube = build('youtube', 'v3',
        developerKey=api_key)
    #defining how the search result is to be stored
    search_response = youtube.search().list(
        q=q,
        videoCaption='closedCaption',
        type="video",
        
        #Set a marker after the 50 Video-Portion
        pageToken=token,
        
        #Order, in which the results come in
        order = order,
        
        #I just want to store the ID
        part="id",
        
        #To be adjusted in between 1-50
        maxResults=max_results,
        
        relevanceLanguage = relevanceLanguage
        
        videoCaption='closedCaption',

        ).execute()
    
    videos = []
    #Exception Handling - taken from (https://github.com/spnichol/youtube_tutorial/blob/master/youtube_videos.py) 
    for search_result in search_response.get("items", []):
        if search_result["id"]["kind"] == "youtube#video":
            videos.append(search_result)
    try:
        #Go to next portion of 50 Videos
        nexttok = search_response["nextPageToken"]
        return(nexttok, videos)
    except Exception as e:
        #If token has arrived at last page, finish. 
        nexttok = "last_page"
        return(nexttok, videos)

### Function to get all search Results

The search through all Youtube Videos, not just the first 50 requires a loop of calls, since one call can only get 50 Results in Maximum, but for sure, there are more than 50 Videos out for almost every search term, one could think of...

Below I have defined a function to get me all Video IDs, which would pop up at a certain search term. Thereby I used the function above as a basis and then built an iterator around it. The function input requires also an explicit enumeration of rounds to be taken, since I had bad experiences of infinite loops when I just ran the 'youtube-search' function, based on its exception handler 'last page'.

In [17]:
#Function with Search Term and Number of Iterations
def longsearch(term, max):
    
    #Storage for Videos
    fulldict = []
    
    #temporary storage for current result
    test = youtube_search(term)
    
    #append current result to Storage
    fulldict.append(test)
    
    #Condition of repetition - until iteration nr is reached
    while len(fulldict) < max:
        
        #First Element in previous Result used as a marker
        token = test[0]
        
        #Storing next Result temporarily
        test = youtube_search(term, token = token)
        
        # Append next result to Storage
        fulldict.append(test)
        
        #Break for the Server
        sleep(1)
        
    return fulldict

### Executing the function
Now I can use the function for my required search term 'Bill Gates', for which we will iterate 15 times. This will be enough to retreive all videos.

In [18]:
ger = longsearch('Bill Gates', 15)

### Flatten list of Results and only display IDs
There are two things, I need to clear before this dataset is usable for further requests and analysis: 

1) There are not only video IDs, but also other informations about the video, which I won't need for analysis. I want a plain dataset of only Video IDs.

2.) The list is still nested into its 50-units portions, due to the function and the iteration above. I need to flatten this list to one level.

Therefore, I have created the function 'scaledown', getting the dataset as the only input, which takes care of both tasks.

In [21]:
truelist_ger

['DyIH3O0NRsM',
 '43LiB6Ynmmw',
 'BX1_2j-QQuk',
 'QtNn2BE_6rE',
 'lz4dDKSizGg',
 'oXhUl4jhiGU',
 'KkPmlYZ_0_g',
 'bRD4BGec0wU',
 'O_tF0qAf2sY',
 '07HNYmIAuwE',
 'id1ryXAgygE',
 'csozbIUq3wY',
 'jIM5o-GdGKw',
 'Gnh4GBzgPHM',
 'uB-d0qCLXfA',
 'bKU8qKzaK_8',
 'tBNNPdyMshE',
 '_dgc1I-64QU',
 'jLbJayQygzw',
 'Y6OQ15nzoFc',
 'U-onOPOIV60',
 'jvLcWSYGyJ8',
 'DSvhPnUgyz8',
 'OAnLTiBIRts',
 'gmhbeb0dVFA',
 'h_oSEqDXYkI',
 'lCkIjYhAdHA',
 'NtdnddEXWlE',
 'wUtLKJUxBPA',
 'xrNzbrYuCOE',
 'PJNfJOdQ_eI',
 'v0UdgSEs6LI',
 'BWyTeZlx55w',
 'QIf4idTBz7E',
 'ETAnfJiAeq8',
 'YEbQgKIXWVc',
 'YWjkYM8slFU',
 'c7zJSRTcwrY',
 'il7RJ9BmGOg',
 'iqB8Quc8A9g',
 'sw4szXqz0Ro',
 '4R8CB0QqTA8',
 'WUF6c1uAlfw',
 'MgnPFncOE8o',
 'fpZGzeFTBbY',
 'aN_BO4Sez-Q',
 'Q2VPN8Yr1BU',
 'igx86PoU7v8',
 'I1aZzNlfq4g',
 'zGrry9a6KfM',
 'hXR2KKyQt0Q',
 'nVpWXB5IQLw',
 'CgYTeXtQbds',
 'K5SPfG2vErQ',
 'kUuGqxUzPKY',
 '8KzJ5cH5_tg',
 'CAVEkWUOaIE',
 'qN6bwZE0v-c',
 'mZ32rLrL2ag',
 'mX51utvnj9E',
 'oef5YexoLck',
 'AhuU_CG9v80',
 '8-etkt

In [19]:
#Enter previous Dataset
def scaledown(dataset):
    
    #Create new Storage
    truelist = []
    
    #Iterate through each of the Rounds
    for elem in dataset:
        trueelem = elem[1]
        
        #Iterate through each of the 50 results
        for i in trueelem:
            
            #Only aim for the Video ID
            truelist.append(i['id']['videoId'])
    return truelist


#Usage of the Function, storing the results in 'truelist'
truelist_ger = scaledown(ger)
    


### Checking the Length of the Queried Results

The result means, we have actually retreived 550 Video Results through our Term 'Migrationspakt.

In [113]:
print(len(truelist_ger))

540


### Iterator to get comments

In [33]:
# ID of the first video
truelist_ger[0]

'Yffx7LO7ecw'

In [45]:
# Number of videos in idlist
len(truelist_ger)

574

In [37]:
# Service specification for usage below
youtube = build('youtube', 'v3', developerKey=api_key)

In [77]:
# Function to get comments of a video. As service, just insert youtube from line above.

def get_video_comments(service, **kwargs):
    comments = []
    results = service.commentThreads().list(**kwargs).execute()
 
    while results:
        for item in results['items']:
            
            comments.append(item)
 
        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(**kwargs).execute()
        else:
            break
 
    return comments

In [78]:
# One exemplary comment query for a vid ID
cmm = get_video_comments(youtube, part='snippet', videoId='Yffx7LO7ecw')

In [90]:
txtlst = []
for i in cmm:
    txtlst.append(i['snippet']['topLevelComment']['snippet']['textDisplay'])

In [93]:
# Executing comments query for each item in video id list, appending it to comments
commlist = []
for item in truelist_ger:
    try:
        commlist.append(get_video_comments(youtube, part='snippet', videoId=item, textFormat='plainText'))
    except HttpError:
        print('one')
        pass

one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one
one


In [None]:
len

In [96]:
# shows number of distinct videos with comments
len(commlist)

119

In [119]:
len(commlist[21])

3

In [115]:
truelist_ger[100]

'b7Ok9ZGEAcE'

In [140]:
info = {
info['comm_id']: commlist[21][i]['id'],
info['video_id']: commlist[21][i]['snippet']['video_id'],
info['comment_date']: commlist[21][i]['snippet']['topLevelComment']['snippet']['publishedAt'],
info['last_edit_date']: commlist[21][i]['snippet']['topLevelComment']['snippet']['updatedAt'],
info['comment_content']: commlist[21][0]['snippet']['topLevelComment']['textDisplay'],
info['comment_original']: commlist[21][0]['snippet']['topLevelComment']['textOriginal'],
info['like_count']: commlist[21][0]['snippet']['topLevelComment']['likeCount']


TypeError: list indices must be integers or slices, not dict

In [151]:
comms = []

for i in range(len(commlist)):
    for j in range(len(commlist[i])):
        comms.append({
        'comm_id': commlist[i][j]['id'],
        'video_id': commlist[i][j]['snippet']['topLevelComment']['snippet']['videoId'],
        'comment_date': commlist[i][j]['snippet']['topLevelComment']['snippet']['publishedAt'],
        'last_edit_date': commlist[i][j]['snippet']['topLevelComment']['snippet']['updatedAt'],
        'comment_content': commlist[i][j]['snippet']['topLevelComment']['snippet']['textDisplay'],
        'comment_original': commlist[i][j]['snippet']['topLevelComment']['snippet']['textOriginal'],
        'like_count': commlist[i][j]['snippet']['topLevelComment']['snippet']['likeCount']})

In [155]:
comment_frame = pd.DataFrame(comms)

In [156]:
comment_frame.to_csv('comments first.csv')

In [147]:
info

{'comm_id': 'Ugwgygt7gmlUQOM_ORZ4AaABAg',
 'video_id': 'jvLcWSYGyJ8',
 'comment_date': '2020-05-27T18:20:11Z',
 'last_edit_date': '2020-05-27T18:20:11Z',
 'comment_content': 'Ma non ci sono i sottotitoli.',
 'comment_original': 'Ma non ci sono i sottotitoli.',
 'like_count': 0}

In [138]:
commlist[21][0]['snippet']['topLevelComment']['snippet']['publishedAt']

'2020-05-27T18:20:11Z'

In [139]:
commlist[21][0]['snippet']['topLevelComment']['snippet']

{'videoId': 'jvLcWSYGyJ8',
 'textDisplay': 'Ma non ci sono i sottotitoli.',
 'textOriginal': 'Ma non ci sono i sottotitoli.',
 'authorDisplayName': 'bartok bartokk',
 'authorProfileImageUrl': 'https://yt3.ggpht.com/a/AATXAJweK4JKXufx6Wg75Y1Geu58UicjfmbPcoRg0A=s48-c-k-c0xffffffff-no-rj-mo',
 'authorChannelUrl': 'http://www.youtube.com/channel/UCd3hryR4GK-_nVT8xtuUrAQ',
 'authorChannelId': {'value': 'UCd3hryR4GK-_nVT8xtuUrAQ'},
 'canRate': True,
 'viewerRating': 'none',
 'likeCount': 0,
 'publishedAt': '2020-05-27T18:20:11Z',
 'updatedAt': '2020-05-27T18:20:11Z'}

In [120]:
commlist[21]

[{'kind': 'youtube#commentThread',
  'etag': 'VkroXz8K4ws0adAkB5LgdsVMIFU',
  'id': 'Ugwgygt7gmlUQOM_ORZ4AaABAg',
  'snippet': {'videoId': 'jvLcWSYGyJ8',
   'topLevelComment': {'kind': 'youtube#comment',
    'etag': '7eViPZgb0Xk69S2JS2eHTlL-TAU',
    'id': 'Ugwgygt7gmlUQOM_ORZ4AaABAg',
    'snippet': {'videoId': 'jvLcWSYGyJ8',
     'textDisplay': 'Ma non ci sono i sottotitoli.',
     'textOriginal': 'Ma non ci sono i sottotitoli.',
     'authorDisplayName': 'bartok bartokk',
     'authorProfileImageUrl': 'https://yt3.ggpht.com/a/AATXAJweK4JKXufx6Wg75Y1Geu58UicjfmbPcoRg0A=s48-c-k-c0xffffffff-no-rj-mo',
     'authorChannelUrl': 'http://www.youtube.com/channel/UCd3hryR4GK-_nVT8xtuUrAQ',
     'authorChannelId': {'value': 'UCd3hryR4GK-_nVT8xtuUrAQ'},
     'canRate': True,
     'viewerRating': 'none',
     'likeCount': 0,
     'publishedAt': '2020-05-27T18:20:11Z',
     'updatedAt': '2020-05-27T18:20:11Z'}},
   'canReply': True,
   'totalReplyCount': 1,
   'isPublic': True}},
 {'kind': 'yout

In [49]:
# flatten list
allcoms = []
for i in commlist:
    for comm in i:
        allcoms.append(comm)

In [50]:
# shows total number of comments across videos
len(allcoms)

41039

In [59]:
# Comments of the 100th video
commlist[100]

['LINK mit QUELLEN: https://www.mmnews.de/wirtschaft/144175-spiegel-kriegt-2-3-mio-euro-von-bill-gates',
 'So ein Quatsch, sucht euch einen Job.',
 '💪🏻👍🏻♥️🙏🏼',
 'Sie möchten Ungreifbar in Ihren Migaloo Atomangetrieben auf Ozeanien kreisen.',
 'So ein Schwachsinn',
 'Handelsgesetzbuch\n- belegt es, dass das \n\nBundesministerium der Justiz eine private Firma ist? \n\nUnd hat es sich selbst ermächtigt?\n\n§ 9a Übertragung der Führung des Unternehmensregisters; Verordnungsermächtigung\n(1) Das Bundesministerium der Justiz und für Verbraucherschutz wird ermächtigt, durch \nRechtsverordnung mit Zustimmung des Bundesrates \n\neiner juristischen Person des Privatrechts\n\ndie Aufgaben nach § 8b Abs. 1 zu übertragen.\n\nDer Beliehene erlangt die Stellung einer Justizbehörde des Bundes \n\n§ 8b Unternehmensregister\n\n(1) Das Unternehmensregister wird vorbehaltlich einer Regelung nach § 9a Abs. 1 vom Bundesministerium der Justiz und für Verbraucherschutz elektronisch geführt.',
 'da zeigt sich 

In [60]:
# Save comments to external file
with open('comments_de.csv', 'w', encoding = 'utf-8', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(commlist)

In [25]:
with open('video_ids.csv', 'w', encoding = 'utf-8', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(truelist_ger)

In [None]:
with open('video_ids.csv', 'r', encoding = 'utf-8', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(truelist_ger)

### Read in the Data
Note: I accidently saved the vid IDs too fine (each char separate. Therefore I had to repair)

In [3]:
with open('video_ids.csv', newline='') as csvfile:
    ids = csv.reader(csvfile, delimiter=' ', quotechar='|')
    

FileNotFoundError: [Errno 2] No such file or directory: 'video_ids.csv'

In [15]:
ids = pd.read_csv('video_ids.csv')

In [22]:
ids.loc[0,:]

Y      s
f      Y
f.1    T
x      0
7      A
L      2
O      6
7.1    w
e      s
c      x
w      Q
Name: 0, dtype: object

In [37]:
vid_id_lst = []
for i in range(ids.shape[0]):
    vid_id_lst.append(''.join(ids.loc[i,:]))

In [26]:
idlst = pd.DataFrame(truelist_ger)

In [29]:
idlst.to_csv('idlst.csv',index=False)

In [35]:
idlst = pd.read_csv('idlst.csv')