# Abstract

Author : Geneviève Masioni

Email : genevieve.masioni@estudiants.urv.cat


### /!\ The full PDF report can be found in the same folder as this notebook.  /!\ 

### Background

**Sentiment analysis (or opinion mining) is a natural language processing technique usually performed on textual data to determine whether a piece of data is positive, negative, or neutral**. Traditionally, it allows businesses to monitor sentiment in customer feedback and therefore tailor their products or services to match their customers’ needs. This technique is **here applied to YouTube comments to help online content creators monitor their audience’s opinion without reading all the comments under a video**. This is particularly useful for those whose content generates thousands to millions of reactions.

###  Aims 

This project has multiple goals that work towards allowing the user to :
1. Get an overview of the audience's opinion on a piece of content (**sentiment + emotion classification**), 
2. Roughly know what is discussed in the comment section (**topic classification**), 
3. Spot intense negative emotions and take actions (report or ban toxic viewers).

The optimal situation is for the overview to provide insights on different objective aspects of the video : video, audio, content, length and editing quality.

###  Method 

The key steps of this project are :
1. **Data collection using the Youtube API**. Creation of JSON files to design a **NoSQL database using Elasticsearch**. Each comment is a document and one JSON file is created per video (a reference to the video is stored in the document).
2. **Data preprocessing** : text cleaning and comment classification (sentiment, emotion, toxicity and topic) using Tensorflow/ PyTorch.
3. **Data analysis** : analysis of the classified comments to answer 3 key questions ; 
    - What are the overall sentiment and emotions ? 
    - What are people talking about ? 
    - Are they any inappropriate comments ? Who is publishing those comments ?

4. **Data summarisation in a Kibana dashboard**.

### Results 

Dashboard sentiment, topics and emotions portrayed under a video’s comment section. 

### Tools  
- Youtube API : data collection (video comments).
- NoSQL with Elasticsearch : for its tokenization feature that allows us to search through text.
- Python Transformers package (based on Tensorflow and PyTorch) for text classification using different models (sentiment, emotion, toxicity and topic classification).
- Kibana : data visualisation (final dashboard).
- Jupyter notebook : source code and synthetic explanation of our process.


### About this notebook
This notebook applies our method on a TEDx Talks video *[When money isn’t real: the 10,000 experiment](https://www.youtube.com/watch?v=_VB39Jo8mAQ)* with **4 690 comments**.

# 0. Configuration : packages, database

In [1]:
# Install required packages
import sys
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install torch
!{sys.executable} -m pip install tensorflow
!{sys.executable} -m pip install transformers
!{sys.executable} -m pip install --upgrade google-api-python-client
!{sys.executable} -m pip install detoxify
!{sys.executable} -m pip install gensim

# utils  
import pprint
import json
import time
import re

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


Defaulting to user installation because normal site-packages is not writeable


## Database setup : Elasticsearch and Kibana

In [2]:
import docker

client = docker.from_env()
# pull elasticsearch
image = client.images.pull('docker.elastic.co/elasticsearch/elasticsearch:7.12.1')
print(image)
# pull kibana
image = client.images.pull('docker.elastic.co/kibana/kibana:7.12.1')
print(image)

# create a network so kibana can find elastic
client.networks.create('elastic')
container = client.containers.run(
    image='docker.elastic.co/elasticsearch/elasticsearch:7.12.1',
    auto_remove=True,
    environment={'discovery.type':'single-node'},
    name='bda-project-masioni',
    ports={9200:9200, 9300:9300},
    network='elastic',
    detach=True)
container.logs()

# Now let's start kibana
container = client.containers.run(
    image='docker.elastic.co/kibana/kibana:7.12.1',
    auto_remove=True,
    name='bda-kb',
    environment={'ELASTICSEARCH_HOSTS':'http://bda-project-masioni:9200'},
    ports={5601:5601},
    network='elastic',
    detach=True)
container.logs()
display(container.status)
container.top()

<Image: 'docker.elastic.co/elasticsearch/elasticsearch:7.12.1'>
<Image: 'docker.elastic.co/kibana/kibana:7.12.1'>


APIError: 409 Client Error for http+docker://localhost/v1.41/containers/create?name=bda-project-masioni: Conflict ("Conflict. The container name "/bda-project-masioni" is already in use by container "972a7af038b6cdb266e54d154712e352e3349a19b0c3608b0b0b99d45cf9deaf". You have to remove (or rename) that container to be able to reuse that name.")

# 1. Data gathering

**NOTE** : You can either 
- Block 1.1 : run the entire data gathering process using the YouTube API and classify (approx 1.8 minute)
- Block 1.2 : load the data from a JSON file in the *dataset/* folder (ready-to-use result of block 1.1).

In [3]:
# GLOBAL VARS + UTILS - THIS BLOCK MUST BE EXECUTED 
   
# Video to analyze : TEDx Talks - "When money isn’t real: the $10,000 experiment"
VIDEO_ID = "_VB39Jo8mAQ"

def save_data(video_id, comments):
    file = open("dataset/" + video_id + ".json", "w")
    json.dump(comments, file)
    f.close()
    
def load_data(video_id):
    with open("dataset/" + video_id + ".json") as json_file:
        return json.load(json_file)

## 1.1 Fetch data using YouTube API

In [5]:
import os
import googleapiclient.discovery

# Get YouTube API Key from file
DEVELOPER_KEY = ""
with open('credentials/credentials.txt') as f: 
    DEVELOPER_KEY = f.readlines()

# Disable OAuthlib's HTTPS verification when running locally.
# *DO NOT* leave this option enabled in production.
os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

api_service_name = "youtube"
api_version = "v3"
youtube = googleapiclient.discovery.build(
    api_service_name, api_version, developerKey = DEVELOPER_KEY)
    
def get_comment_replies(comments):
    comments_extended = comments
    for comment in comments:
        if(comment['snippet']['totalReplyCount'] > 0):
            comment_id = comment['id']
            request = youtube.comments().list(
                part="snippet",
                maxResults=100,
                parentId=comment_id,
                textFormat="plainText"
            )
            response = request.execute()
            comments_extended = comments_extended + response["items"]
    return comments_extended
    
def get_data_from_youtube_api(video_id):
    request = youtube.commentThreads().list(
        part="snippet",
        maxResults=100,
        textFormat="plainText",
        videoId=video_id
    )
    response = request.execute()
    comments = get_comment_replies(response["items"])
    
    if("nextPageToken" in response):
        next_page_token = response["nextPageToken"]
        while(next_page_token):
            request = youtube.commentThreads().list(
                part="snippet,replies",
                maxResults=100,
                textFormat="plainText",
                videoId=video_id,
                pageToken=next_page_token
            )
            response = request.execute()
            comments = comments + get_comment_replies(response["items"])
            if("nextPageToken" in response):
                next_page_token = response["nextPageToken"]
            else:
                next_page_token = None
    return comments

start = time.time()
comments = get_data_from_youtube_api(VIDEO_ID)
end = time.time()
print("Elapsed time : ", end - start)
print("Number of comments : ", len(comments))
print("Sneak peek : ")
pprint.pprint(comments[0])

# Transform to dictionary
indexes = list(range(len(comments)))
comments = dict(zip(indexes, comments))
# Save to JSON file
save_data(VIDEO_ID, comments)

Elapsed time :  45.20023488998413
Number of comments :  4656
Sneak peek : 
{'etag': 'VYCu_WJjXxGgqAuGYvyP-1zTjOo',
 'id': 'UgxK5RGyzy_FAG4R_1d4AaABAg',
 'kind': 'youtube#commentThread',
 'snippet': {'canReply': True,
             'isPublic': True,
             'topLevelComment': {'etag': 'ze2S-AAQlPiB2p3bdV5qM7YNTm4',
                                 'id': 'UgxK5RGyzy_FAG4R_1d4AaABAg',
                                 'kind': 'youtube#comment',
                                 'snippet': {'authorChannelId': {'value': 'UCzY2ONyq7_ge_doVLjOdBJg'},
                                             'authorChannelUrl': 'http://www.youtube.com/channel/UCzY2ONyq7_ge_doVLjOdBJg',
                                             'authorDisplayName': 'Pete '
                                                                  'Romocki',
                                             'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTY4QYWXCxqmfXtPsy7rq3ofDicZUPTf3E9nbcv8A=s48-c-k-c0x00ffffff-no-rj',
 

## 1.2 Load data from a stored JSON file

The file is a ready-to-use result of the gathering process 1.1.

In [None]:
comments = load_data(VIDEO_ID)
pprint.pprint(comments[0])

# 2. Data preprocessing

### 2.1 Text cleaning 

- Special character cleaning (e.g. "\n")
- Links (e.g. self-promotion, spam)

We DO NOT apply the following :
- Remove punctuations (e.g. "?, !, ;")
- Remove emojis

BECAUSE that information (punctuation, emojis...) **helps correctly classify emotions**. Removing those key elements affects the performance of the emotion classifier, as our tests showed :

![Emotion_classification.png](attachment:Emotion_classification.png)

In [6]:
def clean_comments(comments):
    for key in comments:
        comment = comments[key]
        comment_type = comment["kind"]
        message = ""
        if(comment_type == "youtube#comment"):
            message = comment['snippet']['textDisplay']
        else:
            message = comment['snippet']['topLevelComment']['snippet']['textDisplay']
        message = re.sub(r"\\n|\\r|\\t|\\v", "", message)
        message = re.sub(r"\S*https?:\S*", "", message)
        if(comment_type == "youtube#comment"):
            comment['snippet']['textDisplay'] = message
        else:
            comment['snippet']['topLevelComment']['snippet']['textDisplay'] = message
        
start = time.time()
clean_comments(comments)
end = time.time()
print("Elapsed time : ", end - start)
print("Sneak peek : ")
pprint.pprint(comments[0])

Elapsed time :  0.08353090286254883
Sneak peek : 
{'etag': 'VYCu_WJjXxGgqAuGYvyP-1zTjOo',
 'id': 'UgxK5RGyzy_FAG4R_1d4AaABAg',
 'kind': 'youtube#commentThread',
 'snippet': {'canReply': True,
             'isPublic': True,
             'topLevelComment': {'etag': 'ze2S-AAQlPiB2p3bdV5qM7YNTm4',
                                 'id': 'UgxK5RGyzy_FAG4R_1d4AaABAg',
                                 'kind': 'youtube#comment',
                                 'snippet': {'authorChannelId': {'value': 'UCzY2ONyq7_ge_doVLjOdBJg'},
                                             'authorChannelUrl': 'http://www.youtube.com/channel/UCzY2ONyq7_ge_doVLjOdBJg',
                                             'authorDisplayName': 'Pete '
                                                                  'Romocki',
                                             'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTY4QYWXCxqmfXtPsy7rq3ofDicZUPTf3E9nbcv8A=s48-c-k-c0x00ffffff-no-rj',
                          

### 2.1 Missing values

The YouTube API doesn't return missing (null / None) per se as the data is preprocessed to be searchable, e.g. *'viewerRating': 'none'* (string value) for the first comment.

Even if Elasticsearch doesn't index missing values, all data will be indexed here and searching  for missing values won't return an error. For that reason, we won't need to create a new index to be able to map missing values.

### 2.3 Add sentiment, emotion and topic attributes

The classification takes about X minutes. **You can skip steps 2.3 and run 2.4 instead (load preprocessed dataset, ready-to-use).**

- Sentiment : { positive, negative }
- Emotion : { sadness, joy, love, anger, fear, surprise }
- Toxicity : { toxic, severe_toxic, obscene, threat, insult, identity_hate }
- Topic : {}


In [None]:
import tensorflow
from transformers import pipeline
from detoxify import Detoxify

sentiment_classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english") 
emotion_classifier = pipeline("text-classification", model='bhadresh-savani/distilbert-base-uncased-emotion')
toxic_classifier = Detoxify('original')

def classify_comments(comments):
    for key in comments:
        comment = comments[key]
        comment_type = comment["kind"]
        message = ""
        if(comment_type == "youtube#comment"):
            message = comment['snippet']['textDisplay']
        else:
            message = comment['snippet']['topLevelComment']['snippet']['textDisplay']
        # truncate string at 510 characters to match the size of sentiment analysis tensor (512)
        message = (message[:510]) if len(message) > 510 else message
        sentiment = sentiment_classifier(message)[0]
        comment['sentiment'] = { }
        comment['sentiment']['label'] = sentiment['label']
        comment['sentiment']['score'] = str(sentiment['score'])

        emotion = emotion_classifier(message)[0]
        comment['emotion'] = { }
        comment['emotion']['label'] = emotion['label']
        comment['emotion']['score'] = str(emotion['score'])

        toxicity = toxic_classifier.predict(message)
        # ensure naming consistency
        toxicity['identityAttack'] = toxicity.pop("identity_attack")
        toxicity['severeToxicity'] = toxicity.pop("severe_toxicity")
        # float to string (to make them JSON serializable)
        toxicity['identityAttack'] = str(toxicity['identityAttack'])
        toxicity['severeToxicity'] = str(toxicity['severeToxicity'])
        toxicity['toxicity'] = str(toxicity.pop('toxicity'))
        toxicity['obscene'] = str(toxicity.pop('obscene'))
        toxicity['threat'] = str(toxicity.pop('threat'))
        toxicity['insult'] = str(toxicity.pop('insult'))
        comment['toxicity'] = toxicity

start = time.time()
classify_comments(comments)
end = time.time()
print("Elapsed time : ", end - start)
print("Sneak peek : ")
pprint.pprint(comments[0])

In [None]:
# Save the extended dataset in JSON file
save_data(VIDEO_ID + "_preprocessed", comments)

### 2.4 Load preprocessed data from saved JSON file (instead of 2.4)

In [None]:
comments = load_data(VIDEO_ID + "_preprocessed")
pprint.pprint(comments[0])

## 1.3 Flatten data (optional)

/!\ Executing the following block may result in a stack overflow (*RecursionError: maximum recursion depth exceeded while calling a Python object*) depending on the number of comments that were gathered.

Therefore, **this step is OPTIONAL ! Its only purpose is to make queries faster while using our database instance.**

In [None]:
# push the limit of the system - default : 1000
print("Current recursion limit : ", sys.getrecursionlimit())
sys.setrecursionlimit(5000)

def flatten_json(context, old_json, new_json):
    for key in old_json.keys():
        if isinstance(old_json[key], dict):
            if context:
                flatten_json(context + '.' + key, old_json, new_json)
            else:  # empty context
                flatten_json(key, old_json[key], new_json)
        else:
            if context:
                new_json[context + '.' + key] = old_json[key]
            else:  # empty context
                new_json[key] = old_json[key]

def flatten_file(json_data_file):
    new_jsons = []
    with open(json_data_file) as data_file:
        data = json.load(data_file)

        for item in data:
            new_json = dict()
            flatten_json('', data[item], new_json)
            new_jsons.append(new_json)

    return new_jsons

start = time.time()
flattened_comments = flatten_file("dataset/" + VIDEO_ID + '.json')
end = time.time()
comments = flattened_comments
print("Sneak peek : ")
pprint.pprint(comments[0])

## 1.4 Elasticsearch : Data loading and index creation

In [None]:
# Utils functions
import requests
headers = {'Content-Type': 'application/json'}

def search(index, query):
    response = requests.post('http://localhost:9200/'+index+'/_search?pretty', 
                        headers=headers,
                        json=query)
    return response
    
def count(index, query):
    response = requests.post('http://localhost:9200/'+index+'/_count', 
                        headers=headers,
                        json=query)
    return response

# Create an index and load data into the database
lines = []
idx = 0
for comment in comments:
    metadata = {"index": {"_index": "comments", "_id": idx}}
    idx = idx + 1
    lines.append(metadata)
    lines.append(comment)

payload = '\n'.join([json.dumps(line) for line in lines]) + '\n'
response = requests.put('http://localhost:9200/_bulk',
                        data=payload,
                        headers={'Content-Type': 'application/x-ndjson'})
print(response) # Expected : <Response [200]> 

# Check the index was created successfully
response = requests.get('http://localhost:9200/_cat/indices?v=true')
print(response.text)

# Dataset mapping for reference
response = requests.get('http://localhost:9200/comments')
print(json.dumps(response.json(), indent=4))

# 3. Data analysis + 4. Data summarisation


The dashboard helps the content creator answer the following questions :
- What are the overall sentiment and emotions ?
- What are people talking about ?
- Are they any inappropriate comments ? How intense are they ? Who is publishing those comments ?

**The Kibana Dashboard export can be found in the same folder as this notebook and imported into your Kibana instance.** Here is a preview :

- Overall sentiment : pie chart showing the distribution of positive, negative and neutral comments.
- Overall emotion : pie chart showing the distribution of emotions in the comments.
- Topics : bar chart counting the number of comments discussing a given topic. In the best case scenario they represent objective aspects of a piece of content a creator can work on improving.
- Haters list : list of users responsible for posting inappropriate comments.

Additional :
- Bag of words : a word cloud of the positive and negative comments.
- Sentiment/ emotion examples : a set of comments that best represent each sentiment or emotion.
- Average length of comments : in number of words. Is the audience very verbose ?



In [None]:
import gensim.downloader as api

info = api.info()  # show info about available models/datasets
model = api.load("glove-twitter-25")  # download the model and return as object ready for use
model.most_similar("Hi Roberto and Darrel ! Awesome collab - Learnt human intentions - observing human behaviour with human patterns in mind, best thing is Human are highly predictable - Data driven decision making.  Thanking you.")


# Shut down

In [None]:
!{sys.executable} docker stop bda-project