# Table of Contents

1. [Introduction](#Introdcution)
1. a. [Configuration](#Configuration)
1. b. [Import data](#Import-data)
2. [Questions](#Questions)
3. [Question #1: The producers](#Question-#1:-The-producers)
3. a [The producers: channel info](#Channel-info)
3. b [The producers: video info](#Video-info)
3. c [The producers: recommendations info](#Recommendations-info)
3. d [The producers: topics info](#Topics-info)
4. [Question #2: The users](#2:-The-users)
4. a [Dutch commenters on international channels](#Dutch-commenters-on-international-channels)
4. b [Commenters of specific Dutch channels](#Commenters-of-specific-Dutch-channels)


## Introduction

This notebook is used for the analysis of information networks on YouTube and to make this analysis reproducable. I'll take you step by step through the data and analyses, trying to find angles for stories. You can use the Table of Contents to skip to the relevant parts.

### Configuration

First do some configuration, import libraries and set paths to data. Throughout the Notebook, Python3.6 is used. I'll import all libraries at once.

In [None]:
import pandas as pd #basically the engine for the whole analysis. 
import matplotlib.pyplot as plt #for plotting our data.
import glob #a nice library for iterating through multiple files.
import networkx as nx #we need this to construct and export network graphs.
import seaborn as sns; sns.set() #for plotting
import comment_lib #some local modules
import csv #for reading and writing csv's when we are not using the pandas library.
import re

%matplotlib inline

In [None]:
# Set path to NL data - better to set these constants in a separate config file and import them here.

path = '/home/dim/Documents/projecten/extremisme/youtube/yt/YouTubeExtremism/DataCollection/output/NL/'

# Set path to control group data.

path_c = '/home/dim/Documents/projecten/extremisme/youtube/data/temp/bubble/right/NL/'

# Set path to international right data

path_i = '/home/dim/Documents/projecten/extremisme/youtube/data/temp/bubble/right/'

### Import data

Types of data are channels, videos, comments, recommendations and transcripts (for topics). The data are spread over multiple csv's so we have to append them first and create one dataframe for each type of data. We'll write the results to a csv file you can import later.

In [None]:
# Import videofiles into one dataframe.
parse_dates = ['video_published']
filename = 'videos_nl*.csv'

all_files = glob.glob(path + filename)
frame = pd.DataFrame()
list_ = []
for file_ in all_files:
    df = pd.read_csv(file_,index_col=None, header=0, parse_dates=parse_dates)
    list_.append(df)
videos = pd.concat(list_, sort=True)

In [None]:
videos.to_csv(path + 'all_nl_videos.csv', index=None)
del videos

In [None]:
# Import comment files into one dataframe.

parse_dates = ['comment_time']
filename = 'comments_nl*.csv'

all_files = glob.glob(path + filename)
frame = pd.DataFrame()
list_ = []
for file_ in all_files:
    df = pd.read_csv(file_,index_col=None, header=0, parse_dates=parse_dates)
    list_.append(df)
comments = pd.concat(list_, sort=True)

In [None]:
comments.to_csv(path + 'all_nl_comments.csv', index=None)
del comments

In [None]:
# Import recommendations files into one dataframe.

parse_dates = ['publishedAt']
filename = 'recommendations*.csv'

all_files = glob.glob(path + "recommendations*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in all_files:
    df = pd.read_csv(file_,index_col=None, header=0, parse_dates=parse_dates)
    list_.append(df)
recommendations = pd.concat(list_, sort=True)

In [None]:
recommendations.to_csv(path + 'all_nl_recommendations.csv', index=None)
del recommendations

In [None]:
# Import transcripts files into one dataframe.

filename = 'transcripts*.csv'

all_files = glob.glob(path + filename)
frame = pd.DataFrame()
list_ = []
for file_ in all_files:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)
transcripts = pd.concat(list_, sort=True)

In [None]:
transcripts.to_csv(path + 'all_nl_transcripts.csv', index=None)
del transcripts

### Load data from control group

We want to compare the results from the NL right information network with other networks. For instance, we want to compare the behavior of certain political parties (in the Netherlands Forum voor Democratie and the PVV) with centre and left wing parties. 

TODO: make a list of relevant control channels. At least PvdD, SP, DENK, PvdA, D66, GroenLinks, ChristenUnie, VVD, CDA. Other candidates: Zondag met Lubach, De Nieuwe Maan, ???

In [None]:
# Import channels

channels_control = pd.read_csv(path + 'channels_nl_controlgroup_politiek.csv')

# Import videos

videos_control = pd.read_csv(path + 'videos_nl_controlgroup_politiek.csv')

#import comments still TODO:

#import recommendations still TODO:

#import transcripts still TODO:

# Questions

So we're all set up. Before we dive in, what kind of questions do we want to answer? 

1. What kind of content is being watched by Dutch viewers? (The producers)
2. Who is commenting on the videos in the far right information network? How are commenters interacting? (The users)
3. How do political parties compare in terms of content, marketing strategies and reach? (Comparisson and strategies)
4. How does the far right information network compare to other information networks (like far left and center)? (Whataboutism)
5. What content is harmful, hateful, or illegal, in other words, when are lines being crossed? (Morality, the Platform)


# Question #1: The producers

For this we need:
1. Statistics on videos, channels and recommendations.
2. Topics of videos (by tags or through topic modelling)

Let's start by looking at the channels.

### Channel info

Let's plot some channel data, like number of subscriptions and views over time. That will give us a sense of how certain channels are developing.

In [None]:
# Import the channel data into a dataframe.

channels = pd.read_csv(path + 'channels_nl_right.csv')

# Take a subset of the channel data.

stats = channels[['channel_title', 
                  'channel_description', 
                  'channel_subscribercount',
                  'channel_viewcount', 
                  'channel_videocount']]

stats = stats.sort_values(by='channel_subscribercount', ascending=False)
stats.set_index("channel_title",drop=True,inplace=True)

# Create matplotlib figure.

fig = plt.figure(figsize=(20,10)) 

# Create matplotlib axes.

ax = fig.add_subplot(111) 

# Create another axes that shares the same x-axis as ax.

ax2 = ax.twinx() 

# Set a width for a bar chart.

width = 0.4

# Configure the bar chart.

stats.channel_subscribercount.plot(kind='bar', color='red', ax=ax, width=width, position=1)
stats.channel_viewcount.plot(kind='bar', color='blue', ax=ax2, width=width, position=0, legend=True, grid=True)
ax.set_ylabel('subscribers')
ax2.set_ylabel('views')

plt.show()

Be careful, because there are two y-axis. The left is subscribers, the right is viewcount. 

That said though, there are some takeaways and questions:
1. PVVpers has 0 subscribers. That means that the subscribercount has not been given on the channel page. They have a lot of viewers. Still more than Forum voor Democratie, but FvD is catching up. PVV is much older.
2. Some channels generate a lot of views, like Laurens, Rafiek de Bruin, Leukste YouTube Fragmenten, Deweycheatumhowe and the LvKrijger. Most of them are very pro FvD and pro PVV.
3. FvD has relatively many subscribers (they rank 2nd), but not that many views (relitavely, they rank 4th). Did they buy subscribers? 
4. Why did Rossen remove all his videos? He was quite popular.
5. If we look at FvD more broadly and take affiliated channels into consideration, FvD is very big.

### Show channel development over time

Socialblade.com provides a range of statistics on YouTube channels, like daily views and subscription info. I've run the list of channels through [socialblade.com](https://www.socialblade.com). I want to try to get a sense of the the growth of the far right network in recent years, maybe in a bubble flow chart. It would make a great comparisson with other information networks on YouTube. We can use four axes for that:
- x = monthly_views
- y = monthly_subscriptions
- z = monthly_comments (z is size of the bubble)
- plus time

The only constraint is that the oldest data is from early 2015, so it's not that old.

I'll prepare the data for use in [gapminder](https://www.gapminder.org/tools/), an easy way to explore this kind of data.

In [None]:
# Import the data from socialblade

channel_history = pd.read_csv(path_i + 'other_platforms/social_blade_stats.csv')

In [None]:
# Extract all the dates and values of two columns: daily views and total subs

import re
pattern = re.compile('(\d{4}-\d{2}-\d+,\d+)')

# And create two new columns with lists of dates and values found

channel_history['daily_views'] = channel_history['Date_Daily_Views'].str.findall(pattern)
channel_history['daily_subs'] = channel_history['Date_Total_Subs'].str.findall(pattern)

# Stack them, so all the dates and values are linked to the channels and
# we are getting rid of the messy lists.

daily_views = channel_history.set_index('User') \
            .daily_views.apply(pd.Series) \
            .stack() \
            .reset_index(level=-1, drop=True) \
            .reset_index()

# Extract the values columns for views and subscriptions (subs)

daily_views['date'], daily_views['views'] = daily_views[0].str.split(',', 1).str
daily_views = daily_views[['User', 'date', 'views']]
daily_views = daily_views.rename(columns = {'User': 'channel_id'})

daily_subs = channel_history.set_index('User') \
            .daily_subs.apply(pd.Series) \
            .stack() \
            .reset_index(level=-1, drop=True) \
            .reset_index()

daily_subs['date'], daily_subs['subs'] = daily_subs[0].str.split(',', 1).str
daily_subs = daily_subs[['User', 'date', 'subs']]
daily_subs = daily_subs.rename(columns = {'User': 'channel_id'})

# And bring it all together in a dataframe called daily_stats

daily_stats = pd.merge(daily_subs, daily_views,  how='left', left_on=['channel_id', 'date'], right_on = ['channel_id', 'date'])


In [None]:
# Now we need to add some data, first the channel data (like channel_title, etc.)

# Import the channel data

channels_int = pd.read_csv(path_i + 'channels_right.csv')

# And merge them with daily_stats

int_channel_daily_stats = pd.merge(daily_stats, channels_int, on='channel_id', how='left')

# Drop empty values

int_channel_daily_stats = int_channel_daily_stats.dropna()

In [None]:
# We need to get the average (mean) views and subs per year, month and year_month

# The date is not recognized as a date

int_channel_daily_stats['date'] = pd.to_datetime(int_channel_daily_stats['date'])

# Get year, month and year_month (yyyy-mm format)

int_channel_daily_stats['year'] = int_channel_daily_stats['date'].dt.year
int_channel_daily_stats['month'] = int_channel_daily_stats['date'].dt.month
int_channel_daily_stats['yearmonth'] = int_channel_daily_stats['date'].dt.to_period('M')

# The values of subs and views are not integers yet, which will get us into trouble later on

int_channel_daily_stats['subs'] = int_channel_daily_stats['subs'].astype('int')
int_channel_daily_stats['views'] = int_channel_daily_stats['views'].astype('int')

In [None]:
# Then it's time to get the comments and average out the comments per month
# (or should we sum them? Let's try both)

# Import the comments usering an iterator (the comments file is 4.5GB)

columns = ['video_id', 
           'comment_id', 
           'comment_id2', 
           'author_display_name',
           'author_image',
           'author_channel_url',
           'author_channel_id',
           'comment_text',
           'number_of_replies',
           'comment_date'
          ]
cols_to_keep = ['video_id', 'comment_date']

comments_we_need = pd.concat([x.loc[:, cols_to_keep] for x in pd.read_csv(path_i + 'comments_right.csv', names=columns, chunksize=20000)])

In [None]:
# Add channel data to comments_we_need

videos = pd.read_csv(path_i + 'videos_right.csv', low_memory=False, index_col=None)
comments_channels_to_clean = pd.merge(comments_we_need, videos[['video_id', 'video_channel_title']], on='video_id').dropna()

# And make some room in memory

del videos
del comments_we_need

# Parse some dates.

comments_channels_to_clean['comment_date'] = pd.to_datetime(comments_channels_to_clean['comment_date'])
comments_channels_to_clean['year'] = comments_channels_to_clean['comment_date'].dt.year
comments_channels_to_clean['month'] = comments_channels_to_clean['comment_date'].dt.month
comments_channels_to_clean['yearmonth'] = comments_channels_to_clean['comment_date'].dt.to_period('M')

# And clean it up a bit.

comments_channels_to_clean = comments_channels_to_clean.rename(columns = {'video_channel_title': 'channel_title'})

In [None]:
# Prepare the data for merging - the code is still quite messy
# TODO: Clean it up a bit and make it more pythonic. Maybe write a function.

int_channel_daily_stats = int_channel_daily_stats[['channel_title', 
                                                   'subs', 'views', 
                                                   'yearmonth', 
                                                   'year', 
                                                   'month']]

comments_channels_to_clean = comments_channels_to_clean.groupby([comments_channels_to_clean.channel_title, 
                                                                 comments_channels_to_clean.yearmonth ]) \
                                                                .agg('count')

comments_channels_to_clean = comments_channels_to_clean \
                            .rename(columns = {'video_id':'comments'}) \
                            .reset_index()

comments_channels_to_clean = comments_channels_to_clean[['channel_title', 'yearmonth', 'comments']]

merged_comments = pd.merge(int_channel_daily_stats, 
                           comments_channels_to_clean, 
                           on=['channel_title', 'yearmonth'], 
                           how='left')

subset_for_graph = int_channel_daily_stats[['channel_id', 
                                            'channel_title', 
                                            'yearmonth', 
                                            'subs', 
                                            'views']]

In [None]:
# And bring it all finally together.

df1 = pd.melt(merged_comments, id_vars=['channel_title', 
                                        'yearmonth', 
                                        'month', 
                                        'year'])

df2 = df1.groupby(['channel_title',
                   'yearmonth', 
                   'month', 
                   'year', 
                   'variable']) \
                    .mean()\ 
                    .unstack(['yearmonth'])

# Write it to csv for use in Gapminder

df2.to_csv(path + 'for_viz/forgapminder.csv')

### Video info

In [None]:
# Load videos.

videos = pd.read_csv(path + 'all_nl_videos.csv')

In [None]:
# Create a year column.

videos['video_upload_year'] = pd.DatetimeIndex(videos['video_published']).year

In [None]:
# Plot views and uploads per year.

uploads_per_year = videos.groupby(['video_upload_year']).size()
views_per_year = videos.groupby(['video_upload_year'])['video_view_count'].agg('sum')

fig = plt.figure(figsize=(10,5)) # Create matplotlib figure

width = 0.4

uploads_per_year.plot(kind='bar', color='red', width=width, grid=True)
ax.set_ylabel('number of videos published')
ax.set_xlabel('year')

plt.show()

# First the uploads per year.

In [None]:
fig = plt.figure(figsize=(10,5))
width = 0.4

views_per_year.plot(kind='bar', color='red', width=width, grid=True)
ax.set_ylabel('number of views')
ax.set_xlabel('year')

plt.show()

# Then the views per year.

Interesting:

1. In 2018 more videos were uploaded, but they've gotten significantly lesser views. It could be that older video's are still getting views. 

In [None]:
# Let's compare some channels.

channel1 = 'Forum Democratie' #fill in the channels you want to compare
channel2 = 'PVVpers'

filtered = videos.loc[(videos['video_channel_title'] == channel1) | \
                      (videos['video_channel_title'] == channel2)
                     ]

In [None]:
# First look at the number of published videos per year.

ax = filtered.groupby(['video_upload_year','video_channel_title'])['video_channel_title'] \
        .count().unstack(1).plot.bar(title="Number of uploaded videos", figsize=(10,5), grid=True)

ax.set_xlabel('year')
ax.set_ylabel('number of uploads')

plt.show()

In [None]:
# Now look at the viewcount per year.

ax = filtered.groupby(['video_upload_year', 'video_channel_title'])['video_view_count'] \
        .sum().unstack(1).plot.bar(title="Number of views per year", figsize=(10,5), grid=True, legend=True)

ax.set_xlabel('year')
ax.set_ylabel('number of views')

plt.show()

In [None]:
# And the comment count per year.

ax = filtered.groupby(['video_upload_year', 'video_channel_title'])['video_comment_count'] \
        .sum().unstack(1).plot.bar(title="Number of comments per year", figsize=(10,5), grid=True, legend=True)

ax.set_xlabel('year')
ax.set_ylabel('number of comments')

plt.show()

Some takeaways from the comparison of PVV and FvD:
1. FvD is winning on YouTube, by a large margin.
2. They are much more active in uploading content
3. That content reaches a larger audience. TODO: to be sure we need to look at the average views per video.
4. It's clear that there is much more debate, or at least more comments on FvD than on PVV.

In [None]:
# Looking at the mean of viewcount per video

ax = filtered.groupby(['video_upload_year', 'video_channel_title'])['video_view_count'] \
        .mean().unstack(1).plot.bar(title="Number of views per year", figsize=(10,5), grid=True, legend=True)

ax.set_xlabel('year')
ax.set_ylabel('number of views')

plt.show()

I still want to have a good look at it, but it seems that the mean views per video is about the same. This could mean that FvD is simply uploading a lot more content, that doesn't gather many views, while PVV is uploading not so much content, but what is uploaded is peforming better. 

### Recommendations info

The channel data for the recommendations is missing, so we need to add them and merge them with the recommendations and the videos. While we are at it, let's use a simpler variable.

In [None]:
# Load recommendations.

recommendations = pd.read_csv(path + 'all_nl_recommendations.csv')

In [None]:
# Normalize field names (this will be fixed in future versions of the DataCollection library)

recommendations = recommendations.rename(columns={'channelId':'channel_id',
                                                  'description': 'target_channel_description',
                                                  'publishedAt': 'target_video_published',
                                                  'targetVideoId': 'target_video_id',
                                                  'title': 'target_video_title',
                                                  'videoId': 'source_video_id' })

video_channels = pd.merge(videos, channels, on='channel_id', how='left')

video_channels = video_channels.rename(columns={'channel_id': 'source_channel_id',
                                                'video_category_id': 'source_video_category_id',
                                                'video_channel_title': 'source_channel_title',
                                                'video_description': 'source_video_description',
                                                'video_id': 'source_video_id',
                                                'video_published': 'source_video_published',
                                                'video_tags': 'source_video_tags',
                                                'video_title': 'source_video_title',
                                                'video_view_count': 'source_video_viewcount',
                                                'channel_topic_ids': 'source_channel_topic_ids',
                                                'channel_subscribercount': 'source_channel_subscribercount'})

recs_chans = pd.read_csv(path + 'recs_chans.csv')
recs_channels_for_merge = pd.merge(recommendations, recs_chans, on='channel_id', how='left')

recs_channels_for_merge = recs_channels_for_merge.rename(columns={'channel_id': 'target_channel_id',
                                                                 'channel_title': 'target_channel_title',
                                                                 'channel_description': 'target_channel_description',
                                                                 'channel_viewcount': 'target_channel_viewcount',
                                                                 'channel_subscribercount': 'target_channel_subscribercount',
                                                                 'channel_topic_ids': 'target_channel_topic_ids'})

recs = pd.merge(recs_channels_for_merge, video_channels, on='source_video_id', how='left')

recs = recs.drop(['channel_country_x',
                  'channel_default_language_x',
                  'channel_uploads_x',
                  'channel_commentcount_x',
                  'channel_videocount_x',
                  'channel_topic_categories_x',
                  'channel_branding_keywords_x',
                  'video_comment_count',
                  'video_default_language',
                  'video_dislikes_count',
                  'video_duration',
                  'video_likes_count',
                  'video_upload_year',
                  'channel_title',
                  'channel_viewcount',
                  'channel_country_y',
                  'channel_commentcount_y',
                  'channel_uploads_y',
                  'channel_viewcount',
                  'channel_branding_keywords_y',
                  'channel_topic_categories_y',
                  'channel_videocount_y',
                  'video_topic_categories',
                  'video_topic_ids',
                  'channel_default_language_y',
                  'channel_description'
                 ], axis=1)

recs = recs.rename(columns={'source_video_title_y': 'source_video_title'})


cols = ['source_video_id',
        'source_video_title',
        'source_video_description',
        'source_video_published',
        'source_video_tags',
        'source_video_viewcount',
        'source_channel_id',
        'source_video_category_id',
        'source_channel_title',
        'source_channel_subscribercount',
        'source_channel_topic_ids',
        'target_video_id',
        'target_video_title',
        'target_channel_id',
        'target_channel_description',
        'target_video_published',
        'target_channel_title',
        'target_channel_description',
        'target_channel_viewcount']

recs = recs[cols]
         
         

In [None]:
# Let's look at a sample of the data.

recs.sample(5)

In [None]:
# How many videos and recommendations are in this set?
len(recs)

In [None]:
# A quick reminder of the channels.

recs.source_channel_title.unique()

In [None]:
# Pick a channel

chan = 'Erkenbrand Kanaal' #fill in a channel here

#and filter

filtered_recs = recs[recs['source_channel_title'] == chan]

In [None]:
# See the related channels of the videos and how often YouTube has assigned these related channels.

filtered_recs.target_channel_title.value_counts()

In [None]:
# Write to gexf file, for analysis in Gephi.

G = nx.from_pandas_edgelist(recs, source='source_channel_title', target='target_channel_title')
nx.write_gexf(G, path + 'nl_graphs/nl_recommendations.gexf' )

In [None]:
# Select a video from the selected channel.

vid = 'Conference interview with Millennial Woes [2018 ENGLISH]' #change this to another video title

filtered_rec_vids = filtered_recs[filtered_recs['source_video_title'] == vid]

In [None]:
# You can pick another video of this list of videps of selected channel.

filtered_recs.source_video_title.unique()

In [None]:
# YouTube thinks that these videos are related to the selected videos.

filtered_rec_vids.target_video_title.tolist()

### Topics info

This still needs some work. The tags are malformed, and I'm not so sure about the quality of the transcripts. I would say this doens't have a high priority, so I'll leave this to later and focus on the users first.

In [None]:
# Extract tags, first link tags to videos and clean them up a bit

vidtags = videos[['video_id', 'video_title', 'video_tags']]

vidtags = vidtags.video_tags.str.split(', | #', expand=True)\
    .merge(vidtags, left_index = True, right_index = True)\
    .drop(['video_tags'], axis=1)\
    .melt(id_vars = ['video_id'], value_name = "tags") \
    .dropna() \
    .drop(['variable'], axis=1)

vidtags['tags'].replace(regex=True,inplace=True,to_replace=r"'|\[|\]|#|\"",value=r'')

vidtags.tags = vidtags.tags.str.lower()

vidtags = vidtags[vidtags.tags != 'not set']


In [None]:
# Look for certain tags

vidtags = vidtags[vidtags['tags'].str.contains("rassenhaat")]
vidtags.tags.unique()


In [None]:
# Then get the video data with these tags.

vidtags = pd.merge(vidtags, videos, on='video_id', how='left')
vidtags[['video_id', 'tags', 'video_channel_title_x']]

See? Something is wrong here. I get a description in the tags, so this still needs some work.

## Question #2: The users

Who is commenting on the videos in the far right information network? How are commenters interacting? (The users)

In [None]:
# Load comment data

comments = pd.read_csv(path + 'all_nl_comments.csv')

In [None]:
# How many comments do we have?

len(comments)

### Finding the hardcore commenters in the Dutch network

First I'm interested in some statistics to get to the hardcore commenters

In [None]:
comments.columns

In [None]:
#number of unique author names

comments.author_display_name.nunique()

In [None]:
#number of unique author id's

comments.author_channel_id.nunique()

So we have to be a bit careful, because there are more unique id's than names, which is kind of obvious.

Let's start with adding more information to the comment data, so we can select and filter some channels. We can do this by adding the video data to the comment data.

In [None]:
nl_comment_sphere = pd.merge(comments, videos, on='video_id', how='left')

In [None]:
# Check if the merge was succesful.

len(nl_comment_sphere)

In [None]:
# What are the available channels?

nl_comment_sphere.video_channel_title.value_counts()

Some observations:
* Erkenbrand is missing. The channel doens't have a lot of comments, but it has some. We'll need to add Erkenbrand, because it can be important. There are also some other channels I would like to add, like Nederlands Falen, Linkse Moskee and some other.
* For the purposes of our research, I'm going to filter out a couple of channels that are run by Dutch, or from the Netherlands, but are not percieved as such, like Al Stankard, Voice of Europe (which merits its own investigation) and Matthew & Doris, that contain a lot of non-political videos.
* We should establish which channels are from FvD and run some analysis on them together as a seperate cluster. 

So let's build some filters first. This code can be used as well if we are going to investigate the Dutch commenters in the international network as well.

In [None]:
# What channels do you want to remove from the comment file?

to_remove = ['Voice of Europe', 'Matthew & Doris', 'Al Stankard aka HAarlem VEnison']

nl_comment_sphere = nl_comment_sphere[~nl_comment_sphere.video_channel_title.isin(to_remove)]

In [None]:
# Okay, we're set. Let's look at the prolific commenters first. 
# Who is commenting a lot in this network in general? 

topcommenters = nl_comment_sphere.author_display_name.value_counts()
topcommenters = topcommenters[0:26]

fig = plt.figure(figsize=(20,10)) # Create matplotlib figure

width = 0.4

ax = topcommenters.plot(kind='bar', color='red', width=width, grid=True)
ax.set_ylabel('number of comments')
ax.set_xlabel('name')
plt.xticks(rotation=45)

plt.show()

So there are some people that have commented more than 250 times in this dataset

Some observations:
* There are some channels in there that seem to actively debate with their viewers. I think it's interesting to have a look at the top two, but especially Paul Nielsen for he is affiliated with Forum voor Democratie.
* groene hond sounds familiar. I would'nt be suprised if this is the same person as 'botte hond', or 'zilte hond', a notorious social media figure.
* The names certainly don't point to real world identities. Yet.

In [None]:
# Next up: a small group seems responsible for many comments. 
# Let's do an analysis of the GINI to see if that's true.

commenter_groups = nl_comment_sphere.groupby('author_channel_id') #we need these groups later.

num_comments = pd.DataFrame(commenter_groups.size().sort_values(ascending = True), columns = ['count'])
num_comments['Cumulative percentage of comments'] = 100*num_comments['count'].cumsum()/max(num_comments['count'].cumsum())
num_comments['Commenter percentile'] = num_comments.reset_index().index/max(num_comments.reset_index().index)

sns.lineplot(x=num_comments['Commenter percentile'],y=num_comments['Cumulative percentage of comments'])

del num_comments

Indeed, about 75 percent of the comments are placed by 20 percent of the commenters. And about 50 percent of the comments by about 5 percent of the commenters.

In [None]:
# Some are commenting a lot on their own channel (like Paul Nielsen). 
# Who is commenting all over the place?

prolific_commenters = commenter_groups['video_channel_title'].nunique().value_counts()

fig = plt.figure(figsize=(20,10)) # Create matplotlib figure

width = 0.4

prolific_commenters.plot(kind='bar', color='red', width=width, grid=True)
ax.set_ylabel('number of commenters')
ax.set_xlabel('number of channels')

plt.show()

By far the most commenters (35.000 plus) only comment on 1 dchannel. About 5000 comment on two channels. But we're not interested in these commenters, we want to dive into the tail of this graph, so let's start looking for commenters who are commenting on 5 or more channels.

In [None]:
prolific_commenters = prolific_commenters[4:]

fig = plt.figure(figsize=(20,10)) # Create matplotlib figure

width = 0.4

prolific_commenters.plot(kind='bar', color='red', width=width, grid=True)
ax.set_ylabel('number of commenters')
ax.set_xlabel('number of channels')

plt.show()

Let's get the names of the most profilic commenters.

In [None]:
# Fill in a threshold of how many different channels someone has been commenting.

threshold = 10

prolific_commenters = nl_comment_sphere.groupby('author_channel_id') \
                    .filter(lambda x: ((x.video_channel_title.nunique() >= threshold) ))

In [None]:
# Plot the most prolific commenters with the number of comments. 

prolific_commenters_to_plot = prolific_commenters.author_display_name.value_counts()
prolific_commenters_to_plot = prolific_commenters_to_plot[0:20]

fig = plt.figure(figsize=(20,10)) # Create matplotlib figure

width = 0.4

prolific_commenters_to_plot.plot(kind='bar', color='red', width=width, grid=True)
ax.set_ylabel('number of commenters')
ax.set_xlabel('number of channels')
plt.xticks(rotation=45)

plt.show()

There is a large overlap between the people who comment a lot and people who comment all over the place. The channels (like politiekincorrecttv and paul nielsen) are gone. If you want to look at the radical core of the Dutch YouTube information network, here it is. Let's explore some of them. 

### Zooming in on a couple of persons of interest in the Dutch network

In [None]:
# Let's start with Peter Chess (would his real name be Peter Schaak?)

peter = nl_comment_sphere[nl_comment_sphere['author_display_name'] == 'Peter Chess']

It would be interesting to plot the number of comments per channel on a stacked bar chart. The x-axis is the year. The bar chart consists of channels and the height of the stacked charts the number of comments on those channels.

In [None]:
p = peter.groupby(['video_upload_year','video_channel_title']).size().unstack()


In [None]:
p.plot.area(figsize=(20,10))


This is still hard to read, because most of the commenting is after 2012, 2013. So let's start a bit later.

In [None]:
p = p[p.index > 2012] #set the date from where you want the comments.

In [None]:
p.plot.area(figsize=(20,10))

Still far from perfect, but it will do for now.

### Dutch commenters on international channels

I'm interested in exploring how these (mostly) Dutch users are represented in the larger international far right channel network. So I'll make a list of unique id's and run it through the larger corpus.

In [None]:
users_to_check = nl_comment_sphere.author_channel_id.unique().tolist()

In [None]:

columns = ['video_id', 
           'comment_id', 
           'comment_id2', 
           'author_display_name',
           'author_image',
           'author_channel_url',
           'author_channel_id',
           'comment_text',
           'number_of_replies',
           'comment_date'
          ]


iter_csv = pd.read_csv(path_i + 'comments_right.csv', iterator=True, chunksize=100000, names=columns)
nl_int_comment_sphere = pd.concat([chunk[chunk['author_channel_id'].isin(users_to_check)] for chunk in iter_csv])


In [None]:
# Merge comments with video and channel data

videos_all = pd.read_csv(path_i + 'videos_right.csv', low_memory=False)

nl_int_comment_sphere = pd.merge(nl_int_comment_sphere, videos_all, on='video_id', how='left')

In [None]:
# How many comments do we have in this new dataset?

nl_int_comment_sphere['year'] = pd.DatetimeIndex(nl_int_comment_sphere['comment_date']).year

In [None]:
# So this is where people who comment on Dutch channels are commenting in our far right network

popular_channels_for_dutch = nl_int_comment_sphere.video_channel_title.value_counts()
popular_channels_for_dutch = popular_channels_for_dutch[0:20]

fig = plt.figure(figsize=(20,10)) # Create matplotlib figure

width = 0.4

popular_channels_for_dutch.plot(kind='bar', color='red', width=width, grid=True)
ax.set_ylabel('number of comments')
ax.set_xlabel('channels')
plt.xticks(rotation=45)

plt.show()

Some observations:
* Pat Condell is really popular with Dutch commenters.
* Rebel Media is interesting. I didn't know it was that popular.
* Millennial Woes scores pretty high as well.

Let's gather some more stats on the group.

In [None]:
commenter_groups = nl_int_comment_sphere.groupby('author_channel_id')

In [None]:
# Let's do another gini analysis (the results are probably the same)

num_comments = pd.DataFrame(commenter_groups.size().sort_values(ascending = True), columns = ['count'])
num_comments['Cumulative percentage of comments'] = 100*num_comments['count'].cumsum()/max(num_comments['count'].cumsum())
num_comments['Commenter percentile'] = num_comments.reset_index().index/max(num_comments.reset_index().index)

sns.lineplot(x=num_comments['Commenter percentile'],y=num_comments['Cumulative percentage of comments'])

del num_comments

This graph shows that the international Dutch comment sphere is a little bit more elitist than the Dutch one, which shouldn't suprise us. 

Take a look at the number of commenters commenting on n channels.

In [None]:
prolific_commenters = commenter_groups['video_channel_title'].nunique().value_counts()
prolific_commenters = prolific_commenters[0:20]

fig = plt.figure(figsize=(20,10)) 

width = 0.4

prolific_commenters.plot(kind='bar', color='red', width=width, grid=True)
ax.set_ylabel('number of commenters')
ax.set_xlabel('number of channels')

plt.show()

It seems a little bit more evenly distributed. Let's look at the tail of the graph, so the really prolific commenters.

In [None]:
prolific_commenters = commenter_groups['video_channel_title'].nunique().value_counts()
prolific_commenters = prolific_commenters[40:]

fig = plt.figure(figsize=(20,10)) 

width = 0.4

prolific_commenters.plot(kind='bar', color='red', width=width, grid=True)
ax.set_ylabel('number of commenters')
ax.set_xlabel('number of channels')

plt.plot()

These are the people that are really, really prolific, commenting on 39 channels and more. Who are these people? 

Some observations:

1. These are not only Dutch people. No problem. This is something we know, because foreign people can comment on Dutch channels as well. 
2. There are some interesting people I think. For starters: carolienleiden. Might that be Caroline Dauphine from JFvD? Who knows. Identity Europa is a Dutch guy I think, connected to ID Verzet. There are a couple of Dutch guys calling themselves Pinochet I think, closely related to Erkenbrand and /polder/

In [None]:
# Fill in a threshold of how many different channels someone has been commenting.

threshold = 19
threshold2 = 10

prolific_commenters = nl_int_comment_sphere.groupby('author_channel_id') \
                    .filter(lambda x: ((x.video_channel_title.nunique() <= threshold) & (x.video_channel_title.nunique() >= threshold2)))

Now get all the Dutch sounding names from the group we have filtered out above. In this case, everybody that has commented on Dutch channels and on 10plus channels in the international far right network. And everybody that has been commenting on 10plus channels, in total more than 150 times, in the Dutch network (if we do it less than 150 times, it's a lot of data, plus we want the frequent commenters. 150 seems a nice cut off point, but you can set the bar lower.

In [None]:
# Some people with Dutch sounding names 
# who have commented on more than 10 channels in the international network

nl_of_interest_int = ['A Stuijt',
                    'Adrie Van Dijk',
                    'Akka Fietje',
                    'Wouter Lensvelt',
                    'Willem Sterk',
                    'Michael Groenendijk',
                    'Milo Overzicht',
                    'Mike De Jong',
                    'Mike Brink',
                    'Nellie Rutten',
                    'Paul van Dijck',
                    'Peter Jongsma',
                    'Piet Hein',
                    'Pieter van der Meer',
                    'Polder Cannabis Olie team',
                    'Politiekman',
                    'Raymond Doetjes',
                    'Willem Pasterkamp',
                    'Wimpiethe3',
                    'Willie van het Kerkhof',
                    'Vincent Vermeer',
                    'Mark Tak',
                    'Melvin Jansen',
                    'Mark Kamphuis',
                    'Tristan van Oosten',
                    'Tom dGe-lugs-pa',
                    'Tom Van de Pol',
                    'Tom Van Gool',
                    'Marcel Bruinsma',
                    'Maarten van der Poel',
                    'Maciano Van der Laan',
                    'Tiemen Weistra',
                    'TheRdamterror',
                    'TheCitroenman1',
                    'The flying dutchman',
                    'Teun de Heer',
                    'Stijn van de Ven',
                    'Sjaak v Koten',
                    'Sev Vermeer',
                    'Tanya De Beer',
                    'Tim Pietersen',
                    'Alan Holland',
                    'Bennie Leip',
                    'Bert Prins',
                    'Bestheftig',
                    'Borisje Boef',
                    'Chris Van Bekkum',
                    'Coen Bijpost',
                    'Cornelis van der Heijden',
                    'David Teunissen',
                    'David Van der Tweel',
                    'De Veelvraat',
                    'Dennis Bouma',
                    'Dennis Eijs',
                    'Donald gekkehenkie',
                    'peter van',
                    'onbekende telefoon',
                    'nick van achthoven',
                    'mikedehoogh black flag race photos',
                    'kristof verbruggen',
                    'jan holdijk',
                    'jan Yup',
                    'iwan munnikes',
                    'hans van de mortel',
                    'geroestetumor',
                    'geheimschriver',
                    'gaatje niksaan',
                    'dutchmountainsnake',
                    'dutch menneer',
                    'donder bliksem',
                    'boereriem',
                    'appie D',
                    'adam willems',
                    'Yuri Klaver',
                     'zuigdoos',
                    'yvonneforsmanatyahoo',
                    'vanhetgoor',
                    'theflyingdutchboi',
                    'r juttemeijer',
                    'rutger houtdijk',
                    'Dutch Patriot',
                    'Dutch Whitey',
                    'DutchFurnace',
                    'Esias Lubbe',
                    'Ewalds Eiland',
                    'Joey Kuijs',
                    'Faust',
                    'Hollandia777',
                    'Johan van Oldenbarnevelt',
                    'Keescanadees',
                    'Geert Kok',
                    'Haasenpad',
                    'Henk Damster',
                    'Henk van der Laak',
                    'Henri Zwols',
                    'Haat Praat',
                    'Gerard Mulder',    
                    'Grootmeester Jan',
                    'H. v. Heeswijk',
                    'B. Hagen',
                    '1234Daan4321',
                    'Daniella Thoelen', 
                    'Diederik',
                    'Linda Bostoen', 
                    'Christiaan Baron', 
                    'Matthijs van Guilder',
                    'Johannes Roose',
                    'Deon Van der Westhuizen', 
                    'Remko Jerphanion', 
                    'Roosje Keizer',
                    'Dennis Durkop',
                    'ivar olsen',
                    'Pete de pad',
                    'georgio jansen',
                    'Joel Peter',
                    'Antonie de Vry',
                    'Stijn Voorhoeve', 
                    'liefhebber179',
                    'Walter Taljaard',
                    'joe van gogh',
                    'Edo Peter', 
                    'Ad Lockhorst',
                    'kay hoorn',
                    'Erik Bottema',
                    'Deplorable Data',
                    'JESSEverything',
                    'Harry Balzak', 
                    'Bokkepruiker Records',
                    'zonnekat',
                    'Peter-john De Jong',
                    'marco mac',
                    'Joubert x',
                    'Natasja van Dijk',
                    'Voornaam Achternaam',
                    'hermanPla', 
                    'M. van der Scheer',
                    'gerald polyak',
                    'Robbie Retro',
                    'Johannes DeMoravia',
                    'Wouter Vos',
                    'AwoudeX',
                    'carolineleiden',
                    'A-dutch-Z',
                    'piet ikke',
                    'kutbleat',
                    'David of Yorkshire',
                    'Gert Tjildsen',
                    'Flying Dutchman',
                    'Visko Van Der Merwe',
                    'Blobbejaan Blob',
                    'TheBergbok',
                    'jknochel76',
                    'Olleke Bolleke'
                    ]

# And top n from nl_most prolific commenters (on 10 Dutch channels or more, with 150 comments or more)

nl_of_interest_nl = ['Nayako Sadashi', 'demarcation'  
                    'er zaal',
                    'jhon jansen',
                    '-____-',
                    'Brummie Brink',
                    'reindeerkid ',
                    'Pagan Cloak',
                    'NDY',
                    'Karel de Kale',
                    'top top',
                    'Chris Veenendaal ',
                    'MijnheerlijkeBuitenlandse befkut ,',
                    'Kevin Zilverberg',
                    'Rick Dekker ',
                    'Adrie Van Dijk ',
                    'miep miep',
                    'pronto ',
                    'TheUnTrustable0',
                    'danny schaap',
                    'Mark Mathieu',
                    'Raysboss302',
                    'Ruud Hooreman',
                    'Willie W',
                    'Barend Borrelworst',
                    'theo breytenbach',
                     'coinmaster1000 coinmaster1000'  ]

In [None]:
# Just to be sure, I'm going to run them again through the international network. 
# You can skip this step if you want.

columns = ['video_id', 
           'comment_id', 
           'comment_id2', 
           'author_display_name',
           'author_image',
           'author_channel_url',
           'author_channel_id',
           'comment_text',
           'number_of_replies',
           'comment_date'
          ]


iter_csv = pd.read_csv(path_i + 'comments_right.csv', iterator=True, chunksize=100000, names=columns)
nl_commenters_of_interest = pd.concat([chunk[chunk['author_display_name'].isin(nl_of_interest_int)] for chunk in iter_csv])


In [None]:
# These people could be the focus of our investigation.

nl_commenters_of_interest = pd.merge(nl_commenters_of_interest, videos_all, on='video_id', how='left')

nl_commenters_of_interest.to_csv(path + 'nl_commenters_of_interest.csv')

In [None]:
len(nl_commenters_of_interest)

Let's explore a couple of them in detail, especially their journey on YouTube.

In [None]:
caroline = nl_commenters_of_interest[nl_commenters_of_interest['author_display_name'] == 'carolineleiden']

And let's try an area chart.

In [None]:
c = caroline.groupby(['year','video_channel_title']).size().unstack()
c.plot.area(figsize=(20,10))

In [None]:
c = c[c.index > float(2016)]

In [None]:
c.plot.area(figsize=(20,10))

It's probably better to export the data and have a look at them in [RAWGraphs](http://app.rawgraphs.io/), for instance, in the bump charts

In [None]:
c = caroline.groupby(['year','video_channel_title']).size()

c.to_csv(path + 'for_viz/caronlineleiden.csv')

In [None]:
benny = nl_int_comment_sphere[nl_int_comment_sphere['author_display_name'] == 'Bennnnny1987']

In [None]:
b = benny.groupby(['year','video_channel_title']).size()

b.to_csv(path + 'for_viz/benny.csv')

### TODO: Follow commenter journeys

Okay, let's try something more difficult and follow the commenter's journeys and put that data into a graph. I'll take the prolific commenters as a starting point (start small because this is very memory intensive)

TOT 2.3 IS HET NOG WAT PROBEERSELS EN TROEP. DUS SLA DIT EVEN OVER.

In [None]:
# let's zoom in on some channels

channel = 'Millennial Woes' #enter the channel name

comments_of_interest = int_vid_comments[int_vid_comments['video_channel_title'] == channel]

comments_of_interest

My impression is that the data are skewed because of Voice of Europe. We can collect more specific data if we want. Let's select one or more Dutch channels first.

## Question #3: Comparisson

Make some comparissons with other information networks, starting with political parties.

In [None]:
# Load the data

channels_control = pd.read_csv(path + 'channels_nl_controlgroup_politiek.csv')
videos_control = pd.read_csv(path + 'videos_nl_controlgroup_politiek.csv')

In [None]:
# Get channel and video data from PVV and FvD

channel1 = 'Forum Democratie' #fill in the channels you want to compare
channel2 = 'PVVpers'

pvvfvd_vids = videos.loc[(videos['video_channel_title'] == channel1) | \
                      (videos['video_channel_title'] == channel2)
                     ]

In [None]:
pvvfvd_channels = channels.loc[(channels['channel_title'] == channel1) | \
                      (channels['channel_title'] == channel2)
                     ]

In [None]:
compare_channels = channels_control.append(pvvfvd_channels)

In [None]:
compare_vids = videos_control.append(pvvfvd_vids, sort=True)

In [None]:
compare_vids['video_upload_year'] = pd.DatetimeIndex(compare_vids['video_published']).year

In [None]:
# Time to plot some stuff

ax = compare_vids.groupby(['video_upload_year','video_channel_title'])['video_channel_title'] \
        .count().unstack(1).plot.line(title="Number of uploaded videos per party", figsize=(20,10), grid=True)

ax.set_xlabel('year')
ax.set_ylabel('number of uploads')

plt.show()

# Show the number of uploads per year. I chose a line chart here, because the bar chart is really unclear.
# The data are of course discrete and not continuous. 

In [None]:
views_per_year = compare_vids.groupby(['video_upload_year'])['video_view_count'].agg('sum')

fig = plt.figure(figsize=(10,5)) # Create matplotlib figure

width = 0.4

views_per_year.plot(kind='bar', color='red', width=width, grid=True)
ax.set_ylabel('number of videos published')
ax.set_xlabel('year')

plt.show()

# Show the number of views combined per year.

In [None]:
ax = compare_vids.groupby(['video_upload_year','video_channel_title'])['video_view_count'] \
        .agg('sum') \
        .unstack(1) \
        .plot \
        .line(title="Number of views per party", 
            figsize=(20,10), 
            grid=True)

ax.set_xlabel('year')
ax.set_ylabel('number of views')

plt.show()

It's bit of a mess, but it is way clear that Forum Democratie is outperforming everybody. Maybe it's better to make some decisions on what to show.

In [None]:
compare_vids = videos_control.append(pvvfvd_vids, sort=True)
compare_vids['video_upload_year'] = pd.DatetimeIndex(compare_vids['video_published']).year

channels_we_want = ['Forum Democratie', 'Partij van de Arbeid (PvdA)', 'PVVpers', 'DENK TV', 'GroenLinks']

compare_vids = compare_vids[compare_vids.video_channel_title.isin(channels_we_want)]

In [None]:
ax = compare_vids.groupby(['video_upload_year','video_channel_title'])['video_view_count'] \
        .agg('sum').unstack(1).plot.line(title="Number of views per party", figsize=(20,10), grid=True)

ax.set_xlabel('year')
ax.set_ylabel('number of views')

plt.show()