# Introduction to Python for Digital Text Analysis (Part I)

This session will provide an overview of how Python can be used to descriptively summarise a dataset made of YouTube comments.

We expect you to have basic knowledge of Python, bue if you don't, here's the link:
https://github.com/fbkarsdorp/python-intro/blob/master/notebook.ipynb

As this lesson is aimed at people with little python experience - perhaps nothing more than one tutorial! - we are first going to refresh your memory of some Python tools.

_____________________________

Just before that, I'll tell you one of the most important things I've learned about programming:

### Most problems you'll ever face [in Python] were once someone else's problem.

So Google it. Go to Stack Overflow. Don't spend time trying to reinvent the wheel.

## Refreshing your python memory

__Strings__ are one of Python's simplest data types. A string can be represented with single ( **'** ) or double ( **"** ) quotation marks:

In [None]:
s = "Hello world!"
print(s)

A string can be modified in many ways. Here we will focus in a few of them.

You can replace part of a string:

In [None]:
s2 = s.replace("world", "python")
print(s2)

You can concatenate strings, by summing one string to another:

In [None]:
s3 = s + ' ' + s2
print(s3)

You can split strings, turning them into lists:

In [None]:
splitstring = s3.split()
print (splitstring)

**Lists** are sequences of objects. For instance, the object _splitstring_ above is a list of strings.

Here's how to define an empty list, and fill it with numbers, in various ways. Note that one way to fill a list is to add another list to it:

In [None]:
my_list = []

my_list.append(10)
my_list.append(20)

my_list = my_list + [ 30 ]

my_list += range(4)
# this is the same as: my_list = my_list + range(4)

print( my_list )
print( my_list[:2] )
print( my_list[4:] )
print( my_list[:2]+my_list[4:] )

If lists are ordered sequences of objects, **Dictionaries** are for when there's no necessary order.

The elements of a dictionary are pairs of **keys** and **values**.

Here's how to create a dictionary and populate it with key-value pairs:

In [None]:
my_dict = {}

my_dict['location'] = 'Brussels'
my_dict['host'] = 'EASt'
my_dict['event'] = 'workshop'

print my_dict
print my_dict.keys()
print my_dict.values()

Once you have your strings, lists, dictionaries and whatever else, you might want to go through them.

**for loops** are one common way to iterate over structures. Here's how to iterate over a range of integers:

In [None]:
for i in range(5):
    print(i)

But you can iterate over any kind of list. For example, here's how to iterate over a list of strings:

In [None]:
names  = [ 'Amelie','Tom','Niko','Ruben','Esma' ]

for name in names:
    print ("My name is "+name)

And here's how to iterate over two lists at the same time, using **zip**:

In [None]:
names  = [ 'Amelie','Tom','Niko','Ruben','Esma' ]
births = [ 1968, 1984, 1977, 1988, 1973 ]
 
for i, j in zip(names, births):
    print( i, "was born in", j )

_____________

# Data analysis with *pandas*

In [None]:
import pandas as pd

Now that we have fresh python in our heads, we can look into the YouTube comments. We will be using the **pandas** library, which is very common in data analysis tasks.

Here, **comments** will be a **DataFrame** object. You can think of this object as a spreadsheet.

We will be analysing comments from one video by the K-pop band EXO. You can repeat this analysis with any other video in the dataset.

Now let's load the dataset:

In [None]:
comments = pd.read_csv('data/kpop_videos_metadata/exo/I3dezFzsNss.txt', delimiter='\t')

Show the first rows of **comments**:

In [None]:
comments.head()

**Filter** the dataframe, to select only part of the comments:

In [None]:
comments = comments.filter(items=['CommentPublished','CommentTextDisplay','CommentAuthorName','CommentLikeCount'])

comments.head()

Filters can be combined. The result is always another dataframe:

In [None]:
one_two_likes = comments[ (comments['CommentLikeCount'] >= 1) & (comments['CommentLikeCount'] <= 2) ]

one_two_likes.head()

Another way to use filters is to create token columns with _True_ for the rows that pass these filters and _False_ for the rows that do not. Here we define a new column called _SomeLikes_, which is _True_ for rows with _CommentLikeCount_ >= 1 and <= 2. 

In [None]:
comments['SomeLikes'] = (comments['CommentLikeCount'] >= 1) & (comments['CommentLikeCount'] <= 2)

comments.head()

Columns can be removed with **del**:

In [None]:
del comments['SomeLikes']

comments.head()

There is more than one way to access a column and its values.

In the code below, both _comments.head()['CommentTextDisplay']_ and _comments.head().CommentTextDisplay_ show the text of the first 5 comments, with their indices on the side, while _comments.head().CommentTextDisplay.values_ does not show the indices.

Note how any emoji or special characters disappear with the third option as well.

In [None]:
print( comments.head()['CommentTextDisplay'] )

print ("\n---------------------------------------------------------\n")

print( comments.head().CommentTextDisplay )

print ("\n---------------------------------------------------------\n")

print( comments.head().CommentTextDisplay.values )

Here is how to print one specific row of the dataframe:

In [None]:
print (comments.loc[19])

And here is how to print one specific field from that row:

In [None]:
print (comments.loc[19].CommentTextDisplay)

### Exercise 1:

Pick another YouTube video, by one of the four bands, load the file and find the most liked comment. You will need the function **max()**, which returns the maximum value of a list (of numbers, normally).

_____________

# Basic statistics and data visualisation

Now it is time to visualise some of the statistics of this dataset. But first, we need to process this data a little bit.

If we look again at our dataframe **comments**, we can see that each comment has a unique date and time, indicated in **CommentPublished**:

In [None]:
comments.head()

We can change that by using the datetime library to turn that string into year, month, day, hour, minute and second.

Note how _comments['CommentDateTime']_ is written: this combination of _apply_ and _lambda_ is another way to define a column of your dataframe as a function of other columns.

In [None]:
from datetime import datetime
from dateutil import parser

comments['CommentDateTime'] = comments.apply( lambda row: parser.parse(row.CommentPublished), axis=1 )  

print( "min CommentDateTime =", comments.CommentDateTime.min() )
print( "max CommentDateTime =", comments.CommentDateTime.max() )

Let's also add the column *hour*, indicating the time of the day when each comment was posted.

In [None]:
comments['hour'] = comments.apply( lambda row: row.CommentDateTime.hour, axis=1 )  

comments.head()

Finally, let's also remove any rows containing null (_NaN_) results.

In [None]:
comments = comments.dropna()

_____________
### Visualising comments over time

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
sns.set(color_codes=True)
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})

In [None]:
g = sns.factorplot(x="hour", data=comments, kind="count", size=6, aspect=1.5)
plt.show()

### Visualising number of likes per comment

In [None]:
print( "The mean number of likes per comment is "+str(comments.CommentLikeCount.mean()) ) 

In [None]:
g = sns.distplot( comments.CommentLikeCount , kde=False, rug=False, color='g' )

g = g.set_ylabel('number of comments')

plt.yscale('log')
plt.show()

We can look in more detail at the comments with the most likes.

In [None]:
for index, row in comments[ comments.CommentLikeCount > 2000 ].iterrows():
    print str(row.CommentLikeCount) + " likes: " + row.CommentTextDisplay + "\n"

We can look at the number of unique authors:

In [None]:
all_authors = set(comments.CommentAuthorName)
print ("Number of unique authors: " + str(len(all_authors)) )

### Number of comments per author and number of likes per author:

We can also look at which authors comment more or get more likes. Here I have made two dictionaries, *comment_count* and *likes_count*, which have author names as keys, and number of comments and number of likes as values, respectively. 

In [None]:
comment_count = {}
likes_count   = {}

for index, row in comments.iterrows():
    
    author = row.CommentAuthorName
    number_of_likes = int(row.CommentLikeCount)
    
    if author in comment_count:
        comment_count[author] += 1
    else:
        comment_count[author]  = 1
        
    if author in likes_count:
        likes_count[author] += number_of_likes
    else:
        likes_count[author]  = number_of_likes

all_authors = comment_count.keys()

data_dict = { 'Author':list(all_authors),
              'CommentCount':comment_count.values(),
              'TotalLikes':likes_count.values() }

authors = pd.DataFrame(data_dict)

authors.head()

You can sort the dataframe __authors__ by comment count:

In [None]:
authors.sort_values('CommentCount', ascending=False).head(10)

You can sort the dataframe __authors__ by like count:

In [None]:
authors = authors.sort_values('TotalLikes', ascending=False)
authors.head(10)

Log-log plot of the number of likes per author:

In [None]:
x = sorted(authors.TotalLikes.values)[::-1]

plt.loglog(x, color='b')
plt.ylabel('Number of authors')
plt.xlabel('Number of likes')
plt.show()

Log-log plot of the number of comments per author:

In [None]:
x = sorted(authors.CommentCount.values)[::-1]
plt.loglog(x, color='g')
plt.ylabel('Number of authors')
plt.xlabel('Number of comments')
plt.show()

Scatter plot of comments per author vs likes per author:

In [None]:
#plt.close("all")
plt.scatter( authors.CommentCount, authors.TotalLikes, alpha=0.75 )
plt.xscale('log')
plt.yscale('log')

plt.ylabel('Number of likes')
plt.xlabel('Number of comments')

plt.ylim(0.5,10000)
plt.show()

### Exercise 2:

With the same video from the previous exercise, see if you can make the same plots we made here:
- A bar plot for the number of comments per hour
- A bar plot for the number of likes per comment
- A line plot for the number of likes per author
- A line plot for the number of comments per author

_____________

In [None]:
### Comment counts for each video for each of the four groups

In [40]:
# Blackpink comment counts.
df_blackpink_counts = pd.read_csv("data/kpop_comment_counts/blackpink_comment_counts.txt", sep="\t")

# BTS comment counts.
df_bts_counts = pd.read_csv("data/kpop_comment_counts/bts_comment_counts.txt", sep="\t")

# EXO comment counts.
df_exo_counts = pd.read_csv("data/kpop_comment_counts/exo_comment_counts.txt", sep="\t")

# Twice comment counts.
df_twice_counts = pd.read_csv("data/kpop_comment_counts/twice_comment_counts.txt", sep="\t")

In [None]:
df_blackpink_counts.head()

In [42]:
all_bands  = [ df_blackpink_counts, df_bts_counts, df_exo_counts, df_twice_counts ]
band_names = [ 'Black Pink', 'BTS', 'EXO', 'TWICE' ]

for df,name in zip(all_bands, band_names):
    df['BandName'] = [name]*len(df)    

In [None]:
df_blackpink_counts.head()

In [None]:
df_allbands = pd.concat(all_bands)

plt.subplots(figsize=(10,8))
g = sns.boxplot(x="BandName", y="Comments", hue="BandName", data=df_allbands)
g.set( yscale='log' )

plt.show()