## Introduction to Python for Digital Text Analysis (Part I)

This session will provide an overview of how Python can be used to descriptively summarise a dataset.

# First thing: running code

In [78]:
a = 10
print(a)

10


There are two other keyboard shortcuts for running code:

Shift-Enter runs the current cell and inserts a new one below.
Ctrl-Enter run the current cell and enters command mode.

### Strings

In [79]:
s = "Hello world!"
print( len(s) )

12


In [80]:
s2 = s.replace("world", "python")
print(s2)

s3 = s2.replace("Hello","monty")
print(s3)

Hello python!
monty python!


In [81]:
s.find('world')

6

In [82]:
# Concatenation

s4 = s + ' '+ s3

print(s4)

Hello world! monty python!


In [83]:
# Splitting

s = "The method split\nreturns a list\nof all the words\nin the string"

s.split('\n')

['The method split', 'returns a list', 'of all the words', 'in the string']

In [84]:
s.split('-')

['The method split\nreturns a list\nof all the words\nin the string']

### Lists

A list is a container of objects. They do not need to be the same kind.

In [85]:
values = [ '1', 2, 3.0, True, False, 'banana' ]
print (len(values))

6


In [86]:
# Slicing

print (values[2])
print (values[:3])

3.0
['1', 2, 3.0]


In [87]:
# Append

my_list = []

my_list.append(8)
my_list.append(10)
my_list.append(11)
my_list.append(12)

print(my_list)

[8, 10, 11, 12]


In [88]:
# Delete

my_list.remove(10)
print(my_list)

del my_list[0]
print(my_list)

[8, 11, 12]
[11, 12]


In [89]:
# Generating, using the function range(start,stop,step)

range1 = range(0,10,2)
range2 = range(-5,5)

print(range1)
print(range2)

range(0, 10, 2)
range(-5, 5)


In [90]:
# Functions list() and str()

print(list(range1))
print(list(range2))

print( str(list(range2)) + ' is a list' )


[0, 2, 4, 6, 8]
[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4]
[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4] is a list


### Dictionaries

A flexible collection of {key: value} pairs.

Also called associative arrays or hash maps in other languages.

In [91]:
my_dict = {}

my_dict['bananas'] = 'yellow'
my_dict['how_many_bananas'] = 5
my_dict[2017] = True

my_dict

{2017: True, 'bananas': 'yellow', 'how_many_bananas': 5}

# Control flow

Note the indentation!

In python we don't use brackets or semicolons to decide which parts of the code are inside which others; we use indentaiton.

In [92]:
# If statements

car = 10.00
bus = 2.50
cash = 1.50

if car < cash:
    print ('I can drive')
    
if bus < cash and car > cash:
    print ("I'll take the bus")
    
if bus > cash:
    print ('Walking')

Walking


In [93]:
cash = 0.90

if car < cash:
    print ('Enjoy your car ride')
elif bus < cash:
    print ('Another one rides the bus')
elif cash < 1.00:
    print ('stay home')
else:
    print ('Walking..')

stay home


In [94]:
# for loops
for i in range(5):
    print(i)

0
1
2
3
4


In [95]:
names  = [ 'Amelie','Tom','Niko','Ruben','Esma' ]
for name in names:
    print ("My name is "+name)

My name is Amelie
My name is Tom
My name is Niko
My name is Ruben
My name is Esma


In [126]:
# zip

names  = [ 'Amelie','Tom','Niko','Ruben','Esma' ]
births = [ 1968, 1984, 1977, 1988, 1973 ]
 
for i, j in zip(names, births):
    print(i, "was born in", j)

Amelie was born in 1968
Tom was born in 1984
Niko was born in 1977
Ruben was born in 1988
Esma was born in 1973


# Data analysis

We'll use the **pandas** library for data analysis.

In [97]:
import pandas as pd

First, we'll transform _births_ and _names_ into a dictionary.

In [133]:
dict_data = { 'Names':names, 'Births':births }

dict_data

{'Births': [1968, 1984, 1977, 1988, 1973],
 'Names': ['Amelie', 'Tom', 'Niko', 'Ruben', 'Esma']}

**df** will be a **DataFrame** object. You can think of this object as a spreadsheet.

In [134]:
df = pd.DataFrame(dict_data)
df

Unnamed: 0,Births,Names
0,1968,Amelie
1,1984,Tom
2,1977,Niko
3,1988,Ruben
4,1973,Esma


### Exporting

The only parameters we will use is ***index*** and ***header***. Setting these parameters to False will prevent the index and header names from being exported. Change the values of these parameters to get a better understanding of their use.

In [108]:
df.to_csv('births.csv',index=False,header=True)

In [109]:
infile = 'births.csv'
df = pd.read_csv(infile)
df

Unnamed: 0,Births,Names
0,1968,Amelie
1,1985,Tom
2,1977,Niko
3,1988,Ruben
4,1973,Esma


In [136]:
countries = ['Belgium','England','Croatia','Cuba','Turkey']
df['Countries'] = countries

df

Unnamed: 0,Births,Names,age_in_2017,Countries
0,1968,Amelie,49,Belgium
1,1984,Tom,33,England
2,1977,Niko,40,Croatia
3,1988,Ruben,29,Cuba
4,1973,Esma,44,Turkey


In [135]:
df['age_in_2017'] = 2017 - df['Births']
print (df)

   Births   Names  age_in_2017
0    1968  Amelie           49
1    1984     Tom           33
2    1977    Niko           40
3    1988   Ruben           29
4    1973    Esma           44


In [123]:
# Print one specific column
print (df['Countries'])

0    Belgium
1    England
2    Croatia
3       Cuba
4     Turkey
Name: Countries, dtype: object


In [140]:
print (df.Countries)

0    Belgium
1    England
2    Croatia
3       Cuba
4     Turkey
Name: Countries, dtype: object


In [137]:
# Select part of the dataframe

df[ df['Births'] < 1985 ]

Unnamed: 0,Births,Names,age_in_2017,Countries
0,1968,Amelie,49,Belgium
1,1984,Tom,33,England
2,1977,Niko,40,Croatia
4,1973,Esma,44,Turkey


In [138]:
df[ (df['Births'] < 1980) & ( len(df['Names']) > 3 ) ]

Unnamed: 0,Births,Names,age_in_2017,Countries
0,1968,Amelie,49,Belgium
2,1977,Niko,40,Croatia
4,1973,Esma,44,Turkey


In [152]:
# Copying 

df_new = df.copy()
df_new = df[ (df['Births'] < 1980) & ( len(df['Names']) > 3 ) ]

print (df)
print (" ")
print (df_new)

   Births   Names  age_in_2017 Countries
0    1968  Amelie           49   Belgium
1    1984     Tom           33   England
2    1977    Niko           40   Croatia
3    1988   Ruben           29      Cuba
4    1973    Esma           44    Turkey
 
   Births   Names  age_in_2017 Countries
0    1968  Amelie           49   Belgium
2    1977    Niko           40   Croatia
4    1973    Esma           44    Turkey


In [150]:
# Synonyms

print( df['Names'] )
print (" ")
print( df.Names )
print (" ")
print( df.Names.values )

0    Amelie
1       Tom
2      Niko
3     Ruben
4      Esma
Name: Names, dtype: object
 
0    Amelie
1       Tom
2      Niko
3     Ruben
4      Esma
Name: Names, dtype: object
 
['Amelie' 'Tom' 'Niko' 'Ruben' 'Esma']


In [148]:
# Print one specific line

print (df.loc[4])
print (" ")
print (df.loc[4].Names)

Births           1973
Names            Esma
age_in_2017        44
Countries      Turkey
Name: 4, dtype: object
 
Esma


# Data visualisation

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

#%matplotlib inline

In [5]:
# Metadata about all videos.
df_videos = pd.read_excel("../data/videos.xlsx", sheetname=0) 

# Sample of comments about the BTS video 'Save Me'.
df_bts_sample = pd.read_excel("../data/videos.xlsx", sheetname=1)

# Comments about Twice videos.
df_twice = pd.read_excel("../data/videos.xlsx", sheetname=2)

# Blackpink comment counts.
df_blackpink_counts = pd.read_csv("../data/kpop_comment_counts/blackpink_comment_counts.txt", sep="\t")

# BTS comment counts.
df_bts_counts = pd.read_csv("../data/kpop_comment_counts/bts_comment_counts.txt", sep="\t")

# EXO comment counts.
df_exo_counts = pd.read_csv("../data/kpop_comment_counts/exo_comment_counts.txt", sep="\t")

# Twice comment counts.
df_twice_counts = pd.read_csv("../data/kpop_comment_counts/twice_comment_counts.txt", sep="\t")

Next sections should use the data in the above dataframes to provide a basic description of dataset - how many comments, how many unique users, what time are the comments, number of comments over time (can group by week or month and compute means), etc. Include basic visualisations here (plots, graphs, charts) with matplotlib/seaborn. 

*The specific columns in the video comment tabs (df_bts_sample and df_twice) that we should look at are CommentPublished (needs to be parsed), CommentTextDisplay, CommentAuthorName, and CommentLikeCount. We can look in more detail at the comments with the most likes.

*Can use the value_counts method to compute number of posts / user for the BTS and Twice samples and visualise as a bar chart (CommentAuthorName column).