To make full use of this notebook, then go to: File > Save a copy in Drive...

This will allow you to keep a version in your own Drive that you can work from, and is recommended. You can then close out the original tab and work on the copied version.

To find where it's been copied to go: File > Locate in Drive

# Thematic Concentration.

Aim: Use Wang et Liu to compute the h-point for each select Trump speeches found here: https://github.com/unendin/Trump_Campaign_Corpus/tree/master/text


## Setup of workbook:

### 1. Set up workbook for maximum speed (GPU)
### 2. Import necessary modules
### 3. Load in the data
### 4. Data-preprocessing 
### 5. Part-of-Speech tagging and lemmatisation
### 6. Compute h-point

### 1. Set up workbook:

Google Colab generously gives you one GPU (graphics processing unit) to run computations on.
A GPU is much quicker than a CPU, in that it can perform many more FLOPs (floating point operations [read "calculations"]) per second.

To turn this feature on go to:
Edit > Notebook Settings > Change the hardware accelerator to GPU

In [0]:
import tensorflow as tf # Importing our first module (as below) but we need it 
                        # earlier to check whether we have the GPU running in the correct place!
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name)) 

### 2. Import necessary modules

In [0]:
import string
import collections
!pip install spacy
import spacy

### 3. Load in the data

In [0]:
from google.colab import drive
drive.mount('/content/drive')   # Run this code and follow the instructions to mount your drive

In [0]:
text_path = 'drive/My Drive/Colab Notebooks/Deep Learning/OSGD Workbooks/Katy Tur.txt' # replace this with where you've saved the text file in Drive

In [0]:
def read_words(filename):
    with open(filename, "r") as f:
        sentences = []
        for line in f:
          sentences.append(line.strip())
    return sentences

speech = read_words(text_path) # import single speech as a list of line-by-line text

### 4. Pre-process / clean the data

Firstly we remove things not spoken by Donald (audience input etc.)

Then we remove his introduction from each line he speaks, and turn the whole thing into lower case for ease of machine counting.

In [0]:
def clean_line(line):
  if ('LAUGHTER' in line) or ('APPLAUSE' in line) or ('UNIDENTIFIED' in line) or ('INAUDIBLE' in line) or ('CROSSTALK' in line) or ('PROTESTERS' in line):
    return False
  elif line == '':
    return False
  else:
    return True

def remove_donald_punc(line):
  line = line.replace('\'', "'")
  line = line.lower()
  line = line.split(':')

  if len(line) != 2: # extract only pieces of text that Donald says
    line = line[0]
  else:
    if line[0] == 'donald trump':
      line = line[1]
    else:
      line = ''

  return line

In [0]:
cleaned1 = [x for x in speech if clean_line(x) == True] # Remove audience participation etc. using function above
cleaned2 = [remove_donald_punc(x) for x in cleaned1] # Remove all "Donald Trump:" intros
cleaned2[0:10] # Print the result

In [0]:
# Convert back into one long sting of text:

full_text = '' 
for i in range(len(cleaned2)):
  full_text = full_text + ' ' + cleaned2[i]

full_text # Show result.

### 5. Part-of-Speech (POS) tagging and Lemmatisation

Here we label things as noun, verb, adjective etc. (POS tagging, so we can extract themamtic words later).

We also form the lemma of everything for more accurate frequency counting.

In [0]:
nlp = spacy.load("en_core_web_sm") # using the spacy Python package we imported.
doc = nlp(full_text) # convert to spacy format.

In [0]:
lem_pos = {}
tokens =[]
lemmas = []
pos = []

for token in doc:
  lem_pos[token.lemma_] = token.pos_ # extract each word along with it's lemma and pos. into a dictionary
  tokens.append(token.text) # extract list of the ordered words
  lemmas.append(token.lemma_) # extract the lemma that each word maps to
  pos.append(token.pos_) # extract their pos

lemmas2 = []
for i in range(len(lemmas)):
  if pos[i] != 'SPACE' and pos[i] != 'PUNCT' and pos[i] != 'SYM' and pos[i] != 'NUM': # remove punctuation etc. from being considered as part of the h-point
    lemmas2.append(lemmas[i])


In [0]:
print(len(tokens))
print(len(lemmas))
print(len(pos))
print(len(lemmas2))

### 6. Compute h-point

- Gather frequencies
- Rank frequencies
- Calculate h-point
- Return thematic words

In [0]:
def frequency_and_h(data): # function to return the frquency and rank of each lemma, as well as the h-point
    
  counter = collections.Counter(data)
  count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0])) # count the frequency of each lemma and return them in sorted order.

  lem_freq_h = []
  result = True # use booleans to stop the conditions applying beyond the h-point
  cut = False
  bigger = False

  for i in range(len(count_pairs)-1):
    ri = count_pairs[i]
    item = (ri[0], ri[1], i+1) # put the rank next to each lemma
    lem_freq_h.append(item)

    if ri[1] == i+1 and result == True: # See if we have a frequency matching it's rank at any point.
      cut = i+1
      result = False

    if ri[1] < i+1 and bigger == False: # if not, calculate the h point between the frequencies that straggle the point
      ri_1 = count_pairs[i-1]
      cut = (ri_1[1]*(i+1) - (i)*ri[1]) / (1 + ri_1[1] - ri[1]) # h-point formula from the paper
      bigger = True

  return lem_freq_h, cut

Now just to return thematic words:

In [0]:
def return_thematic(lemmas, lem_pos):

  count_pairs, cut = frequency_and_h(lemmas) # using the function above to extract the h-point and the ordered lemmas.
  print("h-point is: {}".format(cut)) # print h-point
  print('')

  thematic = [] # prepare list to store thematic values for calculation of TC value
  print('Thematic words (nouns and adjectives): ')
  for i in range(round(cut)):
    lemma = count_pairs[i][0] # return those that are nouns
    if lem_pos[lemma] == 'NOUN' or lem_pos[lemma] == 'ADJ':
      print(lemma)
      thematic.append(count_pairs[i])
  
  print('')
  multiplier = 2/(cut*(cut-1)*count_pairs[0][1])
  calc = 0
  for ii in range(len(thematic)):
    val = (cut - thematic[ii][2])*thematic[ii][1]
    calc += val
  TC = multiplier*calc
  print('Thematic concentration is: {}'.format(TC))


  print('')
  print('Words above h-point: (Word, Frequency, Rank): ')
  for i in range(round(cut)):
    print(count_pairs[i]) # print words above the h point.

  print('')
  print('All nouns & adjectives by frequency/rank:')

  for i in range(len(count_pairs)):
    lemma = count_pairs[i][0] # return those that are nouns
    if lem_pos[lemma] == 'NOUN' or lem_pos[lemma] == 'ADJ':
      print(count_pairs[i])

  print('')
  print('Top 100 words by frequency:')
  for i in range(100):
    print(count_pairs[i])

  print('')

  return 'Done'



In [0]:
return_thematic(lemmas2, lem_pos)