To make full use of this notebook, then go to: File > Save a copy in Drive...

This will allow you to keep a version in your own Drive that you can work from, and is recommended. You can then close out the original tab and work on the copied version.

To find where it's been copied to go: File > Locate in Drive

# Thematic Concentration.

Aim: Use Wang et Liu to compute the h-point for each select Trump speeches found here: https://github.com/unendin/Trump_Campaign_Corpus/tree/master/text


## Setup of workbook:

### 1. Set up workbook for maximum speed (GPU)
### 2. Import necessary modules
### 3. Load in the data
### 4. Data-preprocessing 
### 5. Part-of-Speech tagging and lemmatisation
### 6. Compute h-point

### 1. Set up workbook:

Google Colab generously gives you one GPU (graphics processing unit) to run computations on.
A GPU is much quicker than a CPU, in that it can perform many more FLOPs (floating point operations [read "calculations"]) per second.

To turn this feature on go to:
Edit > Notebook Settings > Change the hardware accelerator to GPU

In [0]:
import tensorflow as tf # Importing our first module (as below) but we need it 
                        # earlier to check whether we have the GPU running in the correct place!
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name)) 

Found GPU at: /device:GPU:0


### 2. Import necessary modules

In [0]:
import string
import collections
!pip install spacy
import spacy



### 3. Load in the data

In [0]:
from google.colab import drive
drive.mount('/content/drive')   # Run this code and follow the instructions to mount your drive

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
text_path = 'drive/My Drive/Colab Notebooks/Deep Learning/OSGD Workbooks/Katy Tur.txt' # replace this with where you've saved the text file in Drive

In [0]:
def read_words(filename):
    with open(filename, "r") as f:
        sentences = []
        for line in f:
          sentences.append(line.strip())
    return sentences

speech = read_words(text_path) # import single speech as a list of line-by-line text

### 4. Pre-process / clean the data

Firstly we remove things not spoken by Donald (audience input etc.)

Then we remove his introduction from each line he speaks, and turn the whole thing into lower case for ease of machine counting.

In [0]:
def clean_line(line):
  if ('LAUGHTER' in line) or ('APPLAUSE' in line) or ('UNIDENTIFIED' in line) or ('INAUDIBLE' in line) or ('CROSSTALK' in line) or ('PROTESTERS' in line):
    return False
  elif line == '':
    return False
  else:
    return True

def remove_donald_punc(line):
  line = line.replace('\'', "'")
  line = line.lower()
  line = line.split(':')

  if len(line) != 2: # extract only pieces of text that Donald says
    line = line[0]
  else:
    if line[0] == 'donald trump':
      line = line[1]
    else:
      line = ''

  return line

In [0]:
cleaned1 = [x for x in speech if clean_line(x) == True] # Remove audience participation etc. using function above
cleaned2 = [remove_donald_punc(x) for x in cleaned1] # Remove all "Donald Trump:" intros
cleaned2[0:10] # Print the result

[" oh, i've been to iowa many times. i've been to new hampshire many, many times. i love the people there and we've had tremendous success. we've had tremendous crowds.",
 'nobody gets as many standing ovations and, you know, i spent a lot of time out.',
 "i was in south carolina recently and we're all over.",
 "i'm going -- this weekend, i'll be with clint eastwood in california. a tremendous group of people.",
 "i'm going to arizona this weekend. i'll be all over the place.",
 " because i'm doing television with you and i am up there actually a lot. and i watch them up there walking the streets, and it didn't mean anything. i was actually getting more news coverage than anybody else by far because i'm the one that brought up the whole situation and the whole mess with immigration and what the mexican government is doing to us.",
 "so, you know, i didn't have to be and i would have been if they wanted me to. and i just decided that probably it wasn't necessary. i'm going up actually n

In [0]:
# Convert back into one long sting of text:

full_text = '' 
for i in range(len(cleaned2)):
  full_text = full_text + ' ' + cleaned2[i]

full_text # Show result.

'  oh, i\'ve been to iowa many times. i\'ve been to new hampshire many, many times. i love the people there and we\'ve had tremendous success. we\'ve had tremendous crowds. nobody gets as many standing ovations and, you know, i spent a lot of time out. i was in south carolina recently and we\'re all over. i\'m going -- this weekend, i\'ll be with clint eastwood in california. a tremendous group of people. i\'m going to arizona this weekend. i\'ll be all over the place.  because i\'m doing television with you and i am up there actually a lot. and i watch them up there walking the streets, and it didn\'t mean anything. i was actually getting more news coverage than anybody else by far because i\'m the one that brought up the whole situation and the whole mess with immigration and what the mexican government is doing to us. so, you know, i didn\'t have to be and i would have been if they wanted me to. and i just decided that probably it wasn\'t necessary. i\'m going up actually next week 

### 5. Part-of-Speech (POS) tagging and Lemmatisation

Here we label things as noun, verb, adjective etc. (POS tagging, so we can extract themamtic words later).

We also form the lemma of everything for more accurate frequency counting.

In [0]:
nlp = spacy.load("en_core_web_sm") # using the spacy Python package we imported.
doc = nlp(full_text) # convert to spacy format.

In [0]:
lem_pos = {}
tokens =[]
lemmas = []
pos = []

for token in doc:
  lem_pos[token.lemma_] = token.pos_ # extract each word along with it's lemma and pos. into a dictionary
  tokens.append(token.text) # extract list of the ordered words
  lemmas.append(token.lemma_) # extract the lemma that each word maps to
  pos.append(token.pos_) # extract their pos

lemmas2 = []
for i in range(len(lemmas)):
  if pos[i] != 'SPACE' and pos[i] != 'PUNCT' and pos[i] != 'SYM' and pos[i] != 'NUM': # remove punctuation etc. from being considered as part of the h-point
    lemmas2.append(lemmas[i])


In [0]:
print(len(tokens))
print(len(lemmas))
print(len(pos))
print(len(lemmas2))

5758
5758
5758
4834


### 6. Compute h-point

- Gather frequencies
- Rank frequencies
- Calculate h-point
- Return thematic words

In [0]:
def frequency_and_h(data): # function to return the frquency and rank of each lemma, as well as the h-point
    
  counter = collections.Counter(data)
  count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0])) # count the frequency of each lemma and return them in sorted order.

  lem_freq_h = []
  result = True # use booleans to stop the conditions applying beyond the h-point
  cut = False
  bigger = False

  for i in range(len(count_pairs)-1):
    ri = count_pairs[i]
    item = (ri[0], ri[1], i+1) # put the rank next to each lemma
    lem_freq_h.append(item)

    if ri[1] == i+1 and result == True: # See if we have a frequency matching it's rank at any point.
      cut = i+1
      result = False

    if ri[1] < i+1 and bigger == False: # if not, calculate the h point between the frequencies that straggle the point
      ri_1 = count_pairs[i-1]
      cut = (ri_1[1]*(i+1) - (i)*ri[1]) / (1 + ri_1[1] - ri[1]) # h-point formula from the paper
      bigger = True

  return lem_freq_h, cut

Now just to return thematic words:

In [0]:
def return_thematic(lemmas, lem_pos):

  count_pairs, cut = frequency_and_h(lemmas) # using the function above to extract the h-point and the ordered lemmas.
  print("h-point is: {}".format(cut)) # print h-point
  print('')

  print('Words above h-point: (Word, Frequency, Rank): ')
  for i in range(round(cut) + 1):
    print(count_pairs[i]) # print words above the h point.

  print('')
  print('Thematic words (nouns and adjectives): ')
  for i in range(round(cut) + 1):
    lemma = count_pairs[i][0] # return those that are nouns
    if lem_pos[lemma] == 'NOUN' or lem_pos[lemma] == 'ADJ':
      print(lemma)

  print('')
  print('All nouns & adjectives by frequency/rank:')

  for i in range(len(count_pairs)):
    lemma = count_pairs[i][0] # return those that are nouns
    if lem_pos[lemma] == 'NOUN' or lem_pos[lemma] == 'ADJ':
      print(count_pairs[i])

  print('')
  print('Top 100 words by frequency:')
  for i in range(100):
    print(count_pairs[i])

  print('')

  return 'Done'



In [0]:
return_thematic(lemmas2, lem_pos)

h-point is: 26.5

Words above h-point: (Word, Frequency, Rank): 
('-PRON-', 597, 1)
('be', 360, 2)
('i', 174, 3)
('the', 166, 4)
('a', 128, 5)
('to', 127, 6)
('and', 115, 7)
('have', 105, 8)
('not', 97, 9)
('do', 89, 10)
('that', 87, 11)
('of', 82, 12)
('in', 66, 13)
('go', 48, 14)
('people', 46, 15)
('about', 36, 16)
('will', 35, 17)
('because', 31, 18)
('if', 30, 19)
('take', 30, 20)
('know', 29, 21)
('for', 28, 22)
('make', 28, 23)
('what', 28, 24)
('on', 27, 25)
('very', 27, 26)
('country', 26, 27)

Thematic words (nouns and adjectives): 
people
country

All nouns & adjectives by frequency/rank:
('people', 46, 15)
('country', 26, 27)
('mexico', 26, 28)
('great', 25, 30)
('job', 20, 38)
('many', 16, 46)
('money', 16, 47)
('time', 16, 48)
('immigration', 15, 49)
('lot', 15, 50)
('china', 14, 54)
('tremendous', 14, 55)
('bad', 13, 56)
('gun', 12, 62)
('state', 12, 65)
('way', 12, 66)
('well', 12, 67)
('immigrant', 11, 70)
('hillary', 10, 73)
('illegal', 10, 74)
('number', 10, 77)
('pr

'Done'