<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/10_Frequency_Analysis_The_Current.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Frequency of words in comments for The Current**

What are the most frequent words in the comments people leave as comment in The Current data? Are certain words more frequent in some questions than others, or are words used in a relatively equal manner among the questions? What about the messyness of the data - how much cleaning and preprocessing must be done in order to make sense of the data?

These are all valid questions to ask as a way to start thinking about research questions and how they might be answered using computational linguistic approaches. In this notebook, we will create and compare frequency distributions of words in the different comment data for The Current.

First, load in the nltk resources.

In [7]:
# import the main nltk module
import nltk

# download the nltk.book resources
nltk.download('book')

# import the resources
from nltk.book import *

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     C:\Users\Ming\AppData\Roaming

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


Now, we want to load in some of the data from The Current. I will load in data for two questions: whether petrol cars should be banned, and whether freedom camping should be illegal.



In [2]:
# petrol car data
!wget "https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp001.txt"

# freedom camping data
!wget "https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp017.txt"

--2023-11-16 13:22:19--  https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp001.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 220746 (216K) [text/plain]
Saving to: 'tp001.txt'

     0K .......... .......... .......... .......... .......... 23% 1.11M 0s
    50K .......... .......... .......... .......... .......... 46% 1.69M 0s
   100K .......... .......... .......... .......... .......... 69% 3.32M 0s
   150K .......... .......... .......... .......... .......... 92% 7.83M 0s
   200K .......... .....                                      100% 13.8M=0.09s

2023-11-16 13:22:19 (2.22 MB/s) - 'tp001.txt' saved [220746/220746]

--2023-11-16 13:22:20--  https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp017.

Save the list of each text to a variable, stripping trailing newlines and then splitting on newlines.

In [3]:
petrol = open('tp001.txt', encoding="utf8").read().rstrip().split('\n')
camping = open('tp017.txt', encoding="utf8").read().rstrip().split('\n')

Currently the data is a set of separate comments, but we might want to represent all the comments in one container, so each text is a single string of all the comments. We can do this by using `''.join()` to glue together the results of splitting out the comments from the ratings in each text. And, we will actually put a space in the call to `' '.join()`, so that a space is placed between each comment.

In [4]:
# glue the comments together, note that the call to .join() has a space between the delimiters
petrol_text = ' '.join([comment.split('\t')[1] for comment in petrol])

In [5]:
# look - a single string
petrol_text



Ok, now to create a frequency distribution of the words, we need to tokenize the text into tokens. Let's use `nltk.word_tokenize()` and `FreqDist()` to do this.

In [8]:
# first create the tokens
petrol_tokens = nltk.word_tokenize(petrol_text)

In [9]:
# now create the Frequency Distribution
petrol_fdist = nltk.FreqDist(petrol_tokens)

Okay, now that there is a frequency distribution of the petrol text, we can look at what the most frequent words in those comments are! Let's look at the top 20 most frequent words using `.most_common()`.

What do you think of the results?

In [10]:
petrol_fdist.most_common(20)

[('to', 1467),
 ('the', 1452),
 ('.', 1244),
 ('and', 950),
 ('we', 893),
 ('a', 690),
 ('it', 676),
 ('is', 676),
 ('be', 651),
 ('for', 611),
 ('cars', 539),
 ('will', 538),
 ('i', 530),
 ('of', 508),
 ('are', 472),
 ('our', 432),
 ('that', 428),
 ('because', 424),
 ('!', 410),
 ('petrol', 363)]

### **Your Turn**

Can you repeat this to make a frequency distribution for the camping text? You just need to repeat the above code but with the camping text instead of the petrol text! This is the output you should see if you ask for the 20 most frequent words:

```
[('.', 1591),
 ('to', 1562),
 ('the', 1419),
 ('and', 1120),
 ('it', 899),
 ('be', 863),
 ('is', 792),
 ('people', 770),
 ('camping', 674),
 ('a', 663),
 ('i', 631),
 ('should', 628),
 ('of', 573),
 ('freedom', 563),
 ('for', 482),
 ('we', 478),
 ('that', 428),
 ('nature', 401),
 ('in', 392),
 ('not', 388)]

```

In [20]:
# code cells for making fdist for camping
camp_string = ' '.join([comment.split('\t')[1] for comment in camping])
camp_token = nltk.word_tokenize(camp_string)
camp_fdist = nltk.FreqDist(camp_token)
camp_fdist.most_common(20)

[('.', 1591),
 ('to', 1562),
 ('the', 1419),
 ('and', 1120),
 ('it', 899),
 ('be', 863),
 ('is', 792),
 ('people', 770),
 ('camping', 674),
 ('a', 663),
 ('i', 631),
 ('should', 628),
 ('of', 573),
 ('freedom', 563),
 ('for', 482),
 ('we', 478),
 ('that', 428),
 ('nature', 401),
 ('in', 392),
 ('not', 388)]

### **Discuss**

- What words are repeated between the two texts?
- What words are unique to the texts?
- Does this analysis tell us anything about the ability for computational measures to identify features of different texts?

## **What about them stopwords?**

The words that are repeated among the texts are so-called function words, determiners such as `the` or `an`, as well as prepositions such as `in`, `on`, etc. These words *are* important in English for making meaning, but maybe we don't want them in this analysis?

Your challenge is to create a frequency distribution for each text which only considers words of certain lengths, 4 characters or more. What will this do to the results?

In [21]:
camp_l4 = [word for word in camp_token if len(word) >= 4]
camp_l4_fdist = nltk.FreqDist(camp_l4)
camp_l4_fdist.most_common(20) 

[('people', 770),
 ('camping', 674),
 ('should', 628),
 ('freedom', 563),
 ('that', 428),
 ('nature', 401),
 ('think', 361),
 ('environment', 317),
 ('because', 275),
 ('they', 266),
 ('more', 240),
 ('camp', 225),
 ('dont', 218),
 ('with', 211),
 ('will', 209),
 ('this', 208),
 ('have', 206),
 ('rubbish', 178),
 ('there', 177),
 ('campers', 160)]

In [34]:
import nltk

words = """This is thing 
this is another thing 
third thing"""

bla = [sentence.replace("\n", '') for sentence in nltk.sent_tokenize(words)]
bla

['This is thing this is another thing third thing']