<a href="https://colab.research.google.com/github/MK316/mynltkdata/blob/main/Corpus_toolkit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Corpus Toolkit: Source from Kristopher Kyle [[Github link]]("https://github.com/kristopherkyle/corpus_toolkit")

https://github.com/kristopherkyle/corpus_toolkit

Module installation:

In [None]:
!pip install corpus-toolkit

## Note: 
### 1) We need data, brown_single.zip [download here]("https://github.com/kristopherkyle/Corpus-Methods-Intro/blob/master/Course-Materials/brown_single.zip?raw=true")
### 2) The folder "brown_single" should be in your working directory: Below we clone github for the data.

* Step 1. upload brown_single folder to your github

* Step 2. clone github folder as follows


In [2]:
# !git clone https://github.com/youraccountname/repositoryname/brown_single.git
!git clone https://github.com/MK316/mynltkdata.git

Cloning into 'mynltkdata'...
remote: Enumerating objects: 506, done.[K
remote: Counting objects: 100% (190/190), done.[K
remote: Compressing objects: 100% (190/190), done.[K
remote: Total 506 (delta 4), reused 182 (delta 0), pack-reused 316[K
Receiving objects: 100% (506/506), 2.34 MiB | 17.11 MiB/s, done.
Resolving deltas: 100% (7/7), done.


###Note: Before changing working directory, check whether the folder (repository) is properly cloned on the left Files panel.

* Step 3. Directory change: %cd /content/repositoryname/


In [3]:
%cd /content/mynltkdata/

/content/mynltkdata


Check current working directory: !pwd

In [4]:
!pwd

/content/mynltkdata


In [None]:
# this is for further preprocessing when necessary
!pip install -U spacy
# python -m spacy download en_core_web_sm

## [1] Load, tokenize, and generate a frequency list
Data: brown_single (500 files)

In [None]:
from corpus_toolkit import corpus_tools as ct
brown_corp = ct.ldcorpus("brown_single") #load and read corpus
tok_corp = ct.tokenize(brown_corp) #tokenize corpus - by default this lemmatizes as well
brown_freq = ct.frequency(tok_corp) #creates a frequency dictionary

In [10]:
len(brown_freq)

35299

In [18]:
brown_freq['species']

30

# Generate concordance lines
Concordance lines can be generated using the concord() function. By default, a random sample of 25 hits will be generated, with 10 tokens of left and right context.

In [None]:
conc_results1 = ct.concord(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False),["run","ran","running","runs"],nhits = 10)
for x in conc_results1:
	print(x)

# [Added] Concordance with brown_single texts combined

In [29]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download("punkt")

import pandas as pd
import numpy as np

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [53]:
import os
# find all the txt files in the dataset folder
inputs = []
# url = "/Users/mirankim/Desktop/data/brown_single/"
for file in os.listdir("brown_single"):
    if file.endswith(".txt"):
        inputs.append(os.path.join("brown_single", file))
 
 
# concatanate all txt files in a file called merged_file.txt
with open('merged_file.txt', 'w') as outfile:
    for fname in inputs:
        with open(fname, encoding="utf-8", errors='ignore') as infile:
            outfile.write(infile.read())

In [65]:
# Open the merged file
file = open("merged_file.txt",'r')
text = file.read()
file.close()

In [55]:
from nltk.tokenize import RegexpTokenizer
retokenize = RegexpTokenizer("[\w]+")
words = retokenize.tokenize(text)

Find concordance example:

In [67]:
item1 = input()
nltk.Text(words).concordance(item1, 100, 5)

merge
Displaying 5 of 10 matches:
 human dignity Beyond the forest all our paths merge into a single great highway which ends in the 
edral And as the waves flow back and forth and merge with the waves from the neighboring atoms you 
o determine whether it would be appropriate to merge the responses for the purposes of the study Th
hey can amuse each other until we get ready to merge sides All dressing undressing to be more exact
mpetitive advantages to the lines that wish to merge However there is a more profound consideration


The functions ldcorpus() and tokenize() are Python generators, which means that they must be re-declared each time they are used (iterated over). A slightly messier (but more appropriate) way to achieve the results above is to nest the commands.

In [None]:
#note that range can be calculated instead of frequency using the argument calc = "range"
ct.head(brown_freq, hits = 20) #print top 10 items

In [None]:
brown_freq = ct.frequency(ct.tokenize(ct.ldcorpus("brown_single")))
ct.head(brown_freq, hits = 20)

# Collocation

In [71]:
conc_results2 = ct.concord(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False),["run","ran","running","runs"],collocates = ["quick","quickly"], nhits = 10)
for x in conc_results2:
	print(x)

Processing cb_cb27.txt (1 of 472 files)
Processing ca_ca28.txt (2 of 472 files)
Processing ck_ck26.txt (3 of 472 files)
Processing cb_cb03.txt (4 of 472 files)
Processing cn_cn27.txt (5 of 472 files)
Processing cn_cn21.txt (6 of 472 files)
Processing cg_cg39.txt (7 of 472 files)
Processing cg_cg15.txt (8 of 472 files)
Processing cg_cg68.txt (9 of 472 files)
Processing ce_ce26.txt (10 of 472 files)
Processing ck_ck27.txt (11 of 472 files)
Processing cl_cl12.txt (12 of 472 files)
Processing cb_cb11.txt (13 of 472 files)
Processing cf_cf15.txt (14 of 472 files)
Processing cf_cf35.txt (15 of 472 files)
Processing ck_ck12.txt (16 of 472 files)
Processing ck_ck10.txt (17 of 472 files)
Processing cp_cp10.txt (18 of 472 files)
Processing cc_cc02.txt (19 of 472 files)
Processing cg_cg35.txt (20 of 472 files)
Processing ck_ck16.txt (21 of 472 files)
Processing ca_ca01.txt (22 of 472 files)
Processing ch_ch26.txt (23 of 472 files)
Processing cn_cn08.txt (24 of 472 files)
Processing cl_cl22.txt (2

* Search terms (and collocate search terms) can also be interpreted as regular expressions:

In [None]:
conc_results3 = ct.concord(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False),["run.*","ran"],collocates = ["quick.*"], nhits = 10, regex = True)
for x in conc_results3:
	print(x)

# Create a tagged version of your corpus
The most efficient way to conduct multiple analyses with a tagged corpus is to write a tagged version of your corpus to file and then conduct subsequent analyses with the tagged files. If this is not possible for some reason, one can always run the tagger each time an analysis is conducted.

In [None]:
# tagged_brown = ct.tag(ct.ldcorpus("brown_single"))
# ct.write_corpus("tagged_brown_single",tagged_brown) 
#the first argument is the folder where the tagged files will be written

* The function tag() is also a Python generator, so the preferred way to write a corpus is:

In [None]:
ct.write_corpus("tagged_brown_single",ct.tag(ct.ldcorpus("brown_single")))

* Now, we can reload our tagged corpus using the reload() function and generate a part of speech sensitive frequency list.

In [None]:
tagged_freq = ct.frequency(ct.reload("tagged_brown_single"))
ct.head(tagged_freq, hits = 10)

# Collocation: 

---


=> Use the collocator() function to find collocates for a particular word.

In [None]:
collocates = ct.collocator(ct.tokenize(ct.ldcorpus("brown_single")),"go",stat = "MI")
#stat options include: "MI", "T", "freq", "left", and "right"

ct.head(collocates, hits = 10)

# Keyness

---


Keyness is calculated using two frequency dictionaries (consisting of raw frequency values). Only effect sizes are reported (p values are arguably not particularly useful for keyness analyses). Keyness calculation options include "log-ratio", "%diff", and "odds-ratio".

In [None]:
import nltk.corpus
nltk.download('brown')
nltk.download('gutenberg')
from nltk.corpus import gutenberg
from nltk.corpus import brown
print(", ".join(brown.words()))

In [81]:
nltk.corpus.brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [89]:
#First, generate frequency lists for each corpus
corp1freq = ct.frequency(ct.tokenize(ct.ldcorpus("brown")))
corp2freq = ct.frequency(ct.tokenize(ct.ldcorpus("gutenberg")))

#then calculate Keyness
corp_key = ct.keyness(corp1freq,corp2freq, effect = "log-ratio")
ct.head(corp_key, hits = 10) #to display top hits

No files found. There may be a problem with your working directory or your file search term.
No files found. There may be a problem with your working directory or your file search term.


In [87]:
!pwd

/content/mynltkdata


In [91]:
%cd /content/mynltkdata/

/content/mynltkdata


# N-grams

---


N-grams are contiguous sequences of n words. The tokenize() function can be used to create an n-gram version of a corpus by employing the ngram argument. By default, words in an n-gram are separated by two underscores "__"

In [None]:
trigramfreq = ct.frequency(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False, ngram = 3))

In [93]:
ct.head(trigramfreq, hits = 10)

one__of__the	390
the__united__states	339
as__well__as	233
some__of__the	174
the__fact__that	164
out__of__the	159
part__of__the	142
the__end__of	140
i__do__nt	139
it__is__not	131


# Dependency bigrams

---


Dependency bigrams consist of two words that are syntactically connected via a head-dependent relationship. For example, in the clause "The player kicked the ball", the main verb kicked is connected to the noun ball via a direct object relationship, wherein kicked is the head and ball is the dependent.

The function dep_bigram() generates frequency dictionaries for the dependent, the head, and the dependency bigram. In addition, range is calculated along with a complete list of sentences in which the relationship occurs.

In [94]:
bg_dict = ct.dep_bigram(ct.ldcorpus("brown_single"),"dobj")


Processing cb_cb27.txt (1 of 472 files)
Processing ca_ca28.txt (2 of 472 files)
Processing ck_ck26.txt (3 of 472 files)
Processing cb_cb03.txt (4 of 472 files)
Processing cn_cn27.txt (5 of 472 files)
Processing cn_cn21.txt (6 of 472 files)
Processing cg_cg39.txt (7 of 472 files)
Processing cg_cg15.txt (8 of 472 files)
Processing cg_cg68.txt (9 of 472 files)
Processing ce_ce26.txt (10 of 472 files)
Processing ck_ck27.txt (11 of 472 files)
Processing cl_cl12.txt (12 of 472 files)
Processing cb_cb11.txt (13 of 472 files)
Processing cf_cf15.txt (14 of 472 files)
Processing cf_cf35.txt (15 of 472 files)
Processing ck_ck12.txt (16 of 472 files)
Processing ck_ck10.txt (17 of 472 files)
Processing cp_cp10.txt (18 of 472 files)
Processing cc_cc02.txt (19 of 472 files)
Processing cg_cg35.txt (20 of 472 files)
Processing ck_ck16.txt (21 of 472 files)
Processing ca_ca01.txt (22 of 472 files)
Processing ch_ch26.txt (23 of 472 files)
Processing cn_cn08.txt (24 of 472 files)
Processing cl_cl22.txt (2

In [95]:
ct.head(bg_dict["bi_freq"], hits = 10)
#other keys include "dep_freq", "head_freq", and "range"
#also note that the key "samples" can be used to obtain a list of sample sentences
#but, this is not compatible with the ct.head() function (see ct.dep_conc() instead)

what_do	223
place_take	82
what_say	66
him_told	58
it_do	52
that_do	47
this_do	43
what_mean	43
time_have	42
effect_have	41


# Strength of association

---


Various measures of strength of association can calculated between dependents and heads. The soa() function takes a dictionary generated by the dep_bigram() function and calculates the strength of association for each dependency bigram.

In [96]:
soa_mi = ct.soa(bg_dict,stat = "MI")


In [97]:
#other stat options include: "T", "faith_dep", "faith_head","dp_dep", and "dp_head"
ct.head(soa_mi, hits = 10)

Class_judge	10.911712979065362
cigarette_smoke	10.824250137815023
nose_scratch	10.712641631963534
suicide_commit	10.610543444344797
nose_blow	10.267856789290636
imagination_capture	10.07521171134824
calendar_adjust	9.812177305514448
English_speak	9.509614535494016
resemblance_bear	9.21127326092427
contract_award	9.156825476901894


#Concordance lines for dependency bigrams

---


A number of excellent cross-platform GUI- based concordancers such as AntConc are freely available, and are likely the preferred method for most concordancing.

However, it is difficult to get concordance lines for dependency bigrams without a more advanced program. The dep_conc() function takes the samples generated by the dep_bigram() function and creates a random sample of hits (50 hits by default) formatted as an html file.

The following example will write an html file named "dobj_results.html" to your working directory.

In [98]:
ct.dep_conc(bg_dict["samples"],"dobj_results")