<a href="https://colab.research.google.com/github/MK316/applications/blob/main/Tagging_CorpusToolKit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# English text Tagging using the Corpus Toolkit [by Kristopher Kyle]("https://github.com/kristopherkyle/corpus_toolkit")

---

For a given text, we'll create a tagged text.
Tags used: Universal POS Tag (https://universaldependencies.org/u/pos/)

In [None]:
!pip install corpus-toolkit



## [1] Create a folder named 'txtdata'
=> Corpus Toolkit processes files under a specified folder. So we create a folder and upload files in the given folder.

In [None]:
import os
os.mkdir("txtdata")

FileExistsError: ignored

[2] File upload (on colab) under the 'txtdata' folder:
e.g., DoveAndAnt.txt
https://raw.githubusercontent.com/MK316/mynltkdata/main/data/DoveAndAnt.txt

In [None]:
# Upload files under txtdata folder on the left panel.

print("Have you uploaded texts to tag? Type y or n")
answer = input()

if answer == 'y':
   print("OK, proceed the next step.")
else:
   print("Try again when the file is located properly. (This is just a checking.")


Have you uploaded texts to tag? Type y or n
y
OK, we'll move on.


## Checking current directory & change current directory

---
Current data location: /content/txtdata/DoveAndAnt.txt

Current directory should be '/content'
and the directory to input is 'corp', where text file is located.


In [None]:
!pwd
# %cd /content/

In [None]:
from corpus_toolkit import corpus_tools as ct
txt_corp = ct.ldcorpus("txtdata") #load and read text files under 'txtdata' directory
tok_corp = ct.tokenize(txt_corp) #tokenize corpus - by default this lemmatizes as well
word_freq = ct.frequency(tok_corp) #creates a frequency dictionary

## Write a tagged file under '/content/tagged_txt'


In [None]:
# tagged_txt (tagged data folder), txtdata (original data folder)
ct.write_corpus("tagged_txt",ct.tag(ct.ldcorpus("txtdata")))

In [None]:
tagged_freq = ct.frequency(ct.reload("tagged_txt"))
ct.head(tagged_freq, hits = 10)

In [None]:
type(tagged_freq) # dict

# Save Tagged_freq as Data frame

In [None]:
import pandas as pd

In [None]:
data_dict = tagged_freq
data_items = data_dict. items()
data_list = list(data_items)
df = pd.DataFrame(data_list)
print(df)

                  0    1
0          ﻿the_NUM    1
1        visual_ADJ    2
2      village_NOUN    1
3        before_ADP    4
4           the_DET  115
..              ...  ...
821    caution_NOUN    1
822  thoroughly_ADV    1
823       sound_ADJ    1
824   judgment_NOUN    1
825      enjoy_VERB    1

[826 rows x 2 columns]


# Write a tagged frequency dataframe to a csv file (on Colab)

In [None]:
df.to_csv(r'/content/tagged.csv', index=False)

# Read the tagged csv file for further process

In [None]:
df1 = pd.read_csv('tagged.csv') 

# Try R from here:

In [None]:
%load_ext rpy2.ipython

In [None]:
%%R
df2<-read.csv('tagged.csv')

In [None]:
%%R
head(df2)

            X0  X1
1      the_NUM   1
2   visual_ADJ   2
3 village_NOUN   1
4   before_ADP   4
5      the_DET 115
6     age_NOUN   2


In [None]:
%%R
colnames(df2) <- c('Tagwords','Freq')
colnames(df2)

[1] "Tagwords" "Freq"    


In [None]:
%%R

t1 <- df2$Tagwords
t2<-t1[1]; t2
# t3<-as.vector(strsplit(t2,'_')); t3
t3<-gsub("[a-zA-Z]+_","",t2); t3

[1] "NUM"


In [None]:
%%R
df3<-df2
head(df3)
pos = df3$Tagwords
word = df3$Tagwords

POS = gsub("[[:alnum:]|[:punct:]]+_","",pos)
WORD = gsub("_[A-Z]+","",word)
df4<-cbind.data.frame(Tagged=df3$Tagwords, Words = WORD, POS=POS, Freq=df3$Freq)
df4

In [None]:
%%R
# Sort by Freq in decreasing order:
df5<-df4[order(df4$Freq, decreasing=T),]; head(df5)
# Add new index: serial numbering
len<-length(df5$Words)
indx<-1:len

df6<-data.frame(ID = indx, df5)

# Save the result file in csv
write.csv(df6, "tagwordlist.csv", row.names=FALSE)