<a href="https://colab.research.google.com/github/MK316/workshop22/blob/main/class02_voca.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#📘 **Topic 02 Vocabulary learning**

**Table of Contents:**

* Getting familiar with words (sounds) (using 📍_gTTS_)
* Words by POS (using 📍_nltk.frequency distribution, nltk.tagging_)
* Word family in context (using 📍_nltk.concordance_)
* Collocations (using 📍_nltk.collocation_)

💾 Get ready for sample text: Ch01. [Visual village](https://raw.githubusercontent.com/MK316/workshop22/main/data/RE.Ch01.txt) Copy and get it ready to past below :-)

In [1]:
#@markdown 🔳 Paste the text here for analysis:
text = input()

Before the age of the smartphone, aspiring photographers had to learn how to use high-tech cameras and photographic techniques. Not everyone had cameras, and it took skill and a good eye to capture and create a great photograph. Today, with the huge range of camera apps on our smartphones, we are all amateur photographers. And pretty good ones, too: The quality of smartphone images now nearly equals that of digital cameras. The new ease of photography has given us a tremendous appetite for capturing the magical and the ordinary. We are obsessed with documenting everyday moments, whether it’s a shot of our breakfast, our cat – or our cat’s breakfast. And rather than collect pictures in scrapbooks, we share, like, and comment on them with friends and strangers around the globe. Even photojournalists are experimenting with cell phones because their near invisibility makes it easier to capture unguarded media. They can now act as their own publishers – reaching huge audiences via social me

In [9]:
#@markdown 🔳 Import packages: {nltk}, stopwords
%%capture
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download("punkt")

from nltk.corpus import stopwords
nltk.download("stopwords")

In [20]:
#@markdown 🔳 Remove stopwords

words = word_tokenize(text)
words = [w for w in words if len(w) > 1]
print('Total number of words before stopwords: %d'%len(words))

# words into lower case
wlist = []
for w in words:
  w1 = w.lower()
  wlist.append(w1)

words = wlist

words = [w for w in words if not w in stopwords.words('english')]
print('Total number of words after stopwords: %d'%len(words))

Total number of words before stopwords: 759
Total number of words after stopwords: 420


## 🔈 [1] Reading word list:

**Using {gTTS}**

In [42]:
#@markdown 🚩 {gTTS} package installation and import
%%capture
!pip install gTTS
from gtts import gTTS
from IPython.display import Audio

In [68]:
#@markdown 🚩 Making a function { tts ( _text_to_say_) }:
def tts(mytext):

#Install gTTS
# !pip install gTTS
# from gtts import gTTS
# from IPython.display import Audio

  text_to_say = mytext

# Step ⓵ Language to choose:
  language_to_choose = "en" #@param ["en", "fr","ko",'es']
  # lang = language_to_choose

  print("Play language accent: %s"%language_to_choose)
  language = language_to_choose

# gTTS
  gtts_object = gTTS(text = text_to_say,
                     lang = language,
                    slow = False)
  
# #@markdown Step ③: Create the audio file (.wav) to play:
  gtts_object.save("mytext.wav")

# # Output
  return Audio("mytext.wav")

#@markdown 🚩 Type of words: e.g., unique words
w1 = list(set(words))

#@markdown 🚩 uniquewords.csv
import pandas as pd
df3 = pd.DataFrame()
df3['Words'] = sorted(w1)

df3.to_csv('uniquewords.csv')


print(len(w1))

text_to_say = '. '.join(w1)
text_to_say

tts(text_to_say)

290
Play language accent: en


## [2] Words with POS

In [26]:
#@markdown 🔳 Install package: {corpus-toolkit}

%%capture
!pip install corpus-toolkit

text file under txtdata folder

In [59]:
#@markdown 🔳 Tagging > Tagged file to csv

import os
os.mkdir("txtdata")

with open("txtdata/mytext.txt",'w') as f:
  f.write(text)

from corpus_toolkit import corpus_tools as ct

brown_corp = ct.ldcorpus("txtdata") #load and read text files under 'txtdata' directory
tok_corp = ct.tokenize(brown_corp)  #tokenize corpus - by default this lemmatizes as well
brown_freq = ct.frequency(tok_corp) #creates a frequency dictionary

# tagged_txt (tagged data folder), txtdata (original data folder)
ct.write_corpus("tagged_txt",ct.tag(ct.ldcorpus("txtdata")))

tagged_freq = ct.frequency(ct.reload("tagged_txt"))
ct.head(tagged_freq, hits = 10)

import pandas as pd

data_dict = tagged_freq
data_items = data_dict.items()
data_list = list(data_items)
df = pd.DataFrame(data_list)
df.columns = ['Words', 'Freq']
df = df.sort_values(by=['Freq'], ascending=False)
# print(df)

df.to_csv('/content/tagged.csv', index=False)

data = pd.read_csv('tagged.csv')
data.head()

FileExistsError: ignored