# Parts-of-Speech Tagging - First Steps: Working with text files, Creating a Vocabulary and Handling Unknown Words

In this lecture notebook you will create a vocabulary from a tagged dataset and learn how to deal with words that are not present in this vocabulary when working with other text sources. Aside from this you will also learn how to:
 
- read text files
- work with defaultdict
- work with string data
 

In [169]:
import string
from collections import defaultdict
from collections import Counter # ye main nay khud import kia tha, k is kay zariye say bhi kaam kr k daikhain gay


### Read Text Data

A tagged dataset taken from the Wall Street Journal is provided in the file `WSJ_02-21.pos`. 

To read this file you can use Python's context manager by using the `with` keyword and specifying the name of the file you wish to read. To actually save the contents of the file into memory you will need to use the `readlines()` method and store its return value in a variable. 

Python's context managers are great because you don't need to explicitly close the connection to the file, this is done under the hood:

In [171]:
# Read lines from 'WSJ_02-21.pos' file and save them into the 'lines' variable
with open("WSJ_02-21.pos", 'r') as f:
    lines = f.readlines()


In [172]:
lines

['In\tIN\n',
 'an\tDT\n',
 'Oct.\tNNP\n',
 '19\tCD\n',
 'review\tNN\n',
 'of\tIN\n',
 '``\t``\n',
 'The\tDT\n',
 'Misanthrope\tNN\n',
 "''\t''\n",
 'at\tIN\n',
 'Chicago\tNNP\n',
 "'s\tPOS\n",
 'Goodman\tNNP\n',
 'Theatre\tNNP\n',
 '(\t(\n',
 '``\t``\n',
 'Revitalized\tVBN\n',
 'Classics\tNNS\n',
 'Take\tVBP\n',
 'the\tDT\n',
 'Stage\tNN\n',
 'in\tIN\n',
 'Windy\tNNP\n',
 'City\tNNP\n',
 ',\t,\n',
 "''\t''\n",
 'Leisure\tNN\n',
 '&\tCC\n',
 'Arts\tNNS\n',
 ')\t)\n',
 ',\t,\n',
 'the\tDT\n',
 'role\tNN\n',
 'of\tIN\n',
 'Celimene\tNNP\n',
 ',\t,\n',
 'played\tVBN\n',
 'by\tIN\n',
 'Kim\tNNP\n',
 'Cattrall\tNNP\n',
 ',\t,\n',
 'was\tVBD\n',
 'mistakenly\tRB\n',
 'attributed\tVBN\n',
 'to\tTO\n',
 'Christina\tNNP\n',
 'Haag\tNNP\n',
 '.\t.\n',
 '\n',
 'Ms.\tNNP\n',
 'Haag\tNNP\n',
 'plays\tVBZ\n',
 'Elianti\tNNP\n',
 '.\t.\n',
 '\n',
 'Rolls-Royce\tNNP\n',
 'Motor\tNNP\n',
 'Cars\tNNPS\n',
 'Inc.\tNNP\n',
 'said\tVBD\n',
 'it\tPRP\n',
 'expects\tVBZ\n',
 'its\tPRP$\n',
 'U.S.\tNNP\n',
 's

In [173]:
type(lines)

list

In [174]:
lines[0]

# 'word \t tag \n'

'In\tIN\n'

In [85]:
len(lines)

989860

To check the contents of the dataset you can print the first 5 lines:

In [175]:
# Print columns for reference
print("\t\tWord", "\tTag\n")

# Print first five lines of the dataset
for i in range(5):
    print(f'line number {i+1}: {lines[i]}')


		Word 	Tag

line number 1: In	IN

line number 2: an	DT

line number 3: Oct.	NNP

line number 4: 19	CD

line number 5: review	NN



Each line within the dataset has a word followed by its corresponding tag. However since the printing was done using a formatted string it can be inferred that the **word** and the **tag** are separated by a tab (or some spaces) and there is a newline at the end of each line (notice that there is a space between each line). 

If you want to understand the meaning of these tags you can take a look [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

To better understand how the information is structured in the dataset it is recommended to print an unformatted version of it:

In [176]:
# Print first line (unformatted)
lines[0]

'In\tIN\n'

Indeed there is a tab between the word and the tag and a newline at the end of each line.

### Creating a vocabulary

Now that you understand how the dataset is structured, you will create a vocabulary out of it. A vocabulary is made up of every word that appeared at least 2 times in the dataset. 
For this, follow these steps:
- Get only the words from the dataset
- Use a defaultdict to count the number of times each word appears
- Filter the dict to only include words that appeared at least 2 times
- Create a list out of the filtered dict
- Sort the list

For step 1 you can use the fact that every word and tag are separated by a tab and that words always come first. Using list comprehension the words list can be created like this:

In [184]:
# Get the words from each line in the dataset
words = [line.split('\t')[0] for line in lines]


In [89]:
#splitting a string

In [182]:
a = ['1\ta\n', '2\tb\n', '3\tc\n', '4\td\n', '5\te\n']

In [183]:
for i in range(len(a)):
    print(a[i].split('\t'))

# ye yani k string ko split krta hay


['1', 'a\n']
['2', 'b\n']
['3', 'c\n']
['4', 'd\n']
['5', 'e\n']


In [92]:
for i in range(len(a)):
    print(a[i].split('\t')[0])

# ye yani k string ko split krta hay


1
2
3
4
5


In [193]:
'abc'.startswith('a')


True

In [200]:
a = []
for i in range(len(lines)):
    if lines[i].startswith('book\t'):
        print(i)
        a.append(i)
print(len(a))

19855
21766
24915
53506
58710
59481
151575
161986
258523
268467
288924
292856
311129
311353
311861
312193
312353
312382
312873
326437
326934
327295
327872
337643
342959
348846
367585
367608
370631
371022
389011
406105
445181
490941
491313
553071
556008
605512
616724
616747
621317
720055
723427
723550
723693
724031
724165
727630
754824
774212
776973
790840
811509
814459
825055
842745
864166
864216
865825
881242
881246
893140
893565
901713
909286
955471
960917
972715
978080
988638
70


In [199]:
for i in a:
    print(lines[i].split("\t")[1])
    

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN



Step 2 can be done easily by leveraging `defaultdict`. In case you aren't familiar with defaultdicts they are a special kind of dictionaries that **return the "zero" value of a type if you try to access a key that does not exist**. Since you want the frequencies of words, you should define the defaultdict with a type of `int`. 

Now you don't need to worry about the case when the word is not present within the dictionary because getting the value for that key will simply return a zero. Isn't that cool?

In [185]:
# Define defaultdict of type 'int'
freq = defaultdict(int)

# Count frequency of ocurrence for each word in the dataset
for word in words:
    freq[word] += 1


In [94]:
freq

defaultdict(int,
            {'In': 1740,
             'an': 3143,
             'Oct.': 318,
             '19': 100,
             'review': 58,
             'of': 22929,
             '``': 6967,
             'The': 6833,
             'Misanthrope': 3,
             "''": 6787,
             'at': 4362,
             'Chicago': 197,
             "'s": 9311,
             'Goodman': 7,
             'Theatre': 5,
             '(': 1153,
             'Revitalized': 1,
             'Classics': 1,
             'Take': 9,
             'the': 41107,
             'Stage': 3,
             'in': 15186,
             'Windy': 1,
             'City': 139,
             ',': 48723,
             'Leisure': 3,
             '&': 1034,
             'Arts': 8,
             ')': 1160,
             'role': 125,
             'Celimene': 4,
             'played': 53,
             'by': 4495,
             'Kim': 7,
             'Cattrall': 1,
             'was': 3903,
             'mistakenly': 6,
             'att

In [95]:
type(freq)

collections.defaultdict

In [187]:
a = dict(freq)
type(freq)

collections.defaultdict

In [188]:
type(a)

dict

In [189]:
# my way
freq2 = Counter(words)


In [99]:
freq2

Counter({',': 48723,
         'the': 41107,
         '\n': 39832,
         '.': 39020,
         'of': 22929,
         'to': 22198,
         'a': 19284,
         'and': 16115,
         'in': 15186,
         "'s": 9311,
         'that': 7992,
         'for': 7976,
         '$': 7184,
         '``': 6967,
         'is': 6938,
         'The': 6833,
         "''": 6787,
         'said': 5615,
         'on': 5145,
         '%': 4942,
         'it': 4656,
         'by': 4495,
         'from': 4459,
         'million': 4384,
         'at': 4362,
         'as': 4256,
         'with': 4245,
         'Mr.': 4159,
         'was': 3903,
         'be': 3725,
         'are': 3677,
         'its': 3584,
         'has': 3303,
         "n't": 3222,
         'an': 3143,
         'will': 3078,
         'have': 3042,
         'he': 2658,
         'or': 2500,
         'company': 2477,
         'year': 2239,
         'which': 2218,
         'would': 2184,
         'about': 2060,
         '--': 2038,
        

In [100]:
len(freq2)

44390

In [101]:
freq2['``']

6967

In [102]:
len(freq)

44390

In [103]:
freq['``']

6967

In [104]:
# so both dicts are exactly the same, the same keys and the same corresponding values!
# so my method is better.

dict(freq) == dict(freq2)

True

In [190]:
a = {'a':1, 'b':2,'c':3}
b = {'a':1, 'b':2,'c':3}

a==b

True

In [191]:
a = {'a':1, 'b':2,'c':3}
b = {'a':1, 'b':2,'c':4}

a==b

False

In [192]:
a = {'a':1, 'b':2,'c':3, 'd':4}
b = {'a':1, 'b':2,'c':3}

a==b

False

Filtering the `freq` dictionary can be done using list comprehensions again (aren't they handy?). You should filter out words that appeared only once and also words that are just a newline character:

In [152]:
import time

In [201]:
# their way

s = time.time()

# Create the vocabulary by filtering the 'freq' dictionary
vocab = [k for k, v in freq.items() if (v > 1 and k != '\n')]

e = time.time()

e-s

0.007978200912475586

In [163]:
freq


defaultdict(int,
            {'In': 1740,
             'an': 3143,
             'Oct.': 318,
             '19': 100,
             'review': 58,
             'of': 22929,
             '``': 6967,
             'The': 6833,
             'Misanthrope': 3,
             "''": 6787,
             'at': 4362,
             'Chicago': 197,
             "'s": 9311,
             'Goodman': 7,
             'Theatre': 5,
             '(': 1153,
             'Revitalized': 1,
             'Classics': 1,
             'Take': 9,
             'the': 41107,
             'Stage': 3,
             'in': 15186,
             'Windy': 1,
             'City': 139,
             ',': 48723,
             'Leisure': 3,
             '&': 1034,
             'Arts': 8,
             ')': 1160,
             'role': 125,
             'Celimene': 4,
             'played': 53,
             'by': 4495,
             'Kim': 7,
             'Cattrall': 1,
             'was': 3903,
             'mistakenly': 6,
             'att

In [168]:
freq['\n']

39832

In [157]:
# # my way
# import pandas as pd

# series = pd.Series(freq)



In [161]:

# s = time.time()

# filtered_series = series[(series > 1) & (series.index != '\n')]

# # Get the filtered keys as a list
# #vocab = filtered_series.index.tolist()

# e = time.time()

# e-s

# # so pandas is not faster, maybe when the series is bigger, it would be better


0.006821632385253906

In [167]:
vocab

['In',
 'an',
 'Oct.',
 '19',
 'review',
 'of',
 '``',
 'The',
 'Misanthrope',
 "''",
 'at',
 'Chicago',
 "'s",
 'Goodman',
 'Theatre',
 '(',
 'Take',
 'the',
 'Stage',
 'in',
 'City',
 ',',
 'Leisure',
 '&',
 'Arts',
 ')',
 'role',
 'Celimene',
 'played',
 'by',
 'Kim',
 'was',
 'mistakenly',
 'attributed',
 'to',
 'Christina',
 'Haag',
 '.',
 'Ms.',
 'plays',
 'Rolls-Royce',
 'Motor',
 'Cars',
 'Inc.',
 'said',
 'it',
 'expects',
 'its',
 'U.S.',
 'sales',
 'remain',
 'steady',
 'about',
 '1,200',
 'cars',
 '1990',
 'luxury',
 'auto',
 'maker',
 'last',
 'year',
 'sold',
 'Howard',
 'president',
 'and',
 'chief',
 'executive',
 'officer',
 'he',
 'anticipates',
 'growth',
 'for',
 'Britain',
 'Europe',
 'Far',
 'Eastern',
 'markets',
 'INDUSTRIES',
 'increased',
 'quarterly',
 '10',
 'cents',
 'from',
 'seven',
 'a',
 'share',
 'new',
 'rate',
 'will',
 'be',
 'payable',
 'Feb.',
 '15',
 'A',
 'record',
 'date',
 'has',
 "n't",
 'been',
 'set',
 'Bell',
 'based',
 'Los',
 'Angeles',


In [150]:
type(vocab)

list

In [151]:
len(vocab)

23767

Finally, the `sort` method will take care of the final step. Notice that it changes the list directly so you don't need to reassign the `vocab` variable:

In [202]:
# Sort the vocabulary
vocab.sort()

# Print some random values of the vocabulary
for i in range(4000, 4005):
    print(vocab[i])

Early
Earnings
Earth
Earthquake
East


Now you have successfully created a vocabulary from the dataset. **Great job!** The vocabulary is quite extensive so it is not printed out but you can still do so by creating a cell and running something like `print(vocab)`. 

At this point you will usually write the vocabulary into a file for future use, but that is out of the scope of this notebook. If you are curious it is very similar to how you read the file at the beginning of this notebook.


In [203]:
# saving the dictionary locally
#import json
with open('freq.json', 'w') as f:
    json.dump(dict(freq2), f)

# json ka faida ye hay na k directly as a dictionary load bhi ho jaye gi


## Processing new text sources

### Dealing with unknown words

Now that you have a vocabulary, you will use it when processing new text sources. **A new text will have words that do not appear in the current vocabulary**. To tackle this, you can simply classify each new word as an unknown one, but you can do better by creating a function that tries to classify the type of each unknown word and assign it a corresponding `unknown token`. 

This function will do the following checks and return an appropriate token:

   - Check if the unknown word contains any character that is a digit 
       - return `--unk_digit--`
   - Check if the unknown word contains any punctuation character 
       - return `--unk_punct--`
   - Check if the unknown word contains any upper-case character 
       - return `--unk_upper--`
   - Check if the unknown word ends with a suffix that could indicate it is a noun, verb, adjective or adverb 
        - return `--unk_noun--`, `--unk_verb--`, `--unk_adj--`, `--unk_adv--` respectively

If a word fails to fall under any condition then its token will be a plain `--unk--`. The conditions will be evaluated in the same order as listed here. So if a word contains a punctuation character but does not contain digits, it will fall under the second condition. To achieve this behaviour some if/elif statements can be used along with early returns. 

This function is implemented next. Notice that the `any()` function is being heavily used. It returns `True` if at least one of the cases it evaluates is `True`.

In [108]:
def assign_unk(word):
    """
    Assign tokens to unknown words
    """
    
    # Punctuation characters
    # Try printing them out in a new cell!
    punct = set(string.punctuation)         # is main repetition anhi hay, phir bhi deliberately punctuationbnaya hay
    
    # Suffixes | see the following cells for explanations
    noun_suffix = ["action", "age", "ance", "cy", "dom", "ee", "ence", "er", "hood", "ion", "ism", "ist", "ity", "ling", "ment", "ness", "or", "ry", "scape", "ship", "ty"]
    verb_suffix = ["ate", "ify", "ise", "ize"]
    adj_suffix = ["able", "ese", "ful", "i", "ian", "ible", "ic", "ish", "ive", "less", "ly", "ous"]
    adv_suffix = ["ward", "wards", "wise"]

    # Loop the characters in the word, check if any is a digit
    if any(char.isdigit() for char in word):
        return "--unk_digit--"                                       ############# early return

    # Loop the characters in the word, check if any is a punctuation character
    elif any(char in punct for char in word):
        return "--unk_punct--"

    # Loop the characters in the word, check if any is an upper case character
    elif any(char.isupper() for char in word):
        return "--unk_upper--"

    # Check if word ends with any noun suffix
    elif any(word.endswith(suffix) for suffix in noun_suffix):
        return "--unk_noun--"

    # Check if word ends with any verb suffix
    elif any(word.endswith(suffix) for suffix in verb_suffix):
        return "--unk_verb--"

    # Check if word ends with any adjective suffix
    elif any(word.endswith(suffix) for suffix in adj_suffix):
        return "--unk_adj--"

    # Check if word ends with any adverb suffix
    elif any(word.endswith(suffix) for suffix in adv_suffix):
        return "--unk_adv--"
    
    # If none of the previous criteria is met, return plain unknown
    return "--unk--"


In [109]:
string.punctuation
# i guess saray k saray punctuation marks yahan mojood hain


'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [110]:
# Here are examples of words using the provided suffixes, categorized by noun, verb, adjective, and adverb suffixes:

# ### Noun Suffixes

# - **action**: reaction, attraction
# - **age**: marriage, footage
# - **ance**: importance, assistance
# - **cy**: democracy, accuracy
# - **dom**: kingdom, freedom
# - **ee**: employee, trainee
# - **ence**: existence, confidence
# - **er**: teacher, writer
# - **hood**: childhood, neighborhood
# - **ion**: action, education
# - **ism**: realism, criticism
# - **ist**: artist, scientist
# - **ity**: activity, responsibility
# - **ling**: sibling, underling
# - **ment**: agreement, development
# - **ness**: happiness, darkness
# - **or**: actor, conductor
# - **ry**: bravery, machinery
# - **scape**: landscape, seascape
# - **ship**: friendship, leadership
# - **ty**: loyalty, certainty

# ### Verb Suffixes

# - **ate**: activate, educate
# - **ify**: simplify, notify
# - **ise**: advertise, criticise (Note: "ise" is more common in British English)
# - **ize**: organize, realize

# ### Adjective Suffixes

# - **able**: readable, comfortable
# - **ese**: Chinese, Japanese
# - **ful**: hopeful, beautiful
# - **i**: multi (in multi-colored)
# - **ian**: American, historian
# - **ible**: visible, flexible
# - **ic**: artistic, energetic
# - **ish**: childish, foolish
# - **ive**: creative, active
# - **less**: hopeless, fearless
# - **ly**: friendly, likely
# - **ous**: joyous, dangerous

# ### Adverb Suffixes

# - **ward**: forward, backward
# - **wards**: towards, homewards
# - **wise**: otherwise, clockwise

# These examples illustrate how each suffix transforms the root word to change its meaning and grammatical function.

In [111]:
# The code snippet is using a generator expression with the `any()` function to check if any character in the word is a digit. Here's a breakdown of what each part does:

# 1. **Loop the characters in the word**: The generator expression `char.isdigit() for char in word` iterates over each character in the `word`.
# 2. **Check if any character is a digit**: The `char.isdigit()` checks if the current character is a digit (0-9).
# 3. **any() function**: The `any()` function returns `True` if at least one of the values produced by the generator expression is `True`. If all values are `False`, it returns `False`.

# ### Explanation with a Simple Example

# Let's consider the word "hello123":

# 1. **Loop the characters in the word**: The characters in "hello123" are 'h', 'e', 'l', 'l', 'o', '1', '2', '3'.
# 2. **Check if any character is a digit**: For each character, `char.isdigit()` is evaluated:
#    - 'h'.isdigit() → False
#    - 'e'.isdigit() → False
#    - 'l'.isdigit() → False
#    - 'l'.isdigit() → False
#    - 'o'.isdigit() → False
#    - '1'.isdigit() → True
#    - Since '1' is a digit, the rest of the characters are not checked.
# 3. **any() function**: Since there is at least one `True` value from `char.isdigit()`, `any()` returns `True`.

# Thus, if the word contains any digit, the condition inside the `if` statement will be `True` and the function will return `"--unk_digit--"`.

# ### Complete Example Code


def check_for_digits(word):
    # Loop the characters in the word, check if any is a digit
    if any(char.isdigit() for char in word):
        return "--unk_digit--"
    else:
        return word

# Test the function with a word containing digits
print(check_for_digits("hello123"))  # Output: --unk_digit--
# Test the function with a word without digits
print(check_for_digits("hello"))     # Output: hello


# In the first test case, "hello123" contains digits, so the function returns `"--unk_digit--"`. In the second test case, "hello" does not contain any digits, so the function returns the original word "hello".

--unk_digit--
hello


In [112]:
# error expected
1.isdigit()

SyntaxError: invalid syntax. Perhaps you forgot a comma? (3278614508.py, line 2)

In [113]:
'1'.isdigit()


True

In [114]:
b = 'Hello123'
a = (char.isdigit() for char in b)
a

<generator object <genexpr> at 0x0000029EDB0E9230>

In [115]:
b = 'Hello123'
a = any(char.isdigit() for char in b)
a

True

In [116]:
a = 'ansxyz'
a.endswith('xyz')


True

In [117]:
a = 'ansxyz'
a.endswith('qyz')


False

In [118]:
def check_number(num):
    if num > 0:
        return "Positive"
        
    print('1st')
    if num < 0:
        return "Negative"
    print('2nd')
    return "Zero"

# Test the function
print(check_number(5))   # Output: Positive

# daikha, positive say aagay vali koi bhi line nahi chli


Positive


In [119]:
print(check_number(-3))  # Output: Negative


1st
Negative


In [120]:
print(check_number(0))   # Output: Zero


1st
2nd
Zero


A POS tagger will always encounter words that are not within the vocabulary that is being used. By augmenting the dataset to include these `unknown word tokens` you are helping the tagger to have a better idea of the appropriate tag for these words. 

### Getting the correct tag for a word

All that is left is to implement a function that will get the correct tag for a particular word taking special considerations for unknown words. Since the dataset provides each word and tag within the same line and a word being known depends on the vocabulary used, these two elements should be arguments to this function.

This function should check if a line is empty and if so, it should return a placeholder word and tag, `--n--` and `--s--` respectively. 

If not, it should process the line to return the correct word and tag pair, considering if a word is unknown in which scenario the function `assign_unk()` should be used.

The function is implemented next. Notice That the `split()` method can be used without specifying the delimiter, in which case it will default to any whitespace.

In [137]:
def get_word_tag(line, vocab):
    # If line is empty return placeholders for word and tag
    if not line.split():    # split() method can be used without specifying the delimiter, in which case it will default to any whitespace.
        word = "--n--"
        tag = "--s--"
    else:
        # Split line to separate word and tag
        word, tag = line.split()
        # Check if word is not in vocabulary
        if word not in vocab: 
            # Handle unknown word
            word = assign_unk(word)
    return word, tag
    

In [138]:
# explanation

def check_string(s):
    if not s:
        return "The string is empty."
    else:
        return "The string is not empty."

# Test the function
print(check_string(""))    # Output: The string is empty.


The string is empty.


In [139]:
print(check_string("hi"))  # Output: The string is not empty.


The string is not empty.


In [140]:
a = 'qwerty'
a.split()


['qwerty']

In [213]:
a = 'qwerty'
b, c = a.split()


ValueError: not enough values to unpack (expected 2, got 1)

In [142]:
lines[0]

'In\tIN\n'

In [214]:
a = 'In\tIN\n'
b, c = a.split()


In [215]:
b

'In'

In [216]:
c
# So, \t par by default bhi split k deta hay

'IN'

In [217]:
a = 'In\tIN\n'
b = a.split()


In [218]:
b

['In', 'IN']

Now you can try this function with some examples to test that it is working as intended:

In [146]:
get_word_tag('\n', vocab)

('--n--', '--s--')

Since this line only includes a newline character it returns a placeholder word and tag.

In [147]:
get_word_tag('In\tIN\n', vocab)

('In', 'IN')

This one is a valid line and the function does a fair job at returning the correct (word, tag) pair.

In [209]:
get_word_tag('tardigrade\tNN\n', vocab)

('--unk--', 'NN')

This line includes a noun that is not present in the vocabulary. 

The `assign_unk` function fails to detect that it is a noun so it returns an `unknown token`.

In [148]:
get_word_tag('scrutinize\tVB\n', vocab)

('--unk_verb--', 'VB')

This line includes a verb that is not present in the vocabulary. 

In this case the `assign_unk` is able to detect that it is a verb so it returns an `unknown verb token`.

**Congratulations on finishing this lecture notebook!** Now you should be more familiar with working with text data and have a better understanding of how a basic POS tagger works.

**Keep it up!**

In [212]:
get_word_tag('ans\ta', vocab)

('--unk--', 'a')