<a href="https://colab.research.google.com/github/Teaganstmp/Langlearning/blob/main/Making_complex_objects_with_their_own_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classes in Python: complex data types

Sometimes the data that we want to store is complex. You've seen an example in the Universal Dependencies, where each word was represented as its form, lemma, part of speech, and incoming dependency arc. As another example, you may need to store participants in an experiment with their anonymized ID, age, and Likert scale ratings (say, on a scale from 1-5) that they gave for each experimental items.

For the Universal Dependencies case, we've stored each token as a Python dictionary representing an attribute-value matrix.

But Python has another option for storing complex data: You can define your own data type: You define a new *class* of data, and then you can make individual pieces of data as *objects* of that class. The advantage in that case is that you can also define your own *methods* for how to interact with that data type.

Here is an example of a new class for storing Universal Dependencies AVM information. As a reminder, here is the AVM for the first word of the 10th sentence of the the UD_English-GUM corpus:


$$
\left[\begin{array}{ll}
\text{id:} & 1\\
\text{form:} & 'Thus'\\
\text{lemma:} & 'thus'\\
\text{upos:} &  'ADV'\\
\text{xpos:} & 'RB'\\
\text{feats:} &  None\\
\text{head:} & 16\\
\text{deprel:}  & advmod\\
\text{deps:}  & None\\
\text{misc:} & \left[\begin{array}{ll}
\text{SpaceAfter:} & 'No'
\end{array}\right]
\end{array}\right]
$$

We'll simplify it a bit for this example:


$$
\left[\begin{array}{ll}
\text{id:} & 1\\
\text{form:} & 'Thus'\\
\text{lemma:} & 'thus'\\
\text{upos:} &  'ADV'\\
\text{head:} & 16\\
\text{deprel:}  & advmod\\
\end{array}\right]
$$

Here is the Universal Dependencies AVM class:

In [2]:
class UDavm:
    def __init__(self, word_id, form, lemma, upos, head, deprel): # defiine the object with the boject you use it with, i.e. self
        self.word_id = word_id
        self.form = form
        self.lemma = lemma
        self.upos = upos
        self.head = head
        self.deprel = deprel


This makes a new class of objects, each of which have attributes `word_id`, `form`, `lemma`, `upos`, `head`,  and`deprel`.

We have a new reserved word, `class`, which is formed by a class name (a variable name), and a colon.
The class has one method, with the odd name `__init__`. This is the method that gets called when you make a new object of this class:

In [3]:
# This makes an object of class UDavm
avmobj = UDavm(1, "Thus", "thus", "ADV", 16, "advmod")

# we can now access the attributes in an expression
# <variable holding the object> . <attributename>
print(avmobj.form)
print(avmobj.deprel)


Thus
advmod


When we make the object `avmobj`, the method `__init__()` of the class gets called. It stores values for all the attributes -- but where? Note that the method is defined with one more argument that how we call it: there is an additional `self`. `self` is the object itself. When we state `self.form = form`, we store the value of `form` in an attribute `form` that is attached to the object.

Here is another object, with different values in its AVM:

In [4]:
avmobj2 = UDavm(1, "visually", "visually", "ADV", 12, "advmod")

avmobj2.form

'visually'

Now let's define the class again, but add a method. Say we want to add functionality that looks up the syllable structure for the word:

In [5]:
import nltk
nltk.download('cmudict')

class UDavm:
    def __init__(self, word_id, form, lemma, upos, head, deprel):
        self.word_id = word_id
        self.form = form
        self.lemma = lemma
        self.upos = upos
        self.head = head
        self.deprel = deprel

        # loading the CMU dictionary
        self.cmudict = nltk.corpus.cmudict.dict()

    def syllables(self):
        first_pronunciation = self.cmudict[self.form][0]
        return len([i for i in first_pronunciation if not i.isalpha()])




[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.


The new method again has an argument `self`. This time, the body of the method uses `self`, specifically the dictionary and the word form stored in `self`: They were initialized in `__init__()` and then persisted.

When we call the method, we again call it with one argument less than we defined it, in this case with zero arguments:

In [6]:
# making an AVM object
avmobj2 = UDavm(1, "visually", "visually", "ADV", 12, "advmod")

# how many syllables does this word have?
avmobj2.syllables()

3

# Classes in Python packages

Many Python packages define their own data classes. For example, you have seen some classes that come with the Natural Language Toolkit.

One of them is FreqDist, a dictionary with extra bells and whistles that counts how often each word appears in a given word list. Here it is, as a reminder:


In [7]:
import nltk

text = """You are old, Father William," the young man said,
    "And your hair has become very white;
And yet you incessantly stand on your head—
    Do you think, at your age, it is right?"

"In my youth," Father William replied to his son,
    "I feared it might injure the brain;
But now that I'm perfectly sure I have none,
    Why, I do it again and again."
    """
fd = nltk.FreqDist(nltk.word_tokenize(text))
fd

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


You can see the whole source code for FreqDist here: https://www.nltk.org/_modules/nltk/probability.html#FreqDist

It starts with

```
class FreqDist(Counter):
   ...
```

That means that it is a class that is *derived from* the Python data type `Counter`. It takes over all the methods and attributes of `Counter`, and to do the counting, it basically just hands all its data to `Counter`. Leaving out a lot of comments and some code, we have:

```
class FreqDist(Counter):
    def __init__(self, samples=None):
        Counter.__init__(self, samples)
```

One of the methods it defines is `hapaxes`, which returns a list of all the words that appeared only once. Here is its definition:

```
class FreqDist(Counter):
    ...
    def hapaxes(self):
        return [item for item in self if self[item] == 1]
```

Here the self object can be accessed like a dictionary, an ability that  is inherited from `Counter`. The keys in this dictionary are words, and the values are their counts. As you can see, the method accesses all items in the `self` dictionary that have a count of one.


## The function dir()

The function `dir()` gives you access to all the attributes and methods stored with an object. Here it is applied to the FreqDist object:

In [None]:
dir(fd)

# Building a class that enables corpus access in NLTK

Here is a self-made object type that gives access to a corpus, particularly one that is stored in the manner of nltk corpora.

The `__init__()` method stores the directory name.

The method `fileids()` does the same thing as the `fileids()` method for nltk corpora: It gives you the names of all the corpus pieces that you can access individually.

The method `words()` does the same thing as the `words` method for nltk corpora: Given a file ID, it gives you the corpus piece stored under that file ID as a list of words.

In [None]:
import os
import nltk

class MyCorpus:
    # initialization
    def __init__(self, directoryname):
        self.directoryname = directoryname

    # return list of file IDs:
    # all filenames in the directory of the corpus
    # that end in .txt
    def fileids(self):
        return [name for name in os.listdir(self.directoryname)
                if name.endswith("txt")]

    # return list of words for one file ID
    def words(self, fileid):
        whole_filename = os.path.join(self.directoryname, fileid)
        f = open(whole_filename)
        contents = f.read()
        f.close()
        return nltk.word_tokenize(contents)


When we download NLTK data, it goes to "/root/nltk_data/corpora"

In [None]:
import nltk
nltk.download("state_union")
nltk.download("punkt")

import os
os.listdir("/root/nltk_data/corpora")

In [None]:
testcorpus = MyCorpus("/root/nltk_data/corpora/state_union")

In [None]:
testcorpus.directoryname

In [None]:
testcorpus.fileids()

In [None]:
firstfile = testcorpus.fileids()[0]
testcorpus.words(firstfile)[:100]

In [None]:
firstfile