# Code deep-dive: naming, functions, classes, modules

## Naming conventions in Python code

Programmers tend to develop strong opinions on how to write code, both functionally and aesthetically. Below follows some basic guidelines on some general conventions for Python programming. See [PEP 8 - Style Guide for Python Programming](https://peps.python.org/pep-0008/) for more details about general conventions that you will come across.

It mostly comes down to personal preferences, and various groups develop different habits and conventions together, be they small or big groups. In all of my positions during my time in the industry, I have attended official meetings and discussions about code style for the specific projects that I was involved in. That is to say: it matters to people.

We do not want to cover everything, but something like naming is a good place to start to develop good habits.

In Python, the following types of casing are used (to my knowledge) for specific purposes:
- `snake_case`, characterized by all lowercase words and word delimiting underscores.
    - variables: `my_variable`
    - function: `my_function()`
    - modules: `my_module`, though there are mixed opinions on this; many prefer to avoid underscores in module names
- `PascalCase`, characterized by conjoined words with the first letter as uppercase
    - classes: `MyClass`
- `SCREAMING_SNAKE_CASE`, like snake case but with all uppercase
    - constants (variables which should not be changed): `MAX_LENGTH`, `PI`

A prefixed underscore signals that people should not use a thing outside the class or module: `_hidden_variable`, `_hidden_method`, `_HiddenClass`, `_MY_CONSTANT`


Other types of casing that you may come across, and where there is more variation:
- `camelCase`, like Pascal case, except the very first letter is lowercase
    - keys in JSON (because this is the style in JavaScript): `{"someString": "this is a string", "someNumber": 1.234}`
- `kebab-case`, like snake case but with hyphens instead
    - parts of URLs: `https://peps.python.org/pep-0008/#descriptive-naming-styles`
    - filenames: `my-output-file.txt`



Choosing good names for variables, function etc. can really help whoever is going to read your code. The opposite obfuscates your code. There are many opinions here, but some general guidelines are:

- Signal by name what the thing is for rather than an abstract name, e.g. `num_items_in_basket` vs. `n` or `input_number` vs. `x`.
- Do not shorten words. Consider e.g. `preprocess_document(document)`, `preprocess_doc(doc)`, `prepr_doc(doc)`, `prepr_doc(d)`, `pd(d)`. However, many specific words are often used in a shortened form, like _doc_ in the previous example. Also things like: _num_ for _number_, _feat_ for _features_, _func_ for _function_, _var_ for _variable_ and many more.
- Do not (over-)use internal acronyms or abbreviations, e.g. `num_items_in_basket` vs. `num_iib` or `type_to_token_ratio` vs. `ttr` (I know, I did this ...).

## Functions

Functions allow you to reuse code blocks, abstract away from specific goals in code to generalizable functionality and organize your code more neatly. When used well, it makes code much more readable and less prone to errors. Both points boil down to organizing and encapsulating logic into more manageable units.

Less jibber-jabber, more show-dont-tell.

For assignment 1, I mentioned quite a few times how one can break down the task at hand into smaller problems. Functions can help doing this.

Imagine the following task: collect POS-tag counts from all IMDB movie reviews.

In the code below I will be using a few **type hints**. See if you can make out how they work.

In [1]:
!pip install -r requirements.txt
!python -m spacy download en_core_web_sm

Defaulting to user installation because normal site-packages is not writeable
Collecting spacy (from -r requirements.txt (line 1))
  Downloading spacy-3.8.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting pandas (from -r requirements.txt (line 2))
  Downloading pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting scikit-learn (from -r requirements.txt (line 3))
  Downloading scikit_learn-1.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting matplotlib (from -r requirements.txt (line 4))
  Downloading matplotlib-3.10.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy->-r requirements.txt (line 1))
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy->-r requirements.txt (line 1))
  Downloading spacy_loggers-1.0.5-py3-

Imagine if I could do something like this:

```python
from collections import Counter

reviews = load_movie_reviews()

master_counter = Counter()
for review in reviews:
    counter = get_pos_tag_counts(review)
    master_counter += counter # add counts to master counter

```

Hopefully, it is quite clear what the code is trying to do. I just need to define the functions `load_movie_reviews()` and `get_pos_tag_counts()`.

First, how do I get the counts from **one** document? Something like function below can achieve this. 

In [3]:
import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

def get_pos_tag_counts(text: str) -> Counter[str]:
    """Returns a dictionary where keys are POS-tag type and values are absolute
    counts of that POS-tag in the given text."""

    doc = nlp(text)
    pos_tags = [token.pos_ for token in doc]
    pos_tag_counts = Counter(pos_tags)
    return pos_tag_counts


In [4]:
test_string = "This is a test string."
get_pos_tag_counts(test_string)

Counter({'NOUN': 2, 'PRON': 1, 'AUX': 1, 'DET': 1, 'PUNCT': 1})

Now I want to encapsulate the loading of the movie reviews. This allows me to hide some logic that is needed to get rid of the review labels which I do not need for this task.

In [6]:


def load_movie_reviews(path: str = "/work/data/imdb/IMDB Dataset.csv") -> list[str]:
    """Returns a list of IMDB movie reviews as strings."""
    with open(path) as f:
        csv_reader = csv.reader(f)
        next(csv_reader) # skipping header line
        # list comprehension and value-unpacking
        return [text for text, label in csv_reader]


In [11]:
# the first ten elements of the returned list
load_movie_reviews()[:10]

["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the f

Now I can run the code below.

In [8]:
from collections import Counter

# using only a subset here for demo purposes as it takes quite some time to process 50k docs
reviews = load_movie_reviews()[:1000]

master_counter = Counter()
for review in reviews:
    counter = get_pos_tag_counts(review)
    master_counter += counter # add counts to master counter

In [9]:
master_counter

Counter({'NOUN': 43467,
         'PUNCT': 32209,
         'VERB': 27016,
         'DET': 24900,
         'ADP': 24459,
         'PRON': 23050,
         'ADJ': 20230,
         'PROPN': 16870,
         'AUX': 15789,
         'ADV': 14743,
         'CCONJ': 9040,
         'PART': 7440,
         'SCONJ': 5306,
         'NUM': 2613,
         'X': 826,
         'INTJ': 613,
         'SYM': 305,
         'SPACE': 14})

Functions are also critical in reusable functionality. If you ever find yourself copy-pasting code from somewhere else in your project, there is a good chance that you should define a function instead which you call in both places.

# Classes

Classes are "blueprints" for objects in Python. They define which attributes they have and which methods (object-specific) functions they have.

They allow to organize your code in a different manner than functions. It is more about putting together information and functionality that belong together. It really helps with so-called "separation of concerns."

Imagine the following task: without using pandas or utility functions from scikit-learn or anywhere else, define a `LabeledDataset` class that can give different train/test splits, number of labels and their distribution, etc.

Again, I'll try a show-don't-tell approach:

In [20]:
import csv
from collections import Counter

class LabeledDataset:

    def __init__(self, csv_path: str, has_header: bool = True):
        """Input CSV file should have texts in the first column and labels in the second."""
        
        # load data and store in two lists
        self.texts = []
        self.labels = []
        with open(csv_path) as f:
            csv_reader = csv.reader(f)
            if has_header:
                next(csv_reader)
            for text, label in csv_reader:
                self.texts.append(text)
                self.labels.append(label)

    def labeled_texts(self) -> list[tuple[str, str]]:
        """Return a list of text-label tuples."""
        return list(zip(self.texts, self.labels))

    def label_distribution(self) -> dict[str, int]:
        return dict(Counter(self.labels))

    def train_test_split(self, test_proportion: float = .2) -> tuple[list[str], list[str], list[str], list[str]]:
        """Create a train/test split. The test proportion should be between 0 and 1.
        Returned as train_texts, train_labels, test_texts, test_labels."""

        # ensure that the argument makes sense, otherwise throw an error
        if not 0 < test_proportion < 1:
            raise ValueError("test_proportion should be between 0 and 1!")

        # converting to int rounds down
        test_size = int(len(self.texts) * test_proportion)

        # split the data
        train_texts = self.texts[test_size:]
        train_labels = self.labels[test_size:]
        test_texts = self.texts[:test_size]
        test_labels = self.labels[:test_size]
        return (train_texts, train_labels, test_texts, test_labels)


    

In [14]:
imdb_reviews = LabeledDataset("/work/data/imdb/IMDB Dataset.csv")

Let's have a look at this object.

In [15]:
# first, the 'texts' attribute
imdb_reviews.texts[:10]

["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the f

In [16]:
# then, the 'labels' attribute
imdb_reviews.labels[:10]

['positive',
 'positive',
 'positive',
 'negative',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive']

In [None]:
# zipped and put into a list, but all put into one method
imdb_reviews.labeled_texts()[:10]

[("One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the 

In [18]:
imdb_reviews.label_distribution()

{'positive': 25000, 'negative': 25000}

In [21]:
train_texts, train_labels, test_texts, test_labels = imdb_reviews.train_test_split()

In [22]:
len(train_texts)

40000

In [23]:
len(test_texts)

10000

In [24]:
# even some input validation where the code throws an error if the input does not make sense
imdb_reviews.train_test_split(test_proportion=1.2)

ValueError: test_proportion should be between 0 and 1!

The `self` keyword requires a bit of getting used to. It is the object's way of referring to itself from within itself.

## Modules
Imports are from other modules. So far, they have been from external modules. But we can also define our own. In the same folder as this one, I have created `samplemodule.py` which I can import functions, classes (and variables) from.

Have a look at the contents!

In [35]:
from samplemodule import sample_function, SampleClass, PI
# alternatively from a package from samplepackage.samplemodule import sample_function, SampleClass, PI

In [27]:
# I defined the constant pi in the module, imported here 
PI

3.142857142857143

In [28]:
sample_function("Hello from the notebook!")

Hello there from the file samplemodule.py! Here is your argument as a string: Hello from the notebook!


In [31]:
sample_function("Hello again!")

Hello there from the file samplemodule.py! Here is your argument as a string: Hello again!
This was the previous argument: Hello again!


In [32]:
sample_obj = SampleClass(1234)
sample_obj.print_value()

1234


In [34]:
# hidden methods are not closed off from the outside world (like private methods in Java, for instance),
# but it is hidden for a reason
sample_obj._hidden_method()

Did you just call the hidden method?
