# Lab 3: Strings

## Overview

Welcome to your third lab! The goal for today is to familiarize you with strings (or more precisely in python, `str`). Manipulating textual data is a frequent operation in day-to-day proramming &mdash; even more so for us in NLP.

As usual, you will have to submit two exercises. You will find them and the submission instructions at the end of this notebook.

## Exercise #4: Special Words

For each of the following problems, we describe a criterion that makes a word (or phrase!) special.

If you are using macOS or Linux, you should have a dictionary file available at `/usr/share/dict/words`, containing some 100 thousand English words, one per line. Depending on your machine, the number might differ somewhat. Therefore, we also mirrored this file [on Arche](https://arche.univ-lorraine.fr/), so you can download the dictionary from there. That version contains `102401` words.

Write the method `load_english` to load English words from this file. How many English words are there in this file? Using the Arche file, we got `72165` words, after lowercasing them, removing duplicates and checking if they contain ASCII characters only (i.e. we exclude entries that contain apostrophes or accented letters).

If you are using a different version of the file or did a different set of pre-processing steps, your number will likely differ somewhat. This is not crucial for the next four exercises, but having the same number of words in your pre-processed dictionary will help you verify your answers with ours for the next exercises.

In [481]:
# If you downloaded "words" from the course website,
# change me to the path to the downloaded file.
import string


_DICTIONARY_FILE = 'words'

def check_ascii(row):
  for char in row:
    if char not in string.ascii_letters: # check if symbol from a-z
      return False
  return True


def load_english():
    """Load and return a collection of english words from a file."""
    lst = []
    # open file and read line by line
    with open(_DICTIONARY_FILE) as file:
      for row in file:
        row = row.rstrip()  # removes any trailing characters
        if check_ascii(row):
          lst.append(row.lower())  # converts row to lowercase
    return set(lst) # convert list to set for efficient search operation later


english = load_english()
print(len(english))

72165


# Assignment Exercises

### Exercise #7: Triad Phrases

Triad words are English words for which the two smaller strings you make by extracting alternating letters both form valid words.

For example:

![Triad Phrases](http://i.imgur.com/jGEXJWi.png)

Write a function to determine whether an entire phrase passed into a function is made of triad words. You can assume that all words are made of only alphabetic characters, and are separated by whitespace. We will consider the empty string to be an invalid English word.

```python
is_triad_phrase("learned theorem") # => True
is_triad_phrase("studied theories") # => False
is_triad_phrase("wooded agrarians") # => False
is_triad_phrase("forrested farmers") # => False
is_triad_phrase("schooled oriole") # => True
is_triad_phrase("educated small bird") # => False
is_triad_phrase("a") # => False
is_triad_phrase("") # => False
```

Write a set of instructions to generate a list of all triad words. Assign this list to a variable called `all_triad_words`. How many are there? We found `1041` distinct triad words (case-insensitive).

*NB: we obtained the answers above using the dictionary on Arche and after applying the pre-processing steps described in Exercise #4. If the number of words you keep after applying `load_english()` is the same as ours, then you should be able to get the same number as us. If your number is different or you are using another dictionary, your answers may differ.*

You can uncomment the `print()` statement at the bottom once you have finished to see if your functions return the expected result.

In [482]:
# A Triad Phrase is one where the alternate letters of each word spell out another 2 words.

def is_triad_word(word, english):
    """Return whether a word is a triad word."""
    if not len(word):
      return False

    n = len(word)
    # use slicing get first subword and second subword in supposed triad word with step 2 from 0 and 1 index, correspondently.
    sub_word1 = word[0:n:2]
    sub_word2 = word[1:n:2]
    # check resulting subwords in vocabulary english. if both subwords in it, the initial word is triad word.
    # english is a set - fast search operation
    if sub_word1 in english and sub_word2 in english:
      return True

    return False


def is_triad_phrase(phrase, english):
    """Return whether a phrase is composed of only triad words."""
    is_triad = True
    for word in phrase.split(" "):
      # & "bitwise and". Treat True and False as binary numbers: True == 1 and False == 0
      # By truth table if all words are triads (True) in phrase => the phrase consists of triad words - is_triad = True
      is_triad &= is_triad_word(word, english)
    return is_triad


print(
   is_triad_phrase("learned theorem", english), # => True
   is_triad_phrase("studied theories", english), # => False
   is_triad_phrase("wooded agrarians", english), # => False
   is_triad_phrase("forrested farmers", english), # => False
   is_triad_phrase("schooled oriole", english), # => True
   is_triad_phrase("educated small bird", english), # => False
   is_triad_phrase("a", english), # => False
   is_triad_phrase("", english), # => False
)

True False False False True False False False


In [483]:
all_triad_words = [w for w in english if is_triad_word(w, english)]
print(len(all_triad_words))

1041


### Exercise #8: Surpassing Phrases

Surpassing words are words for which the gap between each adjacent pair of letters **strictly** increases. These gaps are computed without "wrapping around" from Z to A.

For example:

![Surpassing Phrases](http://i.imgur.com/XKiCnUc.png)

Write a function to determine whether an entire phrase passed into a function is made of surpassing words. You can assume that all words are made of only alphabetic characters, and are separated by whitespace. We will consider the empty string and a 1-character string to be valid surpassing phrases.

```python
is_surpassing_phrase("superb subway") # => True
is_surpassing_phrase("excellent train") # => False
is_surpassing_phrase("porky hogs") # => True
is_surpassing_phrase("plump pigs") # => False
is_surpassing_phrase("turnip fields") # => True
is_surpassing_phrase("root vegetable lands") # => False
is_surpassing_phrase("a") # => True
is_surpassing_phrase("") # => True
```

We've provided a `character_gap` function that returns the gap between two characters. To understand how it works, you should first learn about the Python functions `ord` (one-character string to integer ordinal) and `chr` (integer ordinal to one-character string). For example:

```python
ord('a') # => 97
chr(97) # => 'a'
```

So, in order to find the gap between `G` and `E`, we compute `abs(ord('G') - ord('E'))`, where `abs` returns the absolute value of its argument.

Write a set of instructions to generate a list of all surpassing words. Assign this list to a variable called `all_surpassing_words`. How many are there? We found `1150` distinct surpassing words.

*NB: we obtained the answers above using the dictionary on Arche and after applying the pre-processing steps described in Exercise #4. If the number of words you keep after applying `load_english()` is the same as ours, then you should be able to get the same number as us. If your number is different or you are using another dictionary, your answers may differ.*

You can uncomment the `print()` statement at the bottom once you have finished to see if your functions return the expected result.

In [484]:
def character_gap(ch1, ch2):
    """Return the absolute gap between two characters."""
    return abs(ord(ch1) - ord(ch2))

def is_surpassing_word(word, english):
    """Return whether a word is surpassing."""
    n = len(word)
    if n <= 1:
      return True

    i = 1 #initial index for loop
    result = True # flag
    char_gap = -1  # variable for tracking if ascii values difference between adjacent characters in the word increases
    while i < n:
      curr_char_gap = character_gap(word[i-1], word[i])
      if curr_char_gap > char_gap:
        char_gap = curr_char_gap
      else:
        result = False
      i += 1
    if result and word in english:  # if result is True => word is surpassing, then check it in vocabulary
      return True

    return result


def is_surpassing_phrase(phrase, english):
    """Return whether a word is surpassing."""
    is_surpassing = True
    for word in phrase.split(" "):  #iterate over words splited by whitespace
      is_surpassing &= is_surpassing_word(word, english)
    return is_surpassing


print(
   is_surpassing_phrase("superb subway", english), # => True
   is_surpassing_phrase("excellent train", english), # => False
   is_surpassing_phrase("porky hogs", english), # => True
   is_surpassing_phrase("plump pigs", english), # => False
   is_surpassing_phrase("turnip fields", english), # => True
   is_surpassing_phrase("root vegetable lands", english), # => False
   is_surpassing_phrase("a", english), # => True
   is_surpassing_phrase("", english), # => True
)

True False True False True False True True


In [485]:
# 72165
all_surpassing_words = [w for w in english if is_surpassing_word(w, english)]
print(len(all_surpassing_words))

1150


### Submission instructions

Alright, you did it!

You will need to submit the last two exercises (#7 and #8) on Arche before 9:59am on Friday, 13th October (just before our next lab). Submit either a `.py` or an `.ipynb` file containing the two functions and name it `td2_firstname_lastname_grpN.py` or `td2_firstname_lastname_grpN.ipynb` accordingly, where `firstname` should be your first name, `lastname` should be your last name, and `N` in `grpN` should be your group number (e.g. Jane Doe, who is in group A1, should name her submission either `td2_jane_doe_grp1.py` or `td2_jane_doe_grp1.ipynb`, depending on whether Jane submitted a Python script or a Jupyter notebook).

To evaluate your submission, we will be looking at the following criteria:

- Does your code run? (So **run** your program at least once before submitting!)
- Does it run correctly? (So **test** your solution with a few different inputs!)
- Is your code well-commented?

## Done Early?

Have a look at the [`string` module in Python](https://docs.python.org/3/library/string.html). It contains a lot of very useful things, such as lists of ascii characters. Another thing you should look into is the [unicode standard in Python](https://docs.python.org/3/howto/unicode.html).

### Other Phrases

On Puzzling.StackExchange, the user [JLee](https://puzzling.stackexchange.com/users/463/jlee) has come up with a ton of interesting puzzles of this form ("I call words that follow a certain rule "adjective" words"). If you like puzzles, optionally read through [these JLee puzzles](https://puzzling.stackexchange.com/search?q=%22I+call+it%22+title%3A%22what+is%22+is%3Aquestion+user%3A463) or [these other puzzles inspired by JLee](https://puzzling.stackexchange.com/search?tab=votes&q=%22what%20is%20a%22%20word%20is%3aquestion).

> With <3, by @sredmond

> With peanuts, monkeys and spies, by tmickus