# Lab 06 Strings and dictionaries

<br>

<center>
    
<img src="img/craiyon-dictionary.jpg" alt="Spam" width="300"/> 
    
    Craiyon.com "Strings and dictionaries" 
    
</center>    

<br>

## 1 Objectives

This lesson will cover two essential Python types:

- strings

- dictionaries


<br>

## 2 Strings

One place where the Python language really shines is in the manipulation of strings.
This section will cover some of Python's built-in string methods and formatting operations.

Such string manipulation patterns come up often in the context of data science work.

<br>

### 2.1 String syntax

You've probably seen plenty of strings before, but just to recap, strings in Python can be defined using either single or double quotations. They are functionally equivalent.

In [1]:
# string syntax
x = 'Pluto is a planet'
y = "Pluto is a planet"
x == y

True

<br>

Double quotes are convenient if your string contains a single quote character (e.g. representing an apostrophe).

Similarly, it's easy to create a string that contains double-quotes if you wrap it in single quotes:

In [2]:
print("Pluto's a planet!")
print('My dog is named "Pluto"')

Pluto's a planet!
My dog is named "Pluto"


<br>

If we try to put a single quote character inside a single-quoted string, Python gets confused:

In [3]:
'Pluto's a planet!'

SyntaxError: invalid syntax (1561186517.py, line 1)

<br>

We can fix this by "escaping" the single quote with a backslash. 

In [3]:
# escape character
'Pluto\'s a planet!'

"Pluto's a planet!"

The table below summarizes some important uses of the backslash character.

| What you type... | What you get | example               | `print(example)`             |
|------------------|--------------|-----------------------|------------------------------|
| `\'`         | `'`            | `'What\'s up?'`         | `What's up?`                 |  
| `\"`         | `"`            | `"That's \"cool\""`     | `That's "cool"`              |  
| `\\`         | `\`            |  `"Look, a mountain: /\\"` |  `Look, a mountain: /\`  |
| `\n`         |   a new line      |   `"1\n2 3"`                       |   `1`<br/>`2 3`              |



<br>

The last sequence, `\n`, represents the *newline character*. It causes Python to start a new line.

In [4]:
# syntax looks funny
hello = "hello\nworld"
print(hello)

hello
world


<br>

In addition, Python's triple quote syntax for strings lets us include newlines literally (i.e. by just hitting 'Enter' on our keyboard, rather than using the special '\n' sequence). We've already seen this in the docstrings we use to document our functions, but we can use them anywhere we want to define a string.

In [5]:
# niche syntax
triplequoted_hello = """hello
world"""
print(triplequoted_hello)
triplequoted_hello == hello

hello
world


True

<br>

The `print()` function automatically adds a newline character unless we specify a value for the keyword argument `end` other than the default value of `'\n'`:

In [10]:
# newline tricks

print("hello")
print("world")
print("hello", end='')
print("pluto", end='')

hello
world
hellopluto

### 2.2 Strings are sequences

Strings can be thought of as sequences of characters. Almost everything we've seen that we can do to a list, we can also do to a string.

In [15]:
# Indexing
planet = 'Pluto'
print(type(planet)) # NB str type
planet[0]

<class 'str'>


'P'

In [12]:
# Slicing
planet[-3:]

'uto'

In [16]:
# How long is this string? (NB largest index value)
len(planet)

5

In [17]:
# Yes, we can even loop over them
[char+'!' for char in planet]

['P!', 'l!', 'u!', 't!', 'o!']

<br>

But a major way in which they differ from lists is that they are *immutable*. We can't modify them.

In [18]:
# nope
planet[0] = 'B'

# planet.append doesn't work either - try it

TypeError: 'str' object does not support item assignment

### 2.3 String methods

Like `list`, the type `str` has lots of very useful methods. I'll show just a few examples here.

In [19]:
# ALL CAPS
claim = "Pluto is a planet in my heart!"
claim.upper()

'PLUTO IS A PLANET IN MY HEART!'

In [20]:
# all lowercase
claim.lower()

'pluto is a planet in my heart!'

In [21]:
# Searching for the first index of a substring
claim.index('plan')

11

In [16]:
claim.startswith(planet)

True

In [22]:
# false because of missing exclamation mark
claim.endswith('planet')

False

## 3 Converting between strings and lists: `.split()` and `.join()`

`str.split()` turns a string into a list of smaller strings, breaking on whitespace by default. This is super useful for taking you from one big string to a list of words.

In [23]:
words = claim.split()
words

['Pluto', 'is', 'a', 'planet', 'in', 'my', 'heart!']

<br>

Occasionally you'll want to split on something other than whitespace:

In [25]:
# the immaculate way to represent dates
datestr = '1956-01-31'
year, month, day = datestr.split('-')

<br>

`str.join()` takes us in the other direction, sewing a list of strings up into one long string, using the string it was called on as a separator.

In [26]:
# this date format is the work of the Devil, just an example
'/'.join([month, day, year]) 

'01/31/1956'

In [28]:
# We can even put unicode characters right in our string literals :)
' 👏 '.join([word.upper() for word in words])

'PLUTO 👏 IS 👏 A 👏 PLANET 👏 IN 👏 MY 👏 HEART!'

## 4 Building strings with `.format()`

Python lets us concatenate strings with the `+` operator.

In [29]:
planet + ', we miss you.'

'Pluto, we miss you.'

<br>

If we want to throw in any non-string objects, we have to be careful to call `str()` on them first

In [30]:
position = 9
# this will result in an error
planet + ", you'll always be the " + position + "th planet to me."

TypeError: can only concatenate str (not "int") to str

In [31]:
# this will evaluate without an error
planet + ", you'll always be the " + str(position) + "th planet to me."

"Pluto, you'll always be the 9th planet to me."

<br>

This is getting hard to read and annoying to type. `str.format()` to the rescue.

In [25]:
# "pro" way to build muatable strings
"{}, you'll always be the {}th planet to me.".format(planet, position)

"Pluto, you'll always be the 9th planet to me."

<br>

So much cleaner! We call `.format()` on a "format string", where the Python values we want to insert are represented with `{}` placeholders.

Notice how we didn't even have to call `str()` to convert `position` from an int. `format()` takes care of that for us.

If that was all that `format()` did, it would still be incredibly useful. But as it turns out, it can do a *lot* more. Here's just a taste:

In [26]:
pluto_mass = 1.303 * 10**22
earth_mass = 5.9722 * 10**24
population = 52910390
#         2 decimal points   3 decimal points, format as percent     separate with commas
"{} weighs about {:.2} kilograms ({:.3%} of Earth's mass). It is home to {:,} Plutonians.".format(
    planet, pluto_mass, pluto_mass / earth_mass, population,
)

"Pluto weighs about 1.3e+22 kilograms (0.218% of Earth's mass). It is home to 52,910,390 Plutonians."

In [32]:
# Referring to format() arguments by index, starting from 0
s = """Pluto's a {0}.
No, it's a {1}.
{0}!
{1}!""".format('planet', 'dwarf planet')
print(s)

Pluto's a planet.
No, it's a dwarf planet.
planet!
dwarf planet!


<br>

We are just scratching the surface here for `str.format`. 

See here: [pyformat.info](https://pyformat.info/) and [the official docs](https://docs.python.org/3/library/string.html#formatstrings) for further reading.

## 5 Dictionaries

Dictionaries are a built-in Python data structure for mapping "keys" to "values".  This is an important structure to be aware of, even though, for now, applications may not be obvious to you.

In [33]:
numbers = {'one':1, 'two':2, 'three':3}

In this case `'one'`, `'two'`, and `'three'` are the **keys**, and 1, 2 and 3 are their corresponding values.

Values are accessed via square bracket syntax similar to indexing into lists and strings.

In [29]:
numbers['one']

1

<br>

We can use the same syntax to add another key, value pair

In [34]:
numbers['eleven'] = 11
numbers

{'one': 1, 'two': 2, 'three': 3, 'eleven': 11}

<br>

Or to change the value associated with an existing key

In [35]:
numbers['one'] = 'Pluto'
numbers

{'one': 'Pluto', 'two': 2, 'three': 3, 'eleven': 11}

<br>

Python has *dictionary comprehensions* with a syntax similar to the list comprehensions we saw in the previous tutorial.

In [36]:
planets = ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']
planet_to_initial = {i: i[0] for i in planets}
planet_to_initial

{'Mercury': 'M',
 'Venus': 'V',
 'Earth': 'E',
 'Mars': 'M',
 'Jupiter': 'J',
 'Saturn': 'S',
 'Uranus': 'U',
 'Neptune': 'N'}

<br>

The `in` operator tells us whether something is a key in the dictionary

In [33]:
'Saturn' in planet_to_initial

True

In [34]:
'Betelgeuse' in planet_to_initial

False

<br>

A for loop over a dictionary will loop over its keys

In [35]:
for k in numbers:
    print("{} = {}".format(k, numbers[k]))

one = Pluto
two = 2
three = 3
eleven = 11


We can access a collection of all the keys or all the values with `dict.keys()` and `dict.values()`, respectively.

In [37]:
# Get all the initials, sort them alphabetically, and put them in a space-separated string.
' '.join(sorted(planet_to_initial.values()))

'E J M M N S U V'

<br>

The very useful `dict.items()` method lets us iterate over the keys and values of a dictionary simultaneously. (In Python jargon, an **item** refers to a key, value pair)

In [38]:
for planet, initial in planet_to_initial.items():
    print("{} begins with \"{}\"".format(planet.rjust(10), initial))

   Mercury begins with "M"
     Venus begins with "V"
     Earth begins with "E"
      Mars begins with "M"
   Jupiter begins with "J"
    Saturn begins with "S"
    Uranus begins with "U"
   Neptune begins with "N"


<br>

To read a full inventory of dictionaries' methods, click the "output" button below to read the full help page, or check out the [official online documentation](https://docs.python.org/3/library/stdtypes.html#dict).

In [39]:
help(dict)

Help on class dict in module builtins:

class dict(object)
 |  dict() -> new empty dictionary
 |  dict(mapping) -> new dictionary initialized from a mapping object's
 |      (key, value) pairs
 |  dict(iterable) -> new dictionary initialized as if via:
 |      d = {}
 |      for k, v in iterable:
 |          d[k] = v
 |  dict(**kwargs) -> new dictionary initialized with the name=value pairs
 |      in the keyword argument list.  For example:  dict(one=1, two=2)
 |  
 |  Methods defined here:
 |  
 |  __contains__(self, key, /)
 |      True if the dictionary has the specified key, else False.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __init__(self,

## 6 Exercises

For each of the five strings below, predict what `len()` would return when passed that string. Use the variable `length` to record your answer, then run the cell to check whether you were right.  

### 6.1a.

In [41]:
# what is the length?
a = ""

# len(a)

0

### 6.1b

In [None]:
# what is the length?
b = "it's ok"

# len(b)

### 6.1c

In [None]:
# what is the length?
c = 'it\'s ok'

# len(c)

### 6.1d

In [None]:
# what is the length?
d = """hey"""

# len(d)

### 6.1e

In [None]:
# what is the length?
e = '\n'

# len(e)

### 6.2

There is a saying that "Data scientists spend 80% of their time cleaning data, and 20% of their time complaining about cleaning data." Let's see if you can write a function to help clean US zip code data. Given a string, it should return whether or not that string represents a valid zip code. For our purposes, a valid zip code is any string consisting of exactly 5 digits.

HINT: `str` has a method that will be useful here. Use `help(str)` to review a list of string methods.

In [None]:
def is_valid_zip(zip_code):
    """Returns whether the input string is a valid (5 digit) zip code
    """
    pass



### 6.3

A researcher has gathered thousands of news articles. But she wants to focus her attention on articles including a specific word. Complete the function below to help her filter her list of articles.

Your function should meet the following criteria:

- Do not include documents where the keyword string shows up only as a part of a larger word. For example, if she were looking for the keyword “closed”, you would not include the string “enclosed.” 
- She does not want you to distinguish upper case from lower case letters. So the phrase “Closed the case.” would be included when the keyword is “closed”
- Do not let periods or commas affect what is matched. “It is closed.” would be included when the keyword is “closed”. But you can assume there are no other types of punctuation.

In [42]:
def word_search(doc_list, keyword):
    """
    Takes a list of documents (each document is a string) and a keyword. 
    Returns list of the index values into the original list for all documents 
    containing the keyword.

    Example:
    doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
    >>> word_search(doc_list, 'casino')
    >>> [0]
    """
    pass



### 6.4

Now the researcher wants to supply multiple keywords to search for. Complete the function below to help her.

(You're encouraged to use the `word_search` function you just wrote when implementing this function. Reusing code in this way makes your programs more robust and readable - and it saves typing!)

In [None]:
def multi_word_search(doc_list, keywords):
    """
    Takes list of documents (each document is a string) and a list of keywords.  
    Returns a dictionary where each key is a keyword, and the value is a list of indices
    (from doc_list) of the documents containing that keyword

    >>> doc_list = ["The Learn Python Challenge Casino.", "They bought a car and a casino", "Casinoville"]
    >>> keywords = ['casino', 'they']
    >>> multi_word_search(doc_list, keywords)
    {'casino': [0, 1], 'they': [1]}
    """
    pass

