<a href="https://colab.research.google.com/github/Bluelord/Kaggle_Courses/blob/main/01%20Python/06%20Strings%20%26%20Dctionaries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Strings & Dctionaries

---



---



## Tutoral

---



### Strings

One place where the Python language really shines is in the manipulation of strings. This section will cover some of Python's built-in string methods and formatting operations. Such string manipulation patterns come up often in the context of data science work.

In [1]:
x = 'Pluto is a planet'
y = "Pluto is a planet"
x == y

True

In [2]:
print("Pluto's a planet!")
print('My dog is named "Pluto"')

Pluto's a planet!
My dog is named "Pluto"


In [3]:
'Pluto's a planet!'

SyntaxError: ignored

We can fix this by "escaping" the single quote with a backslash. 

In [4]:
'Pluto\'s a planet!'

"Pluto's a planet!"

The last sequence, `\n`, represents the *newline character*. It causes Python to start a new line.

In [5]:
hello = "hello\nworld"
print(hello)

hello
world


In [6]:
triplequoted_hello = """hello
world"""
print(triplequoted_hello)
triplequoted_hello == hello

hello
world


True

The `print()` function automatically adds a newline character unless we specify a value for the keyword argument `end` other than the default value of `'\n'`:

In [7]:
print("hello")
print("world")
print("hello", end='')
print("pluto", end='')

hello
world
hellopluto

#### Strings are sequences

Strings can be thought of as sequences of characters. Almost everything we've seen that we can do to a list, we can also do to a string.

In [8]:
# Indexing
planet = 'Pluto'
planet[0]

'P'

In [9]:
# Slicing
planet[-3:]

'uto'

In [10]:
# How long is this string?
len(planet)

5

In [11]:
# Yes, we can even loop over them
[char+'! ' for char in planet]

['P! ', 'l! ', 'u! ', 't! ', 'o! ']

In [12]:
planet[0] = 'B'
# planet.append doesn't work either

TypeError: ignored

#### String methods

Like `list`, the type `str` has lots of very useful methods. I'll show just a few examples here.

In [13]:
# ALL CAPS
claim = "Pluto is a planet!"
claim.upper()

'PLUTO IS A PLANET!'

In [14]:
# all lowercase
claim.lower()

'pluto is a planet!'

In [15]:
# Searching for the first index of a substring
claim.index('plan')

11

In [16]:
claim.startswith(planet)

True

In [17]:
claim.endswith('dwarf planet')

False

Going between strings and lists: `.split()` and `.join()` 
`str.split()` turns a string into a list of smaller strings, breaking on whitespace by default. This is super useful for taking you from one big string to a list of words. `str.join()` takes us in the other direction, sewing a list of strings up into one long string, using the string it was called on as a separator.

In [20]:
words = claim.split()
words

['Pluto', 'is', 'a', 'planet!']

In [21]:
datestr = '1956-01-31'
year, month, day = datestr.split('-')

In [22]:
'/'.join([month, day, year])

'01/31/1956'

In [23]:
# Yes, we can put unicode characters right in our string literals :)
' 👏 '.join([word.upper() for word in words])

'PLUTO 👏 IS 👏 A 👏 PLANET!'

##### Building strings with `.format()`

Python lets us concatenate strings with the `+` operator. If we want to throw in any non-string objects, we have to be careful to call `str()` on them first. This is getting hard to read and annoying to type. `str.format()` to the rescue.

In [24]:
planet + ', we miss you.'

'Pluto, we miss you.'

In [25]:
position = 9
planet + ", you'll always be the " + position + "th planet to me."

TypeError: ignored

In [26]:
planet + ", you'll always be the " + str(position) + "th planet to me."

"Pluto, you'll always be the 9th planet to me."

In [27]:
"{}, you'll always be the {}th planet to me.".format(planet, position)

"Pluto, you'll always be the 9th planet to me."

In [28]:
pluto_mass = 1.303 * 10**22
earth_mass = 5.9722 * 10**24
population = 52910390
#         2 decimal points   3 decimal points, format as percent     separate with commas
"{} weighs about {:.2} kilograms ({:.3%} of Earth's mass). It is home to {:,} Plutonians.".format(
    planet, pluto_mass, pluto_mass / earth_mass, population,
)

"Pluto weighs about 1.3e+22 kilograms (0.218% of Earth's mass). It is home to 52,910,390 Plutonians."

In [29]:
# Referring to format() arguments by index, starting from 0
s = """Pluto's a {0}.
No, it's a {1}.
{0}!
{1}!""".format('planet', 'dwarf planet')
print(s)

Pluto's a planet.
No, it's a dwarf planet.
planet!
dwarf planet!


### Dictionaries

Dictionaries are a built-in Python data structure for mapping keys to values. `'one'`, `'two'`, and `'three'` are the **keys**, and 1, 2 and 3 are their corresponding values. 
Values are accessed via square bracket syntax similar to indexing into lists and strings.

In [30]:
numbers = {'one':1, 'two':2, 'three':3}

In [31]:
numbers['eleven'] = 11
numbers

{'eleven': 11, 'one': 1, 'three': 3, 'two': 2}

In [32]:
numbers['one'] = 'Pluto'
numbers

{'eleven': 11, 'one': 'Pluto', 'three': 3, 'two': 2}

In [34]:
planets = ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']
planet_to_initial = {planet: planet[0] for planet in planets}
print(planet_to_initial)

{'Mercury': 'M', 'Venus': 'V', 'Earth': 'E', 'Mars': 'M', 'Jupiter': 'J', 'Saturn': 'S', 'Uranus': 'U', 'Neptune': 'N'}


The `in` operator tells us whether something is a key in the dictionary

In [35]:
'Saturn' in planet_to_initial

True

In [36]:
'Betelgeuse' in planet_to_initial

False

A for loop over a dictionary will loop over its keys

In [37]:
for k in numbers:
    print("{} = {}".format(k, numbers[k]))

one = Pluto
two = 2
three = 3
eleven = 11


We can access a collection of all the keys or all the values with `dict.keys()` and `dict.values()`, respectively.

In [38]:
# Get all the initials, sort them alphabetically, and put them in a space-separated string.
' '.join(sorted(planet_to_initial.values()))

'E J M M N S U V'

The very useful `dict.items()` method lets us iterate over the keys and values of a dictionary simultaneously. (In Python jargon, an **item** refers to a key, value pair)

In [39]:
for planet, initial in planet_to_initial.items():
    print("{} begins with \"{}\"".format(planet.rjust(10), initial))

   Mercury begins with "M"
     Venus begins with "V"
     Earth begins with "E"
      Mars begins with "M"
   Jupiter begins with "J"
    Saturn begins with "S"
    Uranus begins with "U"
   Neptune begins with "N"


## Exercise

In [40]:
a = ""
# no output length is zero
length = len(a)
print(length)

0


In [41]:
b = "it's ok"
length = len(b)
print(length)

7


In [42]:
c = 'it\'s ok'
length = len(c)
print(length)


7


In [43]:
d = """hey"""
length = len(d)
print(length)

3


In [44]:
e = '\n'
length = len(e)
print(length)

1


a function to help clean US zip code data. Given a string, it should return whether or not that string represents a valid zip code. For our purposes, a valid zip code is any string consisting of exactly 5 digits.

In [45]:
def is_valid_zip(zip_code):
    return len(zip_code) == 5 and zip_code.isdigit()

A researcher has gathered thousands of news articles. But she wants to focus her attention on articles including a specific word. Complete the function below to help her filter her list of articles. function should meet the following criteria:

- Do not include documents where the keyword string shows up only as a part of a larger word. For example, if she were looking for the keyword “closed”, you would not include the string “enclosed.” 
- She does not want you to distinguish upper case from lower case letters. So the phrase “Closed the case.” would be included when the keyword is “closed”
- Do not let periods or commas affect what is matched. “It is closed.” would be included when the keyword is “closed”. But you can assume there are no other types of punctuation.

In [46]:
def word_search(doc_list, keyword):
    loc = []
    for i, doc in enumerate(doc_list):
        keys = doc.split()
        # romoving all periods and commas
        normalized = [key.rstrip('.,').lower() for key in keys]
        if keyword.lower() in normalized:
            loc.append(i)
    return loc
    
    
doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
word_search(doc_list, 'casino')

[0]

The researcher wants to supply multiple keywords to search for. Complete the function below to help her. use the `word_search` function you just wrote when implementing this function. Reusing code in this way makes your programs more robust and readable.

In [49]:
def multi_word_search(doc_list, keywords):
    keyword_loc = {}
    for keyword in keywords:
        keyword_loc[keyword] = word_search(doc_list, keyword)
    return keyword_loc
    
    
doc_list = ["The Learn Python Challenge Casino.", "They bought a car and a casino", "Casinoville"]
keywords = ['casino', 'they']

multi_word_search(doc_list, keywords)

{'casino': [0, 1], 'they': [1]}