**STRINGS**k


One place where the Python language really shines is in the manipulation of strings. This section will cover some of Python's built-in string methods and formatting operations.
Such string manipulation patterns come up often in the context of data science work

**String syntax**
Strings in Python can be defined using either single or double quotations. They are functionally equivalent.
Double quotes are convenient if your string contains a single quote character (e.g. representing an apostrophe).

Similarly, it's easy to create a string that contains double-quotes if you wrap it in single quot

If we try to put a single quote character inside a single-quoted string, Python gets confused:We can fix this by "escaping" the single quote with a backslash.
```
'Pluto\'s a planet!'```



es

Some *important uses of the backslash character*:
What you type...	What you get	example	print(example)
\'	-->	'What\'s up?'	What's up?
\"	-->	"That's \"cool\""	That's "cool"
\\	-->	"Look, a mountain: /\\"	Look, a mountain: /\
\n	--> "1\n2 3":
1
2 3 --> \n represents the newline character: causes Python to start a new line.

In [2]:
hello = "hello\nworld"
print(hello)

hello
world


In addition, Python's *triple quote syntax* for strings lets us include newlines literally (i.e. by just hitting 'Enter' on our keyboard, rather than using the special '\n' sequence). We've already seen this in the docstrings we use to document our functions, but we can use them anywhere we want to define a string.

In [4]:
triplequoted_hello = """hello
world"""
print(triplequoted_hello)
triplequoted_hello == hello 
#the hello from last code cell --> its the same to use """ """ and \n

hello
world


True

The print() function automatically adds a newline character unless we specify a value for the keyword argument end other than the default value of '\n':

In [6]:
print("hello")
print("world")
print("hello", end='')
print("pluto", end='')

hello
world
hellopluto

**Strings are sequences**

Strings can be thought of as sequences of characters. Almost everything we've seen that we can do to a list, we can also do to a string

In [8]:
# Indexing
planet = 'Pluto'
planet[0]

'P'

In [10]:
# Slicing
planet[-3:]

'uto'

In [20]:
# How long is this string?
len(planet) #is measuring variable planet, that has 'Pluto' in it --> 5
len('planet') #is measuring the string planet

6

In [30]:
# Yes, we can even loop over them
print([char+'! ' for char in planet])
print([char+'! ' for char in 'planet'])

['P! ', 'l! ', 'u! ', 't! ', 'o! ']
['p! ', 'l! ', 'a! ', 'n! ', 'e! ', 't! ']


But a major way in which they differ from lists is that they are immutable. We can't modify them.
--> planet[0] = 'B' won't work
planet.append doesn't work either

**String methods**

In [32]:
# ALL CAPS
claim = "Pluto is a planet!"
claim.upper()

'PLUTO IS A PLANET!'

In [34]:
# all lowercase
claim.lower()

'pluto is a planet!'

In [36]:
# Searching for the first index of a substring
claim.index('plan')

11

In [40]:
print(claim.startswith(planet)) #starts w Pluto --> True
(claim.startswith('planet'))

True


False

In [42]:
# false because of missing exclamation mark
claim.endswith('planet')

False

**Going between strings and lists: .split() and .join()**

str.split() turns a string into a list of smaller strings, *breaking on whitespace* by default. This is super useful for taking you from one big string to a list of words.

In [44]:
words = claim.split()
words

['Pluto', 'is', 'a', 'planet!']

In [54]:
# If you want to split on something other than whitespace:
datestr = '1956-01-31'
year, month, day = datestr.split('-')
print(year, month, day)

1956 01 31


str.join() takes us in the other direction, sewing a list of strings up into one long string, using the string it was called on as a separator.

In [56]:
'/'.join([month, day, year])

'01/31/1956'

In [58]:
# Yes, we can put unicode characters right in our string literals :)
' 👏 '.join([word.upper() for word in words])

'PLUTO 👏 IS 👏 A 👏 PLANET!'

**Building strings with .format()**
Python lets us concatenate strings with the + operator.


In [62]:
planet + ', we miss you.'

'Pluto, we miss you.'

If we want to throw in any non-string objects, we have to be careful to call str() on them first

In [64]:
position = 9
planet + ", you'll always be the " + str(position) + "th planet to me."

"Pluto, you'll always be the 9th planet to me."

This is getting hard to read and annoying to type. str.format() to the rescue.

In [66]:
"{}, you'll always be the {}th planet to me.".format(planet, position)

"Pluto, you'll always be the 9th planet to me."

So much cleaner! We call .format() on a "format string", where the Python values we want to insert are represented with {} placeholders.
--> Este método reemplaza los {} en la cadena por los valores que le pasas, en orden. Es como dejar huecos para luego rellenar con tus propias palabras o números.


Notice how we didn't even have to call str() to convert position from an int. format() takes care of that for us.

If that was all that format() did, it would still be incredibly useful. But as it turns out, it can do a lot more. Here's just a taste:

In [68]:
pluto_mass = 1.303 * 10**22
earth_mass = 5.9722 * 10**24
population = 52910390
#         2 decimal points   3 decimal points, format as percent     separate with commas
"{} weighs about {:.2} kilograms ({:.3%} of Earth's mass). It is home to {:,} Plutonians.".format(
    planet, pluto_mass, pluto_mass / earth_mass, population,
)

"Pluto weighs about 1.3e+22 kilograms (0.218% of Earth's mass). It is home to 52,910,390 Plutonians."

In [72]:
#Referring to format() arguments by index, starting from 0
s = """Pluto's a {0}.
No, it's a {1}.
{0}!
{1}!""".format('planet', 'dwarf planet')
print(s)

Pluto's a planet.
No, it's a dwarf planet.
planet!
dwarf planet!


**Dictionaries**
Dictionaries are a built-in Python data structure for mapping keys to values.


In [74]:
numbers = {'one':1, 'two':2, 'three':3}

In this case 'one', 'two', and 'three' are the keys, and 1, 2 and 3 are their corresponding values.

Values are accessed via square bracket syntax similar to indexing into lists and strings.

In [76]:
numbers['one']

1

In [78]:
#We can use the same syntax to add another key, value pair

numbers['eleven'] = 11
numbers

{'one': 1, 'two': 2, 'three': 3, 'eleven': 11}

In [80]:
#Or to change the value associated with an existing key

numbers['one'] = 'Pluto'
numbers

{'one': 'Pluto', 'two': 2, 'three': 3, 'eleven': 11}

Python has **dictionary comprehensions** with a syntax similar to the list comprehensions we saw

In [82]:
planets = ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']
planet_to_initial = {planet: planet[0] for planet in planets}
planet_to_initial

{'Mercury': 'M',
 'Venus': 'V',
 'Earth': 'E',
 'Mars': 'M',
 'Jupiter': 'J',
 'Saturn': 'S',
 'Uranus': 'U',
 'Neptune': 'N'}

Structure:
{clave: valor for elemento in iterable}
clave: lo que quieres usar como la clave en tu diccionario
valor: lo que quieres guardar asociado a esa clave
for elemento in iterable: lo que estás recorriendo (como una lista)

In [86]:
#The in operator tells us whether something is a key in the dictionary
print('Saturn' in planet_to_initial)
print('Betelgeuse' in planet_to_initial)

True
False


A for loop over a dictionary will loop **over its keys**

In [92]:
print(numbers) 
for k in numbers:
    print("{} = {}".format(k, numbers[k]))

{'one': 'Pluto', 'two': 2, 'three': 3, 'eleven': 11}
one = Pluto
two = 2
three = 3
eleven = 11


We can access a collection of all the keys or all the values with dict.keys() and dict.values(), respectively.

In [94]:
# Get all the initials, sort them alphabetically, and put them in a space-separated string.
' '.join(sorted(planet_to_initial.values()))

'E J M M N S U V'

The very useful dict.items() method lets us iterate over the keys and values of a dictionary simultaneously. (In Python jargon, an item refers to a key, value pair)

In [96]:
for planet, initial in planet_to_initial.items():
    print("{} begins with \"{}\"".format(planet.rjust(10), initial))
    #.rjust(10) toma un string (planet) y lo rellena con espacios a la izquierda 
    #hasta que tenga una longitud total de 10 caracteres.

   Mercury begins with "M"
     Venus begins with "V"
     Earth begins with "E"
      Mars begins with "M"
   Jupiter begins with "J"
    Saturn begins with "S"
    Uranus begins with "U"
   Neptune begins with "N"


In [98]:
help(dict)

Help on class dict in module builtins:

class dict(object)
 |  dict() -> new empty dictionary
 |  dict(mapping) -> new dictionary initialized from a mapping object's
 |      (key, value) pairs
 |  dict(iterable) -> new dictionary initialized as if via:
 |      d = {}
 |      for k, v in iterable:
 |          d[k] = v
 |  dict(**kwargs) -> new dictionary initialized with the name=value pairs
 |      in the keyword argument list.  For example:  dict(one=1, two=2)
 |
 |  Built-in subclasses:
 |      StgDict
 |
 |  Methods defined here:
 |
 |  __contains__(self, key, /)
 |      True if the dictionary has the specified key, else False.
 |
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |
 |  __eq__(self, value, /)
 |      Return self==value.
 |
 |  __ge__(self, value, /)
 |      Return self>=value.
 |
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |
 |  __getitem__(self, key, /)
 |      Return self[key].
 |
 |  __gt__(self, value, /)
 |      Return self>va

**EXERCISES**

**Exercise 0**
The empty string has length zero. Note that the empty string is also the only string that Python considers as False when converting to boolean.

Keep in mind Python includes spaces (and punctuation) when counting string length.
The backslash is not part of the string, so it doesn't contribute to its length.

The fact that a string was created using triple-quote syntax doesn't make any difference in terms of its content or length. This string is exactly the same as 'hey'.

The newline character is just a single character! (Even though we represent it to Python using a combination of two characters.)




**Exercise 1**
There is a saying that "Data scientists spend 80% of their time cleaning data, and 20% of their time complaining about cleaning data." Let's see if you can write a function to help clean US zip code data. Given a string, it should return whether or not that string represents a valid zip code. For our purposes, a valid zip code is any string consisting of exactly 5 digits.
HINT: `str` has a method that will be useful here. Use `help(str)` to review a list of string methods.

In [136]:
def is_valid_zip(zip_code):
    """Returns whether the input string is a valid (5 digit) zip code
    """
    if zip_code.isdigit() and len(zip_code) == 5:
        return True
    return False

print(is_valid_zip('12345'))
print(is_valid_zip('1234x'))

True
False


In [122]:
help(str.isdigit)
# #Es un método de strings (str) que responde a la pregunta:
# ¿Esta cadena está compuesta solo por dígitos?
# True → si todos los caracteres son dígitos y hay al menos uno.
# False → si hay letras, espacios, signos u otros símbolos.

Help on method_descriptor:

isdigit(self, /) unbound builtins.str method
    Return True if the string is a digit string, False otherwise.

    A string is a digit string if all characters in the string are digits and there
    is at least one character in the string.



In [138]:
#directly:
def is_valid_zip(zip_code):
    """Returns whether the input string is a valid (5 digit) zip code
    """
    return zip_code.isdigit() and len(zip_code) == 5

print(is_valid_zip('12345'))
print(is_valid_zip('1234x'))

True
False


**Exercise 2**
A researcher has gathered thousands of news articles. But she wants to focus her attention on articles including a specific word. Complete the function below to help her filter her list of articles.

Your function should meet the following criteria:

Do not include documents where the keyword string shows up only as a part of a larger word. For example, if she were looking for the keyword “closed”, you would not include the string “enclosed.”
She does not want you to distinguish upper case from lower case letters. So the phrase “Closed the case.” would be included when the keyword is “closed”
Do not let periods or commas affect what is matched. “It is closed.” would be included when the keyword is “closed”. But you can assume there are no other types of punctuation.

El método .strip() elimina los espacios en blanco del inicio y del final de una cadena (no los del medio). También puedes pasarle una lista de caracteres que quieras quitar del inicio y del final

.split() vs .rsplit()
Ambos métodos dividen una cadena en partes, pero:
.split() 👉 empieza a cortar desde la izquierda (inicio de la cadena).
.rsplit() 👉 empieza a cortar desde la derecha (final de la cadena).

**enumerate()** — para listas (y secuencias)
Usamos enumerate() cuando queremos iterar sobre una lista y tener acceso tanto al índice como al elemento
**.items()** — para diccionarios
Usamos .items() cuando estamos iterando sobre un diccionario y queremos acceder a la clave y su valor

🧠 Analogía rápida
enumerate(lista) → te da (índice, valor) - para listas
diccionario.items() → te da (clave, valor) - para diccionarios

In [105]:
def word_search(doc_list, keyword):
    """
    Takes a list of documents (each document is a string) and a keyword. 
    Returns list of the index values into the original list for all documents 
    containing the keyword.

    Example:
    doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
    >>> word_search(doc_list, 'casino')
    >>> [0]
    """
    #convert keyword to lowercase:
    keyword = keyword.lower()

    #split each string inside the list into lists were each argument is the words separated:
    separated = [str.split() for str in doc_list]

    #convert to lower and strip each word to be able to compare it later to keyword
    #for x in separated: --> don't need this!!! cos the comprehension is already doing it!!!
    cleaned = [[y.lower().strip('.,') for y in x] for x in separated]

    #create 'result' with the index of the list that contains the keyword
    return [cleaned.index(x) for x in cleaned if (keyword in x)]

doc_list = ["The Learn Python Challenge Casino.", "They bought a casino", "Casinoville"]
print(word_search(doc_list, 'Casino'))

#only problem with these is the scenario in which you've got two same strings inside doc_list, so index gets substituted and remains with the lats one and doesn't register the other one

[0, 1]


In [None]:
#Solution
def word_search(doc_list, keyword):
    # list to hold the indices of matching documents
    indices = [] 
    # Iterate through the indices (i) and elements (doc) of documents
    for i, doc in enumerate(doc_list):
        # Split the string doc into a list of words (according to whitespace)
        tokens = doc.split()
        # Make a transformed list where we 'normalize' each word to facilitate matching.
        # Periods and commas are removed from the end of each word, and it's set to all lowercase.
        normalized = [token.rstrip('.,').lower() for token in tokens]
        # Is there a match? If so, update the list of matching indices.
        if keyword.lower() in normalized:
            indices.appnd(i)
    r


doc_list = ["The Learn Python Challenge Casino.", "They bought a casino", "Casinoville"]
print(word_search(doc_list, 'Casino'))eturn indices

**Exercise 3:**
Now the researcher wants to supply multiple keywords to search for. Complete the function below to help her.

(You're encouraged to use the word_search function you just wrote when implementing this function. Reusing code in this way makes your programs more robust and readable - and it saves typing!)

In [115]:
def multi_word_search(doc_list, keywords):
    """
    Takes list of documents (each document is a string) and a list of keywords.  
    Returns a dictionary where each key is a keyword, and the value is a list of indices
    (from doc_list) of the documents containing that keyword

    >>> doc_list = ["The Learn Python Challenge Casino.", "They bought a car and a casino", "Casinoville"]
    >>> keywords = ['casino', 'they']
    >>> multi_word_search(doc_list, keywords)
    {'casino': [0, 1], 'they': [1]}
    """
    #for key in keywords: --> NOT NECESARY!!! The comprehension already does it!!!
    return {keyword: word_search(doc_list, keyword) for keyword in keywords}

doc_list = ["The Learn Python Challenge Casino casino.", "They bought a car and a casino", "Casinoville"]
keywords = ['casino', 'they','banana']
multi_word_search(doc_list, keywords)

{'casino': [0, 1], 'they': [1], 'banana': []}

In [123]:
#Solution:
def multi_word_search(documents, keywords):
    keyword_to_indices = {}
    for keyword in keywords:
        keyword_to_indices[keyword] = word_search(documents, keyword) #creates dictionary with keyword as key and value = result from using the funcion
    return keyword_to_indices
    
doc_list = ["The Learn Python Challenge Casino casino.", "They bought a car and a casino", "Casinoville"]
keywords = ['casino', 'they','banana']
multi_word_search(doc_list, keywords)

{'casino': [0, 1], 'they': [1], 'banana': []}