## Strings

**Text**

Can use `"`, `'` or `str()`

In [None]:
"this is fine"

In [None]:
'so is this (but not in JSON)'

In [None]:
str(float(64))

Files are just big lists of characters
- the line structure is a mirage

`\n` in the newine character (UNIX thing)

In [None]:
file = 'First line.\nSecond line.'
print(file)

Multiple line strings:

In [None]:
print("""
First
""")

Strings next to each other are joined:

In [None]:
'Py' 'thon'

This can be useful for multi line strings:

In [None]:
print('An expert is a person who has made all the mistakes '
      'that can be made in a very narrow field – NIELS BOHR')

We can add strings together:

In [None]:
"ja" + " ja " + "ja"

And multiply them:

In [None]:
"ja" * 3

## Strings are iterable

In [None]:
for character in 'Py' 'thon':
    print(character)

We can use the builtin `len` to measure the number of characters in a string:

In [None]:
len('Python')

## `upper` and `lower`

A common operation in NLP is making everything lower case:

In [None]:
'PYTHON'.lower()

In [None]:
'python'.upper()

We can see some of the other functionality available on the `str` object using `dir`:

In [None]:
dir(str)

## String formatting

There are many ways to do this - below is what works for me:

In [None]:
'{} {} {}'.format('first', 'second', 'third')

We can control the formatting of decimal places

In [None]:
'{:.2f} {:.1f} {:.0f}'.format(420, 420, 420)

## String splitting

A common operation is to split strings on characters.  Let's get the current working directory:

In [None]:
import os

os.getcwd()

We can then use the `split` method to create an iterable:

In [None]:
os.getcwd().split('/')

In [None]:
os.getcwd().split('/')[-2]

We will see more on paths in the next notebook - and more on iterables in the notebook after that.

## String stripping

A common operation is removing trailing whitespace:

In [None]:
'python is dynamically typed    '.strip(' ')

Related is to remove characters from the string - this can be done by replacing with `''`

In [None]:
'python is dynamically typed    '.replace(' ', '')

## `in`

A very Pythonic pattern is to check if an object exists in an iterable using `in`.  As strings are iterable, this syntax works with strings:

In [None]:
'P' in 'Python'

In [None]:
'p' in 'Python'

## Exercise

Write a line to check for `p` and `P` in `Python`:

In [None]:
def check(char, word):
    return char.lower() in word.lower()

## Exercise

**Stemming** is a process of converting words to their stem (a base or root form).  It is a common operation in NLP.

For the text below (`sample`)
- create a list of the stems
- stem (in this case) being the word shortened to 4 characters
- words of 2-3 characters should be kept as 2-3
- words of 1 character should be dropped

After creating your list of stems, count them
- use a `collections.defaultdict(int)` to store the counts

(There are libraries that will stemming for you - below we do this to practice working with strings).

In [None]:
sample = 'In linguistic morphology and information retrieval, stemming is   the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.   Algorithms for stemming have been studied in computer   science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.A computer program or subroutine that stems word may be called a stemming program, stemming algorithm, or stemmer.'

sample

In [None]:
stems = []

for w in sample.split(' '):
    #  clean string here
    if len(w) >= 4:
        stems.append(w[:4])
    elif len(w) == 1:
        pass
    else:
        stems.append(w)
        
from collections import defaultdict

counter = defaultdict(int)

for s in stems:
    counter[s] += 1
    
counter['as']

In [None]:
counter

In [10]:
d = {}

w = ['a', 'b', 'a']

#d['a'] = 0
#d['b'] = 0

#from collections import defaultdict

#d = defaultdict(int)

for wo in w:
    d[wo] += 4
    
d

KeyError: 'a'