# Basic text operations

### Easy string manipulation

To compare strings we use ==

In [None]:
x = 'a string'
y = "a string"
if x == y:
    print("they are the same")


**Is** does not work:

In [None]:
x = 'a string'
y = "a string"
if x is y:
    print("they are the same")

In [None]:
fox = "tHe qUICk bROWn fOx."

To convert the entire string into upper-case or lower-case, you can use the ``upper()`` or ``lower()`` methods respectively:

In [None]:
fox.upper()

In [None]:
fox.lower()

A common formatting need is to capitalize just the first letter of each word, or perhaps the first letter of each sentence.
This can be done with the ``title()`` and ``capitalize()`` methods:

In [None]:
fox.title()

In [None]:
fox.capitalize()

The cases can be swapped using the ``swapcase()`` method:

In [None]:
fox.swapcase()

In [None]:
line = '         this is the content         '
line.strip()

To remove just space to the right or left, use ``rstrip()`` or ``lstrip()`` respectively:

In [None]:
line.rstrip()

In [None]:
line.lstrip()

To remove characters other than spaces, you can pass the desired character to the ``strip()`` method:

In [None]:
num = "000000000000435"
num.strip('0')

In [None]:
line = 'the quick brown fox jumped over a lazy dog'
line.find('fox')

In [None]:
line.index('fox')

In [None]:
line[16:21]

The only difference between ``find()`` and ``index()`` is their behavior when the search string is not found; ``find()`` returns ``-1``, while ``index()`` raises a ``ValueError``:

In [None]:
line.find('bear')

In [None]:
line.index('bear')

In [None]:
line.partition('fox')

The ``rpartition()`` method is similar, but searches from the right of the string.

The ``split()`` method is perhaps more useful; it finds *all* instances of the split-point and returns the substrings in between.
The default is to split on any whitespace, returning a list of the individual words in a string:

In [None]:
line_list = line.split()
print(line_list)

In [None]:
print(line_list[1])

A related method is ``splitlines()``, which splits on newline characters.
Let's do this with a Haiku, popularly attributed to the 17th-century poet Matsuo Bashō:

In [None]:
haiku = """matsushima-ya
aah matsushima-ya
matsushima-ya"""

haiku.splitlines()

Note that if you would like to undo a ``split()``, you can use the ``join()`` method, which returns a string built from a splitpoint and an iterable:

In [None]:
'--'.join(['1', '2', '3'])

A common pattern is to use the special character ``"\n"`` (newline) to join together lines that have been previously split, and recover the input:

In [None]:
print("\n".join(['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']))

### Formatting strings

In [None]:
pi = 3.14159
str(pi)

In [None]:
print( "The value of pi is " + str(pi))

A more flexible way to do this is to use *format strings*, which are strings with special markers (noted by curly braces) into which string-formatted values will be inserted.
Here is a basic example:

In [None]:
"The value of pi is {}".format(pi)

But we can use easier solution (with cutting two last digist):

In [None]:
print (f"The value of pi is {pi:0.2f}")

Pi is a float number so it must be transform to sting.

### Easy regex manipulation

In [None]:
import re

In [None]:
line = 'the quick brown fox jumped over a lazy dog'

With this, we can see that the ``regex.search()`` method operates a lot like ``str.index()`` or ``str.find()``:

In [None]:
line.index('fox')

In [None]:
regex = re.compile('fox')
match = regex.search(line)
match.start()

Similarly, the ``regex.sub()`` method operates much like ``str.replace()``:

In [None]:
line.replace('fox', 'BEAR')

In [None]:
regex.sub('BEAR', line)

The following is a table of the repetition markers available for use in regular expressions:

| Character | Description | Example |
|-----------|-------------|---------|
| ``?`` | Match zero or one repetitions of preceding  | ``"ab?"`` matches ``"a"`` or ``"ab"`` |
| ``*`` | Match zero or more repetitions of preceding | ``"ab*"`` matches ``"a"``, ``"ab"``, ``"abb"``, ``"abbb"``... |
| ``+`` | Match one or more repetitions of preceding  | ``"ab+"`` matches ``"ab"``, ``"abb"``, ``"abbb"``... but not ``"a"`` |
| ``.`` | Any character | ``.*`` matches everything | 
| ``{n}`` | Match ``n`` repetitions of preeeding | ``"ab{2}"`` matches ``"abb"`` |
| ``{m,n}`` | Match between ``m`` and ``n`` repetitions of preceding | ``"ab{2,3}"`` matches ``"abb"`` or ``"abbb"`` |

In [None]:
bool(re.search(r'ab', "Boabab"))

In [None]:
bool(re.search(r'.*ma.*', "Ala ma kota"))

In [None]:
bool(re.search(r'.*(psa|kota).*', "Ala ma kota"))

In [None]:
bool(re.search(r'.*(psa|kota).*', "Ala ma psa"))

In [None]:
bool(re.search(r'.*(psa|kota).*', "Ala ma chomika"))

In [None]:
zdanie = "Ala ma kota."
wzor = r'.*' #pasuje do każdego zdania
zamiennik = "Ala ma psa."

In [None]:
re.sub(wzor, zamiennik, zdanie)

**.*** is a very general pattern, usually we prefer to use more precise expression to be sure that it works only with pattern we want.

For instance for a sentence with spaces we could use something like:

In [None]:
wzor = r"[a-żA-Ż .]+"

In [None]:
zdanie = "Ala ma kota."
zamiennik = "Ala ma psa."

In [None]:
re.sub(wzor, zamiennik, zdanie)

Or we could define that it starts with capital letter and ends with . ? or !

In [None]:
wzor = r"[A-Ż][a-ż ]+[\.\!\?]"

In [None]:
re.sub(wzor, zamiennik, zdanie)

We can create groups with () and then use it.

In [None]:
wzor = r'(.*)kota.'
zamiennik = r"\1 psa."

In [None]:
re.sub(wzor, zamiennik, zdanie)

In [None]:
wzor = r'(.*)ma(.*)'
zamiennik = r"\1 posiada \2"

In [None]:
re.sub(wzor, zamiennik, zdanie)

It is useful for extacting information from big files

In [None]:
text_with_phone_numbers = """
Jan, Kowalski, 123-234-523
Maciej, Nowak, 95845321
Katarzyna Nowacka 23 423 45 23
"""

In [None]:
simple_num_regex = r"[0-9][0-9 \-]+[0-9]"

In [None]:
list_of_nums = re.findall(simple_num_regex, text_with_phone_numbers)

In [None]:
list_of_nums

In [None]:
surname_num_regex = r"([A-Z][a-z]+)[ ,]+([0-9][0-9 \-]+[0-9])"

In [None]:
tuples_with_surmanes_nums = re.findall(surname_num_regex, text_with_phone_numbers)

In [None]:
tuples_with_surmanes_nums

In [None]:
for item in tuples_with_surmanes_nums:
  print(f"{item[0]} has number {re.sub('[- ]', '', item[1])}")

### Python API library

We can easily work with text downloaded from web in many different ways, e.g.:

* prepared library
* scrapping
* REST API (we will deal with then some other time)

For library we need to install it:


In [None]:
!pip install wikipedia

In [None]:
import wikipedia


Here is [nice description](https://towardsdatascience.com/wikipedia-api-for-python-241cfae09f1c) of this library.

Let's try to find something:

In [None]:
wikipedia.set_lang("pl")
person = wikipedia.page("Józef Piłsudski")
print(person.content[:1000])


We want to find the most important years in his life:

In [None]:
year_regex = r"[0-9]{4}"

In [None]:
years = re.findall(year_regex, person.content)

In [None]:
print(years)

In [None]:
year_occurences = {x:years.count(x) for x in years}
years_in_order = sorted(year_occurences.items())


In [None]:
years_in_order[:8]

Let's cut only years, we are interested in:

In [None]:
def check_year(item):
    year, count = item
    if int(year) < 1936 and int(year) > 1866:
          return True  
    return False

In [None]:
relevant_years = list(filter(check_year, years_in_order))


In the end we can plot it:

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

years_df = pd.DataFrame(relevant_years, columns =['Year', 'Occurences'])
sns.set(rc={'figure.figsize':(20,16)})
ax = plt.gca()
ax.xaxis.set_major_locator(ticker.MultipleLocator(base=5))
sns.lineplot(data=years_df, x="Year", y="Occurences")

### Python Web Scrapping

Sometimes there are websites that make text directly available for us (e.g. books).

If we want do download polish book we can do it as follows

In [None]:
!wget https://wolnelektury.pl/media/book/txt/lalka-tom-pierwszy.txt

No we have txt file (see on the side) and we just need to write this:

In [None]:
with open('lalka-tom-pierwszy.txt', 'r') as book:
    lalka = book.read()

print(lalka[:300])

### Excercises

* choose interesting period (e.g. Romanticism in english wikipedia) - long article and pick some important names (e.g. Byron, Mickiewicz, Goethe). Using regex (to find all variations of names) find how many times they are mentioned.

and/or

* do the same thing with book and heroes of a story. If book has multiple parts (like "Lalka") you can count occurences for different parts.



