# Seminar 01. Jupyter, Python, NumPy, NLTK

[Dr. Constantine Korikov, Huawei](mailto:constantine.korikov@huawei.com)

Dr. Valentin Malykh, Huawei

Dr. Ilseyar Alimova, Huawei

## 1. Jupyter notebook

### 1.1.  What is Jupyter
[Jupyter Notebook Users Manual](https://jupyter.brynmawr.edu/services/public/dblank/Jupyter%20Notebook%20Users%20Manual.ipynb)
![](https://jupyter.org/assets/labpreview.png)

Jupyter is de facto a standard in area of programming education. A notebook consists of cells that can be different types:
1. code
2. markdown
3. raw

Under the hood, a notebook is JSON + metadata. For instance, several first lines of this notebook are listed below.
```json
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Seminar 01. Python, NumPy, NLTK\n",
    "\n",
    "Dr. Constantine Korikov, Huawei"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Python crash course"
   ]
  }
}
```
> **Note**
>
> JSON (JavaScript Object Notation) is a lightweight data-interchange format.
> It is easy for humans to read and write. It is easy for machines to parse and generate.
>
> [Read more about JSON](https://www.json.org/json-en.html)

### 1.2. Cells

#### Markdown cells

All text in this file is contained in markdown cells. Jupyter supports rich features of markdown. For example, some text formatting usage is shown below.

`**This is bold text**`

**This is bold text**

`__This is bold text__`

__This is bold text__

`*This is italic text*`

*This is italic text*

`_This is italic text_`

_This is italic text_

`~~Strikethrough~~`

~~Strikethrough~~

`![logo](http://www-file.huawei.com/-/media/corporate/images/home/logo/huawei_logo.png)`

![logo](http://www-file.huawei.com/-/media/corporate/images/home/logo/huawei_logo.png)

`$$\int_{-\infty}^{+\infty} e^{-x^2}\,dx = \sqrt{\pi}$$`

$$
    \int_{-\infty}^{+\infty} e^{-x^2}\,dx = \sqrt{\pi}
$$

> **Note**
>
> Markdown is a lightweight markup language with plain-text-formatting syntax.
>
> [Read more about markdown features in Jupyter](https://athena.brynmawr.edu/jupyter/hub/dblank/public/Jupyter%20Notebook%20Users%20Manual.ipynb)

#### Code cells

Code cells let users work with programming backend in REPL mode.

> **Note**
>
> REPL (read–eval–print loop) is a simple, interactive computer programming environment that takes single user
> inputs, evaluates them, and returns the result to the user. A program written in a REPL environment is executed
> piecewise.  
>
> [Read more about REPL](https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop)

Typically, if you type code in code cell and press `Shift` + `Enter` you will see results provided by processing backend. Here, it is python.

In [1]:
1+2+3

6

If you use `!` symbol before the code this cell will be processed by the shell.

In [2]:
!python --version

Python 3.10.12


### 1.3 Magic commands

Jupyter provides internal commands, known as magic commands. They are not a part of python nor shell. Every magic command inserts into the code cell and starts from `%` symbol. To get the full list of supported magic commands, type the following line.

In [3]:
%lsmagic

Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %conda  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %pip  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %shell  %store  %sx  %system  %tb  %tensorflow_version  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%bigquery  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%late

For example, the following command shows list of variables.

In [4]:
%who_ls

[]

In [5]:
a = 1

In [6]:
%who_ls

['a']

This command runs python code several times to eliminate the influence of other tasks on the machine, such as disk flushing and OS scheduling.

In [7]:
%timeit 2**128

356 ns ± 5.49 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


The following function helps with the installation of additional python packages. It runs package manager `pip`.

In [8]:
%pip install matplotlib



> **Note**
>
> [Read more aboute magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html)

## 2.  Python crash course

### 2.1. Python basics

Python is a friendly language named after the famous British show "Monty Python". Let's print the title of the show.

In [9]:
print("Monty Python")

Monty Python


Now, let's introduce variable and print its value. Here, `f` before `"` means formated string.

In [10]:
x = 6
print(f"Variable x={x}")

Variable x=6


Display form can be tunned.

In [11]:
y = 3.1415926
print(f"Variable y={y:.3f}")

Variable y=3.142


Python has a rich environment where you can find packages for many tasks. To include additional functionality just import a module by its name.

Python is working with objects. And these objects are called by a reference. An example of an object is `list`.

In [12]:
a = [1,2,3,5]

Let's add a new value to the end of the list.

In [13]:
a.append(4)
a
# We use the line with a because append function is an in-place modifier
# and command doesn't return anything
# (by the way, it is an example of comment in python)

[1, 2, 3, 5, 4]

So, let's introduce another list `b` and append `5` to the end of the new list.

In [14]:
b = a
b.append(5)
b

[1, 2, 3, 5, 4, 5]

Because of specificity mentioned above, we see changes in `a` too.

In [15]:
a

[1, 2, 3, 5, 4, 5]

In [16]:
a.append(b)

In [17]:
a

[1, 2, 3, 5, 4, 5, [...]]

In [18]:
from copy import deepcopy
d = deepcopy(a)

In [19]:
d

[1, 2, 3, 5, 4, 5, [...]]

In [20]:
a

[1, 2, 3, 5, 4, 5, [...]]

In [21]:
a.append("ce")

In [22]:
d

[1, 2, 3, 5, 4, 5, [...]]

So, be careful with mutable objects. If you really need a copy of the list, use deep copy module.

In [23]:
from copy import deepcopy
c = deepcopy(a)
a.append(6)
c

[1, 2, 3, 5, 4, 5, [...], 'ce']

### 2.2. Types and operations

The following code shows integer arithmetics.

In [24]:
x = 5
print(type(x))
print(x + 1)
print(x - 1)
print(x * 2)
print(x ** 2)  # Power

print(x)
x += 1
print(x)
x *= 2
print(x)

<class 'int'>
6
4
10
25
5
6
12


In [26]:
y = 2.5
print(type(y))
print(y, y + 1, y * 2, y ** 2)

<class 'float'>
2.5 3.5 5.0 6.25


Boolean operations

In [27]:
t = True
f = False
print(type(t))
print(t and f)
print(t or f)
print(not t)
print(t != f)

<class 'bool'>
False
True
False
True


Strings

In [28]:
hello = 'hello'
world = "world"
print(hello, )
print(len(hello))  # String length
hw = hello + ' ' + world  # String concatenation
print(hw)

hello
5
hello world


In [29]:
multiline = """
One
   Two
      Three
"""
print(multiline) # \n


One 
   Two
      Three



Some string methods

In [30]:
print(hw.capitalize())  # Capitalize a string
print(hw.upper())       # Convert a string to uppercase
print(hw.replace('l', '(ell)'))  # Replace all instances of one substring with another

Hello world
HELLO WORLD
he(ell)(ell)o wor(ell)d


Control flow and cycles

In [31]:
x = 1
# If-else
if x < 0:
    print("Negative")
elif x == 0:
    print('Zero')
else:
    print('Positive')

# Ternary operator
a = "One" if x>0 else "Two"
print(a)

# While cycle
while x > 0:
    print(x)
    x-=1

# For cycle
for _ in range(5):
    print(x)

# List comprehension
b = [x**2 for x in range(5)]
print(b)

Positive
One
1
0
0
0
0
0
[0, 1, 4, 9, 16]


### 2.3 Functions

A function is a useful unit of decomposition of the programs. The following example shows how to define a function.

In [32]:
def plus_one(x: int) -> int:
    """This function returns incremented value"""
    return x+1

It is simple to use.

In [33]:
plus_one(5.14)

6.14

If it is not necessary to describe details, function can be defined shortly.

In [34]:
def plus_one(x): return x+1

plus_one(5)

6

or even without name

In [35]:
(lambda x: x+1)(5)

6

In [36]:
fun = lambda x: x+1

In [37]:
fun(6)

7

Another type of funtions are generators:

In [38]:
def counts(value=0, step=1): # gets an itput and saves state for the next call
    while 1:
        value += step
        yield value

g = counts(step=3)
type(g), next(g), next(g), next(g)

(generator, 3, 6, 9)

> **Note**
>
> Lazy evaluation is an evaluation strategy which delays the evaluation
> of an expression until its value is needed and which also avoids repeated evaluations.
>
> [Read more about lazy evaluation](https://en.wikipedia.org/wiki/Lazy_evaluation)

### 2.4. Containers in python

Python has built-in containers, they listed below.

In [40]:
c_tpl = (1, 1.2, "x")
c_rng = range(10)
c_fst = frozenset({1,2,3}) # readonly set
c_lst = [1,2,3]
c_dct = {1: "One", 2: "Two", 3: "Three"}
c_set = {1,2,3}

Additional containers can be found in package `collections`. For example, useful `namedtuple`.

In [41]:
from collections import namedtuple
Point = namedtuple('Point', ['x', 'y'])
p = Point(1,2)

In [42]:
p

Point(x=1, y=2)

In [43]:
a = (1, 2, 3)

In [44]:
b = a

In [45]:
b = b + ('2',)
a

(1, 2, 3)

In [46]:
b

(1, 2, 3, '2')

or `Counter` for multiset implementation

In [47]:
from collections import Counter

s = 'hello world'
c = Counter(s)
c.most_common()

[('l', 3),
 ('o', 2),
 ('h', 1),
 ('e', 1),
 (' ', 1),
 ('w', 1),
 ('r', 1),
 ('d', 1)]

or dataclass as mutuable alternative for namedtuple.

In [48]:
%pip install dataclasses

Collecting dataclasses
  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
Installing collected packages: dataclasses
Successfully installed dataclasses-0.6


In [49]:
from dataclasses import dataclass

@dataclass
class Structure:
    name: str
    value: float

s = Structure("x", 2)
s

Structure(name='x', value=2)

### 2.5.  Regular expressions in python


A regular expression, regex or regexp, is a sequence of characters that define a search pattern.
Usually such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

Python supports regular expression with the help of module `re`. It is convenient to play with online services like [http://www.pyregex.com/](http://www.pyregex.com/) to see how regexps work.

There are other services:
- [https://www.regextester.com/](https://www.regextester.com/)
- [https://regex101.com/](https://regex101.com/)
- [https://regexr.com/](https://regexr.com/)
- [https://pythex.org/](https://pythex.org/)

Pattern string can contain:
- Special characters like `\t` (tab symbol) or `\\` (\ symbol).
- A character class means range of symbols, e.g. `[ae]` (symbol a or symbol e), `[A-Z]` (any symbol from A to Z), `\d` (any digit symbol), '.' (any symbol).
- Anchors like `^`(start of the line) or `$` (end of the line).
- Match group (can be accessed after applying to the string): `()` (subpattern for a matching group is placed between brackets). There are several modifiers of matching groups.
- Quantifiers which specifies how many instances of the previous element, like `*` (0 or more), `+` (1 or more).

> **Note**
>
> [Read more about regular expressions in Python](https://docs.python.org/library/re.html)

For instance, let's look how to extract integer and fraction parts of the float number using a regular expression.

In [52]:
import re

pattern = r"(\d+)\.(\d+)" # pattern string has prefix r
matches = re.match(pattern, "3.1415926")
matches.groups()

('3', '1415926')

This regular expression works as follows. The first matching group `(\d+)` matches `3` because `\d+` matches a digit (equal to `[0-9]`)
where `+` quantifier means matching between one and unlimited times, as many times as possible. Next element `\.` matches the character `.` literally (case sensitive). The second matching group the same as the first and it captures `1415926`.

In [53]:
# . \w \W \d \D \s \S \b

Let's now try writing a regex that would match

all alphanumeric strings with hyphen, apostrophe or point inside (i.e. should be able to find "44.44","a-ha","it's")

**OR**

any non-whitespace character followed between zero and unlimited times by any alphanumeric character (i.e. "hello!.?" should result in "hello", "!", "." and "?".)

In [54]:
dummy_example = "It's a dum-dum example, we'll place it here to prove a point. Also, look at this number: 300.99."

In [None]:
pattern = re.compile('')
print(re.findall(pattern, dummy_example))

The most common methods that you will probably use are **search**, **findall**, **split** and **sub**

## 3. NumPy

[NumPy Reference](https://docs.scipy.org/doc/numpy/reference/index.html)

NumPy is a python library for scientific calculations that provides effective arrays. Arrays in NumPy are called *ndarray* from N-dimensional array.

To start using NumPy, just import the library. If some library is used frequently in some program it is useful to give an alias for this library in that program. Here, np is a widely used alias for NumPy.

In [56]:
import numpy as np

### 3.1. Why are NumPy arrays effective?

ndarray consist of 3 parts:
- data buffer (packed sequence of homogened data)
- metadata (describes data type)
- metadata (describes form)

That is why ndarray is much faster than built-in python lists.

In [57]:
%timeit a = [i**2 for i in range(1000)]

302 µs ± 6.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [58]:
%timeit b = np.arange(1000)**2

2.75 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


### 3.2. Simple NumPy operations

Simple 1D array

In [59]:
# Create array
x = np.array([1, 2, 3], np.int32)

# Add an element to the end of array
x = np.append(x, np.int32(4))

# Some information about array (+pack them into tuple)
(type(x),
 x.shape, # shape
 x.dtype, # data type
 x[2],    # get element by index
 x[:2]    # slice (get subarray)
)

(numpy.ndarray, (4,), dtype('int32'), 3, array([1, 2], dtype=int32))

### 3.3. Indexes

In [63]:
x = np.array([1,2,3], np.int8)
x[::-1]

array([3, 2, 1], dtype=int8)

In [64]:
x[::2]

array([1, 3], dtype=int8)

In [65]:
x = np.array([[1,2,3],
              [4,5,6],
             ],
             np.int8)
x.reshape(6,1)

array([[1],
       [2],
       [3],
       [4],
       [5],
       [6]], dtype=int8)

Indexing by mask

In [66]:
x = np.array([1,2,3], np.int8)
x[[False, True, False]]

array([2], dtype=int8)

A mask can be set with help of function

In [67]:
x = np.arange(1, 10)
x[x%2==0]

array([2, 4, 6, 8])

Indexing by list of indexes

In [68]:
x = np.array([1,2,3], np.int8)
x[[1,2]]

array([2, 3], dtype=int8)

### 3.4. Some useful built-in array operations

Hadamard product and dot product

In [69]:
a = np.array([[1,2],[3,4]])
b = np.array([[1,0],[0,1]])

print(a*b, a.dot(b), sep='\n\n')

[[1 0]
 [0 4]]

[[1 2]
 [3 4]]


Sum, mean, max, argmax

In [70]:
x = np.random.rand(10)
(
    x,
    x.sum(),
    x.mean(),
    x.max(),
    x.argmax()
)

(array([0.34095855, 0.74454496, 0.35789532, 0.86597506, 0.47198057,
        0.18782758, 0.97902178, 0.57557235, 0.07335848, 0.39863506]),
 4.995769720218177,
 0.49957697202181767,
 0.9790217843093959,
 6)

Broadcasting

In [71]:
a = np.array([[1,2],[3,4]])
b = np.array([[0,1]])
a+b

array([[1, 3],
       [3, 5]])

## 3. NLTK

The package can be installed directly from Jupyter.

In [72]:
! pip install nltk



We will use some modules from NLTK which need to download additional data for them. For this purpose, there is a `download` method.

In [73]:
import nltk
#nltk.set_proxy('http://user:password@proxy.example.com:8080')
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

Proxy settings are optional here. Let's use text from zen of python to play with some internal functions of NLTK.

In [74]:
import this
import codecs

zen_of_python = codecs.encode(this.s, 'rot13')

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [75]:
russian_text = """Граф Лев Николаевич Толсто́й[К 1] (28 августа [9 сентября] 1828, Ясная Поляна, Тульская губерния, Российская
империя — 7 [20] ноября 1910, станция Астапово, Рязанская губерния, Российская империя) — один из наиболее известных русских
писателей и мыслителей, один из величайших писателей-романистов мира[4]. Участник обороны Севастополя. Просветитель, публицист,
религиозный мыслитель, его авторитетное мнение послужило причиной возникновения нового религиозно-нравственного течения —
толстовства. За свои взгляды был отлучен от церкви. Член-корреспондент Императорской Академии наук (1873), почётный академик
по разряду изящной словесности (1900)[5]. Был номинирован на Нобелевскую премию по литературе (1902, 1903, 1904, 1905).
Впоследствии отказался от дальнейшей номинации.

Писатель, ещё при жизни признанный главой русской литературы[6]. Творчество Льва Толстого ознаменовало новый этап в русском и
мировом реализме, выступив мостом между классическим романом XIX века и литературой XX века. Лев Толстой оказал сильное влияние
на эволюцию европейского гуманизма, а также на развитие реалистических традиций в мировой литературе. Произведения Льва
Толстого многократно экранизировались и инсценировались в СССР и за рубежом; его пьесы ставились на сценах всего мира[6]. Лев
Толстой был самым издаваемым в СССР писателем за 1918—1986 годы: общий тираж 3199 изданий составил 436,261 млн экземпляров[7].

Наиболее известны такие произведения Толстого, как романы «Война и мир», «Анна Каренина», «Воскресение»,
автобиографическая[8][6] трилогия «Детство», «Отрочество», «Юность»[К 2], повести «Казаки», «Смерть Ивана Ильича», «Крейцерова
соната», «Отец Сергий», «Хаджи-Мурат», цикл очерков «Севастопольские рассказы», драмы «Живой труп», «Плоды просвещения» и
«Власть тьмы», автобиографические религиозно-философские произведения «Исповедь» и «В чём моя вера?» и др.
"""

> **Note**
>
> We used rot13 encoding because the source file contains encoded text.
>
> [See source of this module](https://github.com/python/cpython/blob/master/Lib/this.py)

### 3.1. Tokenization

Tokenization is a process of splitting text to tokens. Let's split the text sentencewise.

In [76]:
from nltk.tokenize import sent_tokenize

tokens = sent_tokenize(zen_of_python)
print(tokens)

['The Zen of Python, by Tim Peters\n\nBeautiful is better than ugly.', 'Explicit is better than implicit.', 'Simple is better than complex.', 'Complex is better than complicated.', 'Flat is better than nested.', 'Sparse is better than dense.', 'Readability counts.', "Special cases aren't special enough to break the rules.", 'Although practicality beats purity.', 'Errors should never pass silently.', 'Unless explicitly silenced.', 'In the face of ambiguity, refuse the temptation to guess.', 'There should be one-- and preferably only one --obvious way to do it.', "Although that way may not be obvious at first unless you're Dutch.", 'Now is better than never.', 'Although never is often better than *right* now.', "If the implementation is hard to explain, it's a bad idea.", 'If the implementation is easy to explain, it may be a good idea.', "Namespaces are one honking great idea -- let's do more of those!"]


In [77]:
tokens[0]

'The Zen of Python, by Tim Peters\n\nBeautiful is better than ugly.'

Then, we split it wordwise.

In [78]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(zen_of_python)
print(tokens)

['The', 'Zen', 'of', 'Python', ',', 'by', 'Tim', 'Peters', 'Beautiful', 'is', 'better', 'than', 'ugly', '.', 'Explicit', 'is', 'better', 'than', 'implicit', '.', 'Simple', 'is', 'better', 'than', 'complex', '.', 'Complex', 'is', 'better', 'than', 'complicated', '.', 'Flat', 'is', 'better', 'than', 'nested', '.', 'Sparse', 'is', 'better', 'than', 'dense', '.', 'Readability', 'counts', '.', 'Special', 'cases', 'are', "n't", 'special', 'enough', 'to', 'break', 'the', 'rules', '.', 'Although', 'practicality', 'beats', 'purity', '.', 'Errors', 'should', 'never', 'pass', 'silently', '.', 'Unless', 'explicitly', 'silenced', '.', 'In', 'the', 'face', 'of', 'ambiguity', ',', 'refuse', 'the', 'temptation', 'to', 'guess', '.', 'There', 'should', 'be', 'one', '--', 'and', 'preferably', 'only', 'one', '--', 'obvious', 'way', 'to', 'do', 'it', '.', 'Although', 'that', 'way', 'may', 'not', 'be', 'obvious', 'at', 'first', 'unless', 'you', "'re", 'Dutch', '.', 'Now', 'is', 'better', 'than', 'never', '.

For example, we can use this list of tokens to take the most common word in the text.

In [79]:
from nltk.probability import FreqDist
dist = FreqDist(tokens)
dist.most_common(10)

[('.', 18),
 ('is', 10),
 ('better', 8),
 ('than', 8),
 ('to', 5),
 ('the', 5),
 (',', 4),
 ('of', 3),
 ('Although', 3),
 ('never', 3)]

Sometimes, we also need to use n-grams. An n-gram is a contiguous sequence of n items from a given sample of text or speech.

In [80]:
from nltk import ngrams

In [81]:
sent = 'В Тюмени больше ясных дней, чем в Краснодаре.'.split()
list(ngrams(sent, 1)) # unigrams

[('В',),
 ('Тюмени',),
 ('больше',),
 ('ясных',),
 ('дней,',),
 ('чем',),
 ('в',),
 ('Краснодаре.',)]

In [82]:
list(ngrams(sent, 2)) # bigrams

[('В', 'Тюмени'),
 ('Тюмени', 'больше'),
 ('больше', 'ясных'),
 ('ясных', 'дней,'),
 ('дней,', 'чем'),
 ('чем', 'в'),
 ('в', 'Краснодаре.')]

In [83]:
list(ngrams(sent, 3)) # threegrams

[('В', 'Тюмени', 'больше'),
 ('Тюмени', 'больше', 'ясных'),
 ('больше', 'ясных', 'дней,'),
 ('ясных', 'дней,', 'чем'),
 ('дней,', 'чем', 'в'),
 ('чем', 'в', 'Краснодаре.')]

In [84]:
list(ngrams(sent, 5)) # ... pentagrams?

[('В', 'Тюмени', 'больше', 'ясных', 'дней,'),
 ('Тюмени', 'больше', 'ясных', 'дней,', 'чем'),
 ('больше', 'ясных', 'дней,', 'чем', 'в'),
 ('ясных', 'дней,', 'чем', 'в', 'Краснодаре.')]

For Russian language unfortuantely we could not use NLTK, since it is not optimized for it. There is a common tokenizer in Russian - `razdel`. [Here](https://natasha.github.io/razdel/) more on how it works


In [85]:
%pip install razdel

Collecting razdel
  Downloading razdel-0.5.0-py3-none-any.whl (21 kB)
Installing collected packages: razdel
Successfully installed razdel-0.5.0


In [86]:
from razdel import sentenize, tokenize

text_generator = sentenize(russian_text)
print(next(text_generator))
print(next(text_generator))

list(tokenize(russian_text))[:20]

Substring(0, 308, 'Граф Лев Николаевич Толсто́й[К 1] (28 августа [9 сентября] 1828, Ясная Поляна, Тульская губерния, Российская \nимперия — 7 [20] ноября 1910, станция Астапово, Рязанская губерния, Российская империя) — один из наиболее известных русских \nписателей и мыслителей, один из величайших писателей-романистов мира[4].')
Substring(309, 338, 'Участник обороны Севастополя.')


[Substring(0, 4, 'Граф'),
 Substring(5, 8, 'Лев'),
 Substring(9, 19, 'Николаевич'),
 Substring(20, 28, 'Толсто́й'),
 Substring(28, 29, '['),
 Substring(29, 30, 'К'),
 Substring(31, 32, '1'),
 Substring(32, 33, ']'),
 Substring(34, 35, '('),
 Substring(35, 37, '28'),
 Substring(38, 45, 'августа'),
 Substring(46, 47, '['),
 Substring(47, 48, '9'),
 Substring(49, 57, 'сентября'),
 Substring(57, 58, ']'),
 Substring(59, 63, '1828'),
 Substring(63, 64, ','),
 Substring(65, 70, 'Ясная'),
 Substring(71, 77, 'Поляна'),
 Substring(77, 78, ',')]

In [87]:
for text in text_generator:
  print(text)
  break

Substring(339, 500, 'Просветитель, публицист, \nрелигиозный мыслитель, его авторитетное мнение послужило причиной возникновения нового религиозно-нравственного течения — \nтолстовства.')


### 3.2. Stemming

Usually, we want to preprocess text before performing analysis. Normalization is a preprocessing technique that helps simplify analysis. Stemming is a type of normalization. The following code shows us how to use Porter stemming method to get basic for words.

<img src="https://miro.medium.com/v2/resize:fit:1358/0*PD8n0zNFSjFKNP_m" alt="drawing" style="width:50x;"/>


In [105]:
from nltk.stem import PorterStemmer, SnowballStemmer
porter = PorterStemmer()
[porter.stem(word) for word, freq in dist.most_common(10)]

['.', 'is', 'better', 'than', 'to', 'the', ',', 'of', 'although', 'never']

In [89]:
dist.most_common(10)

[('.', 18),
 ('is', 10),
 ('better', 8),
 ('than', 8),
 ('to', 5),
 ('the', 5),
 (',', 4),
 ('of', 3),
 ('Although', 3),
 ('never', 3)]

In [90]:
porter.stem("Painting")

'paint'

Let's stem Stemming word

In [94]:
porter.stem("Stemming")

'stem'

In [107]:
stemmer = SnowballStemmer('russian')
tokenized_example = word_tokenize("На русском стемминг может казаться (кажется) очень странным решением, особенно, если основа слова меняется, например, у слов человек и люди.")
stemmed_example = [stemmer.stem(w) for w in tokenized_example]
print(' '.join(stemmed_example))

на русск стемминг может каза ( кажет ) очен стран решен , особен , есл основ слов меня , например , у слов человек и люд .


### 3.3. Lemmatization

Another normalization method is lemmatization. Let's try to apply Wordnet Lemmatizer to the words.

In [95]:
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')
wnl = WordNetLemmatizer()
[wnl.lemmatize(word) for word, freq in dist.most_common(10)]

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


['.', 'is', 'better', 'than', 'to', 'the', ',', 'of', 'Although', 'never']

In [96]:
wnl.lemmatize("corpora")

'corpus'

Lemmatization for Russian language is also not that easy task. There are several common tools: `pymorphy2`, `Mystem` and `Natasha`, which we will use for it.

In [97]:
%pip install pymorphy2

Collecting pymorphy2
  Downloading pymorphy2-0.9.1-py3-none-any.whl (55 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/55.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dawg-python>=0.7.1 (from pymorphy2)
  Downloading DAWG_Python-0.7.2-py2.py3-none-any.whl (11 kB)
Collecting pymorphy2-dicts-ru<3.0,>=2.4 (from pymorphy2)
  Downloading pymorphy2_dicts_ru-2.4.417127.4579844-py2.py3-none-any.whl (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docopt>=0.6 (from pymorphy2)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: docopt
  Building wheel for docopt (setup.py) ... [?25l[?25hdone
  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl

In [98]:
from pymorphy2 import MorphAnalyzer

In [99]:
morph = MorphAnalyzer()

morph.parse(next(tokenize(russian_text)).text)

[Parse(word='граф', tag=OpencorporaTag('NOUN,anim,masc sing,nomn'), normal_form='граф', score=0.846153, methods_stack=((DictionaryAnalyzer(), 'граф', 52, 0),)),
 Parse(word='граф', tag=OpencorporaTag('NOUN,inan,masc sing,nomn'), normal_form='граф', score=0.076923, methods_stack=((DictionaryAnalyzer(), 'граф', 34, 0),)),
 Parse(word='граф', tag=OpencorporaTag('NOUN,inan,masc sing,accs'), normal_form='граф', score=0.038461, methods_stack=((DictionaryAnalyzer(), 'граф', 34, 3),)),
 Parse(word='граф', tag=OpencorporaTag('NOUN,inan,femn plur,gent'), normal_form='графа', score=0.038461, methods_stack=((DictionaryAnalyzer(), 'граф', 55, 8),))]

In [100]:
morph.parse("Стали")

[Parse(word='стали', tag=OpencorporaTag('VERB,perf,intr plur,past,indc'), normal_form='стать', score=0.975342, methods_stack=((DictionaryAnalyzer(), 'стали', 945, 4),)),
 Parse(word='стали', tag=OpencorporaTag('NOUN,inan,femn sing,gent'), normal_form='сталь', score=0.010958, methods_stack=((DictionaryAnalyzer(), 'стали', 13, 1),)),
 Parse(word='стали', tag=OpencorporaTag('NOUN,inan,femn plur,nomn'), normal_form='сталь', score=0.005479, methods_stack=((DictionaryAnalyzer(), 'стали', 13, 6),)),
 Parse(word='стали', tag=OpencorporaTag('NOUN,inan,femn sing,datv'), normal_form='сталь', score=0.002739, methods_stack=((DictionaryAnalyzer(), 'стали', 13, 2),)),
 Parse(word='стали', tag=OpencorporaTag('NOUN,inan,femn sing,loct'), normal_form='сталь', score=0.002739, methods_stack=((DictionaryAnalyzer(), 'стали', 13, 5),)),
 Parse(word='стали', tag=OpencorporaTag('NOUN,inan,femn plur,accs'), normal_form='сталь', score=0.002739, methods_stack=((DictionaryAnalyzer(), 'стали', 13, 9),))]

Let's try MyStem:

In [101]:
!wget http://download.cdn.yandex.net/mystem/mystem-3.0-linux3.1-64bit.tar.gz
!tar -xvf mystem-3.0-linux3.1-64bit.tar.gz
!cp mystem /bin
from pymystem3 import Mystem
mystem_analyzer = Mystem()

--2024-02-15 18:04:38--  http://download.cdn.yandex.net/mystem/mystem-3.0-linux3.1-64bit.tar.gz
Resolving download.cdn.yandex.net (download.cdn.yandex.net)... 5.45.205.242, 5.45.205.243, 5.45.205.244, ...
Connecting to download.cdn.yandex.net (download.cdn.yandex.net)|5.45.205.242|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://cachev2-m9-10.cdn.yandex.net/download.cdn.yandex.net/mystem/mystem-3.0-linux3.1-64bit.tar.gz?lid=178 [following]
--2024-02-15 18:04:39--  http://cachev2-m9-10.cdn.yandex.net/download.cdn.yandex.net/mystem/mystem-3.0-linux3.1-64bit.tar.gz?lid=178
Resolving cachev2-m9-10.cdn.yandex.net (cachev2-m9-10.cdn.yandex.net)... 37.9.111.215, 2a02:6b8:c35:4:0:562:0:26
Connecting to cachev2-m9-10.cdn.yandex.net (cachev2-m9-10.cdn.yandex.net)|37.9.111.215|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16457938 (16M) [application/octet-stream]
Saving to: ‘mystem-3.0-linux3.1-64bit.tar.gz’


2024-02-15 18:04:41 (9.5

In [102]:
mystem_analyzer.analyze(next(tokenize(russian_text)).text)

[{'analysis': [{'lex': 'граф', 'wt': 0.8203572035, 'gr': 'S,муж,од=им,ед'}],
  'text': 'Граф'},
 {'text': '\n'}]

In [103]:
mystem_analyzer.analyze("Стали")

[{'analysis': [{'lex': 'становиться',
    'wt': 0.9821285009,
    'gr': 'V,нп=прош,мн,изъяв,сов'}],
  'text': 'Стали'},
 {'text': '\n'}]

Let's try Natasha:

In [108]:
!pip install natasha

Collecting natasha
  Downloading natasha-1.6.0-py3-none-any.whl (34.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.4/34.4 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
Collecting navec>=0.9.0 (from natasha)
  Downloading navec-0.10.0-py3-none-any.whl (23 kB)
Collecting slovnet>=0.6.0 (from natasha)
  Downloading slovnet-0.6.0-py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.7/46.7 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting yargy>=0.16.0 (from natasha)
  Downloading yargy-0.16.0-py3-none-any.whl (33 kB)
Collecting ipymarkup>=0.8.0 (from natasha)
  Downloading ipymarkup-0.9.0-py3-none-any.whl (14 kB)
Collecting intervaltree>=3 (from ipymarkup>=0.8.0->natasha)
  Downloading intervaltree-3.1.0.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: intervaltree
  Building wheel for intervaltree (setup.py) ... [?25l[?25hdone
  Created wheel for inte

In [109]:
from natasha import Doc, MorphVocab, Segmenter, NewsEmbedding, NewsMorphTagger

segmenter = Segmenter()
morph_vocab = MorphVocab()
emb = NewsEmbedding()
morph_tagger = NewsMorphTagger(emb)

def natasha_lemmatize(text):
  doc = Doc(text)
  doc.segment(segmenter)
  doc.tag_morph(morph_tagger)
  for token in doc.tokens:
    token.lemmatize(morph_vocab)
  return {_.text: _.lemma for _ in doc.tokens}

In [110]:
natasha_lemmatize('Как дела?')

{'Как': 'как', 'дела': 'дело', '?': '?'}

In [111]:
natasha_lemmatize(next(tokenize(russian_text)).text)

{'Граф': 'граф'}

In [112]:
natasha_lemmatize("Стали")

{'Стали': 'стать'}

## mystem vs. pymorphy vs. natasha

1) We hope that you use linux, because MyStem is too slow on Windows for a long text.

2) Disambiguation. Mystem can resolve homonymy by context (although it doesn’t always succeed), pymorphy2 takes one word as input, so can’t chose right word form by context at all, natasha can’t process some cases successfully either:

In [113]:
homonym1 = 'Страна дорог'
homonym2 = 'Мал золотник, да дорог'

In [114]:
mystem_analyzer = Mystem()

print(mystem_analyzer.analyze(homonym1)[-2])
print(mystem_analyzer.analyze(homonym2)[-2])

{'analysis': [{'lex': 'дорогой', 'wt': 0.1396013647, 'gr': 'A=ед,кр,муж'}], 'text': 'дорог'}
{'analysis': [{'lex': 'дорога', 'wt': 0.8603986502, 'gr': 'S,жен,неод=род,мн'}], 'text': 'дорог'}


In [115]:
print(natasha_lemmatize(homonym1))

{'Страна': 'страна', 'дорог': 'дорога'}


In [116]:
print(natasha_lemmatize(homonym2))

{'Мал': 'маленький', 'золотник': 'золотник', ',': ',', 'да': 'да', 'дорог': 'дорогой'}


There are actually much more advanced text processing solutions for different languages. For instance, [SpaCy](https://spacy.io/models/ru) and [Stanza](https://stanfordnlp.github.io/stanza/).

[Here](https://github.com/natasha/naeval) you can find a benchmark of text processing solutions for Russian.

## Stop-words and punctuation

*Stop* words are the words in a stop list (or stoplist or negative dictionary) which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant.

In [118]:
from nltk.corpus import stopwords
nltk.download('stopwords')
print(stopwords.words('russian'))

['и', 'в', 'во', 'не', 'что', 'он', 'на', 'я', 'с', 'со', 'как', 'а', 'то', 'все', 'она', 'так', 'его', 'но', 'да', 'ты', 'к', 'у', 'же', 'вы', 'за', 'бы', 'по', 'только', 'ее', 'мне', 'было', 'вот', 'от', 'меня', 'еще', 'нет', 'о', 'из', 'ему', 'теперь', 'когда', 'даже', 'ну', 'вдруг', 'ли', 'если', 'уже', 'или', 'ни', 'быть', 'был', 'него', 'до', 'вас', 'нибудь', 'опять', 'уж', 'вам', 'ведь', 'там', 'потом', 'себя', 'ничего', 'ей', 'может', 'они', 'тут', 'где', 'есть', 'надо', 'ней', 'для', 'мы', 'тебя', 'их', 'чем', 'была', 'сам', 'чтоб', 'без', 'будто', 'чего', 'раз', 'тоже', 'себе', 'под', 'будет', 'ж', 'тогда', 'кто', 'этот', 'того', 'потому', 'этого', 'какой', 'совсем', 'ним', 'здесь', 'этом', 'один', 'почти', 'мой', 'тем', 'чтобы', 'нее', 'сейчас', 'были', 'куда', 'зачем', 'всех', 'никогда', 'можно', 'при', 'наконец', 'два', 'об', 'другой', 'хоть', 'после', 'над', 'больше', 'тот', 'через', 'эти', 'нас', 'про', 'всего', 'них', 'какая', 'много', 'разве', 'три', 'эту', 'моя', 'впр

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [119]:
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [120]:
noise = stopwords.words('russian') + list(punctuation)

In [121]:
noise

['и',
 'в',
 'во',
 'не',
 'что',
 'он',
 'на',
 'я',
 'с',
 'со',
 'как',
 'а',
 'то',
 'все',
 'она',
 'так',
 'его',
 'но',
 'да',
 'ты',
 'к',
 'у',
 'же',
 'вы',
 'за',
 'бы',
 'по',
 'только',
 'ее',
 'мне',
 'было',
 'вот',
 'от',
 'меня',
 'еще',
 'нет',
 'о',
 'из',
 'ему',
 'теперь',
 'когда',
 'даже',
 'ну',
 'вдруг',
 'ли',
 'если',
 'уже',
 'или',
 'ни',
 'быть',
 'был',
 'него',
 'до',
 'вас',
 'нибудь',
 'опять',
 'уж',
 'вам',
 'ведь',
 'там',
 'потом',
 'себя',
 'ничего',
 'ей',
 'может',
 'они',
 'тут',
 'где',
 'есть',
 'надо',
 'ней',
 'для',
 'мы',
 'тебя',
 'их',
 'чем',
 'была',
 'сам',
 'чтоб',
 'без',
 'будто',
 'чего',
 'раз',
 'тоже',
 'себе',
 'под',
 'будет',
 'ж',
 'тогда',
 'кто',
 'этот',
 'того',
 'потому',
 'этого',
 'какой',
 'совсем',
 'ним',
 'здесь',
 'этом',
 'один',
 'почти',
 'мой',
 'тем',
 'чтобы',
 'нее',
 'сейчас',
 'были',
 'куда',
 'зачем',
 'всех',
 'никогда',
 'можно',
 'при',
 'наконец',
 'два',
 'об',
 'другой',
 'хоть',
 'после',
 'на