## Live-Coding lesson on Markov chains

We'll apply to text generation

In [2]:
text_file = "./divinacommedia_cleaned.txt"

## Familiarize with data

In [6]:
with open(text_file, 'r', encoding='utf8') as infile:
    for line in infile:
        pass

How many words I find in my file?
How many couples I need?

I have 30 characters possible( 26 letters + space, newline, full stop, exlamation point)

In [7]:
#First order Chain
30**2

900

In [8]:
#Second order Chain
30**3

27000

In [9]:
#third order Chain
30**4

810000

I need at leat 10e7 characters to make a good estimations of that order

In our corpus we have .ca 5e5 characters: we can't do a good 2nd order estimation. But we know that not every combination is spread equally, some compbination doesn't occur at all!!! Most of the combination won't appear

In [10]:
letters=0
with open(text_file, 'r', encoding='utf8') as infile:
    for line in infile:
        for letter in line:
            letters +=1
print(letters)

529771


In [15]:
# How many and which letters I actually have in my text
from collections import Counter
observed = Counter()
with open(text_file, 'r', encoding='utf-8-sig') as infile:
    for line in infile:
        for letter in line:
            observed[letter] += 1

In [16]:
observed

Counter({'N': 258,
         'e': 46094,
         'l': 23087,
         ' ': 83019,
         'm': 11681,
         'z': 1848,
         'o': 37254,
         'd': 14599,
         'c': 20267,
         'a': 42035,
         'i': 39200,
         'n': 26229,
         's': 22188,
         't': 22478,
         'r': 25805,
         'v': 7946,
         '\n': 19054,
         'p': 10790,
         'u': 13408,
         ',': 8513,
         'h': 7109,
         'é': 903,
         '.': 3275,
         'A': 377,
         'q': 3025,
         'è': 925,
         'g': 7121,
         'f': 4947,
         '!': 232,
         'T': 283,
         '’': 7623,
         'ù': 1080,
         ';': 1628,
         'b': 2758,
         'ò': 938,
         'I': 359,
         'M': 405,
         'à': 855,
         'E': 586,
         'ì': 1383,
         'P': 473,
         'Q': 328,
         '«': 1062,
         '»': 1062,
         'R': 117,
         ':': 988,
         'ï': 427,
         'ó': 30,
         '?': 278,
         'O': 356,
   

We can treat this wierd data as they are, or we can replace them in a certain way.
We make a manual normalization

In [29]:
to_replace = {'Ë':'E', 'Ï':'I',
              'ö':'o', 'ä':'a',
              'ü':'u', 'ë':'e',
              'ï':'i' }

def letter_normalization_naive(letter):
    if letter in to_replace:
        return to_replace[letter]
    return letter
 
def letter_normalization_short(letter):
    return to_replace.get(letter, letter)    
    

In [22]:
letter_normalization_naive('Ë')

'E'

In [23]:
letter_normalization_short('Ë')

'E'

In [24]:
%timeit letter_normalization_naive('Ë')

110 ns ± 4.04 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [25]:
%timeit letter_normalization_short('Ë')

122 ns ± 0.361 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


It's slightly faster the naive version!

In [26]:
letter_normalization = letter_normalization_naive

In [27]:
# How many and which letters I actually have in my text
from collections import Counter
observed = Counter()
with open(text_file, 'r', encoding='utf-8-sig') as infile:
    for line in infile:
        for letter in line:
            modified_letter = letter_normalization(letter)
            observed[modified_letter] += 1

In [28]:
observed

Counter({'N': 258,
         'e': 46177,
         'l': 23087,
         ' ': 83019,
         'm': 11681,
         'z': 1848,
         'o': 37255,
         'd': 14599,
         'c': 20267,
         'a': 42043,
         'i': 39627,
         'n': 26229,
         's': 22188,
         't': 22478,
         'r': 25805,
         'v': 7946,
         '\n': 19054,
         'p': 10790,
         'u': 13463,
         ',': 8513,
         'h': 7109,
         'é': 903,
         '.': 3275,
         'A': 377,
         'q': 3025,
         'è': 925,
         'g': 7121,
         'f': 4947,
         '!': 232,
         'T': 283,
         '’': 7623,
         'ù': 1080,
         ';': 1628,
         'b': 2758,
         'ò': 938,
         'I': 360,
         'M': 405,
         'à': 855,
         'E': 588,
         'ì': 1383,
         'P': 473,
         'Q': 328,
         '«': 1062,
         '»': 1062,
         'R': 117,
         ':': 988,
         'ó': 30,
         '?': 278,
         'O': 356,
         'V': 222,
   

In [30]:
from random import choices

completely random selection!!!

In [51]:
letters = list(observed.keys())
occurences = list(observed.values())
generated = choices(letters, occurences, k =20 )

collated = "".join(generated)
print(collated)

qmpao 
i
ml rato
m v


# First order markov chain

We need to take letters in couples.

"home" -> (h,o), (o,m), (m,e)

In [54]:
def couples_from_seq(seq):
    """`seq` is a list of characters"""
    return zip(seq, seq[1:])

list(couples_from_seq('home'))

[('h', 'o'), ('o', 'm'), ('m', 'e')]

Let's pick up all the divina commedia as a single string:

### NOT MEMORY FRIENDLY!!!!

In [60]:
with open(text_file, 'r', encoding='utf-8-sig') as infile:
    whole_text = "".join(line for line in infile)

In [64]:
print(whole_text[:120])

Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura,
ché la diritta via era smarrita.

Ahi quanto a dir


In [65]:
all_couples = list(couples_from_seq(whole_text))

In [66]:
all_couples[:10]

[('N', 'e'),
 ('e', 'l'),
 ('l', ' '),
 (' ', 'm'),
 ('m', 'e'),
 ('e', 'z'),
 ('z', 'z'),
 ('z', 'o'),
 ('o', ' '),
 (' ', 'd')]

In [142]:
from collections import defaultdict
# Counter -= defaultdict(int)
#I think it's like a counter of counters
couples_counter = defaultdict(Counter)
for first_letter, second_letter in all_couples:
    couples_counter[first_letter][second_letter] +=1

In [77]:
foo = defaultdict(Counter)
foo['b']['a'] += 1
foo['b']['c'] += 1

#Is like:
if 'c' not in foo:
    foo['c'] = Counter()
foo['c']['a'] += 1

foo

foo['b']

Counter({'a': 1, 'c': 1})

In [78]:
couples_counter

defaultdict(collections.Counter,
            {'N': Counter({'e': 42,
                      'o': 170,
                      'a': 14,
                      'i': 9,
                      'é': 18,
                      'u': 4,
                      'ï': 1}),
             'e': Counter({'l': 3162,
                      'z': 194,
                      'r': 6508,
                      's': 2693,
                      ' ': 17016,
                      '\n': 1209,
                      'n': 4318,
                      ';': 398,
                      '.': 630,
                      'a': 619,
                      '’': 288,
                      't': 1528,
                      'i': 506,
                      'm': 894,
                      'c': 618,
                      'g': 1066,
                      'd': 1156,
                      'o': 74,
                      ',': 1879,
                      'v': 323,
                      '»': 154,
                      '?': 69,
                      '!':

How many actual couples did we observe:

In [79]:
sum(sum(counts.values()) for counts in couples_counter.values())

529770

In [130]:
letter = 'N'
possible_letters = list(couples_counter[letter].keys())
counts = list(couples_counter[letter].values())
choices(possible_letters, counts)[0]

'o'

In order to make the simulation I need to repeat this procedure replacing the starting letter

In [131]:
text = ['N']

In [133]:
for i in range(200):
    last_letter = text[-1]
    possible_letters = list(couples_counter[last_letter].keys())
    counts = list(couples_counter[last_letter].values())
    next_letter = choices(possible_letters, counts)[0]
    text.append(next_letter)
print("".join(text))

No! ra ll sera isso
Ma l’un so datali viarl’li delo
Quador oneachediria s’n mme soltono,
va cialtto.
e Pe velua condi ure me lorolalosivi fi pi: mal ne, ’ l’lagn bo usorimmico ltrvastivïe,

Inta;
eme chisuido.

esticheror


## Second order Markov Chain

In [134]:
with open(text_file, 'r', encoding='utf-8-sig') as infile:
    whole_text = "".join(line for line in infile)

In [266]:
def triplets_from_seq(seq):
    """`seq` is a list of characters"""
    return zip(seq, seq[1:], seq[2:])

In [267]:
all_triplets = list(triplets_from_seq(whole_text))

In [268]:
all_triplets[:10]

[('N', 'e', 'l'),
 ('e', 'l', ' '),
 ('l', ' ', 'm'),
 (' ', 'm', 'e'),
 ('m', 'e', 'z'),
 ('e', 'z', 'z'),
 ('z', 'z', 'o'),
 ('z', 'o', ' '),
 ('o', ' ', 'd'),
 (' ', 'd', 'e')]

In [199]:
from collections import defaultdict
# Counter -= defaultdict(int)
#I think it's like a counter of counters
triplets_counter = defaultdict(Counter)
for first_letter, second_letter, third_letter in all_triplets:
    triplets_counter[(first_letter,second_letter)][third_letter] +=1

In [235]:
letters = tuple('Ne')
possible_letters = list(triplets_counter[letters].keys())
counts = list(triplets_counter[letters].values())
choices(possible_letters, counts)[0]

't'

In [236]:
text = ['N', 'e', 'l']
tuple(text[-2:])

('e', 'l')

In [250]:
text = ['N', 'e', 'l']
tuple(text[-2:])
last_letters = tuple(text[-2:])
possible_letters = list(triplets_counter[last_letters].keys())
counts = list(triplets_counter[last_letters].values())
choices(possible_letters, counts)[0]

' '

In [270]:
text = ['N', 'e']
for i in range(200):
    last_letters = tuple(text[-2:])
    possible_letters = list(triplets_counter[last_letters].keys())
    counts = list(triplets_counter[last_letters].values())
    next_letter = choices(possible_letters, counts)[0]
    text.append(next_letter)
print("".join(text))

Nestuando
co caccome s’adrataza,
fia so sar prio oma vidime la l’aggide so a ger che pienziel ch’io congentorso se:
on Dio;
e gi».

sette volto ch’alt’ a acquel Fio: «Ala alta illorso la e sti sond’uni.


# Markovian chain of order N

In [9]:
from collections import Counter, defaultdict
from random import choices

text_file = "./divinacommedia_cleaned.txt"
with open(text_file, 'r', encoding='utf-8-sig') as infile:
    whole_text = "".join(line for line in infile)
    
def n_tuples_from_seq(n, seq): 
    tuple_of_seq = tuple(seq[i:] for i in range(n))
    return zip(*tuple_of_seq)

n=7
all_tuples = list(n_tuples_from_seq(n,whole_text))

tuples_counter = defaultdict(Counter)
for key_tuple in all_tuples:
    tuples_counter[key_tuple[:-1]][key_tuple[-1]] +=1
    
text = list(whole_text[:n-1])
for i in range(1000):
    last_letters = tuple(text[-(n-1):])
    possible_letters = list(tuples_counter[last_letters].keys())
    counts = list(tuples_counter[last_letters].values())
    next_letter = choices(possible_letters, counts)[0]
    text.append(next_letter)
print("".join(text))

Nel mezzo del buon parliamo e profondo
Marte e ne la vostro,
non dissi:

ché quel de l’etterno ricco patre!».

E question ch’io divise;
onde uscì presso lui;
poi per lo mele; e qui alcuna ch’io mi fossero impediti e accorti,

rispuose: «Figlio,
sì che Giove
parer lo suono
incominciai, «com’ a lui: «Li occhi più fero e vetusto
di questa corse maggi
lumi, li quale io preme;

e quel tanto s’avaccio cadde, e ’l sole
per troppa similmente a lo rimembri,
ricenti sì come lucerna,
sì che ne la mia donna dietro fu che conosco,
quel convenne a me, come tu vedi,
che spezza.

Venne Cefàs e venne
un sol galeoto,
ch’a guisa di lei
sì presso al passo passo:

così fatta letizia, che lo sguardo morse,
chi va di gonna il pesce andava timida si fu al direi che saver d’alcuna via di tal moto per la grazia,

come degna;

sì com’ i’ ho caro
parer di madre a sua naturalmente inganni lagrimando, d’una de la voce tua discordia e a tal ber si guardate e figlio la sete,
ai quanta.

Ma dimmi la fiamma; e per voi 

In [12]:
import requests
url_base = ("https://raw.githubusercontent.com/UniboDIFABiophysics"+
                "/programmingCourseDIFA/master/divine_comedy/")
filename = "divinacommedia_cleaned.txt"
response = requests.get(url_base + filename)    
response.raise_for_status()    
with open(filename, 'wb') as handle:
    handle.write(response.content)
    
    