# NLP Text Analysis

This notebook performs a **Static Analysis** of two texts:
- *Manuskrypt Wojnicza*
- *One hundred years of solitude*

It uses functionality of the **Zipf** module, that can also be found in this project.

Below, are the clear instructions and step by step guide of what's going on.

For further explanation and usage of the functions, please refer to the insides of the Zipf module, which contains all the necessary documentation.

<hr>

#### Imports
First, we need to import some classes

1. From *processing.zipf* we import `ZipfAnalyzer` & `ZipfPlotter` classes
2. From *models.text* we import `Text` class

In [1]:
from processing.zipf import ZipfAnalyzer, ZipfPrinter
from models.text import Text

import matplotlib.pyplot as plt

#### Variable declaration
We create two text objects with specified properties

In [2]:
text_1 = Text(text_name="One hundred years of solitude", 
              text_author="Gabriel Garcia Marquez",
              text_file_url="https://gist.githubusercontent.com/ismaproco/b0306cfd10817b7f2a221275d611a077/raw/64d141cee684169ddcd96ac604e274710e0ae18f/one_hundred_years_of_solitude.txt")

text_2 = Text(text_name="Manuskrypt Wojnicza", 
              text_file_url="https://www.ic.unicamp.br/~stolfi/voynich/mirror/reeds/docs/FSG.txt")

In [3]:
def plot_zipf_law(words_ranks: list, word_frequencies: list, text_name: str, text_author: str) -> None:

    plt.figure(figsize=(20, 10))

    plt.plot(words_ranks, word_frequencies, color='magenta')
    
    plt.xlabel('Words Ranks')
    plt.ylabel('Words Frequencies')
    plt.title(f'Zipf Law for "{text_name}" by {text_author}')

    plt.show()

#### Zipf text analysis

In [4]:
zipf_1 = ZipfAnalyzer(text_1)

ranks, freq = zipf_1.calc_zipf_law_properties()
n_grams = zipf_1.calculate_n_grams()

In [5]:
zipf_printer_1 = ZipfPrinter(text_1, ranks, freq, n_grams)

In [6]:
zipf_printer_1.print_collocations_result()

- - - - - COLLOCATIONS ANALYSIS - - - - -

dreaming occurs in 4 collocations	
expense occurs in 1 collocations	
early occurs in 7 collocations	
cousin’s occurs in 1 collocations	
wet occurs in 7 collocations	
inoffensive occurs in 1 collocations	
“Sooner occurs in 1 collocations	
stench. occurs in 1 collocations	
better occurs in 31 collocations	
fraud occurs in 1 collocations	
axle. occurs in 1 collocations	
Alacoque occurs in 1 collocations	
trains occurs in 5 collocations	
sleepless occurs in 1 collocations	
Cripple occurs in 1 collocations	
prodigious occurs in 9 collocations	
cutting occurs in 3 collocations	
crocodiles occurs in 2 collocations	
rebuilding occurs in 1 collocations	
ship’s occurs in 1 collocations	
politicians occurs in 4 collocations	
down!” occurs in 1 collocations	
maidens occurs in 1 collocations	
former occurs in 21 collocations	
“Now occurs in 9 collocations	
parading occurs in 1 collocations	
girls. occurs in 1 collocations	
(because occurs in 1 collocations

In [7]:
zipf_printer_1.print_n_grams_result()



- - - - - N-GRAMS ANALYSIS - - - - -


2-GRAMs
---------
YEARS LATER: 2
as he: 76
he faced: 5
faced the: 7
the firing: 10
firing squad: 10
Colonel Aureliano: 196
Aureliano Buendía: 173
Buendía was: 29
was to: 34
to remember: 9
remember that: 6
that distant: 4
afternoon when: 21
when his: 10
his father: 31
father took: 2
took him: 21
him to: 98
to discover: 9
At that: 18
that time: 94
Macondo was: 10
was a: 170
a village: 5
village of: 4
of twenty: 5
adobe houses: 2
built on: 3
on the: 425
the bank: 2
bank of: 2
of a: 256
river of: 2
of clear: 2
clear water: 2
water that: 5
along a: 5
a bed: 3
bed of: 7
which were: 14
white and: 2
The world: 2
world was: 6
was so: 78
so recent: 2
recent that: 2
that many: 8
many things: 5
names and: 6
and in: 62
in order: 92
order to: 91
it was: 215
was necessary: 6
necessary to: 9
Every year: 2
during the: 93
month of: 2
of March: 2
a family: 5
family of: 2
set up: 28
up their: 7
near the: 5
the village: 17
village and: 3
and with: 75
with a: 324
a g

In [8]:
zipf_2 = ZipfAnalyzer(text_2)

ranks, freq = zipf_2.calc_zipf_law_properties()
n_grams = zipf_2.calculate_n_grams()

In [9]:
zipf_printer_2 = ZipfPrinter(text_2, ranks, freq, n_grams)

In [10]:
zipf_printer_2.print_collocations_result()

- - - - - COLLOCATIONS ANALYSIS - - - - -

 occurs in 190 collocations	
GT8G occurs in 2 collocations	
O8AM occurs in 64 collocations	
(D|H)CO occurs in 1 collocations	
4OPOR occurs in 4 collocations	
AEDAM occurs in 8 collocations	
HCOR occurs in 4 collocations	
OHTAK occurs in 6 collocations	
ODOOE occurs in 1 collocations	
ESC8ARG occurs in 1 collocations	
HZO8G occurs in 17 collocations	
TADA(M|N) occurs in 1 collocations	
AET(A|O)R occurs in 1 collocations	
4DZC8G occurs in 3 collocations	
8TCOHG occurs in 1 collocations	
TCAEG occurs in 4 collocations	
OEDCC8GE occurs in 1 collocations	
GDO8G occurs in 2 collocations	
PZAE occurs in 2 collocations	
AETAR occurs in 1 collocations	
DZCG occurs in 27 collocations	
AKTG occurs in 1 collocations	
DT8AE occurs in 2 collocations	
8AIHZG occurs in 1 collocations	
O8E occurs in 2 collocations	
4ODTCO8G occurs in 1 collocations	
2AIR(| occurs in 1 collocations	
OHTCO occurs in 1 collocations	
4OFSCC(|C)G occurs in 1 collocations	
8GC8G occ

In [11]:
zipf_printer_2.print_n_grams_result()



- - - - - N-GRAMS ANALYSIS - - - - -


2-GRAMs
---------
HZAR HZAR: 2
OR GDAM: 2
8AM ODAM: 9
ODAM OR: 3
SG SOE: 2
DO2 8AM: 2
8AM SOR: 5
TOE O8AM: 2
8AM SDZCG: 2
TOE TOE: 22
TOE DOR: 2
TOE SO: 4
SO TOE: 3
TO8AM SO: 2
TOR TCG: 2
TOE 8AN: 2
8AR SCG: 7
ODOE 8AM: 3
HZOE 8AM: 6
ODCCG ODG: 2
ODG 8AM: 5
8AM ODTCG: 3
DCCG 8AE: 2
TAR OR: 3
OR OTG: 2
8AE 8OE: 3
8AR TCG: 6
TODG TOE: 3
TOE HZOE: 7
TG HZG: 4
GDOE 8OE: 2
SOE DOE: 2
DG TOE: 2
TOE 8AM: 37
SOR ODOE: 2
ODOE TOE: 2
TOE 8OE: 5
8OE DG: 2
8AR SOE: 5
8AE TO8G: 2
SOR HZG: 3
TOR TG: 4
TOE 8AL: 2
8TOE 8TG: 2
8TG HZG: 2
SCCG 4ODCG: 2
2O TOE: 2
SCDG 8AM: 2
8AM HZCG: 3
2 AM: 10
8AE TG: 2
TG TOR: 2
TOR HZOR: 3
TAL 2AM: 2
SO GDCCG: 3
GDCCG TCG: 2
TCG 8AM: 7
8AM THZG: 7
THZG : 2
 DOOM: 2
SO SOE: 4
TOR 8AM: 10
8AM OHTG: 2
TOE TO8G: 4
THZG 8AM: 3
8AM SO: 7
8OR TOE: 5
8AM 8OR: 3
TOE TOR: 7
TOR TOE: 11
8AM OHTOR: 3
TAL 8AM: 2
TCOE TOE: 4
GTCOR TOR: 2
8AM HZG: 11
HZG 2TCG: 2
TOR TAE: 2
TO 4ODOE: 2
TOR HZOK: 2
OHAE 8AK: 2
SCOE SCG: 3
ODCCG 4ODCCG: 8
OR TO