# How to use the MuNLP library

This is a basic NLP module for text analysis and generation.  
Documentation purpose.  
This is a work in progress.



automatically reload the modules so changes done in the MuNLP library are usable immediately (not cached):

In [1]:
%load_ext autoreload
%autoreload 2


First of all, import the library. Add it to sys path if not in the same directory

In [2]:
from MuNLP import MuNLP


# Load content from text string

In [3]:
sample = """Processing raw text intelligently is difficult: most words are rare, 
and it’s common for words that look completely different to mean almost the same thing. 
The same words in a different order can mean something completely different. 
Even splitting text into useful word-like units can be difficult in many languages. 
While it’s possible to solve some problems starting from only the raw characters, 
it’s usually better to use linguistic knowledge to add useful information."""

In [4]:
len(sample)  # include newlines

479

## Instantiate a MuNLP object
Just need to pass the text to the class MuNLP:

In [5]:
snippet = MuNLP(sample)


In [6]:
snippet.getStats(verbose=1)  # no newlines: less total chars

Total words:  74
Unique words:  54
Approximate total number of sentences:  4
Total chars:  479


(74, 54, 4, 479)

In [7]:
snippet.getNumWords()

74

In [8]:
snippet.getLength()

479

Words are stored in 'tokens':

In [9]:
len(snippet.tokens)

48

In [10]:
snippet.getMostCommonWords(7)  # top 7 used words

[('words', 3),
 ('it’s', 3),
 ('different', 3),
 ('raw', 2),
 ('text', 2),
 ('difficult', 2),
 ('completely', 2)]

In [120]:
snippet.getTopBigrams()

[(('completely', 'different'), 2)]

In [34]:
snippet.getNumSentences()

4

In [35]:
snippet.sentences

['Processing raw text intelligently is difficult: most words are rare, \nand it’s common for words that look completely different to mean almost the same thing.',
 'The same words in a different order can mean something completely different.',
 'Even splitting text into useful word-like units can be difficult in many languages.',
 'While it’s possible to solve some problems starting from only the raw characters, \nit’s usually better to use linguistic knowledge to add useful information.']

# Use a text file

In [13]:
import os
os.getcwd()


'/Users/Massimo/Documents/workspace/MuStudio/MuNLP'

In [14]:
workDirectory = '../../../MyBooks/05 racconti/'

In [15]:
filePath = "01dream/infos.txt"


In [16]:
info = MuNLP.fromText(workDirectory+filePath) # read content from text file

In [17]:
info.getStats()

(12, 12, 1, 107)

the whole content is stored in 'text' while the cleaned version (no punctuation, no newlines, lowercase) for analysis is in 'cleanText':

In [18]:
info.text

'Title:          What dreams are made of\nSerie:          Racconti\nVolume:         1\nAuthor:         Massimo\n'

In [19]:
info.cleanText

'title          what dreams are made of serie          racconti volume         1 author         massimo'

In [122]:
info.getMostCommonWords(3)

[]

In [123]:
info.getTopBigrams()

[]

# Use an existing PDF document as input

In [21]:
file = '05invasion/Daily life in the American indigenous societies before the European conquest.pdf'


## Instantiate MuNLP object from PDF

Use fromPDF decorator to give file name as input; this will create an object of type MuNLP which contains the text extracted from the file.

In [22]:
doc = MuNLP.fromPDF(workDirectory+file) # read content from PDF file

Total read pages:  10


In [23]:
type(doc)

MuNLP.MuNLP

In [24]:
print("Total number of characters: ", doc.getLength())

Total number of characters:  22806


In [25]:
print("Total number of words: ", doc.getNumWords()) 

Total number of words:  3060


In [26]:
doc.getStats(verbose=1)  # both words and chars, verbose prints the total numbers

Total words:  3060
Unique words:  1067
Approximate total number of sentences:  120
Total chars:  22806


(3060, 1067, 120, 22806)

In [27]:
doc.language  # document's language; this is the default

'EN'

In [28]:
len(doc.tokens)

2081

In [30]:
doc.getMostCommonWords()

[('indigenous', 55),
 ('social', 27),
 ('european', 26),
 ('cultural', 26),
 ('practices', 24),
 ('societies', 23),
 ('often', 19),
 ('trade', 18),
 ('native', 18),
 ('american', 16)]

In [124]:
doc.getTopBigrams(5)

[(('indigenous', 'societies'), 13),
 (('european', 'contact'), 12),
 (('indigenous', 'peoples'), 11),
 (('native', 'american'), 11),
 (('social', 'structures'), 7)]

In [31]:
doc.getNumSentences()

120

In [32]:
doc.getStats(verbose=1)  # both words and chars, verbose prints the total numbers

Total words:  3060
Unique words:  1067
Approximate total number of sentences:  120
Total chars:  22806


(3060, 1067, 120, 22806)

In [33]:
doc.sentences[1:4]

['Please consider checking important information.',
 "The generated content does not represent the developer's viewpoint.",
 'summary\nDaily life in American indigenous societies before the European conquest was char-\nacterized by a rich diversity of cultures, social structures, and sustainable interactions \nwith the environment.']