<h1><i>kiara</i>: Natural Language Processing (NLP)</h1>

Welcome back! Now that we're comfortable with what *kiara* looks like and what it can do to help track your data and your research process, let's try out some of the digital analysis tools, starting with **Natural Language Processing**.

<h1>Why NLP?</h1>

First of all, why bothering with NLP? Natural language processing technology allows researchers to sort through unstructured data such as plain text. In other words, by adding numerical value to text, computers can <i>understand</i> language and perform advanced operations such as text categorisation, labelling, summarisation and so on.
There are two main stages in NLP: pre-processing and analysis (aka, algorithm development and/or implementation). Here we cover both stages through the example of some of the most common pre-processing operations such as tokenisation, lowercasing, removing stopwords etc. in the first part. For the second part, we will use the example of another widely used text analysis method called topic modelling.
For more information about the pre-processing operations and topic modelling and a more in-depth discussion particularly for humanities research, please refer to this repository [here](https://github.com/DHARPA-Project/TopicModelling-).

<h3>Starting the Process</h3>

Let's start by double checking that we have all the required plugins, and setting up an API for us to use *kiara*. We'll do this all in one go this time, but if you're unsure, feel free to head back to the [installation notebook](http://dharpa.org/kiara.documentation/latest/workshop/workshop/) to look over this section again.

In [1]:
try:
    from kiara_plugin.jupyter import ensure_kiara_plugins
except:
    import sys
    print("Installing 'kiara_plugin.jupyter'...")
    !{sys.executable} -m pip install -q kiara_plugin.jupyter
    from kiara_plugin.jupyter import ensure_kiara_plugins

import sys
!{sys.executable} -m pip install kiara_plugin.language_processing==0.4.13

ensure_kiara_plugins()

from kiara.api import KiaraAPI
kiara = KiaraAPI.instance()

Now we're all set up, we want to download some text to work with in our language processing analyis. <br/>
For our example here we will be using a relatively small number of texts. This is a sample taken from the larger corpus [*ChroniclItaly 3.0*](http://doi.org/10.5281/zenodo.4596345) (Viola and Fiscarelli 2021, [Viola 2021](https://www.euppublishing.com/doi/full/10.3366/ijhac.2021.0268)), an open access digital heritage collection of Italian immigrant newspapers published in the United States from 1898 to 1936.
The corpus that we use here includes the digitized (OCRed) front pages of the Italian language newspaper *La rassegna* as collected from [*Chronicling America*](https://chroniclingamerica.loc.gov/newspapers/), an Internet-based, searchable database of U.S. newspapers published in the United States from 1789 to 1963 made available by the Library of Congress.
These files are also a good examples because their filenames already contain important metadata information such as the publication date. The file name structure is: LCCNnumber_date_pageNumber_ocr.txt. Therefore, the file name ‘sn84037025_1917-04-14_ed-1_seq-1_ocr.txt ’ refers to the OCR text file of the first page of the first edition of *La Rassegna* published on 14 April 1917. *kiara* allows us to retrieve both the files and the metadata in the filenames. This is very useful for historical research, but also to keep track of how we are intervening on our sources. Let's see how this works.

In [2]:
kiara.list_operation_ids('download')

['download.file', 'download.file_bundle']

NLP tasks usually require large numbers of files. <br/>
This is why *kiara* allows us to work both with individual files and larger corpora. Here we will use the `download.file_bundle` command to load several files.

In [3]:
kiara.retrieve_operation_info('download.file_bundle')

For the purpose of this notebook tutorial, we are using files that are online, so in order for us to use them, we will need to specify the url from where our files live. If our files are stored locally, however, we can use the `import.file.from.local_path' command.
Let's now download our online files.

Again, we need to define the <span style="color:green">inputs</span>, use `kiara.run_job` with our chosen operation `download.file_bundle` and store this as our <span style="color:red">outputs</span>.

In [4]:
inputs = {
    "url": "https://github.com/DHARPA-Project/kiara.examples/archive/refs/heads/main.zip",
    "sub_path": "kiara.examples-main/examples/data/text_corpus/data"
 }

outputs = kiara.run_job('download.file_bundle', inputs=inputs)
outputs


patool: Extracting /tmp/tmpppd_hbwm ...
patool: ... /tmp/tmpppd_hbwm extracted to `/tmp/tmpesvds0hl'.


Great, we've successfully imported a bundle of files. This has given us both the metadata for the files, and the files themselves. As you can see, *kiara* also gives us additional information on the composition of the text files, that is the number of tokens. This information will be useful later when we will intervene on these files to keep track of how we have changed them. For now, let's save the files in a separate variable for us to use later.

In [5]:
file_bundle = outputs['file_bundle']

<h3>Preparing the Texts</h3>

Now that we have imported the files, let's give them some structure. For this, we will need the `create.table.from.file_bundle` function (similar to the [installation notebook](https://github.com/DHARPA-Project/kiara.documentation/blob/develop/docs/workshop/workshop.ipynb) which you are welcome to revisit at any time). Let's have a look by exploring the list of avaibale operations.

In [6]:
kiara.retrieve_operation_info('create.table.from.file_bundle')

Let's use the file bundle we downloaded earlier and saved in our variable, and run this *kiara* table function.

In [7]:
inputs = {
    'file_bundle' : file_bundle
}

outputs = kiara.run_job('create.table.from.file_bundle', inputs=inputs)
outputs

Great, this has taken all the information from the files we downloaded and made it a bit easier to navigate. In order to process and analyse our sources, we need to work with the files' content which is in the column 'content'. Let's run `kiara.list_operation_ids('table')' to see how we might be able to do that.

In [8]:
kiara.list_operation_ids('table')

['create.database.from.table',
 'create.network_data.from.tables',
 'create.table.from.file',
 'create.table.from.file_bundle',
 'export.table.as.csv_file',
 'extract.date_array.from.table',
 'import.table.from.local_file_path',
 'import.table.from.local_folder_path',
 'query.table',
 'table.pick.column',
 'table_filter.drop_columns',
 'table_filter.select_columns',
 'table_filter.select_rows']

As we are interested in one column, the `table.pick.column` operation seems like a good fit.

In [9]:
kiara.retrieve_operation_info('table.pick.column')

So here we need two <span style="color:green">inputs</span>, the **table** we just made and the name of the **column** we want to pick.

Let's specify our outputs again and run the function. In this way, we retain the content of the files as the variable we need for NLP.

In [10]:
inputs = {
    'table' : outputs['table'],
    'column_name' : 'content'
}

outputs = kiara.run_job('table.pick.column', inputs=inputs)
outputs

<h3>Natural Language Processing (Stage 1)</h3>

Now we are ready for preparing our text for analysis. Let's see what operations are included in *kiara* for NLP in the `kiara_plugin.language_processing` package.

In [11]:
infos = metadata = kiara.retrieve_operations_info()
operations = {}
for op_id, info in infos.item_infos.items():
    if info.context.labels.get("package", None) == "kiara_plugin.language_processing":
        operations[op_id] = info

print(operations.keys())

dict_keys(['create.stopwords_list', 'generate.LDA.for.tokens_array', 'preprocess.tokens_array', 'tokenize.string', 'tokenize.texts_array'])


The contents of our text files have been stored as an **array**. Before performing any operation, we should start by tokenising our text. We can do this by using the `tokenize.texts_array` function.

If you're unsure about which of these operations you should run, you can refer to the in-built explanation in each *kiara* module which clarifies what each operation does. For further information about pros and cons of each pre-processing operation, please refer to this repository [here](https://github.com/DHARPA-Project/TopicModelling-).

In [12]:
kiara.retrieve_operation_info('tokenize.texts_array')

Great, let's give it a go!

In [13]:
inputs = {
    'texts_array': outputs['array']
}

outputs = kiara.run_job('tokenize.texts_array', inputs=inputs)
outputs

We can see from the printed preview that this has tokenized the contents for each of the text files we imported.

Now we can work on pre-processing the text. Let's look at what options we have in the `preprocess.tokens_array` operation.

In [14]:
kiara.retrieve_operation_info('preprocess.tokens_array')

*kiara* includes the most widely used text analysis pre-processing operations. Let's try some of them and take a few moments to notice how they change our text.

Let's start by removing the so-called stopwords. These are low information words such as articles, pronouns, prepositions, conjunctions, etc. which are not semantically salient. There are numerous stopword lists available for many, though not all, languages which can be easily adapted to the individual researcher's needs. Here we are defining our **stopword list** but do experiment yourself with adding and change some of the words.

In [15]:
kiara.retrieve_operation_info('create.stopwords_list')

Let's say we trust the default 'italian' stopword list that comes with nltk, but want to augment it with a few of our custom stopwords. We'd do it like this:

In [16]:
custom_stopword_list = ['la', 'i']

inputs = {
    "languages": ["italian"],
    "stopwords": custom_stopword_list
}

stopwords_outputs = kiara.run_job('create.stopwords_list', inputs=inputs)
my_stopwords_list = stopwords_outputs['stopwords_list']
my_stopwords_list.data

ListModel(list_data=['a', 'abbia', 'abbiamo', 'abbiano', 'abbiate', 'ad', 'agl', 'agli', 'ai', 'al', 'all', 'alla', 'alle', 'allo', 'anche', 'avemmo', 'avendo', 'avesse', 'avessero', 'avessi', 'avessimo', 'aveste', 'avesti', 'avete', 'aveva', 'avevamo', 'avevano', 'avevate', 'avevi', 'avevo', 'avrai', 'avranno', 'avrebbe', 'avrebbero', 'avrei', 'avremmo', 'avremo', 'avreste', 'avresti', 'avrete', 'avrà', 'avrò', 'avuta', 'avute', 'avuti', 'avuto', 'c', 'che', 'chi', 'ci', 'coi', 'col', 'come', 'con', 'contro', 'cui', 'da', 'dagl', 'dagli', 'dai', 'dal', 'dall', 'dalla', 'dalle', 'dallo', 'degl', 'degli', 'dei', 'del', 'dell', 'della', 'delle', 'dello', 'di', 'dov', 'dove', 'e', 'ebbe', 'ebbero', 'ebbi', 'ed', 'era', 'erano', 'eravamo', 'eravate', 'eri', 'ero', 'essendo', 'faccia', 'facciamo', 'facciano', 'facciate', 'faccio', 'facemmo', 'facendo', 'facesse', 'facessero', 'facessi', 'facessimo', 'faceste', 'facesti', 'faceva', 'facevamo', 'facevano', 'facevate', 'facevi', 'facevo', 'fai

In [17]:
inputs = {
    'tokens_array': outputs['tokens_array'],
    'remove_stopwords' : my_stopwords_list
}

outputs = kiara.run_job('preprocess.tokens_array', inputs=inputs)
outputs

Great. Let's take this a bit further and try and combine two of our options in one function. In reality, we can add all the <span style="color:green">inputs</span> together in one job, but let's start with converting everything into lowercase and removing any words with non-alphanumeric symbols.

In [18]:
inputs = {
    'tokens_array': outputs['tokens_array'],
    'to_lowercase' : True,
    'remove_non_alpha' : True
}

outputs = kiara.run_job('preprocess.tokens_array', inputs=inputs)
outputs

Now that we're happy with our prepared pre-processed texts, we can use `generate.LDA.for.tokens_array` to try out some topic modelling. The default for topics is set at seven, but just like the `preprocess.tokens_array` operation, we can play around with the options. Let's have a look.

In [19]:
kiara.retrieve_operation_info('generate.LDA.for.tokens_array')

We'll stick with the default for now, and generate some topics for our text.

In [20]:
inputs = {
    'tokens_array' : outputs['tokens_array']
}

outputs = kiara.run_job('generate.LDA.for.tokens_array', inputs=inputs)
outputs

<h3>Recording and Tracing our Data</h3>

We've successfully downloaded, organised and pre-processed our text files, and now generated some topics for it. <br/>
Fantastic!

As we know, this means we've made lots of decisions about our research process and our data. But by using *kiara*, we can trace what's changed and the decisions we've made. Let's have a look!


<span style="color:blue">As with the installation notebook, not much to see here yet but will be updated as changes come. Would potentially be useful with operations that require options (like the preproccessing) to know whether this has been selected or not?</span>

In [21]:
topics = outputs['topic_models']

topics.lineage
