# **Topic modelling with *kiara***
In this tutorial, you will learn how to use the state-of-the-art data orchestration sofware *kiara* to perform topic modelling on a large-scale corpus.

#**Digital research** 
As digital methods are more and more used in research, the need for transparency and critical reflection is ever more pressing. Regardless of the type of digital analysis that is performed or even the software that is used, documenting the *process* and not just the product becomes integral to the validity of the research findings. 

Despite a fast changing technological landscape, however, there is a frustrating lack of tools that allow you to document the process of digital knowledge production. In other words, how can we keep track of our methodological decisions and interventations?

# **Introducing *kiara***
*kiara* is a state-of-the-art sofware that allows researchers to document and critically reflect not solely on the digital methods and approaches they use, but perhaps more importantly also on how those methods and choices impact on the sources. When you begin using *kiara*, your research choices and changes to the data are recorded so that you will be able to visualise and examine the individual steps you took and how your data changed accordingly. This is important for the researcher, but also for reproducibility and replicability purposes.

*kiara* currently features several digital research approaches such as textual analysis and network analysis. In this tutorial, we will cover **Natural Language Processing** (NLP). 

For more information and updates on *kiara*, please check the [project's repository](https://github.com/DHARPA-Project/kiara).

# **Why NLP?**

Natural language processing technology allows researchers to sort through unstructured data such as plain text. In other words, by adding numerical value to text, computers can 'understand' language and perform advanced operations such as text categorisation, labelling, summarisation and so on. There are two main stages in NLP: pre-processing and analysis (aka, algorithm development and/or implementation). Here we will cover both stages; for the pre-processing part, we will use as an example some of the most common pre-processing operations such as tokenisation, lowercasing, removing stopwords, etc. For the analysis, we will use the example of another widely used text analysis method called **Topic Modelling**. For more information about the pre-processing operations and topic modelling and a more in-depth discussion particularly for humanities research, please refer to this repository [here](https://github.com/DHARPA-Project/TopicModelling-).

# **The case study: narratives of migration by Italian Americans, 1898-1936**
This tutorial uses *kiara* to examine how the changing experience of migration, identity construction, and assimilation is reflected over time in the accounts of migrants themselves. Using a corpus of Italian American newspapers, the analysis will allow us to address the fundamental question whether ethnic media facilitate assimilation and integration, or rather isolate immigrants from their new society by keeping them in the cultural sphere of their homeland (Parks, 2014).
The case study used for this tutorial is partially based on [Viola and Verheul (2019)](https://academic.oup.com/dsh/article/35/4/921/5601610#209907403).

# **Before starting**
For this tutorial, you will need a basic understanding of Python and how to run Python code, whether that be via Jupyter Notebooks or using a text editor and the command line. You will also need to install *kiara*. Please refer to the installation procedure [here](http://dharpa.org/kiara.documentation/latest/workshop/workshop/).

In [None]:
%%capture
! pip install git+https://github.com/DHARPA-Project/kiara_plugin.dh_benelux_2023

# **Preparing the environment**
In order to use *kiara*, we first need to create a KiaraAPI instance. An API allows us to control and interact with *kiara* and its functions. In *kiara* this also allows us to obtain information on the available operations and on how these operations impact on the sources. For more information about the KiaraAPI, see the *kiara* API documentation [here](https://dharpa.org/kiara/latest/reference/kiara/interfaces/python_api/__init__/#kiara.interfaces.python_api.KiaraAPI).

In [None]:
from kiara.api import KiaraAPI

kiara = KiaraAPI.instance()

We can now ask *kiara* to list all the operations that are included with the plugins we installed in the previous step.

In [None]:
kiara.list_operation_ids()

['assemble.network_data.from.files',
 'assemble.network_data.from.tables',
 'create.database.from.file',
 'create.database.from.file_bundle',
 'create.database.from.table',
 'create.network_data.from.file',
 'create.stopwords_list',
 'create.table.from.file',
 'create.table.from.file_bundle',
 'date.check_range',
 'date.extract_from_string',
 'download.file',
 'download.file_bundle',
 'export.file.as.file',
 'export.network_data.as.csv_files',
 'export.network_data.as.graphml_file',
 'export.network_data.as.sql_dump',
 'export.network_data.as.sqlite_db',
 'export.table.as.csv_file',
 'extract.date_array.from.table',
 'file_bundle.pick.file',
 'file_bundle.pick.sub_folder',
 'generate.LDA.for.tokens_array',
 'import.database.from.local_file_path',
 'import.file',
 'import.file_bundle',
 'import.local.file',
 'import.local.file_bundle',
 'import.network_data.from.local_file_paths',
 'import.table.from.local_file_path',
 'import.table.from.local_folder_path',
 'list.contains',
 'logic.and

# **Data onboarding**
In this tutorial, we will be using a sample from [*ChroniclItaly 3.0*](https://zenodo.org/record/4596345#.ZFJaGnZBw2w) (Viola and Fiscarelli 2021, Viola 2021), an open access digital heritage collection of Italian immigrant newspapers published in the United States from 1898 to 1936. This corpus includes the digitized (OCRed) front pages of ten Italian American newspapers as collected from [*Chronicling America*](https://chroniclingamerica.loc.gov/), an Internet-based, searchable database of U.S. newspapers published in the United States from 1789 to 1963 made available by the Library of Congress. The corpus is also a good example of how we can use metadata for historical research. The filenames of the files in *ChroniclItaly 3.0* already contain important metadata information such as the publication date and the publication reference number. The file name structure is:

LCCNnumber_date_pageNumber_ocr.txt

Therefore, the file name ‘sn84037025_1917-04-14_ed-1_seq-1_ocr.txt ’ refers to the OCR text file of the first page of the first edition of the title *La Rassegna* published on 14 April 1917. If your files also contain metadata information, *kiara* allows you to retrieve both the files and the metadata in the filenames. This is very useful for historical research, but also to keep track of how we are intervening on our sources. Let's see how this works.

In [None]:
inputs = {
    "url": "https://github.com/DHARPA-Project/kiara.examples/archive/refs/heads/main.zip",
    "sub_path": "kiara.examples-main/examples/workshops/dh_benelux_2023/data"
 }

outputs = kiara.run_job('download.file_bundle', inputs=inputs)
outputs

patool: Extracting /tmp/tmpq42o27ez ...
patool: running /usr/bin/7z x -o/tmp/tmpg701j27e -- /tmp/tmpq42o27ez
patool: ... /tmp/tmpq42o27ez extracted to `/tmp/tmpg701j27e'.


Great, we've successfully imported a bundle of files. This has given us both the metadata for the files, and the files themselves. As you can see, *kiara* also gives us additional information on the composition of the text files, that is the number of tokens. This information will be useful later when we will intervene on these files to keep track of how we have changed them. For now, let's save the files in a separate variable for us to use later.

In [None]:
file_bundle = outputs['file_bundle']

Now that we have imported the files, let's give them some structure. For example, we can create a dataframe to contain our files and metadata. Let's have a look at the `create.table.from.file_bundle` function. 

In [None]:
kiara.retrieve_operation_info('create.table.from.file_bundle')

As you can see, *kiara* also provides us with meta information on each of its functions. This built-in documentation is also helpful for teaching purposes. Now that we know what `create.table.from.file_bundle` does, we can use it with our sources.

In [None]:
inputs = {
    'file_bundle' : file_bundle
}

outputs = kiara.run_job('create.table.from.file_bundle', inputs=inputs)
outputs

*kiara* has taken all the information from our files and made it a bit easier to navigate. In order to process and analyse our sources, we need to work with the textual content in the column 'content' of the table we just created. Let's explore the available operations specific for tables by running `kiara.list_operation_ids('table')`.

In [None]:
kiara.list_operation_ids('table')

['assemble.network_data.from.tables',
 'create.database.from.table',
 'create.table.from.file',
 'create.table.from.file_bundle',
 'export.table.as.csv_file',
 'extract.date_array.from.table',
 'import.table.from.local_file_path',
 'import.table.from.local_folder_path',
 'query.table',
 'table.pick.column',
 'table_filter.drop_columns',
 'table_filter.select_columns',
 'table_filter.select_rows']

As we are interested in one column, the `table.pick.column` operation seems like a good fit.

In [None]:
inputs = {
    'table' : outputs['table'],
    'column_name' : 'content'
}

outputs = kiara.run_job('table.pick.column', inputs=inputs)
outputs

# **Natural Language Processing (Stage 1)**
Now we are ready for preparing our texts for analysis, that would be stage 1. Let's see what operations are included in *kiara* for NLP in the `kiara_plugin.language_processing package`.

In [None]:
infos = metadata = kiara.retrieve_operations_info()
operations = {}
for op_id, info in infos.item_infos.items():
    if info.context.labels.get("package", None) == "kiara_plugin.language_processing":
        operations[op_id] = info

print(operations.keys())

dict_keys(['create.stopwords_list', 'generate.LDA.for.tokens_array', 'preprocess.tokens_array', 'tokenize.string', 'tokenize.texts_array'])


Before performing any operation, we should start by tokenising our text. Tokenisation is a process that tells the machine at which level we want to perform our operations (e.g., character level, word level, sentence level). In other words, by tokenising we define the boundaries of the elements in our texts. 

If you're unsure about which ones of these operations you should run, you can refer to the in-built explanation in each *kiara* module or refer to this repository [here](https://github.com/DHARPA-Project/TopicModelling-) for further information about pros and cons of each pre-processing operation.

In [None]:
kiara.retrieve_operation_info('tokenize.texts_array')

As we will later perform topic modelling, let's tokenise our texts at word level.

In [None]:
inputs = {
    'texts_array': outputs['array']
}

outputs = kiara.run_job('tokenize.texts_array', inputs=inputs)
outputs

Now that the text of our files is tokenised, we can experiment with different pre-processing operations. Again, we can use the `kiara.retrieve_operation_info` to explore what *kiara* can do.

In [None]:
kiara.retrieve_operation_info('preprocess.tokens_array')

*kiara* includes the most widely used text analysis pre-processing operations. Let's try some of them and take a few moments to notice how they change our texts.

Let's start by removing stopwords. These are low information words, i.e., not semantically salient such as articles, pronouns, prepositions, conjunctions, etc. There are numerous stopword lists available for many, though not all, languages which can be easily adapted to the individual researcher's needs. Here we are defining our stopword list but feel free to experiment with your own.

In [None]:
import pandas as pd
custom_stopword_list = list(pd.read_csv('stop_words.csv')['stopword'])

Now that we have defined our stopword list, let's tell *kiara* to use it. 

In [None]:
inputs = {
    "languages": ["italian", "english"],
    "stopwords": custom_stopword_list
}

stopwords_outputs = kiara.run_job('create.stopwords_list', inputs=inputs)
my_stopwords_list = stopwords_outputs['stopwords_list']
my_stopwords_list.data

ListModel(list_data=['Indiana', 'Not', 'Which', 'Who', 'a', 'ab', 'abbastanza', 'abbia', 'abbiamo', 'abbiano', 'abbiate', 'about', 'above', 'ac', 'accidenti', 'ad', 'adesso', 'af', 'affinche', 'after', 'again', 'against', 'agl', 'agli', 'ahime', 'ahimã¨', 'ahimè', 'ai', 'ain', 'al', 'alcuna', 'alcuni', 'alcuno', 'ali', 'alio', 'all', 'alla', 'alle', 'allo', 'allora', 'altre', 'altri', 'altrimenti', 'altro', 'altrove', 'altrui', 'am', 'ami', 'an', 'anche', 'ancho', 'anco', 'ancora', 'and', 'ani', 'anni', 'anno', 'ano', 'ansa', 'anticipo', 'any', 'aono', 'ap', 'ar', 'are', 'aren', "aren't", 'as', 'assai', 'at', 'attesa', 'attraverso', 'au', 'avanti', 'avemmo', 'avendo', 'avente', 'aver', 'avere', 'averlo', 'avesse', 'avessero', 'avessi', 'avessimo', 'aveste', 'avesti', 'avete', 'aveva', 'avevamo', 'avevano', 'avevate', 'avevi', 'avevo', 'avrai', 'avranno', 'avrebbe', 'avrebbero', 'avrei', 'avremmo', 'avremo', 'avreste', 'avresti', 'avrete', 'avrà', 'avrò', 'avuta', 'avute', 'avuti', 'avu

As you can see, our customed stopword list contains Italian and English. This is tailored to the specificity of our sources, newspapers written by Italian Americans that are mostly in Italian but also some English content. Using a customed list is preferable in most cases as it likely improves the quality of the results. We can now go ahead and ask *kiara* to remove the stopwords from our texts.

In [None]:
inputs = {
    'tokens_array': outputs['tokens_array'],
    'remove_stopwords' : my_stopwords_list,
    'to_lowercase': True,
    'remove_non_alpha': True,
    'remove_short_tokens': 2
}

outputs = kiara.run_job('preprocess.tokens_array', inputs=inputs)
outputs

Another pre-processing operation that is often performed is lowercasing. Again, it is impotant to keep in mind that each intervention on our sources heavily transforms them and as such, it has pros and cons. For this reason, although it is possible to combine more operations into one function, it is a good idea to proceed step-by-step and see how our choices are impacting on the material. For the purposes of this tutorial, let's combine lowercasing with removing any words with non-alpha tokens, that is removing all tokens that include punctuation and numbers (e.g., ex1a.mple).

In [None]:
inputs = {
    'tokens_array': outputs['tokens_array'],
    'to_lowercase' : True,
    'remove_non_alpha' : True
}

outputs = kiara.run_job('preprocess.tokens_array', inputs=inputs)
outputs

Another operations that we can perform to potentially improve the quality of our results is to remove all the tokens shorter or equal to a value that we set. Again, this is another example of an intervention on the sources that depends on the researcher's choices and that is based on the specificity of the sources. In this case, as we are working with historical newspapers, the quality of the OCR content is not optimal, meaning that our texts contain several errors. Removing tokens shorter than or equal to X characters may remove many of such errors. Here we are definining this value at 2 but feel free to experiment with different options based on your sources.

In [None]:
inputs = {
    'tokens_array': outputs['tokens_array'],
    'remove_short_tokens' : 2
}

outputs = kiara.run_job('preprocess.tokens_array', inputs=inputs)
outputs 