<h1><i>kiara</i>: Language Processing</h1>

Welcome back! Now that we're comfortable with what *kiara* looks like and what it can do to help track your data and your research process, let's try out some of the digital analysis tools, starting with **Language Processing**.

<span style ="color:blue">Lorella could there be some general information here on language processing/point to some literature maybe?</span>

<h3>Starting the Process</h3>

Let's start by double checking that we have all the required plugins, and setting up an API for us to use *kiara*. We'll do this all in one go this time, but if you're unsure, feel free to head back to the [installation notebook](http://dharpa.org/kiara.documentation/latest/workshop/workshop/) to look over this section again.

In [2]:
try:
    from kiara_plugin.jupyter import ensure_kiara_plugins
except:
    import sys
    print("Installing 'kiara_plugin.jupyter'...")
    !{sys.executable} -m pip install -q kiara_plugin.jupyter
    from kiara_plugin.jupyter import ensure_kiara_plugins

ensure_kiara_plugins()

from kiara import KiaraAPI
kiara = KiaraAPI.instance()

Now we're all set up, we want to download some text to work with in our language processing analyis. <br/>
Let's have a look and see what there is.

In [3]:
kiara.list_operation_ids('download')

['download.file', 'download.file_bundle']

Last time we only wanted one file, but with language processing we might want a bigger corpus. <br/>
Let's have a look at `download.file_bundle` this time.

In [4]:
kiara.retrieve_operation_info('download.file_bundle')

So we still want a url, but for a zip file that we can download.
Here's some example data for us to use.

Again, we need to define the <span style="color:green">inputs</span>, use `kiara.run_job` with our chosen operation `download.file_bundle` and store this as our <span style="color:red">outputs</span>.

In [5]:
inputs = {
    "url": "https://github.com/DHARPA-Project/kiara.examples/archive/refs/heads/main.zip",
    "sub_path": "kiara.examples-main/examples/data/text_corpus/data"
 }

outputs = kiara.run_job('download.file_bundle', inputs=inputs)
outputs


patool: Extracting /var/folders/5h/j266_5ss6qj7x37qd77pydh0p248rs/T/tmpjr9umxt6 ...
patool: ... /var/folders/5h/j266_5ss6qj7x37qd77pydh0p248rs/T/tmpjr9umxt6 extracted to `/var/folders/5h/j266_5ss6qj7x37qd77pydh0p248rs/T/tmpnha6uj9d'.


Great, we've successfully imported a bundle of files this time rather than just one. This has given us both the metadata for the files, and the files themselves. Let's save the files themselves in a separate variable for us to use later.

In [6]:
file_bundle = outputs['file_bundle']

<h3>Preparing the Texts</h3>

Before we can start using the language processing functions, we need to prepare the texts.
Let's start by creating a table, just like with the csv in the installation notebook, but this time with the `create.table.from.file_bundle` function. Have a look and see what it needs.

In [5]:
kiara.retrieve_operation_info('create.table.from.file_bundle')

Let's use the file bundle we downloaded earlier and saved in our variable, and run this *kiara* table function.

In [7]:
inputs = {
    'file_bundle' : file_bundle
}

outputs = kiara.run_job('create.table.from.file_bundle', inputs=inputs)
outputs

Great, this has taken all the information from the files we downloaded and made it a bit easier to navigate. At the moment, for our language processing, we're just interested in the texts themselves, stored in the 'content' column. Let's have a look at what we might be able to do with our table.

In [8]:
kiara.list_operation_ids('table')

['create.database.from.table',
 'create.network_data.from.tables',
 'create.table.from.file',
 'create.table.from.file_bundle',
 'export.table.as.csv_file',
 'extract.date_array.from.table',
 'filter.table',
 'import.table.from.local_file_path',
 'import.table.from.local_folder_path',
 'query.table',
 'table.pick.column',
 'table_filter.drop_columns',
 'table_filter.select_columns',
 'table_filter.select_rows']

We only want one column, so let's have a look at `table.pick.column`

In [10]:
kiara.retrieve_operation_info('table.pick.column')

So here we need two <span style="color:green">inputs</span>, the **table** we just made and the name of the **coloumn** we want to pick. 

Let's use our outputs again and run the function. This way we can keep just the information we need for our language processing.

In [11]:
inputs = {
    'table' : outputs['table'],
    'column_name' : 'content'
}

outputs = kiara.run_job('table.pick.column', inputs=inputs)
outputs

<h3>Language Processing and Topic Modelling</h3>

Great, we've downloaded and prepped our material for text analysis. Not we want to see what kind of functions are included in *kiara* for language processing. Let's have a look and see what's included in the `kiara_plugin.language_processing` package.

<span style="color:blue">Lorella this may/may not need some extra description on what the different operations do</span>

In [12]:
infos = metadata = kiara.retrieve_operations_info()
operations = {}
for op_id, info in infos.item_infos.items():
    if info.context.labels.get("package", None) == "kiara_plugin.language_processing":
        operations[op_id] = info

print(operations.keys())

dict_keys(['create.stopwords_list', 'generate.LDA.for.tokens_array', 'preprocess.tokens_array', 'remove_stopwords.from.tokens_array', 'tokenize.string', 'tokenize.texts_array'])


The contents of our text files have been stored as an **array**, so let's start by tokenizing this using the `tokenize.texts_array` function.

In [11]:
kiara.retrieve_operation_info('tokenize.texts_array')

Great, let's give it a go!

In [13]:
inputs = {
    'texts_array': outputs['array']
}

outputs = kiara.run_job('tokenize.texts_array', inputs=inputs)
outputs

We can see from the printed preview that this has tokenized the contents for each of the text files we imported.

Now we can work on pre-processing some of this text. Let's look at what options we have in the `preprocess.tokens_array` operation.

In [14]:
kiara.retrieve_operation_info('preprocess.tokens_array')

There's lots of options here to choose from to help pre-process our texts.

Let's start by removing a custom **stopword list**. Feel free to have a play around, and add and change some of the words!

In [15]:
stopword_list = ['la', 'i']

inputs = {
    'tokens_array': outputs['tokens_array'],
    'remove_stopwords' : stopword_list
}

outputs = kiara.run_job('preprocess.tokens_array', inputs=inputs)
outputs

Great. Let's take this a bit further and try and combine two of our options in one function. In reality, we can add all the <span style="color:green">inputs</span> together in one job, but let's start with converting everything into lowercase and removing any words with non-alphanumeric symbols.

In [16]:
inputs = {
    'tokens_array': outputs['tokens_array'],
    'to_lowercase' : True,
    'remove_non_alpha' : True
}

outputs = kiara.run_job('preprocess.tokens_array', inputs=inputs)
outputs

Now that we're happy with our prepared pre-processed texts, we can use `generate.LDA.for.tokens_array` to try out some topic modelling. The default for topics is set at seven, but just like the `preprocess.tokens_array` operation, we can play around with the options. Let's have a look.

In [18]:
kiara.retrieve_operation_info('generate.LDA.for.tokens_array')

We'll stick with the default for now, and generate some topics for our text.

In [21]:
inputs = {
    'tokens_array' : outputs['tokens_array']
}

outputs = kiara.run_job('generate.LDA.for.tokens_array', inputs=inputs)
outputs

<h3>Recording and Tracing our Data</h3>

We've successfully downloaded, organised and pre-processed our text files, and now generated some topics for it. <br/>
Fantastic!

As we know, this means we've made lots of decisions about our research process and our data. But by using *kiara*, we can trace what's changed and the decisions we've made. Let's have a look!


<span style="color:blue">As with the installation notebook, not much to see here yet but will be updated as changes come. Would potentially be useful with operations that require options (like the preproccessing) to know whether this has been selected or not?</span>

In [23]:
topics = outputs['topic_models']

topics.lineage