# Keyword Analysis

In this notebook, you will use the KeywordAnalysis tool to analyse words in a collection of corpus and identify whether certain words are over or under-represented in a particular corpus compared to their representation in other corpus.  

**Note:** the statistical calculations used in this tool are python implementation of the statistical calculation on this [website](https://ucrel.lancs.ac.uk/llwizard.html).

<div class="alert alert-block alert-warning">
<b>User guide to using a Jupyter Notebook</b> 

If you are new to Jupyter Notebook, feel free to take a quick look at [this user guide](https://github.com/Australian-Text-Analytics-Platform/semantic-tagger/blob/main/documents/jupyter-notebook-guide.pdf) for basic information on how to use a notebook.
</div>

## 1. Setup
Before you begin, you need to import the KeywordAnalysis package and the necessary libraries and initiate them to run in this notebook.

In [1]:
# import the KeywordAnalysis tool
print('Loading KeywordAnalysis...')
from keyword_analysis import KeywordAnalysis, DownloadFileLink

# initialize the DocumentSimilarity
ka = KeywordAnalysis()
print('Finished loading.')

Loading KeywordAnalysis...
Finished loading.


## 2. Load the data
This notebook will allow you upload text data in a text file (or a number of text files). Alternatively, you can also upload text inside a text column inside your excel spreadsheet (see an example below).  

<table style='margin-left: 10px'><tr>
<td> <img src='./img/txt_icon.png' style='width: 45px'/> </td>
<td> <img src='./img/xlsx_icon.png' style='width: 55px'/> </td>
<td> <img src='./img/csv_icon.png' style='width: 45px'/> </td>
<td> <img src='./img/zip_icon.png' style='width: 45px'/> </td>
</tr></table>  

<table style='margin-left: 10px'><tr>
<td> <img src='./img/excel_sample.png' style='width: 600px'/> </td>
</tr></table> 

<div class="alert alert-block alert-warning">
<b>Uploading your text files</b> 
    
If you have a large number of text files (more than 10MB in total), we suggest you compress (zip) them and upload the zip file instead. If you need assistance on how to compress your file, please check [the user guide](https://github.com/Australian-Text-Analytics-Platform/semantic-tagger/blob/main/documents/jupyter-notebook-guide.pdf) for more info.  
    
If you upload an excel spreadsheet, please ensure it includes the three columns (text_name, text and source), as shown above. Alternatively, you can also upload the compressed text files (zip of .txt files) corpus by corpus. In this case, please ensure to enter the corpus name for each corpus below.
</div>

<div class="alert alert-block alert-danger">
<b>Large file upload</b> 
    
If you have ongoing issues with the file upload, please re-launch the notebook via Binder again. If the issue persists, consider restarting your computer.
</div>

In [2]:
# upload the text files and/or excel spreadsheets onto the system
display(ka.upload_box)
print('Uploading large files may take a while. Please be patient.')

VBox(children=(Text(value='', description='Corpus Name:', placeholder='Enter corpus name...', style=Descriptio…

Uploading large files may take a while. Please be patient.


In [3]:
# display uploaded text
n=5

ka.text_df.head()

Unnamed: 0,text_name,text,source,text_id
0,text1,"Facebook and Instagram, which Facebook owns, f...",corpus1,b587a07f85
1,text2,(CBC News)\nRepublican lawmakers and previous ...,corpus1,0d03616d2f
2,text3,Federated States of Micronesia President David...,corpus2,286e117603
3,text4,Chinese state media has launched its strongest...,corpus2,f3a2479a87


## 3. Calculate word statistics
Once your texts have been uploaded, you can begin to calculate the statistics for the words in the corpus. 

<div class="alert alert-block alert-danger">
<b>Memory limitation in Binder</b> 
    
The free Binder deployment is only guaranteed a maximum of 2GB memory. Processing very large text files may cause the session (kernel) to re-start due to insufficient memory. Check [the user guide](https://github.com/Australian-Text-Analytics-Platform/semantic-tagger/blob/main/documents/jupyter-notebook-guide.pdf) for more info. 
</div>

In [4]:
# begin the process of calculating word statistics
ka.calculate_word_statistics()

Step 1/3: 100%|██████████| 2/2 [00:00<00:00, 889.28it/s]
Step 2/3: 100%|██████████| 2/2 [00:00<00:00, 22.61it/s]
Step 3/3: 100%|██████████| 2/2 [00:00<00:00, 166.78it/s]


## 4. Analyse word statistics
Once the tool has finished calculating the statistics, you can begin to analyse the outcome.  

<div class="alert alert-block alert-warning">
<b>Pairwise analysis</b> 
    
In the below, you can analyse statistics between pair of sources (corpus vs the rest of the corpus) and see the statistics for all words in the corpus. You can use the below tool to select which corpus to include in the graph and what statistics to show, e.g., normalised word count, log-likelihood, Bayes factor BIC, effect size for log-likelihood (ELL), relative risk, log ratio and odds ratio. 
    
**Note:** The graph only shows 40 words at a time. However, you can use the selection slider to select the words you wish to display on the chart.
</div>

In [5]:
# generate pair-wise corpus analysis
ka.analyse_stats()

VBox(children=(HBox(children=(VBox(children=(HTML(value='<b>Select corpus:</b>', placeholder=''), SelectMultip…

You can also run the below code to save the pairwise analysis onto an excel spreadsheet and download it to your local computer.

In [None]:
# specify the saving parameters
df = ka.pairwise_compare
output_dir = './output/'
file_name = 'pairwise_analysis.xlsx'
sheet_name = 'pairwise-analysis'

# select the number of rows to display
display_n = 5

# save and display the first n rows
ka.save_analysis(df, output_dir, file_name, sheet_name, display_n)

<div class="alert alert-block alert-warning">
<b>Multi-corpora analysis</b> 
    
In the below, you can analyse statistics between multi-corpora at the same time and see the statistics for all words in the corpus. Similar to the above, you can use the below tool to select which statistics to show and the selection slider to select the words you wish to display on the chart.
</div>

In [None]:
# generate multi-corpus analysis
ka.analyse_stats(multi=True)

Last but not least, you can also run the below code to save the multi-corpora analysis and download it to your local computer.

In [None]:
# specify the saving parameters
df = ka.multicorp_comparison
output_dir = './output/'
file_name = 'multi_corpus_analysis.xlsx'
sheet_name = 'multi-corpus-analysis'

# select the number of rows to display
display_n = 5

# save and display the first n rows
ka.save_analysis(df, output_dir, file_name, sheet_name, display_n)