# Keywords Analysis

In this notebook, you will use the KeywordsAnalysis tool to analyse words in a collection of texts (in a corpus) and identify whether certain words are over- or under-represented in a particular corpus (the study corpus) compared to their frequency in the other corpus (the reference corpus).  

**Note:** the statistical calculations used in this tool (Log Likelihood, %Diff, Bayes Factor, Effect Size for Log Likelihood, Relative Risk, Log Ratio, Odds Ratio) are the python implementation of the statistical calculations on this [website](https://ucrel.lancs.ac.uk/llwizard.html), and are explained there with relevant attribution and links.

<div class="alert alert-block alert-warning">
<b>User guide to using a Jupyter Notebook</b> 

If you are new to Jupyter Notebook, feel free to take a quick look at [this user guide](https://github.com/Australian-Text-Analytics-Platform/semantic-tagger/blob/main/documents/jupyter-notebook-guide.pdf) for basic information on how to use a notebook.
</div>

## 1. Setup
Before you begin, you need to import the KeywordsAnalysis package and the necessary libraries and initiate them to run in this notebook.

In [None]:
# import the KeywordsAnalysis tool
print('Loading KeywordsAnalysis...')
from keywords_analysis import KeywordsAnalysis, DownloadFileLink

# initialize the KeywordsAnalysis
ka = KeywordsAnalysis()
print('Finished loading.')

## 2. Load the data
This notebook will allow you to upload text data in a text/corpus file (or a number of text/corpus files). You upload each file/corpus in turn and then compare them. For instance, you could identify keywords in four different corpora that you have uploaded one after the other as separate zip files. Alternatively, you could upload your corpora all at once by specifying the source/corpus name in an excel spreadsheet (see below example).  

<table style='margin-left: 10px'><tr>
<td> <img src='./img/txt_icon.png' style='width: 45px'/> </td>
<td> <img src='./img/xlsx_icon.png' style='width: 55px'/> </td>
<td> <img src='./img/csv_icon.png' style='width: 45px'/> </td>
<td> <img src='./img/zip_icon.png' style='width: 45px'/> </td>
</tr></table>  

<table style='margin-left: 10px'><tr>
<td> <img src='./img/excel_sample.png' style='width: 550px'/> </td>
</tr></table>  

<div class="alert alert-block alert-warning">
<b>Uploading word frequency list</b> 
    
There may be times where you only have the word frequencies without having access to the actual corpus. In this case, you can store the word frequencies in an excel spreadsheet (the first column should contain the words and the second column the word frequencies - see below example) and upload it here. Please ensure to give a corpus name for each uploaded spreadsheet and tick the 'Uploading word frequency list' box.  
</div>

<table style='margin-left: 10px'><tr>
<td> <img src='./img/word_freq.png' style='width: 300px'/> </td>
</tr></table>  

<div class="alert alert-block alert-danger">
<b>Tokens in the word frequency list</b>    

This tool uses scikit learn's CountVectorizer to tokenize the texts (a token is identified as one or more alphanumeric characters in the texts, and punctuation is ignored and treated as a token separator, e.g., "high-school" will be tokenized as two tokens "high" and "school"). We suggest to follow the same token format when uploading your own word frequency list. For more information about the CountVectorizer, please visit [this page](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).
</div>

In [None]:
# upload the text files and/or excel spreadsheets onto the system
ka.upload_file_widget()

<div class="alert alert-block alert-danger">
<b>Large file upload</b> 
    
If you have ongoing issues with the file upload, please re-launch the notebook. If the issue persists, consider restarting your computer.
</div>

## 3. Calculate word statistics
Once your texts have been uploaded, you can begin to calculate the statistics for the words in the corpus. 

<div class="alert alert-block alert-info">
<b>Tools:</b>    

- scikit learn's CountVectorizer: used to tokenize the texts.  

<b>Note:</b> a token is identified as one or more alphanumeric characters in the texts. Here, punctuation is completely ignored and always treated as a token separator. For further information about the CountVectorizer, please visit [this page](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).
</div>

<div class="alert alert-block alert-danger">
<b>Memory limitation in Binder</b> 
    
The free Binder deployment is only guaranteed a maximum of 2GB memory. Processing very large text files may cause the session (kernel) to re-start due to insufficient memory. Check [the user guide](https://github.com/Australian-Text-Analytics-Platform/semantic-tagger/blob/main/documents/jupyter-notebook-guide.pdf) for more info. 
</div>

In [None]:
# begin the process of calculating word statistics
ka.calculate_word_statistics()

## 4. Analyse word statistics
Once the tool has finished calculating the statistics, you can begin to analyse the outcome.  

<div class="alert alert-block alert-warning">
<b>Pairwise analysis</b> 
    
Below, you can analyse statistics between pairs of datasets (study corpus vs reference corpus) and see the statistics for the words in the corpus. When you have more than two datasets to compare (e.g., corpus 1, corpus 2 and corpus 3), you can either choose to compare one corpus to another (e.g., study corpus: corpus 1 vs reference corpus: corpus 2) or compare a corpus with the rest of the data (e.g., study corpus: corpus 2 vs reference corpus: rest of corpus, which includes corpus 1 and 3). You can then use the below tool to display the different statistic(s), e.g., normalised word count, log-likelihood, percentage difference, Bayes factor BIC, effect size for log-likelihood (ELL), relative risk, log ratio and/or odds ratio.  

By default, the graph displays the first 30 words in the corpus (the x-axis) and the selected statistic value(s) for each word (the y-axis), sorted in alphabetical order. However, you can use the 'Select index' widget to display other words in the corpus (move the index up/down by 10 using the up/down arrow, or enter your own index number and press 'Tab' to select any index number).  
    
You can also use the 'Sorted by' drop down menu to sort the words based on the statistic (from the highest to the lowest) if you wish. If you want to save the graph and download it to your local computer, you can use the 'save' icon on the right-hand side of the graph.    

Lastly, you can save the data to an excel spreadsheet and download it to your local computer by pressing the 'Save data to excel' button.
    
<b>Notes:</b> 
- You can select multiple statistics by pressing the Ctrl button and select multiple options on the list using the left-click on your mouse. 
- Press the 'Display chart' button to display a new graph based on the selected corpora and statistic(s) and reset the index to zero (0).
</div>

In [None]:
# generate pair-wise corpus analysis
ka.analyse_stats(right_padding=0.9) # adjust the 'right_padding' to move the legend box left/right

<div class="alert alert-block alert-warning">
<b>What information is included in the above chart?</b> 

**normalised_wc/normalised_reference_corpus_wc:** the normalised count of the word in the study corpus vs the reference corpus. Here, the normalised word count is calculated by dividing the total words for each word in the corpus by the total words in that corpus.
    
**log-likelihood$^{1}$:** the log-likelihood that a word is statistically different in the study corpus vs the reference corpus. 
    
**percent-diff$^{2}$:** the percentage difference between the use of a word in the study corpus vs the reference corpus. 
    
**bayes factor BIC$^{3}$:** the degree of evidence that a word is statistically different in the study corpus vs the reference corpus.  
    
**effect size for log-likelihood (ELL)$^{4}$:** the relative frequency of the log-likelihood of a particular word in the study corpus vs the reference corpus.  
    
**relative risk$^{5}$:** the relative frequency (how many times more frequent) of a particular word in the study corpus vs the reference corpus.
    
**log ratio$^{2}$:** the doubling size (2^n) of a particular word in the study corpus vs the reference corpus.  
    
**odd ratio$^{5}$:** the odd that a particular word is used in the study corpus vs the reference corpus.  
    
**Notes:**  
$^{1}$Large value indicates that the use of the word is statistically different in the study corpus vs the reference corpus.  
$^{2}$Positive value indicates overuse of that word in the study corpus vs the reference corpus, and vice versa.  
$^{3}$Large positive value indicates higher degree of evidence that a word is statistically different in the study corpus vs the reference corpus.  
$^{4}$Value closer to 0 (e.g., 0.0001) indicates lower degree of evidence that a word is statistically different in the study corpus vs the reference corpus.  
$^{5}$Large value indicates that the overuse of the word in the study corpus vs the reference corpus, and vice versa.  

For more information on the above statistics, please visit this [website](https://ucrel.lancs.ac.uk/llwizard.html).  
</div>

<div class="alert alert-block alert-warning">
<b>Multi-corpora analysis</b> 
    
Below, you can analyse the overall statistics at the multi-corpora level, for cases where you explore more than two corpora. This option is only available for some of the statistics, because the other statistics are only applicable to pairwise comparisons.  
 
Similar to the above, by default, the graph displays the first 30 words in the corpus (the x-axis) and the selected statistic value(s) for each word (the y-axis), sorted in alphabetical order. You can use the 'Select index' widget to display other words in the corpus, use the 'Sorted by' drop down menu to sort the words based on the statistic values, or save the graph using the 'save' icon on the right-hand side of the graph.  
    
Lastly, you can save the analysis onto an excel spreadsheet and download it to your local computer by pressing the 'Save data to excel' button.
    
<b>Notes:</b> 
- You can select multiple statistics by pressing the Ctrl button and select multiple options on the list using the left-click on your mouse. 
- Press the 'Display chart' button to display a new graph based on the selected statistic(s) and reset the index to zero (0).
</div>

In [None]:
# generate multi-corpus analysis
ka.analyse_stats(right_padding=0.5, multi=True)

<div class="alert alert-block alert-warning">
<b>What information is included in the above chart?</b> 

**log-likelihood$^{1}$** the log-likelihood that a word is statistically different vs other words in a corpora.  
    
**bayes factor BIC$^{2}$:** the degree of evidence that a word is statistically different vs other words in a corpora.  
    
**effect size for log-likelihood (ELL)$^{3}$:** the relative frequency of the log-likelihood of a particular word vs other words in a corpora.  
    
**Note:**  
$^{1}$Large value indicates that the use of the word is statistically different vs other words in a corpora.  
$^{2}$Large positive value indicates higher degree of evidence that a word is statistically different vs other words in a corpora.  
$^{3}$Value closer to 0 (e.g., 0.0001) indicates lower degree of evidence that a word is statistically different vs other words in a corpora.  
For more information on the above statistics, please visit this [website](https://ucrel.lancs.ac.uk/llwizard.html). 
</div>

## 5. Welch t-test and Fisher permutation test
In this section, you will be able to use statistical test to investigate if the use of a certain word in a corpus is statistically different to the use of that same word in a different corpus. All you need to do is enter the 'word' you wish to analyse, the two corpora you wish to compare, perform data transformation if needed (optional) and select the statistical test to perform using the below tool.

<div class="alert alert-block alert-info">
<b>Tools:</b>    
    
- scipy: collection of math algorithms and functions built on the NumPy extension of Python
- nltk: natural language processing toolkit
</div>

<div class="alert alert-block alert-warning">
<b>Welch t-test</b> 

The Welch t-test is used to test if two populations have equal means. In this context, the Welch t-test will be used to test if the mean (average) frequency of a word in one corpus is the same with the mean frequency of that word in a different corpus. If the mean frequencies in the two corpora being compared are significantly different, then it can be said that the difference to be statistically significant.     
    
**Note:** for more information about the Welch t-test, please visit this [website](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#r3566833beaa2-2).
</div>

<div class="alert alert-block alert-warning">
<b>Fisher permutation test</b> 

The Fisher permutation test is used to test if all observations in the data are sampled from the same distribution. In this context, the Fisher permutation test will be used to test if the frequencies of a word in a corpus and the frequencies of that word in another corpus are the same. If not, and the difference is significant, then it can be said that the use of that word in one corpus is statistically different to that in the other corpus.          
    
**Note:** for more information about Fisher permutation test, please visit this [website](https://docs.scipy.org/doc//scipy/reference/generated/scipy.stats.permutation_test.html).
</div>

In [None]:
ka.word_usage_analysis()

<div class="alert alert-block alert-warning">
<b>Data transformation</b> 

Statistical tests often assume that data is normally distributed (bell-shaped distribution). However, real world data can be messy and often are not normally distributed. Whilst it is not always possible to do so, you can always try to transform your data to more closely match a normal distirbution. In the above tool, you have the option to apply (1) log tranformation, or (2) square root transformation to your data if you wish.  
</div>

<div class="alert alert-block alert-danger">
<b>Word frequency list</b> 
    
You are unable to perform these statistical tests if you only upload the word frequency list as the analysis are conducted on the number of words in each text within the corpus. Please upload the actual text files to do this section. 
</div>