# Text Summarization 

This notebook shows the summaries for specific subthemes and agreement levels. The purpose of this is to help explain the different groupings of agreement levels to determine potential areas for improve in the WES design. The summaries can be made for all the text, a specific sub-theme or theme and various agreement levels.

### Instructions for use

This notebook can be used to create summaries for text. You can select which subtheme and agreement level you want to look at. There are 2 different algorithms provided for generating a summary. 


**Option 1: PageRank - cosine similarity**
This is a implementation adapted from [Prateek Joshi's blog](https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/) and the code is present in the text summary script. This method is slower for a couple reasons. To reduce the run time, you can save the loaded embeddings the first time you run it, so the next summaries are faster.

The pre-trained embeddings must be downloaded locally. 

**Option 2: Variation TextRank - BM25 similarity** 
This method comes from the [Gensim package](https://radimrehurek.com/gensim/summarization/summariser.html) and is an variation on the TextRank algorithm and is much faster than our implementation. 

Both methods give similar summaries and have overlapping sentences. For detailed examples of use read the documentation for generate_text_summary

### Running from the command line

This notebook can be run from the command line and it will print the summary to screen and will write the summary to a csv file in the data/processed folder. 


### Info about working directories

This notebook had been set up to run from the root directory. To switch the working directory, follow the instructions in the cell below.


In [None]:
# This code chunck will change the working directory to be project root

import os
# uncomment and run this line once before preceeding
#os.chdir("..")   # comment and uncomment this line
os.getcwd()

# the file path displayed is the current working directory
# this should be the project root for the follow code to run below

In [None]:
import pandas as pd
import numpy as np
import time
from src.analysis.text_summary import generate_text_summary

In [None]:
# ensure packages reload after every change 
%load_ext autoreload
%autoreload 2

import src

from src.analysis.text_summary import *
# from src.analysis.emotion_analysis import *
# from src.data.preprocessing_text import *

## Summaries

### Option 1: Pagerank 

Credit to: Prateek Joshi

https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/

In [None]:
start = time.time()
summary1, loaded_embedding = generate_text_summary(".\data\interim\joined_qual_quant.csv",  
                                        "./references/data-dictionaries/theme_subtheme_names.csv",
                                        5,
                                        "subtheme",
                                        13,
                                        "weak",
                                        "pre_trained_embedding",
                                        "./references/pretrained_embeddings.nosync/fasttext/crawl-300d-2M.vec",
                                        embedding_return=True)
end = time.time()
print((end - start) / 60, "mins")

In [None]:
start = time.time()
generate_text_summary(".\data\interim\joined_qual_quant.csv",  
                                        "./references/data-dictionaries/theme_subtheme_names.csv",
                                        5,
                                        "subtheme",
                                        13,
                                        "weak",
                                        "pre_trained_embedding",
                                        embedding=loaded_embedding)
end = time.time()
print((end - start) / 60, "mins")

### Option 2: Gensim Package: TextRank

In [None]:
start = time.time()
generate_text_summary(".\data\interim\linking_joined_qual_quant.csv",  
                                        200,
                                        "subtheme",
                                        13,
                                        "weak",
                                        "textrank")

end = time.time()
print((end - start) / 60, "mins")