# Text Analysis on Cultural Indicators

Nowadays the factors that contribute to the advancement and progress of society as well as their effects and consequences are deeply investigated. The **volume of research** is such that it becomes **increasingly difficult** to **assimilate all this information** and extract **useful insights from it**.

These investigations are usually oriented to **specific topics** and our goal is to **understand the relevance** of these subjects in these papers. In order to do this we have a set of **1082 indicators**, divided between **819 general indicators** (covering a wide range of affairs, from *CO2 production* to *GDP growth*) and **263 cultural indicators**. All of them together are assumed to represent the most important topics that are extensively covered by these studies.

However, language is not *objective* and these indicators have been **arbitrarily** selected (with good reasons, but arbitrarily nevertheless), therefore this method can lead to errors and misunderstandings due to the very *subjective* nature of language itself. From this fact we come up with the following questions:

* To what extent can we trust this **general** - **cultural** indicators to summarize and understand investigations? 
* Are **cultural indicators more specific** than their general counterpart? And if so, to what degree are they?

In this Analysis we will try to answer these questions and apply the results to real *Papers* and *Reports* such us the **Paris Agreement** or the **Agenda 2030**, as well as matching general and cultural indicators to understand their connections and redundancy to validate this method.

This process will be split in **4 different notebooks** which can be found down below, each one pointing to its proper documentation. This first document (*index*) is the **connecting link** between all of them and its purpose is to make it easier to understand the **big picture** of the project while the other 4 documents will be more **specific and technical**, thus requiring some extra effort to understand what is happening under the hood.

## [Indicators Cleaning](./Clean%20Indicators.html)

### Introduction

When looking at the given indicators one can see they are *far away* from being perfect (some of them include *specific dates*, *non-english words* or an *unknown encoding*). This can be quite *detrimental* for the quality of the model when applying it as it might not be able to **recognise patterns** or **get confused** (biased) towards specific dates instead of inferring the intentionality of the indicator itself.

### Process

We will mainly apply the following processes:

* Remove **wrong characters**

* **Translate non-english** indicators

* **Drop duplicates** from both types

### Output 

The result of this notebook will be a clean **2 columns CSV file** containing the modified indicators and their type that will be used in the next step.

## [Model Selection, Fine Tuning & Indicators Matching](./Model%20Selection%20and%20Fine%20Tuning.html)

### Introduction

Now that our indicators are clean we need to **analyse them** properly. In order to do that we will have to **find a model** that fits our objective and *fine-tune* it to have the most accurate result when matching the **cultural indicators** with the **general ones**.

### Process

This notebook will be most **complex** and **extensive** and will cover a broad variety of topics:
* Model *Selection*

* Model *Exploration / Validation* (**K-means**)

* *Fine-Tuning* (**PCA**)

* **Error Analysis**

* **Indicators Matching**

### Output 

It exports two **CSV files**, one containing the indicator we want to match and the **top 5 similar indicators** from the other type (*indicator_matches.csv*). The other one contains the **encoded matrix** of the clean indicators to be used directly in the next notebook, the *model* and the *PCA* will be initialized again then.

## [Model Application](./Model%20Application.html)

### Introduction

Finally we have our model *fine-tuned* and ready to work, but there is a drawback for transformers (*word-embedding*) we have not mentioned before: **Transformers** performance decreases with the length of the text that is passed through it. This is not critical but we will try to fix it by **reducing the number of words** of the texts *removing stopwords* and then not applying the model once, but applying it for every **subsentence** defined as all the words between the '**.**' and '**,**' characters. With this we aim to **reduce the vagueness** of the model for long texts.

### Process

This notebook will be most **code-intensive** but should not be difficult to understand as very few things are happening:
* Remove **Stopwords**

* **Applying the model** to a vector of sentences (*Hadamard Product*)

* Create a **Soft Voting Classifier** for the model

### Output

A **Python file** (*module*) containing the necessary functions to work with the model at any time. This will be used in the last step.

## [PDF Reader & Model Application](./PDF%20Reader.html)

### Introduction

Now that our model is ready to be applied to large texts and not only short sentences it is time to develop a **PDF file reader** to extract the **relevant information** from the desired files and **apply the model** to the [Agenda 2030](../Reports/Agenda%202030.pdf) to check whether or not our it can summarise our documents properly.

### Process

This last step doesn't cover a lot, it will just do the following:
* **PDF Reader** application

* **Model application** to the extracted text

### Output

As this is the final step, it will export a **CSV file** with a sentence from the text for every row and the **top 5 indicators** (*general* and *cultural*). 

You can take a look at all the matches for the **Agenda 2030** [here](https://docs.google.com/spreadsheets/d/1AUyOkvd8HSA-VE2eaxsC2kTp_KzXS0bA48KQcQ-ntwY/edit?usp=sharing).

## Notebook Exporting

In [1]:
#!jupyter nbconvert "Index" --to html_toc --TemplateExporter.exclude_input=True -TagRemovePreprocessor.remove_cell_tags="hide"