# Chain of Thought for NER with Llama 3.1.

## Open-weights LLM (Llama3.1 8B)

Large Language Models (LLMs) have reshaped natural language processing (NLP), offering powerful capabilities in tasks like information extraction (IE) from historical texts. Chat-based generative models completely change the way we can interact with and analyse our corpora. These models enable users to engage with training data using natural language, revolutionizing communication paradigms and propagating a wide adoption of AI-tools across text-based tasks. However, concerns about **data privacy**, and **access** have arisen due to the dominance of closed-source models from industry giants like OpenAI and Google. To address these issues, there's a growing interest in open-weights alternatives, which provide transparency and control over models and data.

This Jupyter Notebook explores the potential of open-weights LLMs for NER and aspect recognition in historical texts. We'll showcase zero- and few-shot learning to overcome **data scarcity**, a pivotal problem in applying IE in literary-historical contexts. We aim to showcase how open-source LLMs can illuminate the past and shape the future of historical scholarship!


The Notebook showcases the following procedures:



1.  **Chain of Thought with Few-shot NER/aspect extraction.**

    *   With [LLama3.1 8B](https://ai.meta.com/blog/meta-llama-3-1/) (**multilingual model trained on English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai**). More information [HugginFace](https://huggingface.co/meta-llama/Llama-3.1-8B)



We implement the code using the package **[LangChain](https://www.langchain.com/)**, a popular wrapper around both closed and open-source LLMs. The models run in [TogetherAi service](https://www.together.ai/) and you will need and API-Key for executing this notebook.

# 1.- Required background knowledge 🧠

❗🎓 To adapt and use this Notebook to produce entities for your own texts, you need to have an intuitive understanding of the following concepts:



*   [named entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)
*   prompting
    * few-shot modelling
    *   zero-shot modelling
    *   Chain of thought
*   Large Language Models (generative AI)
*   [HuggingFace model hub](https://huggingface.co/)
*   [BIO-labels / span evaluation](https://pypi.org/project/nervaluate/)
*   Evaluation metrics (F1, accuracy, precision, recall)
*   GitHub
*   [LangChain](https://www.langchain.com/)
*   [Together.ai](https://www.together.ai/)


To adapt the code, you need to know about:


* Functions and classes in Python
* Pandas dataframe operations
* Jupyter Notebooks


# 2.- Load packages 📚


First of all we are going to setup the enviroment

In [None]:

#pip install -r requirements
!pip install langchain nervaluate langchain-community langchain-core
!pip install session-info
!pip install --upgrade langchain-together

In [9]:
import pandas as pd
import os
import json
import pandas as pd
import time
import ast
from pydantic import BaseModel
from pydantic_core import from_json
import glob
from typing import List, Optional
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_together import ChatTogether

In [2]:
import session_info
session_info.show

<function session_info.main.show(na=True, os=True, cpu=False, jupyter=None, dependencies=None, std_lib=False, private=False, write_req_file=False, req_file_name=None, html=None, excludes=['builtins', 'stdlib_list'])>

# 3.- Set environment ❗


**IMPORTANT STEP**: before you can proceed with the code in this Notebook, you have to request an API token from Togeher.AI. Make an account on the website, and follow their directions to create a token. This ensures that HuggingFace controls how many API calls you can make.
Also select your model from the list of available models.

In [3]:
#TOGETHER_API_KEY= add your API key here
model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"

# 4.- Chain of thought with Few-shoot NER/aspect extraction


Here, we'll use the framework LangChain to send a request to the open-source generative LLM to extract aspects from our texts.

The model choice is a  model id which the user can adjust according to their needs.
As an example, we're using the multilingual generative LLM **meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo**.

We construct a structured prompt example for the model to extract entities/aspects from the texts in several categories, which you can fully adapt to your needs and texts.

**CATEGORIES**


---


The entities and categories we will focus on in this notebook are the following:
- **FAUNA**
- **FLORA**
- **PERSON**
- **LOCATION**
- **ORGANISATION**

## 4.1 To Validate the output of the LLM

We want our prompt to return our NER-results as a valid JSON output. However, LLMs tend to output incomplete or invalid JSON-schemas, or hallucinates output. Luckily, **Pydantic** is a library which can fix these issues.

First, we'll construct a **Pydantic Class** to assert which data types we expect from the model output for each entity.

In [4]:
class NER(BaseModel):
    """

    This class asserts the data types we expect from the output of the LLM.
      person: Optionally a list, otherwise None.
      organisation: Optionally a list, otherwise None
      location:
      fauna:
      flora:
    """
    person: Optional[list] = None
    organisation: Optional[list] = None
    location: Optional[list] = None
    fauna: Optional[list] = None
    flora: Optional[list] = None

Let's test this out! We'll try to simulate an incomplete JSON output and feed it to our class.

In [5]:
#Example of validation
partial_json = '{"location": ["Rome"], "person": ["pepe", "Capt. Cook"], "random": ["hallucination"]'

In [10]:
#Example of validation
validator = NER.model_validate(from_json(partial_json,allow_partial=True))

print(repr(validator))
type(validator)

NER(person=['pepe', 'Capt. Cook'], organisation=None, location=['Rome'], fauna=None, flora=None)


__main__.NER

As you can see, the class helps us to parse out the objects which are interesting to our use-case. As you can see, **hallucinations in the output are ignored**, and **the partial JSON-object is validated** automatically!


Now we can easily take the attributes from our validator!

In [11]:
#Example of validation
validator.person

['pepe', 'Capt. Cook']

## 4.2.- START PLAYING : WRITE YOUR SENTENCE HERE. 
Code for una simple sentence. Change the content of the sentence variable.

### 4.2.1 Build a prompt

By means of experiment, we will feed several pieces of information to the LLM which we deem interesting to our use-case.
Similar to modelling, there are no clear-cut ways to build a prompt; and it's **all a matter of experimentation**!

🧠❗ Play around with the question, personality, and template!


In [12]:
sentence = "I was walking in Rome when I saw a beautiful deer and rabbits. I wanted to touch it but it ran through the dandelions."

Building the prompt

In [13]:
# specify the question/request posed to the LLM

question = "Extract the relevant entities from the given sentence."

In [14]:
# specify the personality you expect from the LLM

personality = "You are a historian and literary scholar with expertise on historical travel literature, colonial literature and labelling named entities."

In [15]:
# add a JSON object with the category names followed by the expected data type

schema_entity={"person": ["string"],
        "location": ["string"],
        "fauna": ["string"],
        "flora": ["string"],
        "organisation": ["string"]
      }


In [16]:
# add the category names with small global introduction/definition as a string

categories = """
person: proper names of people,
location: proper names of locations,
fauna: common and scientific names of animals and fauna,
flora: common and scientific names of vegetation, plants, flowers and flora,
organisation: proper names of organisations"""

In [17]:
# This brings all the elements above together in a template.
# The sentence is clearly indicated by <<<>>>, which helps the model to stick to the text given.

template_0=f"{personality}"

template_1= f"""
Your task is to extract relevant named entities from the given sentence based on the following labels:
{categories}
Only respond in JSON format, Do not add any more comment.
The structure of the JSON format is like this:
      {schema_entity} 

Let's approach this step-by-step:

Example 1:
Sentence: <<<The New York Zoo is home to jaguars and giant water lilies.>>>

Step 1: Identify potential named entities
- New York Zoo
- jaguars
- giant water lilies

Step 2: Categorize each entity and asign a label
- New York Zoo: organization (Zoo)
- New York:location (proper name of a city)
- jaguars: fauna (common name of an animal)
- giant water lilies: flora (common name of a plant)

Step 3: Format the output

   {{"person": [],
    "location": ["New York"],
    "fauna": ["jaguars"],
    "flora": ["giant water lilies"],
    "organisation": ["New York Zoo"]
   }}  
    
Example 2:
Sentence: <<<Hurricane Katrina devastated New Orleans in 2005>>>

Step 1: Identify potential named entities
- New Orleans

Step 2: Categorize each entity and asign a label
- New Orleans: location (proper name of a city)

Step 3: Format the output

   {{"person": [],
    "location": ["New Orleans"],
    "fauna": [],
    "flora": [],
    "organisation": []
  }}

Example 3:
Sentence: <<<The Great Wall of China stretches across the Gobi Desert, where Bactrian camels roam freely.>>>

Step 1: Identify potential named entities
- China
- Bactrian camels

Step 2: Categorize each entity and asign a label
- China: location (name of a Country)
- Bactrian camels: fauna (common name of an animal)

Step 3: Format the output

   {{"person": [],
    "location": ["China"],
    "fauna": ["Bactrian camels"],
    "flora": [],
    "organisation": [] 
  }}

Now, DO NOT take into account the previous examples, use them JUST AS A REFERENCE. 

Question: {question}
The sentence for analyzing is: <<< {sentence} >>>

Answer: """

### 4.2.2 Calling the Model

In [18]:
llm = ChatTogether(
      model=model,
      temperature=0,
      api_key=TOGETHER_API_KEY,
  )
messages=[
        (

            "system", template_0,
        ),
        (
            "human",template_1,
        ),
    ]

response = llm.invoke(messages)
result = NER.model_validate(from_json(response.content, allow_partial=True))
print(repr(result))

NER(person=['I'], organisation=[], location=['Rome'], fauna=['deer', 'rabbits'], flora=['dandelions'])


### 4.2.3 Check the results

In [19]:
result.fauna

['deer', 'rabbits']

In [20]:
result.location

['Rome']

## 4.3.- START PLAYING : Use your files and ask the model. 
In case you have your information in a different input format, you'll have to read them and load them as a dataframe. This code processes all rows of the dataframe.

### 4.3.1 Functions

In this section, we write functions for making calls to the LLM and parsing the output. These funtions are used later in the code.


1.  In our function *parse_llm_response*, we split the output and only take the element after our "Answer:"-section in our prompt. Then, we cast the result to JSON by applying [Pydantic](https://docs.pydantic.dev/latest/concepts/json/) to transform partial JSON outputs to a valid JSON object, parse the entity text and their labels.

2.   In our function *llm_output*, we call the LLM anb we obtain a dictionary output.



---

❗💭 **Mind you that these functions will have to be adapted according to the output of your LLM of choice, given that the output is unpredictable and changes when your prompt does.**






In [21]:
#parse the llm response
# cast to json
# parse all the entities and their categories

def parse_llm_response(response, basemodel_class = NER):
  try:
    result = basemodel_class.model_validate(from_json(response.content, allow_partial = True))
    category_entity = []
    for entity in result:
      if entity[1] != None: #if the model returned a valid result for the categories which is not None
        category = entity[0]
        entity_text_list = entity[1]

        for ent in entity_text_list:
          category_entity.append((ent, category))

    return category_entity

  except:
    return []


In [22]:
### call to the LLM and parse the response

def llm_output(content_user,model_1=model,content_sys=template_0,basemodel_class = NER):
    
    llm = ChatTogether(
      model=model_1,
      temperature=0,
      api_key=TOGETHER_API_KEY,
  )
    messages=[
        (
            "system", content_sys,
        ),
        (
            "human",content_user,
        ),
    ]

    response = llm.invoke(messages)
    test_result = basemodel_class.model_validate(from_json(response.content, allow_partial = True))
    result = test_result.model_dump()
    
    return result

### 4.3.2 Apply the LLM to a Pandas DataFrame

Here, we take a sample of our corpus to showcase a possible approach but the most important take-away is that you can apply this pipeline to your own data!


#### 4.3.2.1.- Chunk the text into smaller parts

The LLama 3.1 8x7B model takes a maximum input of **128K** tokens. Therefore, we need to split up the text in smaller bits before we proceed if our text is over the limit.

Let's split up our text in chunks of 64K tokens, and make a new row for each chunk.

The model is probably more inclined to make mistakes when the text chunks are too large. One of the reasons for this is that the models have a tendency to focus on the beginning or the end of an input, and pay less attention to the middle part (this paper [linktekst](https://arxiv.org/pdf/2307.03172) expertly explains it!). On the other hand, there is a [strict rate limit](https://huggingface.co/docs/api-inference/faq) on the HuggingFace API. Experiment with these settings to see if this approach is useful for your use-case!

In [23]:
def text_splitter(sample_text, chunk_size = 64000):
# Initialize the text splitter with custom parameters
  custom_text_splitter = RecursiveCharacterTextSplitter(
      # Set custom chunk size
      chunk_size = chunk_size,
      chunk_overlap  = 20,
      # Use length of the text as the size measure
      length_function = len,

  )

  # Create the chunks
  texts = custom_text_splitter.create_documents([sample_text])
  texts_content = [text.page_content for text in texts]

  return texts_content

#### For test purpose execute section 5 to build English_corpus dataframe. Section 5 is BELOW

In [30]:
#Example of using text_splitter
English_corpus_sample = English_corpus[1:2]

In [31]:
#Example of using text_splitter
English_corpus_sample["chunks"] = English_corpus_sample.text.apply(text_splitter)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  English_corpus_sample["chunks"] = English_corpus_sample.text.apply(text_splitter)


In [32]:
#Example of using text_splitter
English_corpus_sample = English_corpus_sample.explode("chunks")

#### 4.3.2.2.- Loading in a dataframe and chunking the data!

To show you how the LLM works, let's run it on a small sample of our text and print out the prompt and results for each iteration. As you can see, the LLM sometimes outputs incorrect or incomplete JSON-results; which results in a loss of output when the answer is validated and parsed. Because indeed, our validation approach removes invalid JSON objects in the LLM-output, but that means we may also lose some correctly extracted entities.



---


❗ You have to decide for yourself whether this loss is something you can work with - or you could further experiment with your prompt and validation settings to circumvent this problem as much as possible.

In [33]:
#pick a sample of the first ten sentences in our corpus

test = English_corpus_sample[:2]
test = test.reset_index(drop=True)

In [34]:
test

Unnamed: 0,file,text,language,chunks
0,Florence_and_Northern_Tuscany_with_Genoa.txt,Title: Florence and Northern Tuscany with Geno...,English,Title: Florence and Northern Tuscany with Geno...
1,Florence_and_Northern_Tuscany_with_Genoa.txt,Title: Florence and Northern Tuscany with Geno...,English,"It is perhaps in the Via Garibaldi, Via Cairol..."


#### These are the sentences (chunks) to process

You need to build your user content and sys template. The sys template is easy to build, you only need to change the cell above. For the user template you need to build one different template for each sentence in each line of the data frame.

In [35]:
test.chunks

0    Title: Florence and Northern Tuscany with Geno...
1    It is perhaps in the Via Garibaldi, Via Cairol...
Name: chunks, dtype: object

#### 4.3.2.2.- Building the user and sys prompt. So we need to define them


In [36]:
# specify the question/request posed to the LLM

question = "Extract the relevant entities from the given sentence."

In [37]:
# specify the personality you expect from the LLM

personality = "You are a historian and literary scholar with expertise on historical travel literature, colonial literature and labelling named entities."
template_0=f"{personality}"


In [38]:
# add a JSON object with the category names followed by the expected data type

schema_entity={"person": ["string"],
        "location": ["string"],
        "fauna": ["string"],
        "flora": ["string"],
        "organisation": ["string"]
      }


In [39]:
# add the category names with small global introduction/definition as a string

categories = """
person: proper names of people,
location: proper names of locations,
fauna: common and scientific names of animals and fauna,
flora: common and scientific names of vegetation, plants, flowers and flora,
organisation: proper names of organisations"""

 Now we build a dataframe with the different prompt for the different chunks of each row in the dataframe

In [40]:
chunk_size=20000
templates_text=[]
for text in test["chunks"]:
  chunks=text_splitter(text, chunk_size)
  templates=[]
  for sentence in chunks:
  
    template_1 = f"""
          Your task is to extract relevant named entities from the given sentence based on the following labels:
          {categories}
          Only respond in JSON format, Do not add any more comment.
          The structure of the JSON format is like this:
                {schema_entity} 

          Let's approach this step-by-step:

          Example 1:
          Sentence: <<<The New York Zoo is home to jaguars and giant water lilies.>>>

          Step 1: Identify potential named entities
          - New York Zoo
          - jaguars
          - giant water lilies

          Step 2: Categorize each entity and asign a label
          - New York Zoo: organization (Zoo)
          - New York:location (proper name of a city)
          - jaguars: fauna (common name of an animal)
          - giant water lilies: flora (common name of a plant)

          Step 3: Format the output

            {{"person": [],
              "location": ["New York"],
              "fauna": ["jaguars"],
              "flora": ["giant water lilies"],
              "organisation": ["New York Zoo"]
            }}  
              
          Example 2:
          Sentence: <<<Hurricane Katrina devastated New Orleans in 2005>>>

          Step 1: Identify potential named entities
          - New Orleans

          Step 2: Categorize each entity and asign a label
          - New Orleans: location (proper name of a city)

          Step 3: Format the output

            {{"person": [],
              "location": ["New Orleans"],
              "fauna": [],
              "flora": [],
              "organisation": []
            }}

          Example 3:
          Sentence: <<<The Great Wall of China stretches across the Gobi Desert, where Bactrian camels roam freely.>>>

          Step 1: Identify potential named entities
          - China
          - Bactrian camels

          Step 2: Categorize each entity and asign a label
          - China: location (name of a Country)
          - Bactrian camels: fauna (common name of an animal)

          Step 3: Format the output

            {{"person": [],
              "location": ["China"],
              "fauna": ["Bactrian camels"],
              "flora": [],
              "organisation": [] 
            }}

          Now, DO NOT take into account the previous examples, use them JUST AS A REFERENCE. 

          Question: {question}
          The sentence for analyzing is: <<< {sentence} >>>

          Answer: """
    templates.append(template_1)
  templates_text.append(templates)

#### 4.3.2.3 Executing NxM prompts
In this section we execute NxM prompts, where N stands for the number of files and M for the number of chunks per file. All the results are merged in only one dictionary with the NER structure.

In [41]:
# Diccionario acumulador para los resultados finales
final_result = {}
for templates in templates_text:
    # Diccionario calculado en el bucle interno
    # Combinar resultado_actual con resultado_final
    for sub_template in templates:
        sub_result=llm_output(sub_template,model,template_0) 
        for clave, value in sub_result.items():
            if clave in final_result:
                # Sumar valores si es lista
                if isinstance(value, list):
                    final_result[clave] += value
                # Otros tipos pueden manejarse según se necesite
            else:
                # Si la clave no existe en resultado_final, añadirla
                final_result[clave] = value

# Mostrar el resultado final
print(final_result)

{'person': ['Edward Hutton', 'John Evelyn', 'Tennyson', 'Philip of Spain', 'Visconti', 'Cesare Borgia', 'St. Catherine Adorni', 'Andrea Doria', 'St. Nazarus', 'St. Celsus', 'St. Laurence', 'St. Augustine', 'Luitprand', 'Charlemagne', 'Otho', 'Godfrey de Bouillon', 'Urban II', 'Peter the Hermit', 'Guglielmo Embriaco', 'Nicodemus', 'Napoleon', 'Andrea Doria', 'Simone Boccanegra', 'Gian Galeazzo Visconti', 'Filippo Maria Visconti', 'Tommaso Fregosi', 'Francesco Spinola', 'Pietro Fregosi', 'Charles VIII', 'Mahomet', 'Sforza', 'Galeazzo', 'Ludovico Sforza', 'Louis XII', 'Columbus', 'Columbus', 'David', 'Louis of France', 'Francis I', 'Julius II', 'Charles V', 'Andrea Doria', 'Giannettino', 'Gian Luigi Fieschi', 'Abbate di San Fruttuoso', 'Guglielmo Boccanegra', 'Luca Pinelli', 'Pope Alexander III', 'Richard Cordelion', 'Federigo Barbarossa', 'Henry IV', 'Innocent IV', 'Henry VII', 'St. Catherine of Siena', 'St. Catherine Adorni', 'Louis XII', 'Don John of Austria', 'Velasquez', 'Vandyck', '

#### 4.3.2.4 Checking and saving results
You can check the output list based on the category dictionary keys!

In [42]:
resultado_final["person"]

['Edward Hutton',
 'John Evelyn',
 'Tennyson',
 'Philip of Spain',
 'Visconti',
 'Cesare Borgia',
 'St. Catherine Adorni',
 'Andrea Doria',
 'St. Nazarus',
 'St. Celsus',
 'St. Laurence',
 'St. Augustine',
 'Luitprand',
 'Charlemagne',
 'Otho',
 'Godfrey de Bouillon',
 'Urban II',
 'Peter the Hermit',
 'Guglielmo Embriaco',
 'Nicodemus',
 'Napoleon',
 'Andrea Doria',
 'Simone Boccanegra',
 'Gian Galeazzo Visconti',
 'Filippo Maria Visconti',
 'Tommaso Fregosi',
 'Francesco Spinola',
 'Pietro Fregosi',
 'Charles VIII',
 'Mahomet',
 'Sforza',
 'Galeazzo',
 'Ludovico Sforza',
 'Louis XII',
 'Columbus',
 'Columbus',
 'David',
 'Louis of France',
 'Francis I',
 'Julius II',
 'Charles V',
 'Andrea Doria',
 'Giannettino',
 'Gian Luigi Fieschi',
 'Abbate di San Fruttuoso',
 'Guglielmo Boccanegra',
 'Luca Pinelli',
 'Pope Alexander III',
 'Richard Cordelion',
 'Federigo Barbarossa',
 'Henry IV',
 'Innocent IV',
 'Henry VII',
 'St. Catherine of Siena',
 'St. Catherine Adorni',
 'Louis XII',
 'Do

#### Save results to a DataFrame

If we're satisfied with the results, we can eventually save them to a Dataframe.

In [None]:
path = "./CLSinfra/"
for key, values in final_result.items():
    # Convert value lists to DataFrame
    df = pd.DataFrame(values, columns=[key])
    # Save as .CSV-file
    filename = f"{key}.csv"  
    df.to_csv(path+filename, index=False)
    print(f"Archivo guardado: {filename}")

# 5.- EXAMPLE for loading  data to a dataframe 📜

In this code snippet we collect a multilingual corpus of travel literature from the GitHub repository pertaining to GhentCDH. You can find more information on this example corpus on our [GitHub repository](https://github.com/GhentCDH/CLSinfra).

To show you how this workflow can work for different languages, we'll load in our **Dutch** and **English** annotations. We annotated two aspects in these texts: **fauna** 🐱 and **flora** 🌺. These include common names and scientific denominations.


In [None]:
# Load in our example texts
!git clone https://github.com/GhentCDH/CLSinfra.git

In [25]:
path = "./CLSinfra/Example_data_CLS/"

In [26]:
all_travelogues = []

for filename in glob.glob(f"{path}*/*.txt"):
  print(filename)

  name_file = os.path.basename(filename) #find filename
  folder_name = os.path.dirname(filename).split("/")[-1] #find folder name (in our case: the language)

  with open(filename, "r") as travelogue:

    text = travelogue.read()
    travelogue_data = {"file": name_file, "text": text, "language": folder_name}
    all_travelogues.append(travelogue_data)

travel_df = pd.DataFrame(all_travelogues)

./CLSinfra/Example_data_CLS/Dutch/haan098besc02_01.txt
./CLSinfra/Example_data_CLS/Dutch/blin001verz01_01.txt
./CLSinfra/Example_data_CLS/Dutch/have010vree01_01.txt
./CLSinfra/Example_data_CLS/Dutch/piet077omst01_01.txt
./CLSinfra/Example_data_CLS/Dutch/gerr049besc01_01.txt
./CLSinfra/Example_data_CLS/Dutch/oltm003vade01_01.txt
./CLSinfra/Example_data_CLS/Dutch/have010door01_01.txt
./CLSinfra/Example_data_CLS/Dutch/haff003reiz01_01.txt
./CLSinfra/Example_data_CLS/Dutch/haff003lotg02_01.txt
./CLSinfra/Example_data_CLS/Dutch/woen003aant01_01.txt
./CLSinfra/Example_data_CLS/German/TP1207.txt
./CLSinfra/Example_data_CLS/German/TP1213.txt
./CLSinfra/Example_data_CLS/German/TP1210.txt
./CLSinfra/Example_data_CLS/German/TP938.txt
./CLSinfra/Example_data_CLS/German/TP1062.txt
./CLSinfra/Example_data_CLS/German/TP1044.txt
./CLSinfra/Example_data_CLS/German/TP1179.txt
./CLSinfra/Example_data_CLS/German/TP934.txt
./CLSinfra/Example_data_CLS/German/TP1027.txt
./CLSinfra/Example_data_CLS/German/TP9

In [27]:
#Make separate corpora per language
English_corpus = travel_df[travel_df["language"] == "English"]
Dutch_corpus = travel_df[travel_df["language"] == "Dutch"]
German_corpus = travel_df[travel_df["language"] == "German"]
French_corpus = travel_df[travel_df["language"] == "French"]

In [28]:
EN_fauna_flora = pd.read_csv("./CLSinfra/Example_data_CLS/EN_fauna_flora_df.csv")
NL_fauna_flora = pd.read_csv("./CLSinfra/Example_data_CLS/NL_fauna_flora_df.csv")

In [29]:
NL_fauna_flora.sample(10)

Unnamed: 0,sentence,text,_sentence_text,aspect_cat
133,BHL_957_sample_Dutch_19.0.txt_15419-15433,Hirundo,P. ) Hirundo ?,FAUNA
239,BHL_794_sample_Dutch_18.0.txt_1106-1417,bladeren,Men wirt hier de bladeren van 't geboomte niet...,FLORA
387,BHL_7_sample_Dutch_19.0.txt_1946-2236,obtusipetalus,Onder de soorten met twee ( zelden één ) midde...,FLORA
3,DBNL-151_sample_IAA_19.txt_1626-1841,kaaiman,Op onze vaart daarheen hadden wij nog het voor...,FAUNA
234,BHL_794_sample_Dutch_18.0.txt_940-1003,vee,"Men zait maar wei- nig garft , en dat nog alle...",FAUNA
550,DBNL-10_sample20 (1).txt_9491-9642,hout,"De Corso is een gedeelte van de boulevard , zo...",FLORA
448,BHL_7_sample_Dutch_19.0.txt_13264-13333,knobbels,Die knobbels zijn dicht opeengedrongen en spir...,FLORA
377,BHL_7_sample_Dutch_19.0.txt_443-642,randdorens,"Indien zij verschillen , kunnen de middendoren...",FLORA
101,BHL_957_sample_Dutch_19.0.txt_13870-13879,Canis,Canis — ?,FAUNA
413,BHL_7_sample_Dutch_19.0.txt_4330-4440,M. communis,"Alleen merkt hij op , dat onder den naam van M...",FLORA
