# Introduction
This Jupyter notebook outlines the data preprocessing workflow for leveraging machine learning (ML) to automatically extract key information from solar plant installation reports. The target data includes location, installation date, and total generation capacity (in MW).

- Data Source:

Synthetic Reports: Given the potential scarcity of real-world installation reports, this notebook utilizes Large Language Models (LLMs) to generate synthetic reports. These reports are specifically crafted to mimic the structure and content of real reports, ensuring the extracted data aligns with the desired format.
Named Entity Recognition (NER):

- The core task of this notebook is Named Entity Recognition (NER). NER focuses on identifying and classifying specific entities within text data. In this case, we aim to extract locations (using gazetteer-based methods or pre-trained location recognition models), dates (utilizing regular expressions or pre-trained date recognition models), and capacity values (focusing on numerical patterns with unit recognition). Preprocessing steps will involve:

   - Sentence Segmentation: Splitting the report text into individual sentences enhances the performance of some NER models by providing better contextual boundaries for entity recognition.
   - Data Labeling: Training an NER model requires manually annotating entities within a subset of the synthetic reports. This involves tagging relevant words or phrases with their corresponding categories (e.g., "LOCATION", "DATE", "CAPACITY").
Leveraging Modern NLP Models:

- Preprocessing considerations:
Compared to traditional NLP pipelines, the employed models in this notebook benefit from recent advancements in contextual language understanding. This allows us to potentially bypass certain preprocessing steps typically required for older models. Here's a breakdown of these considerations:

- Stop Word Removal: Modern deep learning models can inherently understand the importance of words within their context, potentially rendering stop word removal unnecessary.
- Tokenization: The process of splitting text into individual units (words, punctuation) is often integrated within deep learning models, eliminating the need for a separate tokenization step in our workflow.
- Stemming/Lemmatization: These techniques aim to reduce words to their base form. However, with powerful contextual models, understanding and maintaining the specific word form (e.g., "generating" vs. "generate") may be beneficial.

## Import dependencies

In [41]:
import pandas as pd
import os
import spacy
from spacy.pipeline import Sentencizer
from spacy.tokens import DocBin
from spacy.training import Corpus
import random
import re
from spacy import displacy

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 100)
pd.set_option('display.expand_frame_repr', True)

## Data loading

### Load the report texts

In [42]:
DATA_PATH = os.path.join('..', '..', 'data', 'reports')

### Explore the raw data 

In [43]:
texts = []
for file in os.listdir(DATA_PATH):
    with open(os.path.join(DATA_PATH, file), 'r') as f:
        texts.append(f.read())
# print number of texts
print(f'Number of texts: {len(texts)}')
print(f'Example text: {texts[random.randint(0, len(texts))]}')


Number of texts: 10
Example text: La planta contará con una capacidad de 80 kw para satisfacer la demanda energética.
Con una capacidad de 130 kilovatios, la planta será una fuente confiable de energía renovable.
Se espera que la planta genere 85 kw de electricidad al día.
La capacidad instalada de la planta será de 110 kw.
La planta estará diseñada para generar 75 kilovatios de energía solar.
La planta tendrá una capacidad de 95 kw para alimentar la red eléctrica.
La capacidad de generación de 140 kw asegurará un suministro sostenible.
La planta será capaz de generar 70 kw de electricidad limpia y renovable.
Con una capacidad de 105 kw, la planta será un elemento crucial en el panorama energético.
Se estima que la planta genere 125 kilovatios durante su vida útil.
La planta contará con una capacidad de 115 kw, destacándose en el sector energético.
Con una producción de 90 kilovatios, la planta será una importante fuente de energía renovable.
Se espera que la planta genere 65 kw de ele

## Pre-processing

### Sentence segmentation
Sentence segmentation is a crucial preprocessing step for many Natural Language Processing (NLP) tasks, including Named Entity Recognition (NER). NER models often rely on surrounding words to accurately identify entities. Splitting text into sentences provides clear contextual boundaries, enhancing the model's ability to understand the relationships between words and identify entities more effectively.

In [44]:
nlp = spacy.blank('es')
sentencizer = nlp.add_pipe("sentencizer")

sentencizer = Sentencizer()
doc = nlp("This is a sentence. This is another sentence.")
assert len(list(doc.sents)) == 2

In [45]:
sentences = []
for text in texts:
    doc = nlp(text)
    for sent in doc.sents:
        sentences.append(sent.text)

print(f'Number of sentences: {len(sentences)}')
print(f'Example sentence: {sentences[1]}')
print(f'Average sentences per text: {len(sentences) / len(texts)}')

Number of sentences: 395
Example sentence: 
La planta está ubicada en la provincia de Santiago, República Dominicana.
Average sentences per text: 39.5


### Semi-Automated Data Annotation for NER
Training an NER model requires a labelled dataset where each piece of text (sentence or document) has its relevant entities identified and categorized. Here, we'll leverage a semi-automated approach to expedite the annotation process:

1. Rule-Based Extraction:

Regular Expressions (Regex): We'll utilize regular expressions to extract specific entity patterns from the synthetic reports. This can be effective for capturing entities with consistent formats, such as:
Energy Capacity: Patterns like \d+(KW|MWh) can target numerical values followed by units (kilowatts or megawatt-hours).
Dates: Regex can identify common date formats (e.g., "DD/MM/YYYY", "YYYY-MM-DD") for installation dates.

2. Pre-trained NER Model Integration:

Location Recognition: We'll leverage a pre-trained English NER model (e.g., spaCy's en_core_web_sm) to identify potential locations within the reports. This can be particularly beneficial for recognizing city and state names. Keep in mind that depending on the model's training data, accuracy for specific locations might vary.

3. Combining Results and Manual Refinement:

Automatic Annotation: The identified entities from both regex and the pre-trained model will be automatically tagged within the text data.
Manual Review: manually reviewing these automatic annotations. This allows for correcting any errors, disambiguating potential ambiguities, and identifying any missed entities not captured by the automated methods.

In [55]:
capacity_pattern = r'\b(\d+(\.\d+)?)(\s?)([kmgt])(\s?(w|h|wp)?)\b'
date_pattern = r'(\b(\d{1,2}(\sde\s)?(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)(\sde)?\s\d{4})\b)|(\b(\d{1,2}/\d{1,2}/\d{4})\b)'
nlp_es = spacy.load('es_core_news_md')

docs = []
for sentence in sentences:
    entities=[]
    doc = nlp(sentence)
    capacity_matches = [(match.start(), match.end()) for match in re.finditer(capacity_pattern, sentence, re.IGNORECASE)]
    date_matches = [(match.start(), match.end()) for match in re.finditer(date_pattern, sentence) ]

    for start, end in capacity_matches:
        span = doc.char_span(start, end, "ENERGY_CAPACITY")
        entities.append(span)
    for start, end in date_matches:
        span = doc.char_span(start, end, "DATE")
        entities.append(span)   

    doc_es = nlp_es(sentence)
    loc_entities = []
    for ent in doc_es.ents:
        if ent.label_ == 'LOC':
            entities.append(ent)

    loc_entities = []        
            
    doc.set_ents(entities)
    docs.append(doc)

random.shuffle(docs)
for doc in docs[:10]:
    if doc.ents:
        displacy.render(doc, style="ent")

### Save dataset to be used in next steps of the pipeline

In [48]:

doc_bin = DocBin(docs=docs)
doc_bin.to_disk("./data.spacy")
reader = Corpus("./data.spacy")