# Guide to DataFrame Conversion

Data comes in various formats, each with unique structures and purposes. Transforming these diverse file types into a standardized DataFrame (df) is a crucial step in data analysis. Let's explore some of the most common file types and how to work with them:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [105]:
# Set drive/MyDrive/ ad base working directory
%cd /content/drive/MyDrive/didip_ss/D01/

/content/drive/MyDrive/didip_ss/D01


## Txt files

This code is designed to create a Pandas DataFrame from a set of .txt files located in a specified directory. Each text file is treated as a single data point, and the DataFrame will have two columns: ID: The name of the file without the .txt extension.
Text: The entire text content read from the file.


**Step-by-Step Explanation**

1. **File Reading:**
   - `with open(file_path, 'r') as f:` opens the file in read mode (`'r'`), ensuring proper resource management (the file is automatically closed after use).
   - `text = f.read()` reads the entire file content into a single string variable `text`.

2. **Regular Expression Pattern:**
   - `doc_pattern = r'(c\.\s*\d+[vr])\n\n(.*?)(?=(?:c\.\s*\d+[vr]\n\n|$))'` is a regular expression designed to match the following:
      - `(c\.\s*\d+[vr])`: Captures the document ID, which starts with "c.", followed by optional whitespace (`\s*`), one or more digits (`\d+`), and ends with either "r" or "v" (recto/verso in manuscript terminology).
      - `\n\n`: Matches two newline characters separating the ID from the text.
      - `(.*?)`: Captures the document text content in a non-greedy way (`*?`) to avoid overmatching.
      - `(?=(?:c\.\s*\d+[vr]\n\n|$))`: Positive lookahead assertion to ensure that the match ends either before the next document ID or at the end of the file (`$`).

3. **Data Extraction and Transformation:**
   - `for match in re.finditer(doc_pattern, text, re.DOTALL):`: Iterates over all non-overlapping matches of the `doc_pattern` in the `text`. The `re.DOTALL` flag makes the dot (`.`) in the regular expression match newline characters as well.
      - `doc_id = match.group(1).strip()`: Extracts the document ID (group 1 of the match) and removes leading/trailing spaces.
      - `doc_text = match.group(2).strip().replace('\n', ' ')`: Extracts the document text (group 2), removes extra spaces, and replaces newline characters (`\n`) with spaces to format the text.
      - `data.append({'ID': doc_id, 'Text': doc_text})`: Appends a dictionary containing the ID and text of the extracted document to the `data` list.

4. **DataFrame Creation:**
   - `return pd.DataFrame(data)`: Creates and returns a Pandas DataFrame using the `data` list. The DataFrame will have two columns: "ID" and "Text," organizing the extracted information in a structured format for further analysis or processing.


### Single txt file

In [95]:
import pandas as pd
import re

def load_documents(file_path):
    with open(file_path, 'r') as f:
        text = f.read()

    data = []
    doc_pattern = r'(c\.\s*\d+[vr])\n\n(.*?)(?=(?:c\.\s*\d+[vr]\n\n|$))'  # Regular expression pattern to match document ID and text
    for match in re.finditer(doc_pattern, text, re.DOTALL):
        doc_id = match.group(1).strip()  # Extract document ID
        doc_text = match.group(2).strip().replace('\n', ' ')  # Replace newlines with spaces
        data.append({'ID': doc_id, 'Text': doc_text})  # Append data to the list

    return pd.DataFrame(data)

# Open the file into a df:
df = load_documents('data/single_txt/transcription.txt')
print(df)

        ID                                               Text
0    c. 5v  In Dei nomine Amen. Anno domini Millesimo IIII...
1     c.6r  idem dominus prior dissit se habuisse et rece ...
2   c. 30v  In Dey nomine Amen. Anno domini Millesimo CCCC...
3   c. 31r  dictum bovem de dicto laboritio et secum condu...
4   c. 32r  In Dei nomine Amen. Millesimo CCCC XII indicti...
5   c. 32v  soccitus sopradictus promisit eidem Bonaccurss...
6   c. 33r  evangelia suprascripta et infrascripta adtende...
7   c. 35v  Millesimo CCCC XII indictione quinta tempore S...
8   c. 36r  presenti stipulanti recipienti et ementi pro s...
9   c. 36v  venditor a dicto emptore manu aliter habuit et...
10  c. 37r  ralem defensionem facere contra omnem litigant...
11  c. 37v  Millesimo CCCC XIII indictione sesta die XXIII...
12  c. 38r  Antonius coram me notario et supra scrictis se...
13  c. 38r  Sub dicto millesimo et indictione et die XXVII...
14  c. 38v  Bonaccurssius Massii de Firmo p vice et nomine...
15  c. 3

### Multiple txts in a directory

This function is designed to process all `.txt` files within a given directory and extract information from them to create a structured dataset in the form of a Pandas DataFrame.

**Detailed Explanation**

1. **Initialization:**
   - `data = []`: Creates an empty list called `data`. This list will serve as a container to store dictionaries representing the extracted data from each file.

2. **File Iteration:**
   - `for filename in os.listdir(directory):`: Iterates through each file (or subdirectory) name within the specified `directory`. The `os.listdir` function from the `os` module is used to get this list of file names.

3. **File Type Filtering:**
   - `if filename.endswith('.txt'):`: Checks whether the current `filename` ends with the `.txt` extension. This filtering ensures that only text files are processed.

4. **File Path Construction:**
   - `filename = os.path.join(directory, filename)`: Combines the `directory` path with the `filename` to create the complete path to the file. The `os.path.join` function is used to ensure that the path is constructed correctly, taking into account the operating system's path conventions.

5. **File Reading:**
   - `with open(os.path.join(directory, filename), 'r') as f`: Opens the text file in read mode ('r') using a context manager (`with open...`). This ensures that the file is automatically closed after reading, preventing resource leaks.
   - `text = f.read()`: Reads the entire content of the file into the `text` variable as a single string.

6. **Data Appending:**
   - `data.append({'ID': filename[47:], 'Text': text})`: Appends a dictionary to the `data` list. This dictionary contains two key-value pairs:
     - `'ID'`: The file name itself, excluding the first 47 characters and the '.txt' extension. This likely assumes a specific file naming convention where the ID is embedded within the filename.
     - `'Text'`: The entire text content read from the file.

7. **DataFrame Creation:**
   - `return pd.DataFrame(data)`: Converts the list of dictionaries (`data`) into a Pandas DataFrame. Each dictionary in the list becomes a row in the DataFrame, with keys as column names and values as cell contents. The DataFrame is then returned as the output of the function.


In [98]:
import pandas as pd
import os

def create_dataset(directory):
    data = []
    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            filepath = os.path.join(directory, filename)
            with open(filepath, 'r') as f:
                text = f.read()

            # Robust ID extraction (handles varying filename lengths)
            base_filename = os.path.splitext(filename)[0]
            data.append({'ID': base_filename, 'Text': text})

    return pd.DataFrame(data)

df = create_dataset('data/multiple_txt/')
df[:5]

Unnamed: 0,ID,Text
0,_txt/AMSPO_FSV_1640.txt,Connosçida cosa sea a quantos esta carta viren...
1,_txt/AMSPO_FSV_1644.txt,Connosçida cosa sea a quantos esta carta viren...
2,_txt/AMSPO_FSV_1638.txt,Connosçida cosa sea a quantos esta carta viren...
3,_txt/AMSPO_FSV_1642.txt,Connosçida cosa sea a quantos esta carta viren...
4,_txt/AMSPO_FSV_1639.txt,Connosçida cossa sea a quantos esta carta uire...


## Xlsx files

This code reads data from multiple sheets within an Excel file (.xlsx) located at a specified path. Each sheet is processed separately, and a Pandas DataFrame is created for each sheet.

In [None]:
!pip install openpyxl==3.0.10  # Install openpyxl for handling xlsx files

Collecting openpyxl==3.0.10
  Downloading openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.1/242.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openpyxl
  Attempting uninstall: openpyxl
    Found existing installation: openpyxl 3.1.5
    Uninstalling openpyxl-3.1.5:
      Successfully uninstalled openpyxl-3.1.5
Successfully installed openpyxl-3.0.10


In [99]:
file_path = 'data/xlsx/urbar_1324_part.xlsx'
xls = pd.ExcelFile(file_path)

In [100]:
dfs = {}
print(xls.sheet_names)  # Print the sheet names
for sheet_name in xls.sheet_names:
    df = pd.read_excel(xls, sheet_name)
    dfs[sheet_name] = df

['M1 472-517', 'Ager', 'Area', 'Curia', 'Domus', 'Feudum', 'Huba', 'Laneus', 'Misc', 'Mühle', 'Orto', 'Wiese', 'Vinea', 'mit Feiertagen']


In [101]:
dfs['Huba']

Unnamed: 0,Hofmark St. Pölten M1:472-517,Kennz.,Identifier,Anzahl,Bezeichnung,Besitzverb,Or,Besitzer,Zähler,Einkunft Silignis,...,Ova,Tag/Anm..7,Verbum.10,Zähler.11,Ordei,Tag/Anm..8,Verbum.11,Spezial,Stelle,Anm.
0,,7.0,StP1324/348,0.5,huba,habet,St. Pölten,Sidlinus Pellifex,30,metr.,...,,,,,,,,,M1:473,#7-9 selbe Hube
1,,8.0,StP1324/40,0.25,huba,habet,St. Pölten,Gerunch bei dem tor,30,metr.,...,,,,,,,,,M1:473,#7-9 selbe Hube
2,,9.0,StP1324/53,0.25,huba,[habet],St. Pölten,Heinrich der Chaeser,30,metr.,...,,,,,,,,,M1:473,"#7-9 selbe Hube, Fußjnote 268: Heinrich der Ch..."
3,,31.0,StP1324/412,0.5,huba,habet,St. Pölten,Walbraun,,,...,,,,,,,,,M1:475,"""4 porcos, quorum quilibet valet 60 den."""
4,,261.0,StP1324/2211; StP1324/212,1,Huba,,Egelsee,Leutoldus Lechner de Egelse & filius suus,,,...,,,,,,,,,M1:495,"Abgaben des Sohns sind einzeln angeführt, 1 me..."
5,,268.0,StP1324/98,1,Huba,,Egelsee,Leupoldus de Egelse,,,...,,,,,,,,,M1:496,"""minoris mensure in huba""?"
6,,289.0,StP1324/99,1,Huba,,Foriaeh/Vorhaech [damals Wald sö. Wagram b. St...,Andreas de Egelse,,,...,,,,,,,,,M1:497,
7,,356.0,StP1324/328,1,huba,,außerhalb St. Pölten entlang der Traisen,Mychael de Reichgreben,,,...,,,,,,,,,M1:502,
8,,359.0,StP1324/27,1,Huba,,Hub bei St. Pölten,Bertoldus in Angulo ibidem [Egelse?],,,...,,,,,,,,,M1:502,
9,,368.0,StP1324/454,1,huba,,außerhalb St. Pölten entlang der Traisen,Conradus Zwischendemprunn,,,...,,,,,,,,,M1:503,


## XML

### Corpus Corporum

In [102]:
import pandas as pd
from lxml import etree

# Parse the XML file
file_path = 'data/corpus_corporum/140_Innocentius-III_Bulla-de-canonizatione-S.-Cunegundis.xml'
parser = etree.XMLParser(recover=True)  # This allows the parser to recover from some errors
tree = etree.parse(file_path, parser=parser)
root = tree.getroot()

# Extract relevant information
data = {
    'title': [],
    'author': [],
    'author_date': [],
    'editor': [],
    'publisher': [],
    'publication_place': [],
    'publication_date': [],
    'series_title': [],
    'language': [],
    'text_content': []
}

# Define namespaces
namespaces = {'tei': 'http://www.tei-c.org/ns/1.0'}

# Extract metadata
header = root.find('.//tei:teiHeader', namespaces=namespaces)
if header is not None:
    data['title'].append(header.xpath('.//tei:title/text()', namespaces=namespaces)[0] if header.xpath('.//tei:title', namespaces=namespaces) else '')
    author = header.find('.//tei:author', namespaces=namespaces)
    if author is not None:
        data['author'].append(author.text.strip() if author.text else '')
        data['author_date'].append(author.find('tei:date', namespaces=namespaces).text if author.find('tei:date', namespaces=namespaces) is not None else '')
    data['editor'].append(header.xpath('.//tei:editor/text()', namespaces=namespaces)[0] if header.xpath('.//tei:editor', namespaces=namespaces) else '')
    data['publisher'].append(header.xpath('.//tei:publisher/text()', namespaces=namespaces)[0] if header.xpath('.//tei:publisher', namespaces=namespaces) else '')
    data['publication_place'].append(header.xpath('.//tei:pubPlace/text()', namespaces=namespaces)[0] if header.xpath('.//tei:pubPlace', namespaces=namespaces) else '')
    data['publication_date'].append(header.xpath('.//tei:publicationStmt/tei:date/text()', namespaces=namespaces)[0] if header.xpath('.//tei:publicationStmt/tei:date', namespaces=namespaces) else '')
    data['series_title'].append(header.xpath('.//tei:seriesStmt/tei:title/text()', namespaces=namespaces)[0] if header.xpath('.//tei:seriesStmt/tei:title', namespaces=namespaces) else '')
    data['language'].append(header.xpath('.//tei:language/text()', namespaces=namespaces)[0] if header.xpath('.//tei:language', namespaces=namespaces) else '')

# Extract text content
text_content = []
for p in root.xpath('.//tei:text//tei:p', namespaces=namespaces):
    text_content.append(' '.join(p.xpath('.//text()')))
data['text_content'].append(' '.join(text_content))

# Create DataFrame
df = pd.DataFrame(data)

# Display the first few rows of the DataFrame
df[:5]

Unnamed: 0,title,author,author_date,editor,publisher,publication_place,publication_date,series_title,language,text_content
0,Bulla de canonizatione S. Cunegundis,Innocentius III,-1216,Jacques-Paul Migne,J. P. Migne,Parisiis,1853,"Patrologia Latina, vol. 140",latin,"\n Innocentius episcopus, servus ser..."


### Cora-XML text extractor
 Reference Corpus of Middle High German

In [103]:
import os
import pandas as pd
from lxml import etree

# Define the directory containing the XML files
xml_directory = 'data/cora-xml'

# Function to parse an XML file and extract data
def parse_xml_to_df(xml_file):
    tree = etree.parse(xml_file)
    root = tree.getroot()

    data = []
    sentence = []

    for token in root.xpath('//token'):
        token_id = token.get('id')
        trans = token.get('trans')
        token_type = token.get('type')

        tok_dipl = token.find('tok_dipl')
        tok_dipl_id = tok_dipl.get('id') if tok_dipl is not None else None
        tok_dipl_trans = tok_dipl.get('trans') if tok_dipl is not None else None
        tok_dipl_utf = tok_dipl.get('utf') if tok_dipl is not None else None

        if tok_dipl_utf:
            sentence.append(tok_dipl_utf)

        tok_anno = token.find('tok_anno')
        tok_anno_ascii = tok_anno.get('ascii') if tok_anno is not None else None
        tok_anno_id = tok_anno.get('id') if tok_anno is not None else None
        tok_anno_trans = tok_anno.get('trans') if tok_anno is not None else None
        tok_anno_utf = tok_anno.get('utf') if tok_anno is not None else None
        norm = tok_anno.find('norm').get('tag') if tok_anno is not None and tok_anno.find('norm') is not None else None
        lemma = tok_anno.find('lemma').get('tag') if tok_anno is not None and tok_anno.find('lemma') is not None else None
        lemma_gen = tok_anno.find('lemma_gen').get('tag') if tok_anno is not None and tok_anno.find('lemma_gen') is not None else None
        lemma_idmwb = tok_anno.find('lemma_idmwb').get('tag') if tok_anno is not None and tok_anno.find('lemma_idmwb') is not None else None
        pos = tok_anno.find('pos').get('tag') if tok_anno is not None and tok_anno.find('pos') is not None else None
        pos_gen = tok_anno.find('pos_gen').get('tag') if tok_anno is not None and tok_anno.find('pos_gen') is not None else None
        infl = tok_anno.find('infl').get('tag') if tok_anno is not None and tok_anno.find('infl') is not None else None
        inflClass = tok_anno.find('inflClass').get('tag') if tok_anno is not None and tok_anno.find('inflClass') is not None else None
        inflClass_gen = tok_anno.find('inflClass_gen').get('tag') if tok_anno is not None and tok_anno.find('inflClass_gen') is not None else None

        data.append([
            token_id, trans, token_type, tok_dipl_id, tok_dipl_trans, tok_dipl_utf,
            tok_anno_ascii, tok_anno_id, tok_anno_trans, tok_anno_utf, norm,
            lemma, lemma_gen, lemma_idmwb, pos, pos_gen, infl, inflClass, inflClass_gen
        ])

    # Define the DataFrame columns
    columns = [
        'token_id', 'trans', 'token_type', 'tok_dipl_id', 'tok_dipl_trans', 'tok_dipl_utf',
        'tok_anno_ascii', 'tok_anno_id', 'tok_anno_trans', 'tok_anno_utf', 'norm',
        'lemma', 'lemma_gen', 'lemma_idmwb', 'pos', 'pos_gen', 'infl', 'inflClass', 'inflClass_gen'
    ]

    # Create the DataFrame
    df = pd.DataFrame(data, columns=columns)
    return df, ' '.join(sentence)

# Initialize an empty DataFrame to hold all data
all_data_df = pd.DataFrame(columns=[
    'token_id', 'trans', 'token_type', 'tok_dipl_id', 'tok_dipl_trans', 'tok_dipl_utf',
    'tok_anno_ascii', 'tok_anno_id', 'tok_anno_trans', 'tok_anno_utf', 'norm',
    'lemma', 'lemma_gen', 'lemma_idmwb', 'pos', 'pos_gen', 'infl', 'inflClass', 'inflClass_gen'
])

# Initialize a DataFrame to store file names and documents
file_sentence_df = pd.DataFrame(columns=['file_name', 'document'])

# Loop through each file in the directory
for filename in os.listdir(xml_directory):
    if filename.endswith('.xml'):
        file_path = os.path.join(xml_directory, filename)
        df, sentence = parse_xml_to_df(file_path)
        all_data_df = pd.concat([all_data_df, df], ignore_index=True)
        file_sentence_df = pd.concat([file_sentence_df, pd.DataFrame([[filename, sentence]], columns=['file_name', 'document'])], ignore_index=True)

# Display the file names and sentences DataFrame
file_sentence_df['document'][:5]

# Optionally, save the file names and sentences DataFrame to a CSV file
#file_sentence_df.to_csv('/content/drive/My Drive/NLPSchool_test/cora-xml/file_sentences.csv', index=False)

# Save the combined DataFrame to a CSV file if needed
#all_data_df.to_csv('/content/drive/My Drive/NLPSchool_test/cora-xml/combined_data.csv', header=True, index=False)


0    Ad equū erręhet Man gieng after wege . zoh ſin...
1    daz dv niht enſprecheſt . noh nehein din dinch...
2    ad reſtingendū ſanguinē . In nōie .p.  f .  ...
3    Ad fluxū ſanguiniſ nariū Xpict unde iohan gien...
Name: document, dtype: object

In [None]:
file_sentence_df['document'][0]

'Ad equū erręhet Man gieng after wege . zoh ſin Roſ inhandon . do begagenda imo min trohtin mit ſinero arngrihte . weſman geſtu zune rideſtu . waz mag ih riten . min roſ iſt erręhet . nu ziuh ez da bifiere . tu runeimo in daz ora . drit ez anden ceſewen fuoz . ſo wirt imo deſ erræhetenbuͦz Pat̄ nr̄ . & terge crura eiꝰ & pedeſ dicenſ . alſo ſciero werde diſemo cuiꝰcūq̲ coloriſ ſit . rot . ſuarz . blanc . ualo . griſel . feh . roſſe deſ erræhotenbuͦz ſamo demo got daſelbo buͦzta .'

### Archives Départementales de l'Isère


1. Parse an XML file structured according to the Text Encoding Initiative (TEI) guidelines.
2. Extract specific pieces of information from the XML using XPath expressions.
3. Clean and concatenate text from various elements.
4. Organize the extracted information into a Pandas DataFrame for further analysis or processing.


1. **XML Parsing and Namespace Handling:**
   - `from lxml import etree`: Imports the necessary library to work with XML data.
   - `tree = etree.parse(...):` Parses the XML file into an `etree` object, which provides tools for navigating and manipulating the XML structure.
   - `root = tree.getroot()`: Gets the root element of the XML tree.
   - `ns = {'tei': 'http://www.tei-c.org/ns/1.0'}`: Defines a namespace dictionary to handle the `tei` namespace, which is commonly used in TEI XML files.

2. **`extract_text` Function:**
   - This function takes an XML element as input and does the following:
      - If the element is not None:
         - It uses `element.itertext()` to iterate over all text content within the element and its descendants.
         - It joins all text pieces with spaces (`' '.join()`) and removes leading/trailing whitespace (`strip()`) to get a cleaned text string.
      - If the element is None:
         - It returns an empty string (`''`).

3. **Information Extraction:**
   - Using `root.find()`, combined with XPath expressions and the namespace dictionary (`ns`), the code locates and extracts various pieces of information from the XML:
      - `region`, `settlement`, `repository`, `idno`, `material`, `condition`, `copy_status` are extracted as single text values.
      - `items` and `main_paragraphs` are extracted as lists of XML elements representing items and paragraphs, respectively.

4. **`get_full_text` Function:**
   - This function recursively processes an XML element and its children to extract the full text content:
      - It handles line breaks (`<lb>`) by inserting newline characters (`\n`).
      - It cleans the extracted text by removing extra spaces.

5. **Text Concatenation:**
   - The code uses list comprehensions and `'\n'.join()` to combine:
      - The text of all `main_paragraphs` into a single string (`main_text`).
      - The text of all `items` into a single string (`items_text`).

6. **DataFrame Creation:**
   - `data = {...}`: Creates a dictionary where keys represent the extracted information categories (e.g., "Region," "Archival ID") and values are the corresponding extracted text or concatenated text strings.
   - `df = pd.DataFrame([data])`: Creates a Pandas DataFrame from the `data` dictionary. Since there's only one document in this example, the DataFrame will have a single row with columns for each information category.

7. **Display:**
   - `df`: This line displays the resulting DataFrame.

In [None]:
from lxml import etree
import pandas as pd

In [104]:
import os
import pandas as pd
from lxml import etree

# Namespace dictionary
ns = {'tei': 'http://www.tei-c.org/ns/1.0'}

# Function to extract text from elements, taking care of namespaces and concatenating text from child elements
def extract_text(element):
    if element is not None:
        return ' '.join(element.itertext()).strip()
    return ''

# Function to recursively extract and clean text from an element and its children
def get_full_text(element):
    if element is not None:
        text = []
        for elem in element.iter():
            if elem.text:
                text.append(elem.text.strip())
            if elem.tag == '{http://www.tei-c.org/ns/1.0}lb':
                text.append('\n')  # Handle line breaks
            if elem.tail:
                text.append(elem.tail.strip())
        return ' '.join(text).replace('  ', ' ').strip()
    return ''

def process_xml_file(file_path):
    # Parse the XML file
    tree = etree.parse(file_path)
    root = tree.getroot()

    # Extract relevant text parts
    region = extract_text(root.find('.//tei:region', ns))
    settlement = extract_text(root.find('.//tei:settlement', ns))
    repository = extract_text(root.find('.//tei:repository', ns))
    idno = extract_text(root.find('.//tei:idno', ns))
    material = extract_text(root.find('.//tei:support/tei:material', ns))
    condition = extract_text(root.find('.//tei:condition/tei:objectName', ns))
    copy_status = extract_text(root.find('.//tei:copyStatus', ns))
    items = root.findall('.//tei:item', ns)
    main_paragraphs = root.findall('.//tei:p', ns)

    # Concatenate all main text paragraphs
    main_text = '\n'.join(get_full_text(p) for p in main_paragraphs)

    # Collect items text
    items_text = '\n'.join(extract_text(item) for item in items)

    # Create a dictionary with the extracted information
    data = {
        "Filename": os.path.basename(file_path),
        "Region": region,
        "Settlement": settlement,
        "Repository": repository,
        "Archival ID": idno,
        "Material": material,
        "Condition": condition,
        "Copy Status": copy_status,
        "Abstract": items_text,
        "Main Text": main_text
    }

    return data

# Directory containing the XML files
directory = 'data/ad_xml/'

# List to store data from each file
all_data = []

# Iterate over all XML files in the directory
for filename in os.listdir(directory):
    if filename.endswith('.xml'):
        file_path = os.path.join(directory, filename)
        file_data = process_xml_file(file_path)
        all_data.append(file_data)

# Convert the list of dictionaries to DataFrame
df = pd.DataFrame(all_data)

# Display
df[:5]

Unnamed: 0,Filename,Region,Settlement,Repository,Archival ID,Material,Condition,Copy Status,Abstract,Main Text
0,AD38-B-3545-6.xml,France,Saint-Martin-d'Hères,Archives Départementales de l'Isère,"B 3545, n°6",Non déterminé,Photographie personnelle,Non déterminé,"Reg. Viv. 205 . Ratfication par Guillemette, v...",\nNoverint universi et singuli presentem pagin...
1,AD38-B-3552-3.xml,France,Saint-Martin-d'Hères,Archives Départementales de l'Isère,"B 3552, n°3",Parchemin,Photographie personnelle,Original,Reg. Viv. 238 . Quittance délivrée par Odebert...,\nNoverint universi presentes pariter et futur...
2,AD38-B-3545-3.xml,France,Saint-Martin-d'Hères,Archives Départementales de l'Isère,"B 3545, n°3",Parchemin,Photographie personnelle,Original,Reg. Viv. 204 . Obligation souscrite par Odebe...,\nSceau protégé\nNoverint universi et singuli ...
3,AD38-B-3894-5.xml,France,Saint-Martin-d'Hères,Archives Départementales de l'Isère,"B 3894, n°5",Parchemin,Photographie personnelle,Original,"Reg. Viv. 120 . Acte par lequel Aymard, comte ...",\nReste de ficelle\ntraces de chirographe\nNov...
4,AD38-B-3894-4.xml,France,Saint-Martin-d'Hères,Archives Départementales de l'Isère,"B 3894, n°4",Parchemin,Photographie personnelle,Original,Reg. Viv. 119 . Vente par Henri de Barrès à Ai...,"\nReste de ficelle\ntraces de chirographe\n, e..."


## Docx (Classical/Medieval Greek)

In [79]:
!pip install python-docx==0.8.11 --quiet

In [81]:
import pandas as pd
from docx import Document
import re
import unicodedata

# Path to your document
document_path = 'data/medieval_greek/phlorios_and_platziaflora_part.docx'
document = Document(document_path)

# Extract text and metadata
data = []
current_line_number = 0

def is_greek(char):
    try:
        return unicodedata.name(char).startswith(('GREEK', 'COMBINING GREEK'))
    except ValueError:
        return False

def keep_greek_and_punctuation(text):
    return ''.join(char for char in text if is_greek(char) or unicodedata.category(char).startswith('P') or char.isspace())

for paragraph in document.paragraphs:
    lines = paragraph.text.split('\n')  # Split paragraph into lines based on \n
    for line in lines:
        line = line.strip()

        # Regular expression to match line numbers in parentheses
        match = re.search(r'\((\d+)\)', line)
        if match:
            current_line_number = int(match.group(1))  # Extract line number
            line = line.replace(match.group(0), '')  # Remove line number

            # Split on first tab or a series of spaces if present
            parts = re.split(r'\t|  +', line.strip(), maxsplit=1)  # Strip spaces after removing line number
            text = parts[0]

            # Keep Greek characters, punctuation, and whitespace
            text = keep_greek_and_punctuation(text)

            data.append({'Line Number': current_line_number, 'Text': text})
        elif current_line_number > 0:  # Continuation of previous line
            continuation_text = keep_greek_and_punctuation(line)
            data[-1]['Text'] += ' ' + continuation_text  # Append to the last entry's text
        else:
            text = keep_greek_and_punctuation(line)
            data.append({'Line Number': current_line_number, 'Text': text})  # handle title line

# Create DataFrame
df = pd.DataFrame(data)

df[:50]

Unnamed: 0,Line Number,Text
0,1,Εἷς καβελλάρης εὐγενὴς ὁρμώμενος ἐκ Ρώμης ἀνδρ...
1,5,"Ὑπῆρχε γὰρ εὐγενική, τὸ εἶδος κρυσταλλόχροια, ..."
2,10,Ἰδὼν δὲ ὁ αὐτῆς ἀνὴρ αὐτῆς τὴν ἀτεκνίαν ἐκ βάθ...
3,15,τοῦ χάριν δοῦναι αἰτήσεως ἵνα τεκνοποιήσῃ ὁ δὲ...
4,20,Ἰδὼν δὲ τὴν ὑπόσχεσιν ἀπάρτι πληρωθεῖσαν ἔλαβε...
5,25,"Μετὰ δὲ τοῦ πορεύεσθαι στράταν τοῦ ταξιδίου, ἐ..."
6,30,πλῆθος πολλῶν καβαλλαριῶν ἔσυρεν συντροφία καὶ...
7,35,καὶ βίγλας ἔστησεν πολλὰς βλέποντες τὰς κλεισο...
8,40,"Ὥστε ὑπῆρχεν μετ’ αὐτῶν καὶ ὁ ἀνὴρ ἐκεῖνος, ἐκ..."
9,45,τοῖς ὁμοφύλλοις ἤρξατο κελεύειν καὶ προστάσσει...


In [82]:
print(df['Text'][2])

Ἰδὼν δὲ ὁ αὐτῆς ἀνὴρ αὐτῆς τὴν ἀτεκνίαν ἐκ βάθους τῆς αὑτοῦ ψυχῆς Θεὸν ἐξιλεοῦτο καὶ πρέσβυν παρεστήσατο μύστην τοῦ τηλικούτου Ἰάκωβον, τὸν ἔνδοξον ἀπόστολον Κυρίου, ὡσὰν νομίζων παρρησιὰν ἔχειν πρὸς τὸν Δεσπότην


## Huggingface dataset

In [None]:
!pip install datasets==2.10.0 --quiet

In [None]:
from datasets import load_dataset

# Replace 'medieval_latin_charters' with the actual dataset name
dataset = load_dataset("pnadel/latin_sentences")

# Explore the dataset splits and features
dataset
dataset['train'].features

# Access and print the first few examples
dataset["train"][:5]




  0%|          | 0/2 [00:00<?, ?it/s]

{'f_name': ['./data/stoa0238/stoa007/stoa0238.stoa007.perseus-lat2.xml',
  './data/phi1294/phi002/phi1294.phi002.perseus-lat2.xml',
  './data/phi1020/phi001/phi1020.phi001.perseus-lat2.xml',
  './data/stoa0089/stoa012/stoa0089.stoa012.perseus-lat2.xml',
  './data/phi0860/phi001/phi0860.phi001.perseus-lat2.xml'],
 'title': ['Contra Symmachum',
  'Epigrammata',
  'Thebais',
  'Panegyricus de sexto consulatu Honorii Augusti',
  'Historiae Alexandri Magni'],
 'author': ['Prudentius',
  'Martial',
  'P. Papinius Statius',
  'Claudian',
  'Curtius Rufus, Quintus'],
 'text': ['aut docet occultus quae sacra Diespiter infans ',
  'Fulmineo spumantis apri sum dente perempta, ',
  'sed premit et saevas miserantibus ingerit hastas, ',
  'mens tamen ad silvas et sua lustra redit.',
  'Redditis deinde litteris constituerunt prima lucead Parmenionem coire  Iamque ceteris quoque litterasregis attulerat, iam ad eum venturi erant, cum ParmenioniPolydamanta venisse nuntiaverunt']}

In [None]:
df = pd.DataFrame({'Split': ['train'] * len(dataset["train"]["text"]) + ['test'] * len(dataset["test"]["text"]),
                   'Text': dataset["train"]["text"] + dataset["test"]["text"]})
df[:5]

Unnamed: 0,Split,Text
0,train,aut docet occultus quae sacra Diespiter infans
1,train,"Fulmineo spumantis apri sum dente perempta,"
2,train,sed premit et saevas miserantibus ingerit hast...
3,train,mens tamen ad silvas et sua lustra redit.
4,train,Redditis deinde litteris constituerunt prima l...


## CSV / TSV

CSV/TSM (Comma/Tabulator-Separated Values) is a simple, widely-supported, plain text format for storing tabular data. Its human-readable structure, lack of special library requirements, and straightforward row-and-column format make it easy to use and understand. CSV is often suitable for small to medium datasets, quick analyses, or data exchange where compatibility is paramount. While not as performant as Parquet for large datasets or complex queries, its simplicity and universality make it a valuable tool for many data tasks.

In [None]:
import pandas as pd
import csv

# Read CSV into DataFrame
df = pd.read_csv('/data/99NLP/99_charters_NLP.csv', sep=";")

df[:5]

Unnamed: 0,atom_id,cei_placeName,cei_lang_MOM,cei_tenor,cei_date,cei_date_ATTRIBUTE_value,cei_dateRange,cei_dateRange_ATTRIBUTE_from,cei_dateRange_ATTRIBUTE_to,cei_abstract,cei_abstract_foreign,cei_graphic_ATTRIBUTE_url_orig,cei_graphic_ATTRIBUTE_url_copy
0,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,"XVI. QUOD MULTA ALIA, SE VIVENTE, ADQUISIERI...",99999999.0,99999999.0,,,,,,,00000147.png
1,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,XXX. In Aldomhem habet casam indominicatam c...,99999999.0,99999999.0,,,,,,,"00000213.png, 00000214.png"
2,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,"XXVII. In Pupurninga villa habet ecclesiam, ...",99999999.0,99999999.0,,,,,,,"00000212.png, 00000213.png"
3,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,"LVI. DE ELECTIONE JOHANNIS, EPISCOPI MORINEN...",99999999.0,99999999.0,,,,,,,00000376.png
4,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,IX. DE C0NSTB1 CTIONE CENOBII SANCTI VINNOCC...,,,assumpte Christi nativitatis vigésimo secundo\...,10220101.0,10221231.0,,,,"00000288.png, 00000289.png"


## Parquet

Why Parquet?

Parquet is a columnar storage format that is highly optimized for big data processing and analytics. Here are some of its advantages over CSV:

- Compression: Parquet files are significantly smaller than CSV files due to efficient compression algorithms, saving you storage space and reducing read/write times.
- Performance: Parquet enables faster data loading and querying, especially for analytical workloads, as it only needs to read the columns relevant to your query.
- Type Preservation: Parquet preserves the data types (e.g., integer, float, string) of your columns, ensuring data integrity.

In [None]:
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# Save as parquet
pq.write_table(pa.Table.from_pandas(df), 'data/99NLP/99_charters_NLP.parquet')

In [None]:
df = pd.read_parquet('data/99NLP/99_charters_NLP.parquet')
df[:5]

Unnamed: 0,atom_id,cei_placeName,cei_lang_MOM,cei_tenor,cei_date,cei_date_ATTRIBUTE_value,cei_dateRange,cei_dateRange_ATTRIBUTE_from,cei_dateRange_ATTRIBUTE_to,cei_abstract,cei_abstract_foreign,cei_graphic_ATTRIBUTE_url_orig,cei_graphic_ATTRIBUTE_url_copy
0,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,"XVI. QUOD MULTA ALIA, SE VIVENTE, ADQUISIERI...",99999999.0,99999999.0,,,,,,,00000147.png
1,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,XXX. In Aldomhem habet casam indominicatam c...,99999999.0,99999999.0,,,,,,,"00000213.png, 00000214.png"
2,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,"XXVII. In Pupurninga villa habet ecclesiam, ...",99999999.0,99999999.0,,,,,,,"00000212.png, 00000213.png"
3,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,"LVI. DE ELECTIONE JOHANNIS, EPISCOPI MORINEN...",99999999.0,99999999.0,,,,,,,00000376.png
4,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,IX. DE C0NSTB1 CTIONE CENOBII SANCTI VINNOCC...,,,assumpte Christi nativitatis vigésimo secundo\...,10220101.0,10221231.0,,,,"00000288.png, 00000289.png"


## JSON

**Why JSON?**

JSON (JavaScript Object Notation) is a lightweight, human-readable data interchange format widely used for transmitting data between web applications and servers. Here are some reasons to consider it:

- Universality: JSON is language-independent, meaning you can easily use it with Python, JavaScript, and other programming languages.
- Human Readable: The structure of JSON data is easy for humans to understand, making it great for debugging or manual inspection.
- Compact: JSON files are generally smaller than CSV files (especially if you have a lot of text data), but not as efficient for numerical data analysis as Parquet.

In [None]:
# Save to JSON (replace 'my_data.json' with your desired filename)
df.to_json('data/99NLP/99_charters_NLP.json')

In [None]:
df = pd.read_json('data/99NLP/99_charters_NLP.json')  # Replace with your file path
df[:5]

Unnamed: 0,atom_id,cei_placeName,cei_lang_MOM,cei_tenor,cei_date,cei_date_ATTRIBUTE_value,cei_dateRange,cei_dateRange_ATTRIBUTE_from,cei_dateRange_ATTRIBUTE_to,cei_abstract,cei_abstract_foreign,cei_graphic_ATTRIBUTE_url_orig,cei_graphic_ATTRIBUTE_url_copy
0,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,"XVI. QUOD MULTA ALIA, SE VIVENTE, ADQUISIERI...",99999999.0,99999999.0,,,,,,,00000147.png
1,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,XXX. In Aldomhem habet casam indominicatam c...,99999999.0,99999999.0,,,,,,,"00000213.png, 00000214.png"
2,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,"XXVII. In Pupurninga villa habet ecclesiam, ...",99999999.0,99999999.0,,,,,,,"00000212.png, 00000213.png"
3,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,"LVI. DE ELECTIONE JOHANNIS, EPISCOPI MORINEN...",99999999.0,99999999.0,,,,,,,00000376.png
4,"tag:www.monasterium.net,2011:/charter/AbbayeDe...",,,IX. DE C0NSTB1 CTIONE CENOBII SANCTI VINNOCC...,,,assumpte Christi nativitatis vigésimo secundo\...,10220101.0,10221231.0,,,,"00000288.png, 00000289.png"
