### **Lecture des documents XML/TEI**

https://komax.github.io/blog/text/python/xml/parsing_tei_xml_python/

In [86]:
import os
import pandas as pd
import lxml
from slugify import slugify
from bs4 import BeautifulSoup

*Renommer les fichiers XML*  
La convention de nommage utilisée par défaut par Zotero est un peu longue et peu conviviale à manipuler. On va renommer les fichiers dans un format plus simple pour se faciliter le travail.

In [None]:
# Si les fichiers ont déjà été renommés, ces
# instructions ne changeront rien (on exécute donc peu importe)

folder = '../xml_tei'
tei_docs = os.listdir(folder)

for file in tei_docs:
    new_file_name = '-'.join(slugify(file).split('-')[:6])
    os.rename(os.path.join(folder, file), os.path.join(folder, new_file_name + '.xml'))

*Parser un premier fichier*

In [None]:
tei_doc = os.listdir(folder)[0]
tei_doc

'agosto-2002-bounded-rationality-and-satisficing.xml'

In [None]:
with open(f'{folder}/{tei_doc}', 'r', encoding='utf-8') as tei:
    soup = BeautifulSoup(tei, features='xml')

In [90]:
# Afficher le contenu de l'élément titre
title = soup.title.getText()
title

"Bounded rationality and satisficing in young people's Web‐based decision making"

In [91]:
# Afficher le contenu du résumé
abstract = soup.abstract
abstract

<abstract/>

In [92]:
# Afficher le doi du document 
doi = soup.find('idno', type='DOI').getText()
doi

'10.1002/asi.10024'

In [93]:
# Afficher l'éditeur de l'article
publisher = soup.publisher.getText()
publisher

'Wiley'

In [94]:
# Afficher la date de publication de l'article
publication_date = soup.find('date', type='published')['when']
publication_date

'2001-11-16'

In [95]:
# Afficher le premier auteur de l'article
first_author = soup.author.persName.get_text(" ")
first_author

'Denise E Agosto'

In [96]:
# Afficher le titre du périodique
journal_title = soup.find('title', level='j').getText()
journal_title

'Journal of the American Society for Information Science and Technology'

In [97]:
# Afficher le corps du texte
body = ''

sections = soup.body.find_all('div')
for section in sections:
    header = section.find('head')
    body += header.getText()
    body += '\n'

    paras = section.find_all('p')
    for para in paras:
        body += para.getText()
        body += '\n'

    body += '\n'

print(body)

Introduction
In the simplest terms, decision making involves "selecting among possible actions" (Gilhooly, 1988, p. 132). With his theories of bounded rationality and satisficing, Simon (1955Simon ( , 1956) ) suggested that decision makers operate within time and cognitive limitations that prevent them from evaluating all possible decision outcomes. These theories are largely accepted in relation to adult decision making in traditional information environments (Tyszka, 1989). However, these theories have not been rigorously tested in relation to young people's decision making or in relation to decision making in the World Wide Web environment.
The vast majority of user studies employ adults as subjects, and computer designers and electronic resource evaluators rarely study or consult young people before designing or selecting electronic resources for youth (Druin, 1999;Laurel, 1990). This study turned to young people as sources of knowledge about their own Web-based decision making to 

*Parser tous les fichiers d'un dossier et mettre le contenu souhaité dans un DataFrame*

In [114]:
def xml_tei_to_df(folder:str) -> pd.DataFrame:
    # créer une liste de dictionnaires qui deviendra notre dataframe
    df = []

    # Renommer les fichiers au besoin
    tei_docs = os.listdir(folder)

    for file in tei_docs:
        tei_doc = '-'.join(slugify(file).split('-')[:6])
        os.rename(os.path.join(folder, file), os.path.join(folder, tei_doc + '.xml'))

        # Parser le XML du fichier avec BeautifulSoup
        with open(f'{folder}/{tei_doc}.xml', 'r', encoding='utf-8') as tei:
            soup = BeautifulSoup(tei, features='xml')

            # doi
            try:
                doi = soup.find('idno', type='DOI').getText()
            
            except:
                print(tei_doc, 'no doi info')
                doi=None

            # premier auteur
            try:
                first_author = soup.author.persName.get_text(" ")
            except:
                print(tei_doc, 'no first author info')
                first_author=None

            # titre
            try:
                title = soup.title.getText()
            except:
                print(tei_doc, 'no title info')
                title=None

            # résumé
            abstract = soup.abstract.get_text() if soup.abstract else None       

            # date de publication
            try:
                publication_date = soup.find('date', type='published')['when']
            except:
                print(tei_doc, 'no date info')
                publication_date=None

            # nom du périodique
            try:
                journal_title = soup.find('title', level='j').getText()
            except:
                print(tei_doc, 'no journal info')
                journal_title=None

            # nom de l'éditeur
            try:
                publisher = soup.publisher.getText()            
            except:
                print(tei_doc, 'no publisher info')
                publisher=None

            # corps du texte
            body = ''
            try:
                body = "\n".join(
                    [
                        (section.find('head').get_text() if section.find('head') else "") +
                        "\n" +
                        "\n".join([p.get_text() for p in section.find_all('p')])
                        for section in soup.body.find_all('div', recursive=False)
                    ]
                )
                    
            except:
                print(tei_doc, 'no body info')
                body=None                    


            dic = {
                'doi' : doi,
                'first_author' : first_author,
                'title' : title,
                'abstract' : abstract,
                'published' : publication_date,
                'journal' : journal_title,
                'publisher' : publisher,
                'body' : body
            }

        df.append(dic)

    return pd.DataFrame(df)

In [116]:
df = xml_tei_to_df('../xml_tei')

azzopardi-et-al-2011-report-on no doi info
azzopardi-et-al-2013-how-query no doi info
berryman-2006-what-defines-enough-information no doi info
bouzdine-chameeva-et-al-2006-stopping no doi info
browne-et-al-2005-stopping-rule no doi info
browne-et-al-2007-cognitive-stopping no doi info
card-et-al-2001-information-scent no doi info
dalton-and-charnigo-2004-historians-and no doi info
dostert-and-kelly-2009-users-stopping no doi info
duff-and-johnson-2002-accidentally-found no doi info
gerhart-and-windsor-2017-cognitive-stopping no doi info
keen-1992-presenting-results-of-experimental no doi info
kraft-and-lee-1979-stopping-rules no doi info
nickles-et-al-1995-judgment-based no doi info
simon-1955-a-behavioral-model-of no doi info
simon-1955-a-behavioral-model-of no date info
simon-1955-a-behavioral-model-of no journal info
white-and-harding-2008-identifying-auditor no doi info


In [117]:
df

Unnamed: 0,doi,first_author,title,abstract,published,journal,publisher,body
0,10.1002/asi.10024,Denise E Agosto,Bounded rationality and satisficing in young p...,,2001-11-16,Journal of the American Society for Informatio...,Wiley,"Introduction\nIn the simplest terms, decision ..."
1,,Leif Azzopardi,Report on the SIGIR 2010 Workshop on the Simul...,\nAll search in the real-world is inherently i...,2010,Information Storage and Retrieval,,Introduction\nThe use of simulation to evaluat...
2,,Leif Azzopardi,How Query Cost Affects Search Behavior,\nWe investigate how the cost associated with ...,2012,Communications of the ACM,,INTRODUCTION\nDuring interactive information r...
3,,Jennifer Berryman,What defines 'enough' information? How policy ...,\nIntroduction. Reports findings from research...,2001,Journal of the American Society for Informatio...,,Introduction\nThis paper reports on ongoing re...
4,,Professor Tatiana Bouzdine-Chameeva,STOPPING RULES IN INFORMATION SEARCH IN ONLINE...,"\nHe is lecturing in system analysis, manageme...",2014-06-04,Journal of Wine Research,,"ABSTRACT\nDuring the recent years, the World W..."
5,,Glenn J Browne,Stopping Rule Use During Web-Based Search,\nThe world wide web has become a ubiquitous t...,1984,Acta Psychologica,,Introduction\nSearch behavior is now a princip...
6,,Glenn J Browne,Cognitive Stopping Rules for Terminating Infor...,\nOnline search has become a significant activ...,1984,Acta Psychologica,,Introduction MI^H\nSearch behavior is a ubiqui...
7,,Stuart K Card,Information Scent as a Driver of Web Behavior ...,\nThe purpose of this paper is to introduce a ...,1999,Int. J. of Human-Computer Studies,,INTRODUCTION\nThe development of predictive sc...
8,10.1002/asi.4630240603,William S Cooper,On selecting a measure of retrieval effectiven...,"\nIt was argued in Part I (see JASIS, March-Ap...",1973-11,Journal of the American Society for Informatio...,Wiley,"\ntrieval system is, in principle at least, to..."
9,10.1145/3406522.3446030,Anita Crescenzi,Adaptation in Information Search and Decision-...,\nPrior work in IR has found that searchers un...,2021-03-14,Journal of the American society for Informatio...,ACM,INTRODUCTION\nA recognized challenge in inform...


In [119]:
df.to_csv('../csvs/Ilani_Nowkarizi_Arastoopoor_2024.csv', index=False)