#  Extracting NewsML Headlines with XML

NewsML specifies a standard format for news articles, which is used in many news and
media outlets. This notebook demonstrates how to extract headlines from an XML file
using the [`xml.etree.ElementTree`](https://docs.python.org/3/library/xml.etree.elementtree.html) module.

The project is based on the following job spec:
> Need a simple Python script to periodically
search through a given folder for XML files and
parse them into a list of dicts. The list will be
passed to a different component that creates
an RSS service xml file, which already exists
and is not required to be written.
>
>A sample of the folder to check is attached. It
includes multiple formats, some of them with
images attached. Some of the xml files will
refer to images that were not added to the zip
file to keep it small. The format has a spec
(NewsML 1.2) and is documented here:
https://www.afp.com/communication
/iris/Guide_to_AFP_NewsML-G2.html
Skeleton code for checking the folder and
passing the data to the RSS service will be
provided, but the parse function is what's
missing.
>
>Required values to parse are:
>* Headline
>* Topic
>* Tags
>* Authors
>* Date
>* Content
>* Location

I'm focussing on text extraction. The images could also be added and passed by adding another function, but the method used would depend on the clients requirements. I wasn't able to have that discussion with them, so I'm ignoring that element.

In [1]:
import os
import pandas as pd
from collections import deque
import xml.etree.ElementTree as ET
from datetime import datetime, timezone

import random
random.seed(42)

In [2]:
def extract_author_or_provider(file_path):
    '''Function to XML extract author information,
      or provider information as a fallback'''
    try:
        tree = ET.parse(file_path)
        root = tree.getroot()

        # First, try to find author information
        for byline in root.iter('ByLine'):
            author_name = byline.text
            return author_name  # Return author name if found

        # If author is not found, try to find provider information
        for provider in root.iter('Provider'):
            party = provider.find('Party')
            if party is not None and 'FormalName' in party.attrib:
                provider_name = party.attrib['FormalName']
                return provider_name  # Return provider name if found

        # Return None if neither author nor provider is found
        return None

    except ET.ParseError:
        return "XML Parse Error"

In [3]:
def parse_headlines(root):
    """
    Refactored function to parse the XML root to extract the headline or an alternative text
    when the headline is not explicitly found.

    Args:
    root (Element): The root element of the parsed XML document.

    Returns:
    str: The extracted headline or alternative text, or an appropriate message if not found.
    """
    try:
        # Looking for the HeadLine tag within the NewsLines section
        headline = root.find('.//NewsLines/HeadLine')
        if headline is not None and headline.text is not None:
            return headline.text.strip()
        else:
            # If the headline is not found or empty, look for the first paragraph in body.content
            first_paragraph = root.find('.//body.content/p')
            if first_paragraph is not None and first_paragraph.text is not None:
                return first_paragraph.text.strip()
            else:
                return "Headline or alternative text not found in the file."
    except Exception as e:
        return f"Error processing the XML: {e}"

In [4]:
def extract_content(root):
    """
    Extract news text from <DataContent> within <ContentItem> tags.

    Args:
    root (ET.Element): The root of the ET tree.

    Returns:
    str: The extracted news text.
    """
    news_text = []

    # Find all <ContentItem> tags and then extract text from <DataContent> <p> tags
    for content_item in root.findall('.//ContentItem'):
        for data_content in content_item.findall('.//DataContent'):
            for p in data_content.findall('.//p'):
                if p.text:
                    news_text.append(p.text)

    return '\n'.join(news_text)

In [5]:
def parse_newsml_xml(file_path):
    """
    Parse a NewsML XML file and extract the required information.

    :param file_path: Path to the NewsML XML file.
    :return: Dictionary with extracted data.
    """
    try:
        # Parse the XML file
        tree = ET.parse(file_path)
        root = tree.getroot()

        # Initialize data dictionary
        news_data = {
            'Headline': None,
            'Topic': None,
            'Tags': None,
            'Authors': None,
            'Date': None,
            'Content': None,
            'Location': None
        }

        # Extracting Headline
        news_data['Headline'] = parse_headlines(root)

        # Extracting Topic (NameLabel)
        topic = root.find(".//Identification/NameLabel")
        if topic is not None:
            news_data['Topic'] = topic.text

        # Extracting Tags (OfInterestTo)
        tags = root.findall(".//DescriptiveMetadata/OfInterestTo")
        if tags:
            tags = ', '.join([tag.get('FormalName') for tag in tags if tag.get('FormalName')])
            news_data['Tags'] = tags.replace('--', ', ')
        else:
            news_data['Tags'] = None

        # Extracting Date (FirstCreated)
        date = root.find(".//NewsManagement/FirstCreated")
        if date is not None:
            date_text = date.text
            try:
                # First, try to parse without timezone assuming UTC ('Z' at the end)
                datetime_obj = datetime.strptime(date_text, "%Y%m%dT%H%M%SZ")
                datetime_obj = datetime_obj.replace(tzinfo=timezone.utc)
            except ValueError:
                # Next, try to parse with timezone offset
                datetime_obj = datetime.strptime(date_text, "%Y%m%dT%H%M%S%z")
                # Convert to UTC
                datetime_obj = datetime_obj.astimezone(timezone.utc)

            # Format the datetime object to ISO 8601 format in UTC
            news_data['Date'] = datetime_obj.strftime("%Y-%m-%dT%H:%M:%S%z")


        # Extracting Location
        location = root.find(".//Location")
        if location is not None:
            country = location.find(".//Property[@FormalName='Country']")
            city = location.find(".//Property[@FormalName='City']")
            location_text = ''
            if city is not None:
                location_text += city.get('Value')
            if country is not None:
                location_text += ', ' + country.get('Value') if location_text else country.get('Value')
            news_data['Location'] = location_text

        # TODO: Extract Authors and Content once their structure is understood
            
        # Extracting Authors
        news_data['Authors'] = extract_author_or_provider(file_path)

        # Extracting Content
        news_data['Content'] = extract_content(root)

        return news_data

    except ET.ParseError:
        return {'error': 'Failed to parse XML file'}


In [6]:
def process_xml_files_iteratively(folder_path):
    """
    Iteratively searches through the folder tree for XML files, applies dummy_function to each,
    and counts the number of files processed for each path.
    """
    xml_count = {}
    content_list = []
    queue = deque([folder_path])  # Initialize the queue with the root folder

    while queue:
        current_path = queue.popleft()  # Get the next directory to process
        current_count = 0

        # Attempt to list directories and files in the current_path
        try:
            with os.scandir(current_path) as it:
                for entry in it:
                    if entry.is_dir():
                        queue.append(entry.path)  # Add subdirectories to the queue
                    elif entry.is_file() and entry.name.endswith('.xml'):
                        content_list.append(parse_newsml_xml(entry.path))
                        current_count += 1
        except PermissionError:
            print(f"Permission denied: {current_path}")

        if current_count > 0:
            xml_count[current_path] = current_count

    for x in xml_count:
        print(f'Processed {xml_count[x]} files in {x}')

    df = pd.DataFrame(content_list)
    return df

content = process_xml_files_iteratively('afp')

# Save to csv
content.to_csv('parsed_xml.csv', index=False)

content.info()

Processed 48 files in afp\ARA_Media_SansSport
Processed 26 files in afp\arabic\journal\minaldounia
Processed 22 files in afp\arabic\journal\sport
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Headline  96 non-null     object
 1   Topic     96 non-null     object
 2   Tags      47 non-null     object
 3   Authors   95 non-null     object
 4   Date      96 non-null     object
 5   Content   96 non-null     object
 6   Location  95 non-null     object
dtypes: object(7)
memory usage: 5.4+ KB


`Tags` are missing about half their entries. This is due to the data present in the XML files themselves. Unfortunately, the files in `afp\ARA_Media_SansSport` don't have any information suitable to be used as tags. Otherwise, everything else has largely extracted. The odd missing values seem to be issues with the files themselves.

In [7]:
sample = content.sample(10)
sample

Unnamed: 0,Headline,Topic,Tags,Authors,Date,Content,Location
65,للكلاب مطعمها في روما,ايطاليا-حيوانات-اطعمة-مطعم-خدمة دنيا,"arabic, journal, minaldounia",غيلداس لو رو,2023-11-26T12:01:45+0000,"تسود مطعم ""فيوتو"" في روما أجواء مميزة، فالإضاء...","روما, ITA"
20,وصول الرهائن المفرج عنهم من غزة إلى الأراضي ال...,اسرائيل/فلسطينيون/مفقودون/نزاع,,أ ف ب,2023-11-25T22:14:57+0000,أعلن الجيش الإسرائيلي أن المجموعة الثانية من ا...,"القدس, ZZZ"
74,دوري المحترفين: ايرفينغ ضد خطة الاستئناف في أو...,Basket-NBA-health-virus,"arabic, journal, sport",مايك ستوبي,2020-06-13T22:29:35+0000,أكدت تقارير صحافية متعددة السبت أن كايري ايرفي...,"نيويورك, USA"
56,"شخصية افتراضية عبر لعبة ""فورتنايت"" لنجدة الأطف...",فرنسا-اطفال-استغلال-العاب-خدمة دنيا,"arabic, journal, minaldounia",أرنو بوفييه,2020-06-15T13:48:34+0000,من خلال شخصية افتراضية مجنحة ترتدي الأزرق دخلت...,"باريس, FRA"
33,حماس تعلن أنها أفرجت عن رهينة روسية وسلمتها ال...,تنبيه,,أ ف ب,2023-11-26T14:41:54+0000,حماس تعلن أنها أفرجت عن رهينة روسية وسلمتها ال...,"غزة, PSE"
75,بطولة إسبانيا: برشلونة يدشن عودته إلى المنافسا...,قدم-إسبانيا-بطولة,"arabic, journal, sport",خايمي رينا,2020-06-13T21:58:58+0000,دشَّن برشلونة المتصدر وحامل اللقب عودته إلى ال...,"مدريد, ESP"
19,وصول الرهائن المفرج عنهم من قطاع غزة الى مصر (...,تنبيه,,أ ف ب,2023-11-25T21:39:51+0000,وصول الرهائن المفرج عنهم من قطاع غزة الى مصر (...,"القاهرة, EGY"
6,كييف تؤكد إسقاط 71 مسيّرة روسية ليل الجمعة السبت,تنبيه,,أ ف ب,2023-11-25T07:36:54+0000,كييف تؤكد إسقاط 71 مسيّرة روسية ليل الجمعة الس...,"كييف, UKR"
84,الألعاب الآسيوية تشعل الحلم الأولمبي للرياضات ...,اسياد-كمبيوتر-2023-2022-اندونيسيا-العاب,"arabic, journal, sport",فيصل كمال وسونغهي هوانغ,2023-09-19T05:59:38+0000,شكّل إدراج الرياضات الإلكترونية كمسابقة رسمية ...,"نيودلهي, IND"
7,هجوم المسيّرات الروسية كان الأكبر على كييف منذ...,تنبيه,,أ ف ب,2023-11-25T07:41:03+0000,هجوم المسيّرات الروسية كان الأكبر على كييف منذ...,"كييف, UKR"


I don't speak/understand Arabic. I can tell that matches the contents in the original files, but it would be better to see if the extractions make sense.

## Translating to English

I'm going to use [OpenAI](https://openai.com/) to translate the text to English.

In [8]:
import os
import openai
from dotenv import load_dotenv

# Load environment variables from .env
load_dotenv()

# Set the OpenAI API key
openai.api_key = os.getenv('OPENAI_API_KEY')

def translate_to_english(text):
    # Check if the text is null
    if pd.isnull(text):
        return ""

    # Create a conversation with the model
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": f"You are a helpful assistant that \
            translates any text to English. Return only the translated text\
             do not comment about it"},
            {"role": "user", "content": text}
        ]
    )
    # Extract the assistant's reply
    translation = response['choices'][0]['message']['content']
    return translation

translated_df = sample.drop('Date', axis=1).applymap(translate_to_english)
translated_df.to_csv('translated_sample.csv', index=False)
translated_df

Unnamed: 0,Headline,Topic,Tags,Authors,Content,Location
65,Dogs have their restaurant in Rome.,Italy-animals-food-restaurant-world service,"Arabic, journal, from the world.",Guildas Law Ro,"The ""Fido"" restaurant in Rome has a unique atm...","Rome, Italy"
20,The released hostages from Gaza have arrived i...,Israel/Palestinians/Missing/Conflict,,AFP,The Israeli army announced that the second gro...,"Jerusalem, ZZZ"
74,Professional League: Irving against the restar...,Basketball-NBA-health-virus,"Arabic, journal, sport",Mike Stuby,Multiple news reports on Saturday confirmed th...,"New York, USA"
56,"Virtual character through the game ""Fortnite"" ...",France - Children - Exploitation - Games - Dun...,"Arabic, journal, from the world.",Arno buffet,Through a virtual character wearing blue wings...,"Paris, FRA"
33,Hamas announces that it has released a Russian...,Warning,,AFP,Hamas announces that it has released a Russian...,"Gaza, PSE"
75,Spanish Championship: Barcelona starts its com...,Spain hosted the championship.,"العربية، مجلة، رياضة\n\nArabic, journal, sport",Khaimi Rina,"Barcelona, ​​the leader and defending champion...","Madrid, ESP"
19,The released hostages from Gaza have arrived i...,Warning,,AFP,The released hostages from Gaza have arrived i...,"Cairo, EGY"
6,How to confirm the downing of 71 Russian drone...,Warning,,AFP,How do you confirm the downing of 71 Russian d...,"Kyiv, Ukraine."
84,The Asian Games ignite the Olympic dream for e...,Masters-Computer-2023-2022-Indonesia-Games,"Arabic, journal, sport",Faisal Kamal and Songhee Huang,The inclusion of esports as an official compet...,"New Delhi, IND"
7,The Russian drone attack was the largest on Ky...,Warning,,AFP,The Russian drone attack on Kiev was the large...,"Kyiv, Ukraine"
