## Exercise 6 - XML

1 Add the necessary libraries (xml and pandas).

In [3]:
import xml.etree.ElementTree as ET
import pandas as pd

2 Save the XML structure shown below as a string in a variable. In addition, load the content from the file *users_first.xml* into another variable.

In [5]:
xml_string = """<?xml version='1.0' encoding='UTF-8'?>
<records>
    <record type='customer'>
        <name>Lawrence Burke</name>
        <phone>(815) 571-8746</phone>
        <email>vulputate.lacus@elitfermentum.net</email>
        <address>
            <streetaddress>P.O. Box 478, 5254 Mi St.</streetaddress>
            <country>Mexico</country>
            <postalZip>22044</postalZip>
        </address>
    </record>
    <record type='employee'>
        <name>Angela Hinton</name>
        <phone>(997) 412-6965</phone>
        <email>elit.dictum@egestas.net</email>
        <address>
            <streetaddress>696-4409 Nunc Rd.</streetaddress>
            <country>Germany</country>
            <postalZip>661217</postalZip>
        </address>
    </record>
    <record type='customer'>
        <name>Whitney Flowers</name>
        <phone>1-573-471-9738</phone>
        <email>consectetuer.adipiscing.elit@ridiculusmus.ca</email>
        <address>
            <streetaddress>6978 Et, Av.</streetaddress>
            <country>Indonesia</country>
            <postalZip>28164</postalZip>
        </address>
    </record>
    <record type='customer'>
        <name>Darius Burt</name>
        <phone>1-675-212-4934</phone>
        <email>sociis.natoque@ut.org</email>
        <address>
            <streetaddress>7705 Ut Street</streetaddress>
            <country>Indonesia</country>
            <postalZip>94783-300</postalZip>
        </address>
    </record>
    <record type='customer'>
        <name>Solomon Blair</name>
        <phone></phone>
        <email>pede.suspendisse.dui@commodoat.edu</email>
        <address>
            <streetaddress>Ap #877-8946 Sociosqu Street</streetaddress>
            <country>Germany</country>
            <postalZip>113513</postalZip>
        </address>
    </record>
</records>"""

In [6]:
# Load XML from file
with open('users_first.xml', 'r', encoding='utf-8') as f:
    xml_file_content = f.read()

3 From the contents of both variables, form lists C1 and C2, in which the contents of the XML data are stored as dictionary-type objects. **NOTE**: the attributes presented at the record level must be included in the objects and the identifiers under the address identifier must be brought to a higher level next to user's other properties!

Below is a model showing the structure of the first object:

<pre>[{
    'type': 'customer',
    'name': 'Morgan Sykes',
    'phone': '1-183-546-6564',
    'email': 'lorem.auctor@vulputatemauris.net',
    'streetaddress' : '2594 Tellus St.',
    'country': 'Russian Federation',
    'postalZip': '584585'
}]<pre>



In [8]:
# Function to convert XML data into list of dictionaries
def xml_to_list(xml_data):
    root = ET.fromstring(xml_data)
    result = []
    for rec in root.findall('record'):
        record_dict = {
            'type': rec.get('type'),  # record-level attribute
            'name': rec.findtext('name'),
            'phone': rec.findtext('phone'),
            'email': rec.findtext('email'),
            'streetaddress': rec.find('address').findtext('streetaddress'),
            'country': rec.find('address').findtext('country'),
            'postalZip': rec.find('address').findtext('postalZip')
        }
        result.append(record_dict)
    return result

# Create lists of dictionaries
C1 = xml_to_list(xml_string)
C2 = xml_to_list(xml_file_content)

# Optional: preview first item
C1[0]

{'type': 'customer',
 'name': 'Lawrence Burke',
 'phone': '(815) 571-8746',
 'email': 'vulputate.lacus@elitfermentum.net',
 'streetaddress': 'P.O. Box 478, 5254 Mi St.',
 'country': 'Mexico',
 'postalZip': '22044'}

4 Convert both lists into dataframes and finally combine them into one dataframe (remember to reset the index of the new dataframe!). Print the content of the generated dataframe.

In [10]:
# Convert lists to DataFrames
df1 = pd.DataFrame(C1)
df2 = pd.DataFrame(C2)

# Combine both DataFrames and reset index
df_combined = pd.concat([df1, df2], ignore_index=True)

# Print the combined DataFrame
print(df_combined)

        type               name           phone  \
0   customer     Lawrence Burke  (815) 571-8746   
1   employee      Angela Hinton  (997) 412-6965   
2   customer    Whitney Flowers  1-573-471-9738   
3   customer        Darius Burt  1-675-212-4934   
4   customer      Solomon Blair                   
5   customer       Morgan Sykes  1-183-546-6564   
6   employee       Zia Stafford  (621) 483-5714   
7   employee     Cooper Ramirez  (716) 895-3353   
8   customer  Richard Armstrong  (450) 317-2624   
9   employee     Flavia Delaney                   
10  customer  Branden Henderson  1-314-584-1088   
11  employee    Hector Kirkland  (325) 535-9750   
12  customer        Luke Hooper                   
13  employee      Moses Dejesus  1-337-767-2455   
14  customer          Finn Pena  1-746-189-1226   
15  employee    Lester Espinoza  1-378-721-6318   
16  employee     Nelle Stafford  1-246-844-4648   
17  customer        Amber Casey                   
18  customer      Jordan Strong

5 Some users are missing a phone number. Print the number of these users.

In [12]:
# Find how many users are missing a phone number
missing_phone_count = df_combined['phone'].isna().sum() + (df_combined['phone'] == '').sum()
print("Number of users missing a phone number:", missing_phone_count)

Number of users missing a phone number: 4
