# XML File Parsing

XML is a software- and hardware-independent tool for storing and transporting data.

What is XML?

    -XML stands for eXtensible Markup Language
    
    -XML is a markup language much like HTML
    
    -XML was designed to store and transport data
    
    -XML was designed to be self-descriptive
    
    -XML is a W3C Recommendation

XML Does Not DO Anything

Maybe it is a little hard to understand, but XML does not DO anything.

This note is a note to Tove from Jani, stored as XML:

The XML above is quite self-descriptive:

    -It has sender information

    -It has receiver information

    -It has a heading

    -It has a message body

But still, the XML above does not DO anything. XML is just information wrapped in tags.

Someone must write a piece of software to send, receive, store, or display it:

In [1]:
# Python code to illustrate parsing of XML files
# importing the required modules
import csv
import requests
import warnings
warnings.filterwarnings('ignore')

# XML Parsing to extract Flight Data from https://rapidapi.com/hub

In [2]:
source = input("Enter Source Airport Code: ")
dest =  input("Enter Destination Airport Code: ")
date = input("Enter Journey Date in YYYYMMDD Format: ")

url = "https://timetable-lookup.p.rapidapi.com/TimeTable/" + source + "/" + dest + "/" + date + "/"

headers = {
        "X-RapidAPI-Key": "bab0ed34damsh6e9ffde2dc9e481p11cafajsn8a09a681672b",
        "X-RapidAPI-Host": "timetable-lookup.p.rapidapi.com"
}

response = requests.request("GET", url, headers=headers, verify=False)
raw_text = response.text.split('\n')

Enter Source Airport Code: blr
Enter Destination Airport Code: got
Enter Journey Date in YYYYMMDD Format: 20221024


In [3]:
# Structure of the XML Data:
# raw_text

In [4]:
# Exporting the XML Data as a Pandas Dataframe for better visualization:
import pandas as pd
df = pd.DataFrame(columns=['Total Miles', 'Trip Duration', 'Departure Date Time', 'Departure Time Offset', 
                           'Departure Airport Code', 'Departure Airport Name', 'Arrival Date Time', 'Arrival Time Offset',
                           'Arrival Airport Code', 'Arrival Airport Name', 'Stops', 'Flight Legs', 'Flight Number', 
                           'Airline Code',  'Airline Name'])
c = 0
for line in raw_text:
    #print(line)
    col = []
    if '<FlightDetails TotalFlightTime' in line:
        #print(c)
        for i in range(c+1, c+13):
            if raw_text[i].strip() != '\n':
                #print(raw_text[i].strip())
                col.append(raw_text[i].strip().split('='))
        #print(raw_text[c+25].strip())  
        col.append(raw_text[c+25].strip().split('='))
        flight = raw_text[c+37][1:len(raw_text[c+37])-2].split('<MarketingAirline ')[1].split('" ')        
        #print(flight[0]+'"')
        col.append((flight[0]+'"').split('='))
        #print(flight[2]+'"')
        col.append((flight[2]+'"').split('='))    
        
        #print(col)
        col = [l[1] for l in col]
        df.loc[len(df)] = col
        #print("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
    c+=1
    
df = df.applymap(lambda x: x.replace('"', ''))
df

Unnamed: 0,Total Miles,Trip Duration,Departure Date Time,Departure Time Offset,Departure Airport Code,Departure Airport Name,Arrival Date Time,Arrival Time Offset,Arrival Airport Code,Arrival Airport Name,Stops,Flight Legs,Flight Number,Airline Code,Airline Name
0,5590,PT15H50M,2022-10-24T01:20:00,530,BLR,Kempegowda International Airport,2022-10-24T13:40:00,200,GOT,Goteborg,Connect,3,2289,KL,KLM
1,5590,PT19H55M,2022-10-24T01:20:00,530,BLR,Kempegowda International Airport,2022-10-24T17:45:00,200,GOT,Goteborg,Connect,3,2289,KL,KLM
2,5152,PT14H30M,2022-10-24T03:05:00,530,BLR,Kempegowda International Airport,2022-10-24T14:05:00,200,GOT,Goteborg,Connect,2,755,LH,Lufthansa
3,5611,PT16H05M,2022-10-24T03:05:00,530,BLR,Kempegowda International Airport,2022-10-24T15:40:00,200,GOT,Goteborg,Connect,3,3223,SK,SAS
4,5611,PT17H45M,2022-10-24T03:05:00,530,BLR,Kempegowda International Airport,2022-10-24T17:20:00,200,GOT,Goteborg,Connect,3,3223,SK,SAS
5,5434,PT18H30M,2022-10-24T03:05:00,530,BLR,Kempegowda International Airport,2022-10-24T18:05:00,200,GOT,Goteborg,Connect,3,755,LH,Lufthansa
6,5611,PT19H20M,2022-10-24T03:05:00,530,BLR,Kempegowda International Airport,2022-10-24T18:55:00,200,GOT,Goteborg,Connect,3,3223,SK,SAS
7,5115,PT16H35M,2022-10-24T03:55:00,530,BLR,Kempegowda International Airport,2022-10-24T17:00:00,200,GOT,Goteborg,Connect,3,573,QR,Qatar Airways
8,5115,PT19H35M,2022-10-24T03:55:00,530,BLR,Kempegowda International Airport,2022-10-24T20:00:00,200,GOT,Goteborg,Connect,3,573,QR,Qatar Airways
9,5115,PT16H00M,2022-10-24T04:30:00,530,BLR,Kempegowda International Airport,2022-10-24T17:00:00,200,GOT,Goteborg,Connect,3,4787,QR,Qatar Airways


In [5]:
df.to_csv("XML-Dataset/Flight_Data.csv")

# Local XML File Parsing

Unzip the contents of the zipped folder XML-Dataset/UN Population Data.zip in order to execute the below code snippet

In [6]:
# Convert an XML file to an Ordered Dictionary
import xmltodict
file = open('XML-Dataset/UN Population Data.xml', encoding="utf8")
data = file.read()
data = xmltodict.parse(data)
#data

In [7]:
# Structure of the XML Data:
# data['ROOT']['data']['record']

An OrderedDict is a dictionary subclass that remembers the order that keys were first inserted. The only difference between dict() and OrderedDict() is that:

OrderedDict preserves the order in which the keys are inserted. A regular dict doesn’t track the insertion order and iterating it gives the values in an arbitrary order. By contrast, the order the items are inserted is remembered by OrderedDict.

In [8]:
# Exporting the XML Data as a Pandas Dataframe for better visualization:
import pandas as pd
df1 = pd.DataFrame(columns=['Country', 'Year', 'Area', 'Sex', 'City', 'City Type', 'Record Type', 
                           'Reliability', 'Source Year', 'Value', 'Value Footnotes'])
for value in data['ROOT']['data']['record']:
    col1 = []
    for val in value['field']:        
        try:
            #print(val['@name'], ':', val['#text'])
            col1.append(val['#text'])
        except:
            col1.append('')
        
    df1.loc[len(df1)] = col1
    #print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n')
df1

Unnamed: 0,Country,Year,Area,Sex,City,City Type,Record Type,Reliability,Source Year,Value,Value Footnotes
0,Åland Islands,2021,Total,Both Sexes,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2022,11724,1
1,Åland Islands,2021,Total,Male,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2022,5638,1
2,Åland Islands,2021,Total,Female,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2022,6086,1
3,Åland Islands,2020,Total,Both Sexes,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2021,11692,1
4,Åland Islands,2020,Total,Male,MARIEHAMN,City proper,Estimate - de jure,"Final figure, complete",2021,5599.5,1
...,...,...,...,...,...,...,...,...,...,...,...
69485,Zimbabwe,1992,Total,Female,Bulawayo,City proper,Census - de facto - complete tabulation,"Final figure, complete",1992,311878,
69486,Zimbabwe,1992,Total,Female,Chitungwiza,City proper,Census - de facto - complete tabulation,"Final figure, complete",1992,137022,
69487,Zimbabwe,1992,Total,Female,Gweru,City proper,Census - de facto - complete tabulation,"Final figure, complete",1992,63565,
69488,Zimbabwe,1992,Total,Female,HARARE,City proper,Census - de facto - complete tabulation,"Final figure, complete",1992,565758,


In [9]:
# Save the extracted data as a csv file
df1.to_csv("XML-Dataset/Population_Data.csv")

# XML File Parsing from URL

In [10]:
url = "https://medlineplus.gov/xml/mplus_topics_2022-08-26.xml"

In [11]:
import xmltodict
data = requests.get(url=url, verify=False)
data = xmltodict.parse(data.text)

In [12]:
# Structure of the XML Data:
# data['health-topics']['health-topic']

In [15]:
# Extract relevant data from the XML
for value in data['health-topics']['health-topic']:
    print("TITLE: ", value['@title'], '\n')
    v = value['full-summary'].replace('href=', 'URL: ')
    replace_val = ['<p>', '<a', '</a>', '>', '</p', '<ul', '<li', '</ul', '<p', '</li', '\t']
    for rep in replace_val:
        v = v.replace(rep, '')
    print(v)
    for val in value['group']:
        #print(val)
        try:
            print(val['@url'], ':', val['#text'])
        except:
            pass
    break           # To exit the loop after 1 iteration. Comment it to execute the complete loop
        
    print('\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n')

TITLE:  A1C 

A1C is a blood test for  URL: "https://medlineplus.gov/diabetestype2.html"type 2 diabetes and  URL: "https://medlineplus.gov/prediabetes.html"prediabetes. It measures your average blood glucose, or  URL: "https://medlineplus.gov/bloodsugar.html"blood sugar, level over the past 3 months. Doctors may use the A1C alone or in combination with other diabetes tests to make a diagnosis. They also use the A1C to see how well you are managing your diabetes. This test is different from the blood sugar checks that people with diabetes do every day.

Your A1C test result is given in percentages. The higher the percentage, the higher your blood sugar levels have been:

A normal A1C level is below 5.7%
Prediabetes is between 5.7 to 6.4%. Having prediabetes is a risk factor for getting type 2 diabetes. People with prediabetes may need retests every year.
Type 2 diabetes is above 6.5%
If you have diabetes, you should have the A1C test at least twice a year. The A1C goal for many people w