**What is XML?**

+ XML  ---> Extensibel Markup Language

+ XML is a Markup Language and **file format** SERIALIZATION
(sorting,transmitting, and reconstructing).

+ It defines a set of rules for encoding documents in a way that is both human-readable and machine-readable.

+ The design goals of XML emphasizes simplicity,     generality, and usuability across the internet.It is a textual data format with string support via **UNICODE** for different human languages.

+ As a Markup language XML, it labels, categorizes and structurally organizes information, where as HTML is used for creating web pages, etc...,.

**What is XSD?**

* XSD  ---> XML Schema Definition

* XSD defines the necessary metadata for interpreting and     validating XML file, also referred as Canonical schema.

* An XML document that adheres to basic XML rules is well-formed XML document.

* One that adheres to its schema(XSD) is valid.

# LOGICAL STRUCTURES OF XML

<root>
  <element>
    <subelement>...</subelement>
  </element>
</root>

![Online Imagw](https://way2tutorial.com/images/xml_employee.png)

**With this Understanding Lets try to Validate the XML using XSD**

In [1]:
# importing necessary libraries

from lxml import etree
import pandas as pd
import numpy as np

In [2]:
# creating a funtion to validate XML against XSD

def validate_xml(xml_file,xsd_file):
    try:
        schema=etree.XMLSchema(file=xsd_file)
        xml_doc=etree.parse(xml_file)
        schema.assertValid(xml_doc)
        
        print("Validation Successful.The XML file is Valid against XSD")
        
    except etree.XMLSchemaError as e:
        print(f"Schema Error : {e} ")
    except etree.DocumentInvalid as e:
        print(f"Document invalid : {e}")
        print("Validation failed.The XML is not Valid agains XSD")

In [3]:
# validating XML against XSD using created function

xml_file_path = "20Feb2023.xml"
xsd_file_path = "bank_transactions.xsd"

validate_xml(xml_file_path,xsd_file_path)

Schema Error : Failed to locate the main schema resource at 'bank_transactions.xsd'. 


In [4]:
# validating XML against XSD using created funciton

xml_file_path = "C:\\Users\\91755\\Downloads\\XML Files\\XML Files\\20Feb2023.xml"
xsd_file_path = "C:\\Users\\91755\\Downloads\\XML Files\\XML Files\\bank_transaction_xsd_corrected.xsd"
validate_xml(xml_file_path,xsd_file_path)

Validation Successful.The XML file is Valid against XSD


**LETS PARSE THE XML FILE, FIND THE ROOT ELEMENT AND CHILD ELEMENTS**

In [5]:
#parsing an XML file
xml_file_path = "C:\\Users\\91755\\Downloads\\XML Files\\XML Files\\20Feb2023.xml"
tree=etree.parse(xml_file_path)
root=tree.getroot()
print("ROOT_ELEMENT  = ",root.tag)

ROOT_ELEMENT  =  BankTransactions


In [6]:
# child element of the parent element root

unique_child_elements = set()

# Iterate through all child elements under the root
for child_element in root:
    # Add the tag (element name) to the set
    unique_child_elements.add(child_element.tag)

# Print the unique child element tags
for tag in unique_child_elements:
    print(f"Unique Child Element Tag: {tag}")

Unique Child Element Tag: Transaction


In [7]:
# child element of transaction element

column_name=list(set([element.tag for i in root.findall("Transaction") for element in i]))
column_name

['HolderName',
 'TransactionAmount',
 'Balance',
 'TransactionDate',
 'TransactionDescription',
 'AccountNumber',
 'TransactionType',
 'Currency',
 'ConversionRate']

In [8]:
# creating a empty dataframe using transaction child elements
# as columns

df=pd.DataFrame(columns=column_name)

In [9]:
# filling the values with of the columns with element texts

transaction_data_list=[]

for i in root.findall("Transaction"):
    transaction_data={}
    
    for element in i:
        element_txt=element.text.strip() if element.text else None
        transaction_data[element.tag]=element.text
        
    transaction_data[element.tag]=element.text
    transaction_data_list.append(transaction_data)
        
df=pd.DataFrame(transaction_data_list,columns=column_name)

df.head()

Unnamed: 0,HolderName,TransactionAmount,Balance,TransactionDate,TransactionDescription,AccountNumber,TransactionType,Currency,ConversionRate
0,Qgmklsxylxw Nkufloaddbmm,125.73,13728.84,20Feb2023,JhkqWLDy,663709,Credit,SGD,1.0
1,Laauiwd Hmjmsgscxwgqovi,290.08,8826.28,20Feb2023,hdUdNytCNYVFlWzEILSbsDeVfd,702382,Credit,JPY,0.49
2,Dwqtm Oplgnhvgaubspaf,210.78,15548.06,20Feb2023,fDDpYwjtmfVVwzBFNEvSSzb,263664,Debit,CNY,0.19
3,Mnjnhdbdewe Fogkesgnhrivc,297.18,4351.6,20Feb2023,fVFjvzQogRhBZKnDQUkiyFo,619749,Credit,SGD,1.0
4,Vjrhlpotbfgamh Urpr,323.84,12058.33,20Feb2023,eXmJYQANBwUCAQlivzkI,969272,Debit,SGD,1.0


In [10]:
# shape of the created df
df.shape

(1000, 9)

In [11]:
# df information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   HolderName              1000 non-null   object
 1   TransactionAmount       999 non-null    object
 2   Balance                 1000 non-null   object
 3   TransactionDate         1000 non-null   object
 4   TransactionDescription  1000 non-null   object
 5   AccountNumber           1000 non-null   object
 6   TransactionType         1000 non-null   object
 7   Currency                1000 non-null   object
 8   ConversionRate          1000 non-null   object
dtypes: object(9)
memory usage: 70.4+ KB


**NOTE :** When parsing an XML file using lxml or xml.etree, the library creates element objects to represent the XML structure, and these objects store the values as strings by default.

In [12]:
# type casting of columns

columns=['ConversionRate','Balance','TransactionAmount']

for i in columns:
    df[i]=df[i].astype(np.number)

In [13]:
# assuring the type casting
df[columns].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ConversionRate     1000 non-null   float64
 1   Balance            1000 non-null   float64
 2   TransactionAmount  999 non-null    float64
dtypes: float64(3)
memory usage: 23.6 KB


In [14]:
# missing value analysis
df.isna().sum()

HolderName                0
TransactionAmount         1
Balance                   0
TransactionDate           0
TransactionDescription    0
AccountNumber             0
TransactionType           0
Currency                  0
ConversionRate            0
dtype: int64

In [15]:
# summary statistic of numerical columns of df

df.describe()

Unnamed: 0,TransactionAmount,Balance,ConversionRate
count,999.0,1000.0,1000.0
mean,251.326917,15652.49804,0.99848
std,143.407599,8429.549013,0.362915
min,1.3,1013.0,0.02
25%,123.445,8479.4775,1.0
50%,254.07,16008.225,1.0
75%,380.055,22963.89,1.0
max,499.06,29993.9,1.99


In [16]:
# value_counts of transactiontype column
df['TransactionType'].value_counts()

Debit     513
Credit    487
Name: TransactionType, dtype: int64

In [17]:
# value counts of Currency column
df['Currency'].value_counts()

SGD    595
KRW     89
CNY     83
JPY     82
USD     78
AUD     73
Name: Currency, dtype: int64