 ## Parsing XML file using Pyspark : Part 2
 
 ### Parsing a really large 5GB XML file using Pyspark



`Author : Deepika Sharma`    `| | `  `         Time : September 2020`       `                `

# Method 1 : Pysaprk way

## Analysis

***

Here are the steps for parsing really large xml file using Pyspark. 

In this notebook, I am creating a pipeline for getting the data batchwise which can be given to pyspark for parsing since data file is too big to be given as single file.

Due to the memory limitation, in this notebook you will only see `1 millions` records are parsed. Once the data ingestion process in place, Pyspark parses the xml with lightining speed.

`Steps invlved: `
1. Reading the file split by \n.
2. Parallelise the RDD to further partition.
3. Distribute the partitioned RDDs.
4. Write the parser for getting fields out from the xml.
5. Register the parser with Spark for it to identify.
6. Get one partitioned RDD (calling it `chunk`) at a time to give it to Spark since Spark holds data in memory.
7. Further partition the chunk and distribute the partitioned chunk to all the nodes along with registered function.
8. Collect results, remove duplicate, clean fringe cases, append to final list.
9. Write the final list to DataFrame.

***
Note, you as well find commercialized solutions like DataBricks, which provide more easy and plug n play facilities and are quick to try out. This notebook aims to identify ways to parse large files using Pyspark.

### Step 1

In [1]:
import time
import os,re,gc
import pyspark
import pandas as pd
from itertools import islice
from pyspark.sql.types import *
from pyspark import SparkContext
import xml.etree.ElementTree as ET
from pyspark.sql import SparkSession

In [2]:
num_of_th = 128; repartition_size = num_of_th*4; chunk_size = 1000000
sc = SparkContext(master = "local[20]").getOrCreate()
spark = SparkSession(sc)

In [3]:
file_rdd = spark.read.text("./xml_data/enwiki-latest-abstract.xml", wholetext=False)

Here is the catch, if we read the whole XML as wholetext, file `will not` be split by \n, and it will be difficut for it to fit in memory.

In [4]:
# file_rdd.count()

In [5]:
# File has 74 million records. it is a big file to parse, lets see if we can do it using Pyspark

In [6]:
#Lets look at sample data ...
file_rdd.take(10)

[Row(value='<feed>'),
 Row(value='<doc>'),
 Row(value='<title>Wikipedia: Anarchism</title>'),
 Row(value='<url>https://en.wikipedia.org/wiki/Anarchism</url>'),
 Row(value='<abstract>Anarchism is a political philosophy and movement that rejects all involuntary, coercive forms of hierarchy. It calls for the abolition of the state which it holds to be undesirable, unnecessary and harmful.</abstract>'),
 Row(value='<links>'),
 Row(value='<sublink linktype="nav"><anchor>Etymology, terminology and definition</anchor><link>https://en.wikipedia.org/wiki/Anarchism#Etymology,_terminology_and_definition</link></sublink>'),
 Row(value='<sublink linktype="nav"><anchor>History</anchor><link>https://en.wikipedia.org/wiki/Anarchism#History</link></sublink>'),
 Row(value='<sublink linktype="nav"><anchor>Pre-modern era</anchor><link>https://en.wikipedia.org/wiki/Anarchism#Pre-modern_era</link></sublink>'),
 Row(value='<sublink linktype="nav"><anchor>Modern era</anchor><link>https://en.wikipedia.org/wiki

 Lets' parse the XML..

### Step 2

Limiting to 1 million entries for parsing, due to memeory limitation. If system memory can hold upto 74 million rows, check and change the `chunk_size`.

In [7]:
file_chunk = file_rdd.take(chunk_size)

### Step 3

Parallelize the RDD 

In [8]:
# RDD is partitioned such that each partitioned can be picked up separately and can be parsed to extract the fields.
myRDD = sc.parallelize(file_chunk)#,100)

Lets get the count of entries in each partitioned RDD

In [9]:
def count_in_a_partition(idx, iterator):
    count = 0
    for _ in iterator:
        count += 1
    return idx,count

In [10]:
count_list = myRDD.mapPartitionsWithIndex(count_in_a_partition).collect()

In [11]:
# count_list = [val for i,val in enumerate(count_list) if i%2 != 0]
print("Number of partitions  :" ,len(count_list))

Number of partitions  : 20


### Step 4

Lets' write a parser to get values from XML chunk passed to pyspark.

In [12]:
elements_parsed = {"title" : [], "url" : [], "abstract" : [], "anchor" : [], "link" : []}


def get_values(i,x,elements_parsed):
    try:
        root = ET.fromstring(x[0]) # can be run over multiple threads by increasing batch size
        for child in root.iter():
            if child.tag == "title":
                elements_parsed["title"].append(child.text.split(":")[1].strip())
#                 elements_parsed["count"].append(i)
            if child.tag == "url":
                elements_parsed["url"].append(child.text)
            if child.tag == "abstract":
                elements_parsed["abstract"].append(child.text)
            if child.tag == "anchor":
                elements_parsed["anchor"].append(child.text)                
            if child.tag == "link": 
                elements_parsed["link"].append(child.text)
            
    except:pass
    gc.collect()
    return elements_parsed

### Step 5
- Register the udf with spark.

In [13]:
# Register the parser with Pysaprk
pyspark.sql.udf.UDFRegistration.register(name="get_values", f = get_values, returnType=StringType())

<function __main__.get_values(i, x, elements_parsed)>

Lets' get xml chunk for parsing.

Writing functions for removing duplicates from the output, cleaning few fringe cases, and resetting intermediate variables.

In [14]:
def elements_parsed_rm_frindges(elements_parsed):
    '''
    This function cleans up repetative records in the final output after each iteration, 
    one iteration run on one partitioned RDD. 
    
    Partitioned RDD is further partitioned for Spark to send it across nodes.
    Spark sends all the data to all the nodes for computation. There might be ways to avoid Spark sending data to all the nodes.
    
    Input : Parsed elements after each iteration.
    '''
    
    
    parsed_records_list = [] 
    for i in range(len(elements_parsed)): 
        if elements_parsed[i] not in elements_parsed[i + 1:]: 
            parsed_records_list.append(elements_parsed[i]) 
    
    return parsed_records_list


Resetting intermediate variables.

In [15]:
def reset(file,name):
    '''
    This function resets all the intermediate variables to free up memory. 
    Input : variable which needs to be free up and its name.
    '''
    if name == "elements_parsed":
        file.clear()
       
    if name == "parsed_records_list":
        file.clear()
        
    return file    

### Step 6,7,8
- Get one partitioned RDD at a time to give it to Spark since Spark holds data in memory.
- Further partition the chunk and distribute the partitioned chunk to all the nodes along with registered function.
- Collect results, remove duplicate, clean fringe cases, append to final list.

In [16]:
parsed_records_ini = []; start = time.time()
for idx_,count in enumerate(count_list):    
    # 6. Getting one partitioned RDD at a time.
    chunk = myRDD.mapPartitionsWithIndex(lambda i, it: islice(it, 0, count) if i == idx_ else []).collect()
    
    # 7. Parallelizing it further.    
    myRDD_ = sc.parallelize(chunk)  
    myRDD_ = myRDD_.repartition(repartition_size)
    
    #Initiate the dict in which elements will be appended
    elements_parsed = {"count" : [], "title" : [], "url" : [], "abstract" : [], "anchor" : [], "link" : []}
    
    # 7. Run the job on spark
    elements_parsed = sc.runJob(myRDD_, lambda part: [get_values(i,x,elements_parsed) for i,x in enumerate(part)]) 
    
    # 8. Remove duplicate and consilation of parsed records, a hygiene check.
    parsed_records_list = elements_parsed_rm_frindges(elements_parsed)
        
    #reset the variable to free up memory
    elements_parsed = reset(elements_parsed, name = "elements_parsed")   
    
    # 8. Create a master list in which all the cleaned records will continue appending while all the intermediate variables are reset.
    parsed_records = parsed_records_ini+parsed_records_list
    parsed_records_ini = parsed_records
    
    #reset the other variable
    parsed_records_ini = reset(parsed_records_ini, name = "parsed_records_ini")
    
   
    #Garbage collection to further save up memory
    gc.collect()

    if idx_ > 0 and idx_ %4 == 0:
        print(idx_, time.time()-start)

4 1270.425047636032
8 3373.918560028076
12 4226.878239631653
16 4789.349941253662


In [17]:
(time.time()-start)/60

88.84068657159806

In [18]:
pd.DataFrame(parsed_records)

Unnamed: 0,count,title,url,abstract,anchor,link
0,[],"[Wikipedia: Distance education, Wikipedia: Equ...",[https://en.wikipedia.org/wiki/Distance_educat...,"[Distance education, also called distance lear...","[[Later life, Thoroughbred racing, Death, Lega...",[[https://en.wikipedia.org/wiki/Desi_Arnaz#Lat...
1,[],"[Wikipedia: DNA virus, Wikipedia: Estimator, W...","[https://en.wikipedia.org/wiki/DNA_virus, http...",[A DNA virus is a virus that has DNA as its ge...,"[[Bibliography, See also, References], [Radio ...",[[https://en.wikipedia.org/wiki/Desi_Arnaz#Bib...
2,[],"[Wikipedia: Death of a Hero, Wikipedia: Euphor...",[https://en.wikipedia.org/wiki/Death_of_a_Hero...,[Death of a Hero is a World War I novel by Ric...,"[[Group I: dsDNA viruses, Host range, Taxonomy...",[[https://en.wikipedia.org/wiki/DNA_virus#Grou...
3,[],"[Wikipedia: Degree Confluence Project, Wikiped...",[https://en.wikipedia.org/wiki/Degree_Confluen...,"[Ezra (; , ;""[God] helps"" – Emil G. Hirsch, Is...","[[CRESS DNA viruses, Cruciviridae, Molecular b...",[[https://en.wikipedia.org/wiki/DNA_virus#CRES...
4,[],"[Wikipedia: Epic poetry, Wikipedia: Energy, Wi...","[https://en.wikipedia.org/wiki/Epic_poetry, ht...",[thumb|upright|right|A GPS unit at confluence ...,"[[Marine and other, Satellite viruses, Phyloge...",[[https://en.wikipedia.org/wiki/DNA_virus#Mari...
...,...,...,...,...,...,...
1284,[],"[Wikipedia: Catherine of Alexandria, Wikipedia...",[https://en.wikipedia.org/wiki/Catherine_of_Al...,"[| birth_place = Alexandria, Roman Egypt, thu...","[[Legend, Torture and martyrdom, Veneration, H...",[[https://en.wikipedia.org/wiki/Catherine_of_A...
1285,[],"[Wikipedia: Camma, Wikipedia: IACR, Wikipedia:...","[https://en.wikipedia.org/wiki/Camma, https://...",[Camma was a Galatian princess and priestess o...,"[[Medieval cult, Legacy, In art, Contemporary ...",[[https://en.wikipedia.org/wiki/Catherine_of_A...
1286,[],"[Wikipedia: The Ring (magazine), Wikipedia: Ca...",[https://en.wikipedia.org/wiki/The_Ring_(magaz...,"[| issn = 0035-5410, Camulus or Camulos was a ...","[[History, The Ring world champions, Current T...",[[https://en.wikipedia.org/wiki/The_Ring_(maga...
1287,[],"[Wikipedia: Julia Margaret Cameron, Wikipedia:...",[https://en.wikipedia.org/wiki/Julia_Margaret_...,[Canola oil is a vegetable oil derived from a ...,"[[List of pound for pound #1 fighters, Scandal...",[[https://en.wikipedia.org/wiki/The_Ring_(maga...


### Step 9
 Writing results to a DataFrame

In [19]:
title_list = []; url_list = []; abstract_list = []; anchor_list = []; link_list = []
for i in range(len(parsed_records_ini)):
    title = title_list + parsed_records_ini[i]["title"]
    title_list = title    
    
    url = url_list + parsed_records_ini[i]["url"]
    url_list = url
    
    abstract = abstract_list + parsed_records_ini[i]["abstract"]
    abstract_list = abstract
    
    anchor = anchor_list + [val for val in parsed_records_ini[i]["anchor"] if len(val) > 0 ]#[0:len(parsed_records_ini[i]["title"])]
    anchor_list = anchor
    
    link = link_list + [val for val in parsed_records_ini[i]["link"] if len(val) > 0]#[0:len(parsed_records_ini[i]["title"])]
    link_list = link
    
    if len(parsed_records_ini[i]["title"]) != len(parsed_records_ini[i]["anchor"]):
        print(parsed_records_ini[i]["title"])
        print(parsed_records_ini[i]["anchor"])
        print("\n\n")

['Wikipedia: Distance education', 'Wikipedia: Equivalence relation', 'Wikipedia: Euclidean geometry', 'Wikipedia: Eiffel (programming language)', 'Wikipedia: European Space Operations Centre', 'Wikipedia: Finnish', 'Wikipedia: Four Pillars', 'Wikipedia: Galilean moons', 'Wikipedia: Gamma-Hydroxybutyric acid', 'Wikipedia: Gilles Apap', 'Wikipedia: High fidelity', 'Wikipedia: Hymenoptera', 'Wikipedia: Insulin', 'Wikipedia: Devanagari numerals', 'Wikipedia: Coen brothers', 'Wikipedia: Jawaharlal Nehru', 'Wikipedia: Foreign relations of Kazakhstan']
[['Later life', 'Thoroughbred racing', 'Death', 'Legacy', 'Filmography', 'As actor', 'As producer', 'As writer', 'As director', 'Soundtracks', 'History', 'University correspondence courses', 'Open universities', '2019–20 coronavirus pandemic', 'Technologies', 'Notation', 'Definition', 'Examples', 'The Elements', 'Axioms', 'Parallel postulate', 'Methods of proof', 'System of measurement and arithmetic', 'Notation and terminology', 'History', 'Pr


['Wikipedia: Idiopathic intracranial hypertension', 'Wikipedia: Rugby football', 'Wikipedia: KTH Royal Institute of Technology', 'Wikipedia: Round (music)', 'Wikipedia: Rædwald of East Anglia', 'Wikipedia: Saxhorn', 'Wikipedia: Sorbian languages', 'Wikipedia: Human rights in Sudan', 'Wikipedia: Symplectic manifold', 'Wikipedia: Sweet tea', 'Wikipedia: Stefan Banach', 'Wikipedia: Telephone', 'Wikipedia: Tau Ceti', 'Wikipedia: The Skeptical Environmentalist', 'Wikipedia: Universal Decimal Classification', 'Wikipedia: Batavia (1628 ship)', 'Wikipedia: Vowel']
[['Police actions and national liberation', 'Early traditions of pacifism', 'China', 'Lemba', 'Moriori', 'Greece', 'Roman Empire', 'Christianity', 'Modern history', 'Peace movements', 'Signs and symptoms', 'Causes', 'Mechanism', 'Diagnosis', 'Investigations', 'Forms', 'History', 'Antecedents of rugby', 'Establishment of modern rugby', 'Global status of rugby codes', 'History', 'R1 nuclear reactor', 'Schools', 'International and nati

['Wikipedia: Presbyterian Church (USA)', 'Wikipedia: Quisling', 'Wikipedia: Rome', 'Wikipedia: Telecommunications in Russia', 'Wikipedia: Reproduction', 'Wikipedia: Saint Lucia', 'Wikipedia: Supermarine', 'Wikipedia: Safe semantics', 'Wikipedia: Space exploration', 'Wikipedia: List of synthetic polymers', 'Wikipedia: Second Council of Nicaea', 'Wikipedia: Politics of Turkmenistan', 'Wikipedia: Saint Timothy', 'Wikipedia: Torpoint Ferry', 'Wikipedia: United Religions Initiative', 'Wikipedia: Vittorio De Sica']
[['History', 'Origins', '19th century', '20th century to the present'], ['Origin', 'Popularization in World War II', 'Etymology', 'History', 'Earliest history', 'Legend of the founding of Rome', 'Early history', '1998 financial crisis', '2000s', 'Regulation', 'Universal Service Fund', 'Romulus', 'Numa Pompilius', 'Tullus Hostilius', 'Ancus Marcius', 'Lucius Tarquinius Priscus', 'Servius Tullius', 'Lucius Tarquinius Superbus', 'Public offices after the monarchy', 'Notes and referen

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)






['Wikipedia: Airmed', 'Wikipedia: Jowett Cars', 'Wikipedia: Thoughtcrime', 'Wikipedia: Andrew Fleming', 'Wikipedia: 2080s', 'Wikipedia: John Kennedy (disambiguation)', 'Wikipedia: Santa Claus, Indiana', 'Wikipedia: Computer and network surveillance', 'Wikipedia: Bath County, Virginia', 'Wikipedia: Humphreys County, Tennessee', 'Wikipedia: Clark County, South Dakota', 'Wikipedia: Huntersville', 'Wikipedia: Garfield County, Nebraska', 'Wikipedia: Humphreys County, Mississippi', 'Wikipedia: Cook County, Minnesota']
[['References', 'Frequency-division multiple access', 'Time division multiple access', 'Code division multiple access and spread spectrum multiple access', 'Space-division multiple access', 'Power-division multiple access', 'Packet mode methods', 'Duplexing methods', 'Hybrid channel access scheme application examples', 'Definition within certain application areas', 'Local and metropolitan area networks', 'Early history', 'Inter-war years', 'Second World War', 'Post-war', 'Pe

['Wikipedia: Ogma', 'Wikipedia: Open Audio License', 'Wikipedia: Belgrade (disambiguation)', 'Wikipedia: Ada Vélez', 'Wikipedia: Flywheel', 'Wikipedia: Meurthe-et-Moselle', 'Wikipedia: Waukesha County, Wisconsin', 'Wikipedia: Geodesic', 'Wikipedia: York County, South Carolina', 'Wikipedia: Tuscarawas County, Ohio', 'Wikipedia: Xibalba', 'Wikipedia: Clarke County, Mississippi', 'Wikipedia: Anoka County, Minnesota']
[['Route description', 'Services', 'History', 'Tampa area', 'Orlando area', 'Future', 'I-4 Ultimate ===', 'See also', 'References'], [], ['References', 'Places', 'Belgium', 'Croatia', 'United States of America', 'Russian Empire', 'Soviet Years', 'Post-Soviet era', 'Politics', 'Administrative divisions', 'Economy', 'Demographics', 'Religion', 'Sister relationships', 'See also', 'See also', 'Notes', 'References'], ['Early life', 'Career', 'Personal life', 'Style', 'Bibliography', 'Pseudonyms', 'Legacy', 'Etymology', 'Fjölsvinnsmál', 'Theories', 'Citations', 'Explanatory notes',

['Wikipedia: Derek Bailey (guitarist)', 'Wikipedia: Charles VI', 'Wikipedia: Watervliet', 'Wikipedia: Laima', 'Wikipedia: Mictlāntēcutli', 'Wikipedia: Seine-et-Marne', 'Wikipedia: Hardy County, West Virginia', 'Wikipedia: Military strategy', 'Wikipedia: Brown County, Texas', 'Wikipedia: Ascribed characteristics', 'Wikipedia: Delaware County, Pennsylvania', 'Wikipedia: Tsar Bomba', 'Wikipedia: Jacques Rogge', 'Wikipedia: Hancock County, Ohio', 'Wikipedia: Anguta', 'Wikipedia: Unhcegila', 'Wikipedia: Victoria Williams', 'Wikipedia: Woolwich', 'Wikipedia: Callaway County, Missouri', 'Wikipedia: Yellow Medicine County, Minnesota', 'Wikipedia: Schoolcraft County, Michigan']
[['References'], ['Sport', 'Arts, entertainment and media', 'Films', 'Games', 'Literature', 'Other uses', 'See also', 'Australia', 'Canada', 'United Kingdom', 'United States', 'Elsewhere', 'Ships', 'Sport', 'Other uses', 'See also', 'Films', 'Primary sources', 'Further reading'], ['Disambiguation pages with short descrip

['Wikipedia: Bride', 'Wikipedia: Wilhelm Maybach', 'Wikipedia: Ninjutsu', 'Wikipedia: Accipitridae', 'Wikipedia: Agrippa Postumus', 'Wikipedia: Quilt', 'Wikipedia: Cihuateteo', 'Wikipedia: KRS', 'Wikipedia: Wharton County, Texas', 'Wikipedia: Kimble County, Texas', 'Wikipedia: Benton County, Tennessee', 'Wikipedia: Fairfield County, South Carolina', 'Wikipedia: Ix', 'Wikipedia: Sverdrup', 'Wikipedia: Johann Olav Koss', 'Wikipedia: Robeson County, North Carolina', 'Wikipedia: Sun Electric (band)', 'Wikipedia: Cudham', 'Wikipedia: Long Beach, California']
[['Etrusca Disciplina', 'Priests and officials', 'Beliefs', 'Spirits and deities', 'Afterlife', 'See also', 'Notes', 'References'], ['Etymology', 'Attire', 'History', 'Religion', 'Early life and career beginnings (1846 to 1869)', "Daimler and Otto's four-stroke engine (1869 to 1880)", 'Daimler Motors: fast and small engines (1882)', 'The Daimler Engine', 'Inspiration', 'Activities', 'Crafts', 'Role-playing', 'Conventions', 'Websites and

['Wikipedia: Voltumna', 'Wikipedia: Cailleach', 'Wikipedia: Glottochronology', 'Wikipedia: Fairbanks North Star Borough, Alaska', 'Wikipedia: NewsRadio', 'Wikipedia: Joe Cocker', 'Wikipedia: Sweet corn', 'Wikipedia: Xipe Totec', 'Wikipedia: Gilda Radner', 'Wikipedia: Industrial music', 'Wikipedia: Trinity County, Texas', 'Wikipedia: Jackson County, Texas', 'Wikipedia: Uilleann pipes', 'Wikipedia: Pasquotank County, North Carolina', 'Wikipedia: Cibola County, New Mexico', 'Wikipedia: Finsbury', 'Wikipedia: Ruislip', 'Wikipedia: Ozark County, Missouri', 'Wikipedia: Walthall County, Mississippi', 'Wikipedia: Pope County, Minnesota']
[['See also', 'Notes', 'References', 'Writings', 'References', 'Further reading'], ['Declassified documents', 'Morrison Center', 'Computer Science Department', 'Micron Center for Materials Research', 'Other campuses', 'Academics and organization', 'Publishing', 'Athletics', 'Albertsons Stadium', 'ExtraMile Arena', 'Student life', 'Methodology', 'Word list', 'G

In [20]:
print(len(title_list),len(url_list),len(abstract_list),len(anchor_list),len(link_list))

22060 22059 22059 16191 16602


In [25]:
print("Total number of distinct titles in 1 Million rows of the wikipedia data :" , len(set(title_list)))

Total number of distinct titles in 1 Million rows of the wikipedia data : 22060


# Method 2
- `Using XML Iterparse` : This is a very efficient and simple way to parse a large XML file iteratively. 


- First method is for those, who want to get hands on practicing Pyspark on different problems.

In [21]:
anch = []; lnk = []; start = time.time(); df = pd.DataFrame(); i = 0
elements_parsed = {"title" : [], "url" : [], "abstract" : [], "anchor" : [], "link" : []}



for event, elem in ET.iterparse("./xml_data/enwiki-latest-abstract.xml"):    
    if elem.tag == "title":
        elements_parsed["title"].append(elem.text.split(":")[1])
              
    if elem.tag == "url":
        elements_parsed["url"].append(elem.text)
    if elem.tag == "abstract":
        elements_parsed["abstract"].append(elem.text)
    
    if len(elem) > 0:
        for child in elem:
            if child.tag == "anchor":
                anch.append(child.text)

            if child.tag == "link": 
                lnk.append(child.text)
        
        
    if i > 0 and elem.tag == "title":
        elements_parsed["anchor"].append(anch)  
        elements_parsed["link"].append(lnk)        
        anch = []; lnk = []
        if len(df) == 0:
            df["title"] = [elements_parsed["title"][0]]
            df["url"] = [elements_parsed["url"][0]]
            df["abstract"] = [elements_parsed["abstract"][0]]
            df["anchor"] = [elements_parsed["anchor"][0]]
            df["link"] = [elements_parsed["link"][0]]
        if len(df) > 0:
            df.loc[i] = [elements_parsed["title"][0],elements_parsed["url"][0],elements_parsed["abstract"][0],elements_parsed["anchor"][0],elements_parsed["link"][0]]
        
        elements_parsed.clear()
        elements_parsed = {"title" : [], "url" : [], "abstract" : [], "anchor" : [], "link" : []}
        elements_parsed["title"].append(elem.text)
            
            
    i = i+1
    if i > 0 and i % 1000000 == 0:#00
        print(i,"Time taken...", time.time()-start, "seconds..")
    if i == 1000000:
        break

1000000 Time taken... 125.71665978431702 seconds..


In [22]:
df.shape

(17472, 5)

In [23]:
df[1:]

Unnamed: 0,title,url,abstract,anchor,link
77,Anarchism,https://en.wikipedia.org/wiki/Anarchism,Anarchism is a political philosophy and moveme...,"[Etymology, terminology and definition, Histor...",[https://en.wikipedia.org/wiki/Anarchism#Etymo...
163,Wikipedia: Autism,https://en.wikipedia.org/wiki/Autism,| onset = By age two or three,"[Characteristics, Social development, Communic...",[https://en.wikipedia.org/wiki/Autism#Characte...
231,Wikipedia: Albedo,https://en.wikipedia.org/wiki/Albedo,"Albedo () (, meaning 'whiteness') is the measu...","[Terrestrial albedo, White-sky, black-sky, and...",[https://en.wikipedia.org/wiki/Albedo#Terrestr...
287,Wikipedia: A,https://en.wikipedia.org/wiki/A,][][][][][][][][][][],"[History, Typographic variants, Use in writing...","[https://en.wikipedia.org/wiki/A#History, http..."
451,Wikipedia: Alabama,https://en.wikipedia.org/wiki/Alabama,(We dare defend our rights),"[Etymology, History, Pre-European settlement, ...",[https://en.wikipedia.org/wiki/Alabama#Etymolo...
...,...,...,...,...,...
999845,Wikipedia: AD 5,https://en.wikipedia.org/wiki/AD_5,__NOTOC__,"[Events, By place, Roman Empire, Births, Death...","[https://en.wikipedia.org/wiki/AD_5#Events, ht..."
999874,Wikipedia: AD 6,https://en.wikipedia.org/wiki/AD_6,AD 6 was a common year starting on Friday (lin...,"[Events, By place, Roman Empire, China, Births...","[https://en.wikipedia.org/wiki/AD_6#Events, ht..."
999903,Wikipedia: AD 7,https://en.wikipedia.org/wiki/AD_7,AD 7 was a common year starting on Saturday (l...,"[Events, By place, Roman Empire, China, Persia...","[https://en.wikipedia.org/wiki/AD_7#Events, ht..."
999947,Wikipedia: AD 8,https://en.wikipedia.org/wiki/AD_8,AD 8 was a leap year starting on Sunday (link ...,"[Events, By place, Roman Empire, Europe, Persi...","[https://en.wikipedia.org/wiki/AD_8#Events, ht..."


In [28]:
print("Total number of distinct titles in the records parsed using XMLiterparse : ", df.shape[0])

Total number of distinct titles in the records parsed using XMLiterparse :  17472


End Note : XML Iterparse is missing on few records though it is very fast compared to Pyspark ways of parsing.