In this lecture,we try to process data presented in differentkinds of common encodings, such as CSV files, JSON, XML using Python. Unlike the lectures on data structures, this chapter is not focused on specific algorithms, but instead on the problem of getting data in and out of a program.

## Reading and Writing CSV Data

You want to read or write data encoded as a CSV file. For most kinds of CSV data, use the csv library. For example, suppose you have some stock market data in a file named data.csv

In [38]:
import csv
with open('data.csv') as f:
    f_csv = csv.reader(f)
    headers = next(f_csv)
    #print(headers[1])
    for row in f_csv:
        print(type(row[1]))

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


If you do not want use indexes to reach specific column in given csv file, you have an alternative way:

In [23]:
from collections import namedtuple
with open('data.csv') as f:
    f_csv=csv.reader(f)
    #headings is columns names
    headings=next(f_csv)
    Row=namedtuple('Row',headings)
    for r in f_csv:
        row=Row(*r)
        print(row)

Row(Symbol='AA', Price='39.48', Date='6/11/2007', Time='9:36am', Change='-0.18', Volume='181800')
Row(Symbol='AIG', Price='71.38', Date='6/11/2007', Time='9:36am', Change='-0.15', Volume='195500')
Row(Symbol='AXP', Price='62.58', Date='6/11/2007', Time='9:36am', Change='-0.46', Volume='935000')
Row(Symbol='BA', Price='98.31', Date='6/11/2007', Time='9:36am', Change='+0.12', Volume='104800')
Row(Symbol='C', Price='53.08', Date='6/11/2007', Time='9:36am', Change='-0.25', Volume='360900')
Row(Symbol='CAT', Price='78.29', Date='6/11/2007', Time='9:36am', Change='-0.23', Volume='225400')


Another alternative to read th data as a sequennce of dictionries. To do that we have another object inside csv module __DictReader()__.

In [21]:
with open('data.csv') as f:
    f_csv=csv.DictReader(f)
    for row in f_csv:
        print(row['Volume'])

181800
195500
935000
104800
360900
225400


### How to write data into csv file

In [22]:
headers

['Symbol', 'Price', 'Date', 'Time', 'Change', 'Volume']

In [25]:
rows=[['AA', '39.48', '6/11/2007', '9:36am', '-0.18', '181800'],
['AIG', '71.38', '6/11/2007', '9:36am', '-0.15', '195500'],
['AXP', '62.58', '6/11/2007', '9:36am', '-0.46', '935000'],
['BA', '98.31', '6/11/2007', '9:36am', '+0.12', '104800'],
['C', '53.08', '6/11/2007', '9:36am', '-0.25', '360900'],
['CAT', '78.29', '6/11/2007', '9:36am', '-0.23', '225400']
     ]

In [26]:
with open('stocks.csv','w') as f:
    f_csv=csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(rows)

In [27]:
with open('stocks.csv') as f:
    for line in f:
        row=line.split(',')
        print(row)

['Symbol', 'Price', 'Date', 'Time', 'Change', 'Volume\n']
['\n']
['AA', '39.48', '6/11/2007', '9:36am', '-0.18', '181800\n']
['\n']
['AIG', '71.38', '6/11/2007', '9:36am', '-0.15', '195500\n']
['\n']
['AXP', '62.58', '6/11/2007', '9:36am', '-0.46', '935000\n']
['\n']
['BA', '98.31', '6/11/2007', '9:36am', '+0.12', '104800\n']
['\n']
['C', '53.08', '6/11/2007', '9:36am', '-0.25', '360900\n']
['\n']
['CAT', '78.29', '6/11/2007', '9:36am', '-0.23', '225400\n']
['\n']


## If we have a headers which are not valid identifiers(example Header name: num-premises), how we process the csv file:

In [30]:
import re
with open('data.csv') as f:
    f_csv=csv.reader(f)
    headers=[ re.sub('[^a-zA-Z_]','_',h) for h in next(f_csv)]
    print("AFTER:", headers)
    Row=namedtuple('Row',headers)
    for r in f_csv:
        row=Row(*r)
        print(row.Symbol)

AFTER: ['Symbol', 'Price', '_Date', 'Time', 'Change', 'Volume']


ValueError: Field names cannot start with an underscore: '_Date'

# How we convert to string data to other types

In [36]:
#1st way
#write each data type for columns in list
col_types=[str,float,str,str, float, int]
with open('data.csv') as f:
    f_csv=csv.reader(f)
    header=next(f_csv)
    for row in f_csv:
        row=[converted_type(value)
             for converted_type,value in zip(col_types,row)]
        print(type(row[1]))

<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>


## converting selected fields(columna) type of csv file

In [42]:
field_types=[('Price',float),('Change',float),('Volume',int)]
with open('data.csv') as f:
    for row in csv.DictReader(f):
        row.update((key,conversion(row[key]))
                   for key, conversion in field_types)
        print(type(row['Volume']))

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>


## Reading and Writng JSON Data

JSON is stands for Javasprict Object Notation. In python there exist a module with __json__. With this modlue we can read JSON data.

In [43]:
#Let us convert Python dictionary into JSON data
import json

data={'name':'MISY',
     'code': 3871,
     'section': 1
     }
#convert this dictionary int json string
json_str=json.dumps(data)
print(json_str)

{"name": "MISY", "code": 3871, "section": 1}


In [44]:
#let us write this json string int o json file
with open('data.json','w') as f:
    json.dump(data,f)

In [45]:
#read the json file
with open('data.json','r') as f:
    data=json.load(f)
    print(data)

{'name': 'MISY', 'code': 3871, 'section': 1}


In [46]:
yes=True# in type true
variable=None # in json type null
d={'a': True,
  'b': 'Hello',
  'c': None}

json.dumps(d)

'{"a": true, "b": "Hello", "c": null}'

In [49]:
from urllib.request import urlopen
from pprint import pprint
# a historical record of motor racing data for non-commercial purposes.
u=urlopen('http://ergast.com/api/f1/2004/1/results.json')
response=json.loads(u.read().decode('utf-8'))
pprint(response)

{'MRData': {'RaceTable': {'Races': [{'Circuit': {'Location': {'country': 'Australia',
                                                              'lat': '-37.8497',
                                                              'locality': 'Melbourne',
                                                              'long': '144.968'},
                                                 'circuitId': 'albert_park',
                                                 'circuitName': 'Albert Park '
                                                                'Grand Prix '
                                                                'Circuit',
                                                 'url': 'http://en.wikipedia.org/wiki/Melbourne_Grand_Prix_Circuit'},
                                     'Results': [{'Constructor': {'constructorId': 'ferrari',
                                                                  'name': 'Ferrari',
                                                          

In [53]:
type(response) 

dict

In [54]:
response.keys()

dict_keys(['MRData'])

In [56]:
actual_data=response['MRData']
type(actual_data)

dict

In [57]:
actual_data.keys()

dict_keys(['xmlns', 'series', 'url', 'limit', 'offset', 'total', 'RaceTable'])

In [60]:
for k,v in actual_data.items():
    print(k, ': ',v)

xmlns :  http://ergast.com/mrd/1.5
series :  f1
url :  http://ergast.com/api/f1/2004/1/results.json
limit :  30
offset :  0
total :  20
RaceTable :  {'season': '2004', 'round': '1', 'Races': [{'season': '2004', 'round': '1', 'url': 'http://en.wikipedia.org/wiki/2004_Australian_Grand_Prix', 'raceName': 'Australian Grand Prix', 'Circuit': {'circuitId': 'albert_park', 'url': 'http://en.wikipedia.org/wiki/Melbourne_Grand_Prix_Circuit', 'circuitName': 'Albert Park Grand Prix Circuit', 'Location': {'lat': '-37.8497', 'long': '144.968', 'locality': 'Melbourne', 'country': 'Australia'}}, 'date': '2004-03-07', 'Results': [{'number': '1', 'position': '1', 'positionText': '1', 'points': '10', 'Driver': {'driverId': 'michael_schumacher', 'code': 'MSC', 'url': 'http://en.wikipedia.org/wiki/Michael_Schumacher', 'givenName': 'Michael', 'familyName': 'Schumacher', 'dateOfBirth': '1969-01-03', 'nationality': 'German'}, 'Constructor': {'constructorId': 'ferrari', 'url': 'http://en.wikipedia.org/wiki/Scu

In [63]:
race_data=actual_data['RaceTable']

In [64]:
race_data.keys()

dict_keys(['season', 'round', 'Races'])

# How we can convert  aJson data into Python object rather dict or list

In [65]:
json_data='{"name":"MISY", "code":3871, "section":1}'

#First define a class to declare a tempalte to genertae object
class JSONObject:
    def __init__(self,d):
        self.__dict__= d
        
data=json.loads(json_data,object_hook=JSONObject)
        

In [66]:
data.name

'MISY'

In [67]:
data.code

3871

In [68]:
data.section

1

In [69]:
type(data)

__main__.JSONObject

TypeError: Object of type JSONObject is not JSON serializable

In [74]:
data={'name':'MISY',
     'code': 3871,
     'section': 1
     }

print(json.dumps(data))

{"name": "MISY", "code": 3871, "section": 1}


In [73]:
print(json.dumps(data,indent=4))

{
    "name": "MISY",
    "code": 3871,
    "section": 1
}


# Parsing Simple XML Data

Module name to process the XLM documents is xml.etree.ElementTree

In [2]:
from urllib.request import urlopen
from xml.etree.ElementTree import parse
#Dowload the RSS feed and Parase it.
url=urlopen('https://planetpython.org/rss20.xml')
#then parse this feed
document=parse(url)

In [5]:
#Extracting and output tags of interest from the document
for item in document.iterfind('channel/item'):
    title=item.findtext('title')
    pubDate=item.findtext('pubDate')
    link=item.findtext('link')
    print(title)
    print(pubDate)
    print(link) 

Python Morsels: What is an iterator?
Mon, 14 Mar 2022 15:00:00 +0000
https://www.pythonmorsels.com/what-is-an-iterator/
Kushal Das: Targeted WebID for privacy in Solid
Mon, 14 Mar 2022 14:31:48 +0000
https://kushaldas.in/posts/targeted-webid-for-privacy-in-solid.html
Real Python: Python Class Constructors: Control Your Object Instantiation
Mon, 14 Mar 2022 14:00:00 +0000
https://realpython.com/python-class-constructor/
Mike Driscoll: PyDev of the Week: Jessica Greene
Mon, 14 Mar 2022 12:30:44 +0000
https://www.blog.pythonlibrary.org/2022/03/14/pydev-of-the-week-jessica-greene/
Python Software Foundation: The Pi-thon 2022 PSF Spring Fundraiser!
Mon, 14 Mar 2022 11:59:43 +0000
http://pyfound.blogspot.com/2022/03/the-pi-thon-2022-psf-spring-fundraiser.html
Talk Python to Me: #356: Tips for ML / AI startups
Mon, 14 Mar 2022 08:00:00 +0000
https://talkpython.fm/episodes/show/356/tips-for-ml-ai-startups
Matthew Wright: Analyzing intraday and overnight stock returns with pandas
Mon, 14 Mar 20

In [6]:
#u se of find() method over document object
e=document.find('channel/title')
e

<Element 'title' at 0x000000BF619A59F8>

In [7]:
e.tag

'title'

In [8]:
e.text

'Planet Python'

In [9]:
e.get('some_attribute_name')

For parsing XML Files ElementTree module is not the only ption for you.
You might consider to use __lxml__ module.You only need to change previous import statement to following.

In [10]:
#Insteda of following import
#from xml.etree.ElementTree import parse
#use this version
from lxml.etree import parse
from xml.etree.ElementTree import iterparse
iterparse?


In [11]:
def parse_and_remove(filename,path):
    path_parts=path.split('/')
    document=iterparse(filename,events=('start','end'))
    #skip the root element
    next(document)
    
    tags=[]
    elements=[]
    for event, element in document:
        if event=='start':
            tags.append(element.tag)
            elements.append(element)
        elif event=='end':
            if tags==path_parts:
                yield element
                elements[-2].remove[element]
            try:
                tags.pop()
                elements.pop()
            except IndexError:
                pass          
        

In [22]:
#Suppose you want to write a script that ranks employyees by the id numbers.
#let us do it together
from collections import Counter
employees_by_id=Counter()

data=parse_and_remove('employee.xml','root/row')
for employee in data:
    print(employee.findtext('id'))
    employees_by_id[employee.findtext('id')]+=1
    
for eid, num in employees_by_id.most_common():
    print(eid,num)   

In [23]:
data

<generator object parse_and_remove at 0x000000BF64255C48>

None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None


# Turning Dictionary into XML File

In [32]:
from xml.etree.ElementTree import Element
#Write a function that convert given python dictionary to xml file
def dict_to_xml(tag,dictionary):
    elem=Element(tag)
    for key, value in dictionary.items():
        child=Element(key)
        child.text=str(value)
        elem.append(child)
    return elem

In [33]:
employee={'name':'Engin','surname':'Kandiran','age':32}
e=dict_to_xml('record',employee)

In [34]:
e

<Element 'record' at 0x000000BF63F05958>

In [36]:
from xml.etree.ElementTree import tostring

In [37]:
tostring(e)

b'<record><name>Engin</name><surname>Kandiran</surname><age>32</age></record>'

In [40]:
def dict_to_xml_str(tag,dictionary):
    parts=['<{}>'.format(tag)]
    for key,value in dictionary.items():
        parts.append('<{0}>{1}</{0}>'.format(key,value))
    parts.append('</{}>'.format(tag))
    return ''.join(parts)

e_str=dict_to_xml_str('record',employee)
print(e_str)

<record><name>Engin</name><surname>Kandiran</surname><age>32</age></record>
