# 3.2. Files

Module M-227-04: Programming for Data Analytics

Instructor: prof. Dmitry Pavlyuk

## Files

Files are a usual way store and transfer data. In this topic we review how to read and write files in the following popular formats:

* Plain text
* CSV
* XLSX
* JSON
* XML
* Pickle


## Text files

### Reading a plain text file

In [1]:
text = ""
with open('files/alice_in_wonderland.txt', 'r') as file:
    for line in file:
        text = text+" "+line
print(text[0:500])

 Alice's Adventures in Wonderland
 
                 ALICE'S ADVENTURES IN WONDERLAND
 
                           Lewis Carroll
 
                THE MILLENNIUM FULCRUM EDITION 3.0
 
 
 
 
                             CHAPTER I
 
                       Down the Rabbit-Hole
 
 
   Alice was beginning to get very tired of sitting by her sister
 on the bank, and of having nothing to do:  once or twice she had
 peeped into the book her sister was reading, but it had no
 pictures or conversations in


Let implement this as a function for further re-using

In [2]:
def print_file_head(filename, count=500):
    text = ""
    with open(filename, 'r',encoding='utf-8',) as file:
        for line in file:
            text = text+" "+str(line)
            if len(text)>count:
                break;
    print(text[0:count])
    if len(text)>=count:
        print("...")

### Basic text processing

In [3]:
import re
text=re.sub(r'[^A-Za-z ]+', '', text).lower().split()

words = set(text)
print(f"Total number of words: {len(words)}")

word_stat = {}
for word in words:
    word_stat[word] = text.count(word)
word_stat = sorted(word_stat.items(), key=lambda x: x[1], reverse=True)
print("Most popular words:")
word_stat[:10]

Total number of words: 2749
Most popular words:


[('the', 1632),
 ('and', 845),
 ('to', 721),
 ('a', 627),
 ('she', 537),
 ('it', 526),
 ('of', 508),
 ('said', 462),
 ('i', 401),
 ('alice', 386)]

## CSV and Excel files

## CSV files

A CSV (comma-separated values) file is a plain text file that has a specific format which allows data to be saved in a table structured format.

Internal CSV structure:

In [4]:
print_file_head('files/countries_codes.csv')

 "Country","Alpha-2 code","Alpha-3 code","Numeric code","Latitude (average)","Longitude (average)"
 "Afghanistan","AF","AFG","4","33","65"
 "Albania","AL","ALB","8","41","20"
 "Algeria","DZ","DZA","12","28","3"
 "American Samoa","AS","ASM","16","-14.3333","-170"
 "Andorra","AD","AND","20","42.5","1.6"
 "Angola","AO","AGO","24","-12.5","18.5"
 "Anguilla","AI","AIA","660","18.25","-63.1667"
 "Antarctica","AQ","ATA","10","-90","0"
 "Antigua and Barbuda","AG","ATG","28","17.05","-61.8"
 "Argentina",
...


### Reading CSV files

In [5]:
def print_head(dictionary, count=6):
    print(dict(list(dictionary.items())[0:count]))

In [6]:
import csv

iso2toiso3 = {}
with open('files/countries_codes.csv', 'r') as file:
    reader = csv.DictReader(file)
    for line in reader:
        iso2toiso3[line['Alpha-2 code']] = line['Alpha-3 code']
print_head(iso2toiso3)

{'AF': 'AFG', 'AL': 'ALB', 'DZ': 'DZA', 'AS': 'ASM', 'AD': 'AND', 'AO': 'AGO'}


### Reading XLSX files

XLSX is part of Microsoft Office Open XML specification. XLSX is a proprietary zipped, XML-based file format.

In [7]:
import openpyxl
sheet = openpyxl.load_workbook("files/country_population.xlsx")["country_population"]
population_dict = {}
for row in range(1, sheet.max_row):
    country_dict = {}
    for col in sheet.iter_cols(1, sheet.max_column):
        country_dict[col[0].value] = col[row].value
    population_dict[country_dict['Country Code']] = country_dict['Population, 2021']
print_head(population_dict)

{'ABW': 107195, 'AFE': 694665117, 'AFG': 39835428, 'AFW': 470898870, 'AGO': 33933611, 'ALB': 2811666}


## JSON files

JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays.

Internal JSON structure:

In [8]:
print_file_head('files/countries.json')

 {
   "AD": {
     "name": "Andorra",
     "native": "Andorra",
     "phone": [
       376
     ],
     "continent": "EU",
     "capital": "Andorra la Vella",
     "currency": [
       "EUR"
     ],
     "languages": [
       "ca"
     ]
   },
   "AE": {
     "name": "United Arab Emirates",
     "native": "دولة الإمارات العربية المتحدة",
     "phone": [
       971
     ],
     "continent": "AS",
     "capital": "Abu Dhabi",
     "currency": [
       "AED"
     ],
     "languages": [
       "ar"

...


### Reading JSON files

In [9]:
import json

f = open('files/countries.json', encoding= 'utf-8')
country_dict = json.load(f)
f.close()
print(json.dumps(country_dict["LV"], indent=2))

{
  "name": "Latvia",
  "native": "Latvija",
  "phone": [
    371
  ],
  "continent": "EU",
  "capital": "Riga",
  "currency": [
    "EUR"
  ],
  "languages": [
    "lv"
  ]
}


## XML files

XML (Extensible Markup Language) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data.

Internal XML structure:

In [10]:
print_file_head("files/countries_gdp.xml")

 ﻿<?xml version="1.0" encoding="utf-8"?>
 <Root xmlns:wb="http://www.worldbank.org">
   <data>
     <record>
       <field name="Country or Area" key="ABW">Aruba</field>
       <field name="Item" key="NY.GDP.PCAP.CD">GDP per capita (current US$)</field>
       <field name="Year">1960</field>
       <field name="Value" />
     </record>
     <record>
       <field name="Country or Area" key="ABW">Aruba</field>
       <field name="Item" key="NY.GDP.PCAP.CD">GDP per capita (current US$)</field>
   
...


### Reading XML files

In [11]:
from xml.dom import minidom

file = minidom.parse('files/countries_gdp.xml')
records = file.getElementsByTagName('record')
record_list = []
for record in records:
    fs = {}
    for field in record.getElementsByTagName("field"):
        if (field.firstChild is not None):
            value = field.getAttribute("key") if field.hasAttribute("key") else field.firstChild.nodeValue
            fs[field.getAttribute("name")]=value
    record_list.append(fs)



* Preprocessing - selecting the most recent value

In [12]:
gdp_dict = {}
year_latest_gdp_available = {}
for record in record_list:
    if 'Value' in record:
        country = record['Country or Area']
        year = int(record['Year'])
        value = float(record['Value'])
        if (country is not  year_latest_gdp_available or year_latest_gdp_available[country]<year):
            year_latest_gdp_available[country] = year
            gdp_dict[country] = value
gdp_dict["LVA"]

20642.1679221253

## Example: extending the data

Now we are going to extent the __country_dict__ information on countries (read from theJSON file) with information on population (__population_dict__ from the XLSX file) and gdp (__gdp_dict__ from XML file). __iso2toiso3__ dictionary from the CSV file will be used for matching ISO-2 country codes (used in __country_dict__) and ISO-3 country codes (used in __population_dict__ and __gdp_dict__).

In [13]:
for country_code in country_dict:
    country = country_dict[country_code]
    if country_code in iso2toiso3:
        iso2_code = iso2toiso3[country_code]
        if iso2_code in population_dict:
            country["population"] = population_dict[iso2_code]
        if iso2_code in gdp_dict:
            country["GDP_per_capita"] = gdp_dict[iso2_code]


### Extended data

In [14]:
print(json.dumps(country_dict["LV"], indent=2))

{
  "name": "Latvia",
  "native": "Latvija",
  "phone": [
    371
  ],
  "continent": "EU",
  "capital": "Riga",
  "currency": [
    "EUR"
  ],
  "languages": [
    "lv"
  ],
  "population": 1883162,
  "GDP_per_capita": 20642.1679221253
}


## Saving data to files

### Writing JSON

We store the extended __country_dict__ dictionary into a JSON file:

In [15]:
with open("files/countries_extended.json", "w") as outfile:
    outfile.write(json.dumps(country_dict, indent=2))

### Writing CSV

Now we store the selected field __country_dict__ of into a CSV file:

In [16]:
with open('files/countries_extended.csv', 'w', encoding='utf-8', newline='') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(('ISO2','Name','Population','GDP per capita'))
    for country_code in country_dict:
        country = country_dict[country_code]
        writer.writerow((country_code,country["name"],
                        country["population"] if "population" in country else "",
                        country["GDP_per_capita"] if "GDP_per_capita" in country else ""))

## Pickle files

Pickle can be used to serialize Python object structures, which refers to the process of converting an object in the memory to a byte stream that can be stored as a binary file on disk. 

### Writing Pickle

In [17]:
import pickle

with open('files/countries_extended.pickle', 'wb') as outfile:
    pickle.dump(country_dict, outfile)

### Reading Pickle

In [18]:
with open('files/countries_extended.pickle', 'rb') as infile:
    country_dict_stored = pickle.load(infile)
    
print("Successfully stored? ",country_dict_stored == country_dict)
print(json.dumps(country_dict["LV"], indent=2))

Successfully stored?  True
{
  "name": "Latvia",
  "native": "Latvija",
  "phone": [
    371
  ],
  "continent": "EU",
  "capital": "Riga",
  "currency": [
    "EUR"
  ],
  "languages": [
    "lv"
  ],
  "population": 1883162,
  "GDP_per_capita": 20642.1679221253
}


# Thank you