<img src="./intro_images/MIE.PNG" width="100%" align="left" />

<table style="float:right;">
    <tr>
        <td>                      
            <div style="text-align: right"><a href="https://alandavies.netlify.com" target="_blank">Dr Alan Davies</a></div>
            <div style="text-align: right">Senior Lecturer Health Data Science</div>
            <div style="text-align: right">University of Manchester</div>
         </td>
         <td>
             <img src="./intro_images/alan.PNG" width="30%" />
         </td>
     </tr>
</table>

# 11.0 File handling
****

#### About this Notebook
This notebook introduces working with common files in Python. This includes how you can save, load and append to files.

<div class="alert alert-block alert-warning"><b>Learning Objectives:</b> 
<br/> At the end of this notebook you will be able to:
    
- Investigate core concepts of file handling with Python

- Practice basic file handling tasks using Python 

</div> 

<a id="top"></a>

<b>Table of contents</b><br>

11.1 [Basic file handling](#basic)

11.2 [Working with other types of file](#other)

11.3 [Working with common file formats](#common)

11.4 [Navigating the file system](#nav)

The ability of a programming language to access and manipulate files and folders in a given operating system (e.g. Window, Linux, MacOS etc.) is a poweful tool that programmers can use to automate many processes. This is also a key part of many web-based applications. 

Files are stored in folders (directories) on a given operating system. You may be familiar with selecting files and folders using software like the file explorer in Windows as seen in the image below where the <code>R</code> folder is inside another folder called <code>Documents</code>. 

<img src="./intro_images/files.PNG" width="40%" align="left" />

The exact location of a given file on an operating can be found with something called a <code>file path</code>. An example for a filepath for an image of the author called <code>alan.PNG</code> is located in the following path: 

<code>C:\Users\Alan_Davies\NLP\alan.PNG</code>

This tells us several things. The first letter <code>C</code> is the hard drive the file is stored on (C is the default drive on a Windows machine). Then we have a number of folders separated by slashes. On some operating systems the slashes may be the other way around. In this example, on the hard drive <code>C</code>, there is a folder called <code>Users</code>, within this, another folder called <code>Alan_Davies</code>, then a folder called <code>NLP</code> which contains the file <code>alan.PNG</code>. The other point to note is the letters that come after the dot (period) in a file name. This is referred to as the <code>file extension</code> and determines what type of file it is. In this case it's an image, specifically a Portable Network Graphic (PNG). You may be familiar with other types of file such as a word document <code>.docx</code> or a Portable Document Format (PDF) file <code>.pdf</code> and so on. This example is depicted graphically below.

<img src="./intro_images/filesfolders.PNG" width="40%" align="left" />

<a id="basic"></a>
#### 11.1 Basic file handling

Pyton uses the <code>open</code> function to open files. This function takes two parameters. The first is the filename (with path), the second is the mode in which you want to open the file. Python has 4 core modes that include <code>read</code>, <code>write</code>, <code>append</code> and <code>create</code>. Some of the main modes are shown in the table below:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-kiyi{font-weight:bold;border-color:inherit;text-align:left}
.tg .tg-fymr{font-weight:bold;border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-xldj{border-color:inherit;text-align:left}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-kiyi">Mode</th>
    <th class="tg-kiyi">Description</th>
  </tr>
  <tr>
    <td class="tg-xldj">"r"</td>   
    <td class="tg-0pky">Read a file</td>
  </tr>
  <tr>
    <td class="tg-xldj">"w"</td>   
    <td class="tg-0pky">Write a file</td>
  </tr>
  <tr>
    <td class="tg-xldj">"a"</td>   
    <td class="tg-0pky">Append (add to an existing file) or creates if doesn't already exist</td>
  </tr>
  <tr>
    <td class="tg-xldj">"x"</td>   
    <td class="tg-0pky">Creates a new file</td>
  </tr>
  <tr>
    <td class="tg-xldj">"t"</td>   
    <td class="tg-0pky">Open in text mode (the default option)</td>
  </tr>
  <tr>
    <td class="tg-xldj">"b"</td>   
    <td class="tg-0pky">Opens in binary mode</td>
  </tr>
    <tr>
    <td class="tg-xldj">"+"</td>   
    <td class="tg-0pky">Opens for updating (both reading and writing)</td>
  </tr>
</table>

<div class="alert alert-success">
    <b>Note:</b> Some of these can also be combined. For example <code>f = open("file.txt", "rt")</code> to read text or <code>f = open("file.txt", "r+b")</code> for reading and writing in binary mode.
</div>

Let's try to open a file. Here we will create a variable that stores the path and filename of the required file. This file is a text file (<code>*.txt</code>) and contains the lyrics for the Nirvana song "Something in way", which is also the name of the file. The <code>./</code> is for a <code>relative</code> path (relative to this notebook). This means the folder can be accessed if we put the notebook in different places as long as there is a folder called <code>file_handling</code> in the same folder the notebook is in. This saves us having to hard code the specific path. For example when I wrote this notebook, the file was here: <code>C:\Users\Alan_Davies\Intro to programming (Python)\file_handling\Something in the way.txt</code>. I send these notebooks to someone who uploads them onto a server. The path could be quite different in this case. It might for example be: <code>E:\data_files\teaching\FBMH\SHS\intro_programming\file_handling\Something in the way.txt</code>. This is where the power of relative paths comes in. As long as the folder <code>file_handling</code> is relative to this notebook file, we don't have to explicitly add the full file path to work with the file. 

In [11]:
file_path = "./file_handling/Something in the way.txt"

Next we can use the <code>open</code> function to open the file passing in the file path variable. For just opening a file, we can omit the mode. We could also have included it (e.g. <code>open(file_path, "r")</code>).

In [12]:
f = open(file_path)

You will note that if we try and output the file using <code>print</code> function, we just output details of the file object. We need to use the <code>read</code> function instead if we want to see the contents.

<div class="alert alert-success">
<b>Note:</b> You can also pass a number into the <code>read</code> function to specify the number of lines you want to read. e.g. for the first 10 lines, you could write <code>f.read(10)</code>.
</div>

In [13]:
print(f)

<_io.TextIOWrapper name='./file_handling/Something in the way.txt' mode='r' encoding='cp1252'>


The read function now displays the contents of the file below.

In [14]:
print(f.read())

"Something In The Way"

Underneath the bridge
Tarp has sprung a leak
And the animals I've trapped
Have all become my pets
And I'm living off of grass
And the drippings from the ceiling
It's okay to eat fish
'Cause they don't have any feelings

Something in the way, mmm
Something in the way, yeah, mmm
Something in the way, mmm
Something in the way, yeah, mmm
Something in the way, mmm
Something in the way, yeah, mmm

Underneath the bridge
Tarp has sprung a leak
And the animals I've trapped
Have all become my pets
And I'm living off of grass
And the drippings from the ceiling
It's okay to eat fish
'Cause they don't have any feelings

Something in the way, mmm
Something in the way, yeah, mmm
Something in the way, mmm
Something in the way, yeah, mmm
Something in the way, mmm
Something in the way, yeah, mmm
Something in the way, mmm
Something in the way, yeah, mmm


Finally we close the file. with the <code>close</code> function.

In [15]:
f.close()

<div class="alert alert-success">
<b>Note:</b> Closing files when you are done with them is generally seen as good practice. This is for reasons including; Too many open files may slow your program down, you may run into an upper limit of the number of files that can be open at any given time, in some cases some changes to files are not initiated until the file is closed, operating systems like Windows treat open files as locked files which prevents certain actions being carried out on them. 
</div>

<div class="alert alert-block alert-info">
<b>Task 1:</b>
<br> 
1. Open the file in the location: <code>./file_handling/A/American Pie.txt</code> for reading.<br>
2. Output it's contents <br>
3. Close the file
</div>

In [2]:
file_path = "./file_handling/A/American Pie.txt"
f = open(file_path)
print(f.read())
f.close()

"American Pie"

...A long, long time ago
I can still remember how that music used to make me smile
And I knew if I had my chance that I could make those people dance
And maybe they'd be happy for a while
... But February made me shiver
With every paper I'd deliver
Bad news on the doorstep
I couldn't take one more step
... I can't remember if I cried
When I read about his widowed bride
But something touched me deep inside
The day the music died
... So bye-bye, Miss American Pie
Drove my Chevy to the levee, but the levee was dry
And them good old boys were drinkin' whiskey and rye
Singin' "This'll be the day that I die
This'll be the day that I die"
... Did you write the book of love, and do you have faith in God above
If the Bible tells you so?
Now do you believe in rock and roll, can music save your mortal soul
And can you teach me how to dance real slow?
... Well, I know that you're in love with him
'Cause I saw you dancin' in the gym
You both kicked off your shoes
Man, I dig those rh

If you wanted to return the lines one at a time, you can also use the <code>readline</code> function.

Another thing you are going to want to do when working with files is to put some error handling around using them to deal with issues like a file not being found etc. so your program doesn't just crash, but instead provides more meaningful error handling. Let's consider opening a file to write to it. It should create a new file if it doesn't already exist.

In [16]:
file_path = "./file_handling/"
file_name = "mytestfile.txt"

f = open(file_path + file_name, 'a')
try:
    f.write("\nThis is a line of text to write to the file.")
finally:
    f.close()

Here we opened the file for appending (so if you run this again it will keep adding lines to the file). We use the <code>try</code> ststement to attempt to write to the file and finally we close it. Another less verbose way of doing this is to use a <code>context manager</code>. Python does this with the <code>with</code> statement. 

In [17]:
with open(file_path + file_name, 'a') as f:
    f.write("\nHere is another line.")
    f.close()

This works in the same way as the earlier option but uses less lines of code and manages things like closing the file automatically for us once outside of the context. 

<div class="alert alert-block alert-info">
<b>Task 2:</b>
<br> 
Using a context manager, open the file <code>./file_handling/A/Silk.txt</code> and display its contents.
</div>

In [3]:
with open("./file_handling/A/Silk.txt", 'r') as f:
    print(f.read())
    f.close()

"Silk"

Your broad shoulders, my wet tears
You're alive and I'm still here
As some half-human creature thing
Can you bring life to anything? Ooh
("Take this to make you better"
Though eventually you'll die)
If you don't love me, don't tell me
I've never asked who and I'll never ask why
("It's such a shame, she used to be so delightful")
Well, whose fault is that, if it wasn't Mum and Dad's?
"Well it must be yours"
We'll have none of that, no

Just looking for a protector
God never reached out in time
There's love that is a saviour
But that ain't no love of mine
My love it kills me slowly
Slowly I could die
And when she sleeps she hears the blues
And sees shades of black and white
(Run, run, run away)
(Run, run, run away)

Got to stay cool, you hot, hot head
Count to a thousand before you sleep in bed
Read the news, pass the time
Drink the juice, feeling fine
Got to stay cool, you hot, hot head
Count to a thousand before you sleep in bed
Read the news, pass the time
Drink the juice, fee

The <code>read</code> function reads the entire file in one go. We can also get a list of the lines in a file using the <code>readlines</code> function.

In [22]:
with open(file_path + file_name, 'r') as f:
    print(f.readlines())
    f.close()

['This is a line of text to write to the file.\n', 'Here is another line.This is a line of text to write to the file.\n', 'Here is another line.']


This works well for shorter smaller files but for much larger files this is not very memory efficient. We could instead use the <code>readline</code> function to read in each line, line at a time.

In [24]:
with open(file_path + file_name, 'r') as f:
    for line in f:
        print(line)
    f.close()

This is a line of text to write to the file.

Here is another line.This is a line of text to write to the file.

Here is another line.


To store and use characters digitally they are often represented with an encoding system. There are different character encoding standards such as <code>ASCII</code> (American Standard Code for Information Interchange). For example the letter <code>a</code> is represented by the ASCII code <code>097</code>, which is also <code>01100001</code> in binary. There are other encoding sets such as <code>UTF-8</code> (Unicode (or Universal Coded Character Set) Transformation Format, 8-bit) that supports characters of variable widths. The letter <code>a</code> in this system is <code>U+0061</code>. In Python, we can use UTF-8 directly like so:

In [49]:
u"\u0061"

'a'

<div class="alert alert-success">
<b>Note:</b> You may need to convert the character encoding of text data that you import to carry out further processing and to better represent certain symbols (e.g. emojis &#127773;).
</div>

We can explicitly state the encoding for a file by passing it in as a parameter e.g. <code>f = open("test_file.txt", mode='r', encoding='utf-8')</code>.

<div class="alert alert-block alert-info">
<b>Task 3:</b>
<br> 
Using the method above for displaying the letter <code>a</code>, output the character <code>%</code>.
</div>

In [4]:
u"\u0025"

'%'

<a id="other"></a>
#### 11.2 Working with other types of file

Let's create some data and store it in a dictionary. Here we have some details for a patient. We may want to save this data to a file or transmit it over a network. This could represent many things, such as settings options for a program or app.

In [26]:
my_data = {
    'name': 'Paul Smith',
    'id': '1342',
    'age': 45,
    'diagnosis': 'NIDDM',
    'PMH': ['Hypertension', 'IBS', 'Bowel CA']
}

We can do this easily in Python using the <code>pickle</code> module.

In [27]:
import pickle

We can create a new file to write binary <code>wb</code>. We also need to make sure to use the file extension <code>.pkl</code> for pickled files. This then works in the same way but using <code>pickle.dump</code>. We pass the dictionary into this to write it to file.

In [33]:
new_path = "./file_handling/B/health_record.pkl"

with open(new_path, 'wb') as f:
    pickle.dump(my_data, f)
    f.close()

Pickling refers to the <code>serializing</code> and <code>de-serializing</code> of Python object structures. This essentially converts objects like dictionaries into a stream of bytes that can be sent over a network or saved to disk. This can be used to save information that you want to persist after a program has finished running. Alternatively you may want to use this to send data packets via the internet or other network. The only drawback is that this only works with Python. If you want to make your programs more <code>interoperable</code> (work with other systems and languages) then you should consider using something like JSON (JavaScript Object Notation) instead. JSON is widely supported and is not limited to any specific programming language despite originating from JavaScript. For this we need to use the <code>json</code> module.

In [34]:
import json

As with pickled files we need to change the file extension to reflect the fact that the file is a json file. We do this by using the <code>.json</code> extension. We can convert the dictionary into json format using <code>json.dumps</code> and then write this to a file using the <code>write</code> function. 

In [35]:
new_path = "./file_handling/B/health_record.json"

with open(new_path, 'w') as f:
    json_file = json.dumps(my_data)
    f.write(json_file)
    f.close()

We can load the data back in again using <code>json.load</code>. First we create a new empty dictionary called <code>new_dict</code> to store the loaded data.

In [36]:
new_dict = {}

Here we load back in the data, store it in the new empty dict and print the contents. We can see this does indeed contain the data that we saved in the json file.

In [37]:
with open(new_path) as f:
   health_data = json.load(f)
   f.close()
    
new_dict = health_data
print(new_dict)

{'name': 'Paul Smith', 'id': '1342', 'age': 45, 'diagnosis': 'NIDDM', 'PMH': ['Hypertension', 'IBS', 'Bowel CA']}


<div class="alert alert-block alert-info">
<b>Task 4:</b>
<br> 
1. Using the method above, create some data in a dictionary.<br>
2. Use <code>json.dumps</code> to save the file in the folder <code>./file_handling/B</code>.<br>
3. Load the file into a new empty dictionary and output its contents.<br><br>
<strong>Note:</strong> We don't provide a solution here as the data you choose to store will be decided by each individual. 
</div>

<a id="common"></a>
#### 11.3 Working with common file formats

Python can also cope with loading lots of different file types. This usually involves importing a specific module for that purpose. Here are some examples with a couple of common file formats. In one of the folders we have included a chapter from the book below as a Word document.

<img src="./intro_images/book.PNG" width="20%" align="left" />

We can use the <code>docx</code> module to work with Word documents.

In [39]:
from docx import Document

We create a document by loading in the Word doc like so. This is actually creating an instance of the <code>Document</code> class and passing in the file path to the constructor. 

In [40]:
doc = Document("./file_handling/B/Chapter 1.docx")

You can view text paragraph at a time.

In [41]:
print(doc.paragraphs[0].text)

Chapter 1:


Or display the entire document. The module has many other functions for working with Word documents aside from the few shown here.

In [42]:
for para in doc.paragraphs:
    print(para.text)

Chapter 1:
How to record a 12-lead ECG
Alan Davies and Alwyn Scott

Physiology 
Sinoatrial node 
Interatrial/internodal tracts 
Atrioventricular node 
Bundle of His 
Right bundle branch 
Left bundle branch 
What is an ECG 
Patient positioning 
Electrode placement 
Attaching the cables 
The machine 
What to write on the ECG 
Quiz 
Summary of key points 


Physiology
The heart is located in the chest between the lungs in the mediastinum. It is surrounded by a protective sac called the pericardium (figure 1.1). Essentially the heart is split into four functional chambers; a left and right atrium, and a left and right ventricle (figure 1.2). Deoxygenated blood (blood with no oxygen in it) is emptied into the right atrium via the vena cava. The inferior vena cava returns blood from the lower portion of the body as the superior vena cava returns blood from the higher portion. This blood is then pumped through the tricuspid valve into the right ventricle. Blood is then passed into the lungs v

You may also want to work with Portable Document Format (PDF) file. We included a PDF from a paper shown below. We can load this into Python using the <code>PyPDF2</code> module.

<img src="./intro_images/reviews.PNG" width="30%" align="left" />

In [43]:
import PyPDF2

The short snippet below opens the PDF document for reading binary. We can do things like print the number of pages and display the first page with <code>getPage</code>.

In [45]:
pdf_file = open("./file_handling/B/Sytematic reviews.pdf", "rb")
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
print(pdf_reader.numPages)
pdf_page = pdf_reader.getPage(0)
print(pdf_page.extractText())
pdf_file.close()

7
1008
 

 British Journal of Nursing, 2019, Vol 28, No 15
 © 2019 MA Healthcare Ltd

S
ince their inception  in the late 1970s, systematic 

reviews have gained influence in the health 

professions (Hanley and Cutts, 2013). Systematic 

reviews and meta-analyses are considered to be 

the most credible and authoritative sources of 

evidence available (Cognetti et al, 2015) and are regarded as 

the pinnacle of evidence in the various ‘hierarchies of evidence’. 

Reviews published in the Cochrane Library (https://www.
cochranelibrary.com) are widely considered to be the ‘gold’ 

standard. Since Guyatt et al (1995) presented a users’ guide to 

medical literature for the Evidence-Based Medicine Working 

Group, various hierarchies of evidence have been proposed. 


Figure 1
 illustrates an example. 

Systematic reviews can be qualitative or quantitative. One of 

the criticisms levelled at hierarchies such as these is that qualitative 

research is often positioned towards or even is 

The final example shows how we can load Comma Separated Value (CSV) files with the <code>pandas</code> module. This is frequently used for data science and data analysis as we can load data from files in to a data frame object that lets us display and work with data tables.

In [46]:
import pandas as pd
exam_data = pd.read_csv("./file_handling/B/exam data.csv")
exam_data

Unnamed: 0,name,candidate_number,score
0,Adam West,124532,56
1,Paul Bradley,543433,62
2,Suzzane Smith,445342,71
3,Wiktoria Chleb,654257,66
4,Paulina Westbury,543254,77
5,David Green,541663,51
6,Harrold Grund,765332,48


The important take away here is that for most common file types, Python has a module that can be used to work with them.

We can also use additional libraries such as <code>tkinter</code>. The code below opens a file loading dialog box (this may appear behind other windows if you have multiple windows open). Here you can select a file and once selected the file path will be output. This can be used to create programs where the user can select a file to open.

In [2]:
import tkinter as tk
from tkinter import filedialog

root = tk.Tk()
root.withdraw()

file_path = filedialog.askopenfilename()
print(file_path)

C:/Users/Alan_Davies/Documents/uxd marking table.docx


<a id="nav"></a>
#### 11.4 Navigating the file system

As you can see, to use files we also have to be comfortable with navigating through files and folders. One very useful module for this is the <code>os</code> (operating system) module.

In [3]:
import os

Below is an example of 'walking' through the file/folder structure from a root folder. In this case we start at the <code>file_handling</code> relative to this notebook. We loop over the various files and folders.

In [50]:
data_path = "./file_handling/"
for root, dirs, files in os.walk(data_path):
    print("Root = ", root)
    print("Dirs = ", dirs)
    print("Files =", files)

Root =  ./file_handling/
Dirs =  ['A', 'B']
Files = ['mytestfile.txt', 'Something in the way.txt']
Root =  ./file_handling/A
Dirs =  ['C']
Files = ['American Pie.txt', 'Silk.txt']
Root =  ./file_handling/A\C
Dirs =  []
Files = ['Zombie.txt']
Root =  ./file_handling/B
Dirs =  []
Files = ['Chapter 1.docx', 'exam data.csv', 'health_record.json', 'health_record.pkl', 'Sytematic reviews.pdf']


Another useful bit of functionality is the ability to detect if a file exists or not. We could use this to check for the presence of a file before creating or opening it to avoid errors. We do this with <code>os.path.exists</code>. 

In the example below we check for the presence of the file <code>./file_handling/mytestfile.txt</code> with an <code>if</code> statement. If found we display the message <code>The file exists!</code> otherwise <code>File not found</code>.

In [6]:
test_file_path = "./file_handling/mytestfile.txt"
if(os.path.exists(test_file_path)):
    print("The file exists!")
else:
    print("File not found")

The file exists!


We can also use it to do things like list the files and folders in a directory.

In [51]:
os.listdir(data_path)

['A', 'B', 'mytestfile.txt', 'Something in the way.txt']

<div class="alert alert-success">
    <b>Note:</b> You can find out more about the <code>os</code> module <a href="https://python101.pythonlibrary.org/chapter16_os.html" target="_blank">here</a>. 
</div>

<div class="alert alert-block alert-info">
<b>Task 5:</b>
<br> 
Let's see if we can pull some of this together.<br>
1. Using <code>tkinter</code> let the user select a <code>CSV</code> file.<br>
2. Use the <code>os</code> functions to check if the file exists or not.<br>
3. Open the <code>CSV</code> file with the <code>pandas</code> library and display the contents.<br><br>
    <strong>Hint:</strong> You can try to load the <code>exam data.csv</code> file from earlier. 
</div>

In [9]:
import os
import pandas as pd
import tkinter as tk
from tkinter import filedialog

root = tk.Tk()
root.withdraw()

file_path = filedialog.askopenfilename()
if(os.path.exists(file_path)):
    exam = pd.read_csv(file_path)
    print(exam)
else:
    print("File not found")

               name  candidate_number  score
0         Adam West            124532     56
1      Paul Bradley            543433     62
2     Suzzane Smith            445342     71
3    Wiktoria Chleb            654257     66
4  Paulina Westbury            543254     77
5       David Green            541663     51
6     Harrold Grund            765332     48


### Notebook details
<br>
<i>Notebook created by <strong>Dr. Alan Davies</strong>.
<br>
&copy; Alan Davies 2022

## Notes: