# Week 2 Notebook 2 Files

## Overview

The Python standard library contains many useful functions, including to read and write files. In this lesson we will cover:
* how to open a file and read data from it
* how to write text to a file

### Basic Directory Commands

In Jupyter notebooks, you can use Unix commands in a cell as long as there is no Python code.
`pwd` will show the present working directory.

In [19]:
pwd

'C:\\Users\\Acer\\Data Science\\Pertemuan 2'

You can use `ls` to **list** the contents of the present working directory.

In [20]:
ls


 Volume in drive C is Acer
 Volume Serial Number is 648A-762C

 Directory of C:\Users\Acer\Data Science\Pertemuan 2

23/09/2023  22:38    <DIR>          .
23/09/2023  16:53    <DIR>          ..
23/09/2023  16:53    <DIR>          .ipynb_checkpoints
23/09/2023  16:54                98 13002.txt
23/09/2023  16:54               230 invoices.csv
23/09/2023  16:54             4.753 iris_csv.csv
20/09/2023  09:37            53.995 Kelompok Kuda Hitam 2.ipynb
20/09/2023  09:58           192.926 Kelompok Kuda Hitam.ipynb
23/09/2023  16:54            13.519 penguins_size.csv
23/09/2023  22:15            28.648 Week 2 Notebook 1 Collections.ipynb
23/09/2023  22:38            14.670 Week 2 Notebook 2 Files.ipynb
23/09/2023  16:53           175.021 Week 2 Notebook 3 Pandas Intro.ipynb
23/09/2023  16:53           159.261 Week 2 Notebook 4 Pandas Wrangling.ipynb
23/09/2023  16:53            24.331 Week 2 Notebook 5 Data Visualization.ipynb
              11 File(s)        667.452 bytes
              

In the same directory as this notebook file, you should have downloaded and saved a text file `13002.txt`, and it should appear in the listing of files above.

If not, save it into this directory so that you can read it in easily.

You can open the file using Jupyter, using the File->Open command in the menu above. You will find that it contains information about an invoice, written in plain text. Let's try to read in the data using Python.

## Reading Files


To access a file, you can just use the Python standard library function `open()`. 

The default mode is for reading.

In [21]:
#open a file for reading, with default mode
myfile = open('13002.txt')


In [22]:
print(myfile)

<_io.TextIOWrapper name='13002.txt' mode='r' encoding='cp1252'>


You can see that when we print `myfile`, it gives us the filename and the mode.

To read the data from the file, you can use the `readline()` function on the object `myfile`.

In [23]:
# Execute this cell again to read another line
mydataAsLines = myfile.readline()
print(mydataAsLines)

Invoice No: 13002



As you can see, it has read one line of the data. If you run the line in the cell above again, you will see that it will read another line. If there are no more lines to read there will be no output.

You can also read a specific number of characters (bytes) by using the `read()` function. 

In [24]:
# close the file before reopening 
myfile.close()

# open the file again
myfile = open('13002.txt', 'r')

# read 4 characters. 
myfile.read(4)

'Invo'

Try to change the argument value to read more characters. If the argument is -1 or left blank, the whole file will be read. 

In [25]:
# close the file from the previous open
myfile.close()

# myfile.read() will read the whole file.
myfile = open('13002.txt', 'r')
myfile.read()

# then close the file
myfile.close()

You can see that the file is read as a single string, including the newline character, `\n` at the end of each line. 

You can use the `print()` function to print out the string.

In [27]:
# myfile.read() will read the whole file.
myfile = open('13002.txt', 'r')

# using the print() function will print the string that is read
print(myfile.read())

myfile.close()

Invoice No: 13002
Customer Name: Lighthouse Entertainment
Date: 13 Jan 2020
Invoice Amount: $45.60


You can also read *all* the lines, using the `readlines()` function. This will store all the lines as elements of a list.


In [28]:
# myfile.read() will read the whole file.
myfile = open('13002.txt', 'r')

# using the readlines() function. 
mydata = myfile.readlines()

# This stores line as an element in a list.
print(mydata)

myfile.close()

['Invoice No: 13002\n', 'Customer Name: Lighthouse Entertainment\n', 'Date: 13 Jan 2020\n', 'Invoice Amount: $45.60']


This time, the data is stored as a list, where the first element in index 0 is `'Invoice No: 13002\n'`. 

You can also process each line of the file using a `for` loop.

In [29]:
# open the file
myfile = open('13002.txt', 'r')

# process each line
for line in myfile:
    # separate the line at the colon
    parts = line.split(":")
    
    # print the split parts 
    print(parts)
    
    # print only the data
    print(parts[1])

myfile.close()
    

['Invoice No', ' 13002\n']
 13002

['Customer Name', ' Lighthouse Entertainment\n']
 Lighthouse Entertainment

['Date', ' 13 Jan 2020\n']
 13 Jan 2020

['Invoice Amount', ' $45.60']
 $45.60


## Writing to Files

You can also open files for writing, using `mode = 'w'`. If the file does not exist, it will be created. 

In [None]:
#open a file for writing
anotherfile= open('sample.txt', 'w')
anotherfile.write('Invoice No: 13003\n')
anotherfile.close()


Let's check the file contents by reading it.  

In [None]:
anotherfile = open('sample.txt','r')
print(anotherfile.readline())
anotherfile.close()

However, if you open an existing file for writing again, the contents will be deleted.

In [None]:
# open the file for writing, again
anotherfile= open('sample.txt', 'w')
anotherfile.write('testing')
anotherfile.close()

In [None]:
# check the file contents again
anotherfile = open('sample.txt','r')
print(anotherfile.readline())
anotherfile.close()

### Appending Data

If you do not want to overwrite an existing file, you can *append* lines to it. 

In order to append data to the file, we use `mode = 'a'`. This means that data will be added to the end of the file.

Let's say we have an existing file, 'invoices.csv'. Let's read the file to check it's contents first.


In [None]:
# open the file for reading
myfile = open('invoices.csv', 'r')
print(myfile.read())
myfile.close()

We will open the file to append, by setting the mode to 'a'. The file pointer will be set at the end of the file.

In [None]:
# Open the file to append data
myfile = open('invoices.csv', 'a')
myfile.write('13005, Additional Row, 14 Jan 2020, $100.00\n')
myfile.close()

You can open a file for a combination of reading and writing by specifying the mode. 
* `mode = 'r+'` opens the file for reading and writing, by setting the file pointer at the beginning of the file, ready to read.
* `mode = 'a+'` opens the file for appending and reading, by setting the file pointer at the end of the file, ready for adding more data
* `mode = 'w+'` opens the file for writing and reading, by setting the file pointer at the beginning of the file, ready to write and thus overwriting existing data.
* `file.seek(0)` will set the pointer back to the beginning of the file.

In [44]:
#open the file again, this time to read and write.
# mode = 'r+' will start with the file handle at the beginning of the file, 
# so that you can read first
# open with mode = 'r+' will point at the beginning of the file
myfile = open('invoices.csv', 'r+')

# read the file and print its contents
print('Before appending new data')
print(myfile.read())

# write one more row
myfile.write('13005, One more row, 15 Jan 2020, $250.30\n')

# go back to the beginning of the file
myfile.seek(0)

# read again
print('After appending new data')
print(myfile.read())

myfile.close()


Before appending new data
Invoice No, Customer Name, Date, Invoice Amount
13002, Lighthouse Entertainment, 13 Jan 2020, $45.60
13003, Main Street News, 13 Jan 2020, $100.20
13003, Lee Enterprise, 14 Jan 2020, $30.00
13004, Raju Store, 14 Jan 2020, $300.20

After appending new data
Invoice No, Customer Name, Date, Invoice Amount
13002, Lighthouse Entertainment, 13 Jan 2020, $45.60
13003, Main Street News, 13 Jan 2020, $100.20
13003, Lee Enterprise, 14 Jan 2020, $30.00
13004, Raju Store, 14 Jan 2020, $300.20
13005, One more row, 15 Jan 2020, $250.30



## Exercise

Write the code to open the 'invoices.csv' file for reading, and then print only the heading 'Invoice No' and the data for the invoice numbers.

Use the following comments to guide you

Nama : Hashimatul Zaria
NIM  : E1E121059

In [16]:
import pandas as pd

# Open a file for reading, with default mode
df = open('invoices.csv')
print(df)

# read all the lines
mydataAsLines = df.readline()
print(mydataAsLines)

# header is the first element in the list of lines
# Header elemen pertama dalam daftar baris
header = mydataAsLines[0]

# use a split() function to separate the header names by the comma
# gunakan fungsi split() untuk memisahkan nama-nama header dengan koma
header = header.strip().split(',')

# print the first element of the header names after splitting
print(header[0])

# use a for loop (with a slice starting at index 1) for the rest of the lines
for line in df:
    # split at the comma
    parts = line.split(",")
    print(parts)
    # print the first element
    print(parts[1])

# close the file
df.close()    

<_io.TextIOWrapper name='invoices.csv' mode='r' encoding='cp1252'>
Invoice No, Customer Name, Date, Invoice Amount

I
['13002', ' Lighthouse Entertainment', ' 13 Jan 2020', ' $45.60\n']
 Lighthouse Entertainment
['13003', ' Main Street News', ' 13 Jan 2020', ' $100.20\n']
 Main Street News
['13003', ' Lee Enterprise', ' 14 Jan 2020', ' $30.00\n']
 Lee Enterprise
['13004', ' Raju Store', ' 14 Jan 2020', ' $300.20\n']
 Raju Store
['13005', ' One more row', ' 15 Jan 2020', ' $250.30\n']
 One more row


Being able to extract just the invoice numbers is useful, because it will help us to keep track of the running number.

In this lesson, we have learned how to read and write from files. However, there are many data science libraries that have simplified these operations, and we will make use of the pandas library to read CSV files.

