# W2-03 - File Processing


**Objective: Learn Python basic concepts of file processing**

**Competencies:**
    * Participants will be able to read and write files
 
**Tools:** Python, Anaconda, Jupyter

**Analysis case study:** -

**Data source(s) and fields:** N/A

**Step-by-step Guide:**
 

Processing data is an essential starting point towards any data science / machine learning work. Without data, nothing can be done. This module looks into how files can be read and written by a Python program. We start with some generic reading and writing operations, and then move to JSON and CSV files, two popular data file types.

* [Reading and Writing to Files](#Reading-and-Writing-to-Files)
* [JSON Files](#JSON-Files)
* [CSV Files](#CSV-Files)


## Reading and Writing to Files

*A portion of this content uses [official python tutorial](https://docs.python.org/3.6/tutorial/inputoutput.html).*

`open()` returns a file object, and is most commonly used with two arguments: `open(filename, mode)`.

The first argument is a string containing the filename.

The second argument is another string containing a few characters describing the way in which the file will be used. 

_mode_ can be:
- `'r'` read (by default)
- `'w'` write
- `'a'` append

Normally, files are opened in __text mode__, that means, you read and write __strings__ from and to the file, which are encoded in a specific encoding. If encoding is not specified, the default depends on your platform.

`'b'` appended to the mode opens the file in __binary mode__: now the data is read and written in the form of __bytes__ objects.

Example:


It is good practice to use the __`with`__ keyword when dealing with file objects. The advantage is that the file is properly closed after its suite finishes, even if an exception is raised at some point. Using `with` is also much shorter than writing equivalent `try-finally` blocks:

### Methods of file objects

Files can be read as a file object **`f`**:

- __`f.write(s)`__ write __`s`__ in __`f`__
- __`f.read()`__ reads the whole file
- __`f.readline()`__ reads one line
- to read a file line by line, you can loop over the file object itself:

```python
with open('workfile', 'r') as f:
    for line in f:
        print(line)
```

### Writing files
Let's create a file example_file.txt

In [None]:
filename = 'example_file.txt'

In [None]:
# check your current working directory 
import os
print(os.getcwd())

In [None]:
# We can pass a string directly
with open(filename, 'w') as my_file:
    my_file.write("\n********** This is the top line for this operation ***************\n")
    my_file.write("Here is some example text\n")

    # More often, we want to write data (output) from our Python program to file.
    n = 100
    example_string = "The value of n is %d.\n"%n
    my_file.write(example_string)

    # Here, we write to a file from within a loop
    my_file.write("\nAnd now, here's a song about 99 ipsum lorem lines:\n")
    for nlines in range(1, n, 1):
        line = "%d ipsum lorem lines printed here. %d ipsum lorem lines. One is never enough, we need another, %d ipsum lorem lines printed!\n"%(nlines,nlines,nlines)
        my_file.write(line)

Using the Jupyter notebook directory page on your web browser, locate the file and open it. You can also open it using your OS's file manager. Verify that the contents make sense by comparing them to the code above.

### Reading files
Let's read the file in Python. Again, we will use the `open()` function, this time with the `'r'` parameter to indicate that we are reading.

In [None]:
# Again the open() function gives us a variable of type 'file'
with open(filename, 'r') as input_file:

    # The read() function puts the contents of the file into a string object
    file_contents = input_file.read()

    # Let's see what we got
print(file_contents)

In [None]:
# Now we append the file
with open(filename, 'w') as my_file:
    my_file.write("\n********** This is the top line for this operation ***************\n")
    my_file.write("Here is some example text\n")

    # More often, we want to write data (output) from our Python program to file.
    n = 100
    example_string = "The value of n is %d.\n"%n
    my_file.write(example_string)

    # Here, we write to a file from within a loop
    my_file.write("\nAnd now, here's a song about 99 ipsum lorem lines:\n")
    for nlines in range(1, n, 1):
        line = "%d ipsum lorem lines printed here. %d ipsum lorem lines. One is never enough, we need another, %d ipsum lorem lines printed!\n"%(nlines,nlines,nlines)
        my_file.write(line)

### Parsing
It will often be necessary to "parse" the contents of a file. This means we will extract information from the string into Python variables that we can operate on. When the file was not designed to be parsed conveniently, this can get messy!

As an example, let's locate all words that appear in 25 of Pythagoras' famous sayings. Instead of reading the entire file all-at-once, in this example we'll choose to read it line-by-line so that we can perform additional processing to each line, i.e. splitting and collecting them into another list. This is done by iterating over the file object.

In [None]:
with open('Data/pythagoras.txt','r') as input_file:

    # Throw away the first 3 lines. They do not contain the quotes.
    # Notice that we are reading the first 3 lines but we are not doing anything with them
    # Using the _ name is a Python convention for a variable name that we are not going to use
    for j in range(3):
        _ = input_file.readline()

    words = []
    i = 0
    # Iterate over remaining lines
    for line in input_file:

        # Action: think about this line. What does it do?
        words.extend(line.split())
        
        # Action: what about this line. what does it do?
        words = [w.lower() for w in words]
        
        i = i + 1
        
    print(words)
    #print(sorted(words))
    print("the number of words:", len(words))
    print("number of line :",i)   # sanity check to ensure it actually went thru 25 lines

### Side topic: Bytes and strings

When you talk about “text” you’re probably thinking of “characters and symbols on my computer screen.”

But computers don’t deal in characters and symbols; they deal in **bits and bytes**. 

Every piece of text you’ve ever seen on a computer screen is actually stored in a particular **character encoding**.

Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. 

There are many different character encodings, some optimized for particular languages like Russian, Hindi or Chinese or English, and others that can be used for multiple languages.

A very common __encoding__ is __UTF-8__.

Python has two different classes: __`bytes`__ and __`string`__.

To transform bytes in string, you need to __decode__ it, specifying the __encoding__. 

To transform string in bytes, you need to __encode__ it, specifying the __encoding__.


#### Examples

In [None]:
s1 = '义勇军进行曲'
print(s1)
s2 = 'जन गण मन'
print(s2)

In [None]:
b1 = s1.encode()
print(b1)
b2 = s2.encode()
print(b2)

The function `encode()` encodes the characters into byte form, as you may have noticed a `b'` in front of the code shown. By default, it uses 'utf-8' or 8-bit Unicode encoding.

In [None]:
type(b1), type(b2)

In [None]:
b1.decode(), b2.decode()

In [None]:
help(b2.decode)

The text in English that we are familiar with, can be encoded with ASCII as well. In this case, the a

In [None]:
s3 = 'china'
s3.encode('ASCII') # American Standard Code for Information Interchange, a 7-bit character set containing 128 characters.

In [None]:
# to show the usecase of ASCII code, which is converting special character into readable version
print(ascii("china"))
print(ascii("This å is a special character"))
print(ascii("This å is a special character!"))

# to know the ASCII value of a given character
print(ord("c"))
print(ord("å"))

Some encodings do not work for certain types of characters!

In [None]:
'义勇军进行曲'.encode('ASCII')

In [None]:
'abc'.encode().find(b'c')

In [None]:
with open("file_china_chars", 'bw') as f:
    f.write(s1.encode())

In [None]:
!type file_china_chars
!del file_china_chars

In [None]:
# how to read binary file
with open("file_china_chars", 'rb') as input_file:

    # The read() function puts the contents of the file into a string object
    file_contents = input_file.read()

    # Let's see what we got
print(file_contents.decode())

## JSON Files
JSON (JavaScript Object Notation) is an open-standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. 

Many APIs work with JSON format to provide and receive data. Popular alternatives to JSON are YAML and XML. JSON is often compared to XML because its usage is similar. JSON has now become an alternative to XML, which is older.

<b>Difference between JSON and dictionary</b> : JSON is a data format (a string), Python dictionary is a data structure (in-memory object). Eg, You cannot save a dictionary to a file, you also cannot send it via an HTTP request; instead, you store or transfer the data in a json file or json string.



Normalised in [RCF7159](https://tools.ietf.org/html/rfc7159) and [ECMA404](http://www.ecma-international.org/publications/standards/Ecma-404.htm).

### Description

JSON's data types are:

- __Number__
- __String__
- __Boolean__: __true__ / __false__
- __Array__: ordered list of values. Each of which may be of any type. Arrays use square bracket __[ ]__ notation with elements being __comma-separated ,__ .
- __Object__: an unordered collection of name/value pairs where the names (also called keys) are strings. Objects are delimited with __curly brackets {}__ and use __commas to separate each pair__, while within each pair the __colon ':' character separates the key or name from its value__.
- __null__: An empty value, using the word __null__.

| Python        | JSON          | Example in JSON   |
| ------------- |:-------------:| -----------------:|
| int or float  | Number        |  34               |
| string        | String        |  "foo bar"        |
| Bool          | Boolean       |    true           |
| list          | Array         | [1, 2, 3]         |
| dict          | Object        |{"type": "home", "number": "212 555-1234"}|

### Example
Here's an example from [wikipedia](https://en.wikipedia.org/wiki/JSON#Data_types.2C_syntax_and_example)

```json
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    },
    {
      "type": "mobile",
      "number": "123 456-7890"
    }
  ],
  "children": [],
  "spouse": null
}
```

In [None]:
import json

**Note! Indentation doesn't matter** during creation of data, as long as the specifications are correct.

In [None]:
# How to load JSON data format into dictionary data structure

json_data = '''{
            "type": "home",
        "number": "212 555-1234",
    "active": true,
            "status": [1, 0, 4],
    "user": {
        "ID": "benley83",
                "passcode": "432Kd3D"
    }
}'''

data = json.loads(json_data)  # loads the JSON data
print(data)
print(type(data))

In [None]:
# wrong example with dictionary
dict_data = {"key" : "value"}
data = json.loads(dict_data) # will give you an TypeError

In [None]:
with open("example.json", 'w') as f:
    json.dump(data, f)

In [None]:
# bad example with writing dictionary into the file without converting it into JSON
with open("example.txt", 'w') as f:
    f.write(dict_data)

In [None]:
# saving dict data into json
import json

dict_data = {"apple" : 5, "orange" : 10}

with open('example.json', 'w') as fp:
    json.dump(dict_data, fp)
    
with open('example.json', 'r') as fp:
    print(json.load(fp))

In [None]:
# 1 of major difference between JSON and dictionary: dict can have tuple as key and JSON cannot
import json

dict_data = {("apple", "orange") : 5, "orange" : 10}

print(dict_data)

with open('example.json', 'w') as fp:
    json.dump(dict_data, fp)
    
with open('example.json', 'r') as fp:
    print(json.load(fp))

In [None]:
!type example.json
# 'cat' instead of 'type' for Linux

In [None]:
with open("example.json", 'r') as f:
    data2 = json.load(f)
    
print(data2)

This library `json.tool` is used with the `python` command line to indent a json file

In [None]:
!python -m json.tool example.json

This does not make any changes to the original JSON file formatting (check the file), but only prints it pretty for your checking purposes. Normally, we want the JSON file to be as compact and lightweight as possible, so having no whitespaces is actually good.

We now load some data containing the top rated movies in Sweden (`top-rated-movies-01.json'). First, try to understand how the data is structure inside this JSON file. 

In [None]:
with open('Data/top-rated-movies-01.json') as json_data:
    d = json.load(json_data)


In [None]:
!python -m json.tool Data/top-rated-movies-01.json

You probably want to take a peek at the first entry's keys. Too many entries and values printed just makes it hard to read!

In [None]:
with open('Data/top-rated-movies-01.json') as json_data:
    d = json.load(json_data)
    print(d[0].keys())

**Quick Exercise 1** Print the the number of movies, followed by the full list of movies with their respective IMDB ratings.

In [None]:
with open('top-rated-movies-01.json') as json_data:
    d = json.load(json_data)
    
    # write your code after this line
    
    

## CSV Files

A Comma-Separated Values (CSV) file stores tabular data (numbers and text) in plain text. Most often than not, the CSV file can be comfortably parsed and opened in spreadsheet programs such as MS Excel, OpenOffice Calc and etc.

Each line of the file is a data record. Each record consists of one or more fields, separated by commas. 


### Basic Rules 

- CSV is a delimited data format that has fields/columns separated by the comma (,) character and records/rows terminated by newlines.
- A CSV file does not require a specific character encoding, byte order, or line terminator format (some software does not support all line-end variations).
- A record ends at a line terminator. However, line-terminators can be embedded as data within fields, so software must recognize quoted line-separators (see below) in order to correctly assemble an entire record from perhaps multiple lines.
- All records should have the same number of fields, in the same order.
- Adjacent fields must be separated by a single comma. However, "CSV" formats vary greatly in this choice of separator character. In particular, in locales where the comma is used as a decimal separator, semicolon (;), TAB, or other characters are used instead.

### Example

Write CSV file with standard Python library

In [None]:
import csv

with open('test.csv', 'w') as csvfile:
    writer = csv.writer(csvfile, lineterminator = '\n')
    writer.writerow(['Year', 'Make', 'Model'])
    writer.writerow(['2009', 'Proton', 'Exora'])
    writer.writerow(['2009', 'Perodua', 'Alza'])

In [None]:
!type test.csv


Read a file and store csv records row-by-row into a nested list (i.e. each inner list holds a row of record):

In [None]:
import csv

def load_csv(filename, delim=','):
    data = []
    with open(filename, 'r') as f:
        reader = csv.reader(f, delimiter=delim)
        for row in reader:
            data.append(row)
    return data

data = load_csv('test.csv')
print(data)

**TASK** We have obtained information from Malaysia's Open Data Portal on the number of vehicles in Malaysia over the period of 2008-2015. Examine the contents of the CSV file carefully. 

1. Find the total number of vehicles in Malaysia per year.
2. What is generally the ratio of active to inactive vehicles in Malaysia (across all states)?

In [None]:
vehicle = load_csv('Data/kenderaan.csv', delim=';')

print(vehicle[0])   # header stuff
print(vehicle[1])

# need to remove header before processing further
vehicle = vehicle[1:]

Tip: Noticed how the number of vehicles have a comma separator. You have to remove this comma before the number can be further processed arithmetically.
* Hint: use string method in notebook w1-01

In [None]:
# write your code here


