## 1. Unifying encoding types 
* Using **unicode** _ex. utf-8, utf-16, utf-32_   
* When opening files, by using **utf-8-sig** which removes BOM(Byte Order Mark), we could manage encoding issues from different OS. 


## 2. File path
* Absolute path: path from the root directory
    * _ex._ `/Users/workspace/python-pandas`  
        * starts at the root directory(/) and then add '/' after every directory   
        * last '/' is optional
    
* Relative path: path from working directory
    * _ex._ `main.py`
        * the only file named 'main.py' in present working directory
    * _ex._ `../python-keras`   
        * file in /Users/workspace/python-keras
        * ../ is pointing to parent folder

## 3. Plain Text files

### 3.1 Read files
* Using `.read()`, you can read whole data in a **big string**
* Using `.readline()`, you can read **one line** at a time
* Using `.readlines()`, you can read whole data by line in a **list type**

In [40]:
# 'r': 'read', 'w': 'write', 'a': 'append' option
data_file = open('files/text_data.txt', 'r', encoding='utf-8-sig')
data_lines = data_file.readlines()
print(data_lines)
print(type(data_lines))

['Hello World!\n', 'This is an example file\n', '-----------------------\n', 'Sensor Cloud, Kookmin University']
<class 'list'>


In [41]:
for data_line in data_lines:
    print(data_line)

Hello World!

This is an example file

-----------------------

Sensor Cloud, Kookmin University


In [42]:
data_file = open('files/text_data.txt', 'r', encoding='utf-8-sig')
data = data_file.read()
print(data)

Hello World!
This is an example file
-----------------------
Sensor Cloud, Kookmin University


### 3.2 Write files
* Using `.write()`, you can write data _**BUT**_ if 'w' option is selected, it **overwrite** original file

In [61]:
# 'r': 'read', 'w': 'write', 'a': 'append' option

# Write file
data_file = open('files/text_data.txt', 'w', encoding='utf-8-sig')
data_file.write('Overwritten !!!')

# Read the file
data_file = open('files/text_data.txt', 'r', encoding='utf-8-sig')
data = data_file.read()
print(data)

Overwritten !!!


In [62]:
# 'r': 'read', 'w': 'write', 'a': 'append' option

# add lines
data_file = open('files/text_data.txt', 'a', encoding='utf-8-sig')
data_file.write('\nLine added !!!')

# Read the file
data_file = open('files/text_data.txt', 'r', encoding='utf-8-sig')
data = data_file.read()
print(data)

data_file.close()

Overwritten !!!
Line added !!!


## 4. CSV files

* Example of CSV(Comma Separated Values) file format   
if we want following data,    

Index | Value
:-----: | :-----------:
Name | Seiwon Park
Univ | Kookmin

then we can save it in `.csv` format as follows:   

Index, Value   
Name, Seiwon Park   
Univ, Kookmin   

### 4.1 Read files
* import csv
* Using `csv.reader()`, you can read files by line

In [67]:
import csv

data_file = open('files/csv_data.csv', 'r', encoding='utf-8-sig')
data_lines = csv.reader(data_file, delimiter=',')   # delimiter option: 
for data_line in data_lines:
    print(data_line)
    
data_file.close()

['Index', ' Value']
['Name', ' Seiwon Park']
['Univ', ' Kookmin']


### 4.2 Write files
* import csv
* Using `csv.writer()`, you can write files
* _**NOTE**_: `newline=''` in `open()` function is recommended on Windows as when writing csv data on Windows, it automatically adds empty line("\n") at the end of each line

In [71]:
import csv

# Write file
data_file = open('files/csv_data.csv', 'w', encoding='utf-8-sig', newline='')
data_write = csv.writer(data_file, delimiter=',')
data_write.writerow(['1', '2', '3'])

# Read file
data_file = open('files/csv_data.csv', 'r', encoding='utf-8-sig')
data_lines = csv.reader(data_file, delimiter=',')   # delimiter option: 
for data_line in data_lines:
    print(data_line)
data_file.close()

['1', '2', '3']


In [75]:
import csv

# Write file
# `with open()` can automatically open and close files
with open('files/new_csv_data.csv', 'w', encoding='utf-8-sig', newline='') as csv_writer:
    
    # Set headers
    field_names = ['Index', 'Value']
    writer = csv.DictWriter(csv_writer, fieldnames=field_names)
    
    # Write rows
    writer.writeheader()
    writer.writerow({'Index': 'Name', 'Value': 'Seiwon Park'})
    writer.writerow({'Index': 'Univ', 'Value': 'Kookmin'})


# Read file
with open('files/new_csv_data.csv', 'r', encoding='utf-8-sig', newline='') as csv_reader:
    data_lines = csv.reader(csv_reader, delimiter=',')   # delimiter option: 
    for data_line in data_lines:
        print(data_line)

['Index', 'Value']
['Name', 'Seiwon Park']
['Univ', 'Kookmin']


## 5. XML files 
* XML(Extensible Markup Language)
    * Markup Language: using _**tag**_, represents elements in files
* structure 

        <tag attribute="attribute_value">context</tag>
        
        <examples type="test">
            <example a="12345" b="hello">This is for testing</example>
            <example c="678910" d="world">Sensor Cloud</example>
        </examples>
        
        
### 5.1 Read XML files


In [48]:
!pip install bs4



In [45]:
from bs4 import BeautifulSoup

with open('files/xml_data.xml', 'r', encoding='utf-8-sig') as data_file:
    soup = BeautifulSoup(data_file, features="html")   # features='xml' or 'lxml' doesn't work in Jupyter
    a = soup.select('a')
    for b in a:
        print(b.text)



Hello XML !!!
XML stands for Extensible Markup Language


Sensor Cloud
Kookmin University




In [50]:
from bs4 import BeautifulSoup

with open('files/xml_data.xml', 'r', encoding='utf-8-sig') as data_file:
    soup = BeautifulSoup(data_file, features="html")   # features='xml' or 'lxml' doesn't work in Jupyter
    a = soup.select('a')
    for b in a:
        print(b.select_one('ti'))
        print(b.select_one('co'))

<ti>Hello XML !!!</ti>
<co>XML stands for Extensible Markup Language</co>


## Practice 1. Gathering 'Related Search Terms' from Google

### 1. Get 'http://suggestqueries.google.com/complete/search?output=toolbar&q='
### 2. Search following attribute

        <suggestion data="연관검색어"> 

In [58]:
# Practice 1. Gathering 'Related Search Temrs'

from bs4 import BeautifulSoup
import requests

url = 'http://suggestqueries.google.com/complete/search?output=toolbar&q='
search_term = 'christmas'
response = requests.get(url + search_term)
soup = BeautifulSoup(response.content, 'html')
attr = soup.select('suggestion')

for ele in attr:
    print(ele['data'])

christmas
christmas tree
christmas wallpaper
christmas card
christmas card message
christmas carol
christmas illustration
christmas background
christmas cake
christmas aesthetic


## 6. JSON files
* JSON stands for JavaScript Object Notation

### 6.1 Read JSON files
* Using `json.load()`, you'll read .json file
* Using `json.loads()`, you'll get json-formatted data(which is dict type)
* Using `json.dump()`, you'll write file in a json-formatted data
* Using `json.dumps()`, you'll get string-typed dict data
* Using `json.dumps(data, indent=2)`
* Using `json.load(data).keys()`, you can get all keys in a list

In [61]:
import json

with open('Examples/00_data/US_category_id.json', 'r', encoding='utf-8-sig') as json_file:
    json_data = json.load(json_file)
    for item in json_data['items']:
        print(item)

{'kind': 'youtube#videoCategory', 'etag': '"m2yskBQFythfE4irbTIeOgYYfBU/Xy1mB4_yLrHy_BmKmPBggty2mZQ"', 'id': '1', 'snippet': {'channelId': 'UCBR8-60-B28hp2BmDPdntcQ', 'title': 'Film & Animation', 'assignable': True}}
{'kind': 'youtube#videoCategory', 'etag': '"m2yskBQFythfE4irbTIeOgYYfBU/UZ1oLIIz2dxIhO45ZTFR3a3NyTA"', 'id': '2', 'snippet': {'channelId': 'UCBR8-60-B28hp2BmDPdntcQ', 'title': 'Autos & Vehicles', 'assignable': True}}
{'kind': 'youtube#videoCategory', 'etag': '"m2yskBQFythfE4irbTIeOgYYfBU/nqRIq97-xe5XRZTxbknKFVe5Lmg"', 'id': '10', 'snippet': {'channelId': 'UCBR8-60-B28hp2BmDPdntcQ', 'title': 'Music', 'assignable': True}}
{'kind': 'youtube#videoCategory', 'etag': '"m2yskBQFythfE4irbTIeOgYYfBU/HwXKamM1Q20q9BN-oBJavSGkfDI"', 'id': '15', 'snippet': {'channelId': 'UCBR8-60-B28hp2BmDPdntcQ', 'title': 'Pets & Animals', 'assignable': True}}
{'kind': 'youtube#videoCategory', 'etag': '"m2yskBQFythfE4irbTIeOgYYfBU/9GQMSRjrZdHeb1OEM1XVQ9zbGec"', 'id': '17', 'snippet': {'channelId': 'UC