## Python open data file formats

### JSON 
* simple file format that is very easy for any programming language to read.
* Working with JSON in Python is almost the same such as working with a python dictionary. 
* You will need the json library.

In [2]:
import json
json_data = open("sample_data.json")
data = json.load(json_data)
data

{'felineIQ': None,
 'isCat': True,
 'miceCaught': 0,
 'name': 'Zophie',
 'napsTaken': 37.5}

### XML 
* widely used format for data exchange
* keep the structure in the data. 
* This is pretty easy in python as well. 
* You will need minidom library.

In [13]:
from xml.dom import minidom
xmldoc = minidom.parse("sample_data.xml")
itemlist = xmldoc.getElementsByTagName("name")
print(itemlist)
for i in itemlist:
    print(i.getUserData.data)

[<DOM Element: name at 0x49d16d0>, <DOM Element: name at 0x49d13d8>, <DOM Element: name at 0x49d1048>, <DOM Element: name at 0x49c8df0>, <DOM Element: name at 0x49c8af8>]


AttributeError: 'function' object has no attribute 'data'

### XLS - Spreadsheets. 
* example Microsoft Excel. 
* This data can often be used immediately with the correct descriptions of what the different columns mean. 
* use a tool like xls2csv and then use the output file as a csv.
* best source I had www.python-excel.org. 

### CSV - Comma Separated Files
* compact and thus suitable to transfer large sets of data with the same structure. 
* You can use the CSV python library. 
* exclude structural metadata from inside the document.
* developers will need to create a parser that can interpret each document as it appears.

In [1]:
import csv
exampleFile = open('sample_data.csv')
exampleReader = csv.reader(exampleFile)
exampleData = list(exampleReader)
exampleData

[['4/5/2015 13:34', 'Apples', '73'],
 ['4/5/2015 3:41', 'Cherries', '85'],
 ['4/6/2015 12:46', 'Pears', '14'],
 ['4/8/2015 8:59', 'Oranges', '52'],
 ['4/10/2015 2:07', 'Apples', '152'],
 ['4/10/2015 18:10', 'Bananas', '23'],
 ['4/10/2015 2:40', 'Strawberries', '98']]

In [None]:
import csv
with open('eggs.csv', 'rb') as csvfile:
   file = csv.reader("file root", delimiter=' ', quotechar='|')
   for row in file:
      print ', '.join(row)

### PDF 
* Many datasets have their data in pdf and unfortunately it isn’t easy to read and then edit them. 
* PDF is really presentation oriented and not content oriented.
* But you can use PDFMiner to work with it.

### HTML
* data is available in HTML format on various sites.
* Presentation language (TAGs)
* Yahoo has developed a tool yql that can extract structured information from a website.
* python library called Beautiful Soup for my projects.



In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_file)
soup.title
soup.title.name
soup.title.string
soup.title.parent.name
soup.p
soup.p['class']
soup.a
soup.find_all('a')
soup.find(id="link3")

### Scanned image. 
* TIFF and JPEG-2000 can at least mark them with documentation of what is in the picture.
* If images are clean, containing only text and without any noise, you can use a library called pytesser. 
* You will need the PIL library to use it. 


In [15]:
from pytesser import *
image = Image.open('phototest.tif')  # Open image object using PIL
print(image_to_string(image))

ImportError: No module named 'pytesser'

### TXT - text files

In [None]:
help(open)

In [17]:
try:
    text_file = open("file root", "r")
    lines = text_file.read()
except OSError as e:
    print("Failed to open file.")

Failed to open file.


In [None]:
# Write a file
out_file = open("test.txt", "wt")
out_file.write("Sample File:\n")
out_file.write("Line one\n")
out_file.write("Line two\n")
out_file.write("Line three\n")
out_file.close()

In [None]:
# Read a file
in_file = open("test.txt", "rt") 
text = in_file.read()
text = '*'*40 + '\n' + text + '\n' +  '='*40

print(text)


In [None]:
f = open("test.txt")  # Returns a file object
line = f.readline()  # Invokes readline() method on file
while line:
    print('*' + line, end='')  # optional argument 'end' replaces newline character that print appends
    line = f.readline()

In [None]:
for line in open("test.txt"):
    print(line, end='')

In [None]:
def printFile(filename):
    '''
    prints the content of the given file. Will raise an exception if file does not exist
    '''
    print("-"*40)
    print(filename)
    print("-"*40)
    f = open(filename)  # Returns a file object
    line = f.readline()  # Invokes readline() method on file
    while line:
        print(line, end='')  # optional argument 'end' replaces newline character that print appends
        line = f.readline()
 
printFile("test.txt")


### Check if file exist?

In [None]:
import os
os.path.exists('test123.txt')

In [None]:
# word count example

file=open("test.txt","r+")
wordcount={}
for word in file.read().split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1
#for k,v in wordcount.items():
#    print (k, v)
wordcount