# Lecture 2. Text processing and file input output. 

# Section 1. Introduction to File types 

For each of the following file types, we will have a bit of a summary of the file type, it's relevance and it's structure. Then we will discuss about how to get the file type into and out of a DataFrame. 

In [None]:
## JSON 

In [None]:
## HTML

## XML

XML stands for 'extensible mark-up language'. XML files can be generic or have a document type. For exmaple, GraphML is really just XML with a specific schema that is used for social network graph types. 

Like HTML, XML is a markup language with less than and greater than to encase the element tags. The text inside these tags must have some special characters escaped. 

~~~ xml 
<start> 
    <middle>
        <end1>   Here is an element! </end1>
        <end2>   Here is an element! </end2>
    </middle>
</start>
~~~

Elements have an "element tree". Above, ```start``` is the root node, ```middle``` is a child and ```end1``` is a child of middle. ```end1``` and ```end2``` are siblings. 

XML is a self-documenting style, which means that you can insert details about the elements into the document itself. This can be accomplished with keys that are often prepended to the top of the document just below any details about the formatting. 

Most of the time, we will not be so concerned with the top of an XML document. Rather, we will simply want to navigate the element tree to get to the element(s) that are of concern to us. 

In the script below, we will use urllib to request an XML document from Wikipedia. Then we will use a module called 'beautiful soup' to navigate the document and return aspects of the XML.  

## Bytestreams 

Sometimes we will want to use bytestreams in order to read in data. It is not very common, but for example when reading in zipfiles by code, it is important. Below is an example of reading in a file and then as a bytestream to see the difference. 

Bytestreams are **encoded** character sets. (i.e. they are written in the code computers understand). Strings have been **decoded** so that they can be printed for people to read. Depending on your operating system, you might not be able to write a file. We typically want to decode to UTF-8 which will write the file with the code points that a computer can use to decode the file when it needs to represent the character to a user. 


In [20]:
x = "hello 👋"

try: 
    fileout = open("temp.txt",'w')
    fileout.write(x)
    fileout.close()
except UnicodeEncodeError: 
    print("This program may have difficulty encoding the emoji")

fileout = open("temp.txt",'w',  encoding='UTF-8')
fileout.write(x)
fileout.close()

print("Below we are reading as a string a file that has been encoded")
filein = open("temp.txt",'r')
print(filein.read())
filein.close()

print("Below we are reading as a byte and then decoding it")
filein = open("temp.txt",'rb')
print(filein.read().decode('UTF-8'))
filein.close()



This program may have difficulty encoding the emoji
Below we are reading as a string a file that has been encoded
hello ðŸ‘‹
Below we are reading as a byte and then decoding it
hello 👋


## Serialization

Sometimes, you want to close a program and pick up right where you left off. This might mean ensuring that all the objects are in the state that you want them to be with no further processing. This process of creating a file that will represent the state of some values is called serialization. We 'serialize' variables or data structures. 

Now, python being python, they had to give it a more friendly name. So in python if you want to save the state of a variable or set of variables as is, you can 'pickle' them. You can then 'unpickle' them later on. 

One useful approach with pickling is when you are processing text on a server and you are doing something complicated, you can pickle all your current state of things if the program goes sour, then pick up where you left off.

You can only serialize one object at a time, but of course that object can be a collection of numerable other objects. Since these files are meant for  This is done with the following syntax: 

~~~ py 
import pickle 
x = <object> 
pickle.dump(x,open(<file>,'wb')) 
~~~

And to load the object again (with any name):
~~~
y = pickle.load(open(<file>), 'rb')
~~~

In [23]:
import pickle

x = ['1','2']
pickle.dump(x,open("temp.txt",'wb'))
x = pickle.load(open("temp.txt",'rb'))
print(x)

['1', '2']


### Pickles can expire: Check the version number

Notice that we are using 'rb' and 'wb' with the pickles. This is because Python 3's default pickling version writes the pickled object as a bytestream. Also note that this pickled object will probably not be readable by python 2.  You can set the pickle protocol so that it is readable by python 2 as an option. 

## CSV 

Comma-separated values is a common data storage format. Yet, despite it's prevalence, there are a few variations to consider:
- How are strings represented? Do they use "<string>" for every string, no string or only strings with commas in them? 
- How are new lines represented? 
- Is there a header? 
- Is there a trailing \n?

It is simple to roll your own csv reader. So much so that you did a version of this on the first day. Yet, there are often enough details to consider that you might want to rely on an external program. Python will offer two. First is the ```csv``` module. This is a standalone package that can be imported. It has many options for separators and whether there's a header. It also has some nice ways to index the data. For example, if you want to store your data as a dictionary with the header as the key and the column as the values, this does the trick. 

The basic usage, however, is to iterate through a file line by line. Instead of iterating through with 'readline' and splitting the text that comes back, you create a "csv reader", and this iterates line by line returning not a string of text, but a list split at every comma (or user-defined separator). 

~~~ python 
import csv 

with open('data.csv', newline='') as file_to_read:
    filereader = csv.reader(file_to_read, delimiter=' ', quotechar='|')
    for row in filereader:
        print('<>'.join(row))
~~~

The nice thing about ```csv```, particularly when not using pandas, is the use of the DictReader. This returns a dictionary with the header as the key and the values in that row as the value. If there's no header line, you can specify a list as ```fieldnames```. 

The second is simply reading in using pandas default importer. 

~~~
import pandas as pd

df = pd.read_csv(<path_to_file>) 

or xx
~~~


## Excel 

Excel is the popular spreadsheet program from Microsoft. Files can be stored as either .xls or .xlsx. The first one is a bytestream proprietary file format, but the details are handled by PANDAS. The second one was published as an open standard and is in fact a wrapper over a specific format of xml.  

In [None]:
# This is for our DataFrames
import pandas as pd

# This package allows us to parse json
import json

# This package allows us to take a list in json and turn it into a DataFrame
from pandas.io.json import json_normalize

# This package allows us to download data from the web
import urllib

# This is a package called "Beautiful Soup" that parses webpages/XML and it available as an object
import bs4

# This package stops time so you can work on your assignment
# Just kidding, it allows you to pause a program
import time

# This package allows you to use plt to plot data
import matplotlib.pyplot as plt

# This package allows you to use regular expressions
import re

# This magic command places image output in the workbook
# If you want to save a figure to your computer use plt.savefig(PATH)
%pylab inline


## Review from last week

In python we can create data structures that we can use for managing data, filtering it and combining it in new ways. We introduced the **Series** data structure that has properties of both lists (ordered, indexable) and dictionaries (queryable by key). We also introduced the **DataFrame** which is like a collection of Series as columns, with rows as cases. 

Basic file manipulation
-----------------------

File manipulation consists of creating file openers and then working with those openers. The openers take two arguments. The first is the path to the file. If there is no path given beyond the file name, it will assume that the file is in the same directory as the code. 

    fileopener = open("path","r/w/a")

The second argument is either 'r', 'w' or 'a'. These respectively refer to read, write and append. Once you have a file opener you can either read the file or write the file. 

**Reading from a file**

When reading a file, you can either read the entire contents with the read() command:

    entire_file = fileopener.read()
    
You can also read line by line using an iterator. 

    for line in fileopener:
        print(line)

Remember! When you do this, each line will be returned. These lines typically have a new line character at the end of the line. So when you print it, you will print a space in between these lines. The way to get rid of that character is to "strip" it from the text:

    for line in fileopener:
        print(line.strip())

This will always start from the cursor. So if you have already read the file, you will either have to re-open it or set the cursor back to the original position. 

    fileopener.read()
    fileopener.seek(0)
    for line in fileopener:
         print(line)

When you are done you should close the file:

    fileopener.close()
    
** Writing to a File **

Writing to a file involves again creating an opener. The command to write to a file is:


    fileopener.write(STRING)
     
When you are done writing to the file it is also a good idea to close the file. Again, it is:

    file.close()

This closing is less important for reading than it is for writing. Why? Because writing to the file doesn't always ensure that the physical harddrive is written. By closing the file, it will 'flush' the contents of the file to the hard disk. If in the case you think it's important to periodically flush the data to the hard disk this is possible, too:

    fileopener.flush()

WARNING! Creating a writable file open actually creates a pointer to a file on the computer using the name of the file. This is especially important to understand because you can run the risk of destroying another file with the same name. This is what is called "clobbering". When you work in a word processor and try to save a file with the same name the computer will ask you 'file name already exists. are you sure?' or something to that effect. Python will not give such a warning. So please be careful with the names of your file openers.



In [None]:
txt = "hello world"
FILENAME = "example_clobber.txt"
fileout =  open(FILENAME, "w")
fileout.write(txt)
fileout.close()

filein = open(FILENAME)
print(filein.read())

filein.close()

fileout = open(FILENAME, 'w')

In [None]:

#############################################
# FILE MANIPULATION EXAMPLES 

NAME = "EXAMPLE_FILE"

# Write a file 
fileout = open("%s_testOut.txt" % (NAME),"w")
fileout.write("yes!\n"*10)
fileout.close()

# Read a file 
filein = open("%s_testOut.txt" % (NAME), 'r')
print("Begin printing in full")

print(filein.read())

print("Begin printing line by line")

filein.seek(0) # this resets the cursor to the beginning of the file. 

for count,line in enumerate(filein):
    print(count,":",line.strip())
    
# for line in filein: 
#     print(line)
    
filein.close()
print("***File closed***")

In [None]:
# Append a file 
fileout = open("%s_testOut.txt" % NAME,"a")
fileout.write("No!")
fileout.close()

filein = open("%s_testOut.txt" % NAME,"r")

for count,line in enumerate(filein):
    print(count,":",line)

filein.close()

Writing bytestrings and serializing files
-----------------------------------------

Sometimes you want to store some python objects exactly as the program is using them. This is called serialization in programming parlance. In python, this is called "pickling", as in the way one preserves things by putting them in jars with spices. In the simples instance, you merely want to call the "dump" or "load" methods. However, with Python 3, this works slightly differently than before. Now instead of writing the text as you read it, you have to write it as a stream of bytes. Go to a text editor and open a .jpeg or a .mp3 and you will see what a bytestring looks like. They are pretty unintelligible to humans but they work just fine for the computer. It is often a good idea to pickle some work if you are on a server and concerned what would happen if the program failed. 

To create a file opener as a bytestring, you have to append 'b' to the read or write argument

    open(PATH, 'wb') 
    
So the resulting code to open a file for pickling is

    pickle.dump(PYTHON_OBJECT, open(PATH, 'wb')
    
To read from the file (and typically assugn it to a variable:

    new_object = pickle.load(open(PATH, 'rb')

In [None]:
import pickle
# Write a pickle 
x = ["best","script","ever"]
y = {"comic book guy": "best . script . ever."}

z = (x,y)

pickle.dump(z,open("testPickle.pkl",'wb'))

new_z = pickle.load(open("testPickle.pkl",'rb'))
print(new_z)

# JSON

JSON stands for javascript object notation. It is a common lightweight form for data collection. We can see json on the web really easily. Just navigate to reddit, as in one of the nicer reddits, such as www.reddit.com/aww then append .json to the end of the URL. See there! It's basically just lists and dictionaries. You can load json data with 

~~~
import json

datastructure = json.loads(THE_DATA)
~~~

Then you can just query it as a series of nested lists and dictionaries. You can also print this data in a nice readable format (called pretty printing) in the following way: 

~~~
json.dumps(THE_DATA, indent=4) 
~~~

There are nicer ways to pretty print. Let's look at json from Reddit pretty printed. Open a browser window and head to:

https://jsonformatter.curiousconcept.com 

In there type: 

http://reddit.com/r/aww.json 

Notice how you can collapse and expand the json file. We will use this to navigate through the file then download it for processing. 

In [None]:
# Here's an example snippet. 
# We are going to download from the aww subreddit, then use json_normalize to stick it right in a DataFrame
# 1.
SUBREDDIT = "aww"
URL = "http://www.reddit.com/r/%s.json" % SUBREDDIT

# URL queries are preceded by a header. Your browser has a header, too. 
# Check: https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending
# Part of that header is the "user agent" string. Mine is: 
# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36 OPR/49.0.2725.39
# Reddit expects a string other than the python default. 
# You should alter it to be unique to you. 
# 1/0 # DELETE ME AFTER YOU'VE CHANGED THE HEADER
# 2. 
req = urllib.request.Request( URL, headers={'User-Agent': 'OII class 2018.1/Hogan'})

# IF you make repeated queries to Reddit, you must pause between them for a minimum of three seconds. 
# This snippet (time.sleep(3)) will do that for you, you just have to "import time" 
# time.sleep(4)
# 3. 
infile = urllib.request.urlopen(req)

# Here THE_DATA is we read the result and decode it. 
# 4. 
redditData = json.loads(infile.read().decode('utf8'))

# We we just say take the JSON and make a table. 
# 5. 
rtable = json_normalize(redditData["data"]["children"])
rtable

In [None]:
# When we have a DataFrame with at least two numerical columns, we can 
# plot it as a scatter plot. 
plt.scatter(rtable['data.ups'],rtable['data.num_comments'])

# We can add a label to the x-axis, the y-axis and the entire table. 
plt.xlabel('Upvotes')
plt.ylabel('Comments')
plt.title('Reddit upvotes by comments in r/%s' % subreddit)

plt.savefig("test.png")
plt.show()

XML. Navigating nested mark-up language
------------------------------------------

Both XML and HTML are examples of mark-up **languages** with a DOM tree. That is the documents follow a hierarchical structure and use mark-up tags to indicate which part of the structure you are in. The tags that denote the sturcture are in brackets. Open tags have a word (and some options), closing tags have the same word but with a / in front of it. Here is a basic HTML document

    <html>
        <head>
            <title> 
                This is the title! 
            </title>
        </head>
        <body>
            This is a webpage!
        </body>
    </html>

There are a number of programs that will convert a raw HTML or XML document into a python object that can be navigated. In my opinion, one of the nicest is the package "BeautifulSoup". It takes a bit of getting used to, but it will be of significant help. We can start by downloading an HTML or XML document and then parsing it. The Anaconda package should have beautifulsoup embedded.





In [None]:
# You can set this Wikipage to be any string that has a wikipedia page.
WIKIPAGE = "United Kingdom"

# Here we use urllib.parse.quote to turn spaces and special characters into
# the characters needed for an html string. So for example spaces become %20
URL = "http://en.wikipedia.org/wiki/Special:Export/%s" % urllib.parse.quote(WIKIPAGE)

# View the output here to see how quote strings get formatted. 
# You don't need to remember the codes, just that you typically need to "quote" before
# requesting from a browser. 
EXAMPLE = "Here are some quoted strings:\n!_@_#_$_%_^_&_*_(_)_-_=_+_/_?" 
qEXAMPLE = urllib.parse.quote(EXAMPLE)
print(EXAMPLE,"\n\n",qEXAMPLE)

print(URL,"\n")
# Let's look at this page in an XML browser. Copy and paste it into your browser. 

In [None]:
# Again, don't forget the header! 
req = urllib.request.Request( URL, headers={'User-Agent': 'OII class 2017.1/Hogan'})

# This is the data we receive by opening the URL. But it's not the file, it's just a pointer. 
infile = urllib.request.urlopen(req)

wikitext = infile.read()
# print(wikitext)
# print("#######\n\n\n\n\n######\n")
# print(wikitext.decode('utf8'))

# This does a lot of things. 
# infile.read().decode('utf8') >> reads the page but assumes the page is unicode, not plain ASCII
# soup = bs4.BeautifulSoup(TEXT, features="xml") >> this converts the text to a "soup" that can be queried.
#                                                   By saying features="xml", the soup knows how to parse it. 
#
soup = bs4.BeautifulSoup(wikitext.decode('utf8'), "lxml")

text_to_parse = soup.mediawiki.page.text

# print(text_to_parse)

You might be wondering how I know that it was:

    mediawiki.page.text
    
As we as up above:

    mediawiki.page.revision.id

There are two ways to learn this, the hard way and the harder way. The hard way is to just look at the raw XML and fumble around printing through the tree until you find the node that you're looking for. The "harder" way is to navigate through the page using the tree structure that BeautifulSoup builds. The latter way is pretty darn hard without guidance from the document itself. 

In [None]:
for i in soup.children: print(i.name)

In [None]:
for i in soup.mediawiki.children: print(i.name)

In [None]:
for i in soup.mediawiki.page.children: print(i.name)

In [None]:
print(soup.mediawiki.page.text)

Part 2. Basic Text Scraping
---------------------------

Basic text scraping is the practice of taking some data and cleaning it in such a way that it can be used for other programs. Below are a series of excercises designed to help you understand the fundamentals of text processing. In particular, we will focus on the process of handling whitespace. This will involve using several additional files that should be uploaded to your workspace. 

1. Cleaning up by line breaks. 
2. Splitting text by space. 
3. Finding specific words and characters. 
4. Converting from one character set to another character sets.



**Part 2.1 - Stripping characters.**

Below we will take a file, read it, print it and then get rid of the return characters. Please pay attention to the line-breaks when the file is being printed. 

In [None]:
with open("example_lines.txt") as file:
    for i in file:
        print(i.strip())

**Cleaning up the lines**
Did you notice that each of the lines has a space in between them? This is because we printed:

    "Testing Line 1\n"
    
And this is becuase when python reads the file it does it line by line. It splits the file at the new line character but keeps that character in the string when it returns the string. To get rid of these new line characters we would **strip()** the whitespace from the ends of the string.

In [None]:
with open("example_lines.txt") as file:
    for i in file:
        print(i.strip())

In [None]:
words = []
with open("example_lines.txt") as file:
    for i in file:
        words += i.split()
        
# print(words)

wordseries = pd.Series(words)
display(wordseries.value_counts())
for i in list(wordseries.value_counts()[wordseries.value_counts() > 1].index):
    print(i)

**Word frequency**

So, as we can see above, we have all sorts of issues with words. The word 'line' is there in upper and lower case, sometimes the text uses numbers, sometimes it has periods in there. We can do all sorts of things to these data to  clean them. 

In [None]:
word_dict = {}
with open("example_lines.txt") as file:
    for i in file:
        words = i.split()
        for j in words:
            j = j.lower() # all words are now lower case

            try: 
                if not j[-1].isalpha(): j = j[:-1] # non-alpha suffix
                if not j[0].isalpha(): j = j[1:] # non-alpha prefix
                if len(j) <= 1: continue # empty strings

            except IndexError:
                    continue
                
            # Once cleaned, we can then add the words to a dictionary 
            # The word will be the 'key' and the frequency will be the 'value'
            if j in word_dict: word_dict[j] += 1
            else: word_dict[j] = 1

print(pd.Series(word_dict))

In [None]:
data = pd.Series(word_dict,name="Count")
print(data.value_counts())

data.hist()

Part 3. Simple regular expressions
--------------------------------

"Regular expressions" are pieces of text that can be expressed in a regular form even if the characters are different. For example, when you encounter a URL on a webpage and right click on it, the browser knows that this is a URL and asks "open link in new tab". It does not need to know every URL, just what URLs are supposed to look like (that is they start with "HTTP://" or "HTTPS://"). 

In [None]:
import re 

example_text = "1234\t3333\t10000\t1,500,442\t3.14"
print(example_text)

In [None]:
# Just find the numbers
reg01 = re.compile("[0-9]")
print(reg01.findall(example_text))

In [None]:
# Just find the numbers
reg01 = re.compile("walk[\s]*")
print(reg01.findall(example_text))

In [None]:
# Hmm...it seems * matches 0 or more instances 
reg01 = re.compile("[0-9]+")
print(reg01.findall(example_text))

In [None]:
# Let's deal with those commas
reg01 = re.compile("[\d,.]+")
print(reg01.findall(example_text))

As you can see from the above examples, working with regexs involve compiling a 'regular expression' and then applying that to text. Obviously, we could have just split on the tab character in this particular instance, but it's the logic of specifying regexs that's important, such as saying "all digits" or "one or more digits plus a comma." In the parentheses for the regular expression we can either ask for 0,1,n or a predetermined number of characters. The characters can be in a range, such as 0-9 or a-z. But we can also use escape codes for the characters. See below for examples of regexs with text.  

Also, as a hint, if you forgot about using the **help()** command 

In [None]:
new_string = example_text.replace(",","")
print("Old string: ", example_text)
print("New string: ", new_string)

In [None]:
example_text = "😱 hello fellow kids, ur 2edgy4me 😝😝😝; you're like 3edgy5me. " + \
                "Yh, I replaced the 2 with a 3.\nI'm that edgy 😈."
print(example_text)

In [None]:
reg02 = re.compile("[a-z]+")
print(reg02.findall(example_text))

In [None]:
reg02 = re.compile("\S+")
print(reg02.findall(example_text))

In [None]:
reg02 = re.compile("[😱😝😈🎅🏾🙈]")
emojilist = reg02.findall(example_text)
print(emojilist)

emojiset = set(emojilist)
print(emojiset)

print(pd.Series(emojilist).value_counts())

# Returning to the XML above with regex

In [None]:
print(soup.mediawiki.page.text)

Now that we have the text we can look through it for regularly formatted text. This is ideal for Wikipedia since it is a wiki. Wikis use regularly formatted text for all of its features, and Wikipedians are keen to make sure that the page is formatted properly. It should come as no surprise that MediaWiki itself uses a ton of regular expressions to parse the wiki text in the first place. 

In [None]:
# Assumes you have done the above code
re_inner_links = re.compile(r'\[\[.*?\]\]')
re_outer_links = re.compile(r'https?://[\w\./?&=%]*')
inner_links = re_inner_links.findall(text_to_parse)
# outer_links = re_outer_links.findall(text_to_parse)
# print(inner_links)
# print("The program found %s wikilinks, of which %s are unique." % (len(inner_links),len(set(inner_links))))
print(pd.Series(inner_links).value_counts()[pd.Series(inner_links).value_counts() > 1])

# print(inner_links)