# Phase II – Parsing Your Documents to Add Useful Structures to Your Database  


_Once you have collected all your documents together as described in Phase I, you can begin to search for additional structure within them._

## 1. Finding Meaning and Structure in your Documents

Whenever humans read documents, we are reading for meaning. We read a news article headline and have an idea of what it means. We want to know more, so we read deeper into the article looking for more details and further explanation. Or, perhaps, in the moment it takes for an email to load after you've already read its subject line and clicked on it: we are already wondering what our colleague would like from us or what news they have to share pertaining to the subject line. 

So as we read, we are usually *looking* for information that will answer some question or offer detail to some incomplete model in our heads. And we have an idea of *where in the document* we can find that information. We might skip the sports section of the newspaper and focus on world news to find out about the latest brinkmanship on the Korean peninsula. Or maybe we read the abstract of a research article, then go straight to the discussion section to learn more about the findings and their implications, and then go back to the methods section to ensure proper procedures were followed.

The same goes when it is our job to read through thousands of documents. But when reading through thousands (even millions) of documents of a similar kind, we are also especially motivated to learn to read them in a more time-efficient way. That's where the computers come in.


#### Overview of Phase II

Whatever documents you are preparing for computational text analysis, you are likely to find similarities across them – 'structural' similarities that allow you to skip some sections, focus on others, or even better: interpret the meaning of content words and phrases based on where they appear in the document. 

In Phase II, you will learn to find reliable structures in your documents (like subheadings), along with the meanings they imply about the content found within them. Then, you will learn to program a computer to find those reliable (i.e. repeated across documents) structures and meanings, and store them for later. Once you've done so, and after Phase III, researchers and the public will be able to use a computer to quickly scan through your thousands of documents to find just the sort of information they are interested in learning more about (while ignoring all the rest)... and even to analyze that information using exciting computer-assisted approaches that allow researchers to ask and answer questions they only dreamed of answering a generation ago.

We will walk you through all of this step-by-step. Our goal is for you to experience each step as something that "Of course, makes perfect sense." As your learning accumulates, you will find that programming a computer is actually pretty do-able for you. You already have the necessary basic skills. You know how to make detailed task lists. And you know how to write things down carefully, if you try. You've read (and maybe even drawn) a flow chart before. Everything else is just translating into a foreign language. And that is not hard exactly. It just requires one to set aside a few hours a few times over a few days. 

We recommend doing this Phase II tutorial over the course of two or three evenings. But however you come to it, know that all of your efforts will be immensely appreciated by researchers and analysts, many of whom dedicate their lives to helping humans better understand and learn from human history. This will be fun! And it could even make the world a better place! 

### 1.A. Different Meanings at Different Scales

The meaning we can extract from thousands of documents is often different than the sort of meaning we are interested in if we are looking at just one or a few documents at a time.

##### Practice:
We're going to jump right in! Take a look at the documents below: 

         
            
            ----
            
            To: Jake Ryland Williams
                  Professor
                  Drexel University
            From: Nick Adams
            Date: 12-05-2017
            
            Hello good sir, 
            
            I am very pleased with the progress on the project. I am learning a great deal from your code and am happy to pass along the knowledge you've shared to others. 
            
            Warmly, 
            
            Nick
            
            
            ----
            
            To: Nick Adams
                  Founder & Director
                  GoodlyLabs
            From: Jake Ryland Williams
            Date: 12-12-2017
            In response to: your correspondence of 12-05-2017
            
            Good day friend,
            
            I am so glad it is working out. We should all meet up before the end of the year to celebrate! 
            
            Best regards,
            
            Jake
                     

##### Practice:
When you read these documents, what sorts of things do you want to know? What information could seem valuable?

Type your answer here:

Maybe you responded that you want to know who the people are, or what project they are discussing. Those are fine things to be curious about. But as an exercise: let's imagine how your curiosity would shift if you knew, first of all, that you could never access information that doesn't appear in the documents, and second, that you have tens of thousdands of similarly formatted documents to analyze – a massive amount of correspondence among many people over a longer period of time.


##### Practice: 
Go ahead and look again at the documents with that in mind. What sort of questions do you want to ask?

In [None]:
Type your answer here:

#### Did your curiosity shift?

First, let the overwhelm recede. You won't literally have to read tens of thousands of these documents. The computer is going to help you with that (once you learn to tell it what kind of help you need). 

If you look again and imagine so many similarly constructed documents, you have to wonder: what kinds of things does each person correspond about? Or maybe: who corresponds with whom and how often? What do those networks of correspondence look like? Are there patterns in the timing of messages?

The whole scope of the questions changes. The range of what we can ask and answer changes. And that is the beauty of big textual data. Whereas before, we could only manage to be curious about details; now we can wonder about larger patterns, networks of connection, interactions through time, how ideas flow through a network of people communicating via language. 


##### Practice: 
What specific information could you gather so that you could ask and answer those questions?

Answer key:

1. Name of sender.  
2. Name of recepient.  
3. Date of message.  
4. Whether or not the "In response to:" field exists. The data following the "In response to:" field if it exists.  
5. Body of the message.  
6. Size of the message. 

### 1.B. The Value of Structure

Earlier, we discussed the example of people reading the daily newspaper. No one reads cover to cover, every word. Instead, we look for the sections we find interesting, and read those. Newspapers are designed to allow readers this capability. They are highly structured. Every day, the sections are the same. Not only that, the dates of all their articles, and the names of all those article's authors, are found in the same place for each article each time. Newspapers are very carefully organized, very structured. That structure is valuable to the human newsreader. But it is even more valuable when a human wants to enlist the help of a computer to read thousands of documents. 

##### Practice:
Take a look again at the correspondence documents (again imagining that you want to analyze thousands of them). Are there reliable structures from document to document? Can you list some?


Type your answer here:

Answer Key:

1.  Four hyphens indicate the start of a message.  
2.  There is a line that begins with "To:". Data that follows the "To:" field identifies the recepient.  
3.  The line following the "To:" field begins with "From:". Data that follows the "From:" field is the name of the sender.  
4.  The next line begins with "Date:". What follows this field indicates the date when the message was sent/received. 
5.  The next line may or may not begin with "In response to:" If the "In response to:" field exists, then the email is a response to a previous email. Otherwise, the email is the first of this chain. 
6.  The lines following the "In response to:" field (or the "Date:" field in case the "In response to:" field does not exist) form the body of the message.
7.  The first non-empty line of the body is a salutation.  
8.  The non-empty line just before the last line is a farewell greeting.
9.  The last non-empty line in the message is a signature, i.e. a name the sender uses to identify him/herself.  

Maybe you noticed that each correspondence starts with four hyphens "----" and then has a "To:" field where a recipient's name appears, and a "From:" field where the sender's name appears. There is also a field for "Date:" and, at least sometimes for "In response to:"

Now, let's imagine: if we could program a computer to keep track of all the information appearing alongside these structures, we could easily recover entire chains of correspondence among the people sending and recieving correspondence in some organization. For instance, we could ask the computer to display all the email chains between Nick and Johannes that were more than two two emails long (i.e. more than just a message with a single reply). 

### Even more structure: 

You might have noticed even more structure in the documents. Each correspondence begins with an opening salutation and ends with a closing salutation. The text in between could be called the 'body' of the correspondence. And these different structures are separated by line breaks. It is possible, therefore, that we could separate out these portions of the text. Someone else might even go further, noting that we could break the text down into individual sentences. In fact, each word, each letter is a sort of linguistic structure.The question emerges: where should we stop? When we're seeking to identify or extract structure from our documents, how granular should we get?

There is not a correct answer for all cases. One can use the analogy of building a flight of stairs. There are hard constraints like the distance the stairs must cover in horizontal and vertical directions. We might liken these hard constraints to the sub-headings, table of contents, and other higher-level structures in our documents (e.g., if there are no sub-headings, we cannot use sub-headings as a landing for our staircase). And builders must also consider what size step is comfortable for a human who will be walking up and down the stairs. In our case, we need to think about the comfort of researchers. What size of structure is comfortable or useful for them. We don't want them to have to take 60 tiny steps or 15 huge steps, when they could have taken a comfortable 30.

### 1.C. How much structure is enough? Some guidelines:

1. Take what is given easily and reliably (as long as it is relevant). Our example documents have a very clearly demarcated 'date' field, and researchers will almost always be interested in the dates of correspondence. But not everything that is very easy and reliable to extract is interesting. For example, if the documents also included some corporate logo and tagline at the bottom of each message, we would probably want to just ignore them.
2. Identify/extract structures that nearly all researchers will want, not what any researcher *could* want. All researchers investigating correspondence will want to know who is writing what to whom and when. But it is only the rare researcher who will want to study something like opening and closing salutations. We want to identify/extract structures that we can imagine a broad set of researchers (or the public) using in a query. People will want to search for correspondence by date, recipient, and sender. They are unlikely to want to search for correspondence based on whether it begins with 'Hi' or 'Hello' or 'Dear Friend.'
3. Only identify/extract those structures that do not require judgment calls. Many researchers will be quite interested in the topics of conversation across all of the correspondence. In fact, that might be the primary reason they are looking into a digital archive. But identifying such topics of conversation requires making decisions a researcher will want to make for herself based on the latest techniques and norms of her particular field. From programming a computer to use a particular algorithm, to training human research assistants to make judgments about which words and phrases constitute which topics, there are many, many approaches to topic classification for researchers to choose from. This is a lively field of methodological development in the social sciences. While a few researchers might greatly appreciate your specific efforts to identify latent semantic patterns (like conversational topics) across a digital archive, most will want to take their own approach to this work. The greatest gift you can offer researchers is a database that is ready for research, but does not include any data produced via methodological approaches or assumptions they will need to defend to a panel of editors or reviewers.     

In our present case, these guidelines suggest that we should identify/extract the very clearly demarcated fields of correspondence (date, to, from, etc.), and an expansively defined 'body' of the correspondence, to include the opening and closing salutations. If some researcher, at a later date, wants to investigate the causes or effects of such salutations, s/he will still be able to do so, since we won't throw away that information. But we can rest easy knowing that we've identified everything we can easily identify without over-reaching into the territory of making judgment calls researchers would be forced to like (and to defend to their colleagues) when they want to use our data.


### 1.D. Structure and Meaning Together

At this point, we've identified some reliable structures across our documents, and we have a sense of the meaningful information appearing alongside those structures. The word "To:" is a reliable structure, and we know that the words immediately following on that same line will indicate the name of a person receiving the correspondence. 


##### Practice: 
For the sake of clarity, go ahead and write out all the reliable structures you see in the document and the sorts of meaningful information you expect to find near those structures.

Type your answer here:

To: --> the name of a person receiving the correspondence.

From: --> the name of a person sending the correspondence.

Date: ---> The date when the correspondence was sent.  


## 2. Containers: Storing your Data as you Collect it

Some readers might have noticed that our description of our task has shifted from merely 'identifying'structures and meaning in our documents to 'identifying and extracting' structures and meaning. If we are going to extract structure and meaning, the next question becomes: where are we going to put it?

As in physical reality, computers have a range of containers we can choose from. And, as in physical reality, different containers are built for different purposes. We wouldn't pour milk into a cardboard box, or corn flakes into a gallon jug. Here, too, we need to put some thought into the containers we will need to hold all of the structure and meaning we want to extract from our documents. In fact, even before we start extracting data, it is wise to have our containers ready.

Computers are built to store and perform various operations on information. But information comes in different forms. In our case, we are mostly dealing with words, which are a type of data computer scientists call 'strings.' Words are sequences (hence strings) of characters (letters, spaces, punctuation) that are not at all meaningful to a computer (which speaks the language of mathematics), but that must be stored in their particular sequence. 

Compare this type of data to 'numeric' data. Numeric data, numbers, have all sorts of fun and interesting properties as described by mathematics. You can multiply, add and subtract them, and more. When storing 'numeric' data, computers are sure to keep track of the fact that numbers behave in all the ways we learned about in our high school math classes. In the case of strings, however, a computer will not even try to multiply, add, or subtract the data. 'The quick brown fox' $*$ 'jumped'/ 'over the lazy dog' does not *mean* anything in the way that 3 $*$ 4/6 = 2. So, a computer puts 'strings' in a special type of container made just for strings. 

We can imagine that string containers have a label on them reading "Don't even try to do math on the contents of this container." Other containers hold numeric data on which a computer can do math. And others still hold categorical data, where each individual datum is one of a few categories. Such data are fit for some statistical analyses or visualizations, but not others. So, it will have its own container with its own labels telling the computer which operations are even appropriate to try. 

Since, at some point, we will want to shuttle around and re-organize the data we extract from our documents, it's important to store them in proper containers. 

##### Practice: 
The exercises in section 1 guided you through identifying structures and meaning in documents of correspondence. Review your work. What kinds of data are we collecting? Mostly we are collecting strings, right? Are there any exceptions?


Type your answer here:

Answer key: We are collecting strings. A possible exception could be the "Date:" field.   
The date could be stored as a string, but it could also be a special data type. This is because the date can only take on certain values.
Another exception could be a composite object, e.g. the sender's name and date could be combined to form a tuple.  



Now let's create some containers so we can store our data as we are extracting it. You will want to name each container by the distinct, meaningful information it contains. For example, as we identify/extract information from the "To:" field, we will want to put it in a container called 'Recipient." Calling the container 'To' might be an okay choice, though someone looking at it later might have trouble interpreting what is inside. Calling it 'Name' would be imprecise, and fail to distinguish the information within from information we will be collecting from the "From:" field, which also contains names. 

##### Practice:
Go ahead and create short and informative names for all your containers, and then list a data type for each.  

Type your answer here:

Now, we're going to take our first step with the computer. We're going to ask it to create some containers with the names we want and the properties we need to hold the proper data type for each. This is rather simple to do in python. We just supply the name of our container and use an = to let python know we are creating something new. Then, by using "quotation marks", [brackets], {braces}, or other delimiters we signal to python what kind of data are in the container. Here is an example to get you started.

In [None]:
recipient = ""

Note that we didn't add any information within the quotation marks. What we have now is an empty container that is ready to accept textual string data. Like an empty lunch box with your name on it, the container is ready when you are. 

##### Practice:
Now, go ahead and create some other containers for this project.

Type your answer here:

Answer key:

sender = ""
date = ""
subject = ""
body = ""

Don't worry if you needed to do an online search to confirm how to do this. People who have been programming for years in a specific computer language still need to look up syntax when they are writing new code. The most important skills you can acquire for programming computers have little to do with knowing precisely how to tell a computer what to do. They are:
1. being able to clearly specify (in your native language) what you need the computer to do. We call this pseudo-coding. And,
2. having the persistence to use Google and StackOverflow to figure out how to translate your pseudo-code into a language the computer understands.

The former depends entirely on your ability to clearly understand and communicate the task you are trying to accomplish. This notebook is designed to explain all of that. The latter is somewhat a skill, and is somewhat affected by your experience with a given computer language. But it is substantially a disposition. With sufficient patience and/or grit, anyone can translate pseudo-code into working code. We're going to try to make this easy for you, but just know: whether it takes you 10 hours or 200 hours to write a script that readies your archive for computational research, by going through this process you will be saving researchers and the public *thousands* of hours of work, while greatly increasing the value and utility of your archive. 

## 3. Collecting Data from your Documents

Now, we are getting to the fun part! In this section, we will learn how to extract structured information from our documents and send them into our containers for later analysis. Creating an effective parsing script requires knowledge of several different computer operations and how to combine them effectively. None of them are difficult to understand on their own, but many people become overwhelmed when facing the entire proces. So, we will be walking you through each element – explaining the task to be accomplished, and how to specify it as pseudo-code. Then, in subsequent sections, we will show you how to bring the elements together into an efficient, well-functioning parsing script so a computer can understand it. 

 ### 3.A. Extracting Meaning Based on Structure: The Concept of 'Document Region'
 
Let's look again at our documents and remember what we've learned about structure and meaning. The name 'Jake Ryland Williams' is interesting/meaningful to any researcher looking at these documents. But note that the meaning of that name differs for each document. In the first, it is the name of a recipient. In the second, it is the name of the sender. And we know that *because* the name is adjacent to a different unit of structure in each document. 

We have discussed various sorts of structure in documents. Earlier we mentioned that punctuation and parts of speech can be treated as structure, along with the four hyphens that mark the beginnings of documents. To add some precision to our notion of structure, we are going to refer to different portions of a document as 'regions.' Our documents have a region demarcated by 'To:' and a region that we could call 'the body' of the correspondence, among others. 

Understanding the regions of your documents is important because – as in the example immediately above – the very same content (i.e. 'Jake Ryland Williams') has very different meaning depending on the region in which it is found.


         
         
            
            ----
            
            To: Jake Ryland Williams
                  Professor
                  Drexel University
            From: Nick Adams
            Date: 12-05-2017
            
            Hello good sir, 
            
            I am very pleased with the progress on the project. I am learning a great deal from your code and am happy to pass along the knowledge you've shared to others. 
            
            Warmly, 
            
            Nick
            
            
            ----
            
            To: Nick Adams
                  Founder & Director
                  GoodlyLabs
            From: Jake Ryland Williams
            Date: 12-12-2017
            In response to: your correspondence of 12-05-2017
            
            Good day friend,
            
            I am so glad it is working out. We should all meet up before the end of the year to celebrate! 
            
            Best regards,
            
            Jake
            

##### Practice: 
As a warm-up, let's write down all the regions of the documents above, ignoring other structures. Here is an example in pseudo-code.

Type your answer here:

Answer key:

1.  Recepient region: The part of the document that follows the "To:" field.
2.  Sender region: The data that follows the "From:" field.
3.  Date region: The information found following the "Date:" field.
4.  Subject region: This region may or may not be present. If present, it is the part of the document that follows the "In response to:" field.
5.  Body region: The part of the document that follows the subject region if the subject region exists. Otherwise, it is the part of the document that follows the Date region.

### 3.B. Extracting and Storing Text Strings

We now know the regions of our documents, and which kind of information is found in each region. 
And we have already built containers to hold each kind of information. So, we just need to write out some tasks for the computer. 

The fundamental task we'll be asking the computer to perform involves grabbing/extracting information (i.e. a text string) and sending it to containers. But, we need a way of specifying the conditions that must be met before the computer does that grabbing and sending. Fortunately, as we will discuss further below in section yyy, it is rather easy to tell a computer to do work under some particular conditions. We just use an if/then statement. Here is an example:

##### Practice:
Using the names you have already developed for your regions and containers, pseudo-code your own if/then statements that will grab the text from each region and send it to its appropriate container. Here is some pseudo-code as an example:

In [None]:
if you see "From:",
  then collect all words following From:
  and send them to the 'sender' container.
          

Type your answer here:

In [None]:
Answer key:

if you see "From:",
  then collect all words following From:
  and send them to the 'sender' container.

or else, if you see "To:",
  then collect all words following To: + the next two lines
  and send them to the 'recepient' container.

or else, if you see "Date:",
  then collect the data following "Date:":
  and send it to the 'date' container.

or else, if you see "In response to:",
  then collect the data following "In response to:":
  and send it to the 'subject' container.

or else, if you see anything other than the above,
  collect the data
  and send it to the 'body' container.  

At this point, you have a good idea about what it takes to extract and store meaningful text strings from a single document. But to get the computer to extract information from all your (potentially thousands of documents), you will need to know a bit more about how the computer reads, and works, and how you can tell it to do repetitive tasks... And that brings us to our next section!   

## 4. Reading Like a Machine, Working like a Machine


When we humans read, we're doing a lot of very complex and impressive processing. As our eyes move left to right (in English) perceiving and processing words, we have some notion of what each word means, but we're also updating that notion as we read additional words. For instance, we don't know whether the word "building" refers to a creative action or a large physical object unless we understand its role in a sentence. We don't understand a clause of a sentence without understanding other clauses. But we humans have impressive processors, and we're quite capable of delaying that complete and stable understanding while we're reading to the end of a long sentence, or even a paragraph, or chapter. We essentially hold multiple hypotheses about a sentence's meaning in our minds until the words come together to resolve any ambiguities and give us a clearer picture. 

For example, when we hear, "a man walked into a bar..." we imagine a guy strolling through a door into a dimly lit space where someone wearing an apron is standing between a very long, flat table and a shelf stocked with bottles of alcohol ready to be mixed into cocktails. But when we hear the end of the sentence – "...cut his head open, and had to get stitches." – we immediately revise our imagination of the sentence's meaning, and visualize a construction worker on a building site banging his head on a steel bar. 

Computers do not read this way. They are engineered to read mathematical and logical statements expressed in "abstract languages," which unlike our human languages are far less flexible and context dependent. In fact, one of the defining features of an abstract language is that each term can have only one meaning. So, confusion about walking into a "bar" could never happen in an abstract language.

When computer *do* "read" our human languages – often called "natural" languages – they do so very differently from us. Instead of reading one word at a time sequentially, constructing and adjudicating various interpretations of the meaning as they go, they ingest into their memory whatever portion of text we tell them to ingest, and then perform operations on that portion of text. 

Pointing back to our examples above, we could tell the computer to ingest an entire correspondence into its memory, and then look for certain key terms. But mind you, the computer cannot develop any sense of the content of the correspondence. It is not imagining a conversation between two people working on a project together, nor thinking back to what else was going on in December of 2017. It simply holds the 'string' data – a sequence of arbitrary letters and punctuation marks from its standpoint – in its memory. Then, when we tell the computer to look for a particular sequence of characters within the string, e.g. "From:" it will look for that sequence of characters. But, it still has no idea what the word "From" means in this or any other context. 

This is a strange way of "reading," to be sure. And there are other exotic ways we can program computers to "read" documents. But weird as they are, these different ways of reading can be very powerful and efficient for some purposes. Once we understand how a computer "reads" we can begin to understand how to use these strange ways of reading to complement our own ways of reading. (For a more fullsome treatment of machine reading vs. human reading, see Chapter 2 of the upcoming book: Adams, Nicholas. 2018. *Hybrid Text Analysis: Humans and Machines Together*. SAGE Publishing. London.)

Here, and in the remainder of this section, we offer a number of insights into the ways computers read and work. While we intend to deliver these insights through analogies and examples any beginner can understand, we will also introduce you to some computer science terminology you are likely to encounter when you communicate with others (or consult StackOverflow) about this sort of work. Though the terminology may seem obscure at first, by the end of this section, you will be prepared to pull it all together into an integrated set of concepts that will allow you to describe to a computer (and to a sufficiently experienced human) exactly how the machine should read and work on your documents. 

#### "Reading" while Reading/Doing/Flowing

When we enlist the help of a computer, we are actually having it read two different documents in two different ways. It is peforming an exotic "reading" of our documents as described above. But it is also reading a programming language script by which we humans are giving it very specific instructions about *how* to "read" our documents. Like an American student using a text book written in English to decipher ancient Greek, the computer reads its own language fluently, seeking instructions on how to "read" our natural languages, which it experiences as extremley exotic. To disambiguate these two very different notions of "reading," we will use a computer science term to refer to the way the computer follows the set of instructions humans have written out in an abstract programming language: "flow." Henceforth, we will talk about how we can write instructions a computer will *flow* through to help us read thousands of our natural language documents.

#### Perfectly Following Instructions

Computers are amazing workers. Whatever we tell them to do, they do it. They do it the same every time. And if they don't know what to do because our instructions are not clear, they stop. They don't try to do something that might work out poorly. They just stop. 

It's not super easy to write instructions for computers. But it's not super hard either. And unlike our communications with one another through natural language, we can know when our instructions are clearly communicated. That's because – unlike human languages – computer programming languages have such very strict rules, and so much reliable structure, that meaning conveyed by the language can *never* be misinterpreted by the computer. The computer either understands every instruction perfectly and performs accordingly, or it halts, making no interpretation at all. (Imagine how many human misunderstandings would be avoided if we had this same property, if we only acted when we had perfectly accurate understandings of what others meant!!)

Computers' perfect following of perfectly-constructed instructions is incredible. But it also requires us to create perfect instructions. For those of us who have managed employees, taught students, or raised children, we know how hard this can be. But we also know that it is possible. And when we do perfectly specify instructions, and they are perfectly followed, we know how satisfying that can be. So let's get to it.  

### 4.A. Different Instructions Under Different Conditions:  If ____ , Then ____  Statements and "Control Flow"

We demonstrated earlier that our natural language documents can be broken out into different regions *AND* that we want to do different things with the content that appears in those different regions. For example, some content will go to a 'sender' container, and some different content found in a different region will go to a 'recipient' container. If we want to instruct a computer to treat these different sets of content differently, we need a way of telling it to only follow some instructions under certain conditions.

##### Imagine:
Imagine a robot in a kitchen baking a pastry according to a very detailed recipe. It follows the protocol exactly as written every single time. We can imagine our robot baking a dessert, and there is a note near the bottom of the recipe that reads: 

In [None]:
if you are baking this pastry at an elevation greater than 2000 meters above sea level,
    then,
        increase cooking temperature by 5% 
        decrease cooking time by 10%
        increase flour by 3%
        decrease sugar by 2%
        decrease baking powder by 40%
        decrease baking soda by 20%

Obviously we don't want the robot to take any of those steps if it is baking in Florida or some other low-lying place. So, every programming language has been created in a way that allows us to easily direct the flow of the computer through different instructions based on different conditions. The "If, Then Statement" is the most common way of doing this and is found in almost every programming language. It's also pretty easy to understand. *If* some condition is met (e.g. the computer is baking on a mountaintop) *Then* the computer ought to flow through a series of tasks following the 'then' statement. Otherwise (and this literally goes without saying in most computer languages), the computer ought to ignore that series of tasks entirely, and just continue to flow through the rest of the instructions.

If we were to flow through the baking instructions like our robot would, we would read them like this:
"Oh look. Here is an 'if' statement. I need to evaluate it to see if it is TRUE or FALSE. If it TRUE, I will do all the stuff after the 'then' statement. If it is FALSE, I will ignore all of those tasks and look for the next instruction. Here I go: My sensors tell me that my current elevation is 380 meters, which is less than 2000. So, the 'if' statement is FALSE. Therefore, I will ignore each of the steps following the 'then' statement."   

#### Controlling Flow with Indentation

In our baking example above, we have used another element of computer languages that often guides the flow of the computer: indentation. Notice that all the baking tasks to be performed at higher elevation are right-indented compared to the 'if' which begins the if statement. This indentation is meaningful to the computer as it flows through our instructions. Specifically, a computer will never flow through any right-indented code unless it has already flowed through code above it that begins further to the left. One can think of right-indented tasks as lower priority compared to those tasks further to the left. (In our example, the main task is to evaluate an if statement. Then, if the statement is TRUE, the computer is to complete subtasks affecting the baking of the cake.) 

If one wanted to get a quick sense of what a script (a human-written set of instructions for the computer) is telling a computer to do, she could just read all of the statements that were left-most indented. The Practice exercise under Section 5 below will invite you to interpret a more complicated script. 

### 4.B. Your Own Little Robot 

What we have learned so far allows us to write a lot of useful instructions for the computer. But if we put it to work now, we'd find that it halted frequently and still needed too much supervision from us. There are some confusing portions of the instructions we've written so far, and there are tasks important to our overall document parsing goals that we haven't even tried to write instructions about yet. But before we can improve our instructions, we need to understand a bit about our computer's capacities and limitations. In this section, we will introduce you to your own little robot. Step-by-step, we will show you how to instruct your robot to do work for you that would have required hundreds of hours of your effort. 

#### Recall

Recall the baking robot we discussed a moment ago. It had a sensor letting it know its own altitude. Because it had that sensor, we could tell it to evaluate an 'if' statement about its current altitude and then flow through one or another set of instrutions based upon its evaluation of that 'if' statement. For our work, most of our important 'if' statements ask the computer to evaluate _if_ it is in a particular document region, and _then_ flow through instructions accordingly. Or they suggest that _if_ it sees particular words, _then_ it should proceed through a particular set of instructions. But, what does that mean? What does it mean for the computer to 'be in' a region or 'see' particular words. Lend us your imaginations so we can better understand how to direct the computer.

#### Imagine 

Imagine you have printed out a copy of each of your documents. You have them spread out on your desk. And you have in your hand a tiny miniature version of your computer/robot, about 2 centimeters tall. Now we're going to place your computer/robot on your first document, in the top left corner. And we're going to instruct it to walk the same path that our eyes would traverse if we were reading the document. It's a good computer/robot, so it does as we instruct. It walks along the entire first line from the left all the way to to the right; and when it gets to the end of that first line it jumps all the way to the beginning of the second line on the left side of the page (just like our eyes would). Then, it begins walking to the right side of the page again, all the way to the end, where it then jumps to the third line, and so on.

#### Practice
Pause to imagine this little robot. Notice that as it travels through our document, it is walking through all the regions we identified before. Quickly list in order, the regions the computer/robot would travel through as it walks through our example correspondence documents.

Type your answer here:

We've now got our computer/robot inside our documents and given it the ability to move through them. Pretty cool, right? But it gets cooler, because the robot has some other great abilities. It can scan our documents into its memory as it moves through them, *and* it can flow through a set of instructions we give to it. So, it can do work with the text as it is reading it: it can identify structures for us, extract the nearby text, and send that text to our containers. It can do nearly everything we need it to do.

So, we grab our pseudo-code from Section 3.B, modify it slightly, and tell the robot to get to work. Here are the instructions we feed it: 

In [None]:
Move over the line of text, scanning it into your memory, and flow through this list of instructions: 

    if you see "To:",
      then collect all words following To:
      and collect the next two lines of text      
      and then send all the text to the 'recipient' container.

    or else, if you see "From:",
      then collect all words following From:
      and send them to the 'sender' container    

    or else, if you see "Date:",
      then collect the data following "Date:":
      and send it to the 'date' container.

    or else, if you see "In response to:",
      then collect the data following "In response to:":
      and send it to the 'subject' container.

    or else, if you see anything other than the above,
      collect the data
      and send it to the 'body' container. 

    jump to the next line of text in the document, and repeat these instructions in full. 

And it gets to work. There's just one problem. 

#### Our Little Robot is an Amnesiac.

Our computer/robot is very small and short, and it has a very small, short memory. It can't hold on to very much information. In fact, each time it jumps from the end of one line to the beginning of the next, it loses everything in its memory. So it just wakes up fresh on each new line, starts at the top of the list of instructions we have given it, and tries to do what we have asked. This limitation of our robot is not always a problem. If we place it on the "From:" line of one of our correspondence documents, it does pretty well. 

#### Practice: 
Imagine our robot/computer starting at the line in our first correspondence document which begins with "From." It follows our instructions, moving over the line of text, scanning it, and then checking the rest of the instructions we have given it to see if it should do any work. It sees that the second 'if' statement is TRUE, so it does the work specified after the corresponding 'then' statement: it sends the rest of the line's text to the 'sender' container. It checks the other 'if' statements on the list. They are all FALSE. So it flows to the last line of our instructions and performs that instruction. What happens next?

Type your answer here:

Answer Key:

The robot jumps to a new line in our document and begins flowing from the top of our instructions. It checks 'if' statements. The "Date" statement is TRUE so it sends the rest of the line to the 'date' container. Then, it jumps to the next line, does fine; jumps to the next line, does fine, etc. till it gets to the end of the document, encounters ---- and sends that to the 'body' container. Then, it gets to the "To:" line of the 2nd document and does fine except that it has no conception of what to do with "the next two lines" since it only handles one line at a time. Perhaps it halts at the end of that line. Or perhaps, it jumps to the next line and erroneously sends that text to the 'body' container.

#### The Problem 

When our robot gets to the 2nd and 3rd lines of the 'recipient' region, it doesn't know it is still in that region. It doesn't see any of the structures denoting a region or any of the keywords in the if statements, so it follows the instructions meant for text in the 'body' region. Our 'body' container now includes information about recipients professional titles and organizations. 

Somehow, we need our little robot to remember where it is in our document when it finds itself on a new line without any orienting structural keywords like "To:" or "From:" or any memory of where it was even a moment ago. Real amnesiacs (people who are debilitatingly forgetful) often write notes for themselves before falling asleep – notes they can read the following morning so they will not be totally disoriented.

We can do the same for our robot. As it pauses at the end of each line to do work, we can also have it write a note to itself about what region it is in. That way, when it wakes up on a new line with no memory of what happened before, it can at least find a note in its pocket telling it where it is. Here's how we'd write our instructions to do this:

In [None]:
for each line of text, move over the line of text, scanning it into your memory, and flow through this list of instructions 

    if you see "To:",
      then collect all words other than "To:" that appear on this line
      and send them to the 'recipient' container
      and write yourself a note that you are in the 'recipient' region
      and throw away any other notes that are in your pocket.

    or else, if you see "From:",
      then collect all words following From:
      and send them to the 'sender' container
      and write yourself a note that you are in the 'sender' region
      and throw away any other notes that are in your pocket (So you dont get confused).

    or else, if you see "Date:",
      then collect the data following "Date:":
      and send it to the 'date' container
      and write yourself a note that you are in the 'date' region
      and throw away any other notes that are in your pocket.

    or else, if you see "In response to:",
      then collect all info following "In response to:":
      and send it to the 'subject' container
      and write yourself a note that you are in the 'subject' region
      and throw away any other notes that are in your pocket.

    #Human note to human: we deal with lines of text that do not include the structures above using the following lines of code.
    
    if you see anything other than the above AND you have a note in your pocket saying you are in the 'recipient' region
      then collect all words that appear on this line
      and send them to the 'recipient' container
    
    or else, if you see anything other than the above AND you do NOT have a note in your pocket saying you are in the 'recipient' region,
      collect the data
      and send it to the 'body' container
      and write yourself a note that you are in the 'body' region
      and throw away any other notes that are in your pocket.

    jump to the next line of text in the document, and repeat these instructions in full. 

#### Practice: 
Be the computer/robot. Walk through the correspondence documents following the instructions we have written for the robot. What happens when you get to the line that begins with the word "Professor"? What happens when you get to the line that begins, "Hello good sir"?

In [None]:
Type your answer here:


## 5. Translating from English to Python

As we've told you from the beginning, you know pretty much everything you need to know to be a computer programmer. You know how to make task lists. You know how to write instructions. And you know what you want from your documents. With the pseudo-code above, you have a pretty complete set of instructions to guide your little robot. Now, we just need to translate those instructions into a computer language that the robot can underestand.


### 5.a Putting your Robot in the Document

The first thing you will need to do for any project is to get your little robot inside the document. Flexible computer languages like Python and R are able to put little robots in all sorts of files, but just like different data types allow for different sorts of operations (e.g. we can perform algebra on numeric data but not on textual data), different types of files can be used for different purposes and 'read' by the robot in different ways. Fortunately, other humans have have created a tiny piece of software that instructs the computer to recognize what kind of file we are trying to work with. Such tiny pieces of software are usually called 'functions' and they are very commonly found throughout any programming script. Because we want you to understand how your instructions translate directly into Python, we won't use many functions today. But we will us a few here and there, including a function to help us open our file and get our little robot inside. 

To illustrate how this works, first go to the folder where you are storing all of the Jupyter notebooks of these tutorials. You should see a file called 'test2.txt' somewhere amidst the files. Click on that and observe. Now to ensure that our little robot can observe the same thing you have just seen, click on the Jupyter notebook cell immediately below this one; then hold down the 'Shift' key while pressing the 'Enter' key. 

In [None]:
with open("data/test2.txt") as ourfile:   #with a function called open, open the 'test3.txt' file and call it 'ourfile' 
    for line in ourfile:             #for each line of text in ourfile do the following
        print line                   #print the line

### 5.c Getting the robot to do (any) work
The code immediately above opens the test2.txt file and calls it 'ourfile.' Then the robot is instructed that for each line in ourfile, it should print the line. Following those instructions perfectly, the little robot should have printed the entire file so that it looks just like the file you observed with your own two eyes.

Now, some of us might feel like the robot printed too much white space, too many new line characters. So, the code below adds one instruction that the robot should perform for each line in our file. It tells the robot to redefine the line as a new version of the line where extra white-space has been stripped from the line. 

In [None]:
with open("data/test2.txt") as ourfile:   #with a function called open, open the 'test3.txt' file and call it ourfile 
    for line in ourfile:             #for each line of ourfile do the following
        line = line.rstrip()         #replace the line with a version of the line that has had extra white-space removed
        print line                   #print the line

Now, we know our robot can open and do (some extremely simple) work on the same documents we can read with our own eyes. We're on the move!

If we review our pseudo-code from earlier, we can see that we still have a fair amount of translation work to do. But, the first line is definitely translated.

In [None]:
for each line of text, move over the line of text, scanning it into your memory, and flow through this list of instructions 

    if you see "To:",
      then collect all words other than "To:" that appear on this line
      and send them to the 'recipient' container
      and write yourself a note that you are in the 'recipient' region
      and throw away any other notes that are in your pocket.

    or else, if you see "From:",
      then collect all words following From:
      and send them to the 'sender' container
      and write yourself a note that you are in the 'sender' region
      and throw away any other notes that are in your pocket (So you dont get confused).

    or else, if you see "Date:",
      then collect the data following "Date:":
      and send it to the 'date' container
      and write yourself a note that you are in the 'date' region
      and throw away any other notes that are in your pocket.

    or else, if you see "In response to:",
      then collect all info following "In response to:":
      and send it to the 'subject' container
      and write yourself a note that you are in the 'subject' region
      and throw away any other notes that are in your pocket.

    #Human note to human: we deal with lines of text that do not include the structures above using the following lines of code.
    
    if you see anything other than the above AND you have a note in your pocket saying you are in the 'recipient' region
      then collect all words that appear on this line
      and send them to the 'recipient' container
    
    or else, if you see anything other than the above AND you do NOT have a note in your pocket saying you are in the 'recipient' region,
      collect the data
      and send it to the 'body' container
      and write yourself a note that you are in the 'body' region
      and throw away any other notes that are in your pocket.

    jump to the next line of text in the document, and repeat these instructions in full. 

#### Practice:
Can you identify the line of Python code that corresponds with the first line of our pseudo-code?

Type your answer here:

Answer key:
for each line in ourfile:

### 5.d Moving Forward
Much of our pseudocode involves finding structure in our documents, and identifying which region the robot is currently inside, before instructing it to gather the text from a particular line and send it to one of our containers. So, let's start by doing that for just a couple of our easier regions that don't have multiple lines of text in each. We'll try it for the sender and date regions of just one of our email documents.

Here was our pseudo-code about the date region for reference:

    if you see "Date:",
      then collect the data following "Date:":
      and send it to the 'date' container
      and write yourself a note that you are in the 'date' region
      and throw away any other notes that are in your pocket.

We're going to get a little help for our robot on the task of 'seeing' "Date:" and collecting the data following "Date:".

To identify the location of "Date:", we will import a library of functions that is called 're'. The 're' library contains python's application of the 'Regular Expressions' (RegEx) search language. We will explain Regular Expressions in the next notebook i nmore detail. For now, think of Regular Expressions as more flexible version of your good old Ctrl/Command+F search function.

After identifying the location, we need to remember that if we want to send things to a 'date' container, we need to first create one. We'll do these two steps at the top of our script before opening our file and putting our robot to work. If you press Shift + Enter on the code cell below, what is your output? Does it look right compared to the test2.txt document? 

In [None]:
import re

# Setting up our containers. 
#We tell Python that the container are meant for strings by using the ''. These lines say, "create two string containers that are currently empty and call them 'sender' and 'date'.

sender = ''
date = ''                             

with open("data/test2.txt") as ourfile:
    
    for line in ourfile:  
        line = line.strip()
          
        if "From:" in line:
            region = 'sender'
            #we are telling our robot to look within our line for a match to the word "From:" and to grab "From:" and everything occuring after it ".*" and to send both to a temporary container called 'check'  
            check = re.match(r"(From:)(.*)", line)
            #the next line translates as "if check exists, i.e. has any data in it, then..."
            if check:
                #...then take the second item in the check container and put it to our sender container
                sender = check.group(2)
            
        #'elif' is short for "or else, if..."
        elif "Date:" in line:
            region = 'date'
            check = re.match(r"(Date:)(.*)", line)
            if check:
                date = check.group(2) 

                
#Printing the data from our containers so we can see if it worked.                
print sender
print date

#you may want to un-comment the following two lines so you can see the two items currently in the check container.
#print check.group(1)
#print check.group(2)

That worked. So, we have now demonstrated to ourselves that we can tell our robot how to find meaningful structures, how to write itself notes about what region it is in, how to collect information associated with those regions, and how to put our data in containers. We even had the robot print out the extracted data to prove to us that it got the right information. Let's see if we can add in instructions for our other document regions.

Some of these other regions will be relatively easy, drawing on translations we've already done. But as we saw when pseudo-coding, the recipient and body regions are not as easy to work with as sender and date regions. While the latter two regions never extend to more than one line of text, a recipient's name, title, and affiliation might span three lines. The body of a correspondence can span many lines as well. These pose particular challenges for our little robot since it loses its memory when it jumps to each new line. We will handle this challenge similarly to how we handled it in our pseudo-code, with instructions for these contingencies near the bottom of our script. See comments within the code for extra explanation.

In [None]:
import re

recipient = ''
sender = ''
date = ''                             
subject = ''
body = ''
    
with open("data/test2.txt") as ourfile:
    
    for line in ourfile:  
        line = line.strip()
        
        
        if "----" in line:
            region = 'start'
        
        elif line.startswith('To:'):
            region = 'recipient'
            check = re.match(r"(To:)(.*)", line)
            if check:
                recipient = check.group(2)
            #we deal with additional lines of recipient data below. This portion of code just grabs the first line of recipient data
            
        elif "From:" in line:
            region = 'sender'
            check = re.match(r"(From:)(.*)", line)
            if check:
                sender = check.group(2)
            
        elif "Date:" in line:
            region = 'date'
            check = re.match(r"(Date:)(.*)", line)
            if check:
                date = check.group(2)    
            
        elif "In response to:" in line:
            region = 'subject'
            check = re.match(r"(In response to:)(.*)", line)
            if check:
                subject = check.group(2)
        
        #The following line is better read as "or else if you, the robot, don't see anything like the above and there is no note in your pocket saying you are in the recipient region, then..."     
        elif region != 'recipient':
            region = 'body'
            #we deal with body data below because it is multi-line

#We handle regions with multiple lines here
#Note: We use 'if' again instead of elif, which triggers our robot to treat the next instructions as a separate and complete set of conditions to consider

        if region == 'recipient' and not line.startswith('To:'):  #the 'and not...' just ensures we don't repeat the first line of recipient data we already gathered above
            recipient = recipient + ', ' + line

        elif region == 'body':
            #The code below tells the robot to update the data in the body container by adding a space and whatever text is on the current line
            body = body + ' ' + line


            
# Printing the data along with some labels this time
print("Recipient info: %s" % recipient)
print("Sender info: %s" %sender)
print("Date: %s" %date)
print("Subject: %s" %subject)
print("Body: %s" % body)



This code is not too shabby! It has our robot collecting everything we need. However you might notice a bit of extra white space in the body. That is caused be empty lines in the document. While our line = line.strip command removes unnecessary tabs and spaces, it does not get rid of entire lines. If we want our robot to ignore lines that are totally empty (instead of sending them to our body container with a ' ' space, we can tell it to just move on or 'continue' whenever it encounters a line that is not filled with any information. We've add that in below as the first if statement in the script. Take a look at the difference in the output, particularly in the body container.

In [3]:
import re

recipient = ''
sender = ''
date = ''                             
subject = ''
body = ''
    
with open("data/test2.txt") as ourfile:
    
    for line in ourfile:  
        line = line.strip()
        
        if not line:
            #then, just move on to the next line and start over with these instructions
            continue
        
        elif "----" in line:
            region = 'start'
        
        elif line.startswith('To:'):
            region = 'recipient'
            check = re.match(r"(To:)(.*)", line)
            if check:
                recipient = check.group(2)
            #we deal with additional lines of recipient data below. This portion of code just grabs the first line of recipient data
            
        elif "From:" in line:
            region = 'sender'
            check = re.match(r"(From:)(.*)", line)
            if check:
                sender = check.group(2)
            
        elif "Date:" in line:
            region = 'date'
            check = re.match(r"(Date:)(.*)", line)
            if check:
                date = check.group(2)    
            
        elif "In response to:" in line:
            region = 'subject'
            check = re.match(r"(In response to:)(.*)", line)
            if check:
                subject = check.group(2)
        
        #The following line is better read as "or else if you, the robot, don't see anything like the above and there is no note in your pocket saying you are in the recipient region, then..."     
        elif region != 'recipient':
            region = 'body'
            #we deal with body data below because it is multi-line

#We handle regions with multiple lines here
#Note: We use 'if' again instead of elif, which triggers our robot to treat the next instructions as a separate and complete set of conditions to consider

        if region == 'recipient' and not line.startswith('To:'):  #the 'and not...' just ensures we don't repeat the first line of recipient data we already gathered above
            recipient = recipient + ', ' + line

        elif region == 'body':
            #The code below tells the robot to update the data in the body container by adding a space and whatever text is on the current line
            body = body + ' ' + line


            
# Printing the data along with some labels this time
print("Recipient info: %s" % recipient)
print("Sender info: %s" %sender)
print("Date: %s" %date)
print("Subject: %s" %subject)
print("Body: %s" % body)

Recipient info:  Jake Ryland Williams, Professor, Drexel University
Sender info:  Nick Adams
Date:  12-05-2017
Subject: 
Body:  Hello good sir, I am very pleased with the progress on the project. I am learning a great deal from your code and am happy to pass along the knowledge you've shared to others. Warmly, Nick


### This code works great if you have a single document per file!  But we have many!   :'(

As long as you have a separate file for each document you wish to parse, a script like the one above can get you very far with any parsing task. But, more often than not, digital archivists and researchers have files that include many documents per file (or they may have large folders of files) that they wish to parse with a single script. 

You can see for yourself how poorly the above script would perform in such a scenario. Just replace "test2.txt" with "test3.txt" in the script above. When you read test3.txt with your own eyes, you see that it is two emails separated by '----' four dashes. However when you run the script above on test3.txt, the resulting output does not look right at all. The recipient container only has information about the second email. The same goes with our sender, date, and subject containers. Our parser over-wrote the information from the first email. And the parser didn't do much better for the body container, which now has information from both emails totally undifferentiated. 

We need another to make some edits to our instructions!

### 5.e. Last step: A Container for our Containers

Earlier we discussed containers, and the fact that computers use different containers for different kinds of data much as we use different containers for liquid vs. solids or gases. So far our script only includes containers for strings, (i.e. sequences of characters that are not not mathematically computable). Those containers are doing their jobs just fine when they only need to house information about one email. But, ideally, we would be able to fill a whole set of those string containers for each email in our file (and maybe even store all of those sets of containers in a mega container).

Fortunately, we are not the first humans to come across this challenge. And others who have been developing the Python language and its functions have already made the containers we need to meet our challenge. One is called a 'dictionary'. The other is a 'list'. Forget for a moment the names of these containers, which can be more confusing than helpful since we already have our own notions about the dictionaries and lists we encounter on a regular basis. The Python dictionary container allows us to treat our region-specific containers as a set of containers, such that the string in each container goes with the strings in all the others. We can even name each of our string containers just like we did before. In the code below, under the elif "----" in line: statement, you can see how we create a set of containers all at once. Using the { } curley braces, we designate the name of each container in the set along with its type, with a : colon in between. We name our set of containers (or Python dictionary) 'current_email' because the code opens a fresh set of these containers for each email (which, as we already know begins with a '----').

The creation of this set of region-specific containers (i.e. dictionary) only gets us so far, though. If we stopped here, the set of containers would still be overwritten with each new email our robot traverses. So, we need a way of storing each email, before moving onto the next. To meet that challenge we create a container for all our emails – one that could store thousands of emails and certainly the two in our test3.txt document. You will see that we create this container with a line near the top of the code which reads:

emails = []

Using the [] square brackets to create our container designates the container as a 'list.' Lists are a very flexible sort of Python container that can store any kind of data in a sequence. You could make a list of strings, a list of numbers, or (in our case) a list of dictionaries. This means we can keep all of our data as long as we add a new dictionary to the list for each email document in our file. If you look again under the elif "----" in line: statement and immediately under the creation of our current_email dictionary, you will see a line of code that does this: 

emails.append(current_email)

The .append tells our robot to add to the emails list, without replacing the data already inside of it. So, with each new email in our file, our robot begins a new dictionary and appends it to our emails list. Then, it goes about traversing our document filling that dictionary with information. 

Finally, at the bottom of the code in the cell below, we print all the data from emails container, one email at a time. 


In [5]:
import re

#This will be our container of containers. For each email, it will house a separate set of containers corresponding with each of our text regions. In Python terms, this container is a list, designated with []
emails = []
    
with open("test3.txt") as ourfile:
    
    for line in ourfile:  
        line = line.strip()
        
        if not line:
            continue
        
        elif "----" in line:
            region = 'start'
            #as the robot encounters a new document in our file, we create a fresh Python dictionary called 'current_email' – a set of containers for the data the robot will encounter as it traverses that document
            current_email = {'recipient': '',
                             'sender': '',
                             'date': '',
                             'subject': '',
                             'body': ''}
            #and as the robot encounters a new document, it addes the entire set of containers (i.e. dictionary) to our 'emails' container (of containers) 
            emails.append(current_email)
        
        elif line.startswith('To:'):
            region = 'recipient'
            check = re.match(r"(To:)(.*)", line)
            if check:
                #everything looks the same except we are sending our region-specific data containers inside the current_email set of containers (i.e. dictionary), which is inside our 'emails' mega-container 
                current_email['recipient'] = check.group(2)
                #just like before, we still deal with additional lines of recipient data below. This portion of code just grabs the first line of recipient data
            
        elif "From:" in line:
            region = 'sender'
            check = re.match(r"(From:)(.*)", line)
            if check:
                current_email['sender'] = check.group(2)
            
        elif "Date:" in line:
            region = 'date'
            check = re.match(r"(Date:)(.*)", line)
            if check:
                current_email['date'] = check.group(2)    
            
        elif "In response to:" in line:
            region = 'subject'
            check = re.match(r"(In response to:)(.*)", line)
            if check:
                current_email['subject'] = check.group(2)
        
        elif region != 'recipient':
            region = 'body'
            

#We handle regions with multiple lines here

        if region == 'recipient' and not line.startswith('To:'):  
            current_email['recipient'] = current_email['recipient'] + ', ' + line

        elif region == 'body':
            #again, everything is the same except our container are inside a set of containers inside another container
            current_email['body'] = current_email['body'] + ' ' + line


#We can print each email in our list in sequence using the following code. Note that 'email' is being defined here as a single item in the emails list. Python know that between the word 'for' and 'in' will be an item in the container named after the word 'in'           
for email in emails:
    print("Recipient info: %s" % email['recipient'])
    print("Sender info: %s" % email['sender'])
    print("Date: %s" % email['date'])
    print("Subject: %s" % email['subject'])
    print("Body: %s" % email['body'])
    #We print a bunch of dashes just so the output clearly shows where each new email begins
    print('-' * 80)

Recipient info:  Jake Ryland Williams, Professor, Drexel University
Sender info:  Nick Adams
Date:  12-05-2017
Subject: 
Body:  Hello good sir, I am very pleased with the progress on the project. I am learning a great deal from your code and am happy to pass along the knowledge you've shared to others. Warmly, Nick
--------------------------------------------------------------------------------
Recipient info:  Nick Adams, Founder & Director, GoodlyLabs
Sender info:  Jake Ryland Williams
Date:  12-12-2017
Subject:  your correspondence of 12-05-2017
Body:  Good day friend, I am so glad it is working out. We should all meet up before the end of the year to celebrate! Best regards, Jake
--------------------------------------------------------------------------------


## 6. Summary of What You've Learned

In this section, you not only learned a flexible approach for parsing almost any text document, large or small, you also learned quite a bit about programming a computer. As we've insisted, you already had most of the necessary skills. Now, we hope, you also have some confidence that you can implement a version of the scripts above for your own purposes on your own set of documents. It might please you to learn, too, that we actually covered a lot of basic computing curriculum that people sometimes struggle to learn. If you want to impress your computer science friends or colleagues, you can tell them that you learned how to use a state machine to implement a streaming approach to the parsing of documents. They will be quite impressed.