# Getting Data

In this lesson we will look at different ways of getting data.

## Using command line in notebook

One can check the operating system and run commands from within a notebook.

In [None]:
# You can get a directory listing
%ls

In [None]:
# You can create directories and remove them
%mkdir Red

In [None]:
# Now we can check if the directory is there
ls

In [None]:
%rmdir Red

In [None]:
%ls

## Opening and Reading a Text File

Once we know where a file is we can open it. Here we will get some text online and copy it to a text file for use.

* Get some text from [Project Gutenberg version of Plato's Phaedrus](https://www.gutenberg.org/ebooks/1636)
* Create a text file in Jupyter, paste it in, and save it.
* Check that the file is there.

Here is some code to load text.

In [None]:
# This is the prefered way to open a file and read the text into a variable.
with open("StoryOfWriting.txt", "r") as file:
    theStory = file.read()
    
# And we can check what we got.
theStory[:100]

This uses a ```with ... as ...``` context manager. For more see [With Statement Context Managers](https://docs.python.org/3/reference/datamodel.html#context-managers). It does the same as opening a file, reading it, and then closing it, but safely.

In [None]:
# A simpler way of doing it, but not as safe.
file2Open = open('StoryOfWriting.txt', "r")
theStory = file2Open.read()
file2Open.close()
theStory[:100]

## Getting a file from the web

We don't have to only get text files from the file system. We can also retrieve them from the web.

In [None]:
# We start by importing a module that can make requests over the web.
import urllib.request

# We provide the URL and path for the document we want.
path = 'http://www.gutenberg.org/cache/epub/1636/pg1636.txt'

# We now use a control 
with urllib.request.urlopen(path) as response:
    fullDialogue = response.read().decode('utf-8')
    
# We print out the first 100 characters to check.
print(fullDialogue[:100])

**Note** how we have two methods concatenated. We ```read()``` the response and then ```decode()``` it to UTF-8.

## Doing things to the resulting text

Now we can do somethings to the dialogue.

In [None]:
fullDialogue

In [None]:
# Note how much nicer is the scrolling field when you print it!
print(fullDialogue)

In [None]:
# Get the length in characters
len(fullDialogue)

In [None]:
fullDialogue.count("truth")

In [None]:
fullDialogue[10700:10800]

In [None]:
fullDialogue.find("truth")

**Question** How can we cut out only the dialogue?

In [None]:
fullDialogue.find("SOCRATES: My dear Phaedrus, whence come you, and whither are you going?")

### Questions

* What other sorts of data do we want to be able to load?
    * How might we do it?
* Where can you get data?

----
# Exercise: Concatenating Text

Can you create a single text from a set of web pages? 
* Google some interesting subject that you want to study
* Collect a set of at least 5 URLs that have interesting content on your subject
* Write a notebook that goes to each of those pages and gets the text
* Then concatenate the texts into one variable
* Count the number of characters


## Optional

Can you figure out how to write the concatenated file out to a text file for later use.