# Getting Data

In this lesson we will look at different ways of getting data.

## Using command line in notebook

One can check the operating system and run commands from within a notebook.

In [1]:
# You can get a directory listing
%ls

BDTA Lesson 1 Using Jupyter.ipynb
BDTA Lesson 2 Hello World.ipynb
BDTA Lesson 3 Lists.ipynb
BDTA Lesson 4 Review.ipynb
BDTA Lesson 5 Getting Text.ipynb
BDTA Lesson 6 Functions and For Loops.ipynb
README.md
SimpleSentimentAnalysisExample.ipynb
StoryOfWriting.txt


In [2]:
# You can create directories and remove them
%mkdir Red

In [4]:
# Now we can check if the directory is there
%ls

BDTA Lesson 1 Using Jupyter.ipynb
BDTA Lesson 2 Hello World.ipynb
BDTA Lesson 3 Lists.ipynb
BDTA Lesson 4 Review.ipynb
BDTA Lesson 5 Getting Text.ipynb
BDTA Lesson 6 Functions and For Loops.ipynb
README.md
[34mRed[m[m/
SimpleSentimentAnalysisExample.ipynb
StoryOfWriting.txt


In [7]:
%cd ..

/Users/grockwel/Sync/Rockwell IPython Stuff/Big-Data-Class-Materials


In [8]:
%rmdir Red

In [10]:
%ls

BDTA Lesson 1 Using Jupyter.ipynb
BDTA Lesson 2 Hello World.ipynb
BDTA Lesson 3 Lists.ipynb
BDTA Lesson 4 Review.ipynb
BDTA Lesson 5 Getting Text.ipynb
BDTA Lesson 6 Functions and For Loops.ipynb
README.md
SimpleSentimentAnalysisExample.ipynb
StoryOfWriting.txt
theWritingStory.txt


## Opening and Reading a Text File

Once we know where a file is we can open it. Here we will get some text online and copy it to a text file for use.

* Get some text from [Project Gutenberg version of Plato's Phaedrus](https://www.gutenberg.org/ebooks/1636)
* Create a text file in Jupyter, paste it in, and save it.
* Check that the file is there.

Here is some code to load text.

In [11]:
# This is the prefered way to open a file and read the text into a variable.
with open("theWritingStory.txt", "r") as file:
    theStory = file.read()
    
# And we can check what we got.
theStory[:100]

'SOCRATES: At the Egyptian city of Naucratis, there was a famous old god,\nwhose name was Theuth; the '

In [12]:
%ls

BDTA Lesson 1 Using Jupyter.ipynb
BDTA Lesson 2 Hello World.ipynb
BDTA Lesson 3 Lists.ipynb
BDTA Lesson 4 Review.ipynb
BDTA Lesson 5 Getting Text.ipynb
BDTA Lesson 6 Functions and For Loops.ipynb
README.md
SimpleSentimentAnalysisExample.ipynb
StoryOfWriting.txt
theWritingStory.txt


This uses a ```with ... as ...``` context manager. For more see [With Statement Context Managers](https://docs.python.org/3/reference/datamodel.html#context-managers). It does the same as opening a file, reading it, and then closing it, but safely.

In [None]:
# A simpler way of doing it, but not as safe.
file2Open = open('StoryOfWriting.txt', "r")
theStory = file2Open.read()
file2Open.close()
theStory[:100]

## Getting a file from the web

We don't have to only get text files from the file system. We can also retrieve them from the web.

In [15]:
# We start by importing a module that can make requests over the web.
import urllib.request

# We provide the URL and path for the document we want.
path = 'http://www.gutenberg.org/cache/epub/1636/pg1636.txt'

# We now use a control 
with urllib.request.urlopen(path) as response:
    fullDialogue = response.read().decode('utf-8')
    
# We print out the first 100 characters to check.
print(fullDialogue[:100])

ï»¿The Project Gutenberg EBook of Phaedrus, by Plato

This eBook is for the use of anyone anywhere a


**Note** how we have two methods concatenated. We ```read()``` the response and then ```decode()``` it to UTF-8.

## Doing things to the resulting text

Now we can do somethings to the dialogue.

In [16]:
fullDialogue



In [17]:
# Note how much nicer is the scrolling field when you print it!
print(fullDialogue)

ï»¿The Project Gutenberg EBook of Phaedrus, by Plato

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Phaedrus

Author: Plato

Translator: B. Jowett

Posting Date: October 30, 2008 [EBook #1636]
Release Date: February 1999

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK PHAEDRUS ***




Produced by Sue Asscher





PHAEDRUS

By Plato


Translated by Benjamin Jowett




INTRODUCTION.

The Phaedrus is closely connected with the Symposium, and may be
regarded either as introducing or following it. The two Dialogues
together contain the whole philosophy of Plato on the nature of love,
which in the Republic and in the later writings of Plato is only
introduced playfully or as a figure of speech. But in the Phaedrus and
Sym

In [18]:
fullDialogue.find("SOCRATES: My dear Phaedrus, whence come you, and whither are you going?")

87840

In [19]:
fullDialogue2 = fullDialogue[87840:]

In [21]:
fullDialogue2[:500]

'SOCRATES: My dear Phaedrus, whence come you, and whither are you going?\r\n\r\nPHAEDRUS: I come from Lysias the son of Cephalus, and I am going to\r\ntake a walk outside the wall, for I have been sitting with him the whole\r\nmorning; and our common friend Acumenus tells me that it is much more\r\nrefreshing to walk in the open air than to be shut up in a cloister.\r\n\r\nSOCRATES: There he is right. Lysias then, I suppose, was in the town?\r\n\r\nPHAEDRUS: Yes, he was staying with Epicrates, here at the house of'

In [None]:
# Get the length in characters
len(fullDialogue)

In [None]:
fullDialogue.count("truth")

In [None]:
fullDialogue[10700:10800]

In [None]:
fullDialogue.find("truth")

**Question** How can we cut out only the dialogue?

In [None]:
fullDialogue.find("SOCRATES: My dear Phaedrus, whence come you, and whither are you going?")

### Questions

* What other sorts of data do we want to be able to load?
    * How might we do it?
* Where can you get data?

----
# Exercise: Concatenating Text

Can you create a single text from a set of web pages? 
* Google some interesting subject that you want to study
* Collect a set of at least 5 URLs that have interesting content on your subject
* Write a notebook that goes to each of those pages and gets the text
* Then concatenate the texts into one variable
* Count the number of characters


## Optional

Can you figure out how to write the concatenated file out to a text file for later use.