# File operations

You are now aware of how Python treats files as objects that are first opened, then read from or written to. 

Now, we will put this into a more practical context and work with real files.

The end result will be external data loaded into Python and stored as common Python data structures, such as lists and dictionaries. You can then work with the data in whatever way you need.


## Opening a file and reading it

In the `data` folder is a file named `poem.txt`.

As part of a `with`... `as` statement, use the `open()` function to access the file and store it as a variable named `poem`.

Then, look at what data type the variable is and how long it is.

In [11]:
#Add your code below

with open('data/poem.txt', 'r') as f:
    poem = f.read()

print(type(poem))
print(len(poem))


<class 'str'>
611


*__Windows users:__ If you get an error, you may need to use `'rb'` ('read binary') as the second argument when using the `open()` function to read files; if so, this applies to all questions involving reading files.*

There are two ways to inspect a variable in a notebook. You can just type the variable name, to see the raw data:

In [12]:
poem

"Three witches, casting a spell ...\n\nRound about the cauldron go;\nIn the poison'd entrails throw.\nToad, that under cold stone\nDays and nights hast thirty one\nSwelter'd venom sleeping got,\nBoil thou first i' the charmed pot.\n\nDouble, double toil and trouble;\nFire burn and cauldron bubble.\n\nFillet of a fenny snake,\nIn the cauldron boil and bake;\nEye of newt, and toe of frog,\nWool of bat, and tongue of dog,\nAdder's fork, and blind-worm's sting,\nLizard's leg, and howlet's wing,\nFor a charm of powerful trouble,\nLike a hell-broth boil and bubble.\n\nDouble, double toil and trouble;\nFire burn and cauldron bubble.\n"

Or you can use `print()`, which will format it using any special characters specified within the variable, like tabs or linebreaks:

In [13]:
print(poem)

Three witches, casting a spell ...

Round about the cauldron go;
In the poison'd entrails throw.
Toad, that under cold stone
Days and nights hast thirty one
Swelter'd venom sleeping got,
Boil thou first i' the charmed pot.

Double, double toil and trouble;
Fire burn and cauldron bubble.

Fillet of a fenny snake,
In the cauldron boil and bake;
Eye of newt, and toe of frog,
Wool of bat, and tongue of dog,
Adder's fork, and blind-worm's sting,
Lizard's leg, and howlet's wing,
For a charm of powerful trouble,
Like a hell-broth boil and bubble.

Double, double toil and trouble;
Fire burn and cauldron bubble.



Linebreaks are represented by the newline character, `\n`. This may look like two separate characters, but Python treats it as one!

Currently, our data is one long string. If we want each line of the poem to be its own string, there are two ways to do this.

Use `readlines()` instead of `read()`to return a list of strings (the newline character `\n` will determine where each string ends):

In [14]:
with open('data/poem.txt', 'r') as f:
    poem = f.readlines()
    
poem


['Three witches, casting a spell ...\n',
 '\n',
 'Round about the cauldron go;\n',
 "In the poison'd entrails throw.\n",
 'Toad, that under cold stone\n',
 'Days and nights hast thirty one\n',
 "Swelter'd venom sleeping got,\n",
 "Boil thou first i' the charmed pot.\n",
 '\n',
 'Double, double toil and trouble;\n',
 'Fire burn and cauldron bubble.\n',
 '\n',
 'Fillet of a fenny snake,\n',
 'In the cauldron boil and bake;\n',
 'Eye of newt, and toe of frog,\n',
 'Wool of bat, and tongue of dog,\n',
 "Adder's fork, and blind-worm's sting,\n",
 "Lizard's leg, and howlet's wing,\n",
 'For a charm of powerful trouble,\n',
 'Like a hell-broth boil and bubble.\n',
 '\n',
 'Double, double toil and trouble;\n',
 'Fire burn and cauldron bubble.\n']

Second, we could process the data _after_ loading it. Since it is a `str`, it will have a `.splitlines()` method that can be used to chop it up into a list of smaller strings; use this method on `poem`:

In [15]:
with open('data/poem.txt', 'r') as f:
    poem = f.read()

poem = poem.splitlines()

poem


['Three witches, casting a spell ...',
 '',
 'Round about the cauldron go;',
 "In the poison'd entrails throw.",
 'Toad, that under cold stone',
 'Days and nights hast thirty one',
 "Swelter'd venom sleeping got,",
 "Boil thou first i' the charmed pot.",
 '',
 'Double, double toil and trouble;',
 'Fire burn and cauldron bubble.',
 '',
 'Fillet of a fenny snake,',
 'In the cauldron boil and bake;',
 'Eye of newt, and toe of frog,',
 'Wool of bat, and tongue of dog,',
 "Adder's fork, and blind-worm's sting,",
 "Lizard's leg, and howlet's wing,",
 'For a charm of powerful trouble,',
 'Like a hell-broth boil and bubble.',
 '',
 'Double, double toil and trouble;',
 'Fire burn and cauldron bubble.']

## Opening docx files

Import the package `docx2txt` and `.process()` the file at `data/air_quality.docx`, assigning the result to `doc_text`.

Check the `type()` of `doc_text`.

In [None]:
import docx2txt
#Add your code below
#doc_text = ...

doc_text = docx2txt.process('data/air_quality.docx')

print(doc_text)
type(doc_text)


Now that you have a string, use the `.splitlines()` method to return a list of the contents of each line, assigning the result to `lines`:

In [None]:
lines = doc_text.splitlines()
lines

Remove all blank lines (i.e. empty strings) from the list of `lines`, assigning the remaining list of lines to `content`:

- Don't forget that [Stack Overflow](https://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings) can help you!

In [None]:
content = []

for line in lines:
    if line  !=  '':
        content.append(line)
        

In [None]:
content