# Files and Libraries

## I. Libraries, Modules, and Functions

With Python alone, a programmer can perform some basic operations using simple functions like **print()**, **len()**, **max()**, **min()**, **sorted()** as well as some methods applied directly to particular data types (like **.lower()** and **.upper()** for strings).

However, to do more advanced or specialized things we need to install and import Python **Libraries** also known as **packages**. 

A **Python library** is a collection of files (known as **modules**) that each contain **functions** to complete a set of related tasks. 

*Confused?* 

*This can get confusing as some large libraries have multiple sub-packages each with many different modules. In other cases a library consists of a single module.* ***The important thing to know is that you need to import each library or module you want to use.***

Some commonly used modules are found in [Python's Standard Library](https://docs.python.org/3/library/). 




## II. Working with Libraries

For more on functions, we can refer to a more detailed lesson provided by [**Constellate's** Python 4 lesson](https://lab.constellate.org/perfusion-stearns-eliot/notebooks/tdm-notebooks-2023-04-03T23%3A17%3A07.601Z/python-basics-4.ipynb):

```
**Functions**

You can identify a function by the fact that it ends with a set of parentheses () where arguments can be passed into the function. Depending on the function (and your goals for using it), a function may accept no arguments, a single argument, or many arguments. For example, when we use the print() function, a string (or a variable containing a string) is passed as an argument.

Functions are a convenient shorthand, like a mini-program, that makes our code more modular. We don't need to know all the details of how the print() function works in order to use it. Functions are sometimes called "black boxes", in that we can put an argument into the box and a return value comes out. We don't need to know the inner details of the "black box" to use it. (Of course, as you advance your programming skills, you may become curious about how certain functions work. And if you work with sensitive data, you may need to peer in the black box to ensure the security and accuracy of the output.)

**Libraries & Modules**

While Python comes with many functions, there are thousands more that others have written. Adding them all to Python would create mass confusion, since many people could use the same name for functions that do different things. The solution then is that functions are stored in modules that can be imported for use. A module is a Python file (extension ".py") that contains the definitions for the functions written in Python. These modules (individual Python files) can then be collected into even larger groups called packages and libraries. Depending on how many functions you need for the program you are writing, you may import a single module, a package of modules, or a whole library.
```

1. First, we will import the [**math module**](https://docs.python.org/3/library/math.html).




The formula for importing a module is:

```
import module_name
```

In [None]:
import math

By importing the math module, for example, we have access to a wide range of mathematical functions (see the [math module documentation here](https://docs.python.org/3/library/math.html)). 

2. For example, we can calculate the square root of a large number (by using the **math** module's **sqrt** function) or identify the value of pi (as well as more advanced mathematical operations).

In [None]:
print(math.sqrt(34785739923))

In [None]:
print(math.pi)

3. Use the **help()** function to learn more about the math module.

We can also import the [time module](https://docs.python.org/3/library/time.html) to apply different time functions. 



4. Import the **time** module and use the **help** function to learn more about what it can do.

In [None]:
import time
help(time) 

5. Run the following: 

In [None]:
# A program that waits 3 seconds then prints "Done"

import time # We import all the functions in the `time` module

print('Waiting 3 seconds...')
time.sleep(3) # We run the sleep() function from the time module using `time.sleep()`
print('Done')

We can also just import the sleep() function without importing the whole time module. The syntax is:

```
from module import function
```

6. Run the following, but change the wait time to 4 seconds:


In [None]:
# A program that waits 3 seconds then prints "Done"

from time import sleep # We import just the sleep() function from the time module


print('Waiting 3 seconds...')

sleep(3) # Notice that we just call the sleep() function, not time.sleep()
print('Done')

When writing code in Python, it helps immensely to use descriptive variable names. Python is fine if you named all your variables "x012b" and "2xAyn". But, it helps the human readers of your code to be as descriptive as possible with variable names. And, with auto-complete, you don't have to worry about typing your descriptive variable names in their entirety. Just type the first few letters of the variable name and a menu of pre-defined variables and functions will appear.

7. Read the code below. What does it do? Distinguish between the names of variables, functions, and modules.

In [None]:
import time
start_time = time.time()
large_num = 9027905973792487678342070492070492704992341896198123931989
sq_root = math.sqrt(large_num)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"On this machine / server, Python has calculated that the square root of {large_num} is {sq_root} in just {elapsed_time} seconds.")
print(sq_root ** 2)  ##double-checking the answer

8. Notice above, the use of an "f" before a string. This denotes a **formatted string**. With formatted strings, we can insert variable_names using "{}" and Python will replace the variable name with the value stored in the variable. For example try changing all the string values assigned to variables below:

In [None]:
person = ""  
adj = ""
adv = ""
verb = ""
obj = ""
time_period = ""
print(f"During {time_period}, {person} {verb} the {adj} {obj} {adv}.")

Or we can work with Python's [datetime library](https://docs.python.org/3/library/datetime.html) to work with... you guessed it, dates and times. It does some of the same things as **time** but also does some different things.

Sometimes, for larger libraries we may want to only import individual modules following the formula:

```
from [library name] import module1, module2, etc.
```

9. Try:

In [None]:
from datetime import date, datetime

In [None]:
date.today()

In [None]:
now = datetime.now()
print(now)

## III. Additional Practice with Libraries and Modules

There are also many modules and libraries designed to work with strings and texts. 

For example:
+ [string](https://docs.python.org/3/library/string.html?highlight=string) - for common string functions
+ [re](https://docs.python.org/3/library/re.html?highlight=re#module-re) - regular expression operations (regular expressions allow you to search for particular patterns in texts. I.e. if you want to search for phone numbers in a text you would to search for all texts that match the pattern ###-###-#### or ###-####. Or to find an email you may want to search for continuous strings that include an "@" sign followed by a ".com", ".edu", or ".edu". Regular expressions help you do this.)

10. Run the following

In [None]:
import string, re
print(string.punctuation)

11. Run and read the code below, it uses functions from both the **re** and **string** libraries. Can you figure out what it did? How it works?

In [None]:
sent = "[To be or not to be]: that is the question?!?"
punct = string.punctuation
new_sent = re.sub(f"[{punct}]", "", sent)
print(new_sent)


## IV. Working with Files

An essential skill in Python is to be able to navigate through files on your computer to either read in existing files into Python or to output new files. 

To enable navigating through files on your computer, we will use the **os** and **pathlib** libraries. 

12. Let's import them now.

In [None]:
#import os
from pathlib import Path

13. Examine what the following functions do. Hint: **cwd()** means "current working directory."

In [None]:
print(Path.cwd())
print(Path.cwd().parent)
print(Path.cwd().parent.parent)

14. We can open a dataset of texts using the following code:

In [None]:
#sotudir = Path(Path.cwd().parent, "state-of-the-union-dataset", "txt")
sotudir = Path("~/shared/RR-workshop-data/state-of-the-union-dataset/txt").expanduser()
print(sotudir)

15. Note, "sotudir" merely saves a filepath. To see if this filepath actually links to a real folder, try:

In [None]:
sotudir.exists()

16. We can then print out all files ending in the ".txt" extension using:

In [None]:
pathlist = sorted(sotudir.glob("*.txt")) 
print(pathlist)

17. For each path in the pathlist, we can extract only the name of the file (rather than the whole path) using the **.name** method. For more on pathlib functions and methods see [pathlib documentation](https://docs.python.org/3/library/pathlib.html).

In [None]:
print([path.name for path in pathlist])

18. We can also extract just the file extension from each file, using the **.suffix** method. To extract only unique file types, we can then wrap the file extension list in a **set()** function. Try:

In [None]:
print(set([path.suffix for path in pathlist]))
print(set([1,2,3,2,3,2,4,2,1,5,8,9,9,1]))

19. We can open an individual file using the **open()** function. 

*Note*: In Python, it is recommended that you always close your files after finishing with them. One way to do this is to place an **open()** command within a **with statement**. This way, the files is closed as soon as we exit the indented block underneath the with statement. Another way is to immediately **close()** the file after extracting the information you need from it. Run either or both options below. See [Why Close Python Files](https://realpython.com/why-close-file-python/). 

In [None]:
with open(Path(sotudir, 'Washington_1794.txt')) as f:
    wash94 = f.read()
print(wash94[:300])

## Exercise

20. Choose another State of the Union Address of your choosing (see #17 above for a list of filenames). Read it in (within a **with** statement), and print out the last 400 characters of this address.

21. Now calculate the length of this address. 

## V. Iterating through Multiple Files using For Loops

22. Using a for loop, we can then iterate through each file (whose path is saved in pathlist) and extract information from each:

In [None]:
pathlist = sorted(sotudir.glob('*.txt'))       # .glob only stores the pathlist temporarily (for some reason), so you need to call it again!2
for path in pathlist:
    fn=path.stem
    print(fn)
    with open(path) as f:
        txt = f.read()
    print(len(txt))

## Exercises (Part V)

23. Using the **pathlib** module, create a filepath to the U.S. Inaugural Addresses dataset.

24. Create a "pathlist" of all files in the U.S. Inaugural Addresses folder that are plain text (.txt) files. Then, print out the names of each of these files.

25. Iterate through the Inaugural Addresses using a for loop. Print out the first 250 characters of each address. 