# Files and Libraries

## I. Libraries, Modules, and Functions

With Python alone, a programmer can perform some basic operations using simple functions like **print()**, **len()**, **max()**, **min()**, **sorted()** as well as some methods applied directly to particular data types (like **.lower()** and **.upper()** for strings).

However, to do more advanced or specialized things we need to install and import Python **Libraries** also known as **packages**. 

A **Python library** is a collection of files (known as **modules**) that each contain **functions** to complete a set of related tasks. 

*Confused?* 

*This can get confusing as some large libraries have multiple sub-packages each with many different modules. In other cases a library consists of a single module.* ***The important thing to know is that you need to import each library or module you want to use.***

Some commonly used modules are found in [Python's Standard Library](https://docs.python.org/3/library/). 




## II. Working with Libraries

For more on functions, we can refer to a more detailed lesson provided by [**Constellate's** Python 4 lesson](https://lab.constellate.org/perfusion-stearns-eliot/notebooks/tdm-notebooks-2023-04-03T23%3A17%3A07.601Z/python-basics-4.ipynb):

```
**Functions**

You can identify a function by the fact that it ends with a set of parentheses () where arguments can be passed into the function. Depending on the function (and your goals for using it), a function may accept no arguments, a single argument, or many arguments. For example, when we use the print() function, a string (or a variable containing a string) is passed as an argument.

Functions are a convenient shorthand, like a mini-program, that makes our code more modular. We don't need to know all the details of how the print() function works in order to use it. Functions are sometimes called "black boxes", in that we can put an argument into the box and a return value comes out. We don't need to know the inner details of the "black box" to use it. (Of course, as you advance your programming skills, you may become curious about how certain functions work. And if you work with sensitive data, you may need to peer in the black box to ensure the security and accuracy of the output.)

**Libraries & Modules**

While Python comes with many functions, there are thousands more that others have written. Adding them all to Python would create mass confusion, since many people could use the same name for functions that do different things. The solution then is that functions are stored in modules that can be imported for use. A module is a Python file (extension ".py") that contains the definitions for the functions written in Python. These modules (individual Python files) can then be collected into even larger groups called packages and libraries. Depending on how many functions you need for the program you are writing, you may import a single module, a package of modules, or a whole library.
```

1. First, we will import the [**math module**](https://docs.python.org/3/library/math.html).




The formula for importing a module is:

```
import module_name
```

In [107]:
import math

By importing the math module, for example, we have access to a wide range of mathematical functions (see the [math module documentation here](https://docs.python.org/3/library/math.html)). 

2. For example, we can calculate the square root of a large number (by using the **math** module's **sqrt** function) or identify the value of pi (as well as more advanced mathematical operations).

In [108]:
print(math.sqrt(34785739923))

186509.35612724634


In [109]:
print(math.pi)

3.141592653589793


3. Use the **help()** function to learn more about a module or function

In [110]:
help(math)

Help on built-in module math:

NAME
    math

DESCRIPTION
    This module provides access to the mathematical functions
    defined by the C standard.

FUNCTIONS
    acos(x, /)
        Return the arc cosine (measured in radians) of x.
        
        The result is between 0 and pi.
    
    acosh(x, /)
        Return the inverse hyperbolic cosine of x.
    
    asin(x, /)
        Return the arc sine (measured in radians) of x.
        
        The result is between -pi/2 and pi/2.
    
    asinh(x, /)
        Return the inverse hyperbolic sine of x.
    
    atan(x, /)
        Return the arc tangent (measured in radians) of x.
        
        The result is between -pi/2 and pi/2.
    
    atan2(y, x, /)
        Return the arc tangent (measured in radians) of y/x.
        
        Unlike atan(y/x), the signs of both x and y are considered.
    
    atanh(x, /)
        Return the inverse hyperbolic tangent of x.
    
    ceil(x, /)
        Return the ceiling of x as an Integral.
      

We can also import the [time module](https://docs.python.org/3/library/time.html) to apply different time functions. 



4. Use the **help** function to learn more about what the time module can do.

In [111]:
import time
help(time) 

Help on built-in module time:

NAME
    time - This module provides various functions to manipulate time values.

DESCRIPTION
    There are two standard representations of time.  One is the number
    of seconds since the Epoch, in UTC (a.k.a. GMT).  It may be an integer
    or a floating point number (to represent fractions of seconds).
    The Epoch is system-defined; on Unix, it is generally January 1st, 1970.
    The actual value can be retrieved by calling gmtime(0).
    
    The other representation is a tuple of 9 integers giving local time.
    The tuple items are:
      year (including century, e.g. 1998)
      month (1-12)
      day (1-31)
      hours (0-23)
      minutes (0-59)
      seconds (0-59)
      weekday (0-6, Monday is 0)
      Julian day (day in the year, 1-366)
      DST (Daylight Savings Time) flag (-1, 0 or 1)
    If the DST flag is 0, the time is given in the regular time zone;
    if it is 1, the time is given in the DST time zone;
    if it is -1, mktime() sh

5. Run the following: 

In [112]:
# A program that waits 3 seconds then prints "Done"

import time # We import all the functions in the `time` module

print('Waiting 3 seconds...')
time.sleep(3) # We run the sleep() function from the time module using `time.sleep()`
print('Done')

Waiting 3 seconds...
Done


We can also just import the sleep() function without importing the whole time module. The syntax is:

```
from module import function
```

6. Run the following, but change the wait time to 4 seconds:


In [113]:
# A program that waits 3 seconds then prints "Done"

from time import sleep # We import just the sleep() function from the time module


print('Waiting 3 seconds...')

sleep(3) # Notice that we just call the sleep() function, not time.sleep()
print('Done')

Waiting 3 seconds...
Done


When writing code in Python, it helps immensely to use descriptive variable names. Python is fine if you named all your variables "x012b" and "2xAyn". But, it helps the human readers of your code to be as descriptive as possible with variable names. And, with auto-complete, you don't have to worry about typing your descriptive variable names in their entirety. Just type the first few letters of the variable name and a menu of pre-defined variables and functions will appear.

7. Read the code below. What does it do? Distinguish between the names of variables, functions, and modules.

In [114]:
import time
start_time = time.time()
large_num = 9027905973792487678342070492070492704992341896198123931989
sq_root = math.sqrt(large_num)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"On this machine / server, Python has calculated that the square root of {large_num} is {sq_root} in just {elapsed_time} seconds.")
print(sq_root ** 2)  ##double-checking the answer

On this machine / server, Python has calculated that the square root of 9027905973792487678342070492070492704992341896198123931989 is 9.501529336792309e+28 in just 0.0 seconds.
9.02790597379249e+57


8. Notice above, the use of an "f" before a string. This denotes a **formatted string**. With formatted strings, we can insert variable_names using "{}" and Python will replace the variable name with the value stored in the variable. For example try changing all the string values assigned to variables below:

In [115]:
person = ""  
adj = ""
adv = ""
verb = ""
obj = ""
time_period = ""
print(f"During {time_period}, {person} {verb} the {adj} {obj} {adv}.")

During ,   the   .


Or we can work with Python's [datetime library](https://docs.python.org/3/library/datetime.html) to work with... you guessed it, dates and times. It does some of the same things as **time** but also does some different things.

Sometimes, for larger libraries we may want to only import individual modules following the formula:

```
from [library name] import module1, module2, etc.
```

9. Try:

In [116]:
from datetime import date, datetime

In [117]:
date.today()

datetime.date(2023, 4, 4)

In [118]:
now = datetime.now()
print(now)

2023-04-04 00:07:42.118961


## III. Additional Practice with Libraries and Modules

There are also many modules and libraries designed to work with strings and texts. 

For example:
+ [string](https://docs.python.org/3/library/string.html?highlight=string) - for common string functions
+ [re](https://docs.python.org/3/library/re.html?highlight=re#module-re) - regular expression operations (regular expressions allow you to search for particular patterns in texts. I.e. if you want to search for phone numbers in a text you would to search for all texts that match the pattern ###-###-#### or ###-####. Or to find an email you may want to search for continuous strings that include an "@" sign followed by a ".com", ".edu", or ".edu". Regular expressions help you do this.)

10. Run the following

In [119]:
import string, re
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


11. Run and read the code below, it uses functions from both the **re** and **string** libraries. Can you figure out what it did? How it works?

In [120]:
sent = "[To be or not to be]: that is the question?!?"
punct = string.punctuation
new_sent = re.sub(f"[{punct}]", "", sent)
print(new_sent)


To be or not to be that is the question


## IV. Working with Files

An essential skill in Python is to be able to navigate through files on your computer to either read in existing files into Python or to output new files. 

To enable navigating through files on your computer, we will use the **os** and **pathlib** libraries. 

12. Let's import them now.

In [121]:
#import os
from pathlib import Path

13. Examine what the following functions do. Hint: **cwd()** means "current working directory."

In [122]:
print(Path.cwd())
print(Path.cwd().parent)
print(Path.cwd().parent.parent)

c:\Users\F0040RP\Documents\DartLib_RDS\Python
c:\Users\F0040RP\Documents\DartLib_RDS
c:\Users\F0040RP\Documents


14. We can open a dataset of texts using the following code:

In [123]:
#sotudir = Path(Path.cwd().parent, "state-of-the-union-dataset", "txt")
sotudir = Path("~/shared/RR-workshop-data/state-of-the-union-dataset/txt").expanduser()
print(sotudir)

C:\Users\F0040RP\shared\RR-workshop-data\state-of-the-union-dataset\txt


15. Note, "sotudir" merely saves a filepath. To see if this filepath actually links to a real folder, try:

In [124]:
sotudir.exists()

True

16. We can then print out all files ending in the ".txt" extension using:

In [125]:
pathlist = sorted(sotudir.glob("*.txt")) 
print(pathlist)

[WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1797.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1798.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1799.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1800.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1825.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1826.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1827.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1828.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Arthur_1881.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-unio

17. For each path in the pathlist, we can extract only the name of the file (rather than the whole path) using the **.name** method. For more on pathlib functions and methods see [pathlib documentation](https://docs.python.org/3/library/pathlib.html).

In [126]:
print([path.name for path in pathlist])

['Adams_1797.txt', 'Adams_1798.txt', 'Adams_1799.txt', 'Adams_1800.txt', 'Adams_1825.txt', 'Adams_1826.txt', 'Adams_1827.txt', 'Adams_1828.txt', 'Arthur_1881.txt', 'Arthur_1882.txt', 'Arthur_1883.txt', 'Arthur_1884.txt', 'Buchanan_1857.txt', 'Buchanan_1858.txt', 'Buchanan_1859.txt', 'Buchanan_1860.txt', 'Buren_1837.txt', 'Buren_1838.txt', 'Buren_1839.txt', 'Buren_1840.txt', 'Bush_1989.txt', 'Bush_1990.txt', 'Bush_1991.txt', 'Bush_1992.txt', 'Bush_2001.txt', 'Bush_2002.txt', 'Bush_2003.txt', 'Bush_2004.txt', 'Bush_2005.txt', 'Bush_2006.txt', 'Bush_2007.txt', 'Bush_2008.txt', 'Carter_1978.txt', 'Carter_1979.txt', 'Carter_1980.txt', 'Carter_1981.txt', 'Cleveland_1885.txt', 'Cleveland_1886.txt', 'Cleveland_1887.txt', 'Cleveland_1888.txt', 'Cleveland_1893.txt', 'Cleveland_1894.txt', 'Cleveland_1895.txt', 'Cleveland_1896.txt', 'Clinton_1993.txt', 'Clinton_1994.txt', 'Clinton_1995.txt', 'Clinton_1996.txt', 'Clinton_1997.txt', 'Clinton_1998.txt', 'Clinton_1999.txt', 'Clinton_2000.txt', 'Coolid

18. We can also extract just the file extension from each file, using the **.suffix** method. To extract only unique file types, we can then wrap the file extension list in a **set()** function. Try:

In [127]:
print(set([path.suffix for path in pathlist]))
print(set([1,2,3,2,3,2,4,2,1,5,8,9,9,1]))

{'.txt'}
{1, 2, 3, 4, 5, 8, 9}


19. We can open an individual file using the **open()** function. 

*Note*: In Python, it is recommended that you always close your files after finishing with them. One way to do this is to place an **open()** command within a **with statement**. This way, the files is closed as soon as we exit the indented block underneath the with statement. Another way is to immediately **close()** the file after extracting the information you need from it. Run either or both options below. See [Why Close Python Files](https://realpython.com/why-close-file-python/). 

In [128]:
with open(Path(sotudir, 'Washington_1794.txt')) as f:
    wash94 = f.read()
print(wash94[:300])

Fellow-Citizens of the Senate and House of Representatives:

When we call to mind the gracious indulgence of Heaven by which the
American people became a nation; when we survey the general prosperity of
our country, and look forward to the riches, power, and happiness to which
it seems destined, wit


## Exercise

20. Choose another State of the Union Address of your choosing (see #17 above for a list of filenames). Read it in (within a **with** statement), and print out the last 400 characters of this address.

21. Now calculate the length of this address. 

## V. Iterating through Multiple Files using For Loops

22. Using a for loop, we can then iterate through each file (whose path is saved in pathlist) and extract information from each:

In [132]:
pathlist = sorted(sotudir.glob('*.txt'))       # .glob only stores the pathlist temporarily (for some reason), so you need to call it again!2
for path in pathlist:
    fn=path.stem
    print(fn)
    with open(path) as f:
        txt = f.read()
    print(len(txt))

Adams_1797
12440
Adams_1798
13362
Adams_1799
9204
Adams_1800
8349
Adams_1825
53953
Adams_1826
46443
Adams_1827
42442
Adams_1828
44163
Arthur_1881
24150
Arthur_1882
19026
Arthur_1883
23821
Arthur_1884
55191
Buchanan_1857
82015
Buchanan_1858
98487
Buchanan_1859
74052
Buchanan_1860
84247
Buren_1837
68889
Buren_1838
69847
Buren_1839
80109
Buren_1840
54991
Bush_1989
27817
Bush_1990
21396
Bush_1991
22395
Bush_1992
26606
Bush_2001
25293
Bush_2002
22617
Bush_2003
31842
Bush_2004
30575
Bush_2005
29839
Bush_2006
31413
Bush_2007
31963
Bush_2008
33803
Carter_1978
26530
Carter_1979
19510
Carter_1980
20090
Carter_1981
217947
Cleveland_1885
120992
Cleveland_1886
92835
Cleveland_1887
31647
Cleveland_1888
55422
Cleveland_1893
76748
Cleveland_1894
97755
Cleveland_1895
89753
Cleveland_1896
94919
Clinton_1993
39214
Clinton_1994
42280
Clinton_1995
51285
Clinton_1996
36346
Clinton_1997
38998
Clinton_1998
42215
Clinton_1999
43552
Clinton_2000
44204
Coolidge_1923
41107
Coolidge_1924
42466
Coolidge_1925
66252


## Exercises (Part V)

23. Using the **pathlib** module, create a filepath to the U.S. Inaugural Addresses dataset.

In [130]:
inaugdir = Path("~/shared/RR-workshop-data/US_Inaugural_Addresses").expanduser()
inaugdir.exists()

True

24. Create a "pathlist" of all files in the U.S. Inaugural Addresses folder that are plain text (.txt) files. Then, print out the names of each of these files.

In [131]:
pathlist = sorted(inaugdir.glob("*.txt")) 
print([path.name for path in pathlist])

['._01_washington_1789.txt', '._02_washington_1793.txt', '._03_adams_john_1797.txt', '._04_jefferson_1801.txt', '._05_jefferson_1805.txt', '._06_madison_1809.txt', '._07_madison_1813.txt', '._08_monroe_1817.txt', '._09_monroe_1821.txt', '._10_adams_john_quincy_1825.txt', '._11_jackson_1829.txt', '._12_jackson_1833.txt', '._13_van_buren_1837.txt', '._14_harrison_1841.txt', '._15_polk_1845.txt', '._16_taylor_1849.txt', '._17_pierce_1853.txt', '._18_buchanan_1857.txt', '._19_lincoln_1861.txt', '._20_lincoln_1865.txt', '._21_grant_1869.txt', '._22_grant_1873.txt', '._23_hayes_1877.txt', '._24_garfield_1881.txt', '._25_cleveland_1885.txt', '._26_harrison_1889.txt', '._27_cleveland_1893.txt', '._28_mckinley_1897.txt', '._29_mckinley_1901.txt', '._30_roosevelt_theodore_1905.txt', '._31_taft_1909.txt', '._32_wilson_1913.txt', '._33_wilson_1917.txt', '._34_harding_1921.txt', '._35_coolidge_1925.txt', '._36_hoover_1929.txt', '._37_roosevelt_franklin_1933.txt', '._38_roosevelt_franklin_1937.txt',

25. Iterate through the Inaugural Addresses using a for loop. Print out the first 250 characters of each address. 