# Intro to Python Notebook 2: Working with Texts (and other data)

**A Reproducible Research Workshop**

(A Collaboration between Dartmouth Library and Research Computing)

[*Click here to view or register for our current list of workshops*](http://dartgo.org/RRADworkshops)

*This notebook created by*:
+ Version 1.0: Jeremy Mikecz, Research Data Services (Dartmouth Library)
+ Version 2.0: ???
<!--
+ Some of the inspiration for the code and information in this notebook was taken from https://www.w3schools.com/python/python_intro.asp -- This is a great resource if you want to learn more about Python!-->

This is **Notebook 2** of 3 for the **Introduction to Text Analysis in Python** workshop:
+ Notebook 1: The Basics - getting started with Python
+ **Notebook 2: Working with Texts (and other data) - importing, reviewing, and modifying texts and other data**
+ Notebook 3: Dataframes - importing texts and other data, placing this data into a dataframe, and then modifying, analyzing, visualizing, and exporting this data

In this lesson, you will learn how to:
1. import texts and other data files from your computer
2. place data in lists and modify and analyze those lists
3. iterate through lists and an entire directory of files
4. write functions to create reproducible code
<!--1. import multiple text and data files from local folders
2. extract data from these files and read this data into data tables (known as "dataframes" in Python)-->

**Table of Contents**

+ I. Built-In Functions and Methods
+ II. Python Libraries
+ III. Working with Lists
+ IV. Looping through Lists
+ V. Writing Functions
+ VI. Working with Files
+ VII. Reading in Text Files
+ VIII. Applying Functions to Files

<!--
+ II. Working with Files
+ III. Writing Functions
+ IV. Working with Lists
+ V. Looping through Lists
+ VI. Looping through Files-->

## I. Python Built-In Functions and Methods

For more on functions, we can refer to a more detailed lesson provided by [**Constellate's** Python 4 lesson](https://lab.constellate.org/perfusion-stearns-eliot/notebooks/tdm-notebooks-2023-04-03T23%3A17%3A07.601Z/python-basics-4.ipynb):

**Functions**

```
You can identify a function by the fact that it ends with a set of parentheses () where arguments can be passed into the function. Depending on the function (and your goals for using it), a function may accept no arguments, a single argument, or many arguments. For example, when we use the print() function, a string (or a variable containing a string) is passed as an argument.

Functions are a convenient shorthand, like a mini-program, that makes our code more modular. We don't need to know all the details of how the print() function works in order to use it. Functions are sometimes called "black boxes", in that we can put an argument into the box and a return value comes out. We don't need to know the inner details of the "black box" to use it. (Of course, as you advance your programming skills, you may become curious about how certain functions work. And if you work with sensitive data, you may need to peer in the black box to ensure the security and accuracy of the output.)
```

### Ia. Built-in functions

With Python alone, a programmer can perform some basic operations using simple functions introduced in Notebook 1, such as **print()**, **len()**, **max()**, **min()**, **sorted()**. To call a built-in function, you simply write the name of the function followed by any arguments you want to pass in placed within parentheses:

```
name_of_function(argument1, argument2, ...)
```

Arguments (aka. "parameters") are often, but not always optional. 

As you may recall, to print (output) some information, you would call the **print()** function passing in a text string you want to print. 

1. Try entering some of the following commands in the code cells below. Compare the results:

```
print()
print("Good morning! How are you?")
your_name = "Bob"
print("Hello", your_name)
print("Hello", your_name, "!")
print("Hello ", your_name, "!", sep = "")

country_code = "+1"
area_code = "555"
phone_num = "555-0123"
print(country_code, area_code, phone_num, sep = "-")
```

In [207]:
print()




In [208]:
print("Good morning! How are you?")

Good morning! How are you?


In [209]:
your_name = "Bob"
print("Hello", your_name)

Hello Bob


In [210]:
print("Hello", your_name, "!")

Hello Bob !


In [211]:
print("Hello ", your_name, "!", sep = "")

Hello Bob!


In [212]:
country_code = "+1"
area_code = "555"
phone_num = "555-0123"
print(country_code, area_code, phone_num, sep = "-")

+1-555-555-0123


To learn more about a particular function we can access its documentation using:

```
?function_name
```

For example:

In [213]:
?print

[1;31mSignature:[0m [0mprint[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [0msep[0m[1;33m=[0m[1;34m' '[0m[1;33m,[0m [0mend[0m[1;33m=[0m[1;34m'\n'[0m[1;33m,[0m [0mfile[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mflush[0m[1;33m=[0m[1;32mFalse[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Prints the values to a stream, or to sys.stdout by default.

sep
  string inserted between values, default a space.
end
  string appended after the last value, default a newline.
file
  a file-like object (stream); defaults to the current sys.stdout.
flush
  whether to forcibly flush the stream.
[1;31mType:[0m      builtin_function_or_method

### Ib. Methods

Methods perform in similar ways to functions. The key difference is that:
* functions are independent and can be called by their name only
* methods act on objects of a particular class. In plain terms, some methods work only on text strings. Others work only on integers or dataframes.

Thus, the syntax for calling a method is:

```
object_name.method_name()
```

Examples of [common methods for text strings](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) include:
+ str.capitalize()
+ str.encode()
+ str.endswith()
+ str.startswith()
+ str.lower()
+ str.islower()
+ str.upper()
+ str.isupper()
+ str.strip()
+ str.replace()
+ str.split()

Note: "str" above refers to either a raw text string such as `"hello!"` or a variable that contains a string such as `question = "what is happening?"`. To test if an object is a string (str), use the **type()** function, such as: 

```
type(123)
type("Hello")
type("123")
```

2. Try typing these commands in the cell below.

In [214]:
print(type(123))
print(type("Hello"))
type("123")           #print function not necessary for last line in cell

<class 'int'>
<class 'str'>


str

3. Let's see what some of these string methods do:

In [215]:
sent = "  two roads diverged in a wood and I - I took the one less traveled by, and that has made all the difference.   "

In [216]:
type(sent)

str

In [217]:
sent.upper()

'  TWO ROADS DIVERGED IN A WOOD AND I - I TOOK THE ONE LESS TRAVELED BY, AND THAT HAS MADE ALL THE DIFFERENCE.   '

In [218]:
sent.split()

['two',
 'roads',
 'diverged',
 'in',
 'a',
 'wood',
 'and',
 'I',
 '-',
 'I',
 'took',
 'the',
 'one',
 'less',
 'traveled',
 'by,',
 'and',
 'that',
 'has',
 'made',
 'all',
 'the',
 'difference.']

In [219]:
print(sent)
print(sent.strip())

  two roads diverged in a wood and I - I took the one less traveled by, and that has made all the difference.   
two roads diverged in a wood and I - I took the one less traveled by, and that has made all the difference.


In [220]:
sent.endswith(".")

False

In [221]:
sent.strip().endswith(".")   #chaining methods

True

In [222]:
sent = sent.strip()
sent.replace("two", "three").replace("one", "two")

'three roads diverged in a wood and I - I took the two less traveled by, and that has made all the difference.'

In [223]:
encoded_name = "Andrés González & Günter Schröder".encode()
print(encoded_name)
print(type(encoded_name))


b'Andr\xc3\xa9s Gonz\xc3\xa1lez & G\xc3\xbcnter Schr\xc3\xb6der'
<class 'bytes'>


In [224]:
decoded_name = encoded_name.decode()
print(decoded_name)
print(type(decoded_name))

Andrés González & Günter Schröder
<class 'str'>


4. We can also apply built-in functions to a string object. We outputed information above using the print() function. We can also use the **len()** [length] function.

In [225]:
len(sent)

107

We will examine some other methods below in the section on lists.


### *A note on bytes and strings*

Byte strings are:
+ intended for storage on a computer
+ character strings have to be *encoded* into byte strings to be read by a computer

Strings are:
+ intended for human readers
+ byte strings need to be *decoded* into character strings to be read correctly by humans 

## II. Python Libraries

Beyond the basic functionality provided by Python's built-in functions and methods, if we want to do more advanced or specialized things we need to install and import Python **Libraries** also known as **packages**. 

A **Python library** is a collection of files (known as **modules**) that each contain **functions** and/or **methods** to complete a set of related tasks. 

*Confused?* 

*This can get confusing as some large libraries have multiple sub-packages each with many different modules. In other cases a library consists of a single module.* ***The important thing to know is that you need to import each library or module you want to use.***

For more on Python libraries, we can refer to a more detailed lesson provided by [**Constellate's** Python 4 lesson](https://lab.constellate.org/perfusion-stearns-eliot/notebooks/tdm-notebooks-2023-04-03T23%3A17%3A07.601Z/python-basics-4.ipynb):


```
While Python comes with many functions, there are thousands more that others have written. Adding them all to Python would create mass confusion, since many people could use the same name for functions that do different things. The solution then is that functions are stored in modules that can be imported for use. A module is a Python file (extension ".py") that contains the definitions for the functions written in Python. These modules (individual Python files) can then be collected into even larger groups called packages and libraries. Depending on how many functions you need for the program you are writing, you may import a single module, a package of modules, or a whole library.
```

Some commonly used modules are found in [Python's Standard Library](https://docs.python.org/3/library/) - *these are installed with Python and require no separate installation.* 

Other libraries need to be installed first and then imported.

### IIb. Working with core Python Libraries

5. First, we will import the [**math module**](https://docs.python.org/3/library/math.html).

The syntax for importing a module or library is:

```
import module_name
```

In [226]:
import math

6. Try experimenting with some functions from the math module (see the [documentation here](https://docs.python.org/3/library/math.html))

In [227]:
math.pi

3.141592653589793

In [228]:
math.floor(2.99999)

2

In [229]:
math.dist((2,3), (-4, 1))

6.324555320336759

In [230]:
math.sqrt(64)

8.0

7. There are a variety of Python libraries and modules that help us work with texts. One interesting module that comes with the Python Standard Library is [**difflib**](https://docs.python.org/3/library/difflib.html). It allows us to compare the difference between sequences of text.

Examine the code cells below. They apply the **ndiff** function from the difflib library to two lists of words. Examine the results:

In [231]:
import difflib
sent1 = "What in the world is going on over there?".split()
sent2 = "What the heck is goin' on down there?".split()


In [232]:
from difflib import ndiff
diff = ndiff(sent1, sent2)
print('\n'.join(diff))



  What
- in
  the
- world
+ heck
  is
- going
?     ^

+ goin'
?     ^

  on
- over
+ down
  there?


## III. Working with Lists

There are two basic Python data structures for storing ordered sequences of information. These are **lists** and **tuples**. 

+ **Lists** are enclosed by `[]` and each item is separated by a comma.
+ Lists are *mutable*, meaning they can be modified (items can be added, modified, deleted)
+ **Tuples** are enclosed by `()` and each item is separated by a comma.
+ Tuples are *immutable*, meaning that, once created, they cannot be modified.

```
a_list_of_numbers = [4, 6, 2]
a_tuple_of_numbers = (33, 17, 42, 2)
```

You can read more about [the differences between tuples and lists here](https://builtin.com/software-engineering-perspectives/python-tuples-vs-lists). Below, however, we focus on lists.


### IIIa. Creating Lists

Storing data in individual variables makes sense when you have a few unique values. 

However, operating on many individual variables would be tedious and time-consuming. 

Instead, we can store multiple values under one variable using lists. Say, for example, you want to store quiz scores for a class or multiple race results for a single track athlete. You could store them in a list.

```
**Scores - Quiz 1**   
73.5
86.2
81.9
90.1
67.8
88.0
```/

8. Lists require two principal symbols:

* ```[]``` to store values
* ```,``` to separate values

So the list of scores for the first quiz could be stored by:

In [233]:
# Run this cell by pressing Ctrl+Enter or pressing the play button while selecting this cell.
quiz1 = [73.5, 86.2, 81.9, 90.1, 57.8, 88.0]

9. Lists can contain any type of data. For example, a list of strings may look like this:

In [234]:
student_names = ["Ezekiel", "Maura", "Aly", "Xavi", "Donato", "Jenny"]

10. Or booleans, like:

In [235]:
passed = [True, True, True, True, False, True]

Note: Python only recognizes "True" and "False" as booleans, not "TRUE" or "false".

11. You can print out the contents of these lists using the built-in function **print()**.

In [236]:
print(quiz1)

[73.5, 86.2, 81.9, 90.1, 57.8, 88.0]


11. To check to see if a variable name contains a list, use the **type()** function.

In [237]:
type(passed)

list

12. You can also calculate the total number of items in a list using the built-in function **len()**. Run the following code:

In [238]:
len(quiz1)

6

13. We can run some basic functions to retrieve other information from a list.

In [239]:
print(max(quiz1))
print(min(quiz1))
print(max(student_names))
min(student_names)         #no print function necessary for last line in a cell

90.1
57.8
Xavi


'Aly'

14. *As we already learned, **len()** works on characters strings as well as lists.* But, it counts character strings differently. Run the following code and then calculate the length of each variable. What do you notice?

In [240]:
sent = "This is a sentence."
words = sent.split()
print(words)

['This', 'is', 'a', 'sentence.']


In [241]:
#calculate the length of "sent" and "words".
print(len(sent))
print(len(words))

19
4


<div class="alert alert-info" role="alert" style="color:blue"><h3 style="color:blue;">Exercises for Part III</h3>
    
<p style="color:blue;">15. Create a list of at least five of your favorite numbers (or numbers that recall a significant event, person, or stage of your life). Save to the variable **fav_numbers**.</p>
</div>

In [242]:
fav_numbers = [3, 7, 11, 79, 111, 2012, 2015, 2021, 88888888]

<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">16. Create a list of at least five people that have inspired you. Save to the variable **inspirational_people**.</p></div>

In [243]:
inspirational_people = ["Homer Simpson", "Abe Simpson", "Groundskeeper Willie", "Sideshow Bob", "Lisa Simpson", "Marge", "Lenny", "Karl"]

<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">17. Now, print out each of these lists.</p></div>

In [244]:
print(fav_numbers)
print(inspirational_people) #note: for long variables you can just type the first few letters and a list of options of variables saved in memory will appear.
                             #or you can press TAB to auto-complete the variable name
inspirational_people    #print not necessary to output last line in a cell, but noticed how it is outputed differently

[3, 7, 11, 79, 111, 2012, 2015, 2021, 88888888]
['Homer Simpson', 'Abe Simpson', 'Groundskeeper Willie', 'Sideshow Bob', 'Lisa Simpson', 'Marge', 'Lenny', 'Karl']


['Homer Simpson',
 'Abe Simpson',
 'Groundskeeper Willie',
 'Sideshow Bob',
 'Lisa Simpson',
 'Marge',
 'Lenny',
 'Karl']

<div class="alert alert-info" role="alert" style="color:blue"><p>18. Print out the length of each of these lists.</p>
</div>

In [245]:
print(len(fav_numbers))
len(inspirational_people)

9


8

<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">19. Print out the maximum value from each list.</p>
</div>

In [246]:
print(max(fav_numbers))
print(max(inspirational_people))

88888888
Sideshow Bob


### IIIb. List Indexes and Slices

We can retrieve portions of the list using indexes and slices. 

One important note: in Python the first item in a series is always considered number 0.

20. Thus, to retrieve the first item in our list (that is, the first quiz score), you would simply run:

In [247]:
quiz1[0]

73.5

<div class="alert alert-info" role="alert" style="color:blue">
    <p>21. Using the same format, now try retrieving the 3rd person from your list of inspirational people.</p>
</div>

In [248]:
inspirational_people[2]

'Groundskeeper Willie'

Now imagine you have a long list. You can identify the last item in a list by using the index [-1]. For example:

```
name-of-list[-1]
```
<div class="alert alert-info" role="alert" style="color:blue">
    <p><b>22. Code Together</b>: Try that below with our list of quiz scores.</p>
</div>


In [249]:
quiz1[-1]

88.0

<div class="alert alert-info" role="alert" style="color:blue">
    <p>23. Indexing beyond the end of the list will produce an "IndexError". Try, for example, retrieving the 100th item in our list of quiz scores.</p>
</div>

In [250]:
#quiz1[99]     #will cause an IndexError if uncommented out

To retrieve multiple, consecutive items from a list, we can use **slices**.

The format is as follows:

```
name-of-list[start:end]

```

List slices begin with the start number but end one number before the end number.

So, **"start"** is the index of the first item of the list (starting with zero) and
**"end"** is the index of the last item in the list + 1.

24. Run the cell below

In [251]:
quiz1[0:2]

[73.5, 86.2]

<div class="alert alert-info" role="alert" style="color:blue">
    <h3> Exercise</h3>
    <p>25. Using this format, retrieve the second through fourth item in our list. Compare the results to our original list. If you did it wrong, adjust the indices you are using.</p>
</div>

In [252]:
# remember that incrementing starts at 0!
quiz1[1:4] 

[86.2, 81.9, 90.1]

### IIIc. Lists with Operators, Functions, and list methods

26. Create two lists of integers. Try applying different operands to the lists `+ - * / **`.

In [253]:
a = [3,2,1,4]
b = [5,8,7,6]

In [254]:
a + b

[3, 2, 1, 4, 5, 8, 7, 6]

In [255]:
#a * b
a * 2

[3, 2, 1, 4, 3, 2, 1, 4]

In [256]:
#b - a

In [257]:
c = a + b + [29,13,62]
print(c)

[3, 2, 1, 4, 5, 8, 7, 6, 29, 13, 62]


27. Like most things in Python, there is almost always multiple ways to perform a particular task. For example, to sort a list, you may either use the **sorted()** function or the **.sort()** method. The main difference is that:
+ the **.sort()** method permanently sorts the values of the list
+ while the **sorted()** function just temporarily outputs the new, sorted version of the list (unless you save it over the old list name or under a new list name)

In [258]:
print(sorted(c))  # the sorted function only outputs a sorted list, to save it type: c = sorted(c)
print(c)

[1, 2, 3, 4, 5, 6, 7, 8, 13, 29, 62]
[3, 2, 1, 4, 5, 8, 7, 6, 29, 13, 62]


In [259]:
c.sort()     # .sort() methods sorts a list in place
print(c)

[1, 2, 3, 4, 5, 6, 7, 8, 13, 29, 62]


In [260]:
c.sort(reverse = True)
print(c)

[62, 29, 13, 8, 7, 6, 5, 4, 3, 2, 1]


In [261]:
%whos

Variable                 Type             Data/Info
---------------------------------------------------
Initials                 function         <function Initials at 0x000002BA609100E0>
Path                     type             <class 'pathlib.Path'>
a                        list             n=4
actor_inits              list             n=597
actorlist                list             n=597
area_code                str              555
average                  function         <function average at 0x000002BA60912FC0>
b                        list             n=4
c                        list             n=11
collections              module           <module 'collections' fro<...>ollections\\__init__.py'>
country_code             str              +1
decoded_name             str              Andrés González & Günter Schröder
diff                     generator        <generator object Differ.<...>re at 0x000002BA5F151800>
difflib                  module           <module 'difflib' from '

### IIId. Modifying a List

There are multiple ways to modify a list. These include following methods:

+ **.append()**
+ **.extend()**
+ **.pop()**


28. Try applying these methods to some of the lists we have already created.

In [262]:
print(quiz1)
quiz1.append(96.6) # appending an element to the list
print(quiz1)

#run this cell multiple times to see what happens

[73.5, 86.2, 81.9, 90.1, 57.8, 88.0]
[73.5, 86.2, 81.9, 90.1, 57.8, 88.0, 96.6]


The **append** method modifies a list by [*insert your explanation here*].

Now, let's compare it to the extend method.

In [263]:
print(quiz1)
quiz_group2 = [78.7, 94.0, 89.3]
quiz1.extend(quiz_group2)
print(quiz1)

[73.5, 86.2, 81.9, 90.1, 57.8, 88.0, 96.6]
[73.5, 86.2, 81.9, 90.1, 57.8, 88.0, 96.6, 78.7, 94.0, 89.3]


The **extend** method modifies a list by ....

Can you guess what pop does?

In [264]:
print(quiz1)
quiz1.pop()
print(quiz1)

[73.5, 86.2, 81.9, 90.1, 57.8, 88.0, 96.6, 78.7, 94.0, 89.3]
[73.5, 86.2, 81.9, 90.1, 57.8, 88.0, 96.6, 78.7, 94.0]


The **pop** method modifies a list by ...

Often, there are multiple ways to accomplished the same thing. For example:

In [265]:
print(quiz1)
quiz_group3 = [88.9, 93.3, 98.6]
quiz1 = quiz1 + quiz_group3 # notice that we've used '+' here!
print(quiz1)

[73.5, 86.2, 81.9, 90.1, 57.8, 88.0, 96.6, 78.7, 94.0]
[73.5, 86.2, 81.9, 90.1, 57.8, 88.0, 96.6, 78.7, 94.0, 88.9, 93.3, 98.6]


In [266]:
# We can also use built-in functions with lists
max(quiz1) # returns the maximum value in a list

98.6

In [267]:
# sorting a list
sorted(student_names, reverse=True)

['Xavi', 'Maura', 'Jenny', 'Ezekiel', 'Donato', 'Aly']

<div class="alert alert-info" role="alert" style="color:blue"><h3 style="color:blue;">Exercises for Part IIId</h3>
    
<p style="color:blue;">29. Create a list of some of your favorite musicians, actors, or writers (at least 4).</p>
</div>

In [268]:
other_favs = ["The Edge", "Adele", "Fergie", "Gaga, Lady", "Cher", "Bono", "Drake"]    #I'm using musicians who go by one name because I am lazy.

<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">30. Add that list to your list of inspirational people. Print out this new list.</p></div>

In [269]:
inspirational_people.extend(other_favs)
print(inspirational_people)

['Homer Simpson', 'Abe Simpson', 'Groundskeeper Willie', 'Sideshow Bob', 'Lisa Simpson', 'Marge', 'Lenny', 'Karl', 'The Edge', 'Adele', 'Fergie', 'Gaga, Lady', 'Cher', 'Bono', 'Drake']


<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">31. Calculate the number of people in your inspirational_people list. Then replace the 3rd to last person with another important person to you.</p></div>

In [270]:
print(len(inspirational_people))

15


In [271]:
inspirational_people[-3] = "Chuck"
print(inspirational_people)

['Homer Simpson', 'Abe Simpson', 'Groundskeeper Willie', 'Sideshow Bob', 'Lisa Simpson', 'Marge', 'Lenny', 'Karl', 'The Edge', 'Adele', 'Fergie', 'Gaga, Lady', 'Chuck', 'Bono', 'Drake']


<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">32. Multiply your list of inspirational_people by two:</p>

```
print(inspirational_people * 2)
```

What happens?
</div>

In [272]:
print(inspirational_people * 2)

['Homer Simpson', 'Abe Simpson', 'Groundskeeper Willie', 'Sideshow Bob', 'Lisa Simpson', 'Marge', 'Lenny', 'Karl', 'The Edge', 'Adele', 'Fergie', 'Gaga, Lady', 'Chuck', 'Bono', 'Drake', 'Homer Simpson', 'Abe Simpson', 'Groundskeeper Willie', 'Sideshow Bob', 'Lisa Simpson', 'Marge', 'Lenny', 'Karl', 'The Edge', 'Adele', 'Fergie', 'Gaga, Lady', 'Chuck', 'Bono', 'Drake']


33. Run the following code. What does the **set()** function do?

In [273]:
numbers = [12,4,5,2,3,2,2,4,7,8,9,2,3,1,2,1,1,2,1,4,67,3,5,3,5,7,4,1,3,2,4,1,7,2,16,23,4,5]
print(set(numbers))

{1, 2, 3, 4, 5, 67, 7, 8, 9, 12, 16, 23}


34. We can also sort lists. Run the following. How do they differ?

In [274]:
print(sorted(numbers))
print(sorted(numbers, reverse = True))
print(set(sorted(numbers)))
print(sorted(set(numbers)))

[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 7, 7, 7, 8, 9, 12, 16, 23, 67]
[67, 23, 16, 12, 9, 8, 7, 7, 7, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1]
{1, 2, 3, 4, 5, 67, 7, 8, 9, 12, 16, 23}
[1, 2, 3, 4, 5, 7, 8, 9, 12, 16, 23, 67]


<div class="alert alert-info" role="alert" style="color:blue"><h3 style="color:blue;">Exercises for Part IIId (continued)</h3>
    
<p style="color:blue;">35. Print out a list of your inspirational people, sorted in descending order.</p></div>

In [275]:
print(sorted(inspirational_people, reverse = True))

['The Edge', 'Sideshow Bob', 'Marge', 'Lisa Simpson', 'Lenny', 'Karl', 'Homer Simpson', 'Groundskeeper Willie', 'Gaga, Lady', 'Fergie', 'Drake', 'Chuck', 'Bono', 'Adele', 'Abe Simpson']


<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">36. Calculate the class's average quiz score using functions and methods introduced in this lesson.</p></div>

In [276]:
def average(numlist):
    avg = sum(numlist) / len(numlist)
    return(avg)

list_of_nums = [5, 3, 7, 13, 6]
average(list_of_nums)

6.8

<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">37. Apply the **min()** and **max()** functions to your list of inspirational_people. What happens?</p></div>

In [277]:
print(min(inspirational_people))
print(max(inspirational_people))


Abe Simpson
The Edge


## IV. Looping through Lists

Often we want to examine or modify each individual item in a list. An easy way to iterate over a list is using a **for loop**. The general structure of a for loop is:

```
for item in named_list:
    [instructions for what to do with each item]
```

In such for loops, named_list must be an already established list. "item", however, is an arbitrary variable name we are assigning to each item in the list. 

38. Run the simple for loop below:


In [278]:
for name in student_names:  #note: student_names is an already defined list; *name* is assigned here to each individual item in this list
    print(name)

Ezekiel
Maura
Aly
Xavi
Donato
Jenny


39. We can use for loops to modify items in a list. 

For example, see what happens when we apply the **.lower()** method to each item in our student_names_sorted list.

In [279]:
sorted_names = sorted(student_names)
for name in sorted_names:
    print(name.lower())

aly
donato
ezekiel
jenny
maura
xavi


<div class="alert alert-info" role="alert" style="color:blue"><h3 style="color:blue;">Exercises for Part IV </h3>
    
<p style="color:blue;">40. Can you guess how to convert all student names into uppercase? Try to do so below:</p>
</div>

In [280]:
sorted_names = sorted(student_names)
for name in sorted_names:
    print(name.upper())   #note: this just prints out each name that has been converted to upper case. To save these new upper-cased names into memory see the next step:

ALY
DONATO
EZEKIEL
JENNY
MAURA
XAVI


41. To save these variations, we need to save them to a new list. We can do so using the following formula:

```
new_list = []  #creates a new, empty list
for item in existing_list:
    new_item = [modify original item]
    new_list.append(new_item)
```

In [281]:
student_names_lower = []
for name in sorted_names:
    lower_name = name.lower()
    student_names_lower.append(lower_name)

42. Print out this new list:

In [282]:
print(student_names_lower)

['aly', 'donato', 'ezekiel', 'jenny', 'maura', 'xavi']


We can do the same thing, but more concisely, with a **list comprehension**. The formula for a list comprehension is:

```
new_list = [new_item for item in existing_list]

```

43. An example is below, this time using the **.swapcase()** method for strings:

In [283]:
student_names_swapped = [name.swapcase() for name in sorted_names]
print(student_names_swapped)

['aLY', 'dONATO', 'eZEKIEL', 'jENNY', 'mAURA', 'xAVI']


<div class="alert alert-info" role="alert" style="color:blue"><h3 style="color:blue;">Exercises for Part IV (continued)</h3>
    
<p style="color:blue;">44. Create a new list of your inspirational people, but converted to upper case.</p>
</div>

In [284]:

# we can do this using a traditional for loop
upper_list = []

for name in inspirational_people:
    upper_list.append(name.upper())

print(upper_list)

['HOMER SIMPSON', 'ABE SIMPSON', 'GROUNDSKEEPER WILLIE', 'SIDESHOW BOB', 'LISA SIMPSON', 'MARGE', 'LENNY', 'KARL', 'THE EDGE', 'ADELE', 'FERGIE', 'GAGA, LADY', 'CHUCK', 'BONO', 'DRAKE']


In [285]:
# or we can use a list comprehension
upper_list = [name.upper() for name in inspirational_people]
print(upper_list)

['HOMER SIMPSON', 'ABE SIMPSON', 'GROUNDSKEEPER WILLIE', 'SIDESHOW BOB', 'LISA SIMPSON', 'MARGE', 'LENNY', 'KARL', 'THE EDGE', 'ADELE', 'FERGIE', 'GAGA, LADY', 'CHUCK', 'BONO', 'DRAKE']


<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">45. You are the students' instructor. You have decided to grade their quizzes on a curve (see variable "quiz1" from the beginning of this lesson), increasing each grade by 10%. Create a new list storing students' updated scores.</p>
</div>

In [286]:
print(quiz1)
quiz1_curved = []
for grade in quiz1:
    quiz1_curved.append(grade * 1.1)

#also could do: quiz1_curved = [grade * 1.1 for grade in quiz1]
print(quiz1_curved)


[73.5, 86.2, 81.9, 90.1, 57.8, 88.0, 96.6, 78.7, 94.0, 88.9, 93.3, 98.6]
[80.85000000000001, 94.82000000000001, 90.09000000000002, 99.11, 63.580000000000005, 96.80000000000001, 106.26, 86.57000000000001, 103.4, 97.79000000000002, 102.63000000000001, 108.46000000000001]


<div class="alert alert-info" role="alert" style="color:blue">
<p style="color:blue;">46. Create a list that stores the number of characters found in the name of each of your inspiration people. Then, calculate the average name length of these people.</p>
</div>

In [287]:
name_lengths = [len(name) for name in inspirational_people]
print(name_lengths)
average(name_lengths)

# we could also add the following code to identify the person with the longest name in the list:
# max_length = max(name_lengths)
#print(max_length)
#max_idx = name_lengths.index(max_length)
#print(max_idx)
#print(student_names[max_idx])

[13, 11, 20, 12, 12, 5, 5, 4, 8, 5, 6, 10, 5, 4, 5]


8.333333333333334

<div style = "background-color:#f3e5f5" style="color:purple">
<h3 style = "color:purple">V. Python Basics: Writing Functions</h3>

<p style="color:purple"><b>FUNCTIONS:</b> Next, we would like to create some new columns with lists of tokens that are lower-cased and with stopwords removed. To do so, it is helpful to write a function that does this for a single text. Then, we can apply that function across the entire corpus of SOTU addresses stored in this dataframe.</p>

<p style="color:purple">We have already used a variety of core Python functions such as <b>sum()</b>, <b>len()</b>, and <b>print()</b>. We have also called on many functions defined in auxiliary Python libraries or packages: such as the <b>word_tokenize()</b> and <b>concordance</b> functions from the <b>nltk</b> library we imported.</p>

<p style="color:purple">Sometimes, however, we will want to create our own functions.</p>

<p style="color:purple">A function is a piece of code that only runs when it is called (this will make a bit more sense after we see an example). We can pass parameters (data) into a function, which will perform operations on them. </p>

<p style="color:purple">In Python, we use the `def` keyword to define a function. </p>

```python
def functionName(argumentsToPassIn):
    function instructions
    return(resultsOfFunction)   
```
    
<p style="color:purple">47. Most but not all functions return something using the return command. Here is a super simple function that outputs a phrase, but does not return anything.</p>
</div>

In [308]:
# First, create a function
def print_hello():
    print('Hello!')

In [309]:
# Next, call the function
print_hello()

Hello!


In [None]:
# Now, let's try adding some arguments/parameters to our functions
def add(x, y):
    # we want to add these numbers and
    # return the result
    sum = x + y
    
    return sum
    
add(7, 3)

<div style = "background-color:#f3e5f5" style="color:purple">

<p style="color:purple"><b>48. A SIMPLE FUNCTION:</b> So, for example, if we had a list of names and we wanted to create a function to retrieve the initial of each, we could use the following function:</p>

</div>

In [288]:
import re
def Initials(fullname):
    caps = re.findall('([A-Z])', fullname) #this uses the findall function from the re package to find all capitalized letters
    inits = ''.join(caps)  #takes our list of capitalized letters stored in "caps" and concatenates it
    return(inits)
    
fullname = "Jeremy M. Mikecz"     #replace w/ your name
Initials(fullname)

'JMM'

<div style = "background-color:#f3e5f5"><p style="color:purple">49. We can now apply this function to quickly return the initials from a long list of names.</p></div>

In [289]:
actorlist = ['Christoph Waltz','Tom Hardy','Doug Walker','Daryl Sabara','J.K. Simmons','Brad Garrett','Chris Hemsworth','Alan Rickman','Henry Cavill','Kevin Spacey','Giancarlo Giannini','Johnny Depp','Johnny Depp','Henry Cavill','Peter Dinklage','Chris Hemsworth','Johnny Depp','Will Smith','Aidan Turner','Emma Stone','Mark Addy','Aidan Turner','Christopher Lee','Naomi Watts','Leonardo DiCaprio','Robert Downey Jr.','Liam Neeson','Bryce Dallas Howard','Albert Finney','J.K. Simmons','Robert Downey Jr.','Johnny Depp','Hugh Jackman','Steve Buscemi','Glenn Morshower','Bingbing Li','Tim Holmes','Emma Stone','Jeff Bridges','Joe Mantegna','Ryan Reynolds','Tom Hanks','Christian Bale','Jason Statham','Peter Capaldi','Jennifer Lawrence','Benedict Cumberbatch','Eddie Marsan','Leonardo DiCaprio','Jake Gyllenhaal','Charlie Hunnam','Glenn Morshower','Harrison Ford','A.J. Buckley','Kelly Macdonald','Sofia Boutella','John Ratzenberger','Tzi Ma','Oliver Platt','Robin Wright','Channing Tatum','Christoph Waltz','Jim Broadbent','Jennifer Lawrence','Christian Bale','John Ratzenberger','Amy Poehler','Robert Downey Jr.','ChloÃ« Grace Moretz','Will Smith','Jet Li','Will Smith','Jimmy Bennett','Tom Cruise','Jeanne Tripplehorn','Joseph Gordon-Levitt','Amy Poehler','Scarlett Johansson','Robert Downey Jr.','Chris Hemsworth','Angelina Jolie Pitt','Gary Oldman','Tamsin Egerton','Keanu Reeves','Scarlett Johansson','Jon Hamm','Judy Greer','Damon Wayans Jr.','Jack McBrayer','Tom Hanks','Vivica A. Fox','Gerard Butler','Nick Stahl','Bradley Cooper','Matthew McConaughey','Leonardo DiCaprio','Mark Chinnery','Aidan Turner','Paul Walker','Brad Pitt','Jennifer Lawrence','Jennifer Lawrence','Nicolas Cage','Jimmy Bennett','Johnny Depp','Justin Timberlake','Dominic Cooper','J.K. Simmons','Bruce Spence','Jennifer Garner','Zack Ward','Anthony Hopkins','Robert Pattinson','Robert Pattinson','Will Smith','Will Smith','Johnny Depp','Janeane Garofalo','Christian Bale','Bernie Mac','Robin Williams','Hugh Jackman','Essie Davis','Josh Gad','Steve Bastoni','Chris Hemsworth','Tom Hardy','Tom Hanks','Chris Hemsworth','ChloÃ« Grace Moretz','Kelli Garner','Liam Neeson','Johnny Depp','Tom Cruise','Anthony Hopkins','Christoph Waltz','Matthew Broderick','Angelina Jolie Pitt','Seychelle Gabriel','Philip Seymour Hoffman','Channing Tatum','Elisabeth Harnois','Hugh Jackman','Hugh Jackman','Ty Burrell','Brad Pitt','Jada Pinkett Smith','Toby Stephens','Ed Begley Jr.','Bruce Willis','Will Smith','Robin Wright','J.K. Simmons','Tom Cruise','Hugh Jackman','John Michael Higgins','Tom Cruise','Christian Bale','Chris Hemsworth','J.K. Simmons','Gerard Butler','Gerard Butler','Sam Shepard','Matt Frewer','Jet Li','Kevin Rankin','Channing Tatum','Matthew McConaughey','Steve Buscemi','Chris Evans','Colin Salmon','James DArcy','Robert Pattinson','Robin Williams','Ty Burrell','Don Johnson','Mark Rylance','Leonardo DiCaprio','Ryan Reynolds','Johnny Depp','Benedict Cumberbatch','Matt Damon','Angelina Jolie Pitt','Judy Greer','Jennifer Lawrence','Robert Pattinson','Jim Parsons','Tom Cruise','Will Smith','Salma Hayek','Angelina Jolie Pitt','Anthony Hopkins','Toby Jones','Daniel Radcliffe','Essie Davis','Will Smith','Alfre Woodard','Rupert Grint','Robin Williams','J.K. Simmons','Daniel Radcliffe','Ryan Reynolds','Mark Chinnery','Johnny Depp','Rupert Grint','Jennifer Lawrence','Tom Hanks','Miguel Ferrer','Hugh Jackman','Paul Walker','Robert Downey Jr.','Liam Neeson','Ronny Cox','Tony Curran','Jeremy Renner','Michael Gough','Clint Howard','Jake Gyllenhaal','Tom Cruise','Karen Allen','Chris Evans','Suraj Sharma','Nicolas Cage','Matt Damon','Demi Moore','Michael Fassbender','Nathan Lane','Matt Damon','Vin Diesel','Gary Oldman','Scott Porter','Shelley Conn','Tom Cruise','Morgan Freeman','Natalie Portman','Natalie Portman','Steve Buscemi','Hugh Jackman','Natalie Portman','Ryan Reynolds','Alain Delon','Nicolas Cage','Chris Hemsworth','Noel Fisher','Phaldut Sharma','Jamie RenÃ©e Smith','Stephen Amell','Tim Blake Nelson','Robin Williams','Dwayne Johnson','Vincent Schiavelli','Heath Ledger','Brad Pitt','Brad Pitt','Kate Winslet','Leonardo DiCaprio','James Corden','Christoph Waltz','George Peppard','Eva Green','Mahadeo Shivraj','Steve Buscemi','Naomi Watts','Hugh Jackman','Jacob Tremblay','Jason Patric','Harrison Ford','Bruce Willis','Christopher Lee','Jim Broadbent','Will Smith','Sean Hayes','Will Smith','Liam Neeson','Chazz Palminteri','Oprah Winfrey','Matt Damon','Mathew Buck','Scarlett Johansson','Del Zamora','Nicolas Cage','Djimon Hounsou','Tom Cruise','Daniel Radcliffe','Eva Green','Cary-Hiroyuki Tagawa','Joe Morton','Johnny Depp','Denzel Washington','Jamie Lee Curtis','Denzel Washington','Robert De Niro','Dwayne Johnson','Vanessa Williams','Leonardo DiCaprio','Demi Moore','Eartha Kitt','Jason Statham','Nicolas Cage','Djimon Hounsou','Catherine OHara','Hugh Jackman','Josh Hutcherson','Johnny Depp','CCH Pounder','Leonardo DiCaprio','Leonardo DiCaprio','Michael Gough','Jake Busey','Tom Hanks','Abbie Cornish','Frances Conroy','Dwayne Johnson','Joseph Gordon-Levitt','Will Ferrell','Jason Statham','Ray Winstone','Jamie Kennedy','Chris Hemsworth','Rosario Dawson','Matt Damon','Francesca Capaldi','Ben Gazzara','Dwayne Johnson','Leonardo DiCaprio','Christian Bale','Jeff Bridges','Jon Lovitz','Ioan Gruffudd','Will Ferrell','Milla Jovovich','Chris Noth','Frank Welker','Peter Dinklage','Hayley Atwell','Michael Imperioli','Alexander Gould','Orlando Bloom','Christopher Lee','Jeff Bridges','Angelina Jolie Pitt','Johnny Depp','Michael Jeter','James Franco','Martin Short','Bruce Willis','Dennis Quaid','Holly Hunter','Christopher Masterson','Logan Lerman','Will Smith','Tom Hanks','Denzel Washington','Mei MelanÃ§on','Harrison Ford','Will Forte','Denis Leary','Adam Scott','Bill Murray','Leonardo DiCaprio','Ming-Na Wen','Robert Downey Jr.','Robin Wright','Bruce Willis','Robert Downey Jr.','Morgan Freeman','Leonard Nimoy','Bella Thorne','Tom Cruise','Adam Sandler','Peter Dinklage','Haley Joel Osment','Marsha Thomason','Matthew McConaughey','Greg Grunberg','Curtiss Cook','Logan Lerman','Gerard Butler','Daniel Radcliffe','Alun Armstrong','Brad Pitt','Don Cheadle','Anne Hathaway','Robin Williams','Don Cheadle','Harrison Ford','Liam Neeson','Tim Blake Nelson','William Smith','Paddy Considine','Shirley Henderson','Jeff Bridges','Philip Seymour Hoffman','Paul Walker','Tom Hanks','Robin Williams','Matt Damon','Harrison Ford','Brad Pitt','Milla Jovovich','Steve Buscemi','Jeff Bennett','Caroline Dhavernas','Denzel Washington','Ioan Gruffudd','Matthew Broderick','Kate Winslet','Will Smith','Meryl Streep','Al Pacino','Jon Favreau','Kate Winslet','Bob Hoskins','Dwayne Johnson','F. Murray Abraham','Li Gong','Amber Stevens West','Jim Broadbent','Anthony Hopkins','Raymond Cruz','Roy Scheider','Julia Roberts','Anna Kendrick','Glenn Morshower','Larry Miller','Sarah Michelle Gellar','Wood Harris','Adam Sandler','Ted Danson','Jack McBrayer','Kristen Stewart','Seth MacFarlane','Robert Downey Jr.','Robert Duvall','Morgan Freeman','Jason Statham','Tom Cruise','Jennifer Lawrence','Bradley Cooper','Michael Gough','Bruce Willis','Tia Carrere','Steve Buscemi','Morgan Freeman','Bruce Willis','Adam Sandler','Amy Poehler','Steve Buscemi','Bill Murray','Keanu Reeves','Leonardo DiCaprio','Jon Favreau','Jim Broadbent','Nicolas Cage','Adam Sandler','Tom Hanks','Adam Sandler','Elden Henson','Steve Buscemi','Rosario Dawson','Philip Seymour Hoffman','Denzel Washington','Robin Williams','Liam Neeson','Bill Murray','Roger Rees','Keanu Reeves','Julia Roberts','Brad Pitt','Harrison Ford','Justin Timberlake','Matt Damon','Rosario Dawson','Gary Oldman','Denzel Washington','Vanessa Redgrave','Steve Buscemi','Elizabeth Montgomery','Quincy Jones','Mark Addy','Charlize Theron','Hugh Jackman','Michael Emerson','Robin Williams','Adam Sandler','Matt Damon','Natalie Portman','Nissim Renard','Anthony Hopkins','Bruce Willis','Bruce Greenwood','Sylvester Stallone','Charlie Rowe','Richard Tyson','Brendan Fraser','Fergie','Paul Walker','Olivia Williams','Adam Goldberg','Vin Diesel','Bob Neill','Mia Farrow','Pedro ArmendÃ¡riz Jr.','David Oyelowo','Sasha Roiz','Sariann Monaco','Adam Goldberg','Matthew Broderick','Josh Hutcherson','Will Forte','Philip Seymour Hoffman','J.K. Simmons','Al Pacino','Paul Walker','Jeff Bridges','Roger Rees','Robert De Niro','Steve Coogan','Jason Flemyng','Steve Carell','Will Smith','Ariana Richards','Jada Pinkett Smith','Charlie Hunnam','Hugh Jackman','Angelina Jolie Pitt','Nicolas Cage','Denis Leary','Adam Sandler','Jerry Stiller','James DArcy','Matthew Broderick','Morgan Freeman','Steve Buscemi','Tom Hanks','Harold Perrineau','Don Cheadle','Nicholas Lea','Philip Seymour Hoffman','Robert De Niro','Loretta Devine','Adam Arkin','Dwayne Johnson','Ayelet Zurer','Bruce Willis','Tom Selleck','Henry Cavill','Adam Sandler','Steve Buscemi','Bruce Willis','Julia Ormond','Bai Ling','Henry Cavill','Jimmy Bennett','Matt Damon','Harrison Ford','Connie Nielsen','Christopher Meloni','Brendan Fraser','Dennis Quaid','Robin Wright','Steve Carell','Jon Hamm','Nicolas Cage','Peter Coyote','Peter Dinklage','Matthew McConaughey','Adam Sandler','Jennifer Garner','Will Ferrell','Raven-SymonÃ©','Mhairi Calvey','Jake Gyllenhaal','Albert Brooks','Martin Landau','Sylvester Stallone','David Gant','Bryce Dallas Howard','Oliver Platt','Rory Culkin','Rupert Everett','John Ratzenberger','Julia Roberts','Vin Diesel','Tim Conway','Lili Taylor','Michael Fassbender','Robin Williams','Dwayne Johnson','Bruce Willis','Jeremy Renner','Nicole Beharie','Tom Cruise','Bryce Dallas Howard','Sanaa Lathan','Amy Poehler','Jon Hamm']

In [290]:
actor_inits = [Initials(actor) for actor in actorlist]
print(actor_inits[:20])


['CW', 'TH', 'DW', 'DS', 'JKS', 'BG', 'CH', 'AR', 'HC', 'KS', 'GG', 'JD', 'JD', 'HC', 'PD', 'CH', 'JD', 'WS', 'AT', 'ES']


<div class="alert alert-info" role="alert" style="color:blue">
<h3>Exercises for Part IV (continued)</h3>
    
<p>50. Create a new function that returns the first name and the first initial of the last name followed by a period (i.e. "Naomi Watts" --> "Naomi W."). Then apply the function to `actor_list` and save the results in a new list.</p>
</div>

## VI. Working with Files

An essential skill in Python is to be able to navigate through files on your computer to either read in existing files into Python or to output new files. Fortunately, you now have experience with the pre-requisite skills for navigating through and importing files:
+ importing Python libraries
+ applying functions 
+ looping through lists

To enable navigating through files on your computer, we will use the **pathlib** library. 

51. Let's import it now.

In [291]:
from pathlib import Path 

In [292]:
?Path

[1;31mInit signature:[0m [0mPath[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
PurePath subclass that can make system calls.

Path represents a filesystem path but unlike PurePath, also offers
methods to do system calls on path objects. Depending on your system,
instantiating a Path will return either a PosixPath or a WindowsPath
object. You can also instantiate a PosixPath or WindowsPath directly,
but cannot instantiate a WindowsPath on a POSIX system or vice versa.
[1;31mFile:[0m           c:\users\f0040rp\appdata\local\programs\python\python311\lib\pathlib.py
[1;31mType:[0m           type
[1;31mSubclasses:[0m     PosixPath, WindowsPath

In [293]:
dir(Path)

['__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__fspath__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rtruediv__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__truediv__',
 '_cached_cparts',
 '_cparts',
 '_drv',
 '_format_parsed_parts',
 '_from_parsed_parts',
 '_from_parts',
 '_hash',
 '_make_child',
 '_make_child_relpath',
 '_parse_args',
 '_parts',
 '_pparts',
 '_root',
 '_scandir',
 '_str',
 'absolute',
 'anchor',
 'as_posix',
 'as_uri',
 'chmod',
 'cwd',
 'drive',
 'exists',
 'expanduser',
 'glob',
 'group',
 'hardlink_to',
 'home',
 'is_absolute',
 'is_block_device',
 'is_char_device',
 'is_dir',
 'is_fifo',
 'is_file',
 'is_mount',
 'is_relative_to',
 'is_reserved',
 'is_socket',
 'is_symlink',
 

52. Examine what the following functions do. Hint: **cwd()** means "current working directory."

In [323]:
print(Path.cwd())
print(Path.cwd().parent)
print(Path.cwd().parent.parent)

c:\Users\F0040RP\Documents\DartLib_RDS\intro-to-python\TAW
c:\Users\F0040RP\Documents\DartLib_RDS\intro-to-python
c:\Users\F0040RP\Documents\DartLib_RDS


53. We can use pathlib's **Path()** function to store a filepath to another folder (outside the current working directory).

In [295]:
textdir = Path("~/shared/RR-workshop-data/state-of-the-union-dataset/txt").expanduser()
print(textdir)

C:\Users\F0040RP\shared\RR-workshop-data\state-of-the-union-dataset\txt


54. We can then place all .txt files in a list and print out that list.

In [324]:
pathlist = sorted(textdir.glob("*.txt")) 
print(pathlist)

[WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1797.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1798.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1799.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1800.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1825.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1826.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1827.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Adams_1828.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-union-dataset/txt/Arthur_1881.txt'), WindowsPath('C:/Users/F0040RP/shared/RR-workshop-data/state-of-the-unio

In [297]:
for path in pathlist:
    print(path)     #try path.name, .stem, .suffix, .parts

C:\Users\F0040RP\shared\RR-workshop-data\state-of-the-union-dataset\txt\Adams_1797.txt
C:\Users\F0040RP\shared\RR-workshop-data\state-of-the-union-dataset\txt\Adams_1798.txt
C:\Users\F0040RP\shared\RR-workshop-data\state-of-the-union-dataset\txt\Adams_1799.txt
C:\Users\F0040RP\shared\RR-workshop-data\state-of-the-union-dataset\txt\Adams_1800.txt
C:\Users\F0040RP\shared\RR-workshop-data\state-of-the-union-dataset\txt\Adams_1825.txt
C:\Users\F0040RP\shared\RR-workshop-data\state-of-the-union-dataset\txt\Adams_1826.txt
C:\Users\F0040RP\shared\RR-workshop-data\state-of-the-union-dataset\txt\Adams_1827.txt
C:\Users\F0040RP\shared\RR-workshop-data\state-of-the-union-dataset\txt\Adams_1828.txt
C:\Users\F0040RP\shared\RR-workshop-data\state-of-the-union-dataset\txt\Arthur_1881.txt
C:\Users\F0040RP\shared\RR-workshop-data\state-of-the-union-dataset\txt\Arthur_1882.txt
C:\Users\F0040RP\shared\RR-workshop-data\state-of-the-union-dataset\txt\Arthur_1883.txt
C:\Users\F0040RP\shared\RR-workshop-data

55. In some cases, the names of files contain metadata about the files themselves. In these cases, we can use a for loop to iterate through our files and save some of this metadata.

In [298]:
presidents = []
for path in pathlist:
    file_stem = path.stem
    stem_parts = file_stem.split("_")
    pres_name = stem_parts[0]
    year = stem_parts[1]
    presidents.append(pres_name)
print(presidents)



['Adams', 'Adams', 'Adams', 'Adams', 'Adams', 'Adams', 'Adams', 'Adams', 'Arthur', 'Arthur', 'Arthur', 'Arthur', 'Biden', 'Biden', 'Biden', 'Buchanan', 'Buchanan', 'Buchanan', 'Buchanan', 'Buren', 'Buren', 'Buren', 'Buren', 'Bush', 'Bush', 'Bush', 'Bush', 'Bush', 'Bush', 'Bush', 'Bush', 'Bush', 'Bush', 'Bush', 'Bush', 'Carter', 'Carter', 'Carter', 'Carter', 'Cleveland', 'Cleveland', 'Cleveland', 'Cleveland', 'Cleveland', 'Cleveland', 'Cleveland', 'Cleveland', 'Clinton', 'Clinton', 'Clinton', 'Clinton', 'Clinton', 'Clinton', 'Clinton', 'Clinton', 'Coolidge', 'Coolidge', 'Coolidge', 'Coolidge', 'Coolidge', 'Coolidge', 'Eisenhower', 'Eisenhower', 'Eisenhower', 'Eisenhower', 'Eisenhower', 'Eisenhower', 'Eisenhower', 'Eisenhower', 'Fillmore', 'Fillmore', 'Fillmore', 'Ford', 'Ford', 'Ford', 'Grant', 'Grant', 'Grant', 'Grant', 'Grant', 'Grant', 'Grant', 'Grant', 'Harding', 'Harding', 'Harrison', 'Harrison', 'Harrison', 'Harrison', 'Hayes', 'Hayes', 'Hayes', 'Hayes', 'Hoover', 'Hoover', 'Hoove

In [299]:
set(presidents)

{'Adams',
 'Arthur',
 'Biden',
 'Buchanan',
 'Buren',
 'Bush',
 'Carter',
 'Cleveland',
 'Clinton',
 'Coolidge',
 'Eisenhower',
 'Fillmore',
 'Ford',
 'Grant',
 'Harding',
 'Harrison',
 'Hayes',
 'Hoover',
 'Jackson',
 'Jefferson',
 'Johnson',
 'Kennedy',
 'Lincoln',
 'Madison',
 'McKinley',
 'Monroe',
 'Nixon',
 'Obama',
 'Pierce',
 'Polk',
 'Reagan',
 'Roosevelt',
 'Taft',
 'Taylor',
 'Truman',
 'Trump',
 'Tyler',
 'Washington',
 'Wilson'}

56. For lists that include repeat values, we can use the **collections** module's **Counter()** subclass for counting the frequency of each item in a list.

In [300]:
import collections
collections.Counter(presidents)

Counter({'Roosevelt': 20,
         'Bush': 12,
         'Johnson': 10,
         'Adams': 8,
         'Cleveland': 8,
         'Clinton': 8,
         'Eisenhower': 8,
         'Grant': 8,
         'Jackson': 8,
         'Jefferson': 8,
         'Madison': 8,
         'Monroe': 8,
         'Obama': 8,
         'Truman': 8,
         'Wilson': 8,
         'Reagan': 7,
         'Washington': 7,
         'Coolidge': 6,
         'Nixon': 5,
         'Arthur': 4,
         'Buchanan': 4,
         'Buren': 4,
         'Carter': 4,
         'Harrison': 4,
         'Hayes': 4,
         'Hoover': 4,
         'Lincoln': 4,
         'McKinley': 4,
         'Pierce': 4,
         'Polk': 4,
         'Taft': 4,
         'Trump': 4,
         'Tyler': 4,
         'Biden': 3,
         'Fillmore': 3,
         'Ford': 3,
         'Harding': 2,
         'Kennedy': 2,
         'Taylor': 1})

## VII. Reading in Text Files

57. You may remember we saved a list of file paths to our text files in `pathlist`. To read in one of these files, we will want to:
+ identify the file path to one text and save it as `path1`
+ open this file path as `f`
    - we will open the file within a `with` statement. This will mean the file is closed immediately once the interpreter has moved beyond the indented portion of the code. This will help ensure we don't accidentally corrupt the file.
+ read the file using the **.read()** method and save it as `txt1`

In [301]:
path1 = pathlist[0]
#with open(Path(sotudir,"Bush_2002.txt"), encoding='utf-8') as f:
with open(path1, encoding = 'utf-8') as f:
    txt1 = f.read()

print(path1.name)

Adams_1797.txt


58. What do the following code cells do?

In [302]:
print(len(txt1)) 

12440


In [303]:
txt1[0:100]

'Gentlemen of the Senate and Gentlemen of the House of Representatives:\n\nI was for some time apprehen'

In [304]:
txt1[:100] 

'Gentlemen of the Senate and Gentlemen of the House of Representatives:\n\nI was for some time apprehen'

In [305]:
txt1[100:300]

'sive that it would be necessary, on account of\nthe contagious sickness which afflicted the city of Philadelphia, to\nconvene the National Legislature at some other place. This measure it was\ndesirable '

In [306]:
txt1[-60:]

' measures you may rely on my zealous and hearty concurrence.'

<div class="alert alert-info" role="alert" style="color:blue">
    <p><b>59b. Exercises</b>:</p> 
    <p>Open a text (of your own choosing) and save it into the variable "txt2".</p>
</div>

In [311]:
with open(Path(textdir,"Roosevelt_1944.txt"), encoding='utf-8') as f:
    txt2 = f.read()

<div class="alert alert-info" role="alert" style="color:blue"><h3>Exercises for Part VI</h3>
    
<p>60. Add a coding cell below and print out the first and last 200 characters in your selected <b>txt2</b> text. Can you identify any major themes from the opening and closing words of this address? If not, expand the number of characters you are examining.</p></div>

In [312]:
print(txt2[:200])
txt2[-200:]

To the Congress:

This Nation in the past two years has become an active partner in the
world's greatest war against human slavery.

We have joined with like-minded people in order to defend ourselves


'nd his Government.\n\nEach and every one of us has a solemn obligation under God to serve this\nNation in its most critical hour--to keep this Nation great--to make this\nNation greater in a better world.'

## VIII. Applying Functions to Files

61. The function below reads in a file, places some metadata about each text into a series of lists, and then returns the shortest *n* texts.


In [320]:
#def get_shortest_texts(file_list, n = 5):
def get_shortest_texts(path, n = 5):
    text_info_list = []
    pathlist = sorted(path.glob("*.txt")) 
    for path in pathlist:
        with open(path) as f:
            txt = f.read()
        filestem = path.stem
        text_len = len(txt)
        text_info_list.append((text_len, filestem))
    sorted_list = sorted(text_info_list)
    return(sorted_list[:n])
    

        

In [322]:
short_texts = get_shortest_texts(textdir, n = 8)
print(short_texts)

[(6767, 'Washington_1790'), (8349, 'Adams_1800'), (9204, 'Adams_1799'), (9809, 'Nixon_1973'), (11014, 'Madison_1809'), (11629, 'Washington_1793'), (12257, 'Washington_1795'), (12440, 'Adams_1797')]


<div class="alert alert-info" role="alert" style="color:blue"><h3>Exercises</h3>

<p> 62. Write a function returning the n longest texts from a directory.<p>
</div>