# Intro to Python Notebook 2: Working with Texts (and other data)

**A Reproducible Research Workshop**

(A Collaboration between Dartmouth Library and Research Computing)

[_Click here to view or register for our current list of workshops_](http://dartgo.org/RRADworkshops)

**Instructors: Jeremy Mikecz, Lora Leligdon, and Simon Stone**

_This notebook created by_:
- Version 1.0: Jeremy Mikecz, Research Data Services (Dartmouth Library) with revisions by Mansa Krishna (Earth Science) - (_much thanks Mansa!_)

<!--
- Some of the inspiration for the code and information in this notebook was taken from https://www.w3schools.com/python/python_intro.asp -- This is a great resource if you want to learn more about Python!-->

This is **Notebook 2** of 3 for the **Introduction to Text Analysis in Python** workshop:

- Notebook 1: The Basics - getting started with Python
- **Notebook 2: Working with Texts (and other data) - importing, reviewing, and modifying texts and other data**
- Notebook 3: Dataframes - importing texts and other data, placing this data into a dataframe, and then modifying, analyzing, visualizing, and exporting this data

In this lesson, you will learn how to:

1. import texts and other data files from your computer
2. place data in lists and modify and analyze those lists
3. iterate through lists and an entire directory of files
4. write functions to create reproducible code
<!--1. import multiple text and data files from local folders
5. extract data from these files and read this data into data tables (known as "dataframes" in Python)-->

**Table of Contents**

- I. Built-In Functions and Methods
- II. Python Libraries
- III. Working with Lists
- IV. Looping through Lists
- V. Writing Functions
- VI. Working with Files
- VII. Reading in Text Files
- VIII. Applying Functions to Files

<!--
+ II. Working with Files
+ III. Writing Functions
+ IV. Working with Lists
+ V. Looping through Lists
+ VI. Looping through Files-->


## I. Python Built-In Functions and Methods

For more on functions, we can refer to a more detailed lesson provided by [**Constellate's** Python 4 lesson](https://lab.constellate.org/perfusion-stearns-eliot/notebooks/tdm-notebooks-2023-04-03T23%3A17%3A07.601Z/python-basics-4.ipynb):

**Functions**

```
You can identify a function by the fact that it ends with a set of parentheses () where arguments can be passed into the function. Depending on the function (and your goals for using it), a function may accept no arguments, a single argument, or many arguments. For example, when we use the print() function, a string (or a variable containing a string) is passed as an argument.

Functions are a convenient shorthand, like a mini-program, that makes our code more modular. We don't need to know all the details of how the print() function works in order to use it. Functions are sometimes called "black boxes", in that we can put an argument into the box and a return value comes out. We don't need to know the inner details of the "black box" to use it. (Of course, as you advance your programming skills, you may become curious about how certain functions work. And if you work with sensitive data, you may need to peer in the black box to ensure the security and accuracy of the output.)
```

### Ia. Built-in functions

With Python alone, a programmer can perform some basic operations using simple functions introduced in Notebook 1, such as **print()**, **len()**, **max()**, **min()**, **sorted()**. To call a built-in function, you simply write the name of the function followed by any arguments you want to pass in placed within parentheses:

```
name_of_function(argument1, argument2, ...)
```

Arguments (aka. "parameters") are often, but not always optional.

As you may recall, to print (output) some information, you would call the **print()** function passing in a text string you want to print.

1. Try entering some of the following commands in the code cells below. Compare the results:

```
print()
print("Good morning! How are you?")
your_name = "Bob"
print("Hello", your_name)
print("Hello", your_name, "!")
print("Hello ", your_name, "!", sep = "")

country_code = "+1"
area_code = "555"
phone_num = "555-0123"
print(country_code, area_code, phone_num, sep = "-")
```


To learn more about a particular function we can access its documentation using:

```
?function_name
```

For example:


In [None]:
?print

### Ib. Methods

Methods perform in similar ways to functions. The key difference is that:

- functions are independent and can be called by their name only
- methods act on objects of a particular class. In plain terms, some methods work only on text strings. Others work only on integers or dataframes.

Thus, the syntax for calling a method is:

```
object_name.method_name()
```

Examples of [common methods for text strings](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) include:

- str.capitalize()
- str.encode()
- str.endswith()
- str.startswith()
- str.lower()
- str.islower()
- str.upper()
- str.isupper()
- str.strip()
- str.replace()
- str.split()

Note: "str" above refers to either a raw text string such as `"hello!"` or a variable that contains a string such as `question = "what is happening?"`. To test if an object is a string (str), use the **type()** function, such as:

```
type(123)
type("Hello")
type("123")
```

2. Try typing these commands in the cell below.


3. Let's see what some of these string methods do:


In [None]:
sent = "  Two roads diverged in a wood and I - I took the one less traveled by, and that has made all the difference.   "

In [None]:
# try applying some of the string methods listed above to the sentence 'sent'


In [None]:
sent.strip().endswith(".")  # you can chain multiple methods together. Note: Python applies the method attached to the object first and then moves to the 2nd, 3rd, etc. methods. 
## So, in this case it first applies the .strip() method to our sentence and then applies the .endswith() to the results returned by strip

In [None]:
#another example of chained methods
sent = sent.strip()
sent.replace("two", "three").replace("one", "two")

4. We can also apply built-in functions to a string object. We outputed information above using the print() function. We can also use the **len()** [length] function.


We will examine some other methods below in the section on lists.


## II. Python Libraries

Beyond the basic functionality provided by Python's built-in functions and methods, if we want to do more advanced or specialized things we need to install and import Python **Libraries** also known as **packages**.

A **Python library** is a collection of files (known as **modules**) that each contain **functions** and/or **methods** to complete a set of related tasks.

_Confused?_

_This can get confusing as some large libraries have multiple sub-packages each with many different modules. In other cases a library consists of a single module._ **_The important thing to know is that you need to import each library or module you want to use._**

For more on Python libraries, we can refer to a more detailed lesson provided by [**Constellate's** Python 4 lesson](https://lab.constellate.org/perfusion-stearns-eliot/notebooks/tdm-notebooks-2023-04-03T23%3A17%3A07.601Z/python-basics-4.ipynb):

```
While Python comes with many functions, there are thousands more that others have written. Adding them all to Python would create mass confusion, since many people could use the same name for functions that do different things. The solution then is that functions are stored in modules that can be imported for use. A module is a Python file (extension ".py") that contains the definitions for the functions written in Python. These modules (individual Python files) can then be collected into even larger groups called packages and libraries. Depending on how many functions you need for the program you are writing, you may import a single module, a package of modules, or a whole library.
```

Some commonly used modules are found in [Python's Standard Library](https://docs.python.org/3/library/) - _these are installed with Python and require no separate installation._

Other libraries need to be installed first and then imported.

### IIb. Working with core Python Libraries

5. First, we will import the [**math module**](https://docs.python.org/3/library/math.html).

The syntax for importing a module or library is:

```
import module_name
```


In [None]:
import math

6. Try experimenting with some functions from the math module (see the [documentation here](https://docs.python.org/3/library/math.html))


7. There are a variety of Python libraries and modules that help us work with texts. One interesting module that comes with the Python Standard Library is [**difflib**](https://docs.python.org/3/library/difflib.html). It allows us to compare the difference between sequences of text.

Examine the code cells below. They apply the **ndiff** function from the difflib library to two lists of words. Examine the results:


In [None]:
import difflib

sent1 = "What in the world is going on over there?".split()
sent2 = "What the heck is goin' on down there?".split()

In [None]:
from difflib import ndiff

diff = ndiff(sent1, sent2)
print("\n".join(diff))

## III. Working with Lists

There are two basic Python data structures for storing ordered sequences of information. These are **lists** and **tuples**.

- **Lists** are enclosed by `[]` and each item is separated by a comma.
- Lists are _mutable_, meaning they can be modified (items can be added, modified, deleted)
- **Tuples** are enclosed by `()` and each item is separated by a comma.
- Tuples are _immutable_, meaning that, once created, they cannot be modified.

```
a_list_of_numbers = [4, 6, 2]
a_tuple_of_numbers = (33, 17, 42, 2)
```

You can read more about [the differences between tuples and lists here](https://builtin.com/software-engineering-perspectives/python-tuples-vs-lists). Below, however, we focus on lists.


### IIIa. Creating Lists

Storing data in individual variables makes sense when you have a few unique values.

However, operating on many individual variables would be tedious and time-consuming.

Instead, we can store multiple values under one variable using lists. Say, for example, you want to store quiz scores for a class or multiple race results for a single track athlete. You could store them in a list.

````
**Scores - Quiz 1**
73.5
86.2
81.9
90.1
67.8
88.0
```/
````


8. So the list of scores for the first quiz could be stored by:


In [None]:
# Run this cell by pressing Ctrl+Enter or pressing the play button while selecting this cell.
quiz1 = [73.5, 86.2, 81.9, 90.1, 57.8, 88.0]

<div class="alert alert-info" role="alert" style="color:blue"><h3 style="color:blue;">Exercises for Part III</h3>
    
<p style="color:blue;">15. Create a list of at least five of your favorite numbers (or numbers that recall a significant event, person, or stage of your life). Save to the variable **fav_numbers**.</p>
</div>


<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">16. Create a list of at least five people that have inspired you. Save to the variable **inspirational_people**.</p>

<p><i>Note:</i> All character strings must be contained within quotes, such as: </p>
    
<code>namelist = ["Aly", "Bob", "Charlie"]</code>
</div>


<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">17. Now, print out each of these lists.</p></div>


<div class="alert alert-info" role="alert" style="color:blue"><p>18. Print out the length of each of these lists using the **len()** function.</p>
</div>


<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">19. Print out the maximum value from each list.</p>
</div>


19b. _As we already learned, **len()** works on characters strings as well as lists._ But, it counts character strings differently. Run the following code and then calculate the length of each variable. What do you notice?


In [None]:
sent = "This is a sentence."
words = sent.split()
print(words)

In [None]:
# calculate the length of "sent" and "words".
print(len(sent))
print(len(words))

### IIIb. List Indexes and Slices


We can retrieve portions of the list using indexes and slices.

One important note: in Python the first item in a series is always considered number 0.

20. Thus, to retrieve the first item in our list (that is, the first quiz score), you would simply run:


In [None]:
quiz1[0]

<div class="alert alert-info" role="alert" style="color:blue">
    <p>21. Using the same format, now try retrieving the 3rd person from your list of inspirational people.</p>
</div>


Now imagine you have a long list. You can identify the last item in a list by using the index [-1]. For example:

```
name-of-list[-1]
```

<div class="alert alert-info" role="alert" style="color:blue">
    <p><b>22. Code Together</b>: Try that below with our list of quiz scores.</p>
</div>


<div class="alert alert-info" role="alert" style="color:blue">
    <p>23. Indexing beyond the end of the list will produce an "IndexError". Try, for example, retrieving the 100th item in our list of quiz scores.</p>
</div>


To retrieve multiple, consecutive items from a list, we can use **slices**.

The format is as follows:

```
name-of-list[start:end]

```

List slices begin with the start number but end one number before the end number.

So, **"start"** is the index of the first item of the list (starting with zero) and
**"end"** is the index of the last item in the list + 1.

24. Run the cell below


In [None]:
quiz1[0:2]

<div class="alert alert-info" role="alert" style="color:blue">
    <h3> Exercise</h3>
    <p>25. Using this format, retrieve the second through fourth item in our list. Compare the results to our original list. If you did it wrong, adjust the indices you are using.</p>
</div>


In [None]:
# remember that incrementing starts at 0!


### IIIc. Lists with Operators, Functions, and list methods


26. Create two lists of integers. Try applying different operands to the lists `+ - * / **`.


27. Like most things in Python, there is almost always multiple ways to perform a particular task. For example, to sort a list, you may either use the **sorted()** function (`sorted(listname)`) or the **.sort()** method (`listname.sort()`). The main difference is that:

- the **.sort()** method permanently sorts the values of the list
- while the **sorted()** function just temporarily outputs the new, sorted version of the list (unless you save it over the old list name or under a new list name)


<div class="alert alert-info" role="alert" style="color:blue">
    <h3> Exercise</h3>
    <p>27b. Sort your list of inspirational people. Make sure to save the new sorted list. Then sort it in reverse and save it to a separate list.</p>
</div>


### IIId. Modifying a List

There are multiple ways to modify a list. These include following methods:

- **.append()**
- **.extend()**
- **.pop()**

28. Try applying these methods to some of the lists we have already created. Remember the format is:

```
list_name.list-method(parameters)
```


Often, there are multiple ways to accomplished the same thing. For example:


<div class="alert alert-info" role="alert" style="color:blue"><h3 style="color:blue;">Exercises for Part IIId</h3>
    
<p style="color:blue;">29. Create a list of some of your favorite musicians, actors, or writers (at least 4).</p>
</div>


<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">30. Add that list to your list of inspirational people. Print out this new list.</p></div>


<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">31. Calculate the number of people in your inspirational_people list. Then replace the 3rd to last person with another important person to you.</p>

<p>For example, if a student retook their 4th quiz you could replace it with:</p>
<code>quiz_scores[3] = 88</code>
</div>


33. Run the following code. What does the **set()** function do?


In [None]:
numbers = [
    12,
    4,
    5,
    2,
    3,
    2,
    2,
    4,
    7,
    8,
    9,
    2,
    3,
    1,
    2,
    1,
    1,
    2,
    1,
    4,
    67,
    3,
    5,
    3,
    5,
    7,
    4,
    1,
    3,
    2,
    4,
    1,
    7,
    2,
    16,
    23,
    4,
    5,
]
print(set(numbers))

<div class="alert alert-info" role="alert" style="color:blue"><h3 style="color:blue;">Exercises for Part IIId (continued)</h3>
    
<p style="color:blue;">35. Print out a list of your inspirational people, sorted in descending order.</p></div>


<div class="alert alert-info" role="alert" style="color:blue"><p style="color:blue;">37. Apply the **min()** and **max()** functions to your list of inspirational_people. What happens?</p></div>


## IV. Looping through Lists

Often we want to examine or modify each individual item in a list. An easy way to iterate over a list is using a **for loop**. The general structure of a for loop is:

```
for item in named_list:
    [instructions for what to do with each item]
```

In such for loops, named_list must be an already established list. "item", however, is an arbitrary variable name we are assigning to each item in the list.

38. Run the simple for loop below:


In [None]:
student_names = ["Thor", "Black Widow", "Loki", "Iron Man"]
for (
    name
) in (
    student_names
):  # note: student_names is an already defined list; *name* is assigned here to each individual item in this list
    print(name)

39. We can use for loops to modify items in a list.

For example, see what happens when we apply the **.lower()** method to each item in our student_names_sorted list.


In [None]:
sorted_names = sorted(student_names)
for name in sorted_names:
    print(name.lower())

<div class="alert alert-block alert-info">

**Iterables in Python:**

There are many more kinds of objects you can use in a `for`-loop than just a `list`. These objects are also called _iterables_ because you can interate over its contents. To be an iterable, an object needs to have some way to determine the _next_ element. Try using some other objects (e.g., of type `str`, `set`, or `dict`) in the `for`-loop! What would you expect to happen?

</div>


<div class="alert alert-info" role="alert" style="color:blue"><h3 style="color:blue;">Exercises for Part IV </h3>
    
<p style="color:blue;">40. Can you guess how to convert all student names into uppercase? Try to do so below:</p>
</div>


41. To save these variations, we need to save them to a new list. We can do so using the following formula:

```
new_list = []  #creates a new, empty list
for item in existing_list:
    new_item = [modify original item]
    new_list.append(new_item)
```


42. Print out this new list:


### IVb. List Comprehensions

We can do the same thing, but more concisely, with a **list comprehension**. The formula for a list comprehension is:

```
new_list = [new_item for item in existing_list]

```

43. An example is below, this time using the **.swapcase()** method for strings:


In [None]:
student_names_swapped = [name.swapcase() for name in sorted_names]
print(student_names_swapped)

<div class="alert alert-info" role="alert" style="color:blue"><h3 style="color:blue;">Exercises for Part IV (continued)</h3>
    
<p style="color:blue;">44. Create a new list of your inspirational people, but converted to upper case.</p>
</div>


In [None]:
# we can do this using a traditional for loop


In [None]:
# or we can use a list comprehension


## V. Python Basics: Writing Functions

**FUNCTIONS:** Next, we would like to create some new columns with lists of tokens that are lower-cased and with stopwords removed. To do so, it is helpful to write a function that does this for a single text. Then, we can apply that function across the entire corpus of SOTU addresses stored in this dataframe.

We have already used a variety of core Python functions such as **sum()**, **len()**, and **print()**. We have also called on many functions defined in auxiliary Python libraries or packages: such as the **word_tokenize()** and **concordance** functions from the **nltk** library we imported.

Sometimes, however, we will want to create our own functions.

A function is a piece of code that only runs when it is called (this will make a bit more sense after we see an example). We can pass parameters (data) into a function, which will perform operations on them.

In Python, we use the `def` keyword to define a function.

```python
def functionName(argumentsToPassIn):
    function instructions
    return(resultsOfFunction)
```

47. Most but not all functions return something using the return command. Here is a super simple function that outputs a phrase, but does not return anything.


In [None]:
# First, create a function
def print_hello():
    print("Hello!")

In [None]:
# Next, call the function
print_hello()

In [None]:
# Now, let's try adding some arguments/parameters to our functions


48. **A SIMPLE FUNCTION:** So, for example, if we had a list of names and we wanted to create a function to retrieve the initial of each, we could use the following function:


In [None]:
import re

def Initials(fullname):
    caps = re.findall(
        "([A-Z])", fullname
    )  # this uses the findall function from the re package to find all capitalized letters
    inits = "".join(
        caps
    )  # takes our list of capitalized letters stored in "caps" and concatenates it
    return inits


fullname = "Jeremy M. Mikecz"  # replace w/ your name
Initials(fullname)

49. We can now apply this function to quickly return the initials from a long list of names.


In [None]:
actorlist = [
    "Christoph Waltz",
    "Tom Hardy",
    "Doug Walker",
    "Daryl Sabara",
    "J.K. Simmons",
    "Brad Garrett",
    "Chris Hemsworth",
    "Alan Rickman",
    "Henry Cavill",
    "Kevin Spacey",
    "Giancarlo Giannini",
    "Johnny Depp",
    "Johnny Depp",
    "Henry Cavill",
    "Peter Dinklage",
    "Chris Hemsworth",
    "Johnny Depp",
    "Will Smith",
    "Aidan Turner",
    "Emma Stone",
    "Mark Addy",
    "Aidan Turner",
    "Christopher Lee",
    "Naomi Watts",
    "Leonardo DiCaprio",
    "Robert Downey Jr.",
    "Liam Neeson",
    "Bryce Dallas Howard",
    "Albert Finney",
    "J.K. Simmons",
    "Robert Downey Jr.",
    "Johnny Depp",
    "Hugh Jackman",
    "Steve Buscemi",
    "Glenn Morshower",
    "Bingbing Li",
    "Tim Holmes",
    "Emma Stone",
    "Jeff Bridges",
    "Joe Mantegna",
    "Ryan Reynolds",
    "Tom Hanks",
    "Christian Bale",
    "Jason Statham",
    "Peter Capaldi",
    "Jennifer Lawrence",
    "Benedict Cumberbatch",
    "Eddie Marsan",
    "Leonardo DiCaprio",
    "Jake Gyllenhaal",
    "Charlie Hunnam",
    "Glenn Morshower",
    "Harrison Ford",
    "A.J. Buckley",
    "Kelly Macdonald",
    "Sofia Boutella",
    "John Ratzenberger",
    "Tzi Ma",
    "Oliver Platt",
    "Robin Wright",
    "Channing Tatum",
    "Christoph Waltz",
    "Jim Broadbent",
    "Jennifer Lawrence",
    "Christian Bale",
    "John Ratzenberger",
    "Amy Poehler",
    "Robert Downey Jr.",
    "ChloÃ« Grace Moretz",
    "Will Smith",
    "Jet Li",
    "Will Smith",
    "Jimmy Bennett",
    "Tom Cruise",
    "Jeanne Tripplehorn",
    "Joseph Gordon-Levitt",
    "Amy Poehler",
    "Scarlett Johansson"
]

<div class="alert alert-info" role="alert" style="color:blue"><h3>Exercises</h3>
    <p>50. Apply the Initials function to each name in our actorlist using either a for loop or a list comprehension.</p>
</div>


## VI. Working with Files

An essential skill in Python is to be able to navigate through files on your computer to either read in existing files into Python or to output new files. Fortunately, you now have experience with the pre-requisite skills for navigating through and importing files:

- importing Python libraries
- applying functions
- looping through lists

To enable navigating through files on your computer, we will use the **pathlib** library.

51. Let's import it now.


In [None]:
from pathlib import Path

To review the information about an object we can type `?Object_name` and to get a list of methods for that object: `dir(Object_name)`

52. Examine what the following functions do. Hint: **cwd()** means "current working directory."


In [None]:
print(Path.cwd())
print(Path.cwd().parent)
print(Path.cwd().parent.parent)

53. We can use pathlib's **Path()** function to store a filepath to another folder (outside the current working directory).

```
pathdirname = Path("path/to/folder")
```


54. We can then place all .txt files in a list and print out that list.


55. In some cases, the names of files contain metadata about the files themselves. In these cases, we can use a for loop to iterate through our files and save some of this metadata.


In [None]:
presidents = []
for path in pathlist:
    file_stem = path.stem
    stem_parts = file_stem.split("_")
    pres_name = stem_parts[0]
    year = stem_parts[1]
    presidents.append(pres_name)
print(presidents)

56. For lists that include repeat values, we can use the **collections** module's **Counter()** subclass for counting the frequency of each item in a list.


## VII. Reading in Text Files

57. To read in a text file, we will want to:

- identify the file path to one text and save it as `path_name`
- open this file path as `f`
  - we will open the file within a `with` statement. This will mean the file is closed immediately once the interpreter has moved beyond the indented portion of the code. This will help ensure we don't accidentally corrupt the file.
- read the file using the **.read()** method and save it as `txt1`

The result should look something like this:

```
path_name = Path("path/to/textfile/filename")
with open(path_name, encoding = 'utf-8') as f:
    txt1 = f.read()
```


58. Try slicing the `txt1` text string you just created. For example, retrieve the first and last 100 characters of the string.


<div class="alert alert-info" role="alert" style="color:blue">
    <p><b>59b. Exercises</b>:</p> 
    <p>Open a text (of your own choosing) and save it into the variable "txt2".</p>
</div>


<div class="alert alert-info" role="alert" style="color:blue"><h3>Exercises for Part VI</h3>
    
<p>60. Add a coding cell below and print out the first and last 200 characters in your selected <b>txt2</b> text. Can you identify any major themes from the opening and closing words of this address? If not, expand the number of characters you are examining.</p></div>


## VIII. Applying Functions to Files

61. The function below reads in a file, places some metadata about each text into a series of lists, and then returns the shortest _n_ texts.


In [None]:
def get_shortest_texts(path, n=5):
    """
    This function
    + takes a folder path,
    + reads in all text files found within the folder
    + calculates the character length of each text
    + and then identifies and returns the n shortest texts in the collection
    """
    text_info_list = []
    pathlist = sorted(path.glob("*.txt"))
    for path in pathlist:
        with open(path) as f:
            txt = f.read()
        filestem = path.stem
        text_len = len(txt)
        text_info_list.append((text_len, filestem))
    sorted_list = sorted(text_info_list)
    return sorted_list[:n]

In [None]:
short_texts = get_shortest_texts(textdir, n=8)
print(short_texts)