## Distant reading study week 3 (VT-23)

### Learning material 3a: Processing several files (txt)

Matti La Mela

This Notebook introduces us to reading several files with glob (global) and processing them in Python, which is a way to scale up your research.


### 1. Reading several text files

In [22]:
# glob allows us to search for filenames in a path. We can use the result for reading the files name by name.

import glob

# Let us see what text files we have in our dhq_corpus_complete_2007_2020 path. If we use the wild card *.txt, we get all the files that are of txt file
# extension. Glob creates a list of the file names:

list_files = glob.glob("../text_week3/dhq_corpus_complete_2007_2020/*")

# Let's print the first five filenames:

for filename in list_files[0:5]:
    print(filename)


../text_week3/dhq_corpus_complete_2007_2020\dhq-2007-000001-Drucker-Philosophy.txt
../text_week3/dhq_corpus_complete_2007_2020\dhq-2007-000002-Howard-Interpretative.txt
../text_week3/dhq_corpus_complete_2007_2020\dhq-2007-000003-VandeCreek-Webs.txt
../text_week3/dhq_corpus_complete_2007_2020\dhq-2007-000004-Patrik-Encoding.txt
../text_week3/dhq_corpus_complete_2007_2020\dhq-2007-000005-Wolff-Reading.txt


In [16]:
# We open every file in our list_files in the for-loop and do other relevant operations with them (e.g. data cleaning could be done here).

# In this example we save our files as strings to a list ("texts"), where we could then continue working with them.

texts = []

for filename in list_files:
    with open(filename, mode="r", encoding="utf-8") as file:
        article = file.read()
        texts.append(article)
        

In [17]:
# We have now the whole dhq corpus stored to a list. 

# Let's print the first 200 characters of the first article in our list: first index is for the list element [0], and the second for the range in the string [0:200]

print(texts[0][0:200])

Philosophy and Digital Humanities: A review of Willard McCarty, Humanities
      Computing (London and NY: Palgrave, 2005)
Johanna Drucker
3 April 2007

Humanities computing is still a fledgling disci


### 2. Reading files and processing them in SpaCy


In [18]:
# In this example, we open multiple files and process them in SpaCy. If we have lots of files it might not possible to save them into variables
# due to memory restrictions. We process the files hen we open them, save the information we need, and then continue with the next file.

# Let's import spacy and load the language model.

import spacy

nlp = spacy.load("en_core_web_sm")


In [26]:
# We produce a list of the filenames with glob for our dhq corpus, thus, all the .txt files:

import glob

list_files = glob.glob("./dhq_corpus_complete_2007_2020/*")

In [27]:
# In this example, we want store three kinds of information to lists and save them for later use:

# 1. the filenames of our files we open: "texts_filenames"
# 2. the lengths of our texts: "texts_length"
# 3. all the noun lemmas: "texts_noun_lemmas"

texts_filenames = []
texts_length = []
texts_noun_lemmas = []

# We will process only ten files from the dhq corpus. We can do this by limiting our for loop with the index range [0:10]
# This is a good way to test that everything works, and when everything works, we could process all the files (by removing the index range).

# In the for-loop, we process the files one by one, when we open them:

for filename in list_files[0:10]:
    
    with open(filename, mode="r", encoding="utf-8") as file:
        
        text = file.read() # We read the file as a string to the variable "text"
        
        text_doc = nlp(text, disable=["parser", "ner"]) # We process "text" with SpaCy. We disable the parser and NER processes, as we want to save some time
        
        print("Processing file " + filename + " through spacy nlp pipeline") # Let's print some info to the user so we know where we are
        
        # we want to store all the lemmas of nouns found in the article. We do this with another for-loop, like in the week 2 exercises.
        
        lemmas = []
        
        for token in text_doc:
            if token.is_alpha and token.is_stop == False:
                if token.pos_ == "NOUN":
                    lemmas.append(token.lemma_)

        # We are also interested in the length of the original text, which we save in the list "texts_length".
        # Finally, we store the filename in the list "filenames". We store the information to all three lists at the same time,
        # so we could iterate lists simultaneously and always get the corresponding values (e.g. filename - length - lemmas)

        # The filename with the path is pretty long, so let's take only the name of the file. We could use regular expressions, but we know that
        # the beginning of the path name is always the same. We can simply use the index from character 45, thus index 44 (calculated manually!)
        
        print("Done! Storing information for : " + filename[44:] + "\n")
        
        texts_noun_lemmas.append(lemmas)
        texts_filenames.append(filename[44:])
        texts_length.append(len(text))
                  

Processing file ./dhq_corpus_complete_2007_2020\dhq-2007-000001-Drucker-Philosophy.txt through spacy nlp pipeline
Done! Storing information for : 001-Drucker-Philosophy.txt

Processing file ./dhq_corpus_complete_2007_2020\dhq-2007-000002-Howard-Interpretative.txt through spacy nlp pipeline
Done! Storing information for : 002-Howard-Interpretative.txt

Processing file ./dhq_corpus_complete_2007_2020\dhq-2007-000003-VandeCreek-Webs.txt through spacy nlp pipeline
Done! Storing information for : 003-VandeCreek-Webs.txt

Processing file ./dhq_corpus_complete_2007_2020\dhq-2007-000004-Patrik-Encoding.txt through spacy nlp pipeline
Done! Storing information for : 004-Patrik-Encoding.txt

Processing file ./dhq_corpus_complete_2007_2020\dhq-2007-000005-Wolff-Reading.txt through spacy nlp pipeline
Done! Storing information for : 005-Wolff-Reading.txt

Processing file ./dhq_corpus_complete_2007_2020\dhq-2007-000006-Raben-Tenure.txt through spacy nlp pipeline
Done! Storing information for : 006-Ra

In [29]:
# When we are ready, we should have the lemmas of nouns from the articles in a list, the filenames in another, and the lengths in the third.

# let's print the filename of the first article, the first ten noun lemmas from it, and also its length.

print("The article filename is : " + texts_filenames[0])
print("The first ten noun lemmas are : " + str(texts_noun_lemmas[0][0:10]))
print("The length of the first article is : " + str(texts_length[0]))


The article filename is : 001-Drucker-Philosophy.txt
The first ten noun lemmas are : ['philosophy', 'review', 'computing', 'fledgling', 'discipline', 'spite', 'claim', 'lineage', 'decade', 'labor']
The length of the first article is : 15012


In [30]:
# We could take our list to pandas for counting, but we use here another way for simple counting: the Counter that we get from
# the Python module "collections" is handy for this (though further operations might be easier to do in pandas)

from collections import Counter

example_article = texts_noun_lemmas[0] # We assign the lemmas of the first article to our variable "example_article"

print ("Top 5 lemmas in the first article: " + str(Counter(example_article).most_common(5))) # here we print the 5 most common lemmas in our string

Top 5 lemmas in the first article: [('humanity', 17), ('knowledge', 14), ('way', 11), ('study', 10), ('field', 9)]


In [31]:
# These steps offer you possibilities for processing several files. Another option would be to simply store all the information to one list
# and then work with that.

# NB! spacy tokens take a lot of memory as they contain much information. It is thus good to process only one text (or a part of a text
# at time), so we don't get memory errors. The default maximum text size that SpaCy can process is 1M characters. We have an example of processing too big
# files in Material_3b.


In [34]:
# Finally, let's write our lists to a csv, so we can continue our work in Excel for example.

# We use the module called csv:

import csv

i = 0 # We use i as our index when we iterate the texts we have processed

with open("./lemmas_output.csv", "w", encoding="utf-8") as file: # We save our output to the file lemmas_output.csv, we write the file "w"
    
    # We create a writer object "write" that is able to write things to the "file" that we have created above.
    # The default delimiter (or separator) for writing csv output is "," but as we have commas in the elements
    # that we want to write, we use another delimiter ";". For tab the delimiter is "\t"

    write = csv.writer(file, delimiter = ";")
    
    write.writerow(["Filename", "Article length", "Top five lemmas", "All lemmas of nouns"])      # Let's write first a row with the column names to the csv.
    
    for files in texts_filenames: # In this for-loop, We iterate all the files we have processed, and write them to the csv

        lemmas_string = " ".join(texts_noun_lemmas[i]) # We join our list of lemmas into a string variable "lemmas_string"
        count = Counter(texts_noun_lemmas[i]).most_common(5) # We store the top five lemmas to variable "count"

        # we write four strings: filename, length, "count" which stores top 5 lemmas with Counter, and then the lemmas:

        write.writerow([texts_filenames[i], str(texts_length[i]), count, lemmas_string])
        
        i += 1  # We loop through all the items in the texts_filenames: we use i to access the right lists. i increases with every iteration
    

# Now, having several lists is one way to store metadata for a list. We can also use pandas dataframes for this.

### 3. Using pandas dataframe for creating a csv (optional)

We can also read our lists into a pandas dataframe, where it is possible to perform further calculations or visualize the data.

We have more columns now (cf. Series in the previous exercise, with only one list), so we create a Pandas DataFrame (i.e. several Series). Pandas is also used for writing our csv file.

In [35]:
# Let's first import pandas

import pandas as pd

# There are several ways to read lists into a pandas dataframe, e.g. using dictionaries. We do the reading by simply adding new columns to our dataframe.

# We create a pandas dataframe, add our list texts_filenames as the first column and name it "Filename"

lemmas_df = pd.DataFrame(texts_filenames, columns=["Filename"])

print(lemmas_df)


                        Filename
0     001-Drucker-Philosophy.txt
1  002-Howard-Interpretative.txt
2        003-VandeCreek-Webs.txt
3        004-Patrik-Encoding.txt
4          005-Wolff-Reading.txt
5           006-Raben-Tenure.txt
6       007-Flanders-Welcome.txt
7      008-Raben-Introducing.txt
8         009-Jerz-Somewhere.txt
9                010-Eve-All.txt


In [36]:
# We can insert more columns with insert()

# Here we insert the next list, "texts_length" to our dataframe. With loc, we define where we want our column (0 -> first column, 1 -> last in his case)

lemmas_df.insert(loc=1, column="Text length", value = texts_length)

print(lemmas_df)

                        Filename  Text length
0     001-Drucker-Philosophy.txt        15012
1  002-Howard-Interpretative.txt        67395
2        003-VandeCreek-Webs.txt        59335
3        004-Patrik-Encoding.txt        33031
4          005-Wolff-Reading.txt        40675
5           006-Raben-Tenure.txt         4705
6       007-Flanders-Welcome.txt         6356
7      008-Raben-Introducing.txt         7594
8         009-Jerz-Somewhere.txt       128041
9                010-Eve-All.txt        92290


In [37]:
# Finally, we insert our lemmas to the dataframe with insert()

lemmas_df.insert(loc=2, column="Lemmas", value = texts_noun_lemmas)

print(lemmas_df)

                        Filename  Text length  \
0     001-Drucker-Philosophy.txt        15012   
1  002-Howard-Interpretative.txt        67395   
2        003-VandeCreek-Webs.txt        59335   
3        004-Patrik-Encoding.txt        33031   
4          005-Wolff-Reading.txt        40675   
5           006-Raben-Tenure.txt         4705   
6       007-Flanders-Welcome.txt         6356   
7      008-Raben-Introducing.txt         7594   
8         009-Jerz-Somewhere.txt       128041   
9                010-Eve-All.txt        92290   

                                              Lemmas  
0  [philosophy, review, computing, fledgling, dis...  
1  [pedagogy, quest, genre, gaming, activity, str...  
2  [significance, history, product, type, library...  
3  [year, tradition, philosophy, meditation, effo...  
4  [researcher, humanity, effort, use, computer, ...  
5  [launching, journal, status, community, public...  
6  [issue, access, journal, issue, time, making, ...  
7  [issue, feature, 

In [38]:
# The lemmas are in a list format. If we want a simpler list of strings, let's convert the lists into a string with join() method that is available in Pandas.
# This is similar to join() we have used for lists outside Pandas, e.g. text = " ".join(our_list)

lemmas_df["Lemmas"] = lemmas_df["Lemmas"].str.join(" ") # "Lemmas" is the name of the column, we add a " " between the list elements that we join together.

print(lemmas_df)

# Ok, now the lemmas are a string of lemmas separated with a " "

                        Filename  Text length  \
0     001-Drucker-Philosophy.txt        15012   
1  002-Howard-Interpretative.txt        67395   
2        003-VandeCreek-Webs.txt        59335   
3        004-Patrik-Encoding.txt        33031   
4          005-Wolff-Reading.txt        40675   
5           006-Raben-Tenure.txt         4705   
6       007-Flanders-Welcome.txt         6356   
7      008-Raben-Introducing.txt         7594   
8         009-Jerz-Somewhere.txt       128041   
9                010-Eve-All.txt        92290   

                                              Lemmas  
0  philosophy review computing fledgling discipli...  
1  pedagogy quest genre gaming activity structure...  
2  significance history product type library reso...  
3  year tradition philosophy meditation effort ma...  
4  researcher humanity effort use computer techno...  
5  launching journal status community publication...  
6  issue access journal issue time making effort ...  
7  issue feature iss

In [40]:
# Let's write this to csv. We use the delimiter / separator ";" again.
# Also, we give the parameter index = False, because we do not want to write row labels (in this case, 0, 1, 2, ...)

lemmas_df.to_csv("./lemmas_output_pandas.csv", encoding="utf-8", index = False, sep = ";")