# Data description

The dataset in this project is mainly plain text files recorded by OCR. The source of OCR is from scanned photos of Encyclopedia Britannica, which was also provided with positional XML files. There was also an inventory CSV file that shows the full title relating to each text file.

We decided to use textual data only for this project. There were 155 files of TXT and an inventory CSV file in the text folder. The names of these TXT files are ambiguous and irregular, such as 144850377.txt and 194474782.txt, only by the inventory CSV file we could check their full title. According to the inventory, each edition of Encyclopedia Britannica is separated into numerous files, each representing a section in its book. So the first step we made is to merge files of the same edition manually via an online TXT file merger while deleting the repetitive and unnecessary ones. After the merge process, we have 8 large TXT files of each edition of Encyclopedia Britannica.

The Encyclopedia Britannica covers all sorts of human knowledge. Each of its editions was published over a period of time. Compared to 1st (1768–1771), 2nd (1777–1784) and 3rd (1788–1797) editions, later 4th (1801–1810), 5th (1815–1817), and 6th (1820–1823) have covered far more bits of knowledge in greater volumes, and the 7th (1830–1842) and 8th (1853–1860) extends one more volume of books. 

Our project aims to analyze the textual data and find out the changes of a certain topic or knowledge over time. For my part, I chose to analyze how the topic of climate change was reflected during the years first industrial revolution (1760-1840). According to my research, 1860 was the year for the Earth's natural greenhouse effect to be recognized. By 1896, carbon dioxide levels were considered relative to the effect. Lately, in the 1940s and 1950s, the Carbon Dioxide Theory of Climate Change was formulated. It would be interesting to see how people in the era of the first industrial revolution define climate issues and how their concerns shifts over time. The last edition in our dataset was published in the 1850s, so the issue is very likely not widely recognized during the period, but I'm interested in finding some early clues, such as the high use of fossil fuels as well as the record of extreme weather and temperature changes.

# Explore the data

The best way to achieve the goal is to analyze the frequency of words relating to the climate change topic. Valuable results could be obtained through the visual analysis of the changing keyword frequency in different editions. To do that, I have to do some coding to clean the textual data, do frequency search and output results to a data frame or CSV file for visualization. In the next part, I will show the codes for this process with explanations.

Firstly, load the data by opening text files in the encyclopedia folder. The folder includes 8 previously merged text files with the name of each edition.

In [1]:
import os
import re
from nltk.stem.lancaster import LancasterStemmer
import pandas as pd

# Define directory of txt files
dirs = os.listdir("encyclopedia/")
dirs = dirs[1:9]
dirs.sort()
print(dirs)
book_text = ''

['Edition1.txt', 'Edition2.txt', 'Edition3.txt', 'Edition4.txt', 'Edition5.txt', 'Edition6.txt', 'Edition7.txt', 'Edition8.txt']


Next I need to iterate through each text file and extracting informations from them, there are several steps involved in this part. 

In a for loop, I have to:

1. Do text cleaning, including removing line breaks and non-english characters as well as stemming.
2. Tokenize each of the text.
3. Define a dictionary of keyword sets, they will be used as search keywords while searching through each text set.
4. Using the keywords, search their word frequencies in each edition of Encyclopedia Britannica
5. Create a dataframe to record search results. The row number +1 represents the edition of the book, and the first column shows the line of keywords.

In [2]:
# Iterate the directory and open txts
index = 0
for file in dirs:
    if file.endswith(".txt"):
        f = open("encyclopedia/" + file,"r",encoding = "utf_8")
        book_text = f.read()

        # Clean the text 
        # Removing line breaks
        book_text = book_text.replace("-", "").lower()
        # Remove all non-english character
        book_text = re.sub("[^a-zA-Z0-9]+", " ", book_text)
        # Stemming
        lancaster_stem = LancasterStemmer()
        lancaster_stem.stem(book_text)
        print("Total word count of "+ file)
        print(len(book_text))
        # Tokenize the text
        clean_text = book_text.split()

        # Print some of the text to see if it is loaded successfully
        print("——————— Cleaned sample of " + file)
        print(clean_text[800:850])
    
        # Define keywords and search keyword frequencies
        dic = {"cholera":0,"typhus":0,"diphtheria":0,"silicosis":0,"pneumoconiosis":0}

        # Search keyword frequencies in the text
        for word in clean_text:
            if word in dic:
        
                dic[word] = dic[word] + 1
        clean_text = ''

        print("\nFrequency search result in edition " + str(index+1) + ": ")
        print(dic)
        print("\n")

        # Add the frequency result to the dataframe
        if(index == 0):
            # Create the first line of the dataframe
            df = pd.DataFrame(data=dic, index = [index])
        else:
            # Convert the current line of dictionary values to an list
            values = dic.values()
            values_list = list(values)
            # Add the line to the dataframe
            df.loc[index] = values_list
        index += 1
        f.close()

Total word count of Edition1.txt
13618679
——————— Cleaned sample of Edition1.txt
['made', 'with', 'refjpect', 'to', 'mineralogy', 'materia', 'me', 'die', 'a', 'pa', 'thology', 'phyfiology', 'and', 'therapeutics', 'thele', 'are', 'fo', 'interwoven', 'with', 'anatomy', 'botany', 'chemiftry', 'and', 'medicine', 'that', 'in', 'a', 'work', 'of', 'this', 'kind', 'it', 'was', 'almoh', 'impoffible', 'without', 'many', 'none', 'cellary', 'repetitions', 'to', 'treat', 'them', 'as', 'diiiincd', 'fciences', 'indeed', 'properly', 'ipeaking', 'they']

Frequency search result in edition 1: 
{'cholera': 5, 'typhus': 0, 'diphtheria': 0, 'silicosis': 0, 'pneumoconiosis': 0}


Total word count of Edition2.txt
54708550
——————— Cleaned sample of Edition2.txt
['more', 'agreeably', 'inculcated', 'the', 'various', 'topics', 'of', 'art', 'and', 'fcience', 'a', '2', 'have', 'e', 'p', 'r', 'e', 'f', 'a', 'c', 'have', 'been', 'ranged', 'in', 'a', 'fyftematic', 'order', 'and', 'volumes', 'profefledly', 'written', 

Now I have a dataframe of word frequency with rows representing each edition and columns representing each words. For convenience, we also save the dataframe as a csv file.

In [3]:
# Add the Edition column at the first position
df.insert(0, 'edition', ['1', '2', '3', '4', '5', '6', '7', '8'])
# Output result
print("\n Final output dataframe:")
print(df)
df.to_csv("output_word_proportion_.csv")


 Final output dataframe:
  edition  cholera  typhus  diphtheria  silicosis  pneumoconiosis
0       1        5       0           0          0               0
1       2       31      27           0          0               0
2       3       33      38           0          0               0
3       4       35      61           0          0               0
4       5       35      69           0          0               0
5       6       36      62           0          0               0
6       7       23      11           0          0               0
7       8      110      26           1          0               0


## Visualization

With the results, it is now possible to analyze distributions of keyword frequencies and relationships between them. Since most of the values appear upward trend, using line charts mainly to show their evolution will be a good visualization method.

For the first plot, I chose to visualize the words relating to climate change and natural disasters, to see if the trend of climate change was recorded during the publication of Encyclopedia Britannica.

In [4]:
import seaborn as sns

df.index = df['edition']
sns.set_theme(style="darkgrid")
plot1 = sns.lineplot(data=df['climate'], label='climate')
plot1.set(xlabel = "edition of the book", ylabel='number of occurrences')
plot1.set_title('Frequency of words relating to climate change and natural disasters')
plot2 = sns.lineplot(data=df['temperature'], label='temperature')
plot3 = sns.lineplot(data=df['hazard'], label='hazard')
plot4 = sns.lineplot(data=df['disaster'], label='disaster')

KeyError: 'climate'

It seems that the appearance of relevant terms did increase over time. Notably, the terms "climate" and "temperature" both increased their frequency rapidly in a similar trend. It is possible that they often appear in same contexts in the books.

For the second plot, I want to know if climate change's impact on living creatures was noticed and recorded. The result is shown in the second plot.

In [None]:
sns.set_theme(style="darkgrid")
plot1 = sns.lineplot(data=df['animal'], label='animal')
plot1.set(xlabel = "edition of the book", ylabel='number of occurrences')
plot1.set_title('Frequency of words relating to biodiversity')
plot2 = sns.lineplot(data=df['plant'], label='plant')
plot3 = sns.lineplot(data=df['extinct'], label='extinct')
plot3 = sns.lineplot(data=df['diminish'], label='diminish')

Although the term "animal" and "plant" increased over time, the word "extinct" and "diminsh" appeared more frequently in late editions. 

Next, I want to visualize the trend of the first energy revolution. Since the main source of energy during the first industiral revolution was fossil energy, the apperace of words relating to energy should be increasing. For this part, I chose to use both line plot and stacked area plot. In the output, the lines will represent conceptual words relating to energy while the stacked part will represent physical energy sources.

In [None]:
plot1 = df[['coal','gas','fuel']].plot.area()
plot1.set(xlabel = "edition of the book", ylabel='number of occurrences')
plot1.set_title('Frequency of words relating to energy')
plot3 = sns.lineplot(data=df['energy'], label='energy', color='r')
plot2 = sns.lineplot(data=df['carbon'], label='carbon', color='k')
plot4 = sns.lineplot(data=df['emission'], label='emission', color = 'm')

It is obvious that there was massive use of energy and it is likely the use of these energy sources increased the appearence of "carbon". The next question will be: did these books record what emission of greenhouse gases had brought to their environment. The results are shown in the next plot.

In [None]:
sns.set_theme(style="darkgrid")
plot1 = sns.lineplot(data=df['atmosphere'], label='atmosphere')
plot1.set(xlabel = "edition of the book", ylabel='number of occurrences')
plot1.set_title('Frequency of words relating to greenhouse effects')
plot2 = sns.lineplot(data=df['atmospheric'], label='atmospheric')
plot3 = sns.lineplot(data=df['sealevel'], label='sealevel')
plot4 = sns.lineplot(data=df['greenhouse'], label='greenhouse')

Although the trend shows an upward trend, the arisen of some term frequencies seems abrupt, which is not suitable for drawing obvious conclusions.

Lastly, I want to know whether there was an early awareness of the environmental protection, and whether this awareness was strengthened before the emergence of climate change theories. The visualized result was shown in the final plot.

In [None]:
sns.set_theme(style="darkgrid")
plot1 = sns.lineplot(data=df['environment'], label='environment')
plot1.set(xlabel = "edition of the book", ylabel='number of occurrences')
plot1.set_title('Frequency of words relating to environmental protection')
plot2 = sns.lineplot(data=df['protection'], label='protection')
plot3 = sns.lineplot(data=df['preservation'], label='preservation')
plot4 = sns.lineplot(data=df['conservation'], label='conservation')

# Reflect and Hypothesise

Based on the information in the plots, I made the following observation: 

1. The trend of climate change and its increasing effects are shown in the publishment of Encyclopedia Britannica.

2. The additions of "animal" and "plant" parallel with another trend, which is the decrease of biodiversity.

3. The books recorded the increase in energy consumption and carbon emission during the boom of the first industrial revolution.

4. The concerns of terms relating to the greenhouse effect increased rapidly after the publishment of the fifth edition, which was from 1815 to 1817.

5. Words relating to environmental protection has shown a good increasing trend in the latter editions.

Reflecting the results, three hypotheses are proposed:

1. Although the theory of climate warming did not emerge before the 1900s, people in earlier periods had noticed and were concerned relating trends.

2. Increasing trend of natural disasters and species extinction has warned the people of the time. 

3. The awareness of environmental protection has begun to take root before the advent of climate change theories.

The first hypothesis mainly comes from observations 1 and 4. However, instead of just using word frequencies in encyclopedias, studying at more contemporary sources such as reports and weather records will make this hypothesis more convincing. The second hypothesis is from observation 2 and divergent thinking through the first hypothesis. Although more animals and plants are added in the book (which is natural considering the increasing volumes), the tally of extinct species seems to be growing. Nevertheless, just the frequency of the word "extinction" and "diminish" is not enough to prove this hypothesis, if the text can be further processed and extract terms that contain more than one words, such as "endangered animal", the hypothesis can be test further.

The third hypothesis might be the most convincible one, considering the appearance of words relating to environmental protection clearly increased in the middle and late industrial revolution (5th to 8th edition of the encyclopedias). Moreover, from previous research, 1860 was the for the Earth's natural greenhouse effect to be recognized, right after the publishment of the 8th edition. It can be assumed that the growing awareness of environmental protection has prompted people to do relevant research. To further test this hypothesis, it will be necessary to study more relevant publications around the first fifty years of the 1800s.

Overall, the conducted word frequency analysis did provide meaningful results and interesting hypotheses but the data processing is rather rough. The hypotheses can be further analyzed by advanced methods. In addition to the methods mentioned above, the textual data provide could be further analyzed through spelling correction, more stemming, and even Natural Language Processing (NLP).