# Assignment 1: Pick your own CSV to analyze with Pandas

## Course: Analyzing Structured Data with Pandas and ElementTree

### [Centre for Data, Culture & Society](https://www.cdcs.ed.ac.uk/)

#### Instructor Example by Lucy Havens

March 17, 2022

***

## About the CSV Data in this Jupyter Notebook

In this Jupyter Notebook, I explore a subset of data available on the National Library of Scotland (NLS) [Data Foundry](https://data.nls.uk), a website where the NLS has published digitized versions of some of its collections.  Using the XML file of metadata from [The National Bibliography of Scotland (version 1)](https://data.nls.uk/data/metadata-collections/national-bibliography-of-scotland/), I created a CSV file with a subset of the metadata in the Bibliography.  Then, I extracted a subset of that metadata, taking only rows for books that were published in Edinburgh in the 1900s.

As explained in the Exploring The National Bibliography of Scotland (version 1) ([available here](https://data.nls.uk/tools/jupyter-notebooks/exploring-the-national-bibliography-of-scotland/)):

*This dataset is the first version of the bibliographic records for the National Bibliography of Scotland (NBS). This version of the National Bibliography of Scotland references materials published in Scotland, materials in language Scots, or materials in language Scottish Gaelic from National Library of Scotland's main catalogue. This is the first iteration of the new National Bibliography of Scotland, which was originally produced in April 2019. National Bibliography of Scotland is an ongoing programme of work.*

The metadata was created through a combination of manually typed entries or copied (manually typed) from handwritten entries by NLS employees.


#### 1. Import libraries

We'll import [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html), a library for data science tasks that organizes data in a tabular format, and [NumPy](https://numpy.org/learn/), a library on which pandas is built that has special data structures and methods that are particularly efficient for working with numeric data (more efficient than what's available in Python alone).

*For another introduction to Pandas and NumPy, check out [this CodeAcademy webpage](https://www.codecademy.com/article/introduction-to-numpy-and-pandas).*

In [2]:
import pandas as pd  # give pandas the alias 'pd' for more efficient programming (typing two letters is faster than six!)
import numpy as np   # give NumPy the alias 'np' for the same reason

#### 2. Load the CSV file
There is more than one way to read data into your Jupyter Notebook with Python.  Here we will use the `read_csv()` method from the pandas library, which reads in a file of data organized as comma separated values (CSV) as a **DataFrame**.  A DataFrame is a data structure particular to pandas.  DataFrames organize data into rows and columns, the same way Microsoft Excel organizes CSV data.  Pandas provides powerful methods for summarizing, analyzing, restructuring, and changing data when they're organized into DataFrames that is particularly efficient and flexible, making it a helpful programming tool for data scientists.

The `read_csv()` method takes a file path as its argument.  Since the method is specific to the pandas library, we must reference the method as a part of pandas.  This looks like `pd.read_csv(filepath)`, where the file path is a **string** of any folders the computer needs to go through to reach the data, ending with the file name of the data itself (including the extension `.csv` if that's in the file name!).  A file path might look like...

* `"../data_file_name.csv"`                   -->  if the data is located outside of the current folder (a.k.a. directory)

* `"../data_folder_name/data_file_name.csv"`  -->  if the data is located in a folder stored outside of the current folder

* `"../../data_file_name.csv"`                -->  if the data is located outside of the folder contaiing the current folder

* `"./data_file_name.csv"`                    -->  if the data is located in the current folder

* `"data_file_name.csv"`                      -->  if the data is located in the current folder

What file is our data, NBS_Edinburgh1900.csv, located in?  How would we write its filepath?

In [3]:
df = pd.read_csv("NBS_Edinburgh1900s.csv") # I named my variale `df` as an abbreviation for DataFrame

In [4]:
df.head()  # To preview the first five rows of the DataFrame

Unnamed: 0,author,title,topic,language,publication_place,publication_date
0,"Macpherson, Iain.",Attracting new students to adult education :,Adult education,,Edinburgh :,1989
1,,The breeding birds of south-east Scotland :,"['Birds', 'Birds']",,Edinburgh :,1998
2,,"Adult education, the challenge of change",Adult education,,Edinburgh,1975
3,,Nature conservation in the Cairngorms :,Nature conservation,,Edinburgh :,1989
4,"Ehrenborg, Cecil G.",What are you doing here? Being the adventures ...,,,"Edinburgh,",1950


#### 3. Explore the metadata
Pandas and NumPy together provide many methods for exploring a dataset.  Let's try some out!

**Method 1:** `shape` - this method is from pandas (read about it in the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html?highlight=shape#pandas.DataFrame.shape))

In [105]:
rows, columns = df.shape
print(df.shape)
print("Rows:", rows)
print("Columns:", columns)

(85364, 6)
Rows: 85364
Columns: 6


**Method 2:** `columns` - this method is also from pandas (read about it in the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html))

In [106]:
print(df.columns)

Index(['author', 'title', 'topic', 'language', 'publication_place',
       'publication_date'],
      dtype='object')


What data types do you think the values in each of our `df`'s columns are?  Let's check...

In [107]:
column_names = df.columns
for col in column_names:
    print(col, ": ", type(df.loc[0,col]))

author :  <class 'str'>
title :  <class 'str'>
topic :  <class 'str'>
language :  <class 'str'>
publication_place :  <class 'str'>
publication_date :  <class 'numpy.int64'>


**Method 3:** `array()` - this method is from NumPy (read about it in the [docs](https://numpy.org/doc/stable/reference/generated/numpy.array.html))

NumPy stores integers and lists a bit differently from
Python, all in the name of more efficient data processing!  We can create NumPy arrays with `np.array()` to store numbers, including the `numpy.int64` values in our `publication_date` column.

In [5]:
dates = np.array(list(df.publication_date))
print("First ten dates:", dates[:10])  # same as: print(dates[0:10])

First ten dates: [1989 1998 1975 1989 1950 1958 1967 1966 1906 1907]


In [14]:
print("Mean:", dates.mean())
print("Min:", dates.min())
print("Max:", dates.max())

Mean: 1976.9460779719789
Min: 1900
Max: 1999


**Methods 4, 5 & 6:** `mean()`, `amin()`, `amax()` - these methods are from NumPy (read about them in the docs [here](https://numpy.org/doc/stable/reference/generated/numpy.mean.html?highlight=mean#numpy.mean), [here](https://numpy.org/doc/stable/reference/generated/numpy.amin.html#numpy.amin), and [here](https://numpy.org/doc/stable/reference/generated/numpy.amax.html?highlight=amax#numpy.amax))

In [15]:
# With NumPy's methods, we can pass a DataFrame column directly instead of first turning it into an array
print("Mean:", df.publication_date.mean())
print("Min:", np.amin(df.publication_date))              
print("Max:", np.amax(df.publication_date))
print("Standard Deviation:", np.std(df.publication_date))

Mean: 1976.9460779719789
Min: 1900
Max: 1999
Standard Deviation: 24.414016395926836


**Method 7:** `describe` - this method is from pandas and is for *numeric* data (read about it in the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html?highlight=describe#pandas.DataFrame.describe))

In [16]:
df.describe()

Unnamed: 0,publication_date
count,85364.0
mean,1976.946078
std,24.414159
min,1900.0
25%,1972.0
50%,1986.0
75%,1994.0
max,1999.0


**Method 8:** `unique` - this method is also from pandas (read about it in the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html?highlight=unique#pandas.Series.unique))

What if we want to know more about columns with values of the *string* data type?

In [111]:
author_count = len(df.author.unique())
title_count = len(df.title.unique())
language_count = len(df.language.unique())
pubplace_count = len(df.publication_place.unique())
pubdate_count = len(df.publication_date.unique())

print("Total authors:", author_count)
print("Total titles:", title_count)
print("Total languages:", language_count)
print("Total publication places:", pubplace_count)
print("Total publication dates:", pubdate_count)

Total authors: 16172
Total titles: 61932
Total languages: 59
Total publication places: 272
Total publication dates: 100


In [112]:
# Last ten titles in the data
unique_titles = list(df.title.unique())   # same as: list(df["title"].unique)
unique_titles[-10:]  

# But how is the data organized?  What does "last" mean?

['The Holyrood building project /',
 'The impact of the EU Urban Waste Water Treatment Directive on the fish processing industry /',
 'Order of election of MSPs from regional vote /',
 'Rehabilitation in adult nursing practice /',
 'Client-centred practice in occupational therapy :',
 'Understanding acupuncture /',
 'Self assessment in clinical pharmacology /',
 "St. Giles' Cathedral, High Kirk, Edinburgh. Order of Divine Service held on ... February 4, 1900 ... in aid of the Lord Provost of Edinburgh's War Relief Funds. (Under the auspices of the lodges in the Metropolitan district and Province of Midlothian.).",
 'French cookery for English homes.',
 'The wrongs of Indian womanhood /']

**Method 9:** `sort_values` - this method is from pandas (read about it in the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html?highlight=sort_values#pandas.DataFrame.sort_values))

What if we want to know more about columns with values of the *string* data type?

In [113]:
df_sorted = df.sort_values(by="publication_date")
df_sorted.tail(10)

Unnamed: 0,author,title,topic,language,publication_place,publication_date
48033,,[Planning appeal decision :,Electric signs,,Edinburgh :,1999
48034,,[Planning appeal decision :,Electric signs,,Edinburgh :,1999
42462,,Implementation studies in schools.,"['Curriculum planning', 'Education, Higher']",,Edinburgh :,1999
5922,,"Jubilee Public Hall Trust, Lesmahagow, South L...",,,Edinburgh :,1999
48035,,[Planning appeal decision :,Signs and signboards,,Edinburgh :,1999
48036,,[Planning appeal decision :,Electric signs,,Edinburgh :,1999
48037,,[Planning appeal decision :,City planning,,Edinburgh :,1999
81550,,Gossip :,Social integration,,Edinburgh :,1999
24788,,Music :,Music,,Edinburgh :,1999
73094,,Looking after children in Scotland :,Children,,[Edinburgh] :,1999


Some notes:
* There are fewer unique titles than rows in the dataframe, so the metadata must include several editions or versions of certain books!
* There are 100 unique publication dates, so every year in the 1900s must be included in the metadata.
* When we printed the first five rows of the dataframe, the `publication_place` column showed all five books for those rows were published in Edinburgh, but Edinburgh was written in 3 different ways, so those will all be counted as unique places in the 272 total publication places printed above.

Note that we didn't count the total topics because we can see that some of the values in the `topic` column are strings and others are lists.  We can flatten the lists and create separate columns for `topic 2`, `topic 3`, ... `topic n`.

#### 4. Clean the metadata
We'll begin with the "topic" column:

In [114]:
# create an array of all values in the dataframe's topic' column
topics = df.topic

In [115]:
print(topics[1])
print(type(topics[1]))

['Birds', 'Birds']
<class 'str'>


Hmmm.  It seems that the lists of topics are actually stored as strings, so we can't simply check for the data type to identify which books have a list of topics and which have a single topic.  Let's try using Regular Expressions ([RegEx](https://www.w3schools.com/python/python_regex.asp)) to pull out the words in any topic value with more than one topic.

In [116]:
import re # Regex, or Regular Expressions

In [117]:
# Determine how many topic values a single row may have
# and remove any duplicate topics within a single row
clean_topics = []
max_topics = 1
for t in topics:
    t_series = pd.Series(re.findall("\w+", t)).unique()
    # If there are multiple unique topics:
    if (len(t_series) > 1):
        clean_topics += [t_series]
        if (len(t_series) > max_topics):
            max_topics = len(t_series)
    # If there is only one unique topic:
    else:
        clean_topics += [t_series[0]]

print("Maximum number of topics in a single row:", max_topics)
assert len(clean_topics) == len(topics)    # There should be the same number of topic entries as in our dataframe

Maximum number of topics in a single row: 20


Wow, 20 is a lot of topics!  Maybe I'll leave those as lists within the single topic column for now.  The 'for' loop above should still have done some cleaning on the topic field that we can apply to our dataframe, having removed any duplicate topics listed in a single row.  Let's replace the `topic` column of the dataframe with our new `clean_topics` list:

In [118]:
df.topic = clean_topics
df.head()

Unnamed: 0,author,title,topic,language,publication_place,publication_date
0,"Macpherson, Iain.",Attracting new students to adult education :,"[Adult, education]",,Edinburgh :,1989
1,,The breeding birds of south-east Scotland :,Birds,,Edinburgh :,1998
2,,"Adult education, the challenge of change","[Adult, education]",,Edinburgh,1975
3,,Nature conservation in the Cairngorms :,"[Nature, conservation]",,Edinburgh :,1989
4,"Ehrenborg, Cecil G.",What are you doing here? Being the adventures ...,,,"Edinburgh,",1950


Let's start by removing trailing whitespace and punctuation from the metadata entries.  To do so, we'll use Python's [.strip()](https://www.w3schools.com/python/ref_string_strip.asp) method:

In [24]:
authors = df.author
titles = df.title
languages = df.language
pubplaces = df.publication_place

In [25]:
df.language.unique()

array(['None', 'English.', 'English', 'English & Irish (Middle Irish)',
       'Scottish Gaelic.', 'Arabic.', 'Kayan.', 'Murut.', 'Chinamwanga.',
       'Italian.', 'Japanese.', 'German.', 'French.',
       'English. - Scottish dialect.', 'Urdu', 'Gujarati', 'Cantonese',
       'Punjabi.', 'Hindi.', 'English & Scottish Gaelic.', 'Tumbuka.',
       'Scots.', 'Torres Islands. - Loh dialect.', 'English. ', 'Penan.',
       'Efik.', 'Luba. - Sanga Dialect.', 'Nyanja. - Union Nyanja.',
       'Luba. - Sanga dialect.', 'English & Old French', 'Tagal.',
       'Tonga (Nyasa).', 'Kyangonde.', 'Tamachek.', 'Namwanga.',
       'Nyanja.', 'Konde.', 'Polyglott.', 'Lomwe.', 'Chewa.', 'Songo.',
       'Tonga. - Nyasaland.', 'Luba-Sanga.', 'Bemba.', 'Wisa-Lala.',
       'Latin.', 'Santo. - South Santo dialect.', 'Tonga, Malawi.',
       'Kikuyu.', 'Lobiri.', 'Mumuye.', 'JiÃÅvara.', 'Kanuri.', 'Lobi.',
       'Tangoa.', 'Greek.', 'Spanish.', 'Swedish.', 'Kenyan.'],
      dtype=object)

In [26]:
clean_languages = []
for a in languages:
    a = a.strip()
    a = a.strip(".")
    a = a.strip(",")
    a = a.strip(":")
    a = a.strip(";")
    clean_languages += [a]

In [27]:
df.language = clean_languages
df.language.unique()

array(['None', 'English', 'English & Irish (Middle Irish)',
       'Scottish Gaelic', 'Arabic', 'Kayan', 'Murut', 'Chinamwanga',
       'Italian', 'Japanese', 'German', 'French',
       'English. - Scottish dialect', 'Urdu', 'Gujarati', 'Cantonese',
       'Punjabi', 'Hindi', 'English & Scottish Gaelic', 'Tumbuka',
       'Scots', 'Torres Islands. - Loh dialect', 'Penan', 'Efik',
       'Luba. - Sanga Dialect', 'Nyanja. - Union Nyanja',
       'Luba. - Sanga dialect', 'English & Old French', 'Tagal',
       'Tonga (Nyasa)', 'Kyangonde', 'Tamachek', 'Namwanga', 'Nyanja',
       'Konde', 'Polyglott', 'Lomwe', 'Chewa', 'Songo',
       'Tonga. - Nyasaland', 'Luba-Sanga', 'Bemba', 'Wisa-Lala', 'Latin',
       'Santo. - South Santo dialect', 'Tonga, Malawi', 'Kikuyu',
       'Lobiri', 'Mumuye', 'JiÃÅvara', 'Kanuri', 'Lobi', 'Tangoa',
       'Greek', 'Spanish', 'Swedish', 'Kenyan'], dtype=object)

**More Methods!** 

`groupby` - group your data by particular column's values (read about it in the [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby))

`count()` - count the number of rows that fit into each group (read about it in the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.count.html?highlight=count#pandas.core.groupby.GroupBy.count))

`drop()` - remove columns (`axis=1`) or rows (a.k.a. remove from the index, `axis=0`) from your DataFrame (read about it in the [docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html))

In [40]:
language_groups = df.groupby(by="language").count()
language_groups

language_groups = language_groups.drop(labels=["author","title","topic","publication_place"], axis=1)
language_groups.drop(labels=["author","title","topic","publication_place"], axis=1, inplace=True)
language_groups

language_groups = language_groups.rename(columns={"publication_date":"total_books"})
language_groups

# language_groups.sort_values(by="total_books", inplace=True)
language_groups.sort_values(by="total_books", ascending=False, inplace=True)
# The above line could also be written as:
language_groups = language_groups.sort_values(by="total_books")

language_groups

Unnamed: 0_level_0,total_books
language,Unnamed: 1_level_1
,85119
English,133
Namwanga,10
Tonga (Nyasa),7
Scottish Gaelic,7
Tumbuka,6
Nyanja. - Union Nyanja,6
Efik,5
Tonga. - Nyasaland,5
Tamachek,5


More cleaning...

In [120]:
clean_authors = []
for a in authors:
    a = a.strip()
    a = a.strip('.')
    a = a.strip(",")  # this won't remove commas in the middle that may separate family and given names
    a = a.strip(":")
    a = a.strip(";")
    # We'll keep question marks (?) that may appear at the end of an author
    # name as their presence often indicates uncertainty in library metadata
    clean_authors += [a]

assert len(clean_authors) == len(authors)

In [121]:
df.author = clean_authors
df.head()

Unnamed: 0,author,title,topic,language,publication_place,publication_date
0,"Macpherson, Iain",Attracting new students to adult education :,"[Adult, education]",,Edinburgh :,1989
1,,The breeding birds of south-east Scotland :,Birds,,Edinburgh :,1998
2,,"Adult education, the challenge of change","[Adult, education]",,Edinburgh,1975
3,,Nature conservation in the Cairngorms :,"[Nature, conservation]",,Edinburgh :,1989
4,"Ehrenborg, Cecil G",What are you doing here? Being the adventures ...,,,"Edinburgh,",1950


In [122]:
clean_titles = []
for a in titles:
    a = a.strip()
    a = a.strip(".")
    a = a.strip(",")
    a = a.strip(":")
    a = a.strip(";")
    a = a.strip("/")
    a = a.strip("[")
    a = a.strip("")
    clean_titles += [a]
    
assert len(clean_titles) == len(titles)

In [123]:
df.title = clean_titles
df.tail()

Unnamed: 0,author,title,topic,language,publication_place,publication_date
85359,,Planning appeal decision,"[City, planning]",,Edinburgh :,1999
85360,,"Census 1971, Scotland, report for Lothian Regi...",,,Edinburgh,1976
85361,,"St. Giles' Cathedral, High Kirk, Edinburgh. Or...",,,"Edinburgh,",1900
85362,French Cookery,French cookery for English homes,,,"Edinburgh,",1900
85363,"Fuller, Marcus B",The wrongs of Indian womanhood,,,"Edinburgh,",1900


In [127]:
clean_pubplaces = []
for a in pubplaces:
    a = a.strip()
    a = a.strip(".")
    a = a.strip(",")
    a = a.strip(":")
    a = a.strip(";")
    a = a.strip("[")
    a = a.strip("]")
    a = a.strip(" ")
    clean_pubplaces += [a]
    
assert len(clean_pubplaces) == len(pubplaces)

In [128]:
df.publication_place = clean_pubplaces
df.head()

Unnamed: 0,author,title,topic,language,publication_place,publication_date
0,"Macpherson, Iain",Attracting new students to adult education,"[Adult, education]",,Edinburgh,1989
1,,The breeding birds of south-east Scotland,Birds,,Edinburgh,1998
2,,"Adult education, the challenge of change","[Adult, education]",,Edinburgh,1975
3,,Nature conservation in the Cairngorms,"[Nature, conservation]",,Edinburgh,1989
4,"Ehrenborg, Cecil G",What are you doing here? Being the adventures ...,,,Edinburgh,1950


That's looking much better!

#### 5.  Analyze the metadata

**Write a Function**

*Functions* help programmers like us write code more efficiently. Sometimes, especially as your coding tasks become more complex, you'll find that you want to reuse the same code, or slight variations of the same code, again and again. Functions let you reuse code without having to rewrite it every time you want to use it. By storing lines of code you've written within the bounds of a function that you name, you can refer to those lines of code by the name to reuse that code.

When we want to write a function, we begin the first line with `def`, for define, and end the first line with a colon `:`.  We define inputs to the function as *arguments* inside parentheses.  We end a function by returning a particular value with `return`:

`def functionName(argument1, argument2, ... argumentN):`

`   ...`

`return X`

When we want to use the function, we put the *parameters* A, B and C on which to apply the function inside the parentheses that follow the function's name:

`functionName(parameter1, parameter2, parameter3)`

Let's take another look at the languages in our data:

In [41]:
langs = (df["language"]).unique()
# langs = df.language.unique()
print(langs)

['None' 'English' 'English & Irish (Middle Irish)' 'Scottish Gaelic'
 'Arabic' 'Kayan' 'Murut' 'Chinamwanga' 'Italian' 'Japanese' 'German'
 'French' 'English. - Scottish dialect' 'Urdu' 'Gujarati' 'Cantonese'
 'Punjabi' 'Hindi' 'English & Scottish Gaelic' 'Tumbuka' 'Scots'
 'Torres Islands. - Loh dialect' 'Penan' 'Efik' 'Luba. - Sanga Dialect'
 'Nyanja. - Union Nyanja' 'Luba. - Sanga dialect' 'English & Old French'
 'Tagal' 'Tonga (Nyasa)' 'Kyangonde' 'Tamachek' 'Namwanga' 'Nyanja'
 'Konde' 'Polyglott' 'Lomwe' 'Chewa' 'Songo' 'Tonga. - Nyasaland'
 'Luba-Sanga' 'Bemba' 'Wisa-Lala' 'Latin' 'Santo. - South Santo dialect'
 'Tonga, Malawi' 'Kikuyu' 'Lobiri' 'Mumuye' 'JiÃÅvara' 'Kanuri' 'Lobi'
 'Tangoa' 'Greek' 'Spanish' 'Swedish' 'Kenyan']


In [135]:
def hasDialect(language_str):
    dialects = []
    dialects += re.findall('\w+ dialect', language_str)
    dialects += re.findall('\w+ \(\w+ \w*\)', language_str)
    if len(dialects) >= 1:
        return True
    else:
        return False

In [136]:
unique_langs = []
for l in langs:
    l_list = []
    if '&' in l:
        l_list += l.split('&')
    if '-' in l:
        l_list += l.split('-')
    if '&' not in l and '-' not in l:
        l_list += [l]
    for l_item in l_list:
        if not hasDialect(l_item):
            if l_item not in unique_langs:
                unique_langs += [l_item]
print(unique_langs)

['None', 'English', 'English ', 'Scottish Gaelic', 'Arabic', 'Kayan', 'Murut', 'Chinamwanga', 'Italian', 'Japanese', 'German', 'French', 'English. ', 'Urdu', 'Gujarati', 'Cantonese', 'Punjabi', 'Hindi', ' Scottish Gaelic', 'Tumbuka', 'Scots', 'Torres Islands. ', 'Penan', 'Efik', 'Luba. ', ' Sanga Dialect', 'Nyanja. ', ' Union Nyanja', ' Old French', 'Tagal', 'Tonga (Nyasa)', 'Kyangonde', 'Tamachek', 'Namwanga', 'Nyanja', 'Konde', 'Polyglott', 'Lomwe', 'Chewa', 'Songo', 'Tonga. ', ' Nyasaland', 'Luba', 'Sanga', 'Bemba', 'Wisa', 'Lala', 'Latin', 'Santo. ', 'Tonga, Malawi', 'Kikuyu', 'Lobiri', 'Mumuye', 'JiÃÅvara', 'Kanuri', 'Lobi', 'Tangoa', 'Greek', 'Spanish', 'Swedish', 'Kenyan']


In [137]:
# Create a dictionary to count the occurrence of each language
lang_counts = dict.fromkeys(unique_langs, 0)
dflangs = list(df.language)
for dfl in dflangs:
    for unique_l in unique_langs:
        if unique_l in dfl:
            lang_counts[unique_l] += 1

In [138]:
lang_counts

{'None': 85119,
 'English': 138,
 'English ': 3,
 'Scottish Gaelic': 8,
 'Arabic': 2,
 'Kayan': 1,
 'Murut': 2,
 'Chinamwanga': 1,
 'Italian': 2,
 'Japanese': 1,
 'German': 2,
 'French': 3,
 'English. ': 2,
 'Urdu': 1,
 'Gujarati': 1,
 'Cantonese': 1,
 'Punjabi': 1,
 'Hindi': 1,
 ' Scottish Gaelic': 1,
 'Tumbuka': 6,
 'Scots': 2,
 'Torres Islands. ': 1,
 'Penan': 1,
 'Efik': 5,
 'Luba. ': 4,
 ' Sanga Dialect': 1,
 'Nyanja. ': 6,
 ' Union Nyanja': 6,
 ' Old French': 1,
 'Tagal': 1,
 'Tonga (Nyasa)': 7,
 'Kyangonde': 1,
 'Tamachek': 5,
 'Namwanga': 10,
 'Nyanja': 7,
 'Konde': 2,
 'Polyglott': 2,
 'Lomwe': 1,
 'Chewa': 2,
 'Songo': 1,
 'Tonga. ': 5,
 ' Nyasaland': 5,
 'Luba': 5,
 'Sanga': 5,
 'Bemba': 1,
 'Wisa': 1,
 'Lala': 1,
 'Latin': 2,
 'Santo. ': 1,
 'Tonga, Malawi': 2,
 'Kikuyu': 1,
 'Lobiri': 1,
 'Mumuye': 1,
 'JiÃÅvara': 1,
 'Kanuri': 1,
 'Lobi': 2,
 'Tangoa': 1,
 'Greek': 1,
 'Spanish': 1,
 'Swedish': 1,
 'Kenyan': 1}

In [19]:
ed_count = 0
unique_places = list(df.publication_place.unique())
for p in unique_places:
    if "Edinburgh" in p:
        ed_count += 1
print(str(ed_count) + " out of " + str(len(unique_places)) + " unique publication entries include the city of Edinburgh")

199 out of 199 unique publication entries include the city of Edinburgh


This is as expected, since we thought the data file was meant to include only books published in Edinburgh.

In [20]:
unique_places

['Edinburgh',
 'Glasgow and Edinburgh',
 'Edinburgh & London',
 'Edinburgh & Glasgow',
 'Edinburgh?',
 'Edinburgh]',
 'Edinburgh [etc.',
 'Edinburgh?]',
 'Edinburgh] : \\b Scottish Office Education and Industry Department',
 'Freiburg; Edinburgh',
 'Edinburgh R. & R. Clark',
 'Edinburgh :',
 'Edinburgh (15 Teviot Place, Edinburgh 1)',
 'Edinburgh and London',
 'Edinburgh]The Society',
 'Edinburgh Castle',
 'Edinburgh, etc',
 'Edinburgh : Fruitmarket Gallery',
 'Edinburgh (121 George St., Edinburgh [EH2 4YN])',
 'Edinburgh, Scotland',
 'Edinburgh (68 Dundas St., Edinburgh EH3 6QZ)',
 'London, Edinburgh',
 'Edinburgh (4 Viewforth, Edinburgh)',
 'Edinburgh (Colinton Rd, Edinburgh 10)',
 'Edinburgh (c/o Mrs Kilbey, 15 Cluny Ave., Edinburgh 10)',
 'Edinburgh (27 Walker St., EH3 7HZ)',
 'Edinburgh (Edinburgh EH9 3HJ)',
 'Edinburgh EH11 4BN',
 'Edinburgh (14a Manor Place, Edinburgh 3)',
 'Edinburgh (14 George Sq., Edinburgh EH8 9JZ)',
 'Edinburgh (c/o Royal Botanic Garden, Arboretum Row, Edin

It looks like other publication places are included as well, though!  Some books were published in Edinburgh *and* another city.

Next, let's organize the dataframe chronologically by publication date:

In [130]:
df_chrono = df.sort_values(by="publication_date", ascending=True)
df_chrono.head()

Unnamed: 0,author,title,topic,language,publication_place,publication_date
85363,"Fuller, Marcus B",The wrongs of Indian womanhood,,,Edinburgh,1900
60128,"Johnson, Samuel",Lives of Milton and Addison,,,Edinburgh & London,1900
24275,"Ireland, William Wotherspoon",Supplementary notes on the Scottish De Quencys,,,Edinburgh,1900
24257,"Anderson, Robert Rowand",The place of architecture in the domain of art,,,Edinburgh,1900
68429,"Thomson, Alexis",On neuroma and neuro-fibromatosis,,,Edinburgh,1900


In [131]:
df_date_count = df_chrono.groupby("publication_date").count()
df_date_count.drop(columns=["author", "topic", "language", "publication_place"], inplace=True)
print(df_date_count.shape)
df_date_count.head()

(100, 1)


Unnamed: 0_level_0,title
publication_date,Unnamed: 1_level_1
1900,401
1901,355
1902,282
1903,312
1904,316


In [132]:
print("The most books published in a single year of the 1900s:",df_date_count.title.max())
print("The fewest books published in a single year of the 1900s:",df_date_count.title.min())

The most books published in a single year of the 1900s: 4477
The fewest books published in a single year of the 1900s: 86


In [133]:
df_date_count.sort_values("title")

Unnamed: 0_level_0,title
publication_date,Unnamed: 1_level_1
1917,86
1918,89
1941,95
1942,104
1919,117
...,...
1993,3275
1996,3411
1995,3608
1998,3657


So the year of the 1900s with the fewest published books in the NBS is 1917, and the year with the most, 1999!