# Datasets

This is a growing repository of datasets broadly related to culture and the humanities. The sources of the datasets as well as brief descriptions and example uses can be found below.

If you'd like to add a dataset or an example use case to this page, please open [an issue on GitHub](https://github.com/melaniewalsh/Intro-Cultural-Analytics/issues) or email me at melanie.walsh@cornell.edu

## Film 🎬


### Hollywood Film Dialogue By Character Gender and Age
(1925-2015)

Get the data: {download}`Download Hollywood Film Dialogue data <../data/Pudding/Pudding-Film-Dialogue-Clean.csv>`  
Original source: Hannah Anderson and Matt Daniels  

Brief description:  

This {download}`CSV file <../data/Pudding/Pudding-Film-Dialogue-Clean.csv>` is a consolidated and slightly modified version of data shared by Hannah Anderson and Matt Daniels for their Pudding article, ["Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age"](https://pudding.cool/2017/03/film-dialogue/). The original datasets can be found [on GitHub here](https://github.com/matthewfdaniels/scripts/). 

For 2,000 films from 1925 to 2015, the dataset includes information about characters' names, genders, ages, how many words they spoke in each film, as well as the release year of each film and how much money the film grossed. Anderson and Daniels determined character age and gender (which they code as binary) based on corresponding IMDB  information for actors. They acknowledge this is an imperfect approach. For more on the compilation of the dataset and their methodology, see [FAQ for the “Film Dialogue, By Gender” Project](https://medium.com/@matthew_daniels/faq-for-the-film-dialogue-by-gender-project-40078209f751).

Example uses:
- ["Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age"](https://pudding.cool/2017/03/film-dialogue/), *The Pudding*
- ["Pandas Basics Part 3"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Analysis/Pandas-Basics-Part3.html), *Introduction to Cultural Analytics & Python*
---

## Literature 📚

### *Lost in the City* (1993), Edward P. Jones
HathiTrust Extracted Features

Get the data: {download}`Download Lost in the City data <../texts/literature/Lost-in-the-City-HTRC-Extracted-Features.zip>`   
Original source: [HathiTrust Digital Library](https://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=79069329)

Brief description:  

The {download}`Lost in the City HTRC Extracted Features zip file <../texts/literature/Lost-in-the-City-HTRC-Extracted-Features.zip>` contains word frequencies per page—or "extracted features"—made available by the [HathiTrust Digital Library](https://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=79069329) for Edward P. Jones's short story collection *Lost in the City*. I have also added each short story title for the correct corresponding pages to the dataset.

There are three CSV files in the zip file. "Lost-in-the-City-HTRC-Extracted-Features(PerPage).csv" contains lowercased word frequencies per page as well as part-of-speech information (49,330 rows). "Lost-in-the-City-HTRC-Extracted-Features(PerStory).csv" contains lowercased word frequencies per story as well as part-of-speech information (20,748 rows). "Lost-in-the-City-HTRC-Extracted-Features(>5PerStory).csv" contains pre-processed lowercased word frequencies (with stopwords and punctuation removed) for words that appear more than 5 times in a story as well as part-of-speech information (1,364 rows).

Example uses:
- ["TF-IDF With HathiTrust Data"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/TF-IDF-HathiTrust.html), *Introduction to Cultural Analytics & Python*
___

### African American Literature
(1853-1923)

Get the data: {download}`Download African American Literature data <../texts/literature/African-American-Literature-1853-1923.zip>`  
Source: Amardeep Singh

Brief description:  

The {download}`African American Literature (1853-1923) zip file <../texts/literature/African-American-Literature-1853-1923.zip>` contains 100 works of fiction and poetry by African American writers between 1853-1923. It also contains an Excel file of metadata about the publisher, publication data, and publication place of each text. This corpus was compiled and shared by Amardeep Singh. You can read more about this corpus and its creation in [Singh's blog post about the African American Literature corpus](http://www.electrostani.com/2020/07/announcing-open-access-african-american.html).

Lastly, Singh asks users of this data to adhere to the spirit of the [Colored Conventions Project's principles](https://coloredconventions.org/about-records/ccp-corpus/).

---

### Colonial South Asian Literature
(1850-1923)

Get the data: {download}`Download Colonial South Asian Literature data <../texts/literature/Colonial-South-Asian-Literature-1850-1923.zip>`)  
Source: Amardeep Singh

Brief description:  

The {download}`Colonial South Asian Literature (1850-1923) zip file <../texts/literature/Colonial-South-Asian-Literature-1850-1923.zip>` contains ~100 works of literature by South Asian and British writers between 1853-1923. It also contains an Excel file of metadata about publication and the nationality of each other. This corpus was compiled and shared by Amardeep Singh. You can read more about this corpus and its creation in [Singh's blog post about the Colonial South Asian Literature corpus](http://www.electrostani.com/2020/08/text-corpus-colonial-south-asian.html).

___

### txtLAB's Multilingual Novels
(1771-1932)

Get the data: [Link to Multilingual Novels data](https://figshare.com/articles/txtlab_Novel450/2062002/3)   
Source: Andrew Piper, McGill [txtLAB](https://txtlab.org/data-sets/)

Brief description:  

The [txtLAB's multilingual novels](https://figshare.com/articles/txtlab_Novel450/2062002/3) takes you to a repository where you can download a directory of 150 English-language novels, 150 German-language novels, and 150 French-language novels, which span from 1771 to 1932. Authors featured include Goethe, Franz Kafka, Hermann Melville, Mary Shelley, Kate Chopin, Virginia Woolf, Victor Hugo, Alexandre Dumas, and many more. These text files were compiled and shared by Andrew Piper and the txtLab.

Example uses:
- ["Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel,"](http://piperlab.mcgill.ca/pdfs/Piper_NovelConversions.pdf) Andrew Piper
---

### Modernist Journal Data
(1890s-1920s)

Get the data: [Link to Modernist Journal data](https://sourceforge.net/projects/mjplab/files/)   
Source: [The Modernist Journals Project](https://modjourn.org/)

Brief description:  

The [Modernist journal data link](https://sourceforge.net/projects/mjplab/files/) takes you to a repository where you can download publication metadata for 14 modernist journals from the 1890s to the 1920s — such as *Poetry Magazine*, *The Little Review*, and *The Crisis*. The Modernist Journals Project, which has digitized these journals, provides CSV and tab-separated text files that contain information for every contributor and every work published in the journals.  

Example uses:
- [Comparative Charts―Involving Contributions to The Egoist, The Little Review, and Others (1915-1919)](https://modjourn.org/comparative-charts%e2%80%95involving-contributions-to-the-egoist-the-little-review-and-others-1915-1919/), The Modernist Journals Project

---

### Seattle Public Library Check-out Data
(2005-present)

Get the data: [Link to Seattle Public Library Check-out Data](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6/data)  
Source: [City of Seattle](http://www.seattle.gov/tech/initiatives/open-data/about-the-open-data-program)

Brief description:  
The [Seattle Public Library check-out data link](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6/data) takes you to a database that contains circulation data about the Seattle Public Library system from 2005 until the present. You can filter this database or search for keywords (e.g., "James Baldwin") and export a file of the filtered data by clicking "Export" and your desired file type.

Example uses:
- ["James Baldwin Checkouts from the SPL,"](http://tweetsofanativeson.com/Seattle-Public-Library/) Melanie Walsh

---

### Game of Thrones Character Relationships

Get the data: {download}`Download Game of Thrones Character data <../data/game-of-thrones-characters.zip>`  
Source: A. Beveridge and J. Shan

Brief description:  

The {download}`Game of Thrones Character Relationships zip file <../data/game-of-thrones-characters.zip>`  contains network data for character relationships within George R. R. Martin's *A Storm of Swords*, the third novel in his series *A Song of Ice and Fire* (also known as the HBO television adaptation *Game of Thrones*). This data was originally compiled by A. Beveridge and J. Shan for their article, ["Network of Thrones"](https://www.maa.org/sites/default/files/pdf/Mathhorizons/NetworkofThrones%20%281%29.pdf).

The nodes csv contains 107 different characters, and the edges csv contains 353 weighted relationships between those characters, which were calculated based on how many times two characters' names appeared within 15 words of one another in the novel. For more on the methodology, see Beveridge and Shan's [original article](https://www.maa.org/sites/default/files/pdf/Mathhorizons/NetworkofThrones%20%281%29.pdf).

Example uses:
- ["Network of Thrones,"](https://www.maa.org/sites/default/files/pdf/Mathhorizons/NetworkofThrones%20%281%29.pdf), A. Beveridge and J. Shan
- ["Network Analysis,"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Network-Analysis/Network-Analysis.html) *Introduction to Cultural Analytics & Python&

---

## Politics 🗳️ & History 📜


### *The New York Times* Obituaries
(1852-2007)

Get the data: {download}`Download *New York Times* Obituaries data <../texts/history/NYT-Obituaries.zip>`     
Original source: [Matthew Lavin](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf#lesson-dataset)

Brief description:  

The {download}`*New York Times* obituaries zip file (.zip) of text files (.txt) <../texts/history/NYT-Obituaries.zip>` contains 379 *New York Times* obituaries (1852-2000) based on those collected by Matt Lavin for his *Programming Historian* tutorial, [Analyzing Documents with TF-IDF](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf#lesson-dataset).

I re-scraped the 366 obituaries included in Lavin's tutorial so that the obituary subject's name and death year is included in each text file name. I also added 13 more ["Overlooked"](https://www.nytimes.com/interactive/2018/obituaries/overlooked.html) obituaries — belated obituaries of remarkable women and minorities who did not receive a *NYT* obituary at the time of their death. Obituary subjects include academics, military generals, artists, athletes, activists, politicians, and businesspeople — such as Ada Lovelace, Ulysses Grant, Marilyn Monroe, Virginia Woolf, Jackie Robinson, Marsha P. Johnson, Cesar Chavez, John F. Kennedy, Ray Kroc, and many more.

Example uses:
- [Analyzing Documents with TF-IDF](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf), Matt Lavin
- ["Topic Modeling Text Files"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling-Text-Files.html), *Introduction to Cultural Analytics & Python*

---
        
### U.S. Inaugural Addresses
(1789-2017)

Get the data: {download}`Download U.S. Inaugural Addresses data <../texts/history/US_Inaugural_Addresses.zip>`

Brief description: 

The {download}`U.S. Inaugural Addresses zip file (.zip) of text files (.txt) <../texts/history/US_Inaugural_Addresses.zip>` contains U.S. Inaugural Addresses ranging from President George Washington (1789) to President Donald Trump (2017). Each text file is titled with a number, the corresponding last name of the U.S. President, and the corresponding year of the Inaugural Address.

Example uses:
- ["jsLDA: In-browser topic modeling,"](https://github.com/mimno/jsLDA) David Mimno

---

### Nobel Prize Winners
(1901-2017)

Get the data: {download}`Download Nobel Prize Winners data <../data/nobel-prize-winners.zip>`
Original source: The European Data Portal and [the official Nobel Prize API](https://www.nobelprize.org/about/developer-zone-2/)

Brief description:  

The {download}`Nobel Prize winners CSV file (.CSV) <../data/nobel-prize-winners.zip>` contains information about 957 Nobel Prize winners from 1901 to 2017. This information includes the Nobel laureate's name, birth and death date (if applicable), birth and death location (plus **latitude and longitude coordinates** for the locations), the year they won the Nobel Prize, the category of the Nobel Prize, and the "motivation" for the Nobel Prize.

Nobel laureates include Marie Curie, Johannes Stark, Woodrow Wilson, Jane Addams, Rabindranath Tagore, John Steinbeck, Gabriel Garcia Marquez, Karl Ziegler, Toni Morrison, and many more.

Example uses: [Google Maps, QGIS, and Palladio maps](https://github.com/melaniewalsh/geospatial-lab) from a WUSTL digital humanities graduate seminar that I taught

---

### Refugee Arrivals to the U.S.
(2005-2015)

Get the data: {download}`Download U.S. Refugee Arrivals data <../data/us-refugee-arrivals.zip>`  
Original source: Department of State's Refugee Processing Center and [Jeremy Singer-Vine](https://github.com/BuzzFeedNews/2015-11-refugees-in-the-united-states)

Brief description:   

The {download}`U.S. Refugee Arrivals zip file <../data/us-refugee-arrivals.zip>` contains data about refugee arrivals to the United States between 2005 and 2015. This data was originally compiled from the Department of State's Refugee Processing Center by Jeremy Singer-Vine for his BuzzFeed article ["Where U.S. Refugees Come From — And Go — In Charts."](https://www.buzzfeednews.com/article/jsvine/where-us-refugees-come-from-and-go-in-charts#.vooNwy74jO)

The "refugee-arrivals-by-destination" csv contains information about the number of refugees who arrived in each U.S. city and state, the year that they arrived, and the country from which they arrived. The "refugee-arrivals-by-religion" csv contains information about the number of refugees who arrived in the U.S., the year in which they arrived, and their religious affiliation.

Example uses:
- ["Where U.S. Refugees Come From — And Go"](https://www.buzzfeednews.com/article/jsvine/where-us-refugees-come-from-and-go-in-charts#.vooNwy74jO), Jeremy Singer-Vine
- [Tableau map](https://public.tableau.com/profile/melanie.walsh#!/vizhome/RefugeeArrivalstotheU_S_2005-2015/TotalRefugeeArrivalstoU_S_2005-2015) from a WUSTL digital humanities graduate seminar that I taught

---

### Irish Immigrants Admitted to NYC's Bellevue Almshouse
(1840s)

Get data: [Link to Bellevue Almshouse data](https://docs.google.com/spreadsheets/d/1uf8uaqicknrn0a6STWrVfVMScQQMtzYf5I_QyhB9r7I/edit#gid=2057113261)  
Source: Anelise Shrout, [Digital Almshouse Project](https://www.nyuirish.net/almshouse/)

Brief description:   

The [Bellevue Almshouse link](https://docs.google.com/spreadsheets/d/1uf8uaqicknrn0a6STWrVfVMScQQMtzYf5I_QyhB9r7I/edit#gid=2057113261) takes you to a Google spreadsheet that contains data about Irish-born immigrants who were admitted to the Bellevue Almshouse in the 1840s. The Bellevue Almshouse was part of New York City's public health system, a place where poor, sick, homeless, and otherwise marginalized people were sent — sometimes voluntarily and sometimes forcibly. This dataset was transcribed from the almshouse's own admissions records by Anelise Shrout. For more information about this dataset, see [The Almshouse Records](https://www.nyuirish.net/almshouse/the-almshouse-records/)

Example uses:
- <a href=http://crdh.rrchnm.org/essays/v01-10-(re)-humanizing-data/>“(Re)Humanizing Data: Digitally Navigating the Bellevue Almshouse”</a>, Anelise Hanson Shrout
- ["Pandas Basics Part 1"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Analysis/Pandas-Basics-Part1.html), *Introduction to Cultural Analytics & Python*

---

## Social Media 🕸️


### Donald Trump's Tweets
(2009-2020)

Get the data: {download}`Download Donald Trump tweets data <../texts/politics/Trump-Tweets.csv>`  
Original Source: [Trump Twitter Archive](http://www.trumptwitterarchive.com/)

Brief description:   

The {download}`Donald Trump tweets CSV file (.csv) <../texts/politics/Trump-Tweets.csv>` contains nearly 30,000 tweets from Donald Trump's account from 2009 to March 2020. The information about each tweet includes the source, tweet text, date of tweet, as well as retweet and favorite counts. More updated data can be downloaded at [Trump Twitter Archive](http://www.trumptwitterarchive.com/archive).
 
Example uses:
- ["Topic Modeling Time Series"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling-Time-Series.html), *Introduction to Cultural Analytics & Python*'s
- ["How Trump Reshaped the Presidency in 11,000 Tweets"](https://www.nytimes.com/interactive/2019/11/02/us/politics/trump-twitter-presidency.html), *New York Times*

---

### "Am I The Asshole?" Reddit Posts

Get the data: ({download}`Download "Am I The Asshole?" Reddit data <../data/top-reddit-aita-posts.csv>`)

Brief description:  

The {download}`Am I The Asshole?" Reddit posts CSV file (.csv) <../data/top-reddit-aita-posts.csv>` contains 2,932 Reddits posts from the subreddit "Am I the Asshole?" that have at least an upvote score of 2,000. The information in the dataset includes the date of the post, title, body text, url, upvote score, number of comments, and number of crossposts. This data was collected with [PSAW](https://github.com/dmarx/psaw), a wrapper for the [Pushshift API](https://github.com/pushshift/api).

Example uses:
- ["Topic Modeling CSV Files"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling-CSV.html), *Introduction to Cultural Analytics & Python*

---

## Food 🍔


### The New York Public Library's Menu Dataset
(1840-present)

Get the data: [Link to The New York Public Library's Menu Dataset](http://menus.nypl.org/data)  
Source: The New York Public Library

Brief description:   

The [The New York Public Library's menu dataset link](http://menus.nypl.org/data) takes you to a web page where you can download data from the New York Public Library's massive menu collection — tens of thousands of transcribed menus and menu items from the 1840s to the present. Click "Download the latest data export in CSV format" for the most updated menu data.

Example uses:
- [*Curating Menus*](http://curatingmenus.org/), Katie Rawson and Trevor Muñoz

---

## Other Dataset Compilations

Below are some other great compilations of cultural and humanities-related datasets:
- Jeremy Singer-Vine's [*Data Is Plural* archive](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0) (you can [subscribe to his excellent dataset newsletter here](https://tinyletter.com/data-is-plural))
- *The Pudding*'s [GitHub repository](https://github.com/the-pudding/data)
- Alan Liu's [DH Toychest](http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets)
- Reddit's [r/datasets subreddit](https://www.reddit.com/r/datasets/)

## Suggestions?

If you'd like to add a dataset or an example use case, please open [an issue on GitHub](https://github.com/melaniewalsh/Intro-Cultural-Analytics/issues) or email me at melanie.walsh@cornell.edu