# DS 101 Project 2 - Mapping Emotions in Python, The American Civil War 

#### Group Members
Joseph Fenuku,
Mason Spillman,
Colin Lambert,
Quinn Sheppard,
Claudia Castrillo 


### 1. Introduction

In this project for Digital Studies 101: Foundations in Digital Studies, our group used CSV file data compiled from Project Guttenburg to create an accurate Geoparser map of Virginia showcasing the most prominent emotions felt during the American Civil War. This process consists of us creating a custom corpus dataframe using Gutenberg, and then scraping the data frame for further information. Once a usable data frame was created we then cleaned the data, split it into sentences, cleaned the individual sentences, dropped any unnecessary data, and removed any sentences that lacked toponyms or other relevant information. Once all that was done our group loaded Geoparser and streamlined the emotional score to make the mapped data simpler to understand. Creating a finished Geoparser map showcasing the emotions of the American Civil War. 


### 2. Hypothesis

We believe that the Geoparser map will primarily showcase negative emotions surrounding the war in Virginia. Negative keywords like loss, defeat, and surrender will be most prominent. Virginia will also not be the only place on the local map where emotions will be shown, places like West VA, California, and Pennsylvania will all show up through various associations found in the text. What will be interesting to see is what emotions are present in other countries that were also tied into the American Civil War. Countries like England, France, and Spain all had some form of input during the conflict. Texts that cite their emotions will be few, which may skew the accuracy of emotions in those countries. 


### 3. Corpus Description

Since we were looking to filter different emotions during the Civil War, we had a vast amount of data to comb through. To start, we had 720 different texts when we had only filtered on literature that contained the word “civil war” in the subject. This was to get the base for our data but that was far too much so we had to further filter down. We also added a filter to only include text that contained “United States”. We didn't want text from other countries so this was essential for us to get accurate data but we still ended up with 630 results which was still too much. We ended our code by further filtering our file by adding the condition to include texts that only revolve around the campaign. We ended up with 70 different texts after running our code all the conditions were met. 


In [10]:
#df_civil = pg_catalog_clean[pg_catalog_clean.subjects.str.contains('Civil War') & pg_catalog_clean.subjects.str.contains('United States') & pg_catalog_clean.subjects.str.contains('Campaigns')]


In [11]:
#Import the results and show a sample
import pandas as pd
df_civil_TEXTS = pd.read_pickle('df_civil_TEXTS.pickle')
df_civil_TEXTS.sample(5)

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death,text_data
22029,22100,Text,2007-07-19,Slavery and four years of war,en,"United States -- History -- Civil War, 1861-18...",E456,US Civil War; Slavery; Browsing: History - Ame...,,Keifer,Joseph Warren,1836,1932,SLAVERY AND FOUR YEARS OF WAR ***\r\n\r\n\r\n\...
45529,45603,Text,2014-05-07,Two Wars: An Autobiography of General Samuel G...,en,"United States -- History -- Civil War, 1861-18...",E456,Browsing: Biographies; Browsing: History - Ame...,,French,Samuel Gibbs,1818,1910,TWO WARS: AN AUTOBIOGRAPHY OF GENERAL SAMUEL G...
45362,45436,Text,2014-04-18,Mosby's War Reminiscences; Stuart's Cavalry Ca...,en,"United States -- History -- Civil War, 1861-18...",E456,Browsing: History - American; Browsing: Histor...,,Mosby,John Singleton,1833,1916,***
2583,2616,Text,2004-06-01,Memoirs of General William T. Sherman — Volume 1,en,Generals -- United States -- Biography; United...,E456,US Civil War; Browsing: Biographies; Browsing:...,,Sherman,William T. (William Tecumseh),1820,1891,MEMOIRS OF GENERAL WILLIAM T. SHERMAN — VOLUME...
66171,66250,Text,2021-09-08,"An account of the battle of Wilson's Creek, or...",en,"United States -- History -- Civil War, 1861-18...",E456,Browsing: History - American; Browsing: Histor...,"Adams, Thomas W.",Holcombe,R. I. (Return Ira),1845,1916,"AN ACCOUNT OF THE BATTLE OF WILSON'S CREEK, OR..."


### 4. Geoparsing Results

The geoparsing process was relatively straightforward compared to the other steps. The pickle file containing the list of locations was given to the tool, in which each location was parsed from the sentences and surrounding context. This list was sourced from the 70 different texts previously referred to, all matching the previous filters applied to ensure the data aligned with the American Civil War as well as the correct time frame. We were also wary of false positives that may have been collected from the tool, and tried our best to filter out potential inconsistencies such as names being confused with locations.

Some false positives can be seen in the geoparser such as Washington state being confused with Washington D.C.. This is due to the tool not being able to consider the context or time period in the surrounding sentence, and strictly matching the literal name given with the location. To help remove false positives like these in future, we could apply filters that check the surrounding context given, although this may be very time consuming. A shorter yet imperfect  way to do this would be to filter the more common mistakes such as Washington not yet being a state and for similar locations. This method can also be refined by referencing the locations given based on a specific location list given to the tool.


In [14]:
#Import the results and show a sample
df_civil_PLACES = pd.read_pickle('df_civil_PLACES.pickle')
df_civil_PLACES.sample(5)

Unnamed: 0,cleaned_sentences,text_id,title,subjects,last_name,first_name,birth,death,sentences,toponyms,nltk_toponym_count,place,latitude,longitude,feature_name
27957,The next morning at the dawn of day fugitives ...,45603,Two Wars: An Autobiography of General Samuel G...,"United States -- History -- Civil War, 1861-18...",French,Samuel Gibbs,1818,1910,The next morning at the dawn of day fugitives ...,[Columbus],544,[Columbus],[39.96118],[-82.99879],[seat of a first-order administrative division]
28668,The shrewd and aggressive officers of the Hud...,43590,"The Life of Isaac Ingalls Stevens, Volume 2 (o...","United States -- History -- Civil War, 1861-18...",Stevens,Hazard,1842,1918,The shrewd\r\nand aggressive officers of the H...,"[Victoria, San Juan]",28,"[Hong Kong, San Juan]","[22.27832, 18.46633]","[114.17469, -66.10572]","[capital of a political entity, capital of a p..."
22056,Penetrated in all directions by watercourses n...,23747,Destruction and Reconstruction: Personal Exper...,"United States -- History -- Civil War, 1861-18...",Taylor,Richard,1826,1879,Penetrated in all directions by watercourses n...,[Mississippi],1182,[Mississippi],[32.75041],[-89.75036],[first-order administrative division]
9310,"Gen. Taylor, with a proper escort, rode to En...",45603,Two Wars: An Autobiography of General Samuel G...,"United States -- History -- Civil War, 1861-18...",French,Samuel Gibbs,1818,1910,"Gen. Taylor, with a proper escort, rode\r\nto ...",[Santa Anna],58,"[La Encantada, Santa Anna]","[17.68333, -25.45027]","[-94.81667, -65.60471]","[populated place, populated place]"
10140,"Getty met a strong force along Meadow Brook, n...",22100,Slavery and four years of war,"United States -- History -- Civil War, 1861-18...",Keifer,Joseph Warren,1836,1932,"Getty met a strong force along Meadow Brook, n...",[Middletown],62,"[Meadow Brook, Middletown]","[41.9376, 40.26174]","[-71.16755, -79.6206]","[populated place, populated place]"


In [15]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg


america = mpimg.imread('newplot-america.png')
europe = mpimg.imread('newplot-europe.png')

plt.imshow(america)
plt.axis('off') 
plt.show()

plt.imshow(europe)
plt.axis('off') 
plt.show()

Matplotlib is building the font cache; this may take a moment.


FileNotFoundError: [Errno 2] No such file or directory: 'newplot-europe.png'

### 5. Sentiment Analysis Results


Looking at the average scores of the positive, neutral, and negative word associations we see that neutral is the largest, which makes sense because since these are campaigns it probably focus on the raw facts of the information rather than opinions about it. We also see that the negative word association is larger when looking at the overall averages and is shown to be negative when combining all three lists. This makes sense because it was a war and wars tend to be sad, unenjoyable affairs, and the South would likely be writing negatively since they lost. While the negative was bigger the positive wasn't too much smaller and this could be because the North would write positively since they won the war, but still average to more negative since the North would still write negatively about the lives lost.

In [None]:
#Import the results and show a sample
df_civil_SENTIMENTS = pd.read_pickle('df_civil_SENTIMENTS.pickle')
df_civil_SENTIMENTS.sample(5)

In [None]:
emotions = mpimg.imread('emotions.png')
avg = mpimg.imread('avg.png')

plt.imshow(emotions)
plt.axis('off') 
plt.show()

plt.imshow(avg)
plt.axis('off') 
plt.show()

### 6. Mapping

For the mapping process, the locations with extremely high quantities were removed to make the data more balanced and easier to interpret. Some places had over 1,000 counts, which could make the overall emotional trends harder to see, so filtering them out was necessary. Any locations that didn’t have enough useful data were also excluded in order to focus on the most important areas. The map was then centered and zoomed in to highlight key regions like the U.S. and parts of Europe where the most emotional data was found.

The data on the map showed more clearly where people felt strongly (emotionally) about the war, with red areas showing negative emotions tied to loss and defeat, especially in heavily affected states like Virginia. It also includes places like England and France, which were connected to the war in smaller ways but still had some type of impact. The color scheme on the map correlates with different emotions throughout the time of the war. The darker (purple) colors represent neutral feelings, the reddish colors represent negative emotions (like loss or defeat), and blue conveys positive emotions (like victory). This helped display the emotional reactions to the Civil War effectively on the map, both in the U.S. and internationally. So even though the U.S. had the most data, European countries like England and France still appeared due to their diplomatic ties to the war.


In [None]:
import plotly.express as px
#import matplotlib.pyplot as plt

# Define a threshold for the minimum count
#threshold = 200

# Filter the dataframe to include only rows where location_count is above the threshold
#df_filtered = df_civil[df_civil['location_count'] >= threshold]

# Plot the histogram of the filtered location counts
#df_filtered['location_count'].plot.hist(bins=10, alpha=0.7)

# Add labels and title for clarity
#plt.xlabel('Location Count')
#plt.ylabel('Frequency')
#plt.title('Histogram of Location Counts (Filtered)')
#plt.show()

When starting this project I assumed that the public opinion on the American Civil War would be divided at first, with a mainly negative view that would get progressively sadder throughout its duration. I also assumed that the Southern states would become considerably more negative in comparison to the Northern states, considering their defeat. However, to my surprise, there was less negativity and sadness than I had thought. I believe this is because I hadn't thought to take into consideration what types of people made up the demographics of the authors, making the opinion about the war more biased and optimistic in a one-sided sense. It is also possible that these effects aren't as apparent in the time frame we had chose. This leads to our first issue with the data, as there were way too many subjects and sources, and we needed to find a way to filter and narrow down our results. Adjusting for time frame, American civil war, and adding campaigns to our search helped to resolve this issue.


Filtering still remains an issue though, as false positives and inaccurate data could still be found within our search. If we had another chance at this project, one thing our team would probably work on is narrowing down our results by avoiding places with similar name. It would have also been beneficial to look at non-war related texts from the era to see how general writing had been impacted.