<a href="https://colab.research.google.com/github/BrockDSL/Analyzing_Web_Archives/blob/main/Meme_Web_Archive_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![dsl_logo.png](https://raw.githubusercontent.com/BrockDSL/Analyzing_Web_Archives/main/dsl_logo.png)

# Analyzing Web Archives

Welcome to the Digital Scholarship Lab Level Analyzing Web Archives workshop. The following notebook provides an investigation into the [Meme Generator dataset](https://www.loc.gov/item/2018655320/) from the Library of Congress in the United States. It shows us how we can use a derivative, and the Python programming language, to come up some interesting results.


## How this notebook works

This webpage is a Google Colab notebook and is comprised of different *cells*. Some are code cells that run Python snippets. To works through these cells simply click on the triangle _run_ button in each cell. Click in the cell below to see the play button, then click on it to begin.

In [None]:
!pip install langdetect
!pip install pandas
!pip install matplotlib

import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image,display,IFrame
from ipywidgets import widgets,interact,interact_manual
from langdetect import detect
import matplotlib.pyplot as plt

global rando

%matplotlib inline
print("\nLibraries loaded, and ready to run!")

The workshop will proceed as follows, we'll explain a few concepts, run a few cells and then ask you to do the same. We'll then ask for your feedback in the chatbox or over the microphone.

# Loading our data set

We'll be using the [Pandas](https://pandas.pydata.org/) Python Library to analyze our dataset and we'll do our best to describe what is going on in the code with comments and descriptions.

The information from this archive is saved in a _CSV_ file. Bascially something similar to a spreadsheet. In the next cell we will load this file into something call a Panda's dataframe and we'll look at the first 5 entries by using the **head(5)** function call.

In [None]:
#Open up the CSV file and put it in a Pandas Dataframe variable
meme_data = pd.read_csv("https://raw.githubusercontent.com/BrockDSL/Analyzing_Web_Archives/main/memegenerator.csv", sep=",")
meme_data["Meme ID"] = meme_data["Meme ID"].astype(str)

meme_data.head(5)



---


# Some General Data Exploration


### How much data?

We find out how many data points we have by counting the number of rows in our  dataframe. We do this using the **len()** function.


In [None]:

print("We have this many memes to look at: ",len(meme_data))


### Random Entry

To get a better sense of what is in our dataset let's look at a random entry by using **sample(1)**. Click the below button a few times to get a few different options. The meme image might take a few seconds to load.

In [None]:
rando = meme_data.sample(1)
#find the URL of the picture file and display it
display(Image(url=rando['Archived URL'].values[0], format='jpg'))
print("View on Memegenerator: \t",rando['Meme Page URL'].values[0])
print("View on Archive: \t\t\t",rando['Archived URL'].values[0])
rando

## Question 1 ##

Have a look at the data that is associated with the random record. In the chat box suggest some things you might want to explore with this data?

## Category of memes?

As you might know, memes come in many different flavours. Let's see if we can find out how many types there are? We'll do this by **grouping** our _Base Meme Name_ column and **counting** how many entries are in each.

In [None]:
# Count works by showing us how many rows have values for your search criteria
# it does this for each column that has numeric data, that is why we have so
# many columns in our result

meme_data.groupby(["Base Meme Name"]).count()

## Question 2 ## 

Based on the above summary how many different type so memes we have? Share your response in the chat box.

Yikes! That looks like a lot. Let's just keep the top **25** entries. We'll do this by **sorting**.

In [None]:
# Since the memes are sorted from biggest to smallest we can use the 
# slice operator with [0:25] to only show the top 25 memes
# slice operator - https://www.w3schools.com/python/ref_func_slice.asp

top_25_cat = pd.DataFrame(meme_data.groupby(["Base Meme Name"]).count().sort_values(by="Meme ID",ascending=False)["Meme ID"][0:25])
top_25_cat

## A Random Entry by Meme Category


Let's create an interactive form that will allow us to pick a category to see a random entry. Please click the run button in the cell below to show the form. Copy/Paste the name of a meme category from above in the text box below and click **Show**. 

(BTW, if you choose something that isn't in the category list it will show an error, no matter, just type in something else and run the cell again)


In [None]:
#Random Entry Form
def show_random(choice):
  rando = meme_data[meme_data["Base Meme Name"] == str(choice)].sample(1)
  print("View on Memegenerator: \t",rando['Meme Page URL'].values[0])
  print("View on Archive: \t\t\t",rando['Archived URL'].values[0])
  display(Image(url=rando['Archived URL'].values[0], format='jpg'))
  display(rando)


title_textbox = widgets.Text(
    value='Me Gusta',
    description='Category',
)
print("Enter a meme category from the list above to see a random entry in that category")
print("Click 'Show' to display\n")
show_random_control = interact_manual.options(manual_name="Show")
show_random_control(show_random,choice=title_textbox);




## Now, a (bit) of Math

Let's do a touch of analysis on the categories of memes...

What's the **average** number of memes in each type?



In [None]:
# We group the data again by 'Base Meme Name' then we apply count() to see
#how many items fall within that category
#finally we apply the mean() function to calculate the mean average
meme_type_mean = meme_data.groupby(["Base Meme Name"])["Meme ID"].count().mean()
print("Average number of entries per meme category: ",meme_type_average)

## Question 3 ##

We are going to ask you to modify the code of the following cell to modify our running analysis.

The _mean_ average might be a little misleading. Let's also check what the median number is for each base meme. The code chunk below is incomplete. Can you resolve the error?

In [None]:
# We group the data again by 'Base Meme Name' then we apply count() to see
#how many items fall within that category
#modify the line below to complete this cell
meme_type_median = meme_data.groupby(["Base Meme Name"])["Meme ID"].count().()
print("The median number of entries per base meme is: ",meme_type_median)

As you can see, the difference between the mean and the median is significant. This is because there's a skewed distribution in our dataset. 

Do you have any guesses as to why this might be the case? Share your thoughts in the chat!



## Huh, that's weird

Let's visualize our lopsided distribution by drawing a **histogram**. We shows the frequency distribution in our datae. We will use the [Matplotlib](https://matplotlib.org/) library to do this. Run the cell below to do this. (You don't need to modify anything)

In [None]:
bins = 150

plt.hist(meme_data.groupby(["Base Meme Name"]).count().sort_values(by="Meme ID",ascending=False)["Meme ID"],bins)

plt.title("Category Frequency")
plt.xlabel("Number of Entries")
plt.ylabel("Number of Categories")
plt.show()

## Question 4 ##


Can you describe this graph? What is the biggest value that it is showing? Share your thoughts in the chat box.



---


# Enriching the Data

We've had some fun looking at different components of the data, but let's see if we an **enrich** the data by adding more columns of information to it.

# Language info

As we've seen in our examples there are many different languages represented in our dataset. Let's see if we can **enrich** our dataset by automatically detecting what language it is and adding that as a new column. We'll use the [langdetect](https://pypi.org/project/langdetect/) library to do this. We can use the text in the _Alternate Text_ column.

In [None]:
#Let's look at our random item again
rando

In [None]:
# Let's the language of the random entry from earlier
# We'll use the 'Alternate Text' Column to detect the language
# We'll get a two letter languge code that represents one of the languages in the list of ISO 639-1 codes (https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). 
print(detect(str(rando["Alternate Text"])))

## Question 5

Why do you think we used the _Alternate Text_ column instead of any others for the language detection?


## Don't look behind the curtain

It would take too long to calculate all these langauge value now for all of the entries in the dataset. So the next cell will just add a new column to our dataset. (It took 8 minutes for language detection code to run on the original dataset)

Have a look at the new column _Language_ that was added.

In [None]:
#We build our dataset from scratch so that the cells in this notebook will
#always work as expected

#reload original dataset
meme_data = pd.read_csv("https://raw.githubusercontent.com/BrockDSL/Analyzing_Web_Archives/main/memegenerator.csv", sep=",")
meme_data["Meme ID"] = meme_data["Meme ID"].astype(str)

#load and merge language info
lang_data = pd.read_csv("https://raw.githubusercontent.com/BrockDSL/Analyzing_Web_Archives/main/language_data.csv")
lang_data["Meme ID"] = lang_data["Meme ID"].astype(str)
meme_data = pd.merge(meme_data,lang_data,on="Meme ID", how = "outer")


print("Language Information added to the dataset!")


## How Effective was Language Detection?

Since the language detection was an automated process there might be some problems with results. As described in the [notebook](https://github.com/BrockDSL/Analyzing_Web_Archives/blob/main/Meme_Language_Detection.ipynb) that generates the data, if a match can't be found it adds `Could Not Detect` to the column instead of a language code. 

## Question 6

What % of languages were successfully matched? Share a number in the chat box before running the below cells.


Let's see how many couldn't be detected:

In [None]:
mismatches = meme_data[meme_data["Language"] == "Could Not Detect"]["Meme ID"].count()
print("Total number of mismatches: ",mismatches)
#Order of operations just like in algebra class
print("Successful languages looked up: ", (1 - (mismatches/len(meme_data))) * 100,"%")

Yikes! Let's see what happened here by looking a couple or random entries with that value. If were were going to do some serious analysis we would have to correct those, probably with a manual process.

In [None]:
meme_data[meme_data["Language"] == "Could Not Detect"].sample(5)

Yuck. We've found some memes without good _Alternate Text_ values. For now let's just proceed by keeping them in the data set. Depending on what yoru analysis needs to do you might want to delete them from the dataset.


## Summary of Language Information

Run the next cell to generate a pie graph of the top 10 languages seen in the memes.

In [None]:
#We group, count, & sort
#We then use that slice operator again to only get the top 10 values.
pie_data = meme_data.groupby(["Language"]).count().sort_values(by="Meme ID",ascending=False)[0:10]["Meme ID"]
plt.pie(pie_data, labels = pie_data.keys())
plt.title("Top 10 Languages in the Memes")
plt.show()

## Question 7 ##

Is that how you thought the languages would be distributed? Share your thoughts in the chat box.

# Meme Scores!

Memegenerator has voting capability. By clicking the up or down arrow users can increase / descrease this score. Let's see this for our random meme. Run the next cell to generate the preview

In [None]:
preview_url = str(rando['Meme Page URL'].values[0])
preview_url = preview_url.replace("http:","https:")

IFrame(preview_url,width=1000, height=700)

To enrich our dataset even more we found the scores of all of the memes in dataset. 

We did this by **downloading** all 60000 meme webpages and screen scrapping to find the score that was presented on the page. Downloading the information  took about **4 Hours** so we won't try to do that again. We will however open a CSV file of these scores and add them to our dataset, just like we did with the language information. In this case for memes we could not find the score for (because there were deleted from the site for example) we just use the place holder value `NaN` instead.



Run the next cell to do this and preview a few random scores. Feel free to run it a few times.

In [None]:
#Lets open the file and have a peak.
meme_scores = pd.read_csv("https://raw.githubusercontent.com/BrockDSL/Analyzing_Web_Archives/main/meme_scores.csv",dtype={'Meme ID': object})
meme_scores["Meme ID"] = meme_scores["Meme ID"].astype(str)
meme_scores.sample(5)




---


## Constructing the Final Data Set

Let's add this data to our original dataset by matching on the **Meme ID** column. Then let's look at a couple of random entries of our newly enriched completed dataset. 


For memes that we couldn't get a score for, we drop from the dataset. This is usually a good strategy but for your research projects you will need to decide what is right for what you are investigating.

Run the next cell to create our final version of the dataset with all of enriched data and display a few random entries. Notice how we add a column called _Score_.


In [None]:
#This cell will build the whole dataset from scratch. ie all 3 steps
#this is necessary so that if your run the cells out of order everything 
#will still work as expected

#Original Dataset
meme_data = pd.read_csv("https://raw.githubusercontent.com/BrockDSL/Analyzing_Web_Archives/main/memegenerator.csv", sep=",")
meme_data["Meme ID"] = meme_data["Meme ID"].astype(str)

#open CSV of language info and create a dataframe
lang_data = pd.read_csv("https://raw.githubusercontent.com/BrockDSL/Analyzing_Web_Archives/main/language_data.csv")
lang_data["Meme ID"] = lang_data["Meme ID"].astype(str)

#merge language data to meme data frame
meme_data = pd.merge(meme_data,lang_data,on="Meme ID", how = "outer")

#Meme Score Added
meme_scores = pd.read_csv("https://raw.githubusercontent.com/BrockDSL/Analyzing_Web_Archives/main/meme_scores.csv",dtype={'Meme ID': object})
meme_scores["Meme ID"] = meme_scores["Meme ID"].astype(str)
meme_data = pd.merge(meme_data,meme_scores,on="Meme ID", how = "outer")

#Any meme that has a NaN value in 8 or more columns gets dropped
meme_data.dropna(thresh=8,inplace=True)

#set our random item to be from our new dataframe
rando = meme_data.sample(1)

print("\nFinal Dataset built and ready to go.")

meme_data.sample(5)

## Another Random Entry

As we've been doing let's take a look at the another random entry, this time with all information that we are going to add.

Feel free to run this cell a few times. Remember, it might take a few moments for the image to load.

In [None]:
rando = meme_data.sample(1)
display(Image(url=rando['Archived URL'].values[0], format='jpg'))
rando

## Question 8

What kinds of questions can we ask and answer with this new enriched dataset. Feel free to share some thoughts in the chat box or over the microphone.



---


# The Final Analysis: Scores

Let's try looking at some trends in our final dataset. First our average meme score...

In [None]:
print("Average score of memes: ",meme_data["Score"].mean())

## Average Scores by Language

Let's see what our averages scores are for all of the languages in our dataset.

In [None]:
print("Average score by languages: ")
top_by_language = pd.DataFrame(meme_data.groupby("Language").mean()["Score"].sort_values(ascending=False))
top_by_language


## Displaying Top Score by Language

Run the cell below and add the two letter code of a _Language_ in the box below to see the highest scoring meme in that category. Click the **Show** button to retrieve the top meme. If you put in a code that it can't find, it will give you an error. It's ok just add a different value in the text box and try again.

In [None]:
def show_top_by_language(language_choice):
  top_lang_score = pd.DataFrame(meme_data[meme_data["Language"] == language_choice])
  top_lang_score = top_lang_score.sort_values(by="Score",ascending=False).head(1)
  display(Image(url=top_lang_score['Archived URL'].values[0], format='jpg'))
  display(top_lang_score)

language_textbox = widgets.Text(
    value = 'de',
    description = 'Language'

)

print("Type in any two letter language code to see the top scoring meme in that language")
show_top_lang_control = interact_manual.options(manual_name="Show")
show_top_lang_control(show_top_by_language,language_choice=language_textbox);




## Question 9

Try some different language codes in the form above. Share any interesting results you find in the chat box.

## Top Score by Category (via Mean)

Let's see what our averages scores are for the top 25 Meme Categories in our dataset.

In [None]:
print("Average Mean score by base memes, top 25 only: ")
#we group, find the mean and then sort our results
#we use slice to only show the top 25
top_by_category_mean = pd.DataFrame(meme_data.groupby("Base Meme Name").mean()["Score"].sort_values(ascending=False)[0:25])
top_by_category_mean

## Top Score by Category (via Mediaan)

We saw before that if we used **median** instead of **mean** we got different results. See if you can modify the code in the next cell to have it calculate median instead of mean. 

In [None]:
print("Average Median score by base memes, top 25 only: ")
#we group, find the mean and then sort our results
#we use slice to only show the top 25
top_by_category_median = pd.DataFrame(meme_data.groupby("Base Meme Name").()["Score"].sort_values(ascending=False)[0:25])
top_by_category_median

## Displaying Highest Scoring Meme by Meme Catgory

Run the cell below and dd the name of a _Base Meme Name_ in the box below to see the highest scoring meme in that category. Click the **Show** button to retrieve the top meme. You can copy/paste from the list above. If you type in a value that it doesn't find it will cause an error message, that' ok, just try it again with a different category.

In [None]:
#Run this cell to load the previewer
def show_top_by_category(category_choice):
  top_cat_score = pd.DataFrame(meme_data[meme_data["Base Meme Name"] == category_choice])
  top_cat_score = top_cat_score.sort_values(by="Score",ascending=False).head(1)
  display(Image(url=top_cat_score['Archived URL'].values[0], format='jpg'))
  display(top_cat_score)
 
category_textbox = widgets.Text(
    value = 'Sudden Clarity Clarence',
    description = 'Category'

)
print("Copy/Paste a Meme Category from above to see the top scoring meme in that Category.")
show_top_cat_control = interact_manual.options(manual_name="Show")
show_top_cat_control(show_top_by_category,category_choice=category_textbox);

## Question 10

Trying experimenting with different meme categories and share any interesting results in the chat box.



---



## Highest Scoring Meme in the dataset!


Now that we've explored some different dimensions of the data let's take a look at the hightest scoring meme in the whole data set. Run the final code cell below to find out what it is.

In [None]:
display(Image(url=meme_data[meme_data['Score'] == meme_data['Score'].max()]['Archived URL'].values[0], format='jpg'))
meme_data[meme_data['Score'] == meme_data['Score'].max()]


# Congratulations

You've now had a tour of the Memegenerator dataset! This notebook showed us how to load the original dataset, augment it with additional information, and run some interesting analysis. With luck it has given you some ideas on how you can use web archives & the Python programming language to conduct some interesting research.

We're happy to answer questions in the chat box or please send us a message at: **dsl @ brocku.ca**