## Zusammenfassung: PDF- und HTML-Scraping
Dieses Notebook demonstriert anhand von Beispielen, wie man:
1. **PDF-Scraping** durchführt, um Inhalte aus PDFs zu extrahieren.
2. **HTML-Scraping** verwendet, um Daten aus Webseiten zu extrahieren.
3. Die extrahierten Daten speichert und weiterverarbeitet.
Jeder Schritt ist mit Kommentaren und Erläuterungen auf Deutsch versehen.

## Combined Notebook with Annotations
This notebook merges and annotates the content from all uploaded notebooks.

#### Content from: Excercise 2. Modelling the Weighted Social Network of Hamlet..ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

Let's practice what we have just learnt! Let's follow this article (https://litlab.stanford.edu/LiteraryLabPamphlet2.pdf) and built the social network of **Hamlet** by William Shakespeare. We will be using this version of the book: https://www.gutenberg.org/files/1524/1524-h/1524-h.htm

**Note**: we are not using *Vingt mille lieues sous les mers* because a) it is a super long book (300+ pages) plus it has not so many dialogues, so it may not be the best study case to work with Social Networks considering the amount of time that we have. However, Network Analysis with NetworkX can be done multilingually!

# 1. First we import the libraries

# 2. The we create the G object

Just to let you know with this command we can clean our network (for example if we make a spelling mistake that contaminates our Graph)

# 3. Characters

Now we transform every character into a node by writing each name inside **G.add_node()**. Only the main characters are included in here. 

These are the play characters (you can find this information at the beginning of the book). Remember to change "Claudius" for "King" and "Gertrude" for "Queen" as that is how they will appear throughout the play.

Dramatis Personæ

* HAMLET, Prince of Denmark
* CLAUDIUS, King of Denmark, Hamlet’s uncle
* The GHOST of the late king, Hamlet’s father
* GERTRUDE, the Queen, Hamlet’s mother, now wife of Claudius
* POLONIUS, Lord Chamberlain
* LAERTES, Son to Polonius
* OPHELIA, Daughter to Polonius
* HORATIO, Friend to Hamlet
* FORTINBRAS, Prince of Norway
* VOLTEMAND, Courtier
* CORNELIUS, Courtier
* ROSENCRANTZ, Courtier
* GUILDENSTERN, Courtier
* MARCELLUS, Officer
* BARNARDO, Officer
* FRANCISCO, a Soldier
* OSRIC, Courtier
* REYNALDO, Servant to Polonius
* Players
* A Gentleman, Courtier
* A Priest
* Two Clowns, Grave-diggers
* A Captain
* English Ambassadors.
* Lords, Ladies, Officers, Soldiers, Sailors, Messengers, and Attendants

# 4. Textual Interactions

Then we count (old school style by reading the book) who is talking to whom, and we write that down in **G.add_edge()**. If we make a mistake and we accidentally write twice when a character talks to another one, it doesn´t matter. The networkx library will only take into acount one edge per pair of nodes. 

In theatre plays it can be a bit confussing to know who is talking to as some scenes (such as the last one in Hamlet) everybody is talking (or shouting!) at the same time and it is a total mess! It's ok if your edges are not 100% accurate: an approximation will be fine!

# 5. Checking the structure of our network

Now let's have a look at the number of nodes that we have. Use the G.number_of_nodes() script and then transform G.nodes into a list.

Let's do the same with the edges.

And now let's check the weighted edges.

And now let's sort that list to see who talks the most!

Let's separate the edges based on their weights to visualize things better. This shows much clearly plot weight than our previous graph.

And finally let's get the position of the nodes in the network.

# 6. Network Metrics

This time, because we have a new element (weight) let's explore the network before we actually draw it. We do this because we are interested in tracking down the hub of the network (that is, the person with the biggest number of connections). We can create a network in which we assign those values (network degree) to the nodes, and we can quickly see the relationship between plot agency and hub size.

1. Calculating **Network Degree: who has more connections?**

Sort that variable!

2. Calculating **Betweenes Centrality Scores**: who is the person that connects more nodes in the network? Sort your values.

3. **Communities**: who forms different communities within this network?

# 7. Network Visualization

And now let's have a look at our network! We can represent our weighted network a) by adding labels to the edges and showing the weight in there, or b) by showing the weight in different node sizes.

# A. Edges weight

# B. Nodes Weight

To draw a network by node weight, we need to know the network degree (**the hub**, and then, in order of importance, who has more weighted connections). Let's print again that value.

Now let's change the colour of the hub of the network to red.

# 8. Saving up our data

Let's transform our network into a Pandas Dataframe. We can use nx.to_pandas_adjacency() to do this. It will return a Document Term Matrix where each node is assigned the number of times every character speaks (so, the weight), and 0 points if there is no interaction.

#### Content from: Excercise 1. Modelling the Unweighted Social Network of Hamlet..ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

Let's practice what we have just learnt! Let's follow this article (https://litlab.stanford.edu/LiteraryLabPamphlet2.pdf) and built the social network of **Hamlet** by William Shakespeare. We will be using this version of the book: https://www.gutenberg.org/files/1524/1524-h/1524-h.htm

# 1. First we import the libraries

# 2. The we create the G object

Just to let you know with this command we can clean our network (for example if we make a spelling mistake that contaminates our Graph)

# 3. Characters

Now we transform every character into a node by writing each name inside **G.add_node()**. Only the main characters are included in here. 

These are the play characters (you can find this information at the beginning of the book). Remember to change "Claudius" for "King" and "Gertrude" for "Queen" as that is how they will appear throughout the play.

Dramatis Personæ

* HAMLET, Prince of Denmark
* CLAUDIUS, King of Denmark, Hamlet’s uncle
* The GHOST of the late king, Hamlet’s father
* GERTRUDE, the Queen, Hamlet’s mother, now wife of Claudius
* POLONIUS, Lord Chamberlain
* LAERTES, Son to Polonius
* OPHELIA, Daughter to Polonius
* HORATIO, Friend to Hamlet
* FORTINBRAS, Prince of Norway
* VOLTEMAND, Courtier
* CORNELIUS, Courtier
* ROSENCRANTZ, Courtier
* GUILDENSTERN, Courtier
* MARCELLUS, Officer
* BARNARDO, Officer
* FRANCISCO, a Soldier
* OSRIC, Courtier
* REYNALDO, Servant to Polonius
* Players
* A Gentleman, Courtier
* A Priest
* Two Clowns, Grave-diggers
* A Captain
* English Ambassadors.
* Lords, Ladies, Officers, Soldiers, Sailors, Messengers, and Attendants

# 4. Textual Interactions

Then we count (old school style by reading the book) who is talking to whom, and we write that down in **G.add_edge()**. If we make a mistake and we accidentally write twice when a character talks to another one, it doesn´t matter. The networkx library will only take into acount one edge per pair of nodes. 

# 5. Checking the structure of our network

Now let's have a look at the number of nodes that we have. Use the G.number_of_nodes() script and then transform G.nodes into a list.

Let's do the same with the edges.

# 6. Network Visualization

And now let's have a look at our network!

# 7. Network metrics

I looks like the tree characters that appear in the center of the network are Kat, Patrick and Joey. Let's try to discover who is the **hub**: the node of the network with the higher number of connections.

1. Calculating **Network Degree**: who has more connections?

2. Calculating **Betweenes Centrality Scores**: who is the person that connects more nodes in the network?

3. **Communities**: who forms different communities within this network?

And then we can check whether there are some narrative sub-groups that tend to interact more with each other, and we do indeed observe four different communities.

# 8. Saving up our data

Let's transform our network into a Pandas Dataframe. We can use nx.to_pandas_adjacency() to do this. It will return a Document Term Matrix where each node is assigned 1 point if there is an interaction between two characters, and 0 points if there is no interaction.

#### Content from: Ego Centric Networks. Exercise.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

# Ego-Centric Networks Exercise

Let's practice Ego-Centric Networks using Hamlet! Let's zoom in into the Ego-Centric Network of Hamlet himself.

# 1. We import the libraries

And let's not forget to creat G

Reminder: how to clear our network if we need to

# 2. Hamlet

Book in hand let's count how many times and with whom Hamlet talks to (we can re-use our previous Exercise 2 weighted social network script)

#### Nodes

# Edges

# 3. We check the structure of our Network

Let's check the number of nodes

And now let's check the number of edges

Let's check the weight of the edges

# 4. Network Metrics

1. Calculating **Network Degree** (the hub of the network)

The hub of the network is **Hamlet** because this is an Ego-Centric Network!

# 5. Network Visualization

First of all let's define the position of the nodes in the network.

Let's now use Network Degree

Voilá! There we have our network. We can now proceed to do some cool visualizations. For example, we can use colours to bring attention to the Hub in the network (Hamlet), indicate the second and their weighted degree nodes (Horatio and Pollonius), and change the colour of the other nodes in the network.

# And now you are a total expert on Networks ;)

#### Content from: Ego Centric Networks. .ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

# Ego-Centric Networks vs Socio-Centric Networks

In our previous data analysis (weighted vs unweighted networks), we have been building a type of network called socio-centric network, where we seek to analyze the social structure of a given community (characters in movies/novels/theatre plays). However, there is another type of network called ego-centric network, that uses individuals as their object of study. If you would like to know more about both, this is an excellent article: https://bebr.ufl.edu/sites/default/files/SNA_Encyclopedia_Entry_0.pdf

Let's use **Around the World in 80 days** and let's model one Ego-Centric Networks: Phileas Fogg. 

# 1. We import the libraries

And let's not forget to creat G!

# 2. Phileas Fogg

Book in hand let's count how many times and with whom Phileas Foogs talks to.

#### Nodes

# Edges

# 3. We check the structure of our Network

Let's check the number of nodes

And now let's check the number of edges

Let's check the weight of the edges

# 4. Network Metrics

1. Calculating **Network Degree** (the hub of the network)

The hub of the network is **Phileas Fogg** because this is an Ego-Centric Network!

# 5. Network Visualization

First of all let's define the position of the nodes in the network.

Let's now use Network Degree

Voilá! There we have our network. We can now proceed to do some cool visualizations. For example, we can use colours to bring attention to the Hub in the network (Phileas Fogg), indicate the second and their weighted degree nodes (Jean Passepartout and Auda), and change the colour of the other nodes in the network.

# Exercise Ego Centric Networks

#### Content from: 2. Weighted Sociocentric Networks..ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

**References**: the majority of the scripts in this notebook come from adapting the previous one and asking questions to the language model **Perplexity AI** (for example: https://www.perplexity.ai/search/1fc28433-34c7-4240-a394-459ecb1475e1?s=u), which is a great way to teach-yourself how to write code!

# Modeling a Weighted Network of Around the World in Eighty Days

Now that we have calculated the social network (who knows whom) **Around the World in Eighty Days**, let's go one step further and let's calculate **how many times each character talks to each other**. So, book in hand, this time let's count not only when one character talks to each other once, but also how may times they do it. We will add one extra value in the edge part called "weight", where we will indicate with a number 1 every time there is a textual interaction.

This is called a **weighted graph** (if you are interested about reading more, this is a good place to start: https://www.geeksforgeeks.org/applications-advantages-and-disadvantages-of-weighted-graph/). It adds extra information to unweighted graphs (like the one that we just built in the previous notebook). Let's see how these results either complement or contradict our previous ones!

# 1. We import the libraries

Then we create G (that we will use during all this notebook).

Remember that with this command we can clear our network.

# 2. Characters

Now we transform every character into a node by writing each name inside **G.add_node()**, as we did before.

# 2. Textual Interactions

Now we add one new variable, "weight = X", where we include the number of times there is a dialogue between two characters. So: old school style, we grab the script, and with lots of patience, we just count dialogues (specifically: we count number of lines in each dialogue. For example: 

* **Phileas Fogg**: Good morning Passepartout!
* **Passepartout**: Good morning sir!

That would be represented as: 

**G.add_edge("Phileas Fogg", "Jean Passepartout", weight = 2)** 

# 3. Checking the structure of our network

Now let's have a look at the number of nodes that we have.

Let's do the same with the edges.

And now let's check our weighted edges

Let's sort that list to see which pair talks the most!

Unsurprisingly, it is Jean Passepartout and Fix followed by Phileas Fogg and Passepartoutt. Let's separate the edges based on their weights to visualize things better. This shows much clearly plot weight than our previous graph. 

And finally, let's define the position of the nodes in the network. 

# 4. Network metrics

This time, because we have a new element (**weight**) let's explore the network before we actually draw it. We do this because we are interested in tracking down **the hub of the network** (that is, the person with the biggest number of connections). We can create a network in which we assign those values (network degree) to the nodes, and we can quickly see the relationship between plot agency and hub size.

1. Calculating **Network Degree**: who has more connections?

So: **the network hub is Phileas Fogg!** He is the character that has the biggest number of connections, followed by Fix and Passepartout. These three characters are the ones that have by far the highest amount of text weight. Weighted Networks then add extra value to unweighted ones: now we can analyze **plot character agency**. 

2. Calculating **Betweenes Centrality Scores**: who is the person that connects more nodes in the network?

So now that goes to Passepartout!

3. **Communities**: who forms different communities within this network?

And now we can see different (and more coherent) clusters. We observe one Jean Passepartout cluster, one Phileas Fogg one, one English gentlemen one, and then two "small weight characters" ones. 

# 5. Network Visualization

And now let's have a look at our network! We can represent our weighted network a) by adding labels to the edges and showing the weight in there, or b) by showing the weight in different node sizes. 

# A. Edges weight

# B. Nodes Weight

To draw a network by node weight, we need to know the network degree (**the hub**, and then, in order of importance, who has more weighted connections). Let's print again that value.

We can also add different colours to specific nodes. For example, if we would like our **Hub** (Phileas Fogg), to stand up from the rest, we can colour her differently by adding a node_color argument. 

We can repeat that process as many times as we would like to, by just adding more colours. Let's add two more hubs: **Jean Passepartout** and **Fix**.

# 6. Conclusion

This network provides much fine-grained information than our previous one. 

# 7. Saving up our data

Let's transform our network into a Pandas Dataframe. We can use nx.to_pandas_adjacency() to do this. It will return a Document Term Matrix where each node is assigned the number of times every character speaks (so, the weight), and 0 points if there is no interaction.

#### Content from: 1. Unweighted Sociocentric Networks. Modelling the Social Network of Around the World in 80 Days.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

**References**: the method of building a network using characters as nodes and textual interaction as edges is inspired by this article (https://litlab.stanford.edu/LiteraryLabPamphlet2.pdf). While in there the text used is a play (and therefore it is easier to model networks), and we are using a novel instead, the idea is the same) Some of the scripts have been adapted from this tutorial (https://melaniewalsh.github.io/Intro-Cultural-Analytics/06-Network-Analysis/02-Making-Network-Viz-with-Bokeh.html), and also from this other one (https://networkx.org/documentation/stable/tutorial.html). 

# Modelling the Social Network of **Around the World in 80 Days**.

In this notebook, we are going to use **Network Analysis** to model the social network of **Around the World in Eighty Days**. We are going to count **"who knows whom"**. We are going to transform each character into a **nod**. After that, we are going to count every time that a character talks to another character, and we are going to call that **edge**. This method, while not perfect (it only measures textual interactions once, and therefore, we don´t know what is being said, or how much each character speaks), can be useful nevertheless as a first approximation of empirically measuring changing narrative weighths by identifying **hubs** (which essentially means nodes with lots of connections: https://en.wikipedia.org/wiki/Hub_(network_science)). 

# 1. First we import the libraries

# 2. The we create the G object

Just to let you know with this command we can clean our network (for example if we make a spelling mistake that contaminates our Graph)

# 3. Characters

Now we transform every character into a node by writing each name inside **G.add_node()**. Only the main characters are included in here. 

# 4. Textual Interactions

Then we count (old school style by reading the book) who is talking to whom, and we write that down in **G.add_edge()**. If we make a mistake and we accidentally write twice when a character talks to another one, it doesn´t matter. The networkx library will only take into acount one edge per pair of nodes. 

# 5. Checking the structure of our network

Now let's have a look at the number of nodes that we have.

Let's do the same with the edges.

# 6. Network Visualization

And now let's have a look at our network!

# 7. Network metrics

I looks like the tree characters that appear in the center of the network are Kat, Patrick and Joey. Let's try to discover who is the **hub**: the node of the network with the higher number of connections.

1. Calculating **Network Degree**: who has more connections?

So: the network hub is Phileas Fogg! He is the character that has the biggest number of connections, followed by Jean Passepartout. This could be considered as a preliminary metric of **plot character agency**, showing that in terms of who knows whom, these two characters are the most popular ones in the story.

2. Calculating **Betweenes Centrality Scores**: who is the person that connects more nodes in the network?

This is another metric to determine who can put more people in touch within the network. That position goes to Phileas Fogg, and by far!

3. **Communities**: who forms different communities within this network?

And then we can check whether there are some narrative sub-groups that tend to interact more with each other, and we do indeed observe four different communities.

# 8. Conclusion

While our network does not give any information about the content of the interactions between characters, or about how much they talk to each other (amount of text), it is a preliminary approach that shows how the hub of the network is Phileas Fogg, followed by Jean Passepartout, and therefore, these two characters direct the narrative. 

# 9. Saving up our data

Let's transform our network into a Pandas Dataframe. We can use nx.to_pandas_adjacency() to do this. It will return a Document Term Matrix where each node is assigned 1 point if there is an interaction between two characters, and 0 points if there is no interaction.

# Exercise 1

#### Content from: Harry Potter around the World.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

# Mapping the world of Harry Potter

**Script source:** several queries to Perplexity AI!

Harry Potter is one of the most translated (and popular) books around the world and it is available in 85 languages! (https://en.wikipedia.org/wiki/List_of_Harry_Potter_translations)

Let's write a python script to do a geo-spatial analysis visualization of things!

# 1. We import the libraries

# 2. We create a variable with the capitals of the countries where Harry Potter has been translated

In the real world you would need to do this step yourself! How would you do this using Python?

And: every time there is more than one language in a country (i.e. South Africa: English and Afrikaans) I have used two cities in that country (i.e Pretoria and Cape Town) to show linguistic diversity!)

# 3. We create a list with the Lattitude and Longitude of those cities using Geopy

Let's first practice getting the lat and lon of 3 English speaking main cities: London, Dublin, and New York City. 

Now let's do that for all the cities in our list!

# 4. Pandas Data Frame

Now let's create a Pandas Dataframe that contains our cities and their lat and lon

First let's create a column with the names of the cities

Now create two variables: one for latituted and one for longitude

And now let's add those columns to our data frame

# 5. And now let's visualize things!

Change the colour for red (for Gryffindor!)

And now change the colour for green (for Slytherin!)

#### Content from: Geospatial Analysis.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

# Using Geospatial Analysis to visually analyze Travel Literature!

Geospatial Analysis can be a great tool to help us digg into the textual analysis of Literary Text. This can be particularly useful if we want to add extra layers of analysis to some genres such as **Travel Literature**. In this notebook we are going to exolore how to use the Python Library Plotly: https://plotly.com/python/getting-started/

**Sources:** the majority of the scripts in this notebook come from these sources from plotly: https://plotly.com/python/mapbox-layers/, https://plotly.com/python/scatter-plots-on-maps/, https://plotly.com/python/mapbox-layers/, https://plotly.com/python/reference/scattermapbox/#scattermapbox-marker-symbol. For more senior scripts about geo-spatial data science, this is an excellent course: https://github.com/suneman/socialdata2023.

# 1. We import the libraries

# 2. We manually inspect our city dataset

All Digital Humanities projects involve some degree of close reading analysis. We need to inspect our "GPE_aroundtheworld.txt" file and decide which cities we are going to include in our selection! (you will see that there is a considereable ammount of noise even using Spacy, or that some place names are contemporary to the age of Jules Verne but have changed ever since).

# 3. We create our GPS dataset.

To be able to map our cities, we need to extensively google the Latitude and Longitude of all of them, and manually annotate the results in several lists (as we will need to create a CSV dataframe to be able to plot things in maps with Plotly).

Be aware that:

**GPS Lat-Long signs: N+, S-, W-, E+.**

For example:

Rio de Janeiro: 22.9068° S, 43.1729° W (-22.9068, -43.1729)
London: 51.5072° N, 0.1276° W (51.5072, -0.1276)
Stockholm: 59.3293° N, 18.0686° E (59.3293, 18.0686)
Sydney: 25.2744° S, 133.7751° E (-25.2744, -133.7751)

# Activity for you

Please google "Lat Long decimal" and add the coordinates of **Denver, Bloomington (Indiana), Sacramento**. Add the lattitude, the longitude, and the country (at each corresponding list). Remember to remove the dots (that is just to indicate you where you should be writing things) and to write the closing braket of the list! Once you are finished run the scripts and you will automatically have a Pandas dataframe with all the information!

# Geopy

And now let's try another python library called GEOPY that will tell us the coordinates of our cities! If you are curious, you can read the documentation in here: https://geopy.readthedocs.io/en/stable/. For a faster tutorial you can have a look at https://pypi.org/project/geopy/

Let's scale that to our full dataset of cities (so: if we have a file with all the GPE locations, we feed it into this script and it wil be super fast!)

When we get a none message it means that geopy does not know where is that city

# 5. And now we visualize things!

Let's first try this map.

##### A. Mapbox Maps

Mapbox maps are also called tile-based maps and they allow you to zoom in "google maps" style. For more information have a look at: https://plotly.com/python/mapbox-layers/

#### Activity for you

Change the color_discrete_sequence = [] variable from "fuschia" to "green". You can try other colours!

##### Activity for you

Move around your mouse on the top right corner of the map and click on the picture camera, where it says "Download plot as PNG". You will be able to download your map in your own laptop!

##### B. Geo maps

Geo Maps only show the physical boundaries of countries. Have a look at: https://plotly.com/python/map-configuration/

Which one do you like the most?

# Exercise 1

#### Content from: Mapping Jules Verne. NER with Spacy.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

Now that we have done things at the chapter level, let's do it at the book level! Let's focus on mapping geographically the world of Jules verne by extracting GPE and LOC of **Around the World in 80 days**.

# 1. We import our libraries

# 2. We get our data

This data has not been cleaned and pre-processed to avoid confusing the parser (only \r\n characters have been removed!)

# 3. We import the English pipeline

# 4. We create the Spacy nlp object

# 5. We inspect the English model labels

Let's remember the entities that we have in Spacy:

# 6. We print the entities

# 7. We create one list with GPE 

While possibly LOC is a lable that contains interesting information, as this is a DH introductory course, let's just focus on GPE!

Now let's drop the duplicates in there!

Let's save our values!

# Exercise 3

#### Content from: Information Extraction.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

# Information Extraction: NLTK and Spacy

Script Sources:

* **NLTK**: Tsilimos, Maria. Python: Introduction to Natural Language Processing (NLP). IT Central, University of Zurich.
* **Spacy**: https://spacy.io/usage/spacy-101

**Information Extraction (IE)** consists on transforming **Natural Language unstructured data** (written or spoken) into **structured data** ready to be used by machines. 

In this notebook we are going to learn two different IE methods: **Part of Speech Tagging (POS)** and **Name Entity Recognition (NER)**.

There are many excellent Python libraries out there to write scripts that will allow us to do both things. In this notebook we will learn how to use **NLTK** and **Spacy** and understand the advantages and disadvantages of both!

# 1. Importing our data

Let's begin by using the first chapter of **Around the World in Eighty Days** by Jules Verne.

If you remember, in the previous chapter we did 4 steps of cleaning and pre-processing:

* Tokenization
* Lowercasing
* Removing Punctuation
* Removing Stopwords

Now **we are not going to do any of those things**. We need to do **POS tagging**, and for that, it is necessary to keep punctuation and stopwords to avoid confusing the parser. 

The only thing that we are going to remove are the noisy characters "\r\n".

For that, we are going to use this script: **re.sub(r"\r\n", " ", data")**. (in case you want to replicate it on your own dataset). 

For efficiency purposes a clean first chapter has been created for you with that process already incorporated.

# 2. Understanding Information Extraction Architecture: NLTK

#### A. We import the libraries

#### B. We initialize the Information Extracture Pipeline:

1. Sentence Segmentation
2. Tokenization
3. POS Tagging
4. Chunking
5. NER

##### 1. Sentence Segmentation

##### 2. Tokenization

##### 3. POS Tagging

##### 4. Chunking and NER

##### Chunking

##### NER

And now let's transform that into a list!

Source = https://nanonets.com/blog/named-entity-recognition-with-nltk-and-spacy/

That looks good so far! Let's now check **Geopolitical Entities (GPE)**

That also looks quite good! However we observe some **issues**: is American or Londoner a person or a GPE?

# Exercise 1

# Spacy

And now let's try Spacy. Spacy does not follow the same architecture as NLTK: we don´t need to follow the 4 step pipeline (sentence segmentation, tokenization, POS tagging, NER chunking). All of that is implemented in their code! Have a look at: https://spacy.io/usage/linguistic-features#named-entities

You may need to install the Spacy pipeline. If so, remove the #symbol in the following cells.

Let's first have a look at the existing Entity Labels

We have a winner!

# Exercise 2

#### Content from: Exercises Information Extraction.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

And now let's practice what we have just learnt but now with a multilingual text!

Script Sources:

* **NLTK**: Tsilimos, Maria. Python: Introduction to Natural Language Processing (NLP). IT Central, University of Zurich.
* **Spacy**: https://spacy.io/usage/spacy-101

# Exercise 1: replicating the NLTK IE architecture with the first chapter of Twenty Thousand Leagues Under the Sea

##### A. We import our data

The second chapter of **Around the World in 80 days** has been created for you (without being cleaned and pre-processed, yet without \r\n characters). Write some code to open it!

(P.S. Again, if you want to replicate the code for your own exercises, run the following script: import re re.sub(r"\r\n", " ", data")). 

##### B. We import the libraries

##### C. Sentence Segmentation

##### D. Tokenization

##### E. POS Tagging

##### F. Chunking and NER

##### Chunking

Try extracting a sentence that you like. 

Now create a tree out of that!

##### G. Transforming that into a list and creating three different lists: 1. Person, 2. Organization, 3. GPE 

1. Creating a list

2. Creating a person list

3. Creating a GPE list

4. Creating an organization list

# Exercise 2: Spacy

Now let's repeat the exercise with Spacy to compare the performance of both.

##### A. We import the libraries

##### B. We download the French SPACY pipeline and we inspect the entity labels

You may need to do this (remove the #symbol)

##### C. We initialize the NLP object

##### D. We create a list with the entities

##### E. We create three lists: one with person (PERSON), one with Geopolitical Entities (GPE), one with Organization (ORG).

So: once again we see that Spacy really outperforms NLTK!

#### Content from: PDF text extraction.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

Script source: many queries to Perplexity AI (https://www.perplexity.ai/)

# 1. We import the libraries

We install the library

### Beispiel: PDF-Scraping

In [None]:
# ANNOTATION (auf Deutsch):
# Dieser Code führt die folgende Aufgabe aus:
# (Fügen Sie hier eine spezifische Erklärung hinzu, basierend auf dem Code.)
# This cell performs the following task:
# (Add specific annotations here based on the logic in the cell)

pip install PyPDF2

And then we import it

### Beispiel: PDF-Scraping

In [None]:
# ANNOTATION (auf Deutsch):
# Dieser Code führt die folgende Aufgabe aus:
# (Fügen Sie hier eine spezifische Erklärung hinzu, basierend auf dem Code.)
# This cell performs the following task:
# (Add specific annotations here based on the logic in the cell)

import PyPDF2

We also import pandas

And the Operative System Library

# 2. We extract the text from the PDF

Let's select the PDF 2412.18779 that we have in our directory

### Beispiel: PDF-Scraping

In [None]:
# ANNOTATION (auf Deutsch):
# Dieser Code führt die folgende Aufgabe aus:
# (Fügen Sie hier eine spezifische Erklärung hinzu, basierend auf dem Code.)
# This cell performs the following task:
# (Add specific annotations here based on the logic in the cell)

#Create empty list
text = []

# Open the PDF file
with open('2412.18779.pdf', 'rb') as file:
    # Create a PDF reader object
    pdf_reader = PyPDF2.PdfReader(file)

    # Get the number of pages in the PDF
    num_pages = len(pdf_reader.pages)

    # Initialize an empty string to store the extracted text
    extracted_text = ''

    # Loop through each page and extract the text
    for page_num in range(num_pages):
        page = pdf_reader.pages[page_num]
        extracted_text += page.extract_text()

# 3. We do that we with all our files

I have done a new query at the Arxiv notebook (check the notebook in this same folder!) using the term **Facebook**. Let's locate the directory.

Now let's open it

### Beispiel: PDF-Scraping

In [None]:
# ANNOTATION (auf Deutsch):
# Dieser Code führt die folgende Aufgabe aus:
# (Fügen Sie hier eine spezifische Erklärung hinzu, basierend auf dem Code.)
# This cell performs the following task:
# (Add specific annotations here based on the logic in the cell)

# Set the directory path where the PDF files are located
pdf_dir = 'C:\\Users\\usuario\\ELENA\\it-training uzh\\it-training uzh\\Python for Digital Humanities\\Day 2\\PDF extraction\\arxiv_pdfs_facebook'

# Get a list of all files in the directory
all_files = os.listdir(pdf_dir)

There we have our files!

Now let's extract all the text inside them.

### Beispiel: PDF-Scraping

In [None]:
# ANNOTATION (auf Deutsch):
# Dieser Code führt die folgende Aufgabe aus:
# (Fügen Sie hier eine spezifische Erklärung hinzu, basierend auf dem Code.)
# This cell performs the following task:
# (Add specific annotations here based on the logic in the cell)

# Loop through each PDF file
extracted_texts = []

for file in all_files:
    file_path = os.path.join(pdf_dir, file)  # Specify the directory where the files are located
    with open(file_path, 'rb') as pdf_file:
        # Create a PDF reader object
        pdf_reader = PyPDF2.PdfReader(pdf_file)

        # Get the number of pages in the PDF
        num_pages = len(pdf_reader.pages)

        # Initialize an empty string to store the extracted text
        extracted_text = ''

        # Loop through each page and extract the text
        for page_num in range(num_pages):
            page = pdf_reader.pages[page_num]
            extracted_text += page.extract_text()
    
    # Add the extracted text to the list
    extracted_texts.append(['Doc ' + file, extracted_text])

# 4. We save that into our computer

First we extract the headers

And now we extract the text of the articles

Now we create a Pandas DataFrame

And finally we save it into our laptop

# EXERCISE

Now let's create a new corpus of PDFs from ArXiv to use in our future data analysis

* In this same folder you will find a notebook containing code to do an ArXiv query. So far we have used the key terms "twitter" and "facebook". Try doing a new one using a key term that is interesting to you. If you would like to use two words (Mark Zuckerberg, Climate Change, Donald Trump...) use this syntax: '"climate change"' (quotations inside quotations).

* Now that you have your data, **duplicate this notebook** to have an extra copy of your code. Call the new version "PDF text extraction EXERCISE".

* Once you have dubplicated your notebook and acquired your data, then first extract the text of one PDF (just like we do in here). Remember to change the name to the new file!

* Then repeat the process with the whole folder. Remember to: 
                * Change the name of the folder in path_dir
                * Change the name of the csv data folder containing your data to not overwrite your previous data.

#### Content from: Exercise Getting Data from Arxiv-2.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

# Accesing *ArXiv*

## Elena Fernández Fernández

Let's use **again** the ArXiv API to do a new query: https://pypi.org/project/arxiv/

# ArXiv API

First let's import the libraries

Let's build a query

And let's extract the 50 most recent Twitter ArXiv articles

Ok, so now let's first have a look at the titles. 

And now, if we compare that with the actual ArXiv website, it looks correct: https://arxiv.org/search/cs?query=twitter&searchtype=all&abstracts=show&order=-announced_date_first&size=50

Now let's begin saving the first pdf in that list in our laptop.

### Beispiel: PDF-Scraping

In [None]:
# ANNOTATION (auf Deutsch):
# Dieser Code führt die folgende Aufgabe aus:
# (Fügen Sie hier eine spezifische Erklärung hinzu, basierend auf dem Code.)
# This cell performs the following task:
# (Add specific annotations here based on the logic in the cell)

paper = next(arxiv.Client().results(arxiv.Search(id_list = ["2412.18779"])))
# Download the PDF to the PWD with a default filename.
paper.download_pdf()

### Beispiel: PDF-Scraping

In [None]:
# ANNOTATION (auf Deutsch):
# Dieser Code führt die folgende Aufgabe aus:
# (Fügen Sie hier eine spezifische Erklärung hinzu, basierend auf dem Code.)
# This cell performs the following task:
# (Add specific annotations here based on the logic in the cell)

# Download the PDF to the PWD with a custom filename.
paper.download_pdf(filename = "2412.18779.pdf")

And now we need to create a folder in our directory. Remember that we can also do that using bash commands here in Jupyter Notebooks. 

### Beispiel: PDF-Scraping

In [None]:
# ANNOTATION (auf Deutsch):
# Dieser Code führt die folgende Aufgabe aus:
# (Fügen Sie hier eine spezifische Erklärung hinzu, basierend auf dem Code.)
# This cell performs the following task:
# (Add specific annotations here based on the logic in the cell)

# Download the PDF to a specified directory with a custom filename.
paper.download_pdf(dirpath = "C:\\Users\\usuario\\ELENA\\it-training uzh\\it-training uzh\\Python for Digital Humanities\\Day 2\\PDF extraction\\files_arxiv",
                    filename = "2412.18779.pdf")

So far I have been using just the code provided by the Arxiv API to do all those things (https://pypi.org/project/arxiv/). Now let's go to the next level and let's extract a bunch of articles. Looking at the API it looks like we need to extract the IDs of the papers. Let's do that!

What we need is just the ID number of the paper. Let's select that.

And now let's loop around that to get all the PDFs into our laptop. First let's create a new folder called "arxiv_pdfs"

# IMPORTANT

REMEMBER to create a new folder for the new query that you are going to do to create your own PDF database. Remember to change it in the filename variable too

And now let's get all the pdfs

### Beispiel: PDF-Scraping

In [None]:
# ANNOTATION (auf Deutsch):
# Dieser Code führt die folgende Aufgabe aus:
# (Fügen Sie hier eine spezifische Erklärung hinzu, basierend auf dem Code.)
# This cell performs the following task:
# (Add specific annotations here based on the logic in the cell)

for id in ids_2:
    # Search for the article with the given ID
    search = arxiv.Search(id_list=[id])
    paper = next(client.results(search))

    # Download the PDF to the current working directory with a default filename
    filename = os.path.join("arxiv_pdfs_facebook", urllib.parse.quote(id))
    paper.download_pdf(filename=f"{filename}.pdf")
    
    time.sleep(3)  # 3 seconds (this is the indication of the ArXiv API)

And voila! According to the arxiv API (https://pypi.org/project/arxiv/1.4.8/) the daily limit is 300.000 results: that is a lot!

#### Content from: 5. Body.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

Now that we have our final dataframe, we still need to do some cleaning and preprocessing of our articles text. Let's do that!

# 1. We import the libraries

# 2. We get the data

# 3. We select the text

Checking if there are some float numbers (nan) that stand for missing data

This is happening because we selected more metadata than proper articles (due to the 5000 download limit restrictions for full articles). So, there are some missing articles in there. Let's get rid of them!

We have a clean dataframe! Let's go back to the body part

# 4. We clean and pre-process

Time to do some cleaning

We can see that there is a rebel \' character that has survived our cleaning function. Let's get rid of that!

Now let's change our column in the csv dataframe

# 5. Saving our data

And now we are reading to save our super clean dataframe for future Text Data Mining analysis!

#### Content from: 4. Merging Dataframes.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

As we will see in the next Jupyter Notebook (5. Body) to be able to clean and pre-process the body we need to drop some missing rows of our dataframe that have some missing data. To simplify that process, let's now merge both dataframes before we proceed to cleaning the body of the articles.

# 1. We import our libraries

# 2. We get our data

First we get the metadata 

Now let's change the name of the column "Gale Document Number" to ID to be able to merge dataframes in just a second

And now we get the titles and the unclean body

# 3. Let's merge dataframes

Now let's merge both dataframes using the ID column on both of them

# 4. Cleaning new Dataframe

If we want to make sure that the merge was done correctly, we can check the "Document Title" column from the metadata column with the "Title Column" from the articles dataframe. That being said: let's clean this dataframe a little bit and get rid of the columns Publisher, Subject, and Language. Let's keep the Title one (and we can drop it later on if that may be useful for us).

# 5. Saving our data

And now let's save our data into a csv dataframe

#### Content from: 3. Headers.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

Now let's begin by organizing (AKA cleaning and pre-processing) the titles (headers) of our articles.

# 1. We import the libraries

# 2. We get the data

# 3. We split the title to get the CS indentifier

The way in which we are going to be able to match data (titles and articles) with metadata is by doing a match between the CS identifier in both dataframes. So: we need to extract that from the titles of the articles in here.

First we split things by "CS" (an alternative way would be to do this using regex but it's much more complicated)

And now we need to add CS again to make sure that we can later on concatenate it with the Metadata.

And now we need to get rid of the final .txt to be able to later on match things with the metadata dataframe

# 4. And now we create a new CSV data frame with a new column: Article ID

First we break that list into two different ones

And now we create the new csv

And now we link that to the original dataframe with the proper text

So now we have our clean dataset!

# 5. We export everything into a csv file

#### Content from: 2. Importing Text Data Gale.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

Now that we have our dataframe with the Metadata, let's find a way to use the text files that we can download from the Gale. First, let's import them into our computer.

# 1. We import the libraries

# 2. We set a Path to get the files

To be able to access the files, we need to first find where they are located in our computer. So, we need to set a path.

# 3. We create the dataframe

Now that we have the files, we need to import them into our laptop and create a datafarame with titles in one column and the text of the article in another column.

# 4. We export the dataframe

Success! We have our dataframe and we are ready to export it to a CSV file to start the process of cleaning and pre-processing.

#### Content from: 1. Gale Metadata.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

Let's begin transforming the Gale Metadata Dataframe into something that we can use to later on merge with our text data.

# 1. We impor the libraries

# 2. We import our data

# 3. We modify the metadata column that we need to do the matching later on

To be able to do a matching between the text dataframe and this column, we need to remove "GALE"

Now let's substitute the original column with that

# 4. We export that to a CSV dataframe that we can use later on

#### Content from: Exercises cleaning and pre-processing.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

# Cleaning and Preprocessing Exercises

Let's practice what we have just learnt!

# Exercise 1. Multilingual cleaning and pre-processing

##### A. Import the libraries

##### B. Open the file that you created in the first exercise (**Vingt milles lieues sous les mers**). 

##### C. Extract the text of the chapters into a list

##### D. Tokenize your text

##### E. Lowercase your text

##### F. Remove Punctuation

##### G. Remove Stopwords

##### F. Check out Stopwords in German (this is useful for those of you who want to use German text!)

# Exercise 2

##### A. Create a Pandas Dataframe with your clean text

##### B. Save that into a csv file

##### C. Transform your super_clean variable (a list) into a single string

##### D. Store that into a txt file

#### Content from: Cleaning and preprocessing data.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

# Step 2. Cleaning and Pre-processing data

Now that you have your data (Webscraping, APIs, PDFs, databases...) the next step in our Digital Humanities project is **cleaning and pre-processing**. In this notebook we are going to use the file that we just created in the previous notebook (*Around the World in Eighty Days*) and we are going to:

* Tokenize
* Lowercase
* Remove Punctuation
* Remove Stopwords

# 1. We import our libraries

# 2. We get our data

# 3. We extract the text

# 4. Tokenization

# 5. Lower casing

To lower case our data, we need to right a double loop, as we are looping over each element (tokens) contained in each element of the list. So, first we loop over the list, and then we loop over each token in each list item.

# 6. Punctuation

# 7. Stopwords

Let's first have a look at stopwords in English.

Great! We have 37 very clean chapters of **Around the World in 80 days**!

# Exercise 1

# 8. Using that data

Now that we have **very clean data**, we have two options:
    
    1. We use it chapter by chapter the way we have it (in case we may want to see how things evolve over the novel)
    2. We transform it into a single string in case we may want to analyse the whole book at once.
    
Let's do both things!

**8.1. Chapters**

We can stor things into a dictionary

Or into a csv file

And now let's save that into a dataframe

**8.2. Full text**

Let's now save that into a txt file format (which is widely used in Digital Humanities!).

You should now have that file into your laptop! We will be using it during the following days.

# Exercise 2

#### Content from: 1. API The Guardian.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

**References**: the code in this notebook comes from this script (written by my baby-programmer self 5 years ago: https://github.com/effernan/New-York-Times-Archive-API-code), and this more senior script: https://github.com/rochelleterman/scrape-interwebz/blob/master/1_APIs/3_api_workbook.ipynb 

# The Guardian

In this exercise we will be using the API of **The Guardian** (that does provide the full text of articles): https://open-platform.theguardian.com/access

First, you need a key, that you can get in here: https://bonobo.capi.gutools.co.uk/register/developer 

Let's explore **The Guardian** documentation website: https://open-platform.theguardian.com/documentation/

There are **5 endpoints**: Content, Tags, Sections, Editions, Single item. The one that we need is **content**: https://open-platform.theguardian.com/documentation/search 

So, in there, we have all the different options that the API is providing. The base_url is: "https://content.guardianapis.com/search?"

Everything else is very similar to The New York Times, but in here, there is one option that we can include in the parameters that will return the full text of the articles: ""show-fields" : "body". How **super cool** is that?

##### 1. Let's build the API request

Now the keys have changed! What we need to access is "results"

Now let's build the proper call modifying our previous script.

Let's have a look at that first element in our list of documents.

So, what we want is: id, webPublicationDate, webTitle, webUrl, and the content of the article (that is in fields). Let's modify our function to get us that!

And now let's store that. We have a super cool David Beckham dataset that we can use for our future data analysis!

# Exercise

And now: repeat the exercise but enter some term that you may be interested about (i.e. another Athlete, or some other group of news that you would like to see.) Remember to change the name of the csv file to not overwrite your data!

#### Content from: Exercise Getting Data from Arxiv.ipynb

##### Annotation:
This section includes the content and logic from the original notebook. Annotations have been added to explain the purpose and function of each code cell.

# Accesing *ArXiv*

## Elena Fernández Fernández

Let's give a try to webscrape the ArXiv website (disclaimer: I previously emailed them and ask if it was ok and legal to do this and they said yes!).

ArXiv is one of the most popular Computer Science article repositories out there and a great resource for Text Data Mining Research! Let's try to get the latest 50 articles about Twitter: https://arxiv.org/search/cs?query=twitter&searchtype=all&abstracts=show&order=-announced_date_first&size=50

If you right click on your mouse, you will see that the source code of the website is very similar to La Gaceta de Madrid. So: let's try to re-use that script for this!

The first thing that you need to do is to import the necessary libraries for webscraping

### Beispiel: PDF-Scraping

In [None]:
# ANNOTATION (auf Deutsch):
# Dieser Code führt die folgende Aufgabe aus:
# (Fügen Sie hier eine spezifische Erklärung hinzu, basierend auf dem Code.)
# This cell performs the following task:
# (Add specific annotations here based on the logic in the cell)

import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
import shutil #this one is for saving the PDFs from our computer.
import os

Then let's put the url into our laptop

And now let's access the text

The PDFs, which is what we are looking for, are contained in the category "a", so we need to filter our search to get all the information stored within "a".

Once we have all the "a" information, we need to define our search even more, as the PDFs links that we are looking for are stored within the "href" category inside of "a". We will store them in a list (pdfs).

Now we need to narrow our search even more. What we need are all the strings in which the PDFs are stored. Let's try to get them!

Something is not working. What is it? Let's ask Perplexity AI: https://www.perplexity.ai/search/Im-trying-to-U7GFFcNcTb6df18PJ_RpZg

So: the ArXiv web developers have built some sort of mechanisms that do not allow us to webscrape their website! But good news: we can use their API: https://pypi.org/project/arxiv/

# ArXiv API

First let's install the python library arxiv

And now let's import it

Let's also import the time library

Let's build a query

And let's extract the 50 most recent Twitter ArXiv articles

Ok, so now let's first have a look at the titles. 

And now, if we compare that with the actual ArXiv website, it looks correct: https://arxiv.org/search/cs?query=twitter&searchtype=all&abstracts=show&order=-announced_date_first&size=50

Now let's begin saving the first pdf in that list in our laptop.

### Beispiel: PDF-Scraping

In [None]:
# ANNOTATION (auf Deutsch):
# Dieser Code führt die folgende Aufgabe aus:
# (Fügen Sie hier eine spezifische Erklärung hinzu, basierend auf dem Code.)
# This cell performs the following task:
# (Add specific annotations here based on the logic in the cell)

paper = next(arxiv.Client().results(arxiv.Search(id_list=["2406.12444"])))
# Download the PDF to the PWD with a default filename.
paper.download_pdf()

### Beispiel: PDF-Scraping

In [None]:
# ANNOTATION (auf Deutsch):
# Dieser Code führt die folgende Aufgabe aus:
# (Fügen Sie hier eine spezifische Erklärung hinzu, basierend auf dem Code.)
# This cell performs the following task:
# (Add specific annotations here based on the logic in the cell)

# Download the PDF to the PWD with a custom filename.
paper.download_pdf(filename="2406.12444v1.pdf")

And now we need to create a folder in our directory. Remember that we can also do that using bash commands here in Jupyter Notebooks. 

# IMPORTANT

REMEMBER to create a new folder for the new query that you are going when you will be doing the exercise

And remember to change the name of the folder in the dirpath when you do the exercise. 

### Beispiel: PDF-Scraping

In [None]:
# ANNOTATION (auf Deutsch):
# Dieser Code führt die folgende Aufgabe aus:
# (Fügen Sie hier eine spezifische Erklärung hinzu, basierend auf dem Code.)
# This cell performs the following task:
# (Add specific annotations here based on the logic in the cell)

# Download the PDF to a specified directory with a custom filename.
paper.download_pdf(dirpath = 'C:\\Users\\usuario\\ELENA\\it-training uzh\\it-training uzh\\Python for Digital Humanities\\Day 1\\APIs 1. Arxiv\\files_arxiv',
                    filename = "2406.12444v1.pdf")

So far I have been using just the code provided by the Arxiv API to do all those things (https://pypi.org/project/arxiv/). Now let's go to the next level and let's extract a bunch of articles. Looking at the API it looks like we need to extract the IDs of the papers. Let's do that!

What we need is just the ID number of the paper. Let's select that.

And now let's loop around that to get all the PDFs into our laptop. First let's create a new folder called "arxiv_pdfs"

# IMPORTANT

REMEMBER to create a new folder for the new query that you are going when you will be doing the exercise. Remember to change the name of the folder in filename down there

And now let's get all the pdfs

### Beispiel: PDF-Scraping

In [None]:
# ANNOTATION (auf Deutsch):
# Dieser Code führt die folgende Aufgabe aus:
# (Fügen Sie hier eine spezifische Erklärung hinzu, basierend auf dem Code.)
# This cell performs the following task:
# (Add specific annotations here based on the logic in the cell)

for id in ids_2:
    # Search for the article with the given ID
    search = arxiv.Search(id_list=[id])
    paper = next(client.results(search))

    # Download the PDF to the current working directory with a default filename
    filename = os.path.join("arxiv_pdfs", urllib.parse.quote(id))
    result.download_pdf(filename=f"{filename}.pdf")
    
    time.sleep(3)  # 3 seconds (this is the indication of the ArXiv API)

And voila! According to the arxiv API (https://pypi.org/project/arxiv/1.4.8/) the daily limit is 300.000 results: that is a lot!

# Exercise

Now use this same notebook and do a new search using a different key term. 
* If you would like to use two words (Mark Zuckerberg, Climate Change, Donald Trump...) use this syntax: '"climate change"' (quotations inside quotations).
* Remember to **change the name of the folders** when you will be creating new ones for the individual PDF and for the list of PDFs to not have both queries all mixed up in the same folder