# Zusammenfassung: PDF- und HTML-Scraping
Dieses Notebook kombiniert Inhalte aus den hochgeladenen Notebooks und zeigt anhand von ausgewählten Beispielen, wie man Daten aus PDFs extrahiert (PDF-Scraping) und HTML-Webseiten verarbeitet (HTML-Scraping). Alle Code- und Text-Chunks sind auf Deutsch annotiert.

### Beispiel: PDF-Scraping
#### Erklärung (Deutsch):
Dieser Abschnitt zeigt, wie {description.lower()} funktioniert.

In [None]:
pip install PyPDF2

### Beispiel: API-Datenabfrage
#### Erklärung (Deutsch):
Dieser Abschnitt zeigt, wie {description.lower()} funktioniert.

In [None]:
import requests
import json
import time  # to pause after each API call 
from __future__ import division
import math
import csv
import matplotlib.pyplot as plt
import pandas as pd  # to see our CSV 

### Beispiel: Datenbereinigung
#### Erklärung (Deutsch):
Dieser Abschnitt zeigt, wie {description.lower()} funktioniert.

In [None]:
#Solution

super_clean = []
for i in punctuation_free:
    super_clean.append(" ".join(c for c in i.split() if c not in stopwords.words("french")))

### Inhalte aus: 1. Gale Metadata.ipynb
#### Zusammenfassung und Erklärung auf Deutsch:

Let's begin transforming the Gale Metadata Dataframe into something that we can use to later on merge with our text data.

# 1. We impor the libraries

In [None]:
import pandas as pd
import csv

# 2. We import our data

In [None]:
metadata = pd.read_csv("Times 1980 January February_metadata.csv", index_col = False)

In [None]:
metadata

# 3. We modify the metadata column that we need to do the matching later on

In [None]:
text = metadata["Gale Document Number"].to_list()

In [None]:
text

To be able to do a matching between the text dataframe and this column, we need to remove "GALE"

In [None]:
new_text = []

for i in text:
    new_text.append(i.replace('GALE|', ''))
    

In [None]:
new_text

Now let's substitute the original column with that

In [None]:
metadata["Gale Document Number"] = new_text

In [None]:
metadata

# 4. We export that to a CSV dataframe that we can use later on

In [None]:
metadata.to_csv("final_metadata.csv")

### Inhalte aus: 2. Importing Text Data Gale.ipynb
#### Zusammenfassung und Erklärung auf Deutsch:

Now that we have our dataframe with the Metadata, let's find a way to use the text files that we can download from the Gale. First, let's import them into our computer.

# 1. We import the libraries

In [None]:
import pandas as pd
import os

# 2. We set a Path to get the files

To be able to access the files, we need to first find where they are located in our computer. So, we need to set a path.

In [None]:
pwd

In [None]:
ls

In [None]:
directory_path = "yourpath"

In [None]:
directory_path

# 3. We create the dataframe

Now that we have the files, we need to import them into our laptop and create a datafarame with titles in one column and the text of the article in another column.

In [None]:
titles = []
contents = []

In [None]:
for filename in os.listdir(directory_path):
    if filename.endswith('.txt'):  # Ensure you're only processing text files
        with open(os.path.join(directory_path, filename), 'r') as file:
            titles.append(filename)  # Store the filename as title
            contents.append(file.read())  # Read and store the content

In [None]:
titles

In [None]:
len(titles)

In [None]:
contents

In [None]:
type(contents[0])

In [None]:
len(contents)

In [None]:
df = pd.DataFrame({
    'Title': titles,
    'Content': contents
})

In [None]:
df

# 4. We export the dataframe

Success! We have our dataframe and we are ready to export it to a CSV file to start the process of cleaning and pre-processing.

In [None]:
df.to_csv('raw_data.csv', index = False)

### Inhalte aus: 3. Headers.ipynb
#### Zusammenfassung und Erklärung auf Deutsch:

Now let's begin by organizing (AKA cleaning and pre-processing) the titles (headers) of our articles.

# 1. We import the libraries

In [None]:
import pandas as pd
import re

# 2. We get the data

In [None]:
data = pd.read_csv("raw_data.csv")

In [None]:
data

# 3. We split the title to get the CS indentifier

The way in which we are going to be able to match data (titles and articles) with metadata is by doing a match between the CS identifier in both dataframes. So: we need to extract that from the titles of the articles in here.

In [None]:
title = data["Title"].to_list()

In [None]:
title

First we split things by "CS" (an alternative way would be to do this using regex but it's much more complicated)

In [None]:
result = [s.split('CS') for s in title]

In [None]:
result

And now we need to add CS again to make sure that we can later on concatenate it with the Metadata.

In [None]:
modified_data = [[inner[0], 'CS' + inner[1]] for inner in result]

In [None]:
modified_data

And now we need to get rid of the final .txt to be able to later on match things with the metadata dataframe

In [None]:
cleaned_data = [[item[0], item[1].replace('.txt', '')] for item in modified_data]

In [None]:
cleaned_data

In [None]:
len(cleaned_data)

# 4. And now we create a new CSV data frame with a new column: Article ID

First we break that list into two different ones

In [None]:
title = [i[0] for i in cleaned_data]

In [None]:
len(title)

In [None]:
id_articles = [i[1] for i in cleaned_data]

In [None]:
len(id_articles)

And now we create the new csv

In [None]:
final_data = pd.DataFrame(title, columns = ["Title"])

In [None]:
final_data

In [None]:
final_data["ID"] = id_articles

In [None]:
final_data

And now we link that to the original dataframe with the proper text

In [None]:
text = data["Content"].to_list()

In [None]:
text

In [None]:
final_data["Article"] = text

In [None]:
final_data

So now we have our clean dataset!

# 5. We export everything into a csv file

In [None]:
final_data.to_csv("headers.csv")

### Inhalte aus: 4. Merging Dataframes.ipynb
#### Zusammenfassung und Erklärung auf Deutsch:

As we will see in the next Jupyter Notebook (5. Body) to be able to clean and pre-process the body we need to drop some missing rows of our dataframe that have some missing data. To simplify that process, let's now merge both dataframes before we proceed to cleaning the body of the articles.

# 1. We import our libraries

In [None]:
import pandas as pd

# 2. We get our data

First we get the metadata 

In [None]:
metadata = pd.read_csv("final_metadata.csv", index_col = 0)

In [None]:
metadata

Now let's change the name of the column "Gale Document Number" to ID to be able to merge dataframes in just a second

In [None]:
metadata.rename(columns = {"Gale Document Number" :"ID"}, inplace = True)

In [None]:
metadata

And now we get the titles and the unclean body

In [None]:
articles = pd.read_csv("headers.csv", index_col = 0)

In [None]:
articles

# 3. Let's merge dataframes

Now let's merge both dataframes using the ID column on both of them

In [None]:
merged_df = pd.merge(metadata, articles, on = 'ID', how = 'outer')

In [None]:
merged_df

# 4. Cleaning new Dataframe

If we want to make sure that the merge was done correctly, we can check the "Document Title" column from the metadata column with the "Title Column" from the articles dataframe. That being said: let's clean this dataframe a little bit and get rid of the columns Publisher, Subject, and Language. Let's keep the Title one (and we can drop it later on if that may be useful for us).

In [None]:
final_data = merged_df.drop(['Publisher', 'Subject', 'Language'], axis=1)

In [None]:
final_data

# 5. Saving our data

And now let's save our data into a csv dataframe

In [None]:
final_data.to_csv("final_data.csv")

### Inhalte aus: 5. Body.ipynb
#### Zusammenfassung und Erklärung auf Deutsch:

Now that we have our final dataframe, we still need to do some cleaning and preprocessing of our articles text. Let's do that!

# 1. We import the libraries

In [None]:
import pandas as pd
import re

# 2. We get the data

In [None]:
data = pd.read_csv("final_data.csv", index_col = 0)

In [None]:
data

# 3. We select the text

In [None]:
body = data["Article"].to_list()

In [None]:
len(body)

In [None]:
body[0]

In [None]:
type(body[0])

Checking if there are some float numbers (nan) that stand for missing data

In [None]:
for i in body:
    if type(i) == float:
        print(type(i))

In [None]:
for index, value in enumerate(body):
    if isinstance(value, float):
        print(f"Index: {index}, Value Type: {type(value)}")

In [None]:
body[228]

In [None]:
data.iloc[228:229]

This is happening because we selected more metadata than proper articles (due to the 5000 download limit restrictions for full articles). So, there are some missing articles in there. Let's get rid of them!

In [None]:
data_clean = data.dropna(subset = ["Article"])

In [None]:
data_clean

In [None]:
len(data_clean)

We have a clean dataframe! Let's go back to the body part

In [None]:
body = data_clean["Article"].to_list()

In [None]:
body[0]

# 4. We clean and pre-process

Time to do some cleaning

In [None]:
def clean_text(text):
    clean_text = [i.replace("\n", "") for i in text]
    final_text = [i.replace("\\", "") for i in clean_text]
    really_final_text = [i.replace("\\'", "") for i in final_text]
    return really_final_text

In [None]:
final_text = clean_text(body)

In [None]:
final_text[0]

We can see that there is a rebel \' character that has survived our cleaning function. Let's get rid of that!

In [None]:
def clean_text_final(text):
    final_clean_text = [i.replace("\n", "").replace("\\", "").replace("'", "") for i in text]
    return final_clean_text

In [None]:
really_final_text = clean_text_final(final_text)

In [None]:
really_final_text[0]

Now let's change our column in the csv dataframe

In [None]:
data_clean.loc[:, "Article"] = really_final_text

In [None]:
data_clean

In [None]:
data_clean["Article"][1]

# 5. Saving our data

And now we are reading to save our super clean dataframe for future Text Data Mining analysis!

In [None]:
data_clean.to_csv("final_TDM_dataframe.csv")

### Inhalte aus: Exercises Information Extraction.ipynb
#### Zusammenfassung und Erklärung auf Deutsch:

And now let's practice what we have just learnt but now with a multilingual text!

Script Sources:

* **NLTK**: Tsilimos, Maria. Python: Introduction to Natural Language Processing (NLP). IT Central, University of Zurich.
* **Spacy**: https://spacy.io/usage/spacy-101

# Exercise 1: replicating the NLTK IE architecture with the first chapter of Twenty Thousand Leagues Under the Sea

#### A. We import our data

The second chapter of **Around the World in 80 days** has been created for you (without being cleaned and pre-processed, yet without \r\n characters). Write some code to open it!

(P.S. Again, if you want to replicate the code for your own exercises, run the following script: import re re.sub(r"\r\n", " ", data")). 

In [None]:
#Your code in here

In [None]:
#Solution
with open("chapter_2_80.txt", "r", encoding = "utf-8") as f:
    data = f.read()

#### B. We import the libraries

In [None]:
#Your code in here

In [None]:
#Solution
import nltk
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk import pos_tag
from nltk import ne_chunk
from nltk.chunk import conlltags2tree, tree2conlltags
from nltk.draw import draw_trees

#### C. Sentence Segmentation

In [None]:
#Your code in here

In [None]:
#Solution
sentences = sent_tokenize(data) 

#### D. Tokenization

In [None]:
#Your code in here

In [None]:
#Solution
token_sentences = [word_tokenize(sentence) for sentence in sentences] 

In [None]:
token_sentences

#### E. POS Tagging

In [None]:
#Your code in here

In [None]:
#Solution
pos_sentences = [nltk.pos_tag(sentence) for sentence in token_sentences ] 

In [None]:
pos_sentences

#### F. Chunking and NER

#### Chunking

Try extracting a sentence that you like. 

In [None]:
#Your code in here

In [None]:
#Solution
sentence = pos_sentences[2]

In [None]:
sentence

Now create a tree out of that!

In [None]:
grammar = "NP: {<DT>?<JJ>*<NN>}" 

In [None]:
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence) 
print(result)

In [None]:
result.draw()

In [None]:
#Your code in here

In [None]:
#Solution
chunked_sentences = nltk.ne_chunk_sents(pos_sentences)

In [None]:
for sent in chunked_sentences:
    for chunk in sent: 
        if hasattr(chunk,'label'): 
            print(chunk.label(), ' '.join(c[0] for c in chunk))

#### G. Transforming that into a list and creating three different lists: 1. Person, 2. Organization, 3. GPE 

In [None]:
#Your code in here

In [None]:
#Solution

1. Creating a list

In [None]:
chunked_sentences = nltk.ne_chunk_sents(pos_sentences) #remember to always write this again! 

In [None]:
named_entities = []

In [None]:
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label"):
            named_entities.append((chunk.label(), ' '.join(c[0] for c in chunk)))

In [None]:
named_entities

2. Creating a person list

In [None]:
person = []

for a,b in named_entities:
    if a == "PERSON":
        person.append([a, b])

In [None]:
person

3. Creating a GPE list

In [None]:
GPE = []

for a,b in named_entities:
    if a == "GPE":
        GPE.append([a, b])

In [None]:
GPE

4. Creating an organization list

In [None]:
organization = []

for a,b in named_entities:
    if a == "ORGANIZATION":
        organization.append([a, b])

In [None]:
organization

# Exercise 2: Spacy

Now let's repeat the exercise with Spacy to compare the performance of both.

#### A. We import the libraries

In [None]:
#Your code

In [None]:
#Solution
import spacy

#### B. We download the French SPACY pipeline and we inspect the entity labels

You may need to do this (remove the #symbol)

In [None]:
#!python -m spacy download fr_core_news_sm

In [None]:
#Your code

In [None]:
#Solution
nlp = spacy.load("en_core_web_sm")

In [None]:
nlp.get_pipe('ner').labels

#### C. We initialize the NLP object

In [None]:
#Your code in here

In [None]:
#Solution
doc = nlp(data)

#### D. We create a list with the entities

In [None]:
#Your code in here

In [None]:
#Solution
named_entities = []

for ent in doc.ents:
    named_entities.append([ent.text, ent.label_])

In [None]:
named_entities

#### E. We create three lists: one with person (PERSON), one with Geopolitical Entities (GPE), one with Organization (ORG).

In [None]:
#Your code in here

In [None]:
#Solution
person = []

for ent in doc.ents:
    if ent.label_ == "PERSON":
        person.append([ent.text, ent.label_])

In [None]:
person

In [None]:
#Solution
GPE = []

for ent in doc.ents:
    if ent.label_ == "GPE":
        GPE.append([ent.text, ent.label_])

In [None]:
GPE

In [None]:
org = []

for ent in doc.ents:
    if ent.label_ == "ORG":
        org.append([ent.text, ent.label_])

In [None]:
org

So: once again we see that Spacy really outperforms NLTK!

### Inhalte aus: Information Extraction.ipynb
#### Zusammenfassung und Erklärung auf Deutsch:

# Information Extraction: NLTK and Spacy

Script Sources:

* **NLTK**: Tsilimos, Maria. Python: Introduction to Natural Language Processing (NLP). IT Central, University of Zurich.
* **Spacy**: https://spacy.io/usage/spacy-101

**Information Extraction (IE)** consists on transforming **Natural Language unstructured data** (written or spoken) into **structured data** ready to be used by machines. 

In this notebook we are going to learn two different IE methods: **Part of Speech Tagging (POS)** and **Name Entity Recognition (NER)**.

There are many excellent Python libraries out there to write scripts that will allow us to do both things. In this notebook we will learn how to use **NLTK** and **Spacy** and understand the advantages and disadvantages of both!

# 1. Importing our data

Let's begin by using the first chapter of **Around the World in Eighty Days** by Jules Verne.

If you remember, in the previous chapter we did 4 steps of cleaning and pre-processing:

* Tokenization
* Lowercasing
* Removing Punctuation
* Removing Stopwords

Now **we are not going to do any of those things**. We need to do **POS tagging**, and for that, it is necessary to keep punctuation and stopwords to avoid confusing the parser. 

The only thing that we are going to remove are the noisy characters "\r\n".

For that, we are going to use this script: **re.sub(r"\r\n", " ", data")**. (in case you want to replicate it on your own dataset). 

For efficiency purposes a clean first chapter has been created for you with that process already incorporated.

In [None]:
with open("chapter_1_80.txt", "r", encoding = "utf-8") as f:
    data = f.read()

# 2. Understanding Information Extraction Architecture: NLTK

### A. We import the libraries

In [None]:
import nltk
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk import pos_tag
from nltk import ne_chunk
from nltk.chunk import conlltags2tree, tree2conlltags
from nltk.draw import draw_trees

### B. We initialize the Information Extracture Pipeline:

1. Sentence Segmentation
2. Tokenization
3. POS Tagging
4. Chunking
5. NER

#### 1. Sentence Segmentation

In [None]:
sentences = sent_tokenize(data) 
sentences

#### 2. Tokenization

In [None]:
token_sentences = [word_tokenize(sentence) for sentence in sentences] 

In [None]:
print(token_sentences)

#### 3. POS Tagging

In [None]:
pos_sentences = [nltk.pos_tag(sentence) for sentence in token_sentences ] 

In [None]:
pos_sentences

#### 4. Chunking and NER

#### Chunking

In [None]:
sentence = pos_sentences[7]

In [None]:
sentence

In [None]:
grammar = "NP: {<DT>?<JJ>*<NN>}" 

In [None]:
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence) 
print(result)

In [None]:
result.draw()

#### NER

In [None]:
chunked_sentences = nltk.ne_chunk_sents(pos_sentences)

In [None]:
chunked_sentences

In [None]:
for sent in chunked_sentences:
    for chunk in sent: 
        if hasattr(chunk,'label'): 
            print(chunk.label(), ' '.join(c[0] for c in chunk))

And now let's transform that into a list!

Source = https://nanonets.com/blog/named-entity-recognition-with-nltk-and-spacy/

In [None]:
chunked_sentences = nltk.ne_chunk_sents(pos_sentences)

In [None]:
named_entities = []

In [None]:
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label"):
            named_entities.append((chunk.label(), ' '.join(c[0] for c in chunk)))

In [None]:
named_entities

In [None]:
person = []

for a,b in named_entities:
    if a == "PERSON":
        person.append([a, b])

In [None]:
person

That looks good so far! Let's now check **Geopolitical Entities (GPE)**

In [None]:
GPE = []

for a,b in named_entities:
    if a == "GPE":
        GPE.append([a, b])

In [None]:
GPE

That also looks quite good! However we observe some **issues**: is American or Londoner a person or a GPE?

In [None]:
organization = []

for a,b in named_entities:
    if a == "ORGANIZATION":
        organization.append([a, b])

In [None]:
organization

# Exercise 1

# Spacy

And now let's try Spacy. Spacy does not follow the same architecture as NLTK: we don´t need to follow the 4 step pipeline (sentence segmentation, tokenization, POS tagging, NER chunking). All of that is implemented in their code! Have a look at: https://spacy.io/usage/linguistic-features#named-entities

You may need to install the Spacy pipeline. If so, remove the #symbol in the following cells.

In [None]:
#!pip install spacy

In [None]:
#!python -m spacy download en_core_web_sm

In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
doc = nlp(data)

Let's first have a look at the existing Entity Labels

In [None]:
nlp.get_pipe('ner').labels

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
for ent in doc.ents:
    if ent.label_ == "PERSON":
        print(ent.text, ent.label_)

In [None]:
for ent in doc.ents:
    if ent.label_ == "GPE":
        print(ent.text, ent.label_)

In [None]:
for ent in doc.ents:
    if ent.label_ == "ORG":
        print(ent.text, ent.label_)

We have a winner!

# Exercise 2

### Inhalte aus: Mapping Jules Verne. NER with Spacy.ipynb
#### Zusammenfassung und Erklärung auf Deutsch:

Now that we have done things at the chapter level, let's do it at the book level! Let's focus on mapping geographically the world of Jules verne by extracting GPE and LOC of **Around the World in 80 days**.

# 1. We import our libraries

In [None]:
import spacy

# 2. We get our data

This data has not been cleaned and pre-processed to avoid confusing the parser (only \r\n characters have been removed!)

In [None]:
with open("around_the_world.txt", "r", encoding = "utf-8") as f:
    data = f.read()

# 3. We import the English pipeline

In [None]:
nlp = spacy.load("en_core_web_sm")

# 4. We create the Spacy nlp object

In [None]:
doc = nlp(data)

# 5. We inspect the English model labels

Let's remember the entities that we have in Spacy:

In [None]:
nlp.get_pipe('ner').labels

# 6. We print the entities

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

# 7. We create one list with GPE 

While possibly LOC is a lable that contains interesting information, as this is a DH introductory course, let's just focus on GPE!

In [None]:
GPE = []

for ent in doc.ents:
    if ent.label_ == "GPE":
        GPE.append([ent.text, ent.label_])

In [None]:
GPE

Now let's drop the duplicates in there!

In [None]:
GPE_places = []

for a, b in GPE:
    GPE_places.append(a)  

In [None]:
unique_GPE = set(GPE_places)

In [None]:
unique_GPE

Let's save our values!

In [None]:
with open("GPE_aroundtheworld.txt", "w", encoding = "utf-8") as f:
    f.write(str(unique_GPE))

# Exercise 3

### Inhalte aus: Geospatial Analysis.ipynb
#### Zusammenfassung und Erklärung auf Deutsch:

# Using Geospatial Analysis to visually analyze Travel Literature!

Geospatial Analysis can be a great tool to help us digg into the textual analysis of Literary Text. This can be particularly useful if we want to add extra layers of analysis to some genres such as **Travel Literature**. In this notebook we are going to exolore how to use the Python Library Plotly: https://plotly.com/python/getting-started/

**Sources:** the majority of the scripts in this notebook come from these sources from plotly: https://plotly.com/python/mapbox-layers/, https://plotly.com/python/scatter-plots-on-maps/, https://plotly.com/python/mapbox-layers/, https://plotly.com/python/reference/scattermapbox/#scattermapbox-marker-symbol. For more senior scripts about geo-spatial data science, this is an excellent course: https://github.com/suneman/socialdata2023.

# 1. We import the libraries

In [None]:
#pip install plotly

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as py 
from plotly.figure_factory import create_table # for creating nice table 

# 2. We manually inspect our city dataset

All Digital Humanities projects involve some degree of close reading analysis. We need to inspect our "GPE_aroundtheworld.txt" file and decide which cities we are going to include in our selection! (you will see that there is a considereable ammount of noise even using Spacy, or that some place names are contemporary to the age of Jules Verne but have changed ever since).

# 3. We create our GPS dataset.

To be able to map our cities, we need to extensively google the Latitude and Longitude of all of them, and manually annotate the results in several lists (as we will need to create a CSV dataframe to be able to plot things in maps with Plotly).

Be aware that:

**GPS Lat-Long signs: N+, S-, W-, E+.**

For example:

Rio de Janeiro: 22.9068° S, 43.1729° W (-22.9068, -43.1729)
London: 51.5072° N, 0.1276° W (51.5072, -0.1276)
Stockholm: 59.3293° N, 18.0686° E (59.3293, 18.0686)
Sydney: 25.2744° S, 133.7751° E (-25.2744, -133.7751)

# Activity for you

Please google "Lat Long decimal" and add the coordinates of **Denver, Bloomington (Indiana), Sacramento**. Add the lattitude, the longitude, and the country (at each corresponding list). Remember to remove the dots (that is just to indicate you where you should be writing things) and to write the closing braket of the list! Once you are finished run the scripts and you will automatically have a Pandas dataframe with all the information!

In [None]:
places = ["Huangpu River (Shanghai River)", "Salt Lake City", "Lima (Peru)", "Green Creek", "Hong Kong", "Rewa (Vindhias)",
          "San Francisco", "Aurangabad", "Macao", "New York", "Omaha", "Philadelphia (Pensilvania)", "Dublin", "Turin", 
          "Burhampoor", "Stockholm (Sweden)", "Golconda", "Dover", "Bundelkhand (Bundelcund)", "Ganges", "Bihar (Behar)", 
          "Havre", "Tokyo (Japanese Empire)", "Surat", "Fire Island", "Birmingham", "Cairo (Egypt)", "Paris (France)", 
          "Formosa", "Yokohama", "Brasilia (Brasil)", "Cambridge (Kirkland)", "Weber River", "Desmoines (Iowa)", 
          "Jackson (Missisipi)", "Pittsburg", "Mexico City (Mexico)", "Brindisi", "Jersey City", "Canberra (New Holland)",
          "London", "Portland (Oregon)", "Murshedabad", "Singapur", "Victoria", "New Hampshire (Vermont)", "Glasgow",
          "Calcutta", "Malacca", "Kansas City (Kansas)", "Edinburgh", "Carson City (Nevada)", 
          "Lawrence Kansas (Fort Saunders)", "Rock Island (Illinois)", "the Strait of Bab-el-Mandeb (Bab-el-Mandeb)",
          "Hamburg", "Oslo", "Little Rock (Arkansas)", "Khandallah", "Nagasaki", "Mumbai", "Queenstown", "Edo (Yeddo)", 
          "Odgen", "Amman (Jordan)", "Burdwan", "Yokohama", "Oakland", "Marylebone", "Greenwich", "Columbus",
          "Reno", "Amsterdam (Holland)", "Chicago (Illinois)", "Aden", "Cheyenne (Wyoming)", "Bardhaman (Burdivan)", 
          "Paris", "Liverpool", "Elephanta Island", "Southampton", "Long Island", "Fort Wayne", "Saddle Peak", 
          "Allahaban"]

In [None]:
lat = [31.267401, 40.758701, -12.046374, 41.344525, 22.302711, 24.530727,  37.773972, 19.901054, 22.210928,  40.730610,
        41.257160, 39.952583,  53.350140, 45.116177, 21.307373, 59.334591, 17.383336111111, 39.161079,  25.4556,
       29.7666636, 25.612677, 49.490002, 35.652832, 21.170240, 40.630239, 52.489471, 30.033333, 48.864716, -26.18948040,
       35.443707, -15.793889, 42.373611, 40.71578,  41.619549, 35.514706, 40.440624, 19.432608, 40.633331, 40.719074,
       -35.282001, 51.509865,  45.523064, 24.175903,  1.290270,  -37.020100, 44.000000, 55.860916, 22.572645, 
       2.200844, 39.106667,  55.953251, 39.1638, 38.960213, 41.487076, 12.583, 53.551086, 59.911491, 34.746483, 
       -41.24500000, 32.764233, 19.076090, -45.031162, 35.822994, 34.273178, 31.963158, 23.232513, 35.443707,
       37.804363, 51.518875, 51.477928, 39.983334,  39.530895, 52.377956, 41.881832, 12.800000, 41.161079, 
       23.232513,  48.864716,  53.400002,  18.963253,  50.909698, 40.792240, 41.093842, 11.623377, 25.473034]

In [None]:
lon = [121.522179, 111.876183, -77.042793, -82.968471, 114.177216,  81.299110, -122.431297, 75.352478, 113.552971, 
       -73.935242, -95.995102, -75.165222,  -6.266155, 7.742615, 76.230415, 18.063240, 78.404169444444, -75.525681, 
       78.5636, 78.1999992, 85.158875, 0.100000, 139.839478, 72.831062, -73.308549,  -1.898575,  31.233334, 2.349014,
       -58.22428060, 139.638031, -47.882778, -71.110558,  -110.898227, -93.598022,  -89.912506, -79.995888,  -99.133209,
        17.933332, -74.050552, 149.128998, -0.118092, -122.676483, 88.280182, 103.851959, 144.964600, -72.699997, 
       -4.251433, 88.363892, 102.240143, -94.676392, -3.188267,  -119.7674, -95.277390,  -90.589691, 43.417, 
       9.993682, 10.757933,  -92.289597, 174.79422000, 129.872696, 72.877426, 168.662643,  139.753493,  -77.818047, 
        35.930359,  87.863419, 139.638031, -122.271111, -0.149895, -0.001545, -82.983330, -119.814972,  4.897070, 
        -87.623177, 45.033333, -104.805450, 87.863419, 2.349014,  -2.983333,  72.931442, -1.404351, -73.138260, 
       -85.139236,  92.726486, 81.878357]

In [None]:
countries = ["China", "USA", "Peru", "USA", "Hong Kong", "India", "USA", "India", "China", "USA", "USA", "USA", "Ireland", 
             "Italy", "India", "Sweden", "India", "UK", "India", "India", "India", "France", "Japan", "India", "USA", 
              "UK", "Egypt", "France", "Argentina", "Japan", "Brazil", "USA", "USA", "USA", "USA", "USA", "Mexico",
             "Italy", "USA", "Australia", "UK", "USA", "India", "Singapore", "Australia", "USA", "UK", "India", "Malaysia",
             "USA", "UK", "USA", "USA", "USA", "Ocean", "Germany", "Norway", "USA", "New Zealand", "Japan", "India", 
             "Australia", "Japan", "USA", "Jordan", "India", "Japan", "USA", "UK", "UK", "USA", "USA", "Netherlands", 
             "USA", "Yemen", "USA", "India", "France", "UK", "India", "UK", "USA", "USA", "India", "India"] 

In [None]:
data = pd.DataFrame(places, columns = ["cities"])

In [None]:
data["lat"] = lat

In [None]:
data["lon"] = lon

In [None]:
data["countries"] = countries

In [None]:
data

# Geopy

And now let's try another python library called GEOPY that will tell us the coordinates of our cities! If you are curious, you can read the documentation in here: https://geopy.readthedocs.io/en/stable/. For a faster tutorial you can have a look at https://pypi.org/project/geopy/

In [None]:
import geopy
from geopy.geocoders import Nominatim

In [None]:
# Initialize the Nominatim geocoder
geolocator = Nominatim(user_agent="MyApp", timeout = 5)
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

In [None]:
#Choose a city
location = geolocator.geocode("Paris")

In [None]:
location

In [None]:
print("The latitude of the location is: ", location.latitude)
print("The longitude of the location is: ", location.longitude)

In [None]:
paris = [location.latitude, location.longitude]

In [None]:
paris

Let's scale that to our full dataset of cities (so: if we have a file with all the GPE locations, we feed it into this script and it wil be super fast!)

In [None]:
places

In [None]:
city_coords = []

In [None]:
for city in places:
    location = geolocator.geocode(city)
    if location:
        city_coords.append((location.point.latitude, location.point.longitude))
    else:
        city_coords.append(None)

When we get a none message it means that geopy does not know where is that city

In [None]:
city_coords

# 5. And now we visualize things!

Let's first try this map.

#### A. Mapbox Maps

Mapbox maps are also called tile-based maps and they allow you to zoom in "google maps" style. For more information have a look at: https://plotly.com/python/mapbox-layers/

### Activity for you

Change the color_discrete_sequence = [] variable from "fuschia" to "green". You can try other colours!

In [None]:
import pandas as pd
import plotly.express as px

fig = px.scatter_mapbox(data, lat="lat", lon="lon", hover_name="cities", hover_data=["countries"], #this is the text
                        color_discrete_sequence=["fuchsia"], zoom=3, height=300)                     #that goes inside the boxes

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

fig.show()

#### Activity for you

Move around your mouse on the top right corner of the map and click on the picture camera, where it says "Download plot as PNG". You will be able to download your map in your own laptop!

#### B. Geo maps

Geo Maps only show the physical boundaries of countries. Have a look at: https://plotly.com/python/map-configuration/

In [None]:
data['text'] = data['cities'] + ', ' + data["countries"].astype(str)

fig = go.Figure(data=go.Scattergeo(
        lon = data['lon'],
        lat = data['lat'],
        text = data['text'],
        mode = 'markers',
        ))

fig.update_layout(
        title = 'Around the World in 80 days',
        geo_scope='world',
    )
fig.show()

Which one do you like the most?

# Exercise 1

### Inhalte aus: Harry Potter around the World.ipynb
#### Zusammenfassung und Erklärung auf Deutsch:

# Mapping the world of Harry Potter

**Script source:** several queries to Perplexity AI!

Harry Potter is one of the most translated (and popular) books around the world and it is available in 85 languages! (https://en.wikipedia.org/wiki/List_of_Harry_Potter_translations)

Let's write a python script to do a geo-spatial analysis visualization of things!

# 1. We import the libraries

In [None]:
#Your code in here

In [None]:
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as py 
from plotly.figure_factory import create_table

import geopy

# 2. We create a variable with the capitals of the countries where Harry Potter has been translated

In the real world you would need to do this step yourself! How would you do this using Python?

And: every time there is more than one language in a country (i.e. South Africa: English and Afrikaans) I have used two cities in that country (i.e Pretoria and Cape Town) to show linguistic diversity!)

In [None]:
cities = ["London", "Dublin", "Sydney", "Wellington", "Toronto", "Pretoria", "Cape Town", "Washington DC", "New Dehli", "Kuala Lumpur", 
          "Manila", "Singapore", "Tirana", "Pristina", "Cairo", "Yerevan", "Oviedo", "Baku", "Bilbao", "Minsk", "Dhaka", "Sarajevo", 
          "Rennes", "Sofia", "Barcelona", "Beijing", "Taiwan", "Hong Kong", "Macau", "Taiwan", "Zagreb", "Prague", "Copenhagen", 
          "Ghent", "Amsterdam", "Paramaribo", "Tallinn", "Tórshavn", "Pasay", "Helsinki", "Brussels", "Quebec", "Paris", "Monaco",
          "Lausanne", "Luxembourg", "Leeuwarden", "Santiago", "Tbilisi", "Vienna", "Berlin", "Vaduz", "Zurich", "Echternach",
          "Hamburg", "Athens", "Thessaloniki", "Nuuk", "Gandhinagar", "Honolulu", "Jerusalem", "Mumbai", "Budapest", 
          "Reykjavik", "Jakarta", "Galway", "Belfast", "Rome", "San Marino", "Lugano", "Tokyo", "Phnom Penh", "Lahore", 
          "Seoul", "Milan", "Riga", "Vilnius", "Diekirch", "Skopje", "Kuantan", "Thiruvananthapuram", "Auckland", 
          "Nagpur", "Ulaanbaatar", "Kathmandu", "Trondheim", "Bourdeaux", "Girona", "Tehran", "Warsaw", "Lisboa", 
          "Brasília", "Bucharest", "Chișinău", "Moscow", "Edinburgh", "Belgrade", "Podgorica", "Trebinje", "Colombo", 
          "Bratislava", "Ljubljana", "Madrid", "Rosario", "Buenos Aires", "Stockholm", "Nyland", "Chennai", "Amaravati",
          "Bangkok", "Lhasa", "Ankara", "Kyiv", "Islamabad", "Hanoi", "Cardiff", "Jerusalem", "Mostar", "New York City", 
          "Utrecht", "Krakow", "Sibiu", "Gothenborg", "Horlivka"]
          
          

In [None]:
len(cities)

# 3. We create a list with the Lattitude and Longitude of those cities using Geopy

Let's first practice getting the lat and lon of 3 English speaking main cities: London, Dublin, and New York City. 

In [None]:
#Your code in here

In [None]:
#Solution

from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

In [None]:
geolocator = Nominatim(user_agent="MyApp", timeout = 5)
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

In [None]:
london = geolocator.geocode("London")
dublin = geolocator.geocode("Dublin")
nyc = geolocator.geocode("New York")

In [None]:
london

In [None]:
print(f"The lat and long of {london} is: ", london.latitude, london.longitude)
print(f"The lat and long of {dublin} is: ", dublin.latitude, dublin.longitude)
print(f"The lat and long of {nyc} is: ", nyc.latitude, nyc.longitude)

Now let's do that for all the cities in our list!

In [None]:
#Your code in here

In [None]:
#Solution
city_coords = []

In [None]:
for city in cities:
    location = geolocator.geocode(city)
    if location:
        city_coords.append((location.point.latitude, location.point.longitude))
    else:
        city_coords.append(None)

In [None]:
city_coords

In [None]:
len(city_coords)

# 4. Pandas Data Frame

Now let's create a Pandas Dataframe that contains our cities and their lat and lon

First let's create a column with the names of the cities

In [None]:
#Your code in here

In [None]:
#Solution
harry_potter = pd.DataFrame(cities, columns = ["Cities"])

In [None]:
harry_potter

Now create two variables: one for latituted and one for longitude

In [None]:
#Your code in here

In [None]:
#Solution

In [None]:
type(city_coords)

In [None]:
city_coords[0]

In [None]:
type(city_coords[0])

In [None]:
lat = [x[0] for x in city_coords]
lon = [x[1] for x in city_coords]

In [None]:
len(lat)

In [None]:
len(lon)

And now let's add those columns to our data frame

In [None]:
#Your code in here

In [None]:
#Solution
harry_potter["lat"] = lat
harry_potter["lon"] = lon

In [None]:
harry_potter

# 5. And now let's visualize things!

Change the colour for red (for Gryffindor!)

In [None]:
fig = px.scatter_mapbox(harry_potter, lat="lat", lon="lon", hover_name="Cities", #this is the text
                        color_discrete_sequence=["fuchsia"], zoom=3, height=300)                     #that goes inside the boxes

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

fig.show()

And now change the colour for green (for Slytherin!)

In [None]:
import plotly.graph_objects as go

# Assume 'harry_potter' is your data frame with 'lon', 'lat', and 'text' columns
harry_potter['marker_color'] = 'blue'  # Assign a default color
harry_potter.loc[harry_potter['Cities'].str.contains('Harry Potter'), 'marker_color'] = 'red'  # Change color for specific text

fig = go.Figure(data=go.Scattergeo(
    lon = harry_potter['lon'],
    lat = harry_potter['lat'],
    text = harry_potter['Cities'],
    mode = 'markers',
    marker = dict(
        color = harry_potter['marker_color'])))

fig.update_layout(
    title = 'Harry Potter Translations',
    geo_scope = 'world'
)

fig.show()