
### Neural Machine Translation (Languages Code)
In this notebook we are going to scrap the langauges together with their language codes on [wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
___

Topic: `NMT`

Date: `2022/07/24`

Programming Language: `python`

Main: `Natural Language Processing (NLP)`

___

### Imports
In the following code cell we are going to import the python packages that we are going to use in this notebook.

In [5]:
from google.colab import files, drive
import os, json
import requests
from bs4 import BeautifulSoup as bs

### Mounting the drive

In the following code cell we are going to mount our google drive as we are going to save our languages file in google drive.

In [6]:
drive.mount('/content/drive')

Mounted at /content/drive


### Defining our path.

This is the path where we are going to save our `language.json` file.

In [7]:
base_dir = '/content/drive/My Drive/NLP Data/nmt'

assert os.path.exists(base_dir), f"The path '{base_dir}' does not exists, check if you have mounted the google drive."

save_path = os.path.join(base_dir, "languages.json")

### Getting the Content of the website
 
In the following code cell we are going to get the content of the [`wikipedia-website`](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) using the python `requests` module.

In [8]:
url = "https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes"
html = requests.get(url)
html

<Response [200]>

### Soup Object

In the following code cell we are going to create a `soup` object.

In [9]:
soup = bs(html.content, 'html.parser')

### Defining columns
So the followng columns will be used as our keys to our `language.json` file.

In [22]:
columns = ["Language", "ISO 639-1", "ISO 639-2/T", "ISO 639-2/B", "ISO 639-3", "Description"]

### Scracping the language table.

In the following code cell we are going to scrape the whole table for langugaes.

In [31]:
# finding the table in the DOM
table = soup.find("table")
# Getting all the tr of the table
trs = table.find_all('tr')
data = list()
for tr in trs[1:]: # ignore the first headers
  # get all the td of the tr
  row = tuple(x.text for x in tr.find_all('td'))
  data.append(row)

### Generating `json` data

In teh following code cell we are going to generate the `json_data`  as a list of python dictionaries.

In [43]:
json_data = [ {
              "Language": l, "ISO 639-1": c1, "ISO 639-2/T": c2, "ISO 639-2/B": c3, "ISO 639-3":c4, "Description":d
              } for (l, c1, c2, c3, c4, d) in data
             ]

Checking a single example

In [44]:
json_data[0]

{'Description': 'also known as Abkhaz\n',
 'ISO 639-1': 'ab',
 'ISO 639-2/B': 'abk',
 'ISO 639-2/T': 'abk',
 'ISO 639-3': 'abk',
 'Language': 'Abkhazian'}

Saving our language.json file

In [45]:
with open(save_path, 'w') as writter:
  writter.write(json.dumps(json_data, indent=2))

print("Done")

Done


Downloading our `language.json` file

In [46]:
files.download(save_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>