# Explore Existing Datasets

## CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs

* https://aclanthology.org/2020.emnlp-main.480.pdf
* https://www.statmt.org/cc-aligned/

**Summary**

CCAligned consists of *parallel or comparable web-document pairs in 137 languages aligned with English*. 

These web-document pairs were constructed by performing language identification on raw web-documents, and ensuring corresponding language codes were corresponding in the URLs of web documents. 

This pattern matching approach yielded more than 100 million aligned documents paired with English. Recognizing that each English document was often aligned to mulitple documents in different target language, we can join on English documents to obtain aligned documents that directly pair two non-English documents (e.g., Arabic-French).

In [6]:
!mkdir hindi tamil # directory to store hindi and tamil data

In [None]:
# Download Hindi-English Document Aligned Corpus
!curl -O https://data.statmt.org/cc-aligned/en_XX-hi_IN.tsv.xz

# Download Tamil-English Document Aligned Corpus
!curl -O https://data.statmt.org/cc-aligned/en_XX-ta_IN.tsv.xz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  3 1453M    3 44.9M    0     0  2602k      0  0:09:31  0:00:17  0:09:14 8840k

In [None]:
# Unzip Hindi-English Aligned Documents
!xz -d -k -f -c en_XX-hi_IN.tsv.xz > /content/hindi/en_XX-hi_IN.tsv

# Unzip Tamil-English Aligned Documents
!xz -d -k -f -c en_XX-ta_IN.tsv.xz > /content/tamil/en_XX-ta_IN.tsv

In [None]:
import pandas as pd

data = pd.read_csv('/content/hindi/en_XX-hi_IN.tsv', delimiter='\t', encoding='utf-8')

# Access individual columns using column names
column_names = data.columns
print(column_names)

# Process the data as needed
print(data.head())


# **Building Datasets**

## **Approach**

Let's write a script to randomly select 20,000 Wikipedia articles in Tamil from Tamil Wikipedia and extract their title, URI. I will then loop through this list to check if we have its version in Hindi and English.

My assumption is that if an article exists in Tamil which is the lesser of the 3 languages in terms of data availability, then those articles will likely be in Hindi and certainly in English.

In [3]:
!pip install --upgrade pip
!pip install wikipedia-api
!pip install mwclient
!pip install SPARQLWrapper

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting pip
  Downloading pip-23.1.2-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 1.2 MB/s eta 0:00:01
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.2.4
    Uninstalling pip-21.2.4:
      Successfully uninstalled pip-21.2.4
Successfully installed pip-23.1.2
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0mLooking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting mwclient
  Downloading mwclient-0.10.1-py2.py3-none-any.whl (27 kB)
Installing collected packages: mwclient
Successfully installed mwclient-0.10.1
[0mLooking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting SPARQLWrapper
  Downloading SPARQLWrapper-2.0.0-py3-none-any.whl (28 kB)
Collecting rdflib>=6.1.1 (from SPARQLWrapper)
  Downloading rdflib-6.3.2-py3-none-any.whl (528 kB)
[2K

## **Number of Articles in Wikipedia**

As of May 17, 2023:
1. Tamil Wikipedia has **153,998** articles.
2. Hindi Wikipedia has **156,511** articles. 
3. English Wikipedia has **6,657,489** articles.

In [None]:
# CODE BY CHATGPT 3.5
import requests
import locale

def get_article_count(language):
    url = f"https://{language}.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "format": "json",
        "meta": "siteinfo",
        "siprop": "statistics",
    }
    response = requests.get(url, params=params)
    data = response.json()
    return data["query"]["statistics"]["articles"]

# Set the locale to use comma as a thousands separator
locale.setlocale(locale.LC_ALL, "")

# Get number of articles in Tamil Wikipedia
tamil_count = int(get_article_count("ta"))
formatted_tamil_count = locale.format_string("%d", tamil_count, grouping=True)
print("Number of articles in Tamil Wikipedia:", formatted_tamil_count)

# Get number of articles in Hindi Wikipedia
hindi_count = int(get_article_count("hi"))
formatted_hindi_count = locale.format_string("%d", hindi_count, grouping=True)
print("Number of articles in Hindi Wikipedia:", formatted_hindi_count)

# Get number of articles in English Wikipedia
english_count = int(get_article_count("en"))
formatted_english_count = locale.format_string("%d", english_count, grouping=True)
print("Number of articles in English Wikipedia:", formatted_english_count)

print(article_counts)

## **Requirement**
The dataset is a **3-way comparable corpora in English-Hindi-Tamil**.




### **W1** (n = 20,000)
* For comparing monolingual models using Neural-ProdLDA, LDA with ZershotTM in each language - hi, en, ta.
* For calculating NPMI Coherence score

### **W2** (n = 100,000)

**METRICS FOR EVALUATION**
* *Matches (Mat)*
* *KL Divergence (KL)*
* *Centroid Embeddings (CD)*

**AUTOMATIC EVALUATION**

*Baseline 1 (Ori)*

* Performs topic modeling on documents translated into English via DeepL.
* This is an easily accessible baseline, but automatic translation may introduce bias in the representations.

*Baseline 2 (Uni)*

* Computes all the metrics over a uniform distribution.
* This baseline gives a lower bound.

**MANUAL EVALUATION**

*Rating of Predicted Topics*

* We rated the predicted topics for 300 test documents in five languages on an ordinal scale from 0-3.
* A 0 rate means that the predicted topic is wrong, a 1 rate means the topic is somewhat related, a 2 rate means the topic is good, and a 3 rate means the topic is entirely associated with the considered document.

*Inter-rater Reliability*

* We evaluated the inter-rater reliability using Gwet AC1 with ordinal weighting.
* The resulting value of 0.88 indicates consistent scoring.

## **Approach**

Let's write a script to randomly select 20,000 Wikipedia articles in Tamil from Tamil Wikipedia and extract their title, URI. I will then loop through this list to check if we have its version in Hindi and English.

My assumption is that if an article exists in Tamil which is the lesser of the 3 languages in terms of data availability, then those articles will likely be in Hindi and certainly in English.

In [None]:
!pip install mwapi

In [None]:
import wikipediaapi
import wikipedia
import random

article_counts = {'ta': tamil_count, 'hi': hindi_count, 'en': english_count}

# Total counts of articles for each language
total_counts = article_counts

# Language codes
languages = ['ta', 'hi', 'en']

# Number of articles to select
num_articles = 20

# Set the seed for random number generation for reproducibility
random.seed(20)

# Initialize lists to store the titles and URIs of selected articles
titles = []
uris = []

import mwapi

session = mwapi.Session("https://{lang}.wikipedia.org".format(lang=languages[0]))

# Returns titles of Wikipedia articles
def get_article(title):
    response = session.get(action="query", prop="info", titles=title, formatversion=2)
    pages = response['query']['pages']
    for page in pages:
        if 'title' in page:
            return page['title']
    return None

# Set language to Tamil
wikipedia.set_lang('ta')

# Get a random page in Tamil
random_pages = [wikipedia.random(1) for _ in range(20)]

# Print the titles of the randomly selected articles
titles = [page for page in random_pages]
for page in random_pages:
    print("Article Title:", page)

In [None]:
import wikipediaapi

# Set language to Tamil
wiki_wiki = wikipediaapi.Wikipedia('ta')

enc = 0
hic = 0
tot = 0
# Get the page for titles in Tamil Wikipedia
for title in titles:
    page = wiki_wiki.page(title)

    # Check if the article exists in English and Hindi
    exists_en = False
    exists_hi = False
    if 'hi' in page.langlinks:
        exists_hi = True
        hic +=1
    if 'en' in page.langlinks:
        exists_en = True
        enc += 1

    # Print the result
    if exists_hi and exists_en:
        tot += 1
        print("hi/en yes")
print("Total: " + str(tot))
print(hic, enc)

In [None]:
import wikipediaapi
import random

random.seed(10)
# Set language to Tamil
wiki_tamil = wikipediaapi.Wikipedia('ta')

enc = 0
hic = 0
tot = 0
count = 0  # Counter for the number of articles

# List to store the retrieved article titles
article_titles = []

# Keep retrieving articles until count reaches 100
while count < 100:
    # Get a random index within the range of the total number of articles in Tamil Wikipedia
    random_index = random.randint(1, total_counts['ta'])
    
    # Create the title of the article using the random index
    title = "பகுப்பு:" + str(random_index)
    
    # Check if the article exists in English and Hindi
    exists_en = False
    exists_hi = False
    
    # Get the page for the article
    page = wiki_tamil.page(title)
    
    # if 'hi' in page.langlinks:
    #     exists_hi = True
    #     hic += 1
    #     print("yes")
    # if 'en' in page.langlinks:
    #     exists_en = True
    #     enc += 1

    if 'hi' in page.langlinks and 'en' in page.langlinks:
        exists_hi = exists_en = True
        hic += 1
        enc += 1

    # If the article exists in both English and Hindi, increment the counter
    if exists_hi and exists_en:
        count += 1

        # Append the title to the article_titles list
        article_titles.append(title)

# Print the retrieved article titles
for title in article_titles:
    print("Title:", title)

# Print the final count
print("Total articles:", count)
print("Hindi articles:", hic)
print("English articles:", enc)


In [None]:
!pip install wikipedia

In [None]:
import wikipediaapi
import wikipedia

wiki_tamil = wikipediaapi.Wikipedia('ta')

# Function to check if an article exists in a given language
def check_article_exists(title, language):
    try:
        page = wikipedia.page(title)
        language_links = page.language_links
        return language in language_links
    except wikipedia.exceptions.PageError:
        return False

# Loop through the list of Tamil articles
for index in article_indices:
    title = "பகுப்பு:{index}".format(index=index)
    
    # Check if the article exists in Hindi
    exists_hi = check_article_exists(title, 'hi')
    
    # Check if the article exists in English
    exists_en = check_article_exists(title, 'en')
    
    # Print the article title and its availability in Hindi and English
    print(f"Article Title: {title}")
    print(f"Available in Hindi: {exists_hi}")
    print(f"Available in English: {exists_en}")
    print()


## Comparable Corpus Creation from Wikipedia

This will be a novel contribution.

In [None]:
QUERY = """ PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>

SELECT ?article ?tamil_abstract WHERE {
    
    ?article dbo:abstract ?tamil_abstract . 
    FILTER (lang(?tamil_abstract) = "hi")
}
LIMIT 10
"""

In [None]:
from SPARQLWrapper import SPARQLWrapper, JSON

sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setQuery(QUERY)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

  and should_run_async(code)


In [None]:
for result in results["results"]["bindings"]:
    print(result)

  and should_run_async(code)


In [None]:
import wikipediaapi

# Set language to English
wiki_wiki = wikipediaapi.Wikipedia('en')

# Get the page for "India"
page = wiki_wiki.page("India")

# Check if the article exists in Tamil
exists_ta = False
if 'ta' in page.langlinks:
    exists_ta = True

# Print the result
if exists_ta:
    print("Article is available in Tamil")
else:
    print("Article is not available in Tamil")


  and should_run_async(code)


Article is available in Tamil


In [None]:
import wikipediaapi

# Create a Wikipedia API object
wiki_wiki = wikipediaapi.Wikipedia('en')

# Retrieve the article in English
page_en = wiki_wiki.page("India (country)")
summary_en = page_en.summary
print("Summary (English):")
print(summary_en)

# Retrieve the article in Hindi
wiki_wiki = wikipediaapi.Wikipedia('hi')
page_hi = wiki_wiki.page("भारत")
summary_hi = page_hi.summary
print("Summary (Hindi):")
print(summary_hi)

# Retrieve the article in Tamil
wiki_wiki = wikipediaapi.Wikipedia('ta')
page_ta = wiki_wiki.page("இந்தியா")
summary_ta = page_ta.summary
print("Summary (Tamil):")
print(summary_ta)


  and should_run_async(code)


Summary (English):
India, officially the Republic of India (ISO: Bhārat Gaṇarājya), is a country in South Asia. It is the seventh-largest country by area, the most populous country as of 1 May 2023, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west; China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia. 
Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.
Their long occupation, initially in varying forms of isolation as hunter-gatherers, has made the region highly diverse, second only to Africa in human genetic diversity. Settled life emerged on the subcontinent in the western margins of the Ind

In [None]:

# article_indices = random.sample(range(1, total_counts['ta'] + 1), num_articles)

# # "பகுப்பு:{index}" represents an article title in the form of "Category:index" or "Category followed by the index number"
# for index in article_indices:
#     title = "பகுப்பு:{index}".format(index=index)
#     article = get_article(title)
#     # Process the article as needed
#     print(article)