# Coding All Stars Task 1

After copying the Class Central website using HTTrack and getting our local version, we want to read all the HTML files, gather all their contents, and translate them into Hindi.
And this is our goal in creating this notebook.

First, as mentioned, I used HTTrack to copy the website.
Kindly note that some images can be blurry; but they are not lost. I'm expecting this is because I used `max depth` and `maximum external` depth parameters of 2, and these images may be located in more deep directories, or it may be a cost reduction mechanism by HTTrack to save project sizes and user memories.

For HTML parsing and manipulation, I'll use the BeautifulSoup library, and for text translation, I'll use Google's translation API through the [deep-translator](https://pypi.org/project/deep-translator/) wrapper package.

## Importing Necessary Modules

In [None]:
from bs4 import BeautifulSoup
import os
from deep_translator import GoogleTranslator
from textwrap import wrap
from typing import List, Dict

## Obtaining a list of all HTML file paths 

In [None]:
root_directory = '/home/eyadshokry/Downloads/todo'

def get_html_files_paths(root_dir: str) -> List[str]:
    """Takes directory path and returns all html files in the directory.

    Args:
      root_dir: Path of the directory that should be traversed.

    Returns:
      Array of all html files paths.
    """
    html_files_list = []

    for root, dir_names, filenames in os.walk(root_dir):
        for filename in filenames:
            if filename.endswith('.html'):
                file_name = os.path.join(root, filename)
                html_files_list.append(file_name)
    return html_files_list

html_files = get_html_files_paths(root_dir=root_directory)

## Writing a helper method to translate a string from English to Hindi using `deep_translator` package (Google Translator API wrapper package)

In [None]:
# Helper method to check if the given text is in Hindi language.
def is_hindi(text: str) -> bool:
    """Indicates if the given text is in Hindi language or not.

    Args:
      text: The text that should be detected.

    Returns:
      Boolean if the text is in Hindi language or not.
    """
    max_char = max(text)
    if u'\u0900' <= max_char <= u'\u097f':
        return True
    else:
        return False

In [None]:
def translate_to_hindi(text) -> str:
    """Translates English text to its alternative in Hindi language.

    Args:
      text: The text that should be translated.

    Returns:
      Translated Hindi text.
    """
    max_translation_count = 5000
    
    if isinstance(text, str):
        if text.isdigit() or is_hindi(text):
            return text
                
        if len(text) > max_translation_count:
            # Handles max characters exceeded scenario
            text_chunk = wrap(text, max_translation_count)
            translated_chunk = GoogleTranslator(source='en', target='hi').translate_batch(text_chunk)
            return ' '.join(translated_chunk)
        else:
            return GoogleTranslator(source='en', target='hi').translate(text)

In [None]:
translate_to_hindi(".")

## Finding all the texts in HTML files and replacing them with the translated version

#### BeautifulSoup library issue
Please note that, in some cases, BeautifulSoup fails to find all the desired texts. It mainly happens in nested tags.

Example:

```html
<h2 class="head-2 medium-up-head-1 row horz-align-center vert-align-middle text-center margin-bottom-xsmall" style="color: #EA5918;">
    <svg class="margin-right-xsmall" fill="none" viewbox="0 0 54 40" width="45" xmlns="http://www.w3.org/2000/svg">
        <path d="M0 0L54 1.11315e-05L54 39.0034L0 39.0034L16.1273 19.4042L0 0Z" fill="#0D4DF1">
        </path>
        <path d="M38.9577 22.0208C38.803 21.5515 38.9607 21.0248 39.3365 20.7552L43.5606 17.7255C44.3595 17.1525 43.9976 15.7647 43.0494 15.7647H37.5488C37.1516 15.7647 36.7965 15.4873 36.6587 15.0695L34.8884 9.69909C34.5829 8.77218 33.4139 8.77218 33.1084 9.69909L31.3381 15.0695C31.2003 15.4873 30.8452 15.7647 30.448 15.7647H24.9506C24.0021 15.7647 23.6405 17.1531 24.4398 17.7258L28.668 20.7549C29.0444 21.0245 29.2021 21.5521 29.0469 22.0217L27.3984 27.007C27.082 27.9641 28.0321 28.8234 28.7991 28.2738L33.4875 24.914C33.799 24.6908 34.1978 24.6908 34.5093 24.914L39.2017 28.2766C39.9683 28.826 40.9181 27.9678 40.6026 27.0108L38.9577 22.0208Z" fill="white">
         </path>
    </svg>
        Rankings
</h2>
```
The `Rankings` string here won't be captured by the BeautfulSoup library. So, it won't be replaced automatically.
This issue is well-known in the community. Please take a look at [this thread](https://stackoverflow.com/questions/31958637/beautifulsoup-search-by-text-inside-a-tag).

I implemented a workaround `parse_nested_tags` recursive method, to get rid of this problem. Please take a look at [this solution](https://stackoverflow.com/questions/73377105/how-to-translate-html-file-contents-into-another-language).

In [None]:
def parse_nested_tags(soup_list: List[any], excluded_tags: List[str], translated_words: Dict):
    """Parses the nested tags recursively to access their texts.

    Args:
      soup_list: List of tags that should be parsed.
      excluded_tags: List of tags should be excluded from parsing and translating.
      translated_words: Dictionary of pre translated words.
    """
    for i in soup_list:
        if i.name not in excluded_tags:
            if i.string:
                if i.string in translated_words:
                    hindi = translated_words[i.string]
                else:
                    hindi = translate_to_hindi(i.string)                    
                    translated_words[i.string] = hindi
                if hindi is not None:
                    i.string.replace_with(hindi)
            elif i.text and i.contents:
                parse_nested_tags(i.contents, excluded_tags=excluded_tags, translated_words=translated_words)


def replace_html_file_text(file_path: str):    
    """Replaces HTML file to its Hindi language version.

    Args:
      file_path: The path of HTML file that should be translated.
    """

    html_file = open(file_path, "r+")

    # Reading HTML file
    contents = html_file.read()

    # Creating a BeautifulSoup object and specifying the parser 
    soup = BeautifulSoup(contents, 'html.parser')

    # Exclude not needed tags.
    excluded_tags = ['style', 'script', 'head', 'meta', '[document]', 'img']

    try:
        parse_nested_tags(soup_list=soup.find_all(), excluded_tags=excluded_tags, translated_words={})
    except:
        print("******************** Error Aborted ********************")
    finally:
        translated_html = soup.prettify('utf-8')
        with open(file_path, "wb") as output_file:
            output_file.write(translated_html)

 
for file in html_files:
    print("**** File Path ****")
    print(file)
    print("*******************")
    replace_html_file_text(file_path=file)