## Words themes and complexity analysis

In [None]:
# Libs preparation and installation
!pip install undetected-chromedriver
!pip install pandas numpy nltk

I want to collect mostly used words, find context and level of words from 0 - preschool and 100 - high complexity science literature. Firstly, I'll collect data from these sources:
- Literature and books from public domains and libraries for english-speaking countries like UK and USA.
- YouTube videos subltitles with highest level of english by themes.

Parsing themes with links from free library ["Gutenberg"](http://www.gutenberg.org/ebooks) with `.txt` format for downloading

In [None]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = uc.Chrome(headless=False)

try:
    driver.get('https://www.gutenberg.org/ebooks/bookshelf/')
    WebDriverWait(driver, 15).until(
        EC.presence_of_element_located((By.CLASS_NAME, 'bookshelf_pages'))
    )
    bookshelves = []
    shelf_lists = driver.find_elements(By.CLASS_NAME, 'bookshelf_pages')
    for shelf_list in shelf_lists:
        items = shelf_list.find_elements(By.TAG_NAME, 'li')
        for item in items:
            link = item.find_element(By.TAG_NAME, 'a')
            text = link.text.strip()
            href = link.get_attribute('href')
            title = ' '.join(text.split()[1:]) if text.split() and text.split()[0].isdigit() else text
            bookshelves.append([title, href])
    for i, (title, url) in enumerate(bookshelves, 1):
        print(f"{i:>3}. {title}\n     {url}")
    print(f"\nTotal bookshelves found: {len(bookshelves)}")
    print(bookshelves)
except :
    # Ensure browser closes even if errors occur
    driver.quit()

  1. Best Loved Spanish Literary Classics
     https://www.gutenberg.org/ebooks/bookshelf/420
  2. Adventure
     https://www.gutenberg.org/ebooks/bookshelf/82
  3. Africa
     https://www.gutenberg.org/ebooks/bookshelf/5
  4. African American Writers
     https://www.gutenberg.org/ebooks/bookshelf/6
  5. Ainslee's
     https://www.gutenberg.org/ebooks/bookshelf/195
  6. American Revolutionary War
     https://www.gutenberg.org/ebooks/bookshelf/196
  7. Anarchism
     https://www.gutenberg.org/ebooks/bookshelf/7
  8. Animal
     https://www.gutenberg.org/ebooks/bookshelf/150
  9. Animals-Domestic
     https://www.gutenberg.org/ebooks/bookshelf/151

**...**

Total bookshelves found: 338


After filtering and analysis, new list of most important and common themes looks like:


In [2]:
processed_list = [
    ['Adventure', 'https://www.gutenberg.org/ebooks/bookshelf/82'],
    ['Africa', 'https://www.gutenberg.org/ebooks/bookshelf/5'],
    ['War', 'https://www.gutenberg.org/ebooks/bookshelf/196'],
    ['Anarchism', 'https://www.gutenberg.org/ebooks/bookshelf/7'],
    ['Animal', 'https://www.gutenberg.org/ebooks/bookshelf/150'],
    ['Domestic', 'https://www.gutenberg.org/ebooks/bookshelf/151'],
    ['Birds', 'https://www.gutenberg.org/ebooks/bookshelf/154'],
    ['Insects', 'https://www.gutenberg.org/ebooks/bookshelf/155'],
    ['Mammals', 'https://www.gutenberg.org/ebooks/bookshelf/156'],
    ['Amphibians', 'https://www.gutenberg.org/ebooks/bookshelf/157'],
    ['Trapping', 'https://www.gutenberg.org/ebooks/bookshelf/153'],
    ['Wild', 'https://www.gutenberg.org/ebooks/bookshelf/152'],
    ['Anthropology', 'https://www.gutenberg.org/ebooks/bookshelf/8'],
    ['Archaeology', 'https://www.gutenberg.org/ebooks/bookshelf/9'],
    ['Architecture', 'https://www.gutenberg.org/ebooks/bookshelf/10'],
    ['Argentina', 'https://www.gutenberg.org/ebooks/bookshelf/112'],
    ['Art', 'https://www.gutenberg.org/ebooks/bookshelf/11'],
    ['Legends', 'https://www.gutenberg.org/ebooks/bookshelf/160'],
    ['Astronomy', 'https://www.gutenberg.org/ebooks/bookshelf/101'],
    ['Atheism', 'https://www.gutenberg.org/ebooks/bookshelf/199'],
    ['Australia', 'https://www.gutenberg.org/ebooks/bookshelf/113'],
    ['Racism', 'https://www.gutenberg.org/ebooks/bookshelf/65'],
    ['Association', 'https://www.gutenberg.org/ebooks/bookshelf/422'],
    ['America', 'https://www.gutenberg.org/ebooks/bookshelf/136'],
    ['Listings', 'https://www.gutenberg.org/ebooks/bookshelf/13'],
    ['Bibliomania', 'https://www.gutenberg.org/ebooks/bookshelf/15'],
    ['Biographies', 'https://www.gutenberg.org/ebooks/bookshelf/16'],
    ['Biology', 'https://www.gutenberg.org/ebooks/bookshelf/201'],
    ['Botany', 'https://www.gutenberg.org/ebooks/bookshelf/115'],
    ['War', 'https://www.gutenberg.org/ebooks/bookshelf/137'],
    ['Law', 'https://www.gutenberg.org/ebooks/bookshelf/205'],
    ['Buddhism', 'https://www.gutenberg.org/ebooks/bookshelf/116'],
    ['Camping', 'https://www.gutenberg.org/ebooks/bookshelf/148'],
    ['Canada', 'https://www.gutenberg.org/ebooks/bookshelf/118'],
    ['Chemistry', 'https://www.gutenberg.org/ebooks/bookshelf/211'],
    ['Series', 'https://www.gutenberg.org/ebooks/bookshelf/17'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/18'],
    ['History', 'https://www.gutenberg.org/ebooks/bookshelf/19'],
    ['Literature', 'https://www.gutenberg.org/ebooks/bookshelf/20'],
    ['Books', 'https://www.gutenberg.org/ebooks/bookshelf/22'],
    ['Christianity', 'https://www.gutenberg.org/ebooks/bookshelf/119'],
    ['Christmas', 'https://www.gutenberg.org/ebooks/bookshelf/23'],
    ['Antiquity', 'https://www.gutenberg.org/ebooks/bookshelf/24'],
    ['Cooking', 'https://www.gutenberg.org/ebooks/bookshelf/419'],
    ['Crafts', 'https://www.gutenberg.org/ebooks/bookshelf/27'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/28'],
    ['Nonfiction', 'https://www.gutenberg.org/ebooks/bookshelf/29'],
    ['History', 'https://www.gutenberg.org/ebooks/bookshelf/220'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/30'],
    ['Society', 'https://www.gutenberg.org/ebooks/bookshelf/31'],
    ['Engineering', 'https://www.gutenberg.org/ebooks/bookshelf/32'],
    ['War', 'https://www.gutenberg.org/ebooks/bookshelf/139'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/33'],
    ['Esperanto', 'https://www.gutenberg.org/ebooks/bookshelf/34'],
    ['Ecology', 'https://www.gutenberg.org/ebooks/bookshelf/224'],
    ['Education', 'https://www.gutenberg.org/ebooks/bookshelf/138'],
    ['Egypt', 'https://www.gutenberg.org/ebooks/bookshelf/121'],
    ['Fantasy', 'https://www.gutenberg.org/ebooks/bookshelf/36'],
    ['Folklore', 'https://www.gutenberg.org/ebooks/bookshelf/37'],
    ['Forestry', 'https://www.gutenberg.org/ebooks/bookshelf/145'],
    ['France', 'https://www.gutenberg.org/ebooks/bookshelf/122'],
    ['Forest', 'https://www.gutenberg.org/ebooks/bookshelf/226'],
    ['Geology', 'https://www.gutenberg.org/ebooks/bookshelf/227'],
    ['Germany', 'https://www.gutenberg.org/ebooks/bookshelf/123'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/39'],
    ['Greece', 'https://www.gutenberg.org/ebooks/bookshelf/124'],
    ['Classics', 'https://www.gutenberg.org/ebooks/bookshelf/40'],
    ['Hinduism', 'https://www.gutenberg.org/ebooks/bookshelf/125'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/41'],
    ['Horror', 'https://www.gutenberg.org/ebooks/bookshelf/42'],
    ['Horticulture', 'https://www.gutenberg.org/ebooks/bookshelf/43'],
    ['Humor', 'https://www.gutenberg.org/ebooks/bookshelf/44'],
    ['India', 'https://www.gutenberg.org/ebooks/bookshelf/45'],
    ['Islam', 'https://www.gutenberg.org/ebooks/bookshelf/126'],
    ['Italy', 'https://www.gutenberg.org/ebooks/bookshelf/127'],
    ['Judaism', 'https://www.gutenberg.org/ebooks/bookshelf/128'],
    ['Education', 'https://www.gutenberg.org/ebooks/bookshelf/46'],
    ['Love', 'https://www.gutenberg.org/ebooks/bookshelf/47'],
    ['Manufacturing', 'https://www.gutenberg.org/ebooks/bookshelf/146'],
    ['Mathematics', 'https://www.gutenberg.org/ebooks/bookshelf/102'],
    ['Medicine', 'https://www.gutenberg.org/ebooks/bookshelf/48'],
    ['Microbiology', 'https://www.gutenberg.org/ebooks/bookshelf/105'],
    ['Microscopy', 'https://www.gutenberg.org/ebooks/bookshelf/109'],
    ['Books', 'https://www.gutenberg.org/ebooks/bookshelf/49'],
    ['Music', 'https://www.gutenberg.org/ebooks/bookshelf/50'],
    ['Mycology', 'https://www.gutenberg.org/ebooks/bookshelf/129'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/51'],
    ['Mythology', 'https://www.gutenberg.org/ebooks/bookshelf/52'],
    ['Bookshelf', 'https://www.gutenberg.org/ebooks/bookshelf/149'],
    ['America', 'https://www.gutenberg.org/ebooks/bookshelf/53'],
    ['History', 'https://www.gutenberg.org/ebooks/bookshelf/54'],
    ['Zealand', 'https://www.gutenberg.org/ebooks/bookshelf/130'],
    ['Association', 'https://www.gutenberg.org/ebooks/bookshelf/244'],
    ['Norway', 'https://www.gutenberg.org/ebooks/bookshelf/131'],
    ['Plays', 'https://www.gutenberg.org/ebooks/bookshelf/55'],
    ['Opera', 'https://www.gutenberg.org/ebooks/bookshelf/56'],
    ['Paganism', 'https://www.gutenberg.org/ebooks/bookshelf/132'],
    ['Philosophy', 'https://www.gutenberg.org/ebooks/bookshelf/57'],
    ['Photography', 'https://www.gutenberg.org/ebooks/bookshelf/158'],
    ['Physics', 'https://www.gutenberg.org/ebooks/bookshelf/103'],
    ['Physiology', 'https://www.gutenberg.org/ebooks/bookshelf/110'],
    ['Plays', 'https://www.gutenberg.org/ebooks/bookshelf/59'],
    ['Poetry', 'https://www.gutenberg.org/ebooks/bookshelf/60'],
    ['Politics', 'https://www.gutenberg.org/ebooks/bookshelf/61'],
    ['Precursors', 'https://www.gutenberg.org/ebooks/bookshelf/62'],
    ['Gutenberg', 'https://www.gutenberg.org/ebooks/bookshelf/63'],
    ['Psychology', 'https://www.gutenberg.org/ebooks/bookshelf/64'],
    ['Reference', 'https://www.gutenberg.org/ebooks/bookshelf/66'],
    ['Stories', 'https://www.gutenberg.org/ebooks/bookshelf/67'],
    ['Science', 'https://www.gutenberg.org/ebooks/bookshelf/106'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/68'],
    ['Women', 'https://www.gutenberg.org/ebooks/bookshelf/403'],
    ['Scouts', 'https://www.gutenberg.org/ebooks/bookshelf/144'],
    ['Stories', 'https://www.gutenberg.org/ebooks/bookshelf/69'],
    ['Slavery', 'https://www.gutenberg.org/ebooks/bookshelf/70'],
    ['Sociology', 'https://www.gutenberg.org/ebooks/bookshelf/134'],
    ['Africa', 'https://www.gutenberg.org/ebooks/bookshelf/135'],
    ['America', 'https://www.gutenberg.org/ebooks/bookshelf/71'],
    ['War', 'https://www.gutenberg.org/ebooks/bookshelf/140'],
    ['Suffrage', 'https://www.gutenberg.org/ebooks/bookshelf/72'],
    ['Technology', 'https://www.gutenberg.org/ebooks/bookshelf/143'],
    ['Microbiology', 'https://www.gutenberg.org/ebooks/bookshelf/105'],
    ['Journal', 'https://www.gutenberg.org/ebooks/bookshelf/423'],
    ['Transportation', 'https://www.gutenberg.org/ebooks/bookshelf/74'],
    ['Travel', 'https://www.gutenberg.org/ebooks/bookshelf/75'],
    ['War', 'https://www.gutenberg.org/ebooks/bookshelf/141'],
    ['Kingdom', 'https://www.gutenberg.org/ebooks/bookshelf/76'],
    ['States', 'https://www.gutenberg.org/ebooks/bookshelf/136'],
    ['Law', 'https://www.gutenberg.org/ebooks/bookshelf/302'],
    ['Western', 'https://www.gutenberg.org/ebooks/bookshelf/77'],
    ['Witchcraft', 'https://www.gutenberg.org/ebooks/bookshelf/78'],
    ['Journals', 'https://www.gutenberg.org/ebooks/bookshelf/80'],
    ['Woodwork', 'https://www.gutenberg.org/ebooks/bookshelf/147'],
    ['War', 'https://www.gutenberg.org/ebooks/bookshelf/142'],
    ['War', 'https://www.gutenberg.org/ebooks/bookshelf/325'],
    ['Zoology', 'https://www.gutenberg.org/ebooks/bookshelf/303']
]

Parsing several (25 maximum) texts per theme of books for these themes:

In [None]:
import os
import re
import time
import requests
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import undetected_chromedriver as uc

# Initialize the driver
driver = uc.Chrome(headless=True)

def download_file(url, save_path):
    """Download a file from URL to specified path"""
    response = requests.get(url)
    if response.status_code == 200:
        with open(save_path, 'wb') as f:
            f.write(response.content)
        return True
    return False

def process_book(book_url, theme_dir):
    """Process individual book page and download if conditions are met"""
    try:
        driver.get(book_url)
        print(f"{theme_dir}:{book_url}")

        # 1. Check language (wait for element to load)
        try:
            lang_xpath = "//table[@class='bibrec']//tr[th='Language']/td"
            lang_element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, lang_xpath))
            )
            language = lang_element.text.strip()
            print(f"Language {language}")
            if language not in ["English", "Английский"]:
                return False
        except Exception:
            return False

        # 2. Extract reading ease score
        try:
            # Use presence_of_all_elements_located to find every element that matches
            note_xpath = "//table[@class='bibrec']//tr[th='Note']/td"
            note_elements = WebDriverWait(driver, 10).until(
                EC.presence_of_all_elements_located((By.XPATH, note_xpath))
            )

            # Check if any notes were found
            if not note_elements:
                return False

            all_notes_data = []
            reading_score = None

            # Loop through each found note element
            for note_element in note_elements:
                note_text = note_element.text.strip()
                print(f"Note: {note_text}")

                # Extract reading score from each note using regex
                match = re.search(r"Reading ease score:\s*([\d.]+)", note_text)

                current_score = match.group(1) if match else None

                # Store the first valid reading score found
                if current_score and reading_score is None:
                    reading_score = current_score

                all_notes_data.append({
                    "note_text": note_text,
                    "reading_score": current_score
                })

            # Check if at least one note had a reading score.
            if reading_score is None:
                return False

        except Exception as e:
            print(f"An error occurred while parsing notes: {e}")
            return False

        # 3. Find and process download link
        try:
            selectors = [
                "//a[@class='link ' and text()='Plain Text UTF-8']"
            ]

            download_link = None
            for selector in selectors:
                try:
                    # Use find_elements to avoid an exception if not found
                    links = driver.find_elements(By.XPATH, selector)
                    if links:
                        download_link = links[0]
                        break
                except Exception:
                    # Continue to the next selector
                    continue

            if not download_link:
                print("Error: Download link not found.")
                return False

            download_url = download_link.get_attribute('href')
            book_id_match = re.search(r'/ebooks/(\d+)', book_url)
            if not book_id_match:
                print("Error: Could not extract book ID from URL.")
                return False

            book_id = book_id_match.group(1)
            filename = f"{book_id}_{reading_score}.txt"
            save_path = os.path.join(theme_dir, filename)

            # Download the file
            if download_file(download_url, save_path):
                print(f"Downloaded: {filename}")
                return True
            else:
                print(f"Failed to download: {filename}")
                return False

        except Exception as e:
            print(f"An error occurred during download link processing: {e}")
            return False

    except Exception as e:
        print(f"Error processing {book_url}: {str(e)}")
        return False

    return False

# Main processing loop
for theme_name, theme_url in processed_list:
    try:
        # Create theme directory
        theme_dir = re.sub(r'[\\/*?:"<>|]', "", theme_name)  # Sanitize directory name
        os.makedirs('data/' + theme_dir, exist_ok=True)
        print(f"\nProcessing theme: {theme_name}")

        # Navigate to bookshelf page
        driver.get(theme_url)
        time.sleep(2)  # Initial load wait

        # Process pagination
        page_count = 0
        books_processed = 0
        while True:
            page_count += 1
            print(f"  Page {page_count}")

            # Find all book links
            book_links = []
            try:
                book_elements = WebDriverWait(driver, 15).until(
                    EC.presence_of_all_elements_located((By.XPATH, "//li[@class='booklink']/a[@class='link']"))
                )
                book_links = [elem.get_attribute('href') for elem in book_elements]
            except:
                pass

            # Process each book
            for book_url in book_links:
                if process_book(book_url, theme_dir):
                    books_processed += 1
                time.sleep(1)  # Be polite to server

            # Check for next page
            try:
                next_button = driver.find_element(By.XPATH, "//a[@title='Go to next page']")
                if "disabled" in next_button.get_attribute("class"):
                    break

                next_button.click()
                time.sleep(3)  # Wait for page load
            except:
                break

        print(f"  Finished theme: {theme_name} | Books downloaded: {books_processed}")

    except Exception as e:
        print(f"Error processing theme {theme_name}: {str(e)}")

# Clean up
driver.quit()
print("\nProcessing completed!")

After 3 hours of executing, we get 1.4 Gb of text data with theme and text reading complexity separation. File saved in directories by theme, file name saved as `{BookId}_{TextComplexity}.txt`

In [3]:
dirs = [element[0] for element in processed_list]
dirs = list(set(dirs))
dirs.sort()
print(", ".join(dirs))

Adventure, Africa, America, Amphibians, Anarchism, Animal, Anthropology, Antiquity, Archaeology, Architecture, Argentina, Art, Association, Astronomy, Atheism, Australia, Bibliomania, Biographies, Biology, Birds, Books, Bookshelf, Botany, Buddhism, Camping, Canada, Chemistry, Christianity, Christmas, Classics, Cooking, Crafts, Domestic, Ecology, Education, Egypt, Engineering, Esperanto, Fantasy, Fiction, Folklore, Forest, Forestry, France, Geology, Germany, Greece, Gutenberg, Hinduism, History, Horror, Horticulture, Humor, India, Insects, Islam, Italy, Journal, Journals, Judaism, Kingdom, Law, Legends, Listings, Literature, Love, Mammals, Manufacturing, Mathematics, Medicine, Microbiology, Microscopy, Music, Mycology, Mythology, Nonfiction, Norway, Opera, Paganism, Philosophy, Photography, Physics, Physiology, Plays, Poetry, Politics, Precursors, Psychology, Racism, Reference, Science, Scouts, Series, Slavery, Society, Sociology, States, Stories, Suffrage, Technology, Transportation, T

Now creating "one word" dataset with these columns:

| Word | Adventure | Africa | ... | Zealand | Zoology | Total | Complexity |
|-|-|-|-|-|-|-|-|
| `str` | `int` | `int` | ... | `int` | `int` | `int` | `float` |

by this algorithm:
1. Open text from every directory (theme)
2. Clear and tokenize text by NLTK, deleting stop words  
2. Adding every token:
  - if exist, update theme counter, total counter and complexity by this formula:
  $
  \text{complexity}_{\text{new}} = \frac{\text{comlexity}_\text{old}\cdot \text{total}+\text{text_complexity}}{\text{total}+1}
  $
  - if not exit, add word update counters and set complxity to text complexity of current text


In [None]:
import os
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
import re

# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

def preprocess_text(text):
    """Tokenize, clean, and lemmatize text"""
    # Convert to lowercase and remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text.lower())

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords and short tokens
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words and len(word) > 2]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in tokens]

def process_directory(base_dir, dirs):
    """Process all theme directories and create the word dataset"""
    # Filter themes to only include those in dirs list
    themes = [d for d in dirs if os.path.isdir(os.path.join(base_dir, d))]
    print(f"Processing {len(themes)} themes: {', '.join(themes)}")

    # Initialize word dictionary
    word_data = {}

    # Process each theme
    for theme in themes:
        theme_dir = os.path.join(base_dir, theme)
        print(f"Processing theme: {theme}")

        # Process each file in the theme directory
        for filename in os.listdir(theme_dir):
            if filename.endswith('.txt'):
                # Extract reading score from filename
                try:
                    book_id, reading_score = filename.split('_')[:2]
                    reading_score = float(reading_score.replace('.txt', ''))
                except Exception as e:
                    print(f"Error parsing filename {filename}: {e}")
                    continue

                filepath = os.path.join(theme_dir, filename)

                try:
                    with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
                        text = f.read()

                    # Preprocess text
                    tokens = preprocess_text(text)

                    # Update word data
                    for word in tokens:
                        if word not in word_data:
                            # Initialize new word entry
                            word_data[word] = {
                                'Total': 0,
                                'Complexity': 0.0,
                                **{t: 0 for t in themes}  # Initialize all themes to 0
                            }

                        # Get current word stats
                        current = word_data[word]
                        total_old = current['Total']

                        # Update counts
                        current['Total'] += 1
                        current[theme] += 1

                        # Update complexity using moving average formula
                        current['Complexity'] = (
                            (current['Complexity'] * total_old) + reading_score
                        ) / (total_old + 1)

                except Exception as e:
                    print(f"Error processing {filename}: {str(e)}")

    # Convert to DataFrame
    df = pd.DataFrame.from_dict(word_data, orient='index')
    df.reset_index(inplace=True)
    df.rename(columns={'index': 'Word'}, inplace=True)

    # Reorder columns: Word, themes, Total, Complexity
    columns = ['Word'] + themes + ['Total', 'Complexity']
    return df[columns]

# Main processing
base_directory = "Data"  # Directory containing theme subdirectories
output_file = "word_dataset.csv"

# Use the dirs array from the previous cell
df = process_directory(base_directory, dirs)

# Display some statistics
print(f"\nDataset created with {len(df)} unique words")
print(f"Top 10 most frequent words:")
print(df.sort_values('Total', ascending=False).head(10)[['Word', 'Total', 'Complexity']])

# Save results
df.to_csv(output_file, index=False)
print(f"\nDataset saved to {output_file}")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\79150\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\79150\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\79150\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\79150\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


Processing 111 themes: Adventure, Africa, America, Amphibians, Anarchism, Animal, Anthropology, Antiquity, Archaeology, Architecture, Argentina, Art, Association, Astronomy, Atheism, Australia, Bibliomania, Biographies, Biology, Birds, Books, Bookshelf, Botany, Buddhism, Camping, Canada, Chemistry, Christianity, Christmas, Classics, Cooking, Crafts, Domestic, Ecology, Education, Egypt, Engineering, Esperanto, Fantasy, Fiction, Folklore, Forest, Forestry, France, Geology, Germany, Greece, Gutenberg, Hinduism, History, Horror, Horticulture, Humor, India, Insects, Islam, Italy, Journal, Journals, Judaism, Kingdom, Law, Legends, Listings, Literature, Love, Mammals, Manufacturing, Mathematics, Medicine, Microbiology, Microscopy, Music, Mycology, Mythology, Nonfiction, Norway, Opera, Paganism, Philosophy, Photography, Physics, Physiology, Plays, Poetry, Politics, Precursors, Psychology, Racism, Reference, Science, Scouts, Series, Slavery, Society, Sociology, States, Stories, Suffrage, Techno

In [None]:
df = pd.read_csv('word_dataset.csv')
df = df[df['Total'] > 10]

df.sort_values('Total', ascending=False).head(100)[['Word', 'Total', 'Complexity']]

Unnamed: 0,Word,Total,Complexity
52,one,749024,67.143501
608,would,477428,67.192777
140,said,460270,74.127681
683,time,406210,66.731788
17,may,384135,64.051919
...,...,...,...
1437,set,113125,67.948997
229,order,111802,63.708995
869,far,111664,65.489456
4,world,111031,65.826793


## Oxford dictionary parsing
The obtained results reflect the theoretical correspondence between the thematicity of words and their complexity. However, for frequent common words that occur very often, their complexity as well as their thematic belonging become close to the average values. The same happens with words belonging to several subthemes. In addition to the above, we have also collected proper names, Roman numerals and other words that are of no value to an English language learner.

In order to create a list of really useful words for learning, it was decided to sparse word lists from an authoritative source. This source was the oxford dictionary. It has a division by topical, subtopic and subsubtopic, as well as its CEFR difficulty and part of speech. After research and testing, the following parser was written:

In [None]:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import undetected_chromedriver as uc


# Initialize the driver
driver = uc.Chrome()
driver.maximize_window()

# Create an empty DataFrame
columns = ['word', 'topic', 'subtopic', 'subsubtopic', 'CEFR_level', 'word_class', 'link']
df = pd.DataFrame(columns=columns)

try:
    # Step 1: Parse all topics from the main page
    main_url = "https://www.oxfordlearnersdictionaries.com/topic/"
    driver.get(main_url)
    time.sleep(0.5)  # Allow page to load

    topic_elements = driver.find_elements(By.CLASS_NAME, 'topic-box')
        # By.CSS_SELECTOR, 'div.topic-content-container > div > a')
    print(f"Found {len(topic_elements)} topic elements")

    topics_data = []
    for topic_elem in topic_elements:
        topic_name = topic_elem.find_element(By.CLASS_NAME, 'topic-label').text.strip()
        topic_elem = topic_elem.find_element(By.TAG_NAME, 'a')
        topic_link = topic_elem.get_attribute('href')
        topics_data.append((topic_name, topic_link))

    # Step 2: Process each topic
    for topic_name, topic_link in topics_data:
        print(f"Processing topic: {topic_name} - {topic_link}")
        driver.get(topic_link)
        time.sleep(0.3)  # Allow page to load

        # Find all subtopic boxes
        subtopic_boxes = driver.find_elements(By.CLASS_NAME, 'topic-box-secondary-heading')
        print(f"  Found {len(subtopic_boxes)} subtopic boxes:")

        subtopic_boxes_links = []
        subtopic_names = []
        for box in subtopic_boxes:
            # Extract subtopic name
            subtopic_name = box.text.strip()
            if '(see all)' in subtopic_name:
                subtopic_name = subtopic_name.replace('(see all)', '')
            print(f'text:{subtopic_name}')
            subtopic_names.append(subtopic_name)

            # Find the word list within the box

            link = box.get_attribute('href')
            subtopic_boxes_links.append(link)

        for i in range(len(subtopic_boxes_links)):
            link = subtopic_boxes_links[i]
            subtopic_name = subtopic_names[i]
            try:
                driver.get(link)
                # word_list = box.find_element(By.CLASS_NAME, 'top-g')
                # time.sleep(100)

                wait = WebDriverWait(driver, 10)
                word_list = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'top-g')))

                word_items = word_list.find_elements(By.TAG_NAME, 'li')

                print(f"    Found {len(word_items)} words in subtopic: {subtopic_name}")

                for word_li in word_items:
                    try:
                        # Extract word and word class
                        word = word_li.find_element(By.TAG_NAME, 'a').text
                        word_class = word_li.find_element(By.CLASS_NAME, 'pos').text
                        link = word_li.find_element(By.TAG_NAME, 'a').get_attribute('href')

                        # Extract CEFR level and subsubtopic from data attributes
                        cefr_level = None
                        subsubtopic = None
                        attrs = driver.execute_script(
                            'var items = {}; for (index = 0; index < arguments[0].attributes.length; ++index) { items[arguments[0].attributes[index].name] = arguments[0].attributes[index].value }; return items;',
                            word_li
                        )
                        for attr, value in attrs.items():
                            if attr.endswith('_t'):
                                cefr_level = value
                                subsubtopic = attr[:-2]  # Remove '_t' suffix
                                subsubtopic = subsubtopic[5:] # remove "data-"
                                break

                        # Append to DataFrame
                        new_row = {
                            'word': word,
                            'topic': topic_name,
                            'subtopic': subtopic_name,
                            'subsubtopic': subsubtopic,
                            'CEFR_level': cefr_level,
                            'word_class': word_class,
                            'link': link
                        }
                        df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)

                    except Exception as e:
                        print(f"      Error processing a word: {e}")
            except Exception as e:
                print(f"    Error processing a subtopic box: {e}")

except Exception as e:
    print(f"An error occurred: {e}")
finally:
    driver.quit()
    # Save to CSV
    df.to_csv('oxford_vocabulary.csv', index=False)
    print("Data saved to 'oxford_vocabulary.csv'")

Since the parsing was done on a small number of pages, the data was collected in just a few hours.

## Translation of words and examples
From the key date, we obtained an authoritative list of words, their level and part of speech, and investigated the frequency of word usage in English literature. But since our app is built for learning English, we need translations of these words, as well as example sentences using them both for generating word help and for generating the tasks themselves, where the word under study is dropped from the sentence and its context is used to create 3 other options for the user to choose from (see NLP part from Anton).

The following websites were analyzed:  

| Website | Functionality | Translation | Transcription | Example Sentences with Translation | Collocations | CAPTCHA | Notes |  
|---------|--------------|-------------|---------------|------------------------------------|--------------|---------|-------|  
| [Yandex Translate](https://translate.yandex.ru/dictionary/Английский-Русский/fluently) | Translator | ✅ | ✅ | ✅ | ❌ | ✅ | Split by meanings |  
| [WooordHunt](https://wooordhunt.ru/word/fluently) | Dictionary | ✅ | ✅ | ⚠️ (merged, not split by meanings) | ✅ | ❌ | All translations combined |  
| [KartaSlov](https://en.kartaslov.ru/перевод-в-контексте/fluently) | Contextual Dictionary | ✅ | ✅ | ✅ | ❌ | ❌ | Focus on context |  
| [Cambridge Dictionary](https://dictionary.cambridge.org/ru/словарь/английский/fluently) | English Dictionary | ❌ | ✅ | ⚠️ (no translation) | ❌ | ❌ | Authentic examples |  


At first I tried to create parser for yandex translate:

In [None]:
import time
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import undetected_chromedriver as uc

# Initialize the driver
driver = uc.Chrome()
driver.maximize_window()

df = pd.read_csv("oxford_vocabulary.csv")

def translate(word):
    try:
        driver.get(f'https://translate.yandex.ru/dictionary/Английский-Русский/{word}')
        result = ""
        not_first_time = False
        while result == "":
            if not_first_time:
                 time.sleep(0.1)
            wait = WebDriverWait(driver, 10)
            box = wait.until(EC.text_to_be_present_in_element((By.ID, 'fakeArea'), word))
            wait = WebDriverWait(driver, 10)
            box = wait.until(EC.presence_of_element_located((By.ID, 'dstBox')))
            area = box.find_element(By.TAG_NAME, 'p')
            texts = area.find_elements(By.TAG_NAME, 'span')
            result = "".join([text.text for text in texts])
            not_first_time = True
        return result
    except:
        translate(word)

print(translate("something"))
driver.quit()

but at the third request captcha occures. So I decided to parse KartaSlov service:

In [None]:
import time
import re
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import undetected_chromedriver as uc
from tqdm import tqdm
import os

def clean_word(word):
    """Cleans the word of unwanted characters and replaces spaces with %20"""
    word = re.sub(r'[^\w\s\-]', '', word)  # Deleting non-letter characters (except spaces and hyphens)
    word = re.sub(r'\s+', ' ', word).strip()  # Removing unnecessary spaces
    return "%20".join(word.split())  # Replacing spaces with %20
def translate(word, attemps = 0, max_attempts = 1, fast_translation = False):
    word = "%20".join(word.split())
    try:
        driver.get(f'https://en.kartaslov.ru/перевод-в-контексте/{word}')
        results = []
        translations = []
        sentences = [[]]
        not_first_time = False
        while translations == []:
            # If page is not completely loaded, sleep a bit and continue
            if not_first_time:
                time.sleep(0.1)

            # Page loading and fast translations
            wait = WebDriverWait(driver, 10)
            fast_headers = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'v2-ctx-tra-list')))
            fast_translation_boxes = fast_headers.find_elements(By.TAG_NAME, 'li')
            print(f"Fast translations:{', '.join([box.text for box in fast_translation_boxes])}")
            if fast_translation:
                translations = [box.text for box in fast_translation_boxes]
                for translation in translations:
                    result = {
                        'translation': translation,
                        'sentences' : sentences
                    }
                    results.append(result)
                return results

            # Words translations by headers in boxes
            headers = driver.find_elements(By.CLASS_NAME, 'v2-tr-header')
            translations = []
            for header in headers:
                translation_box = header.find_element(By.TAG_NAME, 'a')
                translation = translation_box.text
                translations.append(translation)

            # Sentences words and translations
            sentences = driver.find_elements(By.CLASS_NAME,'v2-tr-row')
            sentences_list = []
            print(f"Всего {len(sentences)}, по {len(sentences)/len(translations)} на слово")
            for i in range(len(sentences)):
                sentence = sentences[i]
                sentence_en = sentence.find_element(By.CSS_SELECTOR,".v2-tr-column-src.v2-tr-text")
                sentence_en = sentence_en.text

                sentence_ru = sentence.find_element(By.CSS_SELECTOR,".v2-tr-column-trg.v2-tr-text")
                sentence_ru = sentence_ru.text
                sentences_list.append([sentence_en, sentence_ru])
            if len(sentences_list)>=5:
                for i in range(0,len(sentences_list)-5,5):
                    result = {
                        'translation': translations[i//5],
                        'sentences' : sentences_list[i:(i+5)]
                    }
                    results.append(result)
            else:
                results = [{
                        'translation': translations[0],
                        'sentences' : sentences_list
                }]
        return results
    except Exception as e:
        attemps += 1
        if attemps == max_attempts:
            return None
        else:
            print(f"Error {e} \nAttempt #{attemps} for word '{word}'")
            translate(word, attemps)

driver = uc.Chrome()
driver.set_page_load_timeout(0.1)

df = pd.read_csv("oxford_vocabulary.csv")
output_file = "translated_vocabulary.csv"

# Loading existing results, if the file exists
if os.path.exists(output_file):
    result_df = pd.read_csv(output_file)
    
    processed_words = set(result_df['word_clean'].dropna().unique())
    last_word = result_df.tail(1)['word'].tolist()[0]
    start_index = df[df['word'] == last_word].index[0] + 1
else:
    start_index = 0
    result_df = pd.DataFrame(columns=["word", "word_clean", "translation", "sentences", "original_index"])

for index, row in tqdm(df.iloc[start_index:].iterrows(), total=len(df)-start_index, initial=start_index):
    original_word = row["word"]
    
    if not isinstance(original_word, str) or original_word.strip() == "":
        continue
        
    cleaned = clean_word(original_word)
    
    if cleaned in processed_words:
        continue
        
    translations = translate(cleaned)
    
    if translations is None:
        print(f"Не удалось получить перевод для: {original_word}")
        continue
        
    for trans in translations:
        new_row = {
            "word": original_word,
            "word_clean": cleaned,
            "translation": trans["translation"],
            "sentences": trans["sentences"],
            "original_index": index
        }
        result_df = pd.concat([result_df, pd.DataFrame([new_row])], ignore_index=True)
    
    # Regular saving of results
    if index % 100 == 0:
        result_df.to_csv(output_file, index=False)
    
    # Updating the list of processed words
    processed_words.add(cleaned)

result_df.to_csv(output_file, index=False)
driver.quit()
print("Processing complete!")

After several hours and 7050/29997 words translated, my IP was banned (only for my pc service was down):
![503 error](image.png)
So I rewrite script to start from where it crashed and run on other team member's PC. After approximatelly 14 hours in summary, script completely parsed all words. 

## Final results
We ended up with an Oxford dictionary without translations, translations for words from this dictionary with example sentences with translations, and we also have a list of all words from literature with their popularity. We remove the unnecessary, add key columns for each of the tables and for the resulting table we do a cleanup.

In [1]:
import pandas as pd

translated = pd.read_csv('translated_vocabulary.csv')
translated.drop(columns = ['word_clean', 'original_index'], inplace = True)

oxford = pd.read_csv('oxford_vocabulary.csv')
oxford.drop(columns = ['word_class', 'link'], inplace = True)

result = pd.merge(oxford,translated, on = 'word', how = 'right')
result.drop_duplicates(inplace=True)

clean = result[~result['sentences'].duplicated(keep=False)]

words = pd.read_csv('word_dataset.csv')
result = pd.merge(words[['Word', 'Total']], clean, right_on = 'word', left_on = 'Word', how = 'right')

result.drop(columns = ['Word'], inplace = True)
result.to_csv('data.csv')

result

Unnamed: 0,Total,word,topic,subtopic,subsubtopic,CEFR_level,translation,sentences
0,462.0,adder,Animals,Animals,amphibians_and_reptiles,c2,гадюка,[['No poison darts tipped with the venom of an...
1,770.0,alligator,Animals,Animals,amphibians_and_reptiles,c1,аллигатор,"[[""Tell him I've been bit by an alligator."", '..."
2,770.0,alligator,Animals,Animals,amphibians_and_reptiles,c1,крокодил,[['You wanted to honor the man by showing him ...
3,770.0,alligator,Animals,Animals,amphibians_and_reptiles,c1,из крокодиловой кожи,[['Even that alligator handbag his wife left o...
4,196.0,alpaca,Animals,Animals,farm_animals,c2,альпака,"[[""— It's very lightweight alpaca."", 'Альпака?..."
...,...,...,...,...,...,...,...,...
34738,4.0,workplace,Work and business,Working life,office_life,b2,на работе,[['These laws were designed to protect a secre...
34739,4.0,workplace,Work and business,Working life,office_life,b2,рабочий,"[['Devon, looking at my computer is a violatio..."
34740,4.0,workplace,Work and business,Working life,office_life,b2,работать,"[['Oh, my God, this is a workplace.', 'Господи..."
34741,4.0,workplace,Work and business,Working life,office_life,b2,служебный,"[['The old workplace romance trick.', 'Старый ..."
