## Part I: Words themes and complexity analysis

In [None]:
# Libs preparation and installation
!pip install undetected-chromedriver
!pip install pandas numpy nltk

I want to collect mostly used words, find context and level of words from 0 - preschool and 100 - high complexity science literature. Firstly, I'll collect data from these sources:
- Literature and books from public domains and libraries for english-speaking countries like UK and USA.
- YouTube videos subltitles with highest level of english by themes.

Parsing themes with links from free library ["Gutenberg"](http://www.gutenberg.org/ebooks) with `.txt` format for downloading

In [None]:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = uc.Chrome(headless=False)

try:
    driver.get('https://www.gutenberg.org/ebooks/bookshelf/')
    WebDriverWait(driver, 15).until(
        EC.presence_of_element_located((By.CLASS_NAME, 'bookshelf_pages'))
    )
    bookshelves = []
    shelf_lists = driver.find_elements(By.CLASS_NAME, 'bookshelf_pages')
    for shelf_list in shelf_lists:
        items = shelf_list.find_elements(By.TAG_NAME, 'li')
        for item in items:
            link = item.find_element(By.TAG_NAME, 'a')
            text = link.text.strip()
            href = link.get_attribute('href')
            title = ' '.join(text.split()[1:]) if text.split() and text.split()[0].isdigit() else text
            bookshelves.append([title, href])
    for i, (title, url) in enumerate(bookshelves, 1):
        print(f"{i:>3}. {title}\n     {url}")
    print(f"\nTotal bookshelves found: {len(bookshelves)}")
    print(bookshelves)
except :
    # Ensure browser closes even if errors occur
    driver.quit()

  1. Best Loved Spanish Literary Classics
     https://www.gutenberg.org/ebooks/bookshelf/420
  2. Adventure
     https://www.gutenberg.org/ebooks/bookshelf/82
  3. Africa
     https://www.gutenberg.org/ebooks/bookshelf/5
  4. African American Writers
     https://www.gutenberg.org/ebooks/bookshelf/6
  5. Ainslee's
     https://www.gutenberg.org/ebooks/bookshelf/195
  6. American Revolutionary War
     https://www.gutenberg.org/ebooks/bookshelf/196
  7. Anarchism
     https://www.gutenberg.org/ebooks/bookshelf/7
  8. Animal
     https://www.gutenberg.org/ebooks/bookshelf/150
  9. Animals-Domestic
     https://www.gutenberg.org/ebooks/bookshelf/151

**...**

Total bookshelves found: 338


After filtering and analysis, new list of most important and common themes looks like:


In [2]:
processed_list = [
    ['Adventure', 'https://www.gutenberg.org/ebooks/bookshelf/82'],
    ['Africa', 'https://www.gutenberg.org/ebooks/bookshelf/5'],
    ['War', 'https://www.gutenberg.org/ebooks/bookshelf/196'],
    ['Anarchism', 'https://www.gutenberg.org/ebooks/bookshelf/7'],
    ['Animal', 'https://www.gutenberg.org/ebooks/bookshelf/150'],
    ['Domestic', 'https://www.gutenberg.org/ebooks/bookshelf/151'],
    ['Birds', 'https://www.gutenberg.org/ebooks/bookshelf/154'],
    ['Insects', 'https://www.gutenberg.org/ebooks/bookshelf/155'],
    ['Mammals', 'https://www.gutenberg.org/ebooks/bookshelf/156'],
    ['Amphibians', 'https://www.gutenberg.org/ebooks/bookshelf/157'],
    ['Trapping', 'https://www.gutenberg.org/ebooks/bookshelf/153'],
    ['Wild', 'https://www.gutenberg.org/ebooks/bookshelf/152'],
    ['Anthropology', 'https://www.gutenberg.org/ebooks/bookshelf/8'],
    ['Archaeology', 'https://www.gutenberg.org/ebooks/bookshelf/9'],
    ['Architecture', 'https://www.gutenberg.org/ebooks/bookshelf/10'],
    ['Argentina', 'https://www.gutenberg.org/ebooks/bookshelf/112'],
    ['Art', 'https://www.gutenberg.org/ebooks/bookshelf/11'],
    ['Legends', 'https://www.gutenberg.org/ebooks/bookshelf/160'],
    ['Astronomy', 'https://www.gutenberg.org/ebooks/bookshelf/101'],
    ['Atheism', 'https://www.gutenberg.org/ebooks/bookshelf/199'],
    ['Australia', 'https://www.gutenberg.org/ebooks/bookshelf/113'],
    ['Racism', 'https://www.gutenberg.org/ebooks/bookshelf/65'],
    ['Association', 'https://www.gutenberg.org/ebooks/bookshelf/422'],
    ['America', 'https://www.gutenberg.org/ebooks/bookshelf/136'],
    ['Listings', 'https://www.gutenberg.org/ebooks/bookshelf/13'],
    ['Bibliomania', 'https://www.gutenberg.org/ebooks/bookshelf/15'],
    ['Biographies', 'https://www.gutenberg.org/ebooks/bookshelf/16'],
    ['Biology', 'https://www.gutenberg.org/ebooks/bookshelf/201'],
    ['Botany', 'https://www.gutenberg.org/ebooks/bookshelf/115'],
    ['War', 'https://www.gutenberg.org/ebooks/bookshelf/137'],
    ['Law', 'https://www.gutenberg.org/ebooks/bookshelf/205'],
    ['Buddhism', 'https://www.gutenberg.org/ebooks/bookshelf/116'],
    ['Camping', 'https://www.gutenberg.org/ebooks/bookshelf/148'],
    ['Canada', 'https://www.gutenberg.org/ebooks/bookshelf/118'],
    ['Chemistry', 'https://www.gutenberg.org/ebooks/bookshelf/211'],
    ['Series', 'https://www.gutenberg.org/ebooks/bookshelf/17'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/18'],
    ['History', 'https://www.gutenberg.org/ebooks/bookshelf/19'],
    ['Literature', 'https://www.gutenberg.org/ebooks/bookshelf/20'],
    ['Books', 'https://www.gutenberg.org/ebooks/bookshelf/22'],
    ['Christianity', 'https://www.gutenberg.org/ebooks/bookshelf/119'],
    ['Christmas', 'https://www.gutenberg.org/ebooks/bookshelf/23'],
    ['Antiquity', 'https://www.gutenberg.org/ebooks/bookshelf/24'],
    ['Cooking', 'https://www.gutenberg.org/ebooks/bookshelf/419'],
    ['Crafts', 'https://www.gutenberg.org/ebooks/bookshelf/27'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/28'],
    ['Nonfiction', 'https://www.gutenberg.org/ebooks/bookshelf/29'],
    ['History', 'https://www.gutenberg.org/ebooks/bookshelf/220'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/30'],
    ['Society', 'https://www.gutenberg.org/ebooks/bookshelf/31'],
    ['Engineering', 'https://www.gutenberg.org/ebooks/bookshelf/32'],
    ['War', 'https://www.gutenberg.org/ebooks/bookshelf/139'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/33'],
    ['Esperanto', 'https://www.gutenberg.org/ebooks/bookshelf/34'],
    ['Ecology', 'https://www.gutenberg.org/ebooks/bookshelf/224'],
    ['Education', 'https://www.gutenberg.org/ebooks/bookshelf/138'],
    ['Egypt', 'https://www.gutenberg.org/ebooks/bookshelf/121'],
    ['Fantasy', 'https://www.gutenberg.org/ebooks/bookshelf/36'],
    ['Folklore', 'https://www.gutenberg.org/ebooks/bookshelf/37'],
    ['Forestry', 'https://www.gutenberg.org/ebooks/bookshelf/145'],
    ['France', 'https://www.gutenberg.org/ebooks/bookshelf/122'],
    ['Forest', 'https://www.gutenberg.org/ebooks/bookshelf/226'],
    ['Geology', 'https://www.gutenberg.org/ebooks/bookshelf/227'],
    ['Germany', 'https://www.gutenberg.org/ebooks/bookshelf/123'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/39'],
    ['Greece', 'https://www.gutenberg.org/ebooks/bookshelf/124'],
    ['Classics', 'https://www.gutenberg.org/ebooks/bookshelf/40'],
    ['Hinduism', 'https://www.gutenberg.org/ebooks/bookshelf/125'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/41'],
    ['Horror', 'https://www.gutenberg.org/ebooks/bookshelf/42'],
    ['Horticulture', 'https://www.gutenberg.org/ebooks/bookshelf/43'],
    ['Humor', 'https://www.gutenberg.org/ebooks/bookshelf/44'],
    ['India', 'https://www.gutenberg.org/ebooks/bookshelf/45'],
    ['Islam', 'https://www.gutenberg.org/ebooks/bookshelf/126'],
    ['Italy', 'https://www.gutenberg.org/ebooks/bookshelf/127'],
    ['Judaism', 'https://www.gutenberg.org/ebooks/bookshelf/128'],
    ['Education', 'https://www.gutenberg.org/ebooks/bookshelf/46'],
    ['Love', 'https://www.gutenberg.org/ebooks/bookshelf/47'],
    ['Manufacturing', 'https://www.gutenberg.org/ebooks/bookshelf/146'],
    ['Mathematics', 'https://www.gutenberg.org/ebooks/bookshelf/102'],
    ['Medicine', 'https://www.gutenberg.org/ebooks/bookshelf/48'],
    ['Microbiology', 'https://www.gutenberg.org/ebooks/bookshelf/105'],
    ['Microscopy', 'https://www.gutenberg.org/ebooks/bookshelf/109'],
    ['Books', 'https://www.gutenberg.org/ebooks/bookshelf/49'],
    ['Music', 'https://www.gutenberg.org/ebooks/bookshelf/50'],
    ['Mycology', 'https://www.gutenberg.org/ebooks/bookshelf/129'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/51'],
    ['Mythology', 'https://www.gutenberg.org/ebooks/bookshelf/52'],
    ['Bookshelf', 'https://www.gutenberg.org/ebooks/bookshelf/149'],
    ['America', 'https://www.gutenberg.org/ebooks/bookshelf/53'],
    ['History', 'https://www.gutenberg.org/ebooks/bookshelf/54'],
    ['Zealand', 'https://www.gutenberg.org/ebooks/bookshelf/130'],
    ['Association', 'https://www.gutenberg.org/ebooks/bookshelf/244'],
    ['Norway', 'https://www.gutenberg.org/ebooks/bookshelf/131'],
    ['Plays', 'https://www.gutenberg.org/ebooks/bookshelf/55'],
    ['Opera', 'https://www.gutenberg.org/ebooks/bookshelf/56'],
    ['Paganism', 'https://www.gutenberg.org/ebooks/bookshelf/132'],
    ['Philosophy', 'https://www.gutenberg.org/ebooks/bookshelf/57'],
    ['Photography', 'https://www.gutenberg.org/ebooks/bookshelf/158'],
    ['Physics', 'https://www.gutenberg.org/ebooks/bookshelf/103'],
    ['Physiology', 'https://www.gutenberg.org/ebooks/bookshelf/110'],
    ['Plays', 'https://www.gutenberg.org/ebooks/bookshelf/59'],
    ['Poetry', 'https://www.gutenberg.org/ebooks/bookshelf/60'],
    ['Politics', 'https://www.gutenberg.org/ebooks/bookshelf/61'],
    ['Precursors', 'https://www.gutenberg.org/ebooks/bookshelf/62'],
    ['Gutenberg', 'https://www.gutenberg.org/ebooks/bookshelf/63'],
    ['Psychology', 'https://www.gutenberg.org/ebooks/bookshelf/64'],
    ['Reference', 'https://www.gutenberg.org/ebooks/bookshelf/66'],
    ['Stories', 'https://www.gutenberg.org/ebooks/bookshelf/67'],
    ['Science', 'https://www.gutenberg.org/ebooks/bookshelf/106'],
    ['Fiction', 'https://www.gutenberg.org/ebooks/bookshelf/68'],
    ['Women', 'https://www.gutenberg.org/ebooks/bookshelf/403'],
    ['Scouts', 'https://www.gutenberg.org/ebooks/bookshelf/144'],
    ['Stories', 'https://www.gutenberg.org/ebooks/bookshelf/69'],
    ['Slavery', 'https://www.gutenberg.org/ebooks/bookshelf/70'],
    ['Sociology', 'https://www.gutenberg.org/ebooks/bookshelf/134'],
    ['Africa', 'https://www.gutenberg.org/ebooks/bookshelf/135'],
    ['America', 'https://www.gutenberg.org/ebooks/bookshelf/71'],
    ['War', 'https://www.gutenberg.org/ebooks/bookshelf/140'],
    ['Suffrage', 'https://www.gutenberg.org/ebooks/bookshelf/72'],
    ['Technology', 'https://www.gutenberg.org/ebooks/bookshelf/143'],
    ['Microbiology', 'https://www.gutenberg.org/ebooks/bookshelf/105'],
    ['Journal', 'https://www.gutenberg.org/ebooks/bookshelf/423'],
    ['Transportation', 'https://www.gutenberg.org/ebooks/bookshelf/74'],
    ['Travel', 'https://www.gutenberg.org/ebooks/bookshelf/75'],
    ['War', 'https://www.gutenberg.org/ebooks/bookshelf/141'],
    ['Kingdom', 'https://www.gutenberg.org/ebooks/bookshelf/76'],
    ['States', 'https://www.gutenberg.org/ebooks/bookshelf/136'],
    ['Law', 'https://www.gutenberg.org/ebooks/bookshelf/302'],
    ['Western', 'https://www.gutenberg.org/ebooks/bookshelf/77'],
    ['Witchcraft', 'https://www.gutenberg.org/ebooks/bookshelf/78'],
    ['Journals', 'https://www.gutenberg.org/ebooks/bookshelf/80'],
    ['Woodwork', 'https://www.gutenberg.org/ebooks/bookshelf/147'],
    ['War', 'https://www.gutenberg.org/ebooks/bookshelf/142'],
    ['War', 'https://www.gutenberg.org/ebooks/bookshelf/325'],
    ['Zoology', 'https://www.gutenberg.org/ebooks/bookshelf/303']
]

Parsing several (25 maximum) texts per theme of books for these themes:

In [None]:
import os
import re
import time
import requests
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import undetected_chromedriver as uc

# Initialize the driver
driver = uc.Chrome(headless=True)

def download_file(url, save_path):
    """Download a file from URL to specified path"""
    response = requests.get(url)
    if response.status_code == 200:
        with open(save_path, 'wb') as f:
            f.write(response.content)
        return True
    return False

def process_book(book_url, theme_dir):
    """Process individual book page and download if conditions are met"""
    try:
        driver.get(book_url)
        print(f"{theme_dir}:{book_url}")

        # 1. Check language (wait for element to load)
        try:
            lang_xpath = "//table[@class='bibrec']//tr[th='Language']/td"
            lang_element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, lang_xpath))
            )
            language = lang_element.text.strip()
            print(f"Language {language}")
            if language not in ["English", "Английский"]:
                return False
        except Exception:
            return False

        # 2. Extract reading ease score
        try:
            # Use presence_of_all_elements_located to find every element that matches
            note_xpath = "//table[@class='bibrec']//tr[th='Note']/td"
            note_elements = WebDriverWait(driver, 10).until(
                EC.presence_of_all_elements_located((By.XPATH, note_xpath))
            )

            # Check if any notes were found
            if not note_elements:
                return False

            all_notes_data = []
            reading_score = None

            # Loop through each found note element
            for note_element in note_elements:
                note_text = note_element.text.strip()
                print(f"Note: {note_text}")

                # Extract reading score from each note using regex
                match = re.search(r"Reading ease score:\s*([\d.]+)", note_text)

                current_score = match.group(1) if match else None

                # Store the first valid reading score found
                if current_score and reading_score is None:
                    reading_score = current_score

                all_notes_data.append({
                    "note_text": note_text,
                    "reading_score": current_score
                })

            # Check if at least one note had a reading score.
            if reading_score is None:
                return False

        except Exception as e:
            print(f"An error occurred while parsing notes: {e}")
            return False

        # 3. Find and process download link
        try:
            selectors = [
                "//a[@class='link ' and text()='Plain Text UTF-8']"
            ]

            download_link = None
            for selector in selectors:
                try:
                    # Use find_elements to avoid an exception if not found
                    links = driver.find_elements(By.XPATH, selector)
                    if links:
                        download_link = links[0]
                        break
                except Exception:
                    # Continue to the next selector
                    continue

            if not download_link:
                print("Error: Download link not found.")
                return False

            download_url = download_link.get_attribute('href')
            book_id_match = re.search(r'/ebooks/(\d+)', book_url)
            if not book_id_match:
                print("Error: Could not extract book ID from URL.")
                return False

            book_id = book_id_match.group(1)
            filename = f"{book_id}_{reading_score}.txt"
            save_path = os.path.join(theme_dir, filename)

            # Download the file
            if download_file(download_url, save_path):
                print(f"Downloaded: {filename}")
                return True
            else:
                print(f"Failed to download: {filename}")
                return False

        except Exception as e:
            print(f"An error occurred during download link processing: {e}")
            return False

    except Exception as e:
        print(f"Error processing {book_url}: {str(e)}")
        return False

    return False

# Main processing loop
for theme_name, theme_url in processed_list:
    try:
        # Create theme directory
        theme_dir = re.sub(r'[\\/*?:"<>|]', "", theme_name)  # Sanitize directory name
        os.makedirs('data/' + theme_dir, exist_ok=True)
        print(f"\nProcessing theme: {theme_name}")

        # Navigate to bookshelf page
        driver.get(theme_url)
        time.sleep(2)  # Initial load wait

        # Process pagination
        page_count = 0
        books_processed = 0
        while True:
            page_count += 1
            print(f"  Page {page_count}")

            # Find all book links
            book_links = []
            try:
                book_elements = WebDriverWait(driver, 15).until(
                    EC.presence_of_all_elements_located((By.XPATH, "//li[@class='booklink']/a[@class='link']"))
                )
                book_links = [elem.get_attribute('href') for elem in book_elements]
            except:
                pass

            # Process each book
            for book_url in book_links:
                if process_book(book_url, theme_dir):
                    books_processed += 1
                time.sleep(1)  # Be polite to server

            # Check for next page
            try:
                next_button = driver.find_element(By.XPATH, "//a[@title='Go to next page']")
                if "disabled" in next_button.get_attribute("class"):
                    break

                next_button.click()
                time.sleep(3)  # Wait for page load
            except:
                break

        print(f"  Finished theme: {theme_name} | Books downloaded: {books_processed}")

    except Exception as e:
        print(f"Error processing theme {theme_name}: {str(e)}")

# Clean up
driver.quit()
print("\nProcessing completed!")

After 3 hours of executing, we get 1.4 Gb of text data with theme and text reading complexity separation. File saved in directories by theme, file name saved as `{BookId}_{TextComplexity}.txt`

In [3]:
dirs = [element[0] for element in processed_list]
dirs = list(set(dirs))
dirs.sort()
print(", ".join(dirs))

Adventure, Africa, America, Amphibians, Anarchism, Animal, Anthropology, Antiquity, Archaeology, Architecture, Argentina, Art, Association, Astronomy, Atheism, Australia, Bibliomania, Biographies, Biology, Birds, Books, Bookshelf, Botany, Buddhism, Camping, Canada, Chemistry, Christianity, Christmas, Classics, Cooking, Crafts, Domestic, Ecology, Education, Egypt, Engineering, Esperanto, Fantasy, Fiction, Folklore, Forest, Forestry, France, Geology, Germany, Greece, Gutenberg, Hinduism, History, Horror, Horticulture, Humor, India, Insects, Islam, Italy, Journal, Journals, Judaism, Kingdom, Law, Legends, Listings, Literature, Love, Mammals, Manufacturing, Mathematics, Medicine, Microbiology, Microscopy, Music, Mycology, Mythology, Nonfiction, Norway, Opera, Paganism, Philosophy, Photography, Physics, Physiology, Plays, Poetry, Politics, Precursors, Psychology, Racism, Reference, Science, Scouts, Series, Slavery, Society, Sociology, States, Stories, Suffrage, Technology, Transportation, T

Now creating "one word" dataset with these columns:

| Word | Adventure | Africa | ... | Zealand | Zoology | Total | Complexity |
|-|-|-|-|-|-|-|-|
| `str` | `int` | `int` | ... | `int` | `int` | `int` | `float` |

by this algorithm:
1. Open text from every directory (theme)
2. Clear and tokenize text by NLTK, deleting stop words  
2. Adding every token:
  - if exist, update theme counter, total counter and complexity by this formula:
  $
  \text{complexity}_{\text{new}} = \frac{\text{comlexity}_\text{old}\cdot \text{total}+\text{text_complexity}}{\text{total}+1}
  $
  - if not exit, add word update counters and set complxity to text complexity of current text


In [None]:
import os
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
import re

# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

def preprocess_text(text):
    """Tokenize, clean, and lemmatize text"""
    # Convert to lowercase and remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text.lower())

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords and short tokens
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words and len(word) > 2]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in tokens]

def process_directory(base_dir, dirs):
    """Process all theme directories and create the word dataset"""
    # Filter themes to only include those in dirs list
    themes = [d for d in dirs if os.path.isdir(os.path.join(base_dir, d))]
    print(f"Processing {len(themes)} themes: {', '.join(themes)}")

    # Initialize word dictionary
    word_data = {}

    # Process each theme
    for theme in themes:
        theme_dir = os.path.join(base_dir, theme)
        print(f"Processing theme: {theme}")

        # Process each file in the theme directory
        for filename in os.listdir(theme_dir):
            if filename.endswith('.txt'):
                # Extract reading score from filename
                try:
                    book_id, reading_score = filename.split('_')[:2]
                    reading_score = float(reading_score.replace('.txt', ''))
                except Exception as e:
                    print(f"Error parsing filename {filename}: {e}")
                    continue

                filepath = os.path.join(theme_dir, filename)

                try:
                    with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
                        text = f.read()

                    # Preprocess text
                    tokens = preprocess_text(text)

                    # Update word data
                    for word in tokens:
                        if word not in word_data:
                            # Initialize new word entry
                            word_data[word] = {
                                'Total': 0,
                                'Complexity': 0.0,
                                **{t: 0 for t in themes}  # Initialize all themes to 0
                            }

                        # Get current word stats
                        current = word_data[word]
                        total_old = current['Total']

                        # Update counts
                        current['Total'] += 1
                        current[theme] += 1

                        # Update complexity using moving average formula
                        current['Complexity'] = (
                            (current['Complexity'] * total_old) + reading_score
                        ) / (total_old + 1)

                except Exception as e:
                    print(f"Error processing {filename}: {str(e)}")

    # Convert to DataFrame
    df = pd.DataFrame.from_dict(word_data, orient='index')
    df.reset_index(inplace=True)
    df.rename(columns={'index': 'Word'}, inplace=True)

    # Reorder columns: Word, themes, Total, Complexity
    columns = ['Word'] + themes + ['Total', 'Complexity']
    return df[columns]

# Main processing
base_directory = "Data"  # Directory containing theme subdirectories
output_file = "word_dataset.csv"

# Use the dirs array from the previous cell
df = process_directory(base_directory, dirs)

# Display some statistics
print(f"\nDataset created with {len(df)} unique words")
print(f"Top 10 most frequent words:")
print(df.sort_values('Total', ascending=False).head(10)[['Word', 'Total', 'Complexity']])

# Save results
df.to_csv(output_file, index=False)
print(f"\nDataset saved to {output_file}")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\79150\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\79150\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\79150\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\79150\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


Processing 111 themes: Adventure, Africa, America, Amphibians, Anarchism, Animal, Anthropology, Antiquity, Archaeology, Architecture, Argentina, Art, Association, Astronomy, Atheism, Australia, Bibliomania, Biographies, Biology, Birds, Books, Bookshelf, Botany, Buddhism, Camping, Canada, Chemistry, Christianity, Christmas, Classics, Cooking, Crafts, Domestic, Ecology, Education, Egypt, Engineering, Esperanto, Fantasy, Fiction, Folklore, Forest, Forestry, France, Geology, Germany, Greece, Gutenberg, Hinduism, History, Horror, Horticulture, Humor, India, Insects, Islam, Italy, Journal, Journals, Judaism, Kingdom, Law, Legends, Listings, Literature, Love, Mammals, Manufacturing, Mathematics, Medicine, Microbiology, Microscopy, Music, Mycology, Mythology, Nonfiction, Norway, Opera, Paganism, Philosophy, Photography, Physics, Physiology, Plays, Poetry, Politics, Precursors, Psychology, Racism, Reference, Science, Scouts, Series, Slavery, Society, Sociology, States, Stories, Suffrage, Techno

In [22]:
# Import df from csv
df = pd.read_csv('word_dataset.csv')
# Drop all words with total less than 10
df = df[df['Total'] > 10]

df.sort_values('Total', ascending=False).head(100)[['Word', 'Total', 'Complexity']]


Unnamed: 0,Word,Total,Complexity
52,one,749024,67.143501
608,would,477428,67.192777
140,said,460270,74.127681
683,time,406210,66.731788
17,may,384135,64.051919
...,...,...,...
1437,set,113125,67.948997
229,order,111802,63.708995
869,far,111664,65.489456
4,world,111031,65.826793


After computing comlexity of words, we also can add oral English style data by parsing YouTube videos per theme and uploading thier subtitles

## Part 2: Exercises
Not only vocabulary is important, but also the way words are used. To practice the user's use of new words as well as learning grammar, syntax, and semantics, we will create a database of exercises using textbooks, websites, other resources, and create a model based on textual data to predict difficult moments in sentenses and generate assignments with solutions for them.

Anton Korotkov, ML ingineer. </br>
I have recieved data from George and analyzed which approaches might be useful for our purposes. Basically, for our goal we need to design an architecture that would analyze the dataset of words and educational datasets(for example, open bases of English tasks) and based on some word generate a task. To build this architecture we need to explicitely define the problem:</br>
**Text-to-Task Generation:**
- Input: (word, complexity, context, topic)
- Output: A task (e.g., fill-in-the-blank, match with definition, generate a sentence, etc.)</br>

There are 3 approaches to reach our goal:

- **Encoder-Decoder Models(seq2seq)**:
    - T5(Text-to-text transfer Transformer):
        - Every task is a text generation task
        - Input: "Generate vocabulary exercise for: [word], complexity: B1, context: ..., topic: Health"
        - Output: "Fill in the blank: You should always wear a ____ when cycling. (helmet)"
        - In is very useful to follow this approach, because it is flexible and pretrained on NLP tasks and it is **fine-tunable**
    - BART (Bidirectional and Auto-Regressive Transformers)
        - Similar to T5, but trained for denoising tasks and generation
        - More suitable for masked-style tasks (e.g., cloze tasks)
    - FLAN-T5 / mT5
        - Fine-tuned versions of T5 with better zero-shot generalization

- **Prompt-based Models (Instruction-Tuned LLMs)**
    - GPT-4 or GPT-3.5 (via OpenAI API)
    - LLaMA 3 / Mistral / Mixtral
    - Claude (Anthropic)
    - These models can be prompted like:</br>
        - *Given the word "photosynthesis", complexity level "B2", context "The process plants use to make food...", and topic "Biology", generate a vocabulary exercise appropriate for a B2-level English learner.*</br>
    - We can use this architecture for starting the whole system since using this structure is very **lightweight** and does not need heavy fine-tuning.
- **Fine-tuned Task Generators**
    - This schema is robust to make a pedagogically focused outputs, since it is made with emphasis on using with academical purposes
    - RUBERT or RoBERTa fine-tuned on educational datasets
    - Task-specific fine-tuning on datasets like:
        - George's work(open Literature datasets)
        - BEA(Building Education Applications)
        - Newsela (simplified texts by grade level)
        - TNT (Teacher Needs Tasks) datasets.
During the next weeks we will decide which approach we will use for our architecture and develop it rapidly.