# CREATING A LIST OF ENGLISH WORDS

## 1. WordNet Noun Filtering and Export

This cell extracts and filters English **nouns** from the WordNet lexical database using the NLTK library.
Explanation:

1. **NLTK Resources Download**  
   Downloads WordNet and its multilingual extension (only needed once).

2. **Synset Iteration**  
   Loops through all noun synsets (`'n'`) and collects their lemmas (canonical word forms).

3. **Filtering Criteria**
   - Only words with **length between 4 and 8 characters**
   - Only **alphabetic words** (no numbers or punctuation)
   - **No plurals**: words ending in `'s'` are excluded
   - Lemmas are cleaned (underscores removed, lowercased)

4. **Uniqueness**  
   A Python `set` is used to ensure all words are unique.

5. **Alphabet Coverage (Optional)**  
   Collects the set of first letters covered by the resulting words (for diagnostic purposes).

6. **Saving to File**  
   The final sorted list is saved as a JSON file named `nouns_filtered.json`.

7. **Output**  
   Prints the total number of words saved and the initial letters covered.

This filtered noun list can be useful for word-based games, vocabulary exercises, or further linguistic processing.

In [1]:
import nltk
from nltk.corpus import wordnet as wn
import json
import string

# Download the required data (first time only)
nltk.download('wordnet')
nltk.download('omw-1.4')

# Set to avoid duplicates
noun_set = set()

# Extraction: nouns only, length 4-8, letters only, no plurals
for syn in wn.all_synsets('n'):
    for lemma in syn.lemmas():
        word = lemma.name().lower().replace('_', '')
        if (
            4 <= len(word) <= 8 and
            word.isalpha() and
            not word.endswith('s')
        ):
            noun_set.add(word)

# Filter to cover all letters of the alphabet (optional, only for debugging)
present_letters = set(word[0] for word in noun_set if word[0] in string.ascii_lowercase)

# Order and save
noun_list = sorted(noun_set)

# Save the JSON file
with open("nouns_filtered.json", "w") as f:
    json.dump(noun_list, f, indent=2)

print(f"{len(noun_list)} parole salvate in nouns_filtered.json")
print(f"Lettere iniziali coperte: {sorted(present_letters)}")

[nltk_data] Downloading package wordnet to /Users/giotto/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/giotto/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


30002 parole salvate in nouns_filtered.json
Lettere iniziali coperte: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


This cell extracts and filters English **nouns** from the WordNet lexical database using the NLTK library. It works exactly how the previus cell works, but for words that has 9 and 10 letters.

In [None]:
import nltk
from nltk.corpus import wordnet as wn
import json
import string

nltk.download('wordnet')
nltk.download('omw-1.4')

noun_set = set()

# Only words with 9-10 lenght
for syn in wn.all_synsets('n'):
    for lemma in syn.lemmas():
        word = lemma.name().lower().replace('_', '')
        if (
            9 <= len(word) <= 10 and
            word.isalpha() and
            not word.endswith('s')
        ):
            noun_set.add(word)

noun_list = sorted(noun_set)

with open("filtered_nouns_9-10.json", "w") as f:
    json.dump(noun_list, f, indent=2)

print(f"{len(noun_list)} parole salvate in filtered_nouns_9-10.json")

## 2. Asynchronous Validation of Nouns via Cambridge Dictionary

This cell performs a parallel validation of a list of English nouns by checking whether each word exists in the [Cambridge Dictionary](https://dictionary.cambridge.org). The code uses `asyncio` and `aiohttp` for efficient web scraping. Here's how it works:

1. **Environment Setup for Jupyter**  
   `nest_asyncio.apply()` is used to allow asynchronous code to run in environments like Jupyter Notebooks, which already have a running event loop.

2. **Configuration**
   - Sets a limit of **5 concurrent HTTP requests** using an asyncio `Semaphore`.
   - Custom `User-Agent` header simulates a real browser to avoid being blocked by the site.

3. **Loading Words**  
   Reads a list of words from `nouns_filtered.json`, previously filtered from WordNet.

4. **Validation Function**
   - For each word, an HTTP request is sent to its Cambridge Dictionary page.
   - If the response is valid (`HTTP 200`), it parses the HTML using `BeautifulSoup`.
   - It checks for a `<div>` with the class `entry-body__el`, which signals a valid dictionary entry.
   - Valid words are appended to the `valid_words` list.
   - Progress and results are printed live in the notebook.

5. **Async Execution**
   - `asyncio.gather()` executes all checks concurrently while respecting the `CONCURRENT_REQUESTS` limit.

6. **Saving Results**
   - After all words are processed, the validated words are written into `valid_nouns.json`.

This asynchronous approach significantly reduces the time required to verify large word lists compared to a sequential implementation, making it suitable for large-scale lexical validation tasks.

In [None]:
import aiohttp
import asyncio
from bs4 import BeautifulSoup
import json
import nest_asyncio

# Apply patches for environments with active event loops: Jupyter
nest_asyncio.apply()

# Maximum number of parallel requests
CONCURRENT_REQUESTS = 5

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

# Upload words from JSON file
with open('nouns_filtered.json', 'r') as f:
    words = json.load(f)

valid_words = []

# Limit the number of parallel requests
semaphore = asyncio.Semaphore(CONCURRENT_REQUESTS)

async def check_word(session, word, index):
    url = f"https://dictionary.cambridge.org/dictionary/english/{word}"
    async with semaphore:
        try:
            async with session.get(url, headers=headers, timeout=10) as response:
                if response.status == 200:
                    html = await response.text()
                    soup = BeautifulSoup(html, 'html.parser')
                    if soup.find("div", class_="entry-body__el"):
                        valid_words.append(word)
                        print(f"[{index}] ✅ {word}")
                    else:
                        print(f"[{index}] ❌ {word} not found")
                else:
                    print(f"[{index}] ❌ {word} - HTTP {response.status}")
        except Exception as e:
            print(f"[{index}] ⚠️ Error with word '{word}': {e}")

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [check_word(session, word, i) for i, word in enumerate(words)]
        await asyncio.gather(*tasks)

# Start the program in async mode compatible with Jupyter
await main()

# Salva le parole valide in un file . json
with open('valid_nouns.json', 'w') as f:
    f.write("[\n" + ",\n".join(f'"{w}"' for w in valid_words) + "\n]")

[0] ❌ abaca not found
[1] ❌ abadan not found
[3] ❌ abamp not found
[2] ✅ abalone
[4] ❌ abampere not found
[5] ✅ abandon
[9] ✅ abattoir
[7] ❌ abator not found
[8] ❌ abattery not found
[6] ❌ abasia not found
[13] ✅ abbey
[11] ❌ abbacy not found
[10] ✅ abaya
[12] ❌ abbe not found
[14] ✅ abbot
[15] ✅ abdomen
[17] ✅ abductor
[18] ❌ abel not found
[19] ❌ abelard not found
[16] ✅ abducent
[21] ✅ abelia
[20] ❌ abele not found
[22] ❌ abelmosk not found
[24] ❌ aberdare not found
[23] ❌ abenaki not found
[27] ✅ abetment
[26] ✅ aberrant
[25] ✅ aberdeen
[30] ✅ abettor
[29] ❌ abetter not found
[28] ❌ abettal not found
[31] ✅ abeyance
[32] ❌ abfarad not found
[33] ❌ abhenry not found
[35] ❌ abidance not found
[34] ❌ abhorrer not found
[36] ✅ abidjan
[37] ❌ abience not found
[39] ❌ abilene not found
[38] ❌ abila not found
[41] ❌ abjurer not found
[40] ✅ ability
[42] ✅ abkhaz
[44] ✅ ablation
[45] ✅ ablative
[43] ❌ abkhazia not found
[47] ✅ ableism
[46] ❌ ablaut not found
[48] ❌ ablism not found
[49] ✅ 

This cell performs a parallel validation of a list of English nouns by checking whether each word exists in the [Cambridge Dictionary](https://dictionary.cambridge.org). The code uses `asyncio` and `aiohttp` for efficient web scraping. It works exactly like the previous cell, but for longer nouns.


In [None]:
import aiohttp
import asyncio
from bs4 import BeautifulSoup
import json
import nest_asyncio

nest_asyncio.apply()

CONCURRENT_REQUESTS = 5
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

# Cambia qui il file sorgente
with open('nouns_9_10.json', 'r') as f:
    words = json.load(f)

valid_words_910 = []
semaphore = asyncio.Semaphore(CONCURRENT_REQUESTS)

async def check_word(session, word, index):
    url = f"https://dictionary.cambridge.org/dictionary/english/{word}"
    async with semaphore:
        try:
            async with session.get(url, headers=headers, timeout=10) as response:
                if response.status == 200:
                    html = await response.text()
                    soup = BeautifulSoup(html, 'html.parser')
                    if soup.find("div", class_="entry-body__el"):
                        valid_words.append(word)
                        print(f"[{index}] ✅ {word}")
                    else:
                        print(f"[{index}] ❌ {word} not found")
                else:
                    print(f"[{index}] ❌ {word} - HTTP {response.status}")
        except Exception as e:
            print(f"[{index}] ⚠️ Error with word '{word}': {e}")

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [check_word(session, word, i) for i, word in enumerate(words)]
        await asyncio.gather(*tasks)

await main()

# Salva il file risultante con nome 9_10.json
with open('valid_nouns_9-10.json', 'w') as f:
    f.write("[\n" + ",\n".join(f'"{w}"' for w in valid_words_910) + "\n]")

print(f"✅ {len(valid_words)} parole valide salvate in valid_nouns_9-10.json")

## 3. Merge and Sort Valid Noun Lists (Lengths 4–8 and 9–10)

This code block performs the following operations:

1. **Loads** two JSON files:
   - `valid_nouns_9-10.json`: contains validated English nouns that are 9 or 10 letters long.
   - `valid_nouns_4-8.json`: contains validated English nouns that are 4 to 8 letters long.

2. **Merges** the two lists into a single set to remove any duplicate entries.

3. **Sorts** the combined set alphabetically using Python’s built-in `sorted()` function, which internally uses the efficient Timsort algorithm.

4. **Saves** the resulting sorted list into a new JSON file named `valid_nouns_4-10.json`.

5. **Prints** the total number of unique valid nouns saved.

This step ensures you have a comprehensive, alphabetically sorted list of validated English nouns ranging from 4 to 10 letters.

In [2]:
import json

# File names
file1 = 'valid_nouns_9-10.json'
file2 = 'valid_nouns_4-8.json'
output_file = 'valid_nouns_4-10.json'

# Upload the 2 lists
with open(file1, 'r') as f:
    list1 = json.load(f)
with open(file2, 'r') as f:
    list2 = json.load(f)

# Merge the 2 lists and eventually delete duplicates
merged_set = set(list1 + list2)

# Sort using Timsort (Python sorted use Timsort)
sorted_list = sorted(merged_set)

# Save in the final File
with open(output_file, 'w') as f:
    json.dump(sorted_list, f, indent=2)

print(f"✅ {len(sorted_list)} parole salvate in {output_file}")

✅ 21114 parole salvate in valid_nouns_4-10.json
