# Complex_Word_Identification_in_NLP


### In NLP, Complex Word Identification (CWI) is the task of automatically detecting words that may be difficult for a reader to understand. This task is especially important in applications such as:

* Text simplification

* Educational tools

* Assistive reading systems

* Content adaptation for non-native speakers

two approaches to identify complex word identification:

1. A rule-based approach using word length and frequency

2. A machine learning-based approach using a Decision Tree classifier

Libraries and Tools Used

**1. NLTK** -  Used to access linguistic corpora such as the Brown Corpus, which provides word frequency statistics.

**2. spaCy** - Installed for general NLP support and consistency with previous labs.

**3. pandas**  - Used to organize data into tabular form for machine learning.

**4. scikit-learn** - Used to build and train a Decision Tree classifier for word complexity prediction.

In [None]:
!pip install nltk spacy scikit-learn pandas
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m65.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


* The Brown Corpus is a standard English corpus containing text from various domains. It is commonly used for linguistic analysis and frequency estimation.
### Word Frequency Distribution

#### A frequency distribution is created by counting how often each word occurs in the corpus.

Why frequency matters:

* Common words are generally easier to understand

* Rare words are more likely to be complex

### **Rule-Based Complexity Detection**

#### In the rule-based approach, a word is considered complex if its length exceeds a fixed threshold.

Rule Used:

If length(word) > 7 → Complex

Otherwise → Simple

This approach is easy to implement but limited because:

* It ignores word meaning

* It does not consider context

In [None]:
# frequency based complex word detection

import nltk
from nltk.corpus import brown
from collections import Counter

nltk.download("brown")

words =brown.words()
freq_dist = Counter(word.lower() for word in words)

def is_complex(word, threshold=7):
    return len(word) > threshold

sentence = "Photosynthesis is an essential biochemical process".split()

for word in sentence:
    if is_complex(word):
        print(word, "-> is complex")
    else:
        print(word, "-> is simple")



[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


Photosynthesis -> is complex
is -> is simple
an -> is simple
essential -> is complex
biochemical -> is complex
process -> is simple


### **Machine Learning-Based Complex Word Identification**



#### 1. Feature Extraction

Each word is converted into numerical features that help the model decide whether a word is complex or simple.

The features used are:

Word length – longer words are often more complex

Word frequency – rare words are usually harder to understand






#### 2. Dataset Preparation

A small labeled dataset is created:

0 → Simple word

1 → Complex word





#### 3. Model Used: Decision Tree

A Decision Tree classifier is used because:

* Works well with small datasets and easy to understand

* Makes decisions based on clear rule




#### 4. Prediction

After training, The model predicts whether each word is simple or complex based on learned patterns.

 --> plant

 --> metamorphosis


In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

def extract_features(word):
    return {
        "length": len(word),
        "frequency": freq_dist[word.lower()]
    }

data = [
    ("dog", 0),
    ("cat", 0),
    ("photosynthesis", 1),
    ("biochemical", 1),
    ("run", 0),
    ("photosynthetic", 1)
]

df = pd.DataFrame(data, columns=["word", "label"])

X = pd.DataFrame(df["word"].apply(extract_features).tolist())
y = df["label"]

model = DecisionTreeClassifier(random_state=42)
model.fit(X, y)

test_words = ["plant", "metamorphosis"]
test_features = pd.DataFrame([extract_features(w) for w in test_words])
predictions = model.predict(test_features)

for word, label in zip(test_words, predictions):
    print(word, "->", "complex" if label else "simple")


plant -> simple
metamorphosis -> complex
