# Complete Guide to Natural Language Processing (NLP)
## **Introduction**

Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between computers and human language. The NLP pipeline consists of several sequential steps that transform raw text into a format suitable for machine learning algorithms and analysis.

## **Overview of NLP Pipeline**

The typical NLP pipeline follows these major phases:

1. **Text Acquisition**
2. **Text Preprocessing**
3. **Feature Extraction**
4. **Model Training/Application**
5. **Evaluation and Deployment**

---

## 1. **Text Acquisition**

### Definition
Text acquisition is the initial step where raw text data is collected from various sources.

### **Common Sources**
- Web scraping
- APIs (Twitter, Reddit, etc.)
- Databases
- Document files (PDF, Word, etc.)
- User input
- Sensor data (speech-to-text)

### **Key Considerations**
- Data quality and reliability
- Legal and ethical considerations
- Data volume and diversity
- Encoding formats (UTF-8, ASCII, etc.)

---

## 2. **Text Preprocessing**

Text preprocessing is the most critical phase in NLP, involving cleaning and standardizing raw text data to make it suitable for analysis.

### 2.1 Text Cleaning

#### Noise Removal
- **Purpose**: Remove irrelevant characters and formatting
- **Includes**:
  - HTML tags removal
  - Special characters elimination
  - Extra whitespace normalization
  - Non-printable character removal

#### Case Normalization
- **Purpose**: Standardize text case for consistency
- **Methods**:
  - Convert to lowercase (most common)
  - Convert to uppercase
  - Title case conversion

### 2.2 Tokenization

#### Definition
Tokenization is the process of breaking down text into individual units called tokens (words, phrases, or characters).

#### Types of Tokenization
- **Word Tokenization**: Splitting text into individual words
- **Sentence Tokenization**: Dividing text into sentences
- **Subword Tokenization**: Breaking words into smaller units (BPE, WordPiece)
- **Character Tokenization**: Splitting into individual characters

#### Challenges
- Handling punctuation
- Contractions (don't, won't)
- Hyphenated words
- URLs and email addresses

### 2.3 Stop Words Removal

#### Definition
Stop words are common words that carry little semantic meaning and are often removed to focus on more meaningful terms.

#### Common Stop Words
- Articles: a, an, the
- Prepositions: in, on, at, by
- Pronouns: I, you, he, she, it
- Conjunctions: and, or, but

#### Considerations
- Language-specific stop word lists
- Domain-specific stop words
- Context-dependent importance

### 2.4 Stemming

#### Definition
Stemming reduces words to their root or base form by removing suffixes and prefixes.

#### Purpose
- Reduce vocabulary size
- Group related words together
- Improve computational efficiency

#### Common Algorithms
- **Porter Stemmer**: Most widely used English stemmer
- **Snowball Stemmer**: Multilingual stemming algorithm
- **Lancaster Stemmer**: More aggressive than Porter

#### Example
- running, runs, ran → run
- better, good → good

### 2.5 Lemmatization

#### Definition
Lemmatization reduces words to their canonical dictionary form (lemma) using linguistic analysis.

#### Difference from Stemming
- More accurate than stemming
- Considers word context and part of speech
- Produces valid dictionary words
- Computationally more expensive

#### Example
- better → good (lemmatization)
- better → better (stemming)
- mice → mouse (lemmatization)
- mice → mice (stemming)

### 2.6 Part-of-Speech (POS) Tagging

#### Definition
POS tagging assigns grammatical categories (noun, verb, adjective, etc.) to each word in a sentence.

#### Common POS Tags
- **Nouns**: NN (singular), NNS (plural), NNP (proper noun)
- **Verbs**: VB (base form), VBD (past tense), VBG (gerund)
- **Adjectives**: JJ (adjective), JJR (comparative), JJS (superlative)
- **Adverbs**: RB (adverb), RBR (comparative adverb)

#### Applications
- Information extraction
- Syntax analysis
- Machine translation
- Text summarization

### 2.7 Named Entity Recognition (NER)

#### Definition
NER identifies and classifies named entities (proper nouns) in text into predefined categories.

#### Common Entity Types
- **Person**: Names of individuals
- **Organization**: Company names, institutions
- **Location**: Cities, countries, landmarks
- **Date/Time**: Dates, times, durations
- **Money**: Currency amounts
- **Percentage**: Numerical percentages

#### Applications
- Information extraction
- Question answering systems
- Content categorization
- Knowledge graph construction

---

## 3. **Feature Extraction**

### 3.1 Bag of Words (BoW)

#### Definition
BoW represents text as a collection of words, disregarding grammar and word order but keeping track of frequency.

#### Characteristics
- Simple and intuitive
- Ignores word order
- Sparse representation
- Good baseline for many tasks

### 3.2 TF-IDF (Term Frequency-Inverse Document Frequency)

#### Definition
TF-IDF weighs words based on their frequency in a document and rarity across the entire corpus.

#### Components
- **Term Frequency (TF)**: How often a term appears in a document
- **Inverse Document Frequency (IDF)**: How rare a term is across all documents

#### Formula
TF-IDF = TF(t,d) × IDF(t,D)

### 3.3 N-grams

#### Definition
N-grams are contiguous sequences of n words from a text.

#### Types
- **Unigrams**: Single words
- **Bigrams**: Two consecutive words
- **Trigrams**: Three consecutive words

#### Benefits
- Capture local word dependencies
- Preserve some word order information
- Useful for language modeling

### 3.4 Word Embeddings

#### Definition
Word embeddings represent words as dense vectors in a continuous vector space where semantically similar words are closer together.

#### Popular Methods
- **Word2Vec**: Skip-gram and CBOW models
- **GloVe**: Global Vectors for Word Representation
- **FastText**: Extends Word2Vec with subword information

#### Advantages
- Capture semantic relationships
- Dense representation
- Transfer learning capabilities

---

## 4. **Advanced Preprocessing Techniques**

### 4.1 Dependency Parsing

#### Definition
Dependency parsing analyzes the grammatical structure of sentences by identifying relationships between words.

#### Applications
- Relationship extraction
- Question answering
- Machine translation

### 4.2 Coreference Resolution

#### Definition
Coreference resolution identifies when different expressions refer to the same entity.

#### Example
"John went to the store. He bought milk." (He refers to John)

### 4.3 Sentiment Analysis Preprocessing

#### Specific Considerations
- Handling negations
- Emoticon processing
- Slang and informal language
- Context-dependent sentiment

---

## 5. **Text Normalization Techniques**

### 5.1 Spelling Correction

#### Purpose
Correct misspelled words to improve text quality.

#### Methods
- Dictionary-based correction
- Statistical models
- Phonetic matching

### 5.2 Contraction Expansion

#### Purpose
Expand contractions to their full forms.

#### Examples
- don't → do not
- won't → will not
- I'm → I am

### 5.3 Handling Special Characters and Symbols

#### Considerations
- Preserve meaningful symbols
- Remove or replace irrelevant characters
- Handle Unicode characters
- Process mathematical symbols

---

## 6. **Language-Specific Considerations**

### 6.1 Multilingual Processing

#### Challenges
- Different writing systems
- Varying grammatical structures
- Language detection
- Cross-lingual transfer

### 6.2 Domain-Specific Processing

#### Considerations
- Medical texts
- Legal documents
- Social media content
- Technical literature

---

## 7. **Quality Assurance and Validation**

### 7.1 Data Quality Checks

#### Validation Steps
- Encoding verification
- Completeness assessment
- Consistency checks
- Duplicate detection

### 7.2 Preprocessing Validation

#### Verification Methods
- Sample inspection
- Statistical analysis
- A/B testing
- Domain expert review

---

## 8. **Tools and Libraries**

### Popular NLP Libraries

#### Python
- **NLTK**: Comprehensive NLP toolkit
- **spaCy**: Industrial-strength NLP
- **Gensim**: Topic modeling and document similarity
- **TextBlob**: Simple NLP processing

#### Other Languages
- **Stanford CoreNLP** (Java)
- **OpenNLP** (Java)
- **Gate** (Java)

---

## 9. **Best Practices**

### 9.1 Preprocessing Guidelines

1. **Understand Your Data**: Analyze text characteristics before preprocessing
2. **Preserve Important Information**: Don't over-clean the data
3. **Document Your Steps**: Maintain clear records of preprocessing decisions
4. **Validate Results**: Always verify preprocessing outcomes
5. **Consider Domain Context**: Adapt preprocessing to specific use cases

### 9.2 Performance Considerations

- **Computational Efficiency**: Balance accuracy with processing speed
- **Memory Management**: Handle large datasets efficiently
- **Scalability**: Design for growth in data volume
- **Reproducibility**: Ensure consistent results across runs

---

## 10. Common Pitfalls and Solutions

### 10.1 Data Leakage

#### Problem
Information from the target variable inadvertently included in features.

#### Solution
Careful feature engineering and cross-validation.

### 10.2 Over-preprocessing

#### Problem
Removing too much information, leading to loss of meaningful signals.

#### Solution
Iterative approach with performance monitoring.

### 10.3 Inconsistent Preprocessing

#### Problem
Different preprocessing for training and inference data.

#### Solution
Standardized preprocessing pipelines and version control.

---

## Conclusion

The NLP preprocessing pipeline is foundational to successful natural language processing applications. Each step serves a specific purpose in transforming raw text into a format suitable for machine learning algorithms. The key to effective preprocessing lies in understanding your specific use case, maintaining data quality, and following best practices while avoiding common pitfalls.

Successful NLP projects require careful consideration of each preprocessing step, appropriate tool selection, and continuous validation of results. As the field evolves, new techniques and tools emerge, making it essential to stay updated with the latest developments in NLP preprocessing methodologies.

In [96]:
import spacy as sc

In [97]:
nlp=sc.blank('en')# Load Blank pipeline

In [98]:
sentence=nlp("I am Maarij")# Load the sentence

In [99]:
for word in sentence:
    print(word.text)

I
am
Maarij


## Checking for emails

In [100]:
data=nlp('''name|course|email
Maarij Aqeel | BS Artificial Intelligence | maarij.aqeel@example.com 
Ayesha Khan | BS Data Science | ayesha.khan@example.com
Usman Tariq | BS Computer Science | usman.tariq@example.com
Fatima Noor | BS Software Engineering | fatima.noor@example.com
Ahmed Raza | BS Cyber Security | ahmed.raza@example.com
Sara Malik | BS Information Technology | sara.malik@example.com
Bilal Hussain | BS Robotics | bilal.hussain@example.com
''')

In [101]:
emails=[]
for word in data:
    if word.like_email:
        emails.append(word)

In [102]:
emails

[maarij.aqeel@example.com,
 ayesha.khan@example.com,
 usman.tariq@example.com,
 fatima.noor@example.com,
 ahmed.raza@example.com,
 sara.malik@example.com,
 bilal.hussain@example.com]

In [103]:
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x7f8ba929d050>

In [104]:
doc=nlp('This is 1st sentence. This is 2nd sentence.')

In [105]:
for sent in doc.sents:
    print(sent)

This is 1st sentence.
This is 2nd sentence.


In [106]:
nlp=sc.load('en_core_web_sm')# Load the pre-trained English Model

In [107]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [108]:
doc=nlp('I live in Pakistan. I completed the Course.')

# tagger
- Assigns Part-of-Speech (POS) tags to each token.
- Tells you if a word is a noun, verb, adjective, etc.
# parser
- Performs syntactic dependency parsing.
- Determines how words relate to each other grammatically (subject, object, modifiers, etc).
# attribute_ruler
- Adds custom attributes and rules for lemmatization, POS tagging, or entity recognition.
# lemmatizer 
- Converts words to their base form (lemma).
# ner (Named Entity Recognizer)
- Finds and classifies named entities in text (people, places, organizations, dates, etc).

In [109]:
for sent in doc:
    print(sent,"|",sent.pos_)# tagger

I | PRON
live | VERB
in | ADP
Pakistan | PROPN
. | PUNCT
I | PRON
completed | VERB
the | DET
Course | NOUN
. | PUNCT


In [110]:
for sent in doc:
    print(sent.text,'|',sent.dep_,'|',sent.head.text)# Parser

I | nsubj | live
live | ROOT | live
in | prep | live
Pakistan | pobj | in
. | punct | live
I | nsubj | completed
completed | ROOT | completed
the | det | Course
Course | dobj | completed
. | punct | completed


In [122]:
pipe = nlp.get_pipe('attribute_ruler')# Attribute Ruler
pipe.add( patterns=[[{"TEXT": "Course"}]],
    attrs={"LEMMA": "NLP Course"})
doc=nlp('I live in Pakistan. I completed the Course.')

for sent in doc:
    print(sent.text,'|',sent.lemma_)

I | I
live | live
in | in
Pakistan | Pakistan
. | .
I | I
completed | complete
the | the
Course | NLP Course
. | .


In [113]:
for sent in doc:
    print(sent.text,'|',sent.lemma_)# lemmatizer

I | I
live | live
in | in
Pakistan | Pakistan
. | .
I | I
completed | complete
the | the
Course | course
. | .


In [114]:
for sent in doc.ents:
    print(sent.text,'|',sent.label_)# ner

Pakistan | GPE
Course | PRODUCT


# **Stemming**
- Using Fixed Rules to Derive a base word from a given word

In [115]:
from nltk import PorterStemmer

In [116]:
stemmer=PorterStemmer()
words=['eating','climbing','ran','adjustable','spended','ability']
for word in words:
    print(word,'|',stemmer.stem(word))# It doesn't have much knowledge about form of verbs

eating | eat
climbing | climb
ran | ran
adjustable | adjust
spended | spend
ability | abil


# **Lemmatization**
- Using linguistic knowledge to derive a base word

In [70]:
import spacy as sc

In [71]:
nlp=sc.load('en_core_web_sm')
words=nlp('''eating climbing ran adjustable spended ability''')

In [72]:
for word in words:
    print(word.text,'|',word.lemma_)

eating | eat
climbing | climbing
ran | run
adjustable | adjustable
spended | spend
ability | ability


# **Word Representation/Feature Extraction**
- Label Encoding
- One hot Encoding
- Bag Of Words
- TF-IDF

## 1. Label Encoding
- Converts each category into a unique integer.
- Imposes an ordinal relationship (even if none exists).

- Example:
| Fruit   | Label |
|---------|-------|
| apple   |   0   |
| banana  |   1   |
| orange  |   2   |


## 2. One-Hot Encoding
- Converts each category into a binary vector.
- Each category gets a separate column with 1 or 0.

- Example:  ['apple', 'banana', 'orange']
| apple | banana | orange |
|-------|--------|--------|
|   1   |   0    |   0    |
|   0   |   1    |   0    |
|   0   |   0    |   1    |

## 3. Bag Of Words(BOW)
- Treats each document as a "bag" of words, ignoring grammar and word order.
- Only keeps track of word counts or presence.

-Example: 
- ### ➤ Step 1: Build a vocabulary of all unique words:
`["I", "love", "NLP", "is", "fun"]`


- ### ➤ Step 2: Create vectors for each sentence using word counts:

| Sentence     | I | love | NLP | is | fun |
|--------------|---|------|-----|----|-----|
| I love NLP   | 1 |  1   |  1  | 0  |  0  |
| NLP is fun   | 0 |  0   |  1  | 1  |  1  |
| I love fun   | 1 |  1   |  0  | 0  |  1  |

## 3.1 Bag Of N-grams
- Bag of N-Grams is an extension of the Bag of Words model
- Instead of individual words, it represents text using sequences of N consecutive words (n-grams)
- N-Gram: A contiguous sequence of N words from a given text.
    - Unigram (1-gram): "I", "love", "NLP"
    - Bigram (2-gram): "I love", "love NLP"
    - Trigram (3-gram): "I love NLP"
- Bag: Ignores the order and just counts the frequency of n-grams.
## 4. TF-IDF
- TF-IDF stands for:
    - TF → Term Frequency
    - IDF → Inverse Document Frequency
- It gives higher weight to words that:

    - Appear often in one article (TF)

    - But appear in fewer articles overall (IDF)

- And gives lower weight to words that:

    - Are common across all articles (like "the", "is", etc.)

# Bag Of Words

In [108]:
from sklearn.feature_extraction.text import CountVectorizer

In [121]:
def preprocess(text):
    doc=nlp(text)
    filtered=[word.lemma_ for word in doc if not word.is_stop and not word.is_punct]
    return ' '.join(filtered)

In [98]:
text=['Mohamed Osman Mohamud, 23, who was convicted in 2013 ',
' attempting to use a weapon of mass destruction (explosives) in connection with ',
'a plot to detonate a vehicle bomb at an annual Christmas tree lighting ceremony in Portland,',
'was sentenced today to serve 30 years in prison, followed by a lifetime term of supervised release.']

In [100]:
filtered_out=[preprocess(line) for line in text]
filtered_out

['Mohamed Osman Mohamud 23 convict 2013',
 '  attempt use weapon mass destruction explosive connection',
 'plot detonate vehicle bomb annual Christmas tree lighting ceremony Portland',
 'sentence today serve 30 year prison follow lifetime term supervised release']

In [103]:
vectorizer=CountVectorizer(ngram_range=(1,2))
vectorizer.fit(filtered_out)
vectorizer.vocabulary_

{'mohamed': 32,
 'osman': 36,
 'mohamud': 34,
 '23': 1,
 'convict': 16,
 '2013': 0,
 'mohamed osman': 33,
 'osman mohamud': 37,
 'mohamud 23': 35,
 '23 convict': 2,
 'convict 2013': 17,
 'attempt': 7,
 'use': 56,
 'weapon': 60,
 'mass': 30,
 'destruction': 18,
 'explosive': 22,
 'connection': 15,
 'attempt use': 8,
 'use weapon': 57,
 'weapon mass': 61,
 'mass destruction': 31,
 'destruction explosive': 19,
 'explosive connection': 23,
 'plot': 38,
 'detonate': 20,
 'vehicle': 58,
 'bomb': 9,
 'annual': 5,
 'christmas': 13,
 'tree': 54,
 'lighting': 28,
 'ceremony': 11,
 'portland': 40,
 'plot detonate': 39,
 'detonate vehicle': 21,
 'vehicle bomb': 59,
 'bomb annual': 10,
 'annual christmas': 6,
 'christmas tree': 14,
 'tree lighting': 55,
 'lighting ceremony': 29,
 'ceremony portland': 12,
 'sentence': 44,
 'today': 52,
 'serve': 46,
 '30': 3,
 'year': 62,
 'prison': 41,
 'follow': 24,
 'lifetime': 26,
 'term': 50,
 'supervised': 48,
 'release': 43,
 'sentence today': 45,
 'today ser

In [105]:
vectorizer.transform(['Mohamed Osman Mohamud, 23, who was convicted in 2013']).toarray()

array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [104]:
vectorizer.transform(['Mohamed Osman 23 was convicted in 2013']).toarray()

array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [106]:
vectorizer.transform(['Ali detonated a vehicle']).toarray()# out of vocabulary (OOV)

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]])

# TF-IDF

In [110]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [136]:
corpus=[  "She enjoys reading books on psychology every evening.",
    "Every evening, she likes to read psychology books.",
    "Reading psychology books in the evening is something she enjoys.",
    "She finds joy in going through psychology literature at night."]

In [140]:
vectorizer=TfidfVectorizer()
vectorizer.fit(corpus)


In [141]:
features=vectorizer.get_feature_names_out()
print(features)

['at' 'books' 'enjoys' 'evening' 'every' 'finds' 'going' 'in' 'is' 'joy'
 'likes' 'literature' 'night' 'on' 'psychology' 'read' 'reading' 'she'
 'something' 'the' 'through' 'to']


In [142]:
for word in features:
    index=vectorizer.vocabulary_.get(word)# This will represent the Score of each word
    print(f"{word} {vectorizer.idf_[index]}")

at 1.916290731874155
books 1.2231435513142097
enjoys 1.5108256237659907
evening 1.2231435513142097
every 1.5108256237659907
finds 1.916290731874155
going 1.916290731874155
in 1.5108256237659907
is 1.916290731874155
joy 1.916290731874155
likes 1.916290731874155
literature 1.916290731874155
night 1.916290731874155
on 1.916290731874155
psychology 1.0
read 1.916290731874155
reading 1.5108256237659907
she 1.0
something 1.916290731874155
the 1.916290731874155
through 1.916290731874155
to 1.916290731874155


# **Stop Words**

In [2]:
from spacy.lang.en.stop_words import STOP_WORDS
import spacy

In [3]:
nlp=spacy.load('en_core_web_sm')

In [6]:
doc=nlp('We are going to remove stop words. They are not essential in some cases.')
# Tokenize
for word in doc:
    if word.is_stop:
        print(word.text)

We
are
to
They
are
not
in
some


In [55]:
def preprocess(text):
    doc=nlp(text)
    non_stop_words=[word.text for word in doc if not word.is_stop and not word.is_punct]
    return ' '.join(non_stop_words)

In [56]:
text='We are going to remove stop words. They are not essential in some cases.'
filtered=preprocess(text)
filtered

'going remove stop words essential cases'

# Question Answer Dataset practice

In [57]:
import pandas as pd

In [58]:
df=pd.read_json('./combined.json',lines=True)
df

Unnamed: 0,id,title,contents,date,topics,components
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,[],[National Security Division (NSD)]
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,[],[Environment and Natural Resources Division]
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,[],[Environment and Natural Resources Division]
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,[],[Environment and Natural Resources Division]
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
...,...,...,...,...,...,...
13082,16-735,Yuengling to Upgrade Environmental Measures to...,The Department of Justice and the U.S. Environ...,2016-06-23T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
13083,10-473,Zarein Ahmedzay Pleads Guilty to Terror Violat...,The Justice Department announced that Zarein...,2010-04-23T00:00:00-04:00,[],[Office of the Attorney General]
13084,17-045,Zimmer Biomet Holdings Inc. Agrees to Pay $17....,Subsidiary Agrees to Plead Guilty to Violating...,2017-01-12T00:00:00-05:00,[Foreign Corruption],"[Criminal Division, Criminal - Criminal Fraud ..."
13085,17-252,ZTE Corporation Agrees to Plead Guilty and Pay...,ZTE Corporation has agreed to enter a guilty p...,2017-03-07T00:00:00-05:00,"[Asset Forfeiture, Counterintelligence and Exp...","[National Security Division (NSD), USAO - Texa..."


In [59]:
df.drop(['date','topics','components','id'],axis=1,inplace=True)

In [60]:
df

Unnamed: 0,title,contents
0,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,..."
1,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...
2,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...
3,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...
4,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir..."
...,...,...
13082,Yuengling to Upgrade Environmental Measures to...,The Department of Justice and the U.S. Environ...
13083,Zarein Ahmedzay Pleads Guilty to Terror Violat...,The Justice Department announced that Zarein...
13084,Zimmer Biomet Holdings Inc. Agrees to Pay $17....,Subsidiary Agrees to Plead Guilty to Violating...
13085,ZTE Corporation Agrees to Plead Guilty and Pay...,ZTE Corporation has agreed to enter a guilty p...


In [61]:
df=df.head(100)
df

Unnamed: 0,title,contents
0,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,..."
1,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...
2,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...
3,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...
4,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir..."
...,...,...
95,"After Nearly 20 Years, International Fugitive ...","A former New York businessman, who disappeared..."
96,After Supreme Court Declines to Hear Same-Sex ...,Attorney General Eric Holder announced today t...
97,"After UBS Produces Singapore-Based Documents, ...",UBS AG has complied with an Internal Revenue S...
98,AGCO Corp. to Pay $1.6 Million in Connection w...,"WASHINGTON – AGCO Corp., a U.S. corporation ..."


In [62]:
df['non_stop_words']=df['contents'].apply(preprocess)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['non_stop_words']=df['contents'].apply(preprocess)


In [63]:
df

Unnamed: 0,title,contents,non_stop_words
0,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",PORTLAND Oregon Mohamed Osman Mohamud 23 convi...
1,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,WASHINGTON North Carolina Waccamaw River wa...
2,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,BOSTON $ 1 million settlement reached n...
3,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,WASHINGTON federal grand jury Las Vegas tod...
4,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",U.S. Department Justice U.S. Environmental Pro...
...,...,...,...
95,"After Nearly 20 Years, International Fugitive ...","A former New York businessman, who disappeared...",New York businessman disappeared day federal j...
96,After Supreme Court Declines to Hear Same-Sex ...,Attorney General Eric Holder announced today t...,Attorney General Eric Holder announced today f...
97,"After UBS Produces Singapore-Based Documents, ...",UBS AG has complied with an Internal Revenue S...,UBS AG complied Internal Revenue Service IRS s...
98,AGCO Corp. to Pay $1.6 Million in Connection w...,"WASHINGTON – AGCO Corp., a U.S. corporation ...",WASHINGTON AGCO Corp. U.S. corporation base...


In [64]:
df['contents'].iloc[4][:400]

'The U.S. Department of Justice, the U.S. Environmental Protection Agency (EPA), and the Rhode Island Department of Environmental Management (RIDEM) announced today that two subsidiaries of Stanley Black & Decker Inc.—Emhart Industries Inc. and Black & Decker Inc.—have agreed to clean up dioxin contaminated sediment and soil at the Centredale Manor Restoration Project Superfund Site in North Provid'

In [67]:
df['non_stop_words'].iloc[4][:400]

'U.S. Department Justice U.S. Environmental Protection Agency EPA Rhode Island Department Environmental Management RIDEM announced today subsidiaries Stanley Black Decker Inc.—Emhart Industries Inc. Black Decker Inc.—have agreed clean dioxin contaminated sediment soil Centredale Manor Restoration Project Superfund Site North Providence Johnston Rhode Island \xa0  pleased reach resolution collaborative'

# Problems with Bag of Words and TF-IDF
## Both Bag of Words (BoW) and TF-IDF convert text into numerical form, but they have serious limitations:

- 🔹 1. No Understanding of Meaning
    - They treat words as independent tokens, with no idea of their meaning.
    - Example: “good” and “excellent” are treated as totally unrelated, even though they’re semantically similar.

- 🔹 2. No Context
    - These methods don’t understand where or how a word is used.
    - Example: The word "bank" in:
    - “He went to the bank to withdraw money.”
    - “She sat on the river bank.”
    - BoW and TF-IDF treat "bank" the same in both sentences.

- 🔹 3. High Dimensionality
    - The vocabulary can be huge, so they create very large sparse vectors (mostly 0s).
    - This is inefficient and can hurt model performance.



## Word Embedding
- Word Embedding is a technique to represent words as dense vectors of real numbers in a continuous vector space — where semantics (meaning) and relationships between words are preserved.
- Instead of 1-hot vectors or word counts, each word is mapped to a low-dimensional vector (e.g., 100 or 300 dimensions).
- Words that appear in similar contexts are close in vector space.

- **Example**:
    - Imagine a word embedding space like this:

| Word   | Vector (simplified)       |
|--------|----------------------------|
| king   | [0.5, 0.1, 0.9, ...]       |
| queen  | [0.5, 0.2, 0.9, ...]       |
| apple  | [0.9, 0.1, 0.1, ...]       |

You can do math like:
- king - man + woman ≈ queen



!![image.png](attachment:4a4c6ae3-aa81-4290-8e64-78baa7d05c9a.png)

![image.png](attachment:c54fd447-1fa3-4af7-b90a-24a6949fd5cb.png)

In [4]:
!python -m spacy download "en_core_web_lg"

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:06[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [21]:
import spacy as sc

In [22]:
nlp=sc.load('en_core_web_lg')

In [23]:
doc=nlp('Ali like to eat bread. But sutlaji dont drink tea.' )
for word in doc:
    print(f"word:{word}\t|\tis_vector:{word.has_vector}\t|\tOOV:{word.is_oov}")

word:Ali	|	is_vector:True	|	OOV:False
word:like	|	is_vector:True	|	OOV:False
word:to	|	is_vector:True	|	OOV:False
word:eat	|	is_vector:True	|	OOV:False
word:bread	|	is_vector:True	|	OOV:False
word:.	|	is_vector:True	|	OOV:False
word:But	|	is_vector:True	|	OOV:False
word:sutlaji	|	is_vector:False	|	OOV:True
word:do	|	is_vector:True	|	OOV:False
word:nt	|	is_vector:True	|	OOV:False
word:drink	|	is_vector:True	|	OOV:False
word:tea	|	is_vector:True	|	OOV:False
word:.	|	is_vector:True	|	OOV:False


In [25]:
base=nlp("Tomato")
for word in doc:
    print(f"{word.text}<->{base.text}  |\t{word.similarity(base[0])}")# Represents similarity between two words

Ali<->Tomato  |	0.07391463965177536
like<->Tomato  |	0.24707908928394318
to<->Tomato  |	0.13988150656223297
eat<->Tomato  |	0.41074666380882263
bread<->Tomato  |	0.6368071436882019
.<->Tomato  |	0.15089364349842072
But<->Tomato  |	0.23196357488632202
sutlaji<->Tomato  |	0.0
do<->Tomato  |	0.14112377166748047
nt<->Tomato  |	0.07228509336709976
drink<->Tomato  |	0.326096773147583
tea<->Tomato  |	0.3690275549888611
.<->Tomato  |	0.15089364349842072


  print(f"{word.text}<->{base.text}  |\t{word.similarity(base[0])}")# Represents similarity between two words
