# Text Preprocessing and Linguistic Analysis in Natural Language Processing (NLP)

#### Natural Language Processing (NLP) is a field of AI that enables machines to understand, analyze, and derive meaning from human language. However, raw text data cannot be directly understood by machines. Hence the text must be preprocessed and linguistically analyzed

This focuses on the fundamental building blocks of NLP, which convert raw text into structured linguistic information. These steps help machines understand how text is formed, what each word represents, and how words relate to one another grammatically and semantically.

In [None]:
# install necessary libraries

!pip install nltk spacy
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m101.8 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## Tokenization
#### Tokenization is the process of breaking text into smaller units called tokens.

* Sentence Tokenization: Splits text into sentences

* Word Tokenization: Splits sentences into individual words

##### Why it is needed: Machines process text word by word. Tokenization is the first step that transforms unstructured text into manageable units.

In [None]:
import nltk
nltk.download('punkt')   #punkt -> sentance and word tokenizer model
nltk.download('punkt_tab') # punkt_tab -> new dependency required in recent nltk version
from nltk.tokenize import word_tokenize, sent_tokenize
# word_tokenize -> splits sentance into words and punctuations
# sent_tokenize -> split text into sentance

text= " Natural Language Processing is very Intresting."

print("Sentances:",sent_tokenize(text))
print("words:",word_tokenize(text))

Sentances: [' Natural Language Processing is very Intresting.']
words: ['Natural', 'Language', 'Processing', 'is', 'very', 'Intresting', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## stop word removal
#### stopword : Stop words are commonly used words in a language that usually do not add significant meaning to a sentence.
* Eg (is, the,a, in, on, very) that usually do not add any meaning.


##### In most NLP tasks, these words:

* Appear very frequently

* Do not help in understanding the intent or topic

In [None]:
# stop word removal
nltk.download('stopwords')

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filterd_words=[w for w in word_tokenize(text) if w.lower() not in stop_words]

print(filterd_words)


['Natural', 'Language', 'Processing', 'Intresting', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Lemmatization
#### Lemmatization is the process of converting words into their base or dictionary form, called a lemma.

Examples:

* running → run

* studies → study

Unlike stemming, lemmatization:

* Uses vocabulary and grammar rules

* Produces valid, meaningful words

In [None]:
nltk.download('wordnet')
# wordnet is a lexical database used to find the base form of words

from nltk.stem import WordNetLemmatizer
lemmatizer= WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word.lower(),pos='v') for word in filterd_words]
print(lemmatized)

[nltk_data] Downloading package wordnet to /root/nltk_data...


['natural', 'language', 'process', 'intresting', '.']


## Part of Speech Tagging

#### POS tagging assigns grammatical labels (noun, verb, adjective, etc.) to each word in a sentence.

Example:

* Apple → Noun

* opening → Verb

* office → Noun

##### Why it is needed: Understanding grammar helps machines infer meaning, detect relationships, and disambiguate words based on context.

In [None]:
nltk.download('averaged_perceptron_tagger_eng')  #POS tagging

from nltk import pos_tag

# each word is tagged with its gramatical role

tokens = word_tokenize("Apple is opening a new office in india.")
print(pos_tag(tokens))

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


[('Apple', 'NNP'), ('is', 'VBZ'), ('opening', 'VBG'), ('a', 'DT'), ('new', 'JJ'), ('office', 'NN'), ('in', 'IN'), ('india', 'NN'), ('.', '.')]


In [None]:
import spacy
nlp=spacy.load("en_core_web_sm")
doc = nlp("Apple is opening a new office in india.")
for token in doc:
    print(token.text,token.pos_)
for ent in doc.ents:
    print(ent.text, "->", ent.label_)

Apple PROPN
is AUX
opening VERB
a DET
new ADJ
office NOUN
in ADP
india PROPN
. PUNCT
Apple -> ORG
india -> GPE
