# Part-of-Speech (POS) Tagging

Part-of-Speech (POS) tagging is a fundamental process in Natural Language Processing (NLP) that involves assigning a part of speech to each word in a given text. These parts of speech include nouns, verbs, adjectives, adverbs, conjunctions, and more. POS tagging helps in understanding the grammatical structure of a sentence and is essential for various NLP tasks such as parsing, named entity recognition, and machine translation.

In this notebook, we use the spaCy library to perform POS tagging on a sample sentence from a Ruskin Bond book. The results are then visualized and analyzed using pandas. The dataframe `df_tokens` contains the tokens and their corresponding POS tags.

In [2]:
pip install spacy

Collecting spacy
  Downloading spacy-3.8.3-cp312-cp312-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.11-cp312-cp312-win_amd64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.10-cp312-cp312-win_amd64.whl.metadata (8.6 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.0 (from spacy)
  Downloading thinc-8.3.3-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.0-cp312-cp312-win_amd64

In [4]:
import spacy
import pandas as pd

# Install the spaCy model
!python -m spacy download en_core_web_sm

nlp = spacy.load('en_core_web_sm')



Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
      --------------------------------------- 0.3/12.8 MB ? eta -:--:--
     ---- ----------------------------------- 1.3/12.8 MB 6.1 MB/s eta 0:00:02
     -------- ------------------------------- 2.6/12.8 MB 5.6 MB/s eta 0:00:02
     ------------ --------------------------- 3.9/12.8 MB 6.0 MB/s eta 0:00:02
     ----------------- ---------------------- 5.5/12.8 MB 6.2 MB/s eta 0:00:02
     -------------------- ------------------- 6.6/12.8 MB 6.1 MB/s eta 0:00:02
     ------------------------- -------------- 8.1/12.8 MB 6.2 MB/s eta 0:00:01
     ---------------------------- ----------- 9.2/12.8 MB 6.2 MB/s eta 0:00:01
     --------------------------------- ------ 10.7/12.8 MB 6.3 MB/s eta 0:00:01
     ------------------------------------- -- 1

In [7]:
#give me a sample sentence preferably from a ruskin bond book which is large enough to show a demo of POS tagging
sent = "The room was filled with the scent of roses, and the sound of bees humming in the garden outside the window."

# Process the text
doc = nlp(sent)
# Create a dataframe with token text and POS
token_data = {'Text': [token.text for token in doc], 'POS': [token.pos_ for token in doc]}
df_tokens = pd.DataFrame(token_data)

df_tokens

Unnamed: 0,Text,POS
0,The,DET
1,room,NOUN
2,was,AUX
3,filled,VERB
4,with,ADP
5,the,DET
6,scent,NOUN
7,of,ADP
8,roses,NOUN
9,",",PUNCT


In [10]:
df_tokens.groupby('POS').count().sort_values(by='Text', ascending=False)

Unnamed: 0_level_0,Text
POS,Unnamed: 1_level_1
NOUN,7
ADP,5
DET,5
PUNCT,2
VERB,2
AUX,1
CCONJ,1


# Named Entity Recognition Tagging

Named Entity Recognition (NER) is a process in Natural Language Processing (NLP) that identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, dates, etc. In this notebook, we use the spaCy library to perform NER on a sample sentence from a Ruskin Bond book. The results are then visualized and analyzed using pandas.

In [17]:
from spacy import displacy

nlp = spacy.load('en_core_web_sm')

# Visualize the sentence
#do an entity recognition on the same sentence

# Process the text
doc = nlp(sent)
doc.ents

#  Create a dataframe with token text and labels
token_label_data = {'Text': [token.text for token in doc.ents], 'Label': [token.ent_type_ for token in doc.ents]}
df_labels = pd.DataFrame(token_label_data)

df_labels.head()

#displacy.render(doc, style='ent', jupyter=True)


Unnamed: 0,Text,Label
