<table>
  <tr>
    <td> <b> Roll no.: </b> N031 </td>
    <td> <b> Name: </b> Shourya Gupta </td>
  </tr>
  <tr>
    <td> <b> Program and Division: </b> MBA Tech CE D</td>
    <td> <b> Batch: </b> B1 </td>
  </tr>   
</table>

# **Experiment 4**

### **Aim**

Perform POS tagging and NER on text data

### **Import Libraries**

In [None]:
import spacy
from spacy import displacy

**Downloading pre trained spacy models**

In [None]:
# pretrained english model
!python -m spacy download en_core_web_sm

In [None]:
# pretrained french model
!python -m spacy download fr_core_news_sm

### **Implementing POS tagging using Python**

In [None]:
# using pretrained model for english language
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [None]:
doc = nlp("Captain america ate $100 of samosa. Then he said I can do this all day")
for token in doc:
  print(token," | ", spacy.explain(token.pos_)," | ",token.tag)

Captain  |  proper noun  |  15794550382381185553
america  |  proper noun  |  15794550382381185553
ate  |  verb  |  17109001835818727656
$  |  symbol  |  11283501755624150392
100  |  numeral  |  8427216679587749980
of  |  adposition  |  1292078113972184607
samosa  |  noun  |  783433942507015291
.  |  punctuation  |  12646065887601541794
Then  |  adverb  |  164681854541413346
he  |  pronoun  |  13656873538139661788
said  |  verb  |  17109001835818727656
I  |  pronoun  |  13656873538139661788
can  |  auxiliary  |  16235386156175103506
do  |  verb  |  14200088355797579614
this  |  pronoun  |  15267657372422890137
all  |  determiner  |  15267657372422890137
day  |  noun  |  15308085513773655218


### **Named Entity Recognition (NER)**

In [None]:
doc = nlp(" NMIMS ltd. is an educational institute. Tesla Inc is going to acquire twitter ltd. for $45 billions. I am staying in Mumbai, Maharashtra, India, Earth.")
for ent in doc.ents:
  print(ent.text," | ",ent.label_," | ",spacy.explain(ent.label_))

NMIMS ltd.  |  ORG  |  Companies, agencies, institutions, etc.
Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
twitter ltd.  |  ORG  |  Companies, agencies, institutions, etc.
$45 billions  |  MONEY  |  Monetary values, including unit
Mumbai  |  GPE  |  Countries, cities, states
Maharashtra  |  GPE  |  Countries, cities, states
India  |  GPE  |  Countries, cities, states
Earth  |  LOC  |  Non-GPE locations, mountain ranges, bodies of water


In [None]:
displacy.render(doc,style="ent")

### **Customizing Tokenizer**

In [None]:
from spacy.symbols import ORTH
# customize the tokens
nlp.tokenizer.add_special_case("gimme",[
  {ORTH: "gim"},
   {ORTH: "me"}
  ])

doc = nlp("gimme double cheese extra latge healthy pizza")
tokens = [token.text for token in doc]
tokens

['gim', 'me', 'double', 'cheese', 'extra', 'latge', 'healthy', 'pizza']

### **Support for other languages**

In [None]:
nlp = spacy.load("fr_core_news_sm")

doc = nlp("Captain America a mangé 100 pizzas. Puis il a dit que je pouvais faire ça toute la journée")
for token in doc:
  print(token," | ", spacy.explain(token.pos_)," | ",token.tag)

Captain  |  noun  |  92
America  |  proper noun  |  96
a  |  auxiliary  |  87
mangé  |  verb  |  100
100  |  numeral  |  93
pizzas  |  noun  |  92
.  |  punctuation  |  97
Puis  |  coordinating conjunction  |  89
il  |  pronoun  |  95
a  |  auxiliary  |  87
dit  |  verb  |  100
que  |  subordinating conjunction  |  98
je  |  pronoun  |  95
pouvais  |  verb  |  100
faire  |  verb  |  100
ça  |  pronoun  |  95
toute  |  adjective  |  84
la  |  determiner  |  90
journée  |  noun  |  92


### **Observations and Learning**

Part-of-Speech (POS) tagging plays a crucial role NLP by assigning parts of speech to each word in a text, such as nouns, verbs, adjectives, adverbs, etc. NER is a technique used to identify and classify specific entities in text into predefined categories. These entities are typically proper nouns and can include: names, organizations, locations, date & times.

### **Conclusion**

Successfully explored spacy package for POS tagging and NER.

In [None]:
def colab2pdf():
  # @title Download Notebook in PDF Format{display-mode:'form'}
  !apt-get install -yqq --no-install-recommends librsvg2-bin>/dev/null;
  import contextlib,datetime,google,io,IPython,ipywidgets,json,locale,nbformat,os,pathlib,requests,urllib,warnings,werkzeug,yaml,re;locale.setlocale(locale.LC_ALL,'en_US.UTF-8');warnings.filterwarnings('ignore',category=nbformat.validator.MissingIDFieldWarning);
  %matplotlib inline
  def convert(b):
    try:
      s.value='🔄 Converting';b.disabled=True
      n=pathlib.Path(werkzeug.utils.secure_filename(urllib.parse.unquote(requests.get(f'http://{os.environ["COLAB_JUPYTER_IP"]}:{os.environ["KMP_TARGET_PORT"]}/api/sessions').json()[0]['name'])))
      p=pathlib.Path('/content/pdfs')/f'{datetime.datetime.utcnow().strftime("%Y%m%d_%H%M%S")}_{n.stem}';p.mkdir(parents=True,exist_ok=True);nb=nbformat.reads(json.dumps(google.colab._message.blocking_request('get_ipynb',timeout_sec=600)['ipynb']),as_version=4)
      u=[u for c in nb.cells if c.get('cell_type')=='markdown' for u in re.findall(r'!\[.*?\]\((https?://.*?)\)',c['source']) if requests.head(u,timeout=5).status_code!=200]
      if u:raise Exception(f"Bad Image URLs: {','.join(u)}")
      nb.cells=[cell for cell in nb.cells if '--Colab2PDF' not in cell.source]
      nb=nbformat.v4.new_notebook(cells=nb.cells or [nbformat.v4.new_code_cell('#')]);nbformat.validator.normalize(nb)
      nbformat.write(nb,(p/f'{n.stem}.ipynb').open('w',encoding='utf-8'))
      with (p/'config.yml').open('w', encoding='utf-8') as f: yaml.dump({'include-in-header':[{'text':r'\usepackage{fvextra}\DefineVerbatimEnvironment{Highlighting}{Verbatim}{breaksymbolleft={},showspaces=false,showtabs=false,breaklines,breakanywhere,commandchars=\\\{\}}'}],'include-before-body':[{'text':r'\DefineVerbatimEnvironment{verbatim}{Verbatim}{breaksymbolleft={},showspaces=false,showtabs=false,breaklines}'}]},f)
      !quarto render {p}/{n.stem}.ipynb --metadata-file={p}/config.yml --to pdf -M latex-auto-install -M margin-top=1in -M margin-bottom=1in -M margin-left=1in -M margin-right=1in --quiet
      google.colab.files.download(str(p/f'{n.stem}.pdf'));s.value=f'✅ Downloaded: {n.stem}.pdf'
    except Exception as e:s.value=f'❌ {str(e)}'
    finally:b.disabled=False
  if not pathlib.Path('/usr/local/bin/quarto').exists():
    !wget -q 'https://quarto.org/download/latest/quarto-linux-amd64.deb' && dpkg -i quarto-linux-amd64.deb>/dev/null && quarto install tinytex --update-path --quiet && rm quarto-linux-amd64.deb
  b=ipywidgets.widgets.Button(description='⬇️ Download');s=ipywidgets.widgets.Label();b.on_click(lambda b:convert(b));IPython.display.display(ipywidgets.widgets.HBox([b,s]))
colab2pdf() # | Colab2PDF v1.6 | https://github.com/drengskapur/colab2pdf | GPL-3.0-or-later |

HBox(children=(Button(description='⬇️ Download', style=ButtonStyle()), Label(value='')))

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>