<a href="https://colab.research.google.com/github/Ghishan/NLP/blob/main/basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import spacy and download language model
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
# Tokenization
sentence = nlp.tokenizer("We live in Karachi.")
# Length of sentence
print("The number of tokens: ", len(sentence))
# Print individual words (i.e., tokens)
print("The tokens: ")
for words in sentence:
 print(words)


The number of tokens:  5
The tokens: 
We
live
in
Karachi
.


In [None]:
# Print tokens for example sentence without chunking
for token in nlp("My parents live in New York City."):
 print(token.text)

My
parents
live
in
New
York
City
.


In [None]:
# Print chunks for example sentence
for chunk in nlp("My parents live in New York City.").noun_chunks:
 print(chunk.text)

My parents
New York City


In [20]:
import pandas as pd
import os
cwd = os.getcwd()
# Import Jeopardy Questions
data = pd.read_csv('/content/jeopardy_questions.csv')
data = pd.DataFrame(data=data)
# Lowercase, strip whitespace, and view column names
data.columns = map(lambda x: x.lower().strip(), data.columns)
# Reduce size of data
data = data[0:1000]
# Tokenize Jeopardy Questions
#created tokens for each of the 1,000 Jeopardy questions
data["question_tokens"] = data["question"].apply(lambda x: nlp(x))

In [21]:
# View first question
example_question = data.question[0]
example_question_tokens = data.question_tokens[0]
print("The first questions is:")
print(example_question)

The first questions is:
For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory


In [22]:
print("The tokens from the first question are:")
for tokens in example_question_tokens:
 print(tokens)

The tokens from the first question are:
For
the
last
8
years
of
his
life
,
Galileo
was
under
house
arrest
for
espousing
this
man
's
theory


**Parts of Speech tagging**

In [23]:
# Print Part-of-speech tags for tokens in the first question
print("Here are the Part-of-speech tags for each token in the first question:")
for token in example_question_tokens:
 print(token.text,token.pos_, spacy.explain(token.pos_))

Here are the Part-of-speech tags for each token in the first question:
For ADP adposition
the DET determiner
last ADJ adjective
8 NUM numeral
years NOUN noun
of ADP adposition
his PRON pronoun
life NOUN noun
, PUNCT punctuation
Galileo PROPN proper noun
was AUX auxiliary
under ADP adposition
house NOUN noun
arrest NOUN noun
for ADP adposition
espousing VERB verb
this DET determiner
man NOUN noun
's PART particle
theory NOUN noun


In [24]:
# Visualize the dependency parse
from spacy import displacy
displacy.render(example_question_tokens, style='dep',
 jupyter=True, options={'distance': 120})

In [25]:
# Print lemmatization for tokens in the first question
lemmatization = pd.DataFrame(data=[], \
 columns=["original","lemmatized"])
i = 0
for token in example_question_tokens:
 lemmatization.loc[i,"original"] = token.text
 lemmatization.loc[i,"lemmatized"] = token.lemma_
 i = i+1
lemmatization

Unnamed: 0,original,lemmatized
0,For,for
1,the,the
2,last,last
3,8,8
4,years,year
5,of,of
6,his,his
7,life,life
8,",",","
9,Galileo,Galileo
