### Setup

In [65]:
!pip install eng-to-ipa
!pip install -U spacy
!python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
2022-11-17 11:32:36.485825: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 4.7 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [66]:
pip freeze | grep -E "spacy|ipa|nltk"

en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl
eng-to-ipa==0.0.2
nltk==3.7
spacy==3.4.3
spacy-legacy==3.0.10
spacy-loggers==1.0.3


In [2]:
import spacy
import eng_to_ipa as ipa

## Phonetics
Convert English text into the Phonetics using eng-to-ipa

**eng-to-ipa** library program utilizes the Carnegie-Mellon University (CMU) Pronouncing Dictionary to convert English text into the International Phonetic Alphabet.

**IPA:** International Phonetic Alphabet

*   **`convert()`** function converts a string into IPA
*   **`ipa_list()`** function returns the list of each word as a list of all its possible transcriptions
* **`get_rhymes()`** *italicised text* function returns a list of rhymes for a word or set of words
* **`syllable_count()`** function returns an integer, corresponding to the number of syllables in a word. Returns a list of syllable counts if more than one word is provided in the input string.

---

### Learn & play:

> https://en.wikipedia.org/wiki/International_Phonetic_Alphabet

> https://pypi.org/project/eng-to-ipa/

In [17]:
# Convert to phonetic transcript (IPA)
ipa.convert('unfortunately', retrieve_all=True)

['ənˈfɔrʧunətli', 'ənˈfɔrʧənətli']

In [8]:
ipa.convert('go home')

'goʊ hoʊm'

In [10]:
# Check for different pronunciations
ipa.convert('aluminium', retrieve_all=True )

['əˈlumɪnəm', 'ˌæˈljumɪnəm']

In [19]:
ipa.ipa_list('natural language processing is fun but difficult')

[['ˈnæʧrəl', 'ˈnæʧərəl'],
 ['ˈlæŋgwəʤ', 'ˈlæŋgwɪʤ'],
 ['ˈprɑsɛsɪŋ'],
 ['ɪz'],
 ['fən'],
 ['bət'],
 ['ˈdɪfəkəlt']]

In [29]:
# Find rhymes to a word
ipa.get_rhymes('rhyming function')

[['climbing', 'diming', 'liming', 'priming', 'timing'],
 ['compunction',
  'conjunction',
  'dysfunction',
  'injunction',
  'junction',
  'malfunction']]

In [30]:
ipa.get_rhymes('science')

['alliance',
 'appliance',
 'bioscience',
 'compliance',
 'defiance',
 'non-compliance',
 'noncompliance',
 'pseudoscience',
 'reliance']

In [31]:
# Calculate the number of syllables per token
ipa.syllable_count('interesting Wednesday business computer')

[4, 2, 2, 3]

## Morphology
Tokenization, lemmatization, stemming using Spacy.

spaCy provides a number of pre-trained model packages you can download using the spacy download command.
`python -m spacy download en_core_web_sm`

For example, the "en_core_web_sm" package is a small English model that supports all core capabilities and is trained on web text.

The `spacy.load` method loads a model package by name and returns an `nlp` object. The package provides the binary weights that enable spaCy to make predictions. It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline.

**Tokenization**

`Token` objects represent the tokens in a document – for example, a word or a punctuation character. To get a token at a specific position, you can index into the doc.
`Token` objects also provide various attributes that let you access more information about the tokens:

**`.text`** attribute returns the verbatim token text.

**`like_num`** checks whether a token in the doc resembles a number

**`token.i + 1`** gets the token following the current token in the document. 


**Stemming**


---

Learn & play more:

> https://spacy.io/





In [38]:
# Load the spaCy language model, and create the nlp object.
nlp = spacy.load('en_core_web_sm')

In [102]:
# Process text. SpaCy creates a Doc object when you process a text with the nlp object.. The Doc lets you access information about the text in a structured way, and no information is lost.
doc = nlp("I studied 3 languages.")

In [103]:
# Iterate over tokens in a Doc
for token in doc:
    print(token, token.lemma_)

I I
studied study
3 3
languages language
. .


In [104]:
# count the tokens
len(doc)

5

In [105]:
# Index. doc[1] is the token "studied" which is saved in the parameter "studied"
studied = doc[1]
studied

studied

In [106]:
# Output the string for a given element
doc[1].text

'studied'

In [107]:
# A slice the Doc to remove the dot
doc[0:4]

I studied 3 languages

In [109]:
# output the string for element "3"
number = doc[2]
number

3

In [110]:
number.like_num

True

In [111]:
studied.like_num

False

In [112]:
number.like_email

False

In [113]:
# ID of the verbatim text content.
number.orth

602994839685422785

In [87]:
# Get the parent document
number.doc

I studied 3 languages.

In [114]:
# ID of the base form of the token, with no inflectional suffixes.
studied.lemma

4251533498015236010

In [115]:
#Base form of the token, with no inflectional suffixes.
studied.lemma_

'study'

In [116]:
len(studied)

7

In [117]:
studied.text.capitalize()

'Studied'

In [119]:
studied.text.count('d')

2

In [127]:
 #Loop over items in the Doc object, using the variable 'token' to refer to items in the list
for token in doc:
# Print the token and the results of morphological analysis
    print(token.text, token.morph, sep=': ')
    print(' ')

I: Case=Nom|Number=Sing|Person=1|PronType=Prs
 
studied: Tense=Past|VerbForm=Fin
 
3: NumType=Card
 
languages: Number=Plur
 
.: PunctType=Peri
 


In [93]:
studied.morph

Tense=Past|VerbForm=Fin

In [94]:
studied.morph.get('Tense')

['Past']

In [95]:
number.morph

NumType=Card

## Syntax

POS:
```
    "ADJ": "adjective",
    "ADP": "adposition", <-- prepositions and postpositions together
    "ADV": "adverb",
    "AUX": "auxiliary",
    "CONJ": "conjunction",
    "CCONJ": "coordinating conjunction",
    "DET": "determiner",
    "INTJ": "interjection",
    "NOUN": "noun",
    "NUM": "numeral",
    "PART": "particle",
    "PRON": "pronoun",
    "PROPN": "proper noun",
    "PUNCT": "punctuation",
    "SCONJ": "subordinating conjunction",
    "SYM": "symbol",
    "VERB": "verb",
    "X": "other",
    "EOL": "end of line",
    "SPACE": "space"
```

Noun chunks:
 ```
    "NP": "noun phrase",
    "PP": "prepositional phrase",
    "VP": "verb phrase",
    "ADVP": "adverb phrase",
    "ADJP": "adjective phrase",
    "SBAR": "subordinating conjunction",
    "PRT": "particle",
    "PNP": "prepositional noun phrase"
```

Dependency labels (selected):

```
    "advmod": "adverbial modifier",
    "amod": "adjectival modifier",
    "attr": "attribute",
    "aux": "auxiliary",
    "auxpass": "auxiliary (passive)",
    "case": "case marking",
    "conj": "conjunct",
    "csubj": "clausal subject",
    "det": "determiner",
    "dobj": "direct object",
    "expl": "expletive",
    "hyph": "hyphen",
    "infmod": "infinitival modifier",
    "meta": "meta modifier",
    "neg": "negation modifier",
    "nmod": "modifier of nominal",
    "nn": "noun compound modifier",
    "nsubj": "nominal subject",
    "nsubjpass": "nominal subject (passive)",
    "nounmod": "modifier of nominal",
    "npmod": "noun phrase as adverbial modifier",
    "num": "number modifier",
    "number": "number compound modifier",
    "nummod": "numeric modifier",
    "obj": "object",
    "pcomp": "complement of preposition",
    "pobj": "object of preposition",
    "poss": "possession modifier",
    "possessive": "possessive modifier",
    "prep": "prepositional modifier",
    
```
---


Learn & Play

Full list of tags: 
> https://github.com/explosion/spaCy/blob/master/spacy/glossary.py

Displacy options:
> https://spacy.io/api/top-level#displacy_options

In [128]:
from spacy import displacy

In [129]:
doc = nlp('Lucerne is a beautiful town in Switzerland.')

In [130]:
for token in doc:
  print (token.text + '\n\t-->' + token.pos_)

Lucerne
	-->PROPN
is
	-->AUX
a
	-->DET
beautiful
	-->ADJ
town
	-->NOUN
in
	-->ADP
Switzerland
	-->PROPN
.
	-->PUNCT


In [134]:
# POS level granularity
displacy.render(doc, style="dep", jupyter=True)

In [136]:
# Phrase level granularity
displacy.render(doc, style="dep", jupyter=True, options={'fine_grained': True})

In [101]:
# Merge noun phrases into one token.
displacy.render(doc, style="dep", jupyter=True, options={'collapse_phrases': True})

In [140]:
spacy.explain(u'NNP')

'noun, proper singular'

In [137]:
# Get a description for a given POS tag, dependency label or entity type
spacy.explain(u'AUX')

'auxiliary'

In [143]:
# Stop words
stopwords = nlp.Defaults.stop_words

# Default stop word list in Spacy
print(len(stopwords))
print(stopwords)

326
{'thereupon', 'to', 'upon', 'via', 'call', 'same', 'hers', 'along', 'when', 'would', 'beforehand', 'myself', 'very', 'them', 'us', 'all', 'other', 'then', 'there', 'below', "'d", 'was', 'say', '’d', 'thence', 'will', '’m', 'part', 'except', 'elsewhere', 'keep', '‘re', 'within', 'whereby', 'fifteen', '‘d', 'during', 'either', 'down', 'former', 'none', 'such', 'up', 'top', 'though', 'without', 'whose', 'almost', 'does', 'mostly', 'whoever', 'due', 'wherever', 'noone', 'every', 'were', 'yours', 'you', 'several', '‘ll', 'everyone', 'since', 'see', 'we', 'she', 'no', 'our', 'often', 'so', 'hereby', 'above', 'because', 'indeed', 'thereafter', 'sometimes', 'move', 'for', 'whence', 'become', 'last', 'doing', 'however', 'yet', 'hundred', 'make', 'before', 'unless', '‘s', 'never', 'another', 'namely', '’ll', 'my', 'could', '‘ve', 'twenty', 'first', 'nor', 'are', 'an', 'into', 'one', 'over', 'own', 'which', 'whereas', 'here', 'already', 'something', 'i', 'cannot', 'formerly', 'be', 'herein', 

In [104]:
# Check stop words in document
for token in doc:
    print(token.text, '\n-->' , token.is_stop)

Lucerne 
--> False
is 
--> True
a 
--> True
beautiful 
--> False
town 
--> False
in 
--> True
Switzerland 
--> False
. 
--> False


In [144]:
# Add stop words
nlp.Defaults.stop_words |= {"newstopword1","newstopword2"}
nwords = [i for i in stopwords if(i.startswith('n'))]
print(nwords)

['none', 'noone', 'no', 'never', 'namely', 'nor', 'newstopword1', 'nevertheless', "n't", 'not', 'n‘t', 'now', 'nothing', 'newstopword2', 'nobody', 'neither', 'next', 'n’t', 'nine', 'nowhere', 'name']


In [145]:
# Remove stop words from default list
nlp.Defaults.stop_words -= {"newstopword1","newstopword2"}
nwords = [i for i in stopwords if(i.startswith('n'))]
print(nwords)

['none', 'noone', 'no', 'never', 'namely', 'nor', 'nevertheless', "n't", 'not', 'n‘t', 'now', 'nothing', 'nobody', 'neither', 'next', 'n’t', 'nine', 'nowhere', 'name']


In [146]:
# Remove stop words from text
sentence_wo_stops = []
for token in doc:
  if token.text in stopwords:
    print('stop word \t-->', token)
  else:
    print('function word \t-->', token)
    sentence_wo_stops.append(token.text)

print('\nThe sentence after stop word removal contains only content words:', ' '.join(sentence_wo_stops))

function word 	--> Lucerne
stop word 	--> is
stop word 	--> a
function word 	--> beautiful
function word 	--> town
stop word 	--> in
function word 	--> Switzerland
function word 	--> .

The sentence after stop word removal contains only content words: Lucerne beautiful town Switzerland .


## Semantics

In [155]:
s1_apple = nlp('She worked at Apple.')
s2_apple = nlp('She ate an apple.')

In [156]:
# WSD with POS
displacy.render(s1_apple, style="dep", jupyter=True, options={'fine_grained': False})

In [157]:
# WSD with POS
displacy.render(s2_apple, style="dep", jupyter=True, options={'fine_grained': False})

In [158]:
s1_java = nlp('She worked in Java.')
s2_java = nlp('She codes in Java.')

In [159]:
displacy.render(s1_java, style="dep", jupyter=True, options={'fine_grained': True})

In [160]:
displacy.render(s2_java, style="dep", jupyter=True, options={'fine_grained': True})