
## it_nltmadj_01_enus_04

### WordNet with NLTK

**WordNet** is a database of English words that are linked together by their semantic relationships. It is like a supercharged dictionary/thesaurus with a graph structure.

In [29]:
'''
The WordNet is a part of Python's Natural Language Toolkit. It is a large word database of English Nouns, Adjectives, Adverbs 
and Verbs. These are grouped into some set of cognitive synonyms, which are called synsets.
'''
import nltk
nltk.download('wordnet')

from nltk.corpus import wordnet

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nukes\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### Synsets
A synonym set, or synset, is a group of synonyms.

In [30]:
syn_arr = wordnet.synsets('good') # this is how to look up some word in wordnet, many different meanings of the words
print(syn_arr)

[Synset('good.n.01'), Synset('good.n.02'), Synset('good.n.03'), Synset('commodity.n.01'), Synset('good.a.01'), Synset('full.s.06'), Synset('good.a.03'), Synset('estimable.s.02'), Synset('beneficial.s.01'), Synset('good.s.06'), Synset('good.s.07'), Synset('adept.s.01'), Synset('good.s.09'), Synset('dear.s.02'), Synset('dependable.s.04'), Synset('good.s.12'), Synset('good.s.13'), Synset('effective.s.04'), Synset('good.s.15'), Synset('good.s.16'), Synset('good.s.17'), Synset('good.s.18'), Synset('good.s.19'), Synset('good.s.20'), Synset('good.s.21'), Synset('well.r.01'), Synset('thoroughly.r.02')]


#### Definition 
Synsets also come with a prose **definition** and some **example** sentences:

In [31]:
syn_arr[3].definition() # We can check different definition for the word in synset. 

'articles of commerce'

In [32]:
syn_arr[1].examples() # Returns an example sentence for the word. You can try for all the indexes from 0 to 4

['there is much good to be found in people']

#### lemma
The synonyms contained within a synset are called **lemmas**

In [33]:
syn_arr[1].lemma_names()


['good', 'goodness']

##### Synonyms and Antonyms

In [34]:
# Creating a list of all the synonyms and antonyms for a particular word
syn = list() # Empty synonyms list
ant = list() # Empty antonyms list
for synset in syn_arr: 
    for lemma in synset.lemmas():
        syn.append(lemma.name())    #add the synonyms to the list
        if lemma.antonyms():    #When antonyms are available, add them into the list
            ant.append(lemma.antonyms()[0].name())
print('Synonyms: ' + str(syn))
print('Antonyms: ' + str(ant))

Synonyms: ['good', 'good', 'goodness', 'good', 'goodness', 'commodity', 'trade_good', 'good', 'good', 'full', 'good', 'good', 'estimable', 'good', 'honorable', 'respectable', 'beneficial', 'good', 'good', 'good', 'just', 'upright', 'adept', 'expert', 'good', 'practiced', 'proficient', 'skillful', 'skilful', 'good', 'dear', 'good', 'near', 'dependable', 'good', 'safe', 'secure', 'good', 'right', 'ripe', 'good', 'well', 'effective', 'good', 'in_effect', 'in_force', 'good', 'good', 'serious', 'good', 'sound', 'good', 'salutary', 'good', 'honest', 'good', 'undecomposed', 'unspoiled', 'unspoilt', 'good', 'well', 'good', 'thoroughly', 'soundly', 'good']
Antonyms: ['evil', 'evilness', 'bad', 'badness', 'bad', 'evil', 'ill']


#### Wordnet Hierarchy   

We have seen, all synsets are connected to other synsets by means of semantic relations. It forms a hierarchy of concepts. 

That Hierarchy can be understood by two terminologies: 

- Hypernyms 

- Hyponyms 

In [35]:
print('Hypernyms: ',syn_arr[0].hypernyms())
print('Hyponyms: ',syn_arr[0].hyponyms())

Hypernyms:  [Synset('advantage.n.01')]
Hyponyms:  [Synset('common_good.n.01')]



## it_nltmadj_01_enus_05

### Wordnet Relations & Semantic Similarity

In [36]:
import nltk
nltk.download('wordnet')
nltk.download('omw') # The nltk's Open Multilingual Wordnet has English names for all the synsets, since it is a
# multilingual database centered on the original English Wordnet

from nltk.corpus import wordnet

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nukes\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw to
[nltk_data]     C:\Users\nukes\AppData\Roaming\nltk_data...
[nltk_data]   Package omw is already up-to-date!


#### Meronym
A meronym denotes a member of something. Example, 'wheel' is a meronym of 'automobile'.

Hypernyms and hyponyms are called lexical relations because they relate one synset to another. These two relations navigate up and down the "is-a" hierarchy. Another important way to navigate the WordNet network is from items to their components (meronyms) or to the things they are contained in (holonyms). For example, the parts of a tree are its trunk, crown, and so on; the part_meronyms(). The substance a tree is made of includes heartwood and sapwood; the substance_meronyms(). A collection of trees forms a forest; the member_holonyms():

In [37]:
syn_arr = wordnet.synsets('tree')
print('part_meronyms: ',syn_arr[0].part_meronyms())
print('substance_meronyms: ',syn_arr[0].substance_meronyms()) 

part_meronyms:  [Synset('burl.n.02'), Synset('crown.n.07'), Synset('limb.n.02'), Synset('stump.n.01'), Synset('trunk.n.01')]
substance_meronyms:  [Synset('heartwood.n.01'), Synset('sapwood.n.01')]


#### Holonym

In [38]:
print('part_holonym: ',wordnet.synset('trunk.n.01').part_holonyms())
print('substance_holonym: ',wordnet.synset('heartwood.n.01').substance_holonyms())

part_holonym:  [Synset('tree.n.01')]
substance_holonym:  [Synset('tree.n.01')]


#### Entailments
Semantic relationship between two verbs. 

In [39]:
wordnet.synset('tease.v.03').entailments()

[Synset('arouse.v.07'), Synset('disappoint.v.01')]

#### Semantic Similarity
To compute the similarity between two sentences, we base the semantic similarity between word senses. – synonyms and antonyms are one step in this direction.   

Some similarity approaches can be found below.

- Vector space model
 - Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI)

- Semantic (via WordNet)
 - Similarity measures (Pedersen et al.,)
  - Path lengths between concepts
   - lch (Leacock and Chodorow)
   - wup (Wu and Palmer)
   - path (path similarity)
  - Information content
   - res (Resnik)
   - lin (Lin)
   - jcn (Jiang and Conrath)
 - Measures of relatedness (Pedersen et al.,)
  - hso (Hirst and St-Onge)
  - lesk (Banerjee and Pedersen)
  - vector (Patwardhan. Related to ‘vector space model’ above)
- Possibly other approaches…



Using WordNet, It can be calculated by *path_similarity()* function. It returns a score which denotes how similar two words are by traversing through WordNet network.

In [40]:
wordnet.synset('bus.n.01').path_similarity(wordnet.synset('car.n.01'))
wordnet.synset('water.n.01').path_similarity(wordnet.synset('sea.n.01'))

0.125

In [41]:
wordnet.synset('car.n.01').path_similarity(wordnet.synset('car.n.01'))

1.0

#### Multilingual WordNet

WordNet has one of the largest multilingual dictionary. It is even utilized by Google Translate as a part of the translation process between languages 

In [45]:
print(sorted(wordnet.langs())) # List of languages supported by wordnet
print('\n')
print(wordnet.synset('sea.n.01').lemma_names('ita')) # Example of translation into other languages

['eng']


['mare']



## it_nltmadj_01_enus_06

### Sematic similarity of two sentences using WordNet

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('omw')

from nltk.corpus import wordnet as wn
import string
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
english_stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw to /root/nltk_data...
[nltk_data]   Package omw is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
# This method is defined to remove the punctuations from the given text
regular_punct = list(string.punctuation) # creating a list of punctuations 
def remove_punctuation(text,punct_list):
    for punc in punct_list:
        if punc in text:
            text = text.replace(punc, ' ') # The replace method is used to replace a specific phrase with another
    return text.strip() # The strip() method returns a copy of the string by removing both the leading and the trailing spaces 

In [None]:
def preprocess(sentence):
    '''
    Clean – remove punctuation characters and numbers.
    Normalize – lowercase and expand contractions (don’t –> do not)
    Lemmatization – obtain “root” words that can be found in a dictionary. This eliminates past and future tense words and provides present tense. 
    Also, any plural words are converted to singular.
    Tokenization
    Determine parts of speech
    Remove stop words – these are low value words that occur so frequently they don’t offer any discriminating value to 
    determining the similarity between texts.
    Obtain and save synsets for each sentence.
    
    '''
    sentence = remove_punctuation(sentence,regular_punct)
    tokens = nltk.word_tokenize(sentence)
    tagged_tokens = nltk.pos_tag(tokens)
    
    # Filter stopwords
    tagged_sentence_words = [word for word in tagged_tokens if word not in english_stopwords]
    
    synsets_list = []
    ##get synsets
    for word,pos in tagged_sentence_words:
        
        synsets = wn.synsets(word.lower() ,pos = 'n')
        synsets_list.extend(synsets)
    
    return synsets_list

In [None]:
def SimScore(synsets1, synsets2):
    """
    Purpose: Computes sentence similarity using Wordnet path_similarity().
    Input: Synset lists representing sentence 1 and sentence 2.
    Output: Similarity score as a float
    """

    print("-----")
    print("Synsets1: %s\n" % synsets1)
    print("Synsets2: %s\n" % synsets2)

    sumSimilarityscores = 0
    scoreCount = 0

    # For each synset in the first sentence...
    for synset1 in synsets1:

        synsetScore = 0
        similarityScores = []

        # For each synset in the second sentence...
        for synset2 in synsets2:

            # Only compare synsets with the same POS tag. Word to word knowledge
            # measures cannot be applied across different POS tags.
            if synset1.pos() == synset2.pos():

                # Note below is the call to path_similarity mentioned above. 
                synsetScore = synset1.path_similarity(synset2)

                if synsetScore < 1: #!= None:
                    print("Path Score %0.2f: %s vs. %s" % (synsetScore, synset1, synset2))
                    similarityScores.append(synsetScore)

                # If there are no similarity results but the SAME WORD is being
                # compared then it gives a max score of 1.
                elif synset1.name().split(".")[0] == synset2.name().split(".")[0]:
                    synsetScore = 1
                    print("Path MAX-Score %0.2f: %s vs. %s" % (synsetScore, synset1, synset2))
                    similarityScores.append(synsetScore)

                synsetScore = 0

        if(len(similarityScores) > 0):
            sumSimilarityscores += max(similarityScores)
            scoreCount += 1

    # Average the summed, maximum similarity scored and return.
    if scoreCount > 0:
        avgScores = sumSimilarityscores / scoreCount

    return avgScores

In [None]:
sent1 = 'The Cardigan dog breed is superior.'
sent2 = 'Pembroke breed is the best.'

# Preprocess and extract synsets
synsets1 = preprocess(sent1)
synsets2 = preprocess(sent2)

In [None]:
print(synsets1)
print('\n')
print(synsets2)
print('\n--------------------\n')
print(synsets1[0].path_similarity(synsets2[5]))

[Synset('cardigan.n.01'), Synset('cardigan.n.02'), Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('breed.n.01'), Synset('breed.n.02'), Synset('superior.n.01'), Synset('superior.n.02'), Synset('victor.n.01'), Synset('lake_superior.n.01'), Synset('superior.n.05'), Synset('superscript.n.01')]


[Synset('pembroke.n.01'), Synset('breed.n.01'), Synset('breed.n.02'), Synset('best.n.01'), Synset('best.n.02'), Synset('best.n.03')]

--------------------

0.07142857142857142


In [None]:
wn.synset('cardigan.n.01').definition()
wn.synset('cardigan.n.02').definition()

'slightly bowlegged variety of corgi having rounded ears and a long tail'

#### SimScore equation

: image

In [None]:
scores = SimScore(synsets1, synsets2)


-----
Synsets1: [Synset('cardigan.n.01'), Synset('cardigan.n.02'), Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('breed.n.01'), Synset('breed.n.02'), Synset('superior.n.01'), Synset('superior.n.02'), Synset('victor.n.01'), Synset('lake_superior.n.01'), Synset('superior.n.05'), Synset('superscript.n.01')]

Synsets2: [Synset('pembroke.n.01'), Synset('breed.n.01'), Synset('breed.n.02'), Synset('best.n.01'), Synset('best.n.02'), Synset('best.n.03')]

Path Score 0.07: Synset('cardigan.n.01') vs. Synset('pembroke.n.01')
Path Score 0.07: Synset('cardigan.n.01') vs. Synset('breed.n.01')
Path Score 0.05: Synset('cardigan.n.01') vs. Synset('breed.n.02')
Path Score 0.06: Synset('cardigan.n.01') vs. Synset('best.n.01')
Path Score 0.09: Synset('cardigan.n.01') vs. Synset('best.n.02')
Path Score 0.07: Synset('cardigan.n.01') vs. Synset('best.n.03')
Path Score 0.33: Synset('cardigan.n.02') vs

In [None]:
print("Func Score: %0.2f" % scores)

Func Score: 0.27



## it_nltmadj_01_enus_07

### **Regular Expressions**

**Regular expressions** (REs, or regexes, or regex patterns) are a powerful language for matching text patterns. Useful for searches, e.g., E-mail addresses or phone numbers. This notebook includes basic introduction to regular expressions and its implementation.

In [None]:
#import libraries
import re

To define our pattern, use re.compile() 

In [None]:
test_string = '123abc456789abc123ABC'
pattern = re.compile(r'abc')
print(pattern)

re.compile('abc')


Some **search methods** in regex are:
- match(): Determine if the RE matches at the beginning of the string.
- search(): Scan through a string, looking for any location where this RE matches.
- findall(): Find all substrings where the RE matches, and returns them as a list.
- finditer(): Find all substrings where the RE matches, and returns them as an iterator.


In [None]:
# match
my_string = 'abc123ABC123abc'
pattern = re.compile(r'abc')
match = pattern.match(my_string)
print(match)

<re.Match object; span=(0, 3), match='abc'>


In [None]:
# match
my_string = 'abc123ABC123abc'
pattern = re.compile(r'123')
match = pattern.match(my_string)
print(match)

None


In [None]:
# Search
my_string = 'abc123ABC123abcDOG'
pattern = re.compile(r'DOG')
match = pattern.search(my_string)
print(match)

<re.Match object; span=(15, 18), match='DOG'>


In [None]:
# Search
my_string = 'abc123ABC123abc'
pattern = re.compile(r'123')
match = pattern.search(my_string)
print(match)

<re.Match object; span=(3, 6), match='123'>


In [None]:
# findall()
my_string = 'abc123ABC123abc'
pattern = re.compile(r'123')
matches = pattern.findall(my_string)
print(matches)
for match in matches:
    print(match)

['123', '123']
123
123


In [None]:
# finditer()
test_string = '123abc456789abc123ABC'
matches = re.finditer(r'abc', test_string)
for match in matches:
    print(match)

<re.Match object; span=(3, 6), match='abc'>
<re.Match object; span=(12, 15), match='abc'>


# it_nltmadj_01_enus_08

#### Subfeatures under Match object

- group(): Return the string matched by the RE
- start(): Return the starting position of the match
- end(): Return the ending position of the match
- span(): Return a tuple containing the (start, end) positions of the match

In [None]:
my_string = 'abc123ABC123abc'
pattern = re.compile(r'abc')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)
    print(match.span(), match.start(), match.end())
    print(match.group()) # returns the string

<re.Match object; span=(0, 3), match='abc'>
(0, 3) 0 3
abc
<re.Match object; span=(12, 15), match='abc'>
(12, 15) 12 15
abc


#### Meta characters
Metacharacters are characters with a special meaning:
It includes . ^ $ * + ? { } [ ] \ | ( )

- . Any character (except newline character) "he..o"
- ^ Starts with "^hello"
- $ Ends with "world$"
-(*) Zero or more occurrences "aix *"
- (+) One or more occurrences "aix+"
- { } Exactly the specified number of occurrences "al{2}"
- [] A set of characters "[a-m]"
- \ Signals a special sequence (can also be used to escape special characters) "\d"
- | Either or "falls|stays"
- ( ) Capture and group

In [None]:
test_string = 'demo.com'
pattern = re.compile(r'\.')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(4, 5), match='.'>


#### More Metacharacters / Special Sequences

Special Metacharacters: '\' followed by characters

- \d :Matches any decimal digit; this is equivalent to the class [0-9].
- \D : Matches any non-digit character; this is equivalent to the class [^0-9].
- \s : Matches any whitespace character;
- \S : Matches any non-whitespace character;
- \w : Matches any alphanumeric (word) character; this is equivalent to the class [a-zA-Z0-9_].
- \W : Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
- \b Returns a match where the specified characters are at the beginning or at the end of a word r"\bain" r"ain\b"
- \B Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word r"\Bain" r"ain\B"
- \A Returns a match if the specified characters are at the beginning of the string "\AThe"
- \Z Returns a match if the specified characters are at the end of the string "Spain\Z"

In [None]:
test_string = 'This is demo string - 12345'
pattern = re.compile(r'\d')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

print()
pattern = re.compile(r'\s')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(22, 23), match='1'>
<re.Match object; span=(23, 24), match='2'>
<re.Match object; span=(24, 25), match='3'>
<re.Match object; span=(25, 26), match='4'>
<re.Match object; span=(26, 27), match='5'>

<re.Match object; span=(4, 5), match=' '>
<re.Match object; span=(7, 8), match=' '>
<re.Match object; span=(12, 13), match=' '>
<re.Match object; span=(19, 20), match=' '>
<re.Match object; span=(21, 22), match=' '>


#### Quantifier
- (*) : 0 or more
- (+) : 1 or more
- ? : 0 or 1, used when a character can be optional
- {4} : exact number
- {4,6} : range numbers (min, max)

In [None]:
my_string = 'hello_12'
pattern = re.compile(r'_?\d*')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

print('\n--------------')
pattern = re.compile(r'\d+')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

print('\n--------------')   
pattern = re.compile(r'\d{3}') # or if you need a range r'\d{3,5}'
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(1, 1), match=''>
<re.Match object; span=(2, 2), match=''>
<re.Match object; span=(3, 3), match=''>
<re.Match object; span=(4, 4), match=''>
<re.Match object; span=(5, 8), match='_12'>
<re.Match object; span=(8, 8), match=''>

--------------
<re.Match object; span=(6, 8), match='12'>

--------------


## it_nltmadj_01_enus_09

#### Conditions
Use the | for either or condition.

In [None]:
my_string = """
Mr. Simpson
Mrs Simpson
Mr. Brown
Ms Smith
Mr. T
"""
pattern = re.compile(r'Mr\.?\s\w+')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

print('\n----------------')
pattern = re.compile(r'(Mr|Ms|Mrs)\.?\s\w+')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(1, 12), match='Mr. Simpson'>
<re.Match object; span=(25, 34), match='Mr. Brown'>
<re.Match object; span=(44, 49), match='Mr. T'>

----------------
<re.Match object; span=(1, 12), match='Mr. Simpson'>
<re.Match object; span=(13, 24), match='Mrs Simpson'>
<re.Match object; span=(25, 34), match='Mr. Brown'>
<re.Match object; span=(35, 43), match='Ms Smith'>
<re.Match object; span=(44, 49), match='Mr. T'>


#### Compilation Flags
- ASCII, A : Makes several escapes like \w, \b, \s and \d match only on ASCII characters with the respective property.
- DOTALL, S : Make . match any character, including newlines.
- IGNORECASE, I : Do case-insensitive matches.
- LOCALE, L : Do a locale-aware match.
- MULTILINE, M : Multi-line matching, affecting ^ and $.
- VERBOSE, X (for ‘extended’) : Enable verbose REs, which can be organized more cleanly and understandably.

In [None]:
my_string = "Hello World"
pattern = re.compile(r'world', re.IGNORECASE) # No match without I flag
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(6, 11), match='World'>


## it_nltmadj_01_enus_10

### Sentiment extraction using SentiWordNet

In [None]:
nltk.download("sentiwordnet")

[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


True

In [None]:
from nltk.corpus import sentiwordnet as swn

In [None]:
sent = swn.senti_synset('good.n.03')

In [None]:
sent.pos_score()

0.625

In [None]:
sent.neg_score()

0.0

In [None]:
#1.0 - (pos_score + neg_score)
sent.obj_score()

0.375


## it_nltmadj_01_enus_11

### Sentiment Classification Using SentiWordNet


In [None]:
import nltk
nltk.download("sentiwordnet")
nltk.download("stopwords")
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()


def penn_to_wn(tag):
    """
    English Penn Treebank part-of-speech Tagset
    
    A tagset is a list of part-of-speech tags, i.e. labels used to indicate the part of speech and often also other grammatical 
    categories (case, tense etc.) of each token in a text corpus.
    Convert between the PennTreebank tags to simple Wordnet tags
    """
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return None


def get_sentiment(word,tag):
    """ returns list of pos neg and objective score. But returns empty list if not present in senti wordnet. """

    wn_tag = penn_to_wn(tag)
    if wn_tag not in (wn.NOUN, wn.ADJ, wn.ADV):
        return []

    lemma = lemmatizer.lemmatize(word, pos=wn_tag)
    if not lemma:
        return []

    synsets = wn.synsets(word, pos=wn_tag)
    if not synsets:
        return []

    # Take the first sense, the most common
    synset = synsets[0]
    swn_synset = swn.senti_synset(synset.name())

    return [swn_synset.pos_score(),swn_synset.neg_score(),swn_synset.obj_score()]




[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/LearnDataSci/articles/master/Python%20Pandas%20Tutorial%20A%20Complete%20Introduction%20for%20Beginners/IMDB-Movie-Data.csv")

In [None]:
data.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


In [None]:
# Running sentiment for single Movie Plot 
ps = PorterStemmer()
words_data = data['Description'][0].split()
print(data['Description'][0])
# words_data = [ps.stem(x) for x in words_data] # if you want to further stem the word

pos_val = nltk.pos_tag(words_data)
senti_val = [get_sentiment(x,y) for (x,y) in pos_val]

sentiment =  sum([eachlist[2] for eachlist in senti_val if eachlist])/len(words_data)

A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control of the universe.


In [None]:
print(f"pos_val is {pos_val}")
print(f"senti_val for each word is {senti_val}")
print(f"Total sentiment is {sentiment}")

pos_val is [('A', 'DT'), ('group', 'NN'), ('of', 'IN'), ('intergalactic', 'JJ'), ('criminals', 'NNS'), ('are', 'VBP'), ('forced', 'VBN'), ('to', 'TO'), ('work', 'VB'), ('together', 'RB'), ('to', 'TO'), ('stop', 'VB'), ('a', 'DT'), ('fanatical', 'JJ'), ('warrior', 'NN'), ('from', 'IN'), ('taking', 'VBG'), ('control', 'NN'), ('of', 'IN'), ('the', 'DT'), ('universe.', 'NN')]
senti_val for each word is [[], [0.0, 0.0, 1.0], [], [0.0, 0.0, 1.0], [0.0, 0.25, 0.75], [], [], [], [], [0.0, 0.0, 1.0], [], [], [], [0.375, 0.5, 0.125], [0.0, 0.0, 1.0], [], [], [0.0, 0.0, 1.0], [], [], []]
Total sentiment is 0.27976190476190477
