# Natural Language Processing using NLTK

In [164]:
# Install NLTK - pip install nltk
import nltk
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /projects/aebc210b-e912-4df
[nltk_data]     7-91ea-37e0f8451ece/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /projects/aebc210b-e912-4df7-
[nltk_data]     91ea-37e0f8451ece/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## NLP Part 0 - Get some Data!

This section's code is mostly given to you as a review for how you can scrape and manipulate data from the web. 

In [165]:
import urllib
import bs4 as bs
import re

In [166]:
# We will read the contents of the Wikipedia article "Global_warming" as an example, please feel free to use your own! You can use the url below:
url = 'https://en.wikipedia.org/wiki/Valorant' # you can change this to use other sites as well.

# We can open the page using "urllib.request.urlopen" then read it using ".read()"
source = urllib.request.urlopen(url).read()

# Beautiful Soup is a Python library for pulling data out of HTML and XML files.
# you may need to install a parser library --> "!pip3 install lxml"
# Parsing the data/creating BeautifulSoup object

soup = bs.BeautifulSoup(source,"html.parser") 

# Fetching the data
text = ""
for paragraph in soup.find_all('p'): #The <p> tag defines a paragraph in the webpages
    text += paragraph.text

# Preprocessing the data

text = re.sub(r'\[[0-9]*\]',' ',text) # [0-9]* --> Matches zero or more repetitions of any digit from 0 to 9
text = text.lower() #everything to lowercase
text = re.sub(r'\W^.?!',' ',text) # \W --> Matches any character which is not a word character except (.?!)
text = re.sub(r'\d',' ',text) # \d --> Matches any decimal digit
text = re.sub(r'\s+',' ',text) # \s --> Matches any characters that are considered whitespace (Ex: [\t\n\r\f\v].)

In [167]:
text[:200]

' valorant (stylized as valorant) is a free-to-play first-person hero shooter developed and published by riot games, for microsoft windows. first teased under the codename project a in october , the ga'

## NLP Part 1 - Tokenization of paragraphs/sentences

In this section we are going to tokenize our sentences and words. If you aren't familiar with tokenization, we recommend looking up "what is tokenization". 

You should also spend time on the [NLTK documentation](https://www.nltk.org/). If you're not sure how to do something, or get an error, it is best to google it first and ask questions as you go!

In [168]:
'''
Your code here: Tokenize the words from the data and set it to a variable called words.
Hint: how to this might be on the very home page of NLTK!
'''
words = nltk.word_tokenize(text)

In [169]:
print(words[:10])

['valorant', '(', 'stylized', 'as', 'valorant', ')', 'is', 'a', 'free-to-play', 'first-person']


In [170]:
'''
Your code here: Tokenize the sentences from the data  and set it to a variable called sentences.
Hint: try googling how to tokenize sentences in NLTK!
'''
from nltk.tokenize import sent_tokenize
sentences = nltk.sent_tokenize(text)

In [171]:
print(sentences[:10])

[' valorant (stylized as valorant) is a free-to-play first-person hero shooter developed and published by riot games, for microsoft windows.', 'first teased under the codename project a in october , the game began a closed beta period with limited access on april , , followed by an official release on june , .', 'the development of the game started in .', 'valorant takes inspiration from the counter-strike series of tactical shooters, borrowing several mechanics such as the buy menu, spray patterns, and inaccuracy while moving.', 'valorant is a team-based first-person hero shooter set in the near future.', 'players play as one of a set of agents, characters based on several countries and cultures around the world.', 'in the main game mode, players are assigned to either the attacking or defending team with each team having five players on it.', 'agents have unique abilities, each requiring charges, as well as a unique ultimate ability that requires charging through kills, deaths, orbs,

## NLP Part 2 - Stopwords and Punctuation
Now we are going to work to remove stopwords and punctuation from our data. Why do you think we are going to do this? Do some research if you don't know yet. 

In [172]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /projects/aebc210b-e912-4
[nltk_data]     df7-91ea-37e0f8451ece/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [173]:
'''
define a function called "remove_stopwords" that takes in a list of the sentences of the text and returns one that doesn't have any stopwords.
'''
def remove_stopwords(sentences):
    ### Some code goes here. Hint: You may have to look up how to remove stopwords in NLTK if you get stuck. ###
    for i, sent in enumerate(sentences):
        words = nltk.word_tokenize(sent)
        new_words = [word for word in words if word not in stopwords.words('english')]
        sentences[i] = ' '.join(new_words)
    return sentences

###Then actually apply your function###
sentences = remove_stopwords(sentences)
print(sentences[:10]) #Check if it worked correctly. Are all stopwords removed?

['valorant ( stylized valorant ) free-to-play first-person hero shooter developed published riot games , microsoft windows .', 'first teased codename project october , game began closed beta period limited access april , , followed official release june , .', 'development game started .', 'valorant takes inspiration counter-strike series tactical shooters , borrowing several mechanics buy menu , spray patterns , inaccuracy moving .', 'valorant team-based first-person hero shooter set near future .', 'players play one set agents , characters based several countries cultures around world .', 'main game mode , players assigned either attacking defending team team five players .', 'agents unique abilities , requiring charges , well unique ultimate ability requires charging kills , deaths , orbs , objectives .', "every player starts round `` classic '' pistol one `` signature ability '' charges .", 'weapons ability charges purchased using in-game economic system awards money based outcome p

In [174]:
'''
define a function called "remove_punctuation" that removes punctuation from the sentences.
'''
def remove_punctuation(sentences):
    ### Some code goes here. Hint: Try looking up how to remove stopwords in NLTK if you get stuck. ###
    for i in range(len(sentences)):
        words = nltk.word_tokenize(sentences[i])
        words = [word for word in words if word not in ",.?()"]
        sentences[i] = ' '.join(words)
        return sentences
sentences = remove_punctuation(sentences)
print(sentences[:10]) #eliminating all punctuation.

['valorant stylized valorant free-to-play first-person hero shooter developed published riot games microsoft windows', 'first teased codename project october , game began closed beta period limited access april , , followed official release june , .', 'development game started .', 'valorant takes inspiration counter-strike series tactical shooters , borrowing several mechanics buy menu , spray patterns , inaccuracy moving .', 'valorant team-based first-person hero shooter set near future .', 'players play one set agents , characters based several countries cultures around world .', 'main game mode , players assigned either attacking defending team team five players .', 'agents unique abilities , requiring charges , well unique ultimate ability requires charging kills , deaths , orbs , objectives .', "every player starts round `` classic '' pistol one `` signature ability '' charges .", 'weapons ability charges purchased using in-game economic system awards money based outcome previous 

## NLP Part 3a - Stemming the words
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. There is an example below!

In [175]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
# try each of the words below
stemmer.stem('troubled')
#stemmer.stem('trouble')
#stemmer.stem('troubling')
#stemmer.stem('troubles')

'troubl'

In [176]:
'''
Your code here:
Define a function called "stem_sentences" that takes in a list of sentences and returns a list of stemmed sentences.
'''
def stem_sentences(sentences):
    for i, sent in enumerate(sentences):
        words = nltk.word_tokenize(sent)
        new_words = [stemmer.stem(word) for word in words]
        sentences[i] = ' '.join(new_words)
    return sentences
    ### Some code goes here. Hint: Try looking up how to stem words in NLTK if you get stuck (or simply use the example above and run stemmer in a loop!). ###


In [177]:
stemmed_sentences = stem_sentences(sentences)
print(stemmed_sentences[:10])

['valor styliz valor free-to-play first-person hero shooter develop publish riot game microsoft window', 'first teas codenam project octob , game began close beta period limit access april , , follow offici releas june , .', 'develop game start .', 'valor take inspir counter-strik seri tactic shooter , borrow sever mechan buy menu , spray pattern , inaccuraci move .', 'valor team-bas first-person hero shooter set near futur .', 'player play one set agent , charact base sever countri cultur around world .', 'main game mode , player assign either attack defend team team five player .', 'agent uniqu abil , requir charg , well uniqu ultim abil requir charg kill , death , orb , object .', 'everi player start round `` classic `` pistol one `` signatur abil `` charg .', 'weapon abil charg purchas use in-gam econom system award money base outcom previou round , kill player respons , object complet .']


## NLP Part 3b - Lemmatization
Lemmatization considers the context and converts the word to its meaningful base form. There is a cool tutorial and definition of lemmatization in NLTK [here](https://www.geeksforgeeks.org/python-lemmatization-with-nltk/).

In [178]:
from nltk.stem import WordNetLemmatizer
    
## Step 1: Import the lemmatizer
lemmatizer = WordNetLemmatizer()

'''
Your code here: Define a function called "lem_sentences" that: loops through the sentences, split the sentences up by words and applies "lemmatizer.lemmatize" to each word and then join everything back into a sentence
'''
##Similar to stopwords: For loop through the sentences, split by words and apply "lemmatizer.lemmatize" to each word and join back into a sentence
def lem_sentences(sentences):

    return sentences
sentences = lem_sentences(sentences)
print(sentences[:10]) 

['valor styliz valor free-to-play first-person hero shooter develop publish riot game microsoft window', 'first teas codenam project octob , game began close beta period limit access april , , follow offici releas june , .', 'develop game start .', 'valor take inspir counter-strik seri tactic shooter , borrow sever mechan buy menu , spray pattern , inaccuraci move .', 'valor team-bas first-person hero shooter set near futur .', 'player play one set agent , charact base sever countri cultur around world .', 'main game mode , player assign either attack defend team team five player .', 'agent uniqu abil , requir charg , well uniqu ultim abil requir charg kill , death , orb , object .', 'everi player start round `` classic `` pistol one `` signatur abil `` charg .', 'weapon abil charg purchas use in-gam econom system award money base outcom previou round , kill player respons , object complet .']


In [179]:
print(sentences[:10])

['valor styliz valor free-to-play first-person hero shooter develop publish riot game microsoft window', 'first teas codenam project octob , game began close beta period limit access april , , follow offici releas june , .', 'develop game start .', 'valor take inspir counter-strik seri tactic shooter , borrow sever mechan buy menu , spray pattern , inaccuraci move .', 'valor team-bas first-person hero shooter set near futur .', 'player play one set agent , charact base sever countri cultur around world .', 'main game mode , player assign either attack defend team team five player .', 'agent uniqu abil , requir charg , well uniqu ultim abil requir charg kill , death , orb , object .', 'everi player start round `` classic `` pistol one `` signatur abil `` charg .', 'weapon abil charg purchas use in-gam econom system award money base outcom previou round , kill player respons , object complet .']


## NLP Part 4 - POS Tagging
Parts of speech tagging is marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context.

In [180]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to /project
[nltk_data]     s/aebc210b-e912-4df7-91ea-37e0f8451ece/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [181]:
# POS Tagging example
# CC - coordinating conjunction
# NN - noun, singular (cat, tree)
all_words = nltk.word_tokenize(text)  ###If we want to look at part of speech taking before we stem/lem

tagged_words = nltk.pos_tag(all_words)
##Creates a list of lists where each element of the list is [word,partofspeech abbreviation]

# Tagged word paragraph
word_tags = []
for tw in tagged_words:
    word_tags.append(tw[0]+"_"+tw[1])

tagged_paragraph = ' '.join(word_tags)

'''
Your code here: print the first 1000 characters of tagged_paragraph.
'''
print(tagged_paragraph[:1000])

valorant_NN (_( stylized_VBN as_IN valorant_NN )_) is_VBZ a_DT free-to-play_JJ first-person_JJ hero_NN shooter_NN developed_VBD and_CC published_VBN by_IN riot_NN games_NNS ,_, for_IN microsoft_JJ windows_NNS ._. first_RB teased_VBN under_IN the_DT codename_NN project_NN a_DT in_IN october_NN ,_, the_DT game_NN began_VBD a_DT closed_JJ beta_NN period_NN with_IN limited_JJ access_NN on_IN april_NN ,_, ,_, followed_VBN by_IN an_DT official_JJ release_NN on_IN june_NN ,_, ._. the_DT development_NN of_IN the_DT game_NN started_VBD in_IN ._. valorant_JJ takes_VBZ inspiration_NN from_IN the_DT counter-strike_JJ series_NN of_IN tactical_JJ shooters_NNS ,_, borrowing_VBG several_JJ mechanics_NNS such_JJ as_IN the_DT buy_NN menu_NN ,_, spray_NN patterns_NNS ,_, and_CC inaccuracy_NN while_IN moving_VBG ._. valorant_NN is_VBZ a_DT team-based_JJ first-person_JJ hero_NN shooter_NN set_VBN in_IN the_DT near_JJ future_NN ._. players_NNS play_VBP as_IN one_CD of_IN a_DT set_NN of_IN agents_NNS ,_, cha

# Word2Vec Model Visualization



In [182]:
# Install gensim - pip install gensim
import nltk
from gensim.models import Word2Vec
import matplotlib.pyplot as plt
nltk.download('punkt')
from wordcloud import WordCloud

[nltk_data] Downloading package punkt to /projects/aebc210b-e912-4df7-
[nltk_data]     91ea-37e0f8451ece/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [183]:
#Let's go ahead and create a list that's formatted how word2vec needs:
    # a list of lists where the ith entry in the list is the word tokenizaiton of the ith sentence (after preprocessing)
tokenized = [nltk.word_tokenize(sentence) for sentence in sentences]

In [184]:
# print the tokenized list of lists
print(word2vec_list)

[['valorant', '(', 'stylized', 'as', 'valorant', ')', 'is', 'a', 'free-to-play', 'first-person', 'hero', 'shooter', 'developed', 'and', 'published', 'by', 'riot', 'games', ',', 'for', 'microsoft', 'windows', '.', 'first', 'teased', 'under', 'the', 'codename', 'project', 'a', 'in', 'october', ',', 'the', 'game', 'began', 'a', 'closed', 'beta', 'period', 'with', 'limited', 'access', 'on', 'april', ',', ',', 'followed', 'by', 'an', 'official', 'release', 'on', 'june', ',', '.', 'the', 'development', 'of', 'the', 'game', 'started', 'in', '.', 'valorant', 'takes', 'inspiration', 'from', 'the', 'counter-strike', 'series', 'of', 'tactical', 'shooters', ',', 'borrowing', 'several', 'mechanics', 'such', 'as', 'the', 'buy', 'menu', ',', 'spray', 'patterns', ',', 'and', 'inaccuracy', 'while', 'moving', '.', 'valorant', 'is', 'a', 'team-based', 'first-person', 'hero', 'shooter', 'set', 'in', 'the', 'near', 'future', '.', 'players', 'play', 'as', 'one', 'of', 'a', 'set', 'of', 'agents', ',', 'chara

## Training the Word2Vec model

For this part you may want to follow a guide [here](https://radimrehurek.com/gensim/models/word2vec.html). 



In [186]:
''' Training the Word2Vec model. You should pass:
1. a list of lists where the ith entry in the list is the word tokenizaiton of the ith sentence
2. min_count=1 --> Ignores all words with total frequency lower than 1 (i.e., include everything).
'''
# create the model
model = gensim.models.Word2Vec(word2vec_list)
# get the most common words of the model (it's entire vocabulary)

# save the model to use it later

# model = Word2Vec.load("word2vec.model")

[nltk_data] Downloading package punkt to /projects/aebc210b-e912-4df7-
[nltk_data]     91ea-37e0f8451ece/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to /project
[nltk_data]     s/aebc210b-e912-4df7-91ea-37e0f8451ece/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


AttributeError: 'list' object has no attribute 'most_common'

In [0]:
#print the first 10 most common words.

In [0]:
# Look up the most similar words to certain words in your text using the model.wv.most_similar() function

## Testing our model

In [0]:
    # Finding Word Vectors - print word vectors for certain words in your text


In [0]:
    ### Finding the most similar words in the model ###


In [0]:
similar1, similar2

In [0]:
# code to print a wordcloud for your sentences
wordcloud = WordCloud(
                        background_color='white',
                        max_words=100,
                        max_font_size=50, 
                        random_state=42
                        ).generate(str(sentences))
fig = plt.figure(1)
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

### Why did we do all this work?

In [0]:
# reFetching the data
lame_text = ""
for paragraph in soup.find_all('p'): #The <p> tag defines a paragraph in the webpages
    lame_text += paragraph.text

In [0]:
'''
Doing the same without removing stop words or lemming
'''
# tokenize the text using sent_tokenize

# from this list of sentences, create a list of lists where the ith entry in the list is the word tokenizaiton of the ith sentence (after preprocessing)

In [0]:
# Redo the word cloud but set stopwords to empty so it looks really bad
wordcloud = WordCloud(
                        background_color='white',
                        max_words=100,
                        max_font_size=50, 
                        random_state=42, ###SET STOPWORDS = [] and/or include_numbers = True or you will get the same thing!!!
                        stopwords = [],
                        include_numbers = True).generate(str(lame_sentences)) 
fig = plt.figure(1)
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [0]:
# Training the Word2Vec model (same code as before), but one change: use our lame data that was not preprocessed

# Try printing this after training the model.
words = model.wv.index_to_key
print(words[:10])

In [0]:
# Finding a vector of a word, but badly

In [0]:
### Finding the most similar words in the model but... you get the idea ###



## Reflection

How important do you think proper preprocessing in NLP is?

