# Introduction to NLP

### In this notebook I have explained and shown the implementation of some of the common data preprocessing techniques for NLP using the nltk library 

#### **Tokenization**
#### **Stopwords**
#### **Stemming**
#### **Lemmatization**
#### **BagOfWords**

In [1]:
import nltk
# nltk.download("all")
# execute the above cell if you have installed nltk and need to download the additional packages

paragraph = """ The ultimate productivity hack is saying no.

Not doing something will always be faster than doing it. This statement reminds me of the old computer programming saying, “Remember that there is no code faster than no code.”

The same philosophy applies in other areas of life. For example, there is no meeting that goes faster than not having a meeting at all.

This is not to say you should never attend another meeting, but the truth is that we say yes to many things we don't actually want to do. There are many meetings held that don't need to be held. There is a lot of code written that could be deleted.

How often do people ask you to do something and you just reply, “Sure thing.” Three days later, you're overwhelmed by how much is on your to-do list. We become frustrated by our obligations even though we were the ones who said yes to them in the first place.

It's worth asking if things are necessary. Many of them are not, and a simple “no” will be more productive than whatever work the most efficient person can muster.

But if the benefits of saying no are so obvious, then why do we say yes so often?
Why We Say Yes
We agree to many requests not because we want to do them, but because we don't want to be seen as rude, arrogant, or unhelpful. Often, you have to consider saying no to someone you will interact with again in the future—your co-worker, your spouse, your family and friends.

Saying no to these people can be particularly difficult because we like them and want to support them. (Not to mention, we often need their help too.) Collaborating with others is an important element of life. The thought of straining the relationship outweighs the commitment of our time and energy.

For this reason, it can be helpful to be gracious in your response. Do whatever favors you can, and be warm-hearted and direct when you have to say no.

But even after we have accounted for these social considerations, many of us still seem to do a poor job of managing the tradeoff between yes and no. We find ourselves over-committed to things that don't meaningfully improve or support those around us, and certainly don't improve our own lives.

Perhaps one issue is how we think about the meaning of yes and no.

The Difference Between Yes and No
The words “yes” and “no” get used in comparison to each other so often that it feels like they carry equal weight in conversation. In reality, they are not just opposite in meaning, but of entirely different magnitudes in commitment.

When you say no, you are only saying no to one option. When you say yes, you are saying no to every other option.

I like how the economist Tim Harford put it, “Every time we say yes to a request, we are also saying no to anything else we might accomplish with the time.” Once you have committed to something, you have already decided how that future block of time will be spent.

In other words, saying no saves you time in the future. Saying yes costs you time in the future. No is a form of time credit. You retain the ability to spend your future time however you want. Yes is a form of time debt. You have to pay back your commitment at some point.

No is a decision. Yes is a responsibility.

The Role of No
Saying no is sometimes seen as a luxury that only those in power can afford. And it is true: turning down opportunities is easier when you can fall back on the safety net provided by power, money, and authority. But it is also true that saying no is not merely a privilege reserved for the successful among us. It is also a strategy that can help you become successful.

Saying no is an important skill to develop at any stage of your career because it retains the most important asset in life: your time. As the investor Pedro Sorrentino put it, “If you don’t guard your time, people will steal it from you.”

You need to say no to whatever isn't leading you toward your goals. You need to say no to distractions. As one reader told me, “If you broaden the definition as to how you apply no, it actually is the only productivity hack (as you ultimately say no to any distraction in order to be productive).”

Nobody embodied this idea better than Steve Jobs, who said, “People think focus means saying yes to the thing you’ve got to focus on. But that’s not what it means at all. It means saying no to the hundred other good ideas that there are. You have to pick carefully.”

There is an important balance to strike here. Saying no doesn't mean you'll never do anything interesting or innovative or spontaneous. It just means that you say yes in a focused way. Once you have knocked out the distractions, it can make sense to say yes to any opportunity that could potentially move you in the right direction. You may have to try many things to discover what works and what you enjoy. This period of exploration can be particularly important at the beginning of a project, job, or career.

Upgrading Your No
Over time, as you continue to improve and succeed, your strategy needs to change.

The opportunity cost of your time increases as you become more successful. At first, you just eliminate the obvious distractions and explore the rest. As your skills improve and you learn to separate what works from what doesn't, you have to continually increase your threshold for saying yes.

You still need to say no to distractions, but you also need to learn to say no to opportunities that were previously good uses of time, so you can make space for great uses of time. It's a good problem to have, but it can be a tough skill to master.

In other words, you have to upgrade your “no's” over time.

Upgrading your no doesn't mean you'll never say yes. It just means you default to saying no and only say yes when it really makes sense. To quote the investor Brent Beshore, “Saying no is so powerful because it preserves the opportunity to say yes.”

The general trend seems to be something like this: If you can learn to say no to bad distractions, then eventually you'll earn the right to say no to good opportunities.

How to Say No
Most of us are probably too quick to say yes and too slow to say no. It's worth asking yourself where you fall on that spectrum.

If you have trouble saying no, you may find the following strategy proposed by Tim Harford, the British economist I mentioned earlier, to be helpful. He writes, “One trick is to ask, “If I had to do this today, would I agree to it?” It’s not a bad rule of thumb, since any future commitment, no matter how far away it might be, will eventually become an imminent problem.”

If an opportunity is exciting enough to drop whatever you're doing right now, then it's a yes. If it's not, then perhaps you should think twice.

This is similar to the well-known “Hell Yeah or No” method from Derek Sivers. If someone asks you to do something and your first reaction is “Hell Yeah!”, then do it. If it doesn't excite you, then say no.

It's impossible to remember to ask yourself these questions each time you face a decision, but it's still a useful exercise to revisit from time to time. Saying no can be difficult, but it is often easier than the alternative. As writer Mike Dariano has pointed out, “It’s easier to avoid commitments than get out of commitments. Saying no keeps you toward the easier end of this spectrum.”

What is true about health is also true about productivity: an ounce of prevention is worth a pound of cure.

The Power of No
More effort is wasted doing things that don't matter than is wasted doing things inefficiently. And if that is the case, elimination is a more useful skill than optimization.

I am reminded of the famous Peter Drucker quote, “There is nothing so useless as doing efficiently that which should not be done at all.” """



# Tokenization:- 
It is the process of breaking down a larger text into smaller blocks.

* paragraphs to sentences

* sentences to words

In [2]:
# Tokenize sentences
Sentences = nltk.sent_tokenize(paragraph)

# Tokenizing words
Words = nltk.word_tokenize(paragraph)
print("A few sentences after applying tokenization\n")
for i in range(3,6):
    print(Sentences[i])
print("\nFirst 10 Words: ")
print(Words[:10])

A few sentences after applying tokenization

For example, there is no meeting that goes faster than not having a meeting at all.
This is not to say you should never attend another meeting, but the truth is that we say yes to many things we don't actually want to do.
There are many meetings held that don't need to be held.

First 10 Words: 
['The', 'ultimate', 'productivity', 'hack', 'is', 'saying', 'no', '.', 'Not', 'doing']


# Stemming:
Stemming is the process of converting the affected words into their stem words

* History Historical -> Histor

* Wait Waiting Waited -> Wait


# Stopwords
Adjectives, prepositions, words like them, there etc do not add value to some use cases and are repeated. To remove the dependency of the algorithm from such words we apply stop words.

In [3]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords  # has a variety of stopwords for various languages -> stopwords.words("language")

stemmer = PorterStemmer()
sentences = nltk.sent_tokenize(paragraph)
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))] # Applying stemming and lammetization 
    sentences[i] = ' '.join(words) # Joining the words again after preprocessing it.

for i in range(0,5):
    print(sentences[i])

the ultim product hack say .
not someth alway faster .
thi statement remind old comput program say , “ rememb code faster code. ” the philosophi appli area life .
for exampl , meet goe faster meet .
thi say never attend anoth meet , truth say ye mani thing n't actual want .


# Shortcomings of Stemming
Can come up with words that have no meaning as senn in the above block

Sometimes it creates false positives too like 

* Universe Universal University -> Univers  

These above words don't have the same meaning yet after applying stemming, we get the same output

# Lemmatization
Similar to Stemming but converts into meaningful words
* History Historical -> History

In [4]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))] # Applying stemming and lammetization 
    sentences[i] = ' '.join(words) # Joining the words again after preprocessing it.

for i in range(0,5):
    print(sentences[i])


The ultimate productivity hack saying .
Not something always faster .
This statement reminds old computer programming saying , “ Remember code faster code. ” The philosophy applies area life .
For example , meeting go faster meeting .
This say never attend another meeting , truth say yes many thing n't actually want .


### Provides a much better representation of the actual word. In practical applications lemmatization is always preferred over stremming

# Bag Of Words:-
### Bag of words is converting a sentence to a vector:
* S1 : Avengers is a good movie and Thor is an avenger.
* S2 : IronMan is an Avenger 
* S3 : CaptainAmerica is the leader of the Avengers

The sentences are preprocessed with lemmatization and stopwords

We take the words and form a vector with the sentence number as y axis word as x axis

| BOG | avenger | good | movie | thor | ironman | captainamerica | leader|
| --- | --- | --- | --- | --- | --- | --- | --- |
| S1	| 2	| 1	| 1	| 1	| 0	| 0	| 0 |
| S2	| 1	| 0	| 0	| 0	| 1	| 0	| 0 |
| S3	| 1	| 0	| 0	| 0	| 0	| 1	| 1 |


In [5]:
import re 
sentences = nltk.sent_tokenize(paragraph)
for i in range(len(sentences)):
    sentence = re.sub('[^a-zA-Z]', ' ', sentences[i])
    sentence = sentence.lower()
    #sentence = re.sub(' +', ' ', sentence) # Removes additional spaces but its not needed .split() handles it.
    words = sentence.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english')) ]
    words = (' ').join(words)
    sentences[i] = words
for i in range(0,5):
    print(sentences[i])

ultimate productivity hack saying
something always faster
statement reminds old computer programming saying remember code faster code philosophy applies area life
example meeting go faster meeting
say never attend another meeting truth say yes many thing actually want


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
bag_of_words = cv.fit_transform(sentences).toarray()
print("No of sentences = "+str(bag_of_words.shape[0])+"\nNo of words = "+str(bag_of_words.shape[1]))

No of sentences = 77
No of words = 351


# TFIDF 
### Term Frequency Inverse Document Frequency

It is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus. This method is a widely used technique in Information Retrieval and Text Mining.


S1 : Avengers is a good movie and Thor is an avenger.

S2 : IronMan is an Avenger 

S3 : CaptainAmerica is the leader of the Avengers


After pre-processing:

S1 : avenger good movie thor avenger

S2 : ironman avenger 

S3 : captainamerica leader avenger


    Term Frequency (TF) = No. Of repetitions of a word in a sentence/ No. Of words in a sentence
    

    Inverse Document Frequency (IDF) = log( No. Of sentences / No. Of sentences containing the word)
    
eg TF of avenger in sentence 1 is 2 (no of times avenger occurs) / 5 (no of words in the first sentence) = 2/5
    
eg IDF of thor is log( 3 (number of sentences) / 1 (no of sentenced thor appears in) = log(3)

eg TFIDF of thor in 1st sentence is:  TF (thor | sentence 1)  * IDF (thor) = 1/5 * log(3) = 0.22 (rounded)

<center> TF Table </center>

| TF | avenger | good | movie | thor | ironman | captainamerica | leader|
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| S1	| 2/5	| 1/5	| 1/5	| 1/5	| 0	| 0	| 0 |
| S2	| 1/2	| 0	| 0	| 0	| 1/2	| 0	| 0 |
| S3	| 1/3	| 0	| 0	| 0	| 0	| 1/3	| 1/3 |

<center> IDF Table </center>

| Words | avenger | good | movie | thor | ironman | captainamerica | leader|
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| TDF	| log(3/3)	| log(3/1)	| log(3/1)	| log(3/1)	| log(3/1)	| log(3/1)	| log(3/1) |

<center> TFIDF Table </center>

| TFIDF | avenger | good | movie | thor | ironman | captainamerica | leader|
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| S1	| 0	| 0.22	| 0.22	| 0.22	| 0	| 0	| 0 |
| S2	| 0	| 0	| 0	| 0	| 0.55 | 0	| 0 |
| S3	| 0	| 0	| 0	| 0	| 0	| 0.37	| 0.37 |


In [7]:
# Using the same sentences array from bag of words 
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf_df = pd.DataFrame(tfidf.fit_transform(sentences).toarray())
print("No of sentences = " +
      str(tfidf_df.shape[0])+"\nNo of words = "+str(tfidf_df.shape[1]))


No of sentences = 77
No of words = 351
