<a href="https://colab.research.google.com/github/Hari-1903/Applied-Machine-Learning/blob/main/Text_Analyser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Analysing Text Data**

## Preprocessing data using **tokenization**

Tokenization is the process of dividing text into a set of meaningful pieces. These pieces are called
tokens.

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
text="Are you curious about tokenization? Let's see how it works! We need to analyze a couple of sentences with punctuations to see it in action."
sent_tokenize_list=sent_tokenize(text)
print("Sentence Word Tokenizer:\n")
print(sent_tokenize_list)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Are you curious about tokenization?', "Let's see how it works!", 'We need to analyze a couple of sentences with punctuations to see it in action.']


In [None]:
from nltk.tokenize import word_tokenize
print("Word Tokenizer: \n")
print(word_tokenize(text))

['Are', 'you', 'curious', 'about', 'tokenization', '?', 'Let', "'s", 'see', 'how', 'it', 'works', '!', 'We', 'need', 'to', 'analyze', 'a', 'couple', 'of', 'sentences', 'with', 'punctuations', 'to', 'see', 'it', 'in', 'action', '.']


In [None]:
from nltk.tokenize import WordPunctTokenizer
word_punct_tokenizer = WordPunctTokenizer()
print("\nWord punct tokenizer:")
print(word_punct_tokenizer.tokenize(text))


Word punct tokenizer:
['Are', 'you', 'curious', 'about', 'tokenization', '?', 'Let', "'", 's', 'see', 'how', 'it', 'works', '!', 'We', 'need', 'to', 'analyze', 'a', 'couple', 'of', 'sentences', 'with', 'punctuations', 'to', 'see', 'it', 'in', 'action', '.']


## **Stemming** text data

The goal of stemming is to reduce these different forms into a common base form. This uses a heuristic process to
cut off the ends of words to extract the base form.

In [None]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer
words=['airplane','eventually','dogs','eating','was','wolf','beaches','grounded','enjoying','goal','vision']
stemmers=['PORTER','LANCASTER','SNOWBALL']

stemmer_porter=PorterStemmer()
stemmer_lancaster=LancasterStemmer()
stemmer_snowball=SnowballStemmer('english')

formatted_row='{:>16}'*(len(stemmers)+1)
print('\n',formatted_row.format('WORD',*stemmers),'\n')

for word in words:
  stemmed_words=[stemmer_porter.stem(word),stemmer_lancaster.stem(word),stemmer_snowball.stem(word)]
  print (formatted_row.format(word,*stemmed_words))


             WORD          PORTER       LANCASTER        SNOWBALL 

        airplane         airplan           airpl         airplan
      eventually          eventu              ev          eventu
            dogs             dog             dog             dog
          eating             eat             eat             eat
             was              wa             was             was
            wolf            wolf            wolf            wolf
         beaches           beach           beach           beach
        grounded          ground          ground          ground
        enjoying           enjoy           enjoy           enjoy
            goal            goal            goal            goal
          vision          vision             vis          vision


## Converting text to its base using **Lemmatization**

The goal of lemmatization is also to reduce words to their base forms, but this is a more structured
approach.

In [None]:
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
words=['airplane','eventually','dogs','eating','was','wolf','beaches','grounded','enjoying','goal','vision']
lemmatizers=['NOUN LEMMATIZER','VERB LEMATIZER']
lemmatizer_wordnet=WordNetLemmatizer()

formatted_row='{:>24}'*(len(lemmatizers) + 1)
print('\n',formatted_row.format('WORD',*lemmatizers),'\n')

for word in words:
  lemmatized_words=[lemmatizer_wordnet.lemmatize(word,pos='n'),lemmatizer_wordnet.lemmatize(word,pos='v')]
  print(formatted_row.format(word,*lemmatized_words))


                     WORD         NOUN LEMMATIZER          VERB LEMATIZER 

                airplane                airplane                airplane
              eventually              eventually              eventually
                    dogs                     dog                     dog
                  eating                  eating                     eat
                     was                      wa                      be
                    wolf                    wolf                    wolf
                 beaches                   beach                   beach
                grounded                grounded                  ground
                enjoying                enjoying                   enjoy
                    goal                    goal                    goal
                  vision                  vision                  vision


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
