<a href="https://colab.research.google.com/github/Hari-1903/Applied-Machine-Learning/blob/main/Text_Analyser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Analysing Text Data**

## Preprocessing data using **tokenization**

Tokenization is the process of dividing text into a set of meaningful pieces. These pieces are called
tokens.

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
text="Are you curious about tokenization? Let's see how it works! We need to analyze a couple of sentences with punctuations to see it in action."
sent_tokenize_list=sent_tokenize(text)
print("Sentence Word Tokenizer:\n")
print(sent_tokenize_list)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Sentence Word Tokenizer:

['Are you curious about tokenization?', "Let's see how it works!", 'We need to analyze a couple of sentences with punctuations to see it in action.']


In [None]:
from nltk.tokenize import word_tokenize
print("Word Tokenizer: \n")
print(word_tokenize(text))

Word Tokenizer: 

['Are', 'you', 'curious', 'about', 'tokenization', '?', 'Let', "'s", 'see', 'how', 'it', 'works', '!', 'We', 'need', 'to', 'analyze', 'a', 'couple', 'of', 'sentences', 'with', 'punctuations', 'to', 'see', 'it', 'in', 'action', '.']


In [None]:
from nltk.tokenize import WordPunctTokenizer
word_punct_tokenizer = WordPunctTokenizer()
print("\nWord punct tokenizer:")
print(word_punct_tokenizer.tokenize(text))


Word punct tokenizer:
['Are', 'you', 'curious', 'about', 'tokenization', '?', 'Let', "'", 's', 'see', 'how', 'it', 'works', '!', 'We', 'need', 'to', 'analyze', 'a', 'couple', 'of', 'sentences', 'with', 'punctuations', 'to', 'see', 'it', 'in', 'action', '.']


## **Stemming** text data

The goal of stemming is to reduce these different forms into a common base form. This uses a heuristic process to
cut off the ends of words to extract the base form.

In [None]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer
words=['airplane','eventually','dogs','eating','was','wolf','beaches','grounded','enjoying','goal','vision']
stemmers=['PORTER','LANCASTER','SNOWBALL']

stemmer_porter=PorterStemmer()
stemmer_lancaster=LancasterStemmer()
stemmer_snowball=SnowballStemmer('english')

formatted_row='{:>16}'*(len(stemmers)+1)
print('\n',formatted_row.format('WORD',*stemmers),'\n')

for word in words:
  stemmed_words=[stemmer_porter.stem(word),stemmer_lancaster.stem(word),stemmer_snowball.stem(word)]
  print (formatted_row.format(word,*stemmed_words))


             WORD          PORTER       LANCASTER        SNOWBALL 

        airplane         airplan           airpl         airplan
      eventually          eventu              ev          eventu
            dogs             dog             dog             dog
          eating             eat             eat             eat
             was              wa             was             was
            wolf            wolf            wolf            wolf
         beaches           beach           beach           beach
        grounded          ground          ground          ground
        enjoying           enjoy           enjoy           enjoy
            goal            goal            goal            goal
          vision          vision             vis          vision


## Converting text to its base using **Lemmatization**

The goal of lemmatization is also to reduce words to their base forms, but this is a more structured
approach.

In [None]:
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
words=['airplane','eventually','dogs','eating','was','wolf','beaches','grounded','enjoying','goal','vision']
lemmatizers=['NOUN LEMMATIZER','VERB LEMATIZER']
lemmatizer_wordnet=WordNetLemmatizer()

formatted_row='{:>24}'*(len(lemmatizers) + 1)
print('\n',formatted_row.format('WORD',*lemmatizers),'\n')

for word in words:
  lemmatized_words=[lemmatizer_wordnet.lemmatize(word,pos='n'),lemmatizer_wordnet.lemmatize(word,pos='v')]
  print(formatted_row.format(word,*lemmatized_words))

[nltk_data] Downloading package wordnet to /root/nltk_data...



                     WORD         NOUN LEMMATIZER          VERB LEMATIZER 

                airplane                airplane                airplane
              eventually              eventually              eventually
                    dogs                     dog                     dog
                  eating                  eating                     eat
                     was                      wa                      be
                    wolf                    wolf                    wolf
                 beaches                   beach                   beach
                grounded                grounded                  ground
                enjoying                enjoying                   enjoy
                    goal                    goal                    goal
                  vision                  vision                  vision


## Building **bag of words** model

Algorithms need numerical data so that they can analyze them and output meaningful information.This is basically a model that
learns a vocabulary from all the words in all the documents.



In [None]:
!pip install chunking
nltk.download('maxent_ne_chunker')

import numpy as np
from nltk.corpus import brown


def splitter(data, num_words):
  words = data.split(' ')
  output = []
  cur_count = 0
  cur_words = []

  for word in words:
    cur_words.append(word)
    cur_count += 1
    if cur_count == num_words:
      output.append(' '.join(cur_words))
      cur_words = []
      cur_count = 0
  output.append(''.join(cur_words))
  return output



[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


In [None]:
if __name__=='__main__':
  data = ' '.join(brown.words()[:10000])
  num_words = 2000
  chunks = []
  counter = 0
  text_chunks = splitter(data, num_words)
  for text in text_chunks:
    chunk = {'index': counter, 'text': text}
    chunks.append(chunk)
    counter += 1

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=5, max_df=.95)
doc_term_matrix =vectorizer.fit_transform([chunk['text']for chunk in chunks])
vocab = np.array(vectorizer.get_feature_names_out())
print ("\nVocabulary:")
print (vocab)
print ("\nDocument term matrix:")
chunk_names = ['Chunk-0', 'Chunk-1', 'Chunk-2', 'Chunk-3','Chunk-4']
formatted_row = '{:>12}' * (len(chunk_names) + 1)
print ('\n', formatted_row.format('Word', *chunk_names), '\n')
for word, item in zip(vocab, doc_term_matrix.T):
  output = [str(x) for x in item.data]
  print (formatted_row.format(word, *output))


Vocabulary:
['about' 'after' 'against' 'aid' 'all' 'also' 'an' 'and' 'are' 'as' 'at'
 'be' 'been' 'before' 'but' 'by' 'committee' 'congress' 'did' 'each'
 'education' 'first' 'for' 'from' 'general' 'had' 'has' 'have' 'he'
 'health' 'his' 'house' 'in' 'increase' 'is' 'it' 'last' 'made' 'make'
 'may' 'more' 'no' 'not' 'of' 'on' 'one' 'only' 'or' 'other' 'out' 'over'
 'pay' 'program' 'proposed' 'said' 'similar' 'state' 'such' 'take' 'than'
 'that' 'the' 'them' 'there' 'they' 'this' 'time' 'to' 'two' 'under' 'up'
 'was' 'were' 'what' 'which' 'who' 'will' 'with' 'would' 'year' 'years']

Document term matrix:

         Word     Chunk-0     Chunk-1     Chunk-2     Chunk-3     Chunk-4 

       about           1           1           1           1           3
       after           2           3           2           1           3
     against           1           2           2           1           1
         aid           1           1           1           3           5
         all       

## **Text Classifier**

The goal of text classification is to categorize text documents into different classes. This is an extremely
important analysis technique in NLP.

In [None]:
from sklearn.datasets import fetch_20newsgroups

category_map = {'misc.forsale': 'Sales', 'rec.motorcycles':'Motorcycles','rec.sport.baseball': 'Baseball', 'sci.crypt':'Cryptography',
'sci.space': 'Space'}

training_data = fetch_20newsgroups(subset='train',categories=category_map.keys(), shuffle=True,random_state=7)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_termcounts =vectorizer.fit_transform(training_data.data)
print ("\nDimensions of training data:", X_train_termcounts.shape)


Dimensions of training data: (2968, 40605)


In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer

input_data = ["The curveballs of right handed pitchers tend to curve to the left",
"Caesar cipher is an ancient form of encryption",
"This two-wheeler is really good on slippery roads"]

tfidf_transformer = TfidfTransformer()
X_train_tfidf =tfidf_transformer.fit_transform(X_train_termcounts)

classifier = MultinomialNB().fit(X_train_tfidf,training_data.target)
X_input_termcounts = vectorizer.transform(input_data)
X_input_tfidf = tfidf_transformer.transform(X_input_termcounts)
predicted_categories = classifier.predict(X_input_tfidf)

for sentence, category in zip(input_data, predicted_categories):
  print ('\nInput:', sentence, '\nPredicted category:',category_map[training_data.target_names[category]])


Input: The curveballs of right handed pitchers tend to curve to the left 
Predicted category: Baseball

Input: Caesar cipher is an ancient form of encryption 
Predicted category: Cryptography

Input: This two-wheeler is really good on slippery roads 
Predicted category: Motorcycles
