<a href="https://colab.research.google.com/github/Satyendra0207/Text-and-image_analytics/blob/main/1_text_analysis_basic_a.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **TEXT PREPROCESSING**



1.   first step in the pipeline of NLP(natural language processing)
2.   **Text Preprocessing is the process of bringing the text into a form that is predictable and analyzable for a specific task. A task is the combination of approach and domain**
3.   For example, extracting top keywords with TF-IDF (approach) from Tweets (domain) is an example of a task. The main objective of text preprocessing is to break the text into a form that machine learning algorithms can digest.





**Noise removal**
Noise removal is about **removing digits**, **characters**, and **pieces of text** that interfere with the process of text analysis.

In [None]:
# import the necessary libraries
import nltk
import string
import re

In [None]:
def text_lowercase(text):
    return text.lower()

def text_upper(text):
  return text.upper() #for upper case 
 
input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!"
print(text_lowercase(input_str))
print(text_upper(input_str))

hey, did you know that the summer break is coming? amazing right !! it's only 5 more days !!
HEY, DID YOU KNOW THAT THE SUMMER BREAK IS COMING? AMAZING RIGHT !! IT'S ONLY 5 MORE DAYS !!


In [None]:
# Remove numbers
def remove_numbers(text):
    result = re.sub(r'\d+', '', text) #\d+ ---any digit of any sequence 
    return result
 
input_str = "There are 3 balls in this bag, and 12 in the other one."
remove_numbers(input_str)

'There are  balls in this bag, and  in the other one.'

Numbers to digit conversion

In [None]:
# import the inflect library
import inflect
p = inflect.engine()
 
# convert number into words
def convert_number(text):
    # split string into list of words
    temp_str = text.split()
    # initialise empty list
    new_string = []
 
    for word in temp_str:
        # if word is a digit, convert the digit
        # to numbers and append into the new_string list
        if word.isdigit():
            temp = p.number_to_words(word)
            new_string.append(temp)
        # append the word as it is
        else:
            new_string.append(word)
 
    # join the words of new_string to form a string
    temp_str = ' '.join(new_string)
    return temp_str
 
input_str = 'There are 3 balls in this bag, and 1216 in the other one.'
convert_number(input_str)

'There are three balls in this bag, and one thousand, two hundred and sixteen in the other one.'

In [None]:
# remove punctuation
#punctuatuion----!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
def remove_punctuation(text):
    translator = str.maketrans('','',string.punctuation)
    return text.translate(translator)
 
input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!"
remove_punctuation(input_str)

'Hey did you know that the summer break is coming Amazing right  Its only 5 more days '

In [None]:
# remove whitespace from text
def remove_whitespace(text):
    #return(re.sub(r"\s+"," ", text, flags = re.I))
    return  " ".join(text.split())
 
input_str = "   we don't need   the given questions"
remove_whitespace(input_str)

" we don't need the given questions"

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

**Tokenization** is defined as a process to split the text into smaller units, i.e., tokens, perhaps at the same time throwing away certain characters, such as punctuation. Tokens could be words, numbers, symbols, n-grams, or characters. N-grams are a combination of n words or characters together. Tokenization does this task by locating word boundaries.

Languages such as English and French are referred to as space-delimited as most of the words are separated from each other by space. Languages such as Chinese and Thai are said to be unsegmented as words do not have clear boundaries Tokenization is also affected by writing systems. Structures of languages can be grouped into three categories:

**Isolating:** Words do not divide into smaller units. Example: Mandarin

**Agglutinative**: Words divide into smaller units. Example: Japanese, Tamil

**Inflectional:** Boundaries between morphemes are not clear and ambiguous in terms of grammatical meaning. Example: Latin

In [None]:
from nltk.tokenize import word_tokenize
from collections import Counter
sentence= "Hey nice to see you! Have a great day"
token=word_tokenize(sentence)
count = Counter(token)
print("len(count) = ",len(count)) # %s placeholder for string or used for concatenation
print("most_common = ",count.most_common(10))# here 10, means top 10 out of total
print(token)

len(count) = %s 10
most_common = %s [('Hey', 1), ('nice', 1), ('to', 1), ('see', 1), ('you', 1), ('!', 1), ('Have', 1), ('a', 1), ('great', 1), ('day', 1)]
['Hey', 'nice', 'to', 'see', 'you', '!', 'Have', 'a', 'great', 'day']


**Remove default stopwords:**

Stopwords are words that do not contribute to the meaning of a sentence. Hence, they can safely be removed without causing any change in the meaning of the sentence. The NLTK library has a set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens.

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
# remove stopwords function
def remove_stopwords(text):
    stop_words = set(stopwords.words("english")) # contains all the word of english that doesn't have meaning the sentence
    word_tokens = word_tokenize(text) #gives a token every word/letter in the text--delimiter -- space
    print(word_tokens)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return filtered_text
 
example_text = "This is a sample sentence and we are going to remove the stopwords from this."
remove_stopwords(example_text)

['This', 'is', 'a', 'sample', 'sentence', 'and', 'we', 'are', 'going', 'to', 'remove', 'the', 'stopwords', 'from', 'this', '.']


['This', 'sample', 'sentence', 'going', 'remove', 'stopwords', '.']

**Normalization**
Normalization is the process of converting the token into its basic form (morpheme). Inflection is removed from the token to get the base form of the word. It helps in reducing the number of unique tokens and redundancy in the data. It reduces the data dimensionality and removes the variation of a word from the text.

There are two techniques to perform normalization. They are **Stemming** and  **Lemmatization**.

**Stemming:**

Stemming is the process of getting the root form of a word. Stem or root is the part to which inflectional affixes (-ed, -ize, -de, -s, etc.) are added. The stem of a word is created by removing the prefix or suffix of a word. So, stemming a word may not result in actual words.

books      --->    book ,
looked     --->    look ,
denied     --->    deni ,
flies      --->    fli

In [None]:
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()
 
# stem words in the list of tokenized words
def stem_words(text):
    word_tokens = word_tokenize(text)
    stems = [stemmer.stem(word) for word in word_tokens]
    return stems
 
text = 'data science uses scientific methods algorithms and many types of processes'
stem_words(text)

['data',
 'scienc',
 'use',
 'scientif',
 'method',
 'algorithm',
 'and',
 'mani',
 'type',
 'of',
 'process']

**Lemmatization:**

Like stemming, lemmatization also converts a word to its root form. **The only difference is that lemmatization ensures that the root word belongs to the language.** We will get valid words if we use lemmatization. In NLTK, we use the WordNetLemmatizer to get the lemmas of words. We also need to provide a context for the lemmatization. So, we add the part-of-speech as a parameter.

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer = WordNetLemmatizer()
# lemmatize string
def lemmatize_word(text):
    word_tokens = word_tokenize(text)
    # provide context i.e. part-of-speech
    lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in word_tokens]
    return lemmas
 
text = 'data science uses scientific methods algorithms and many types of processes'
lemmatize_word(text)

['data',
 'science',
 'use',
 'scientific',
 'methods',
 'algorithms',
 'and',
 'many',
 'type',
 'of',
 'process']

**Part of Speech Tagging:**

The part of speech explains **how a word is used in a sentence**. In a sentence, a word can have different contexts and semantic meanings. The **basic natural language processing models like bag-of-words ** fail to identify these relations between words. Hence, we use part of speech tagging to mark a word to its part of speech tag based on its context in the data. It is also used to extract relationships between words.

In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
  
# convert text into word_tokens with their tags
def pos_tagging(text):
    word_tokens = word_tokenize(text)
    return pos_tag(word_tokens)
  
pos_tagging('You just gave me a scare')
pos_tagging('listen to me I will not call you')

[('listen', 'NN'),
 ('to', 'TO'),
 ('me', 'PRP'),
 ('I', 'PRP'),
 ('will', 'MD'),
 ('not', 'RB'),
 ('call', 'VB'),
 ('you', 'PRP')]

We can get the details of all the part of speech tags using the Penn Treebank tagset.

In [None]:
# download the tagset 
nltk.download('tagsets')
  
# extract information about the tag
nltk.help.upenn_tagset('NN')

[nltk_data] Downloading package tagsets to /root/nltk_data...


NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


[nltk_data]   Unzipping help/tagsets.zip.


**Chunking:**

Chunking is the process of **extracting phrases from unstructured text** and more structure to it. It is also **known as shallow parsing**. It is done on top of Part of Speech tagging. It groups word into “chunks”, mainly of noun phrases. Chunking is done using regular expressions.

In [None]:
import tkinter as tk

In [None]:
from nltk.tokenize import word_tokenize 
from nltk import pos_tag
  
# define chunking function with text and regular
# expression representing grammar as parameter
def chunking(text, grammar):
    word_tokens = word_tokenize(text)
  
    # label words with part of speech
    word_pos = pos_tag(word_tokens)
    print(word_pos)
    # create a chunk parser using grammar
    chunkParser = nltk.RegexpParser(grammar)
    print(chunkParser)
    # test it on the list of word tokens with tagged pos
    tree = chunkParser.parse(word_pos) 
    for subtree in tree.subtrees():
        print(subtree)
    tree.draw()
      
sentence = 'the little yellow bird is flying in the sky'
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunking(sentence, grammar)

[('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('bird', 'NN'), ('is', 'VBZ'), ('flying', 'VBG'), ('in', 'IN'), ('the', 'DT'), ('sky', 'NN')]
chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
       <ChunkRule: '<DT>?<JJ>*<NN>'>
(S
  (NP the/DT little/JJ yellow/JJ bird/NN)
  is/VBZ
  flying/VBG
  in/IN
  (NP the/DT sky/NN))
(NP the/DT little/JJ yellow/JJ bird/NN)
(NP the/DT sky/NN)


TclError: ignored

In the given example, grammar, which is defined using a simple regular expression rule. This rule says that an NP (Noun Phrase) chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN).

**Named Entity Recognition:**

Named Entity Recognition is used to extract information from unstructured text. It is used to classify entities present in a text into categories like a person, organization, event, places, etc. It gives us detailed knowledge about the text and the relationships between the different entities.

In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [None]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk

def named_entity_recognition(text):
    # tokenize the text
    word_tokens = word_tokenize(text)
  
    # part of speech tagging of words
    word_pos = pos_tag(word_tokens)
  
    # tree of word entities
    print(ne_chunk(word_pos))
  
text = 'Bill works for GeeksforGeeks so he went to Delhi for a meetup.'
named_entity_recognition(text)

(S
  (PERSON Bill/NNP)
  works/VBZ
  for/IN
  (ORGANIZATION GeeksforGeeks/NNP)
  so/RB
  he/PRP
  went/VBD
  to/TO
  (GPE Delhi/NNP)
  for/IN
  a/DT
  meetup/NN
  ./.)


**Sentiment Analysis** is a use case of Natural Language Processing (NLP) and comes under the category of text classification. To put it simply, Sentiment Analysis involves classifying a text into various sentiments, such as positive or negative, Happy, Sad or Neutral, etc. Thus, the ultimate goal of sentiment analysis is to decipher the underlying mood, emotion, or sentiment of a text. This is also known as **Opinion Mining.**

From a strict machine learning point of view, this task is nothing but a supervised learning task.