<a href="https://colab.research.google.com/github/Devil-Rick/NLP-/blob/main/NLP_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Case Conversion

In [1]:
string = 'The quick brown fox jumped over The Big Dog'
string

'The quick brown fox jumped over The Big Dog'

In [2]:
# converting text from 1 form to another
# generally all text data is converted into small letters for convenience
print(f'In small letters "{string.lower()}"')
print(f'In capital letters "{string.upper()}"')
print(f'all 1st letter in each word is capital letter "{string.title()}"')

In small letters "the quick brown fox jumped over the big dog"
In capital letters "THE QUICK BROWN FOX JUMPED OVER THE BIG DOG"
all 1st letter in each word is capital letter "The Quick Brown Fox Jumped Over The Big Dog"


# using NLTK 

In [3]:
import nltk 

In [4]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## Tokenization
Is the process of `segmenting running` text into sentences and words. In essence, it’s the task of cutting a text into pieces called `tokens`, and at the same time throwing away certain characters, such as punctuation.

In [5]:
text = ("US unveils world's most powerful supercomputer, beats China. " 
               "The US has unveiled the world's most powerful supercomputer called 'Summit', " 
               "beating the previous record-holder China's Sunway TaihuLight. With a peak performance "
               "of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, "
               "which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, "
               "which reportedly take up the size of two tennis courts.")
text

"US unveils world's most powerful supercomputer, beats China. The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight. With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, which reportedly take up the size of two tennis courts."

In [6]:
# tokenization using NLTK

# 1. Sentence tokens (breaking the para into different sentences)
nltk.sent_tokenize(text)

["US unveils world's most powerful supercomputer, beats China.",
 "The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight.",
 'With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second.',
 'Summit has 4,608 servers, which reportedly take up the size of two tennis courts.']

In [7]:
# 2. Word tokens (breaking the para into words)
nltk.word_tokenize(text)[:15]

['US',
 'unveils',
 'world',
 "'s",
 'most',
 'powerful',
 'supercomputer',
 ',',
 'beats',
 'China',
 '.',
 'The',
 'US',
 'has',
 'unveiled']

## Stopwords
Some very `common words` that appear to provide `little or no value` to the `NLP objective` are `filtered and excluded` from the text to be processed
Includes `pronouns` and `prepositions` such as “and”, “the” or “to” in English

In [8]:
from nltk.corpus import stopwords

In [9]:
# example of all the stopwords in english
stopwords.words('english')[:20]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his']

## Stemming
This is the process of slicing the `end or the beginning` of words with the intention of `removing affixes`
    
    Affixes that are attached at the beginning of the word are called prefixes (e.g. “astro” in the word “astrobiology”) and the ones attached at the end of the word are called suffixes (e.g. “ful” in the word “helpful”).

The biggest problem with stemming is that it gives no guarantee that the stem word will have a meanimg `For eg:- Finally will be Fina`.

Then why we use stemming ?

We use `stemming` as it very fast and simple to use.For our `NLP model` if `speed is more important` than grammer stemming is the way to go.


In [10]:
from nltk.stem import PorterStemmer

In [11]:
Stemmer_Lines = PorterStemmer()
Lines = nltk.sent_tokenize(text)

for i in range(len(Lines)):
  words = nltk.word_tokenize(Lines[i])
  stem_words = [Stemmer_Lines.stem(word) for word in words if word not in set(stopwords.words('english'))]
  Lines[i] = " ".join(stem_words)
Lines

["US unveil world 's power supercomput , beat china .",
 "the US unveil world 's power supercomput call 'summit ' , beat previou record-hold china 's sunway taihulight .",
 'with peak perform 200,000 trillion calcul per second , twice fast sunway taihulight , capabl 93,000 trillion calcul per second .',
 'summit 4,608 server , reportedli take size two tenni court .']

## Lemmetization
`Lemmatization` resolves words to their `dictionary form` (known as lemma) for which it requires detailed dictionaries in which the algorithm can look into and link words to their corresponding lemmas.

Unlike stemming the words after lemmatization has a dictionary meaning.
`For eg:- finally and final becomes final` 

Lemmatization also takes the `part of speech` of every word as a `parameter` by default all the words are considered as noun.

In [12]:
from nltk.stem import WordNetLemmatizer

In [13]:
# when we use lemmatize with default settings

Lemmatizer_Lines = WordNetLemmatizer()
Lines = nltk.sent_tokenize(text)

for i in range(len(Lines)):
  words = nltk.word_tokenize(Lines[i])
  stem_words = [Lemmatizer_Lines.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
  Lines[i] = " ".join(stem_words)
Lines

["US unveils world 's powerful supercomputer , beat China .",
 "The US unveiled world 's powerful supercomputer called 'Summit ' , beating previous record-holder China 's Sunway TaihuLight .",
 'With peak performance 200,000 trillion calculation per second , twice fast Sunway TaihuLight , capable 93,000 trillion calculation per second .',
 'Summit 4,608 server , reportedly take size two tennis court .']

In [14]:
# using Diffferent parts of speech

# difference bet noun , verb
print(f"Cars change to car as it is a noun   ({Lemmatizer_Lines.lemmatize('cars' , 'n')})")
print(f"Running doesnt change to run as it is a verb   ({Lemmatizer_Lines.lemmatize('running' , 'n')})")
print(f"Cars doesnt change to car as it is a noun   ({Lemmatizer_Lines.lemmatize('cars' , 'v')})")
print(f"Running change to run as it is a verb   ({Lemmatizer_Lines.lemmatize('running' , 'v')})")

Cars change to car as it is a noun   (car)
Running doesnt change to run as it is a verb   (running)
Cars doesnt change to car as it is a noun   (cars)
Running change to run as it is a verb   (run)


**TO SOLVE THIS PROB IN LEMMATIZATION**
1. we must word tokenize the lines
2. then use POS tagging (Parts of speech Tagging)
3. convert POS tags to wordnet tags
4. apply lemmatization

In [15]:
# tag conversion to wordnet
from nltk.corpus import wordnet
def post_wordnet(words_token):
  tag_map = {'j': wordnet.ADJ, 'v': wordnet.VERB, 'n': wordnet.NOUN, 'r': wordnet.ADV}
  new_tagged_tokens = [(word, tag_map.get(tag[0].lower(), wordnet.NOUN)) for word, tag in words_token]
  return new_tagged_tokens

In [16]:
Lemmatizer_Lines = WordNetLemmatizer()
Lines = nltk.sent_tokenize(text)
Line_list = []
for i in range(len(Lines)):
  words = nltk.word_tokenize(Lines[i])

  # using POS tagging
  words_token = nltk.pos_tag(words)

  # converting to wordnet tag
  final_words = post_wordnet(words_token)

  # Finally lemmatizing the words and forming final lines
  stem_words = [Lemmatizer_Lines.lemmatize(word , tag) for (word , tag) in final_words if word not in set(stopwords.words('english'))]
  Lines[i] = " ".join(stem_words)
  Line_list.append(Lines[i])
Lines

["US unveils world 's powerful supercomputer , beat China .",
 "The US unveil world 's powerful supercomputer call 'Summit ' , beat previous record-holder China 's Sunway TaihuLight .",
 'With peak performance 200,000 trillion calculation per second , twice fast Sunway TaihuLight , capable 93,000 trillion calculation per second .',
 'Summit 4,608 server , reportedly take size two tennis court .']

## When to use Stemming and when to use Lemmatization?

Stemming is mostly used to index documents in a search engine where as lemmatization is most common in chatbots

## Bag of Words
Bag of Words model is used to `preprocess the text` by converting it into a bag of words, which keeps a `count of the total occurrences of most frequently used words`.

This model can be visualized using a table, which contains the count of words corresponding to the word itself.

Major drawback of this model that it gives equal value to every word so sentiment analysis is hard

`For eg:-`

sent 1 = 'he is a good boy'

sent 2 = 'he is a bad boy'
    
    Here 'good' 'bad' and 'boy' is given equal priority but good and bad should be given more value as they describe how the boy is.


In [17]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [26]:
# Let's use the Lines after lemmatizing and convert them into bag of words
Bow_vector = CountVectorizer() # Bow = Bag of words
final_vector = Bow_vector.fit_transform(Line_list).toarray()
final_vector

array([[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1],
       [0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
        1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1],
       [2, 1, 0, 1, 0, 2, 0, 1, 0, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 2, 0, 0,
        0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
        1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0]])

In [28]:
vocab = Bow_vector.get_feature_names_out()
vocab

array(['000', '200', '608', '93', 'beat', 'calculation', 'call',
       'capable', 'china', 'court', 'fast', 'holder', 'peak', 'per',
       'performance', 'powerful', 'previous', 'record', 'reportedly',
       'second', 'server', 'size', 'summit', 'sunway', 'supercomputer',
       'taihulight', 'take', 'tennis', 'the', 'trillion', 'twice', 'two',
       'unveil', 'unveils', 'us', 'with', 'world'], dtype=object)

In [29]:
pd.DataFrame(final_vector , columns=vocab)

Unnamed: 0,000,200,608,93,beat,calculation,call,capable,china,court,fast,holder,peak,per,performance,powerful,previous,record,reportedly,second,server,size,summit,sunway,supercomputer,taihulight,take,tennis,the,trillion,twice,two,unveil,unveils,us,with,world
0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,1
1,0,0,0,0,1,0,1,0,1,0,0,1,0,0,0,1,1,1,0,0,0,0,1,1,1,1,0,0,1,0,0,0,1,0,1,0,1
2,2,1,0,1,0,2,0,1,0,0,1,0,1,2,1,0,0,0,0,2,0,0,0,1,0,1,0,0,0,2,1,0,0,0,0,1,0
3,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,1,1,0,0,0,1,1,0,0,0,1,0,0,0,0,0


## TF-IDF

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [25]:
TFIDF_vector = TfidfVectorizer()
final_vector = TFIDF_vector.fit_transform(Line_list).toarray()
final_vector

array([[0.        , 0.        , 0.        , 0.        , 0.362529  ,
        0.        , 0.        , 0.        , 0.362529  , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.362529  , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.362529  ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.45982207, 0.362529  ,
        0.        , 0.362529  ],
       [0.        , 0.        , 0.        , 0.        , 0.23154214,
        0.        , 0.29368184, 0.        , 0.23154214, 0.        ,
        0.        , 0.29368184, 0.        , 0.        , 0.        ,
        0.23154214, 0.29368184, 0.29368184, 0.        , 0.        ,
        0.        , 0.        , 0.23154214, 0.23154214, 0.23154214,
        0.23154214, 0.        , 0.        , 0.29368184, 0.        ,
        0.        , 0.        , 0.29368184, 0.        , 0.23154214,
        0.     

In [22]:
vocab = TFIDF_vector.get_feature_names_out()
vocab

array(['000', '200', '608', '93', 'beat', 'calculation', 'call',
       'capable', 'china', 'court', 'fast', 'holder', 'peak', 'per',
       'performance', 'powerful', 'previous', 'record', 'reportedly',
       'second', 'server', 'size', 'summit', 'sunway', 'supercomputer',
       'taihulight', 'take', 'tennis', 'the', 'trillion', 'twice', 'two',
       'unveil', 'unveils', 'us', 'with', 'world'], dtype=object)

In [30]:
pd.DataFrame(final_vector , columns=vocab)

Unnamed: 0,000,200,608,93,beat,calculation,call,capable,china,court,fast,holder,peak,per,performance,powerful,previous,record,reportedly,second,server,size,summit,sunway,supercomputer,taihulight,take,tennis,the,trillion,twice,two,unveil,unveils,us,with,world
0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,1
1,0,0,0,0,1,0,1,0,1,0,0,1,0,0,0,1,1,1,0,0,0,0,1,1,1,1,0,0,1,0,0,0,1,0,1,0,1
2,2,1,0,1,0,2,0,1,0,0,1,0,1,2,1,0,0,0,0,2,0,0,0,1,0,1,0,0,0,2,1,0,0,0,0,1,0
3,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,1,1,0,0,0,1,1,0,0,0,1,0,0,0,0,0


# using SPACY