<a href="https://colab.research.google.com/github/Nawaf9997/reference-ML-S-/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

In [1]:
# import the necessary libraries
import nltk
import string
import re


#Text Lowercase
We lowercase the text to reduce the size of the vocabulary of our text data.

In [8]:
def text_lowercase(text):
    return text.lower()

input_str = "Mosharafa obtained his primary certificate in 1910, ranking second nationwide. He obtained his Baccalaureate at the age of 16, becoming the youngest student at that time to be awarded such a certificate and, again, ranking second. He preferred to enroll in the Teachers' College rather than the faculties of Medicine or Engineering due to his deep interest in mathematics.";
text_lowercase(input_str)


"mosharafa obtained his primary certificate in 1910, ranking second nationwide. he obtained his baccalaureate at the age of 16, becoming the youngest student at that time to be awarded such a certificate and, again, ranking second. he preferred to enroll in the teachers' college rather than the faculties of medicine or engineering due to his deep interest in mathematics."

#Remove numbers
We can either remove numbers or convert the numbers into their textual representations.
We can use regular expressions to remove the numbers.

In [10]:
# Remove numbers
def remove_numbers(text):
    result = re.sub(r'\d+', '', text)
    return result

input_str = "Mosharafa 55 obtained his primary certificate in 1910, ranking second nationwide. He obtained his Baccalaureate at the age of 16, becoming the youngest student at that time to be awarded such a certificate and, again, ranking second. He preferred to enroll in the Teachers' College rather than the faculties of Medicine or Engineering due to his deep interest in mathematics."
remove_numbers(input_str)


"Mosharafa  obtained his primary certificate in , ranking second nationwide. He obtained his Baccalaureate at the age of , becoming the youngest student at that time to be awarded such a certificate and, again, ranking second. He preferred to enroll in the Teachers' College rather than the faculties of Medicine or Engineering due to his deep interest in mathematics."

We can also convert the numbers into words. This can be done by using the inflect library.

In [11]:
# import the inflect library
import inflect
p = inflect.engine()

# convert number into words
def convert_number(text):
    # split string into list of words
    temp_str = text.split()
    # initialise empty list
    new_string = []

    for word in temp_str:
        # if word is a digit, convert the digit
        # to numbers and append into the new_string list
        if word.isdigit():
            temp = p.number_to_words(word)
            new_string.append(temp)

        # append the word as it is
        else:
            new_string.append(word)

    # join the words of new_string to form a string
    temp_str = ' '.join(new_string)
    return temp_str

input_str = 'There are 3 balls in this bag, and 12 in the other one.'
convert_number(input_str)


'There are three balls in this bag, and twelve in the other one.'

#Remove punctuation
We remove punctuations so that we don’t have different forms of the same word. If we don’t remove the punctuation, then been. been, been! will be treated separately.

In [12]:
# remove punctuation
def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)
input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!"
remove_punctuation(input_str)


'Hey did you know that the summer break is coming Amazing right  Its only 5 more days '

#Remove whitespace

We can use the join and split function to remove all the white spaces in a string.

In [13]:
# remove whitespace from text
def remove_whitespace(text):
    return  " ".join(text.split())
input_str = "we don't need   the given questions"
remove_whitespace(input_str)


"we don't need the given questions"

# Remove default stopwords
Stopwords are words that do not contribute to the meaning of a sentence. Hence, they can safely be removed without causing any change in the meaning of the sentence. The NLTK library has a set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens.

In [16]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [17]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# remove stopwords function
def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return filtered_text

example_text = "This is a sample sentence and we are going to remove the stopwords from this."
remove_stopwords(example_text)


['This', 'sample', 'sentence', 'going', 'remove', 'stopwords', '.']