# NLP
NLP stands for Natural Language Processing. It is the branch of Artificial Intelligence that gives the ability to machine understand and process human languages. Human languages can be in the form of text or audio format.

# What is text pre-processing?

Text pre-processing is the process of transforming unstructured text to structured text to prepare it for analysis.

When you pre-process text before feeding it to algorithms, you increase the accuracy and efficiency of said algorithms by removing noise and other inconsistencies in the text that can make it hard for the computer to understand.

Making the text easier to understand also helps to reduce the time and resources required for the computer to pre-process data.

Processes involved in text pre-processing
To properly pre-process your text and get it in the right state to perform further analysis and actions with it, there are quite a few operations that need to be done on the text and a couple of steps to be followed to get a well structured text.

#Tokenization
Tokenization is the first stage of the process.

Here your text is analysed and then broken down into chunks called ‘tokens’ which can either be words or phrases. This allows the computer to work on your text token by token rather than working on the entire text in the following stages.

The two main types of tokenisation are word and sentence tokenisation.

Word tokenisation is the most common kind of tokenisation.

Here, each token is a word, meaning the algorithm breaks down the entire text into individual words:

In [None]:
text = 'Wisdoms daughter walks alone. The mark of Athena burns through rome'

words = text.split()
print(words)

['Wisdoms', 'daughter', 'walks', 'alone.', 'The', 'mark', 'of', 'Athena', 'burns', 'through', 'rome']


On the other hand, sentence tokenisation breaks down text into sentences instead of words. It is a less common type of tokenisation only used in few Natural Language Processing (NLP) tasks.

# Case normalisation

This technique converts all the letters in your text to a single case, either uppercase or lowercase.

Case normalisation ensures that your data is stored in a consistent format and makes it easier to work with the data.

In [None]:
text = "'To Sleep Or NOT to SLEep, THAT is THe Question'"

def lower_case(text):
    text = text.lower()
    return text

lower_case = lower_case(text)#converts everthing to lowercase
print(lower_case)


'to sleep or not to sleep, that is the question'


# Stemming
Stemming words like coding, coder, and coded all have the same base word which is code.

ML models most-often-than-not understand that these words are all derived from one base word. They can work with your text without the tenses, prefixes, and suffixes that we as humans would normally need to make sense of it.

Stemming your texts not only helps to reduce the number of words the model has to work with, and by extension improves the efficiency of the model.



In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

text = "She enjoys coding, coded many projects, and is a skilled coder."

stemmed_words = [stemmer.stem(word) for word in text.split()]

print("Stemmed Words:", stemmed_words)


Stemmed Words: ['she', 'enjoy', 'coding,', 'code', 'mani', 'projects,', 'and', 'is', 'a', 'skill', 'coder.']


# Lemmatisation
This method is very similar to stemming in that it is also used to identify the base of words. It is however a more complex and accurate technique than stemming.

Lemmatization, unlike stemming, reduces words to their base or dictionary form (lemma), ensuring the root word remains meaningful.

In [None]:
import nltk

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

text = "She enjoys coding, coded many projects, and is a skilled coder."

lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in text.split()]

print("Lemmatized Words:", lemmatized_words)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Lemmatized Words: ['She', 'enjoy', 'coding,', 'cod', 'many', 'projects,', 'and', 'be', 'a', 'skilled', 'coder.']


# Punctuation removal

During human conversations, punctuation marks like `‘’, ! , [, }, *, #, /, ?, and ‘’` are incredibly relevant and necessary to have a proper conversation. Thelp to fully convey the message of the writer.

In [None]:
import re

text = ' (to love is to destroy, and to be loved, is to be "the" one <destroyed>} '

def remove_punctuations(text):
    punctuation = re.compile(r'[{};():,."/<>-]')
    text = punctuation.sub(' ', text)
    return text

clean_text = remove_punctuations(text)
print(clean_text)

  to love is to destroy  and to be loved  is to be  the  one  destroyed   


# Accent removal
This process is about removing language specific character symbols from text.

Some characters are written with specific accents or symbols to either imply a different pronunciation or to signify that words containing such accented texts have a different meaning.

In [None]:
import re

text = "her fiancé's résumé is beautiful"

def remove_accents(text):
    accents = re.compile(u"[\u0300-\u036F]|é|è")
    text = accents.sub(u"e", text)
    return text

cleaned_text = remove_accents(text)
print(cleaned_text)

her fiance's resume is beautiful


# Lab Tasks
### **Note** you will perform all the task using dataset.txt file

# Task 1 Read the Dataset from a File:
In this task you will read the dataset from a text file (dataset.txt).

In [None]:
with open("data.txt","r") as file:
  data=file.read()
data

"The car is driven on the road.\nThe truck is parked in the lot.\nThis pasta is delicious and affordable.\nI enjoy coding in Python.\nArtificial Intelligence is transforming industries.\nThe weather is sunny today.\nShe loves reading books.\nThe cake tastes amazing.\nLearning new things every day is fulfilling.\nThe sky is clear and blue.\nThe dog chased the cat around the yard.\nHe is studying for his final exams.\nThe phone battery is running low.\nNature has a calming effect on the mind.\nThe coffee is too hot to drink right now.\nShe enjoys painting landscapes in her free time.\nThe train arrived at the station early.\nHe is preparing a presentation for work.\nI am excited about the upcoming event.\nThe movie was thrilling and full of suspense.\nRunning in the park is a great way to start the day.\nThe artist painted a beautiful portrait of the woman.\nShe is learning French in her spare time.\nThe garden is blooming with colorful flowers.\nThe bird sang a melodious tune in the mor

# Task 2: Text Pre-processing and Tokenization
Given a document, perform text cleaning (remove HTML tags, emojis, and special characters), convert text to lowercase, and then tokenize it into words.


In [None]:
import re

def cleanFunction(data):
  html_tags = re.compile(r'<.*?>')
  emojis = re.compile(r'[^\w\s]')
  punctuation=re.compile(r'[(),.{};"/]')
  text=punctuation.sub(" ",data)
  text=text.lower()
  if text.find('\n')!=-1:
    text=text.replace('\n','')
  text=text.split(' ')
  return text
text=cleanFunction(data)

In [None]:
text

['the',
 'car',
 'is',
 'driven',
 'on',
 'the',
 'road',
 'the',
 'truck',
 'is',
 'parked',
 'in',
 'the',
 'lot',
 'this',
 'pasta',
 'is',
 'delicious',
 'and',
 'affordable',
 'i',
 'enjoy',
 'coding',
 'in',
 'python',
 'artificial',
 'intelligence',
 'is',
 'transforming',
 'industries',
 'the',
 'weather',
 'is',
 'sunny',
 'today',
 'she',
 'loves',
 'reading',
 'books',
 'the',
 'cake',
 'tastes',
 'amazing',
 'learning',
 'new',
 'things',
 'every',
 'day',
 'is',
 'fulfilling',
 'the',
 'sky',
 'is',
 'clear',
 'and',
 'blue',
 'the',
 'dog',
 'chased',
 'the',
 'cat',
 'around',
 'the',
 'yard',
 'he',
 'is',
 'studying',
 'for',
 'his',
 'final',
 'exams',
 'the',
 'phone',
 'battery',
 'is',
 'running',
 'low',
 'nature',
 'has',
 'a',
 'calming',
 'effect',
 'on',
 'the',
 'mind',
 'the',
 'coffee',
 'is',
 'too',
 'hot',
 'to',
 'drink',
 'right',
 'now',
 'she',
 'enjoys',
 'painting',
 'landscapes',
 'in',
 'her',
 'free',
 'time',
 'the',
 'train',
 'arrived',
 'at'

# Task 3: One-Hot Encoding for a Given Text
 Write a program that converts a sentence into one-hot encoded vectors.


In [None]:
import numpy as np
def one_hot_encoding(sentence):
  words=sentence.lower().split(" ")
  uniqueWords=sorted(set(words))
  word_to_index={ word:i for i, word in enumerate(uniqueWords)}
  one_hot_vector=[]
  i=0
  print(len(uniqueWords))
  for word in words:
    one_hot_vector.append(np.zeros(len(uniqueWords)))
    index=word_to_index[word]
    one_hot_vector[i][index]=1
    i=i+1
  return one_hot_vector
one_hot_encoding('''
I think this looks good
yeah this is fine
''')

8


[array([1., 0., 0., 0., 0., 0., 0., 0.]),
 array([0., 1., 0., 0., 0., 0., 0., 0.]),
 array([0., 0., 0., 0., 0., 0., 1., 0.]),
 array([0., 0., 0., 0., 0., 0., 0., 1.]),
 array([0., 0., 0., 0., 0., 1., 0., 0.]),
 array([0., 0., 0., 1., 0., 0., 0., 0.]),
 array([0., 0., 0., 0., 0., 0., 0., 1.]),
 array([0., 0., 0., 0., 1., 0., 0., 0.]),
 array([0., 0., 1., 0., 0., 0., 0., 0.])]

# Task 4: TF-IDF Calculation
Implement a function to calculate Term Frequency (TF) and Inverse Document Frequency (IDF) for a given corpus of documents.

In [None]:
# def term_frequency(sentence):
#   if sentence.find('\n')!=-1:
#     sentence=sentence.replace('\n','')
#   words=sentence.lower().split(' ')
#   countOfWords=len(words)
#   dic={word:0 for word in sorted(set(words))}
#   for word in words:
#     dic[word]=dic[word]+1
#   lst=[]
#   for val in list(dic.values()):
#     result=val/countOfWords
#     lst.append(result)
#   import pandas as pd
#   df = pd.DataFrame({'Word': list(dic.keys()), 'TF': lst})
#   return df
# df=term_frequency('''
# I think this looks good
#  yeah this is fine
# ''')
import math
import pandas as pd

def term_frequency(sentence):
    if sentence.find('\n') != -1:
        sentence = sentence.replace('\n', '')

    wordsInSentence = sentence.lower().split(' ')
    totalWords = len(wordsInSentence)

    wordCountDict = {word: 0 for word in sorted(set(wordsInSentence))}

    for word in wordsInSentence:
        wordCountDict[word] += 1

    termFrequencies = []
    for count in list(wordCountDict.values()):
        result = count / totalWords
        termFrequencies.append(result)

    return wordCountDict, termFrequencies

def calculate_idf(corpus):
    numDocuments = len(corpus)
    wordDocCount = {}

    for document in corpus:
        uniqueWords = set(document.lower().split(' '))
        for word in uniqueWords:
            if word in wordDocCount:
                wordDocCount[word] += 1
            else:
                wordDocCount[word] = 1

    idfDict = {}
    for word, docCount in wordDocCount.items():
        idfDict[word] = math.log(numDocuments / docCount)

    return idfDict

def tf_idf(sentence, corpus):
    wordCountDict, termFrequencies = term_frequency(sentence)
    idfValues = calculate_idf(corpus)

    idfList = []

    for word in wordCountDict.keys():
        idfList.append(idfValues.get(word, 0))

    resultDf = pd.DataFrame({'Word': list(wordCountDict.keys()), 'TF': termFrequencies, 'IDF': idfList})

    return resultDf

sentence = '''
I think this looks good
 yeah this is fine
'''

corpus = [
    "I love programming and coding",
    "This is a document about Python",
    "I think this looks good yeah this is fine",
    "Learning is important and Python helps"
]

df = tf_idf(sentence, corpus)
print(df)


    Word        TF       IDF
0   fine  0.111111  1.386294
1   good  0.111111  1.386294
2      i  0.111111  0.693147
3     is  0.111111  0.287682
4  looks  0.111111  1.386294
5  think  0.111111  1.386294
6   this  0.222222  0.693147
7   yeah  0.111111  1.386294


# Task 5: Word2Vec Model Implementation
Build a simple Word2Vec model using Gensim for a given corpus

use gensi m.models
### from gensim.models import Word2Vec

In [None]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
corpus = [
    "I love programming and coding",
    "This is a document about Python",
    "I think this looks good yeah this is fine",
    "Learning is important and Python helps",
]

tokenizedCorpus = [word_tokenize(sentence.lower()) for sentence in corpus]
model = Word2Vec(sentences=tokenizedCorpus, vector_size=100, window=5, min_count=1, workers=4)
print(model.wv.most_similar('python'))


[('looks', 0.2529098391532898), ('i', 0.17018885910511017), ('think', 0.15016482770442963), ('document', 0.13887980580329895), ('fine', 0.10852645337581635), ('love', 0.034764934331178665), ('about', 0.033071890473365784), ('yeah', 0.016065234318375587), ('is', 0.004503022879362106), ('important', -0.005897548049688339)]


# Task 6: Sentence Matching Based on Given Dataset
In this task, you will be given a large dataset of sentences (provided below). Your goal is to match a query with sentences from this dataset. You will implement a function to find sentences that contain specific words or phrases from a user query.
## **Steps:**
- **Data Acquisition:**Use the provided dataset of sentences.
- **Text Preparation:** Clean the dataset (remove punctuation, convert to lowercase, etc.).
- **Feature Engineering:** Use TF-IDF (Term Frequency-Inverse Document Frequency) to create numerical vectors for each sentence.
- **Search:** Match the query against the sentence vectors using cosine similarity.
**Return Matched Sentences:** Display the top matching sentences.

## **Required Libraries:**
- **NLTK** for text cleaning and preprocessing.
- **TfidfVectorizer** from **scikit-learn** for converting text to vectors.
- **Cosine Similarity** for finding similar sentences.


In [None]:
import nltk
import string
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('stopwords')
from nltk.corpus import stopwords

stopWords = set(stopwords.words('english'))

sentences = [
    "The cat sat on the mat.",
    "Dogs are great companions.",
    "Cats and dogs can be friends.",
    "The mat is comfortable.",
    "I love to walk my dog.",
    "The sun is shining brightly."
]

def clean_text(sentences):
    cleanedSentences = []
    for sentence in sentences:
        sentence = sentence.translate(str.maketrans('', '', string.punctuation)).lower()
        words = [word for word in sentence.split() if word not in stopWords]
        cleanedSentences.append(' '.join(words))
    return cleanedSentences

def compute_tf_Idf(sentences):
    vectorizer = TfidfVectorizer()
    tfidfMatrix = vectorizer.fit_transform(sentences)
    return tfidfMatrix, vectorizer

def match_query(query, tfidfMatrix, vectorizer, topN=3):
    cleanedQuery = clean_text([query])[0]
    queryVector = vectorizer.transform([cleanedQuery])
    cosineSimilarities = cosine_similarity(queryVector, tfidfMatrix).flatten()
    topIndices = cosineSimilarities.argsort()[-topN:][::-1]
    return [(sentences[i], cosineSimilarities[i]) for i in topIndices]

cleanedSentences = clean_text(sentences)
tfidfMatrix, vectorizer = compute_tf_Idf(cleanedSentences)

query = "I want to know about dogs."
matchedSentences = match_query(query, tfidfMatrix, vectorizer)
print("Top matching sentences:")
for sentence, score in matchedSentences:
    print(f"Sentence: '{sentence}' - Similarity Score: {score:.4f}")


Top matching sentences:
Sentence: 'Cats and dogs can be friends.' - Similarity Score: 0.5016
Sentence: 'Dogs are great companions.' - Similarity Score: 0.5016
Sentence: 'The sun is shining brightly.' - Similarity Score: 0.0000


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
