# NLP
NLP stands for Natural Language Processing. It is the branch of Artificial Intelligence that gives the ability to machine understand and process human languages. Human languages can be in the form of text or audio format.

# What is text pre-processing?

Text pre-processing is the process of transforming unstructured text to structured text to prepare it for analysis.

When you pre-process text before feeding it to algorithms, you increase the accuracy and efficiency of said algorithms by removing noise and other inconsistencies in the text that can make it hard for the computer to understand.

Making the text easier to understand also helps to reduce the time and resources required for the computer to pre-process data.

Processes involved in text pre-processing
To properly pre-process your text and get it in the right state to perform further analysis and actions with it, there are quite a few operations that need to be done on the text and a couple of steps to be followed to get a well structured text.

#Tokenization
Tokenization is the first stage of the process.

Here your text is analysed and then broken down into chunks called ‘tokens’ which can either be words or phrases. This allows the computer to work on your text token by token rather than working on the entire text in the following stages.

The two main types of tokenisation are word and sentence tokenisation.

Word tokenisation is the most common kind of tokenisation.

Here, each token is a word, meaning the algorithm breaks down the entire text into individual words:

In [None]:
text = 'Wisdoms daughter walks alone. The mark of Athena burns through rome'

words = text.split()
print(words)

['Wisdoms', 'daughter', 'walks', 'alone.', 'The', 'mark', 'of', 'Athena', 'burns', 'through', 'rome']


On the other hand, sentence tokenisation breaks down text into sentences instead of words. It is a less common type of tokenisation only used in few Natural Language Processing (NLP) tasks.

# Case normalisation

This technique converts all the letters in your text to a single case, either uppercase or lowercase.

Case normalisation ensures that your data is stored in a consistent format and makes it easier to work with the data.

In [None]:
text = "'To Sleep Or NOT to SLEep, THAT is THe Question'"

def lower_case(text):
    text = text.lower()
    return text

lower_case = lower_case(text)#converts everthing to lowercase
print(lower_case)


'to sleep or not to sleep, that is the question'


# Stemming
Stemming words like coding, coder, and coded all have the same base word which is code.

ML models most-often-than-not understand that these words are all derived from one base word. They can work with your text without the tenses, prefixes, and suffixes that we as humans would normally need to make sense of it.

Stemming your texts not only helps to reduce the number of words the model has to work with, and by extension improves the efficiency of the model.



In [None]:
from nltk.stem import PorterStemmer

# Initialize the stemmer
stemmer = PorterStemmer()

# Example text
text = "She enjoys coding, coded many projects, and is a skilled coder."

# Tokenize and stem each word
stemmed_words = [stemmer.stem(word) for word in text.split()]

print("Stemmed Words:", stemmed_words)


Stemmed Words: ['she', 'enjoy', 'coding,', 'code', 'mani', 'projects,', 'and', 'is', 'a', 'skill', 'coder.']


# Lemmatisation
This method is very similar to stemming in that it is also used to identify the base of words. It is however a more complex and accurate technique than stemming.

Lemmatization, unlike stemming, reduces words to their base or dictionary form (lemma), ensuring the root word remains meaningful.

In [None]:
import nltk

# Download required NLTK data, like WordNet and others
nltk.download('wordnet')
nltk.download('omw-1.4')  # Optional, for enhanced language support
nltk.download('punkt')     # If you need tokenizers
from nltk.stem import WordNetLemmatizer

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Example text
text = "She enjoys coding, coded many projects, and is a skilled coder."

# Tokenize and lemmatize each word
lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in text.split()]

print("Lemmatized Words:", lemmatized_words)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Lemmatized Words: ['She', 'enjoy', 'coding,', 'cod', 'many', 'projects,', 'and', 'be', 'a', 'skilled', 'coder.']


# Punctuation removal

During human conversations, punctuation marks like `‘’, ! , [, }, *, #, /, ?, and ‘’` are incredibly relevant and necessary to have a proper conversation. Thelp to fully convey the message of the writer.

In [None]:
import re

text = ' (to love is to destroy, and to be loved, is to be "the" one <destroyed>} '

def remove_punctuations(text):
    punctuation = re.compile(r'[{};():,."/<>-]')
    text = punctuation.sub(' ', text)
    return text

clean_text = remove_punctuations(text)
print(clean_text)

  to love is to destroy  and to be loved  is to be  the  one  destroyed   


# Accent removal
This process is about removing language specific character symbols from text.

Some characters are written with specific accents or symbols to either imply a different pronunciation or to signify that words containing such accented texts have a different meaning.

In [None]:
import re

text = "her fiancé's résumé is beautiful"

def remove_accents(text):
    accents = re.compile(u"[\u0300-\u036F]|é|è")
    text = accents.sub(u"e", text)
    return text

cleaned_text = remove_accents(text)
print(cleaned_text)

her fiance's resume is beautiful


# Lab Tasks
### **Note** you will perform all the task using dataset.txt file

# Task 1 Read the Dataset from a File:
In this task you will read the dataset from a text file (dataset.txt).

['The car is driven on the road.\n', 'The truck is parked in the lot.\n', 'This pasta is delicious and affordable.\n', 'I enjoy coding in Python.\n', 'Artificial Intelligence is transforming industries.\n', 'The weather is sunny today.\n', 'She loves reading books.\n', 'The cake tastes amazing.\n', 'Learning new things every day is fulfilling.\n', 'The sky is clear and blue.\n', 'The dog chased the cat around the yard.\n', 'He is studying for his final exams.\n', 'The phone battery is running low.\n', 'Nature has a calming effect on the mind.\n', 'The coffee is too hot to drink right now.\n', 'She enjoys painting landscapes in her free time.\n', 'The train arrived at the station early.\n', 'He is preparing a presentation for work.\n', 'I am excited about the upcoming event.\n', 'The movie was thrilling and full of suspense.\n', 'Running in the park is a great way to start the day.\n', 'The artist painted a beautiful portrait of the woman.\n', 'She is learning French in her spare time.\

In [None]:
with open('datset.txt', 'r') as f:
    text = f.read()
    print(text)

The car is driven on the road.
The truck is parked in the lot.
This pasta is delicious and affordable.
I enjoy coding in Python.
Artificial Intelligence is transforming industries.
The weather is sunny today.
She loves reading books.
The cake tastes amazing.
Learning new things every day is fulfilling.
The sky is clear and blue.
The dog chased the cat around the yard.
He is studying for his final exams.
The phone battery is running low.
Nature has a calming effect on the mind.
The coffee is too hot to drink right now.
She enjoys painting landscapes in her free time.
The train arrived at the station early.
He is preparing a presentation for work.
I am excited about the upcoming event.
The movie was thrilling and full of suspense.
Running in the park is a great way to start the day.
The artist painted a beautiful portrait of the woman.
She is learning French in her spare time.
The garden is blooming with colorful flowers.
The bird sang a melodious tune in the morning.
The conference is s

# Task 2: Text Pre-processing and Tokenization
Given a document, perform text cleaning (remove HTML tags, emojis, and special characters), convert text to lowercase, and then tokenize it into words.


In [None]:
def lower_case(text):
    text = text.lower()
    return text

lower_case = lower_case(text)
words = lower_case.split()
print(words)

['the', 'car', 'is', 'driven', 'on', 'the', 'road.', 'the', 'truck', 'is', 'parked', 'in', 'the', 'lot.', 'this', 'pasta', 'is', 'delicious', 'and', 'affordable.', 'i', 'enjoy', 'coding', 'in', 'python.', 'artificial', 'intelligence', 'is', 'transforming', 'industries.', 'the', 'weather', 'is', 'sunny', 'today.', 'she', 'loves', 'reading', 'books.', 'the', 'cake', 'tastes', 'amazing.', 'learning', 'new', 'things', 'every', 'day', 'is', 'fulfilling.', 'the', 'sky', 'is', 'clear', 'and', 'blue.', 'the', 'dog', 'chased', 'the', 'cat', 'around', 'the', 'yard.', 'he', 'is', 'studying', 'for', 'his', 'final', 'exams.', 'the', 'phone', 'battery', 'is', 'running', 'low.', 'nature', 'has', 'a', 'calming', 'effect', 'on', 'the', 'mind.', 'the', 'coffee', 'is', 'too', 'hot', 'to', 'drink', 'right', 'now.', 'she', 'enjoys', 'painting', 'landscapes', 'in', 'her', 'free', 'time.', 'the', 'train', 'arrived', 'at', 'the', 'station', 'early.', 'he', 'is', 'preparing', 'a', 'presentation', 'for', 'work.

# Task 3: One-Hot Encoding for a Given Text
 Write a program that converts a sentence into one-hot encoded vectors.


In [3]:
import numpy as np

def one_hot_encode(text):
    word_to_index = {}
    index = 0
    for word in text.split():
        if word not in word_to_index:
            word_to_index[word] = index
            index += 1

    one_hot_vectors = []
    for word in text.split():
        vector = np.zeros(len(word_to_index))
        vector[word_to_index[word]] = 1
        one_hot_vectors.append(vector)

    return one_hot_vectors

with open('datset.txt', 'r') as f:
    first_sentence = ""
    while True:
        char = f.read(1)
        if not char or char == '.':
            break
        first_sentence += char

first_sentence = first_sentence.strip().lower()

one_hot_vectors = one_hot_encode(first_sentence)

print("One-Hot Encoded Vectors:")
first_sentence_words = first_sentence.split()
for i, vector in enumerate(one_hot_vectors):
    print(f"Word: {first_sentence_words[i]}, Vector: {vector}")


One-Hot Encoded Vectors:
Word: the, Vector: [1. 0. 0. 0. 0. 0.]
Word: car, Vector: [0. 1. 0. 0. 0. 0.]
Word: is, Vector: [0. 0. 1. 0. 0. 0.]
Word: driven, Vector: [0. 0. 0. 1. 0. 0.]
Word: on, Vector: [0. 0. 0. 0. 1. 0.]
Word: the, Vector: [1. 0. 0. 0. 0. 0.]
Word: road, Vector: [0. 0. 0. 0. 0. 1.]


# Task 4: TF-IDF Calculation
Implement a function to calculate Term Frequency (TF) and Inverse Document Frequency (IDF) for a given corpus of documents.

In [7]:
import math

def calculate_tf(sentence):
    words = sentence.split()
    total_words = len(words)
    tf = {}

    for word in words:
        tf[word] = tf.get(word, 0) + 1

    for word in tf:
        tf[word] /= total_words

    return tf

def calculate_idf(sentences):
    total_sentences = len(sentences)
    word_doc_count = {}

    for sentence in sentences:
        unique_words = set(sentence.split())
        for word in unique_words:
            word_doc_count[word] = word_doc_count.get(word, 0) + 1

    idf = {}
    for word, count in word_doc_count.items():
        idf[word] = math.log(total_sentences / count)

    return idf

with open('datset.txt', 'r') as f:
    content = f.read().strip().lower()
    sentences = content.split('. ')

tf_scores = [calculate_tf(sentence) for sentence in sentences]

idf_scores = calculate_idf(sentences)

print("TF Scores:")
for i, tf in enumerate(tf_scores):
    print(f"Sentence {i+1}: {tf}")

print("\nIDF Scores:")
print(idf_scores)


TF Scores for each sentence:
Sentence 1: {'the': 0.1272390364422483, 'car': 0.0030883261272390363, 'is': 0.07226683137739345, 'driven': 0.0006176652254478073, 'on': 0.01173563928350834, 'road.': 0.0006176652254478073, 'truck': 0.0006176652254478073, 'parked': 0.0006176652254478073, 'in': 0.015441630636195183, 'lot.': 0.0006176652254478073, 'this': 0.0024706609017912293, 'pasta': 0.0006176652254478073, 'delicious': 0.0018529956763434219, 'and': 0.012353304508956145, 'affordable.': 0.0006176652254478073, 'i': 0.0012353304508956147, 'enjoy': 0.0006176652254478073, 'coding': 0.0006176652254478073, 'python.': 0.0006176652254478073, 'artificial': 0.0006176652254478073, 'intelligence': 0.0006176652254478073, 'transforming': 0.0006176652254478073, 'industries.': 0.0006176652254478073, 'weather': 0.0012353304508956147, 'sunny': 0.0012353304508956147, 'today.': 0.0006176652254478073, 'she': 0.019147621988882025, 'loves': 0.0006176652254478073, 'reading': 0.0024706609017912293, 'books.': 0.001235

# Task 5: Word2Vec Model Implementation
Build a simple Word2Vec model using Gensim for a given corpus

use gensi m.models
### from gensim.models import Word2Vec

In [10]:
from gensim.models import Word2Vec

def read_corpus(filename):
    with open(filename, 'r') as f:
        content = f.read().lower()
        sentences = content.split('. ')
        tokenized_sentences = [sentence.split() for sentence in sentences if sentence]
        return tokenized_sentences

def train_word2vec(sentences):
    model = Word2Vec(sentences, vector_size=50, window=5, min_count=1)
    return model

def explore_model(model):
    print("Vector for 'car':")
    print(model.wv['car'])

    print("\nMost similar words to 'car':")
    print(model.wv.most_similar('car'))

filename = 'datset.txt'
sentences = read_corpus(filename)

model = train_word2vec(sentences)

explore_model(model)


Vector for 'car':
[-6.2789692e-04  1.3909686e-03 -1.3658030e-02  5.1096007e-03
 -4.1698324e-03  3.8797366e-03  1.2211216e-02 -9.3356794e-05
 -8.4986286e-03  1.4764344e-02  1.0723642e-02  1.1896982e-02
  8.9222938e-03  1.6201288e-02  1.3497480e-02 -1.7707658e-03
  1.9731827e-02  1.0337951e-02  7.3222485e-03 -5.1035830e-03
 -2.7140039e-03 -6.1729928e-03 -1.3300695e-02  6.2286849e-03
  1.9564744e-02  1.9066224e-02 -5.3435699e-03  1.2281654e-02
  4.1817403e-03 -8.7886471e-03  3.9139073e-03 -1.0632699e-02
  1.7158762e-02  1.0093577e-02  4.0351283e-03  9.3076676e-03
  1.9015599e-02  1.3732970e-02 -1.2057308e-03 -1.5911512e-02
  1.8536422e-02  1.4409487e-02  2.1235095e-03 -3.0685693e-03
 -7.4846991e-03 -8.1663262e-03  1.6542181e-02 -7.9411743e-03
 -1.0800276e-02  2.0949097e-02]

Most similar words to 'car':
[('useful', 0.49448415637016296), ('sky.', 0.41240453720092773), ('known', 0.39266136288642883), ('diverse', 0.3710325360298157), ('flowers', 0.36098313331604004), ('community.', 0.3440743

# Task 6: Sentence Matching Based on Given Dataset
In this task, you will be given a large dataset of sentences (provided below). Your goal is to match a query with sentences from this dataset. You will implement a function to find sentences that contain specific words or phrases from a user query.
## **Steps:**
- **Data Acquisition:**Use the provided dataset of sentences.
- **Text Preparation:** Clean the dataset (remove punctuation, convert to lowercase, etc.).
- **Feature Engineering:** Use TF-IDF (Term Frequency-Inverse Document Frequency) to create numerical vectors for each sentence.
- **Search:** Match the query against the sentence vectors using cosine similarity.
**Return Matched Sentences:** Display the top matching sentences.

## **Required Libraries:**
- **NLTK** for text cleaning and preprocessing.
- **TfidfVectorizer** from **scikit-learn** for converting text to vectors.
- **Cosine Similarity** for finding similar sentences.
