# NLP
NLP stands for Natural Language Processing. It is the branch of Artificial Intelligence that gives the ability to machine understand and process human languages. Human languages can be in the form of text or audio format.

# What is text pre-processing?

Text pre-processing is the process of transforming unstructured text to structured text to prepare it for analysis.

When you pre-process text before feeding it to algorithms, you increase the accuracy and efficiency of said algorithms by removing noise and other inconsistencies in the text that can make it hard for the computer to understand.

Making the text easier to understand also helps to reduce the time and resources required for the computer to pre-process data.

Processes involved in text pre-processing
To properly pre-process your text and get it in the right state to perform further analysis and actions with it, there are quite a few operations that need to be done on the text and a couple of steps to be followed to get a well structured text.

#Tokenization
Tokenization is the first stage of the process.

Here your text is analysed and then broken down into chunks called ‘tokens’ which can either be words or phrases. This allows the computer to work on your text token by token rather than working on the entire text in the following stages.

The two main types of tokenisation are word and sentence tokenisation.

Word tokenisation is the most common kind of tokenisation.

Here, each token is a word, meaning the algorithm breaks down the entire text into individual words:

In [1]:
text = 'Wisdoms daughter walks alone. The mark of Athena burns through rome'

words = text.split()
print(words)

['Wisdoms', 'daughter', 'walks', 'alone.', 'The', 'mark', 'of', 'Athena', 'burns', 'through', 'rome']


On the other hand, sentence tokenisation breaks down text into sentences instead of words. It is a less common type of tokenisation only used in few Natural Language Processing (NLP) tasks.

# Case normalisation

This technique converts all the letters in your text to a single case, either uppercase or lowercase.

Case normalisation ensures that your data is stored in a consistent format and makes it easier to work with the data.

In [2]:
text = "'To Sleep Or NOT to SLEep, THAT is THe Question'"

def lower_case(text):
    text = text.lower()
    return text

lower_case = lower_case(text)#converts everthing to lowercase
print(lower_case)


'to sleep or not to sleep, that is the question'


# Stemming
Stemming words like coding, coder, and coded all have the same base word which is code.

ML models most-often-than-not understand that these words are all derived from one base word. They can work with your text without the tenses, prefixes, and suffixes that we as humans would normally need to make sense of it.

Stemming your texts not only helps to reduce the number of words the model has to work with, and by extension improves the efficiency of the model.



In [3]:
from nltk.stem import PorterStemmer

# Initialize the stemmer
stemmer = PorterStemmer()

# Example text
text = "She enjoys coding, coded many projects, and is a skilled coder."

# Tokenize and stem each word
stemmed_words = [stemmer.stem(word) for word in text.split()]

print("Stemmed Words:", stemmed_words)


Stemmed Words: ['she', 'enjoy', 'coding,', 'code', 'mani', 'projects,', 'and', 'is', 'a', 'skill', 'coder.']


# Lemmatisation
This method is very similar to stemming in that it is also used to identify the base of words. It is however a more complex and accurate technique than stemming.

Lemmatization, unlike stemming, reduces words to their base or dictionary form (lemma), ensuring the root word remains meaningful.

In [4]:
import nltk

# Download required NLTK data, like WordNet and others
nltk.download('wordnet')
nltk.download('omw-1.4')  # Optional, for enhanced language support
nltk.download('punkt')     # If you need tokenizers
from nltk.stem import WordNetLemmatizer

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Example text
text = "She enjoys coding, coded many projects, and is a skilled coder."

# Tokenize and lemmatize each word
lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in text.split()]

print("Lemmatized Words:", lemmatized_words)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Lemmatized Words: ['She', 'enjoy', 'coding,', 'cod', 'many', 'projects,', 'and', 'be', 'a', 'skilled', 'coder.']


# Punctuation removal

During human conversations, punctuation marks like `‘’, ! , [, }, *, #, /, ?, and ‘’` are incredibly relevant and necessary to have a proper conversation. Thelp to fully convey the message of the writer.

In [5]:
import re

text = ' (to love is to destroy, and to be loved, is to be "the" one <destroyed>} '

def remove_punctuations(text):
    punctuation = re.compile(r'[{};():,."/<>-]')
    text = punctuation.sub(' ', text)
    return text

clean_text = remove_punctuations(text)
print(clean_text)

  to love is to destroy  and to be loved  is to be  the  one  destroyed   


# Accent removal
This process is about removing language specific character symbols from text.

Some characters are written with specific accents or symbols to either imply a different pronunciation or to signify that words containing such accented texts have a different meaning.

In [6]:
import re

text = "her fiancé's résumé is beautiful"

def remove_accents(text):
    accents = re.compile(u"[\u0300-\u036F]|é|è")
    text = accents.sub(u"e", text)
    return text

cleaned_text = remove_accents(text)
print(cleaned_text)

her fiance's resume is beautiful


# Lab Tasks
### **Note** you will perform all the task using dataset.txt file

# Task 1 Read the Dataset from a File:
In this task you will read the dataset from a text file (dataset.txt).

In [34]:


with open('datset.txt', 'r') as file:
  dataset = file.readlines()
print(dataset)


['The car is driven on the road.\n', 'The truck is parked in the lot.\n', 'This pasta is delicious and affordable.\n', 'I enjoy coding in Python.\n', 'Artificial Intelligence is transforming industries.\n', 'The weather is sunny today.\n', 'She loves reading books.\n', 'The cake tastes amazing.\n', 'Learning new things every day is fulfilling.\n', 'The sky is clear and blue.\n', 'The dog chased the cat around the yard.\n', 'He is studying for his final exams.\n', 'The phone battery is running low.\n', 'Nature has a calming effect on the mind.\n', 'The coffee is too hot to drink right now.\n', 'She enjoys painting landscapes in her free time.\n', 'The train arrived at the station early.\n', 'He is preparing a presentation for work.\n', 'I am excited about the upcoming event.\n', 'The movie was thrilling and full of suspense.\n', 'Running in the park is a great way to start the day.\n', 'The artist painted a beautiful portrait of the woman.\n', 'She is learning French in her spare time.\

# Task 2: Text Pre-processing and Tokenization
Given a document, perform text cleaning (remove HTML tags, emojis, and special characters), convert text to lowercase, and then tokenize it into words.


In [11]:
pip install emoji

Collecting emoji
  Downloading emoji-2.13.2-py3-none-any.whl.metadata (5.8 kB)
Downloading emoji-2.13.2-py3-none-any.whl (553 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m553.2/553.2 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.13.2


In [12]:
import re
import emoji

def clean_and_tokenize(text):
  """Cleans and tokenizes text.

  Args:
    text: The input text.

  Returns:
    A list of tokens.
  """
  # Remove HTML tags
  text = re.sub('<[^<]+?>|[{};():,."/<>-]', '', text)

  # Remove emojis and special characters
  text = re.sub(r'[^\w\s]', '', text)
  text = emoji.replace_emoji(text, replace='')

  # Convert to lowercase
  text = text.lower()

  # Tokenize into words
  tokens = text.split()

  return tokens


for line in dataset:
    tokens = clean_and_tokenize(line)
    print(tokens)

['the', 'car', 'is', 'driven', 'on', 'the', 'road']
['the', 'truck', 'is', 'parked', 'in', 'the', 'lot']
['this', 'pasta', 'is', 'delicious', 'and', 'affordable']
['i', 'enjoy', 'coding', 'in', 'python']
['artificial', 'intelligence', 'is', 'transforming', 'industries']
['the', 'weather', 'is', 'sunny', 'today']
['she', 'loves', 'reading', 'books']
['the', 'cake', 'tastes', 'amazing']
['learning', 'new', 'things', 'every', 'day', 'is', 'fulfilling']
['the', 'sky', 'is', 'clear', 'and', 'blue']
['the', 'dog', 'chased', 'the', 'cat', 'around', 'the', 'yard']
['he', 'is', 'studying', 'for', 'his', 'final', 'exams']
['the', 'phone', 'battery', 'is', 'running', 'low']
['nature', 'has', 'a', 'calming', 'effect', 'on', 'the', 'mind']
['the', 'coffee', 'is', 'too', 'hot', 'to', 'drink', 'right', 'now']
['she', 'enjoys', 'painting', 'landscapes', 'in', 'her', 'free', 'time']
['the', 'train', 'arrived', 'at', 'the', 'station', 'early']
['he', 'is', 'preparing', 'a', 'presentation', 'for', 'work'

# Task 3: One-Hot Encoding for a Given Text
 Write a program that converts a sentence into one-hot encoded vectors.


In [33]:
import numpy as np
corpus =  dataset

# Create a set of unique words in the corpus
unique_words = set()
for sentence in corpus:
    for word in sentence.split():
        unique_words.add(word.lower())

# Create a dictionary to map each
# unique word to an index
word_to_index = {}
for i, word in enumerate(unique_words):
    word_to_index[word] = i

# Create one-hot encoded vectors for
# each word in the corpus
one_hot_vectors = []
for sentence in corpus:
    sentence_vectors = []
    for word in sentence.split():
        vector = np.zeros(len(unique_words))
        vector[word_to_index[word.lower()]] = 1
        sentence_vectors.append(vector)
    one_hot_vectors.append(sentence_vectors)

# Print the one-hot encoded vectors
# for the first sentence
print("One-hot encoded vectors for the first sentence:")
for vector in one_hot_vectors[0]:
    print(vector)

One-hot encoded vectors for the first sentence:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0.

# Task 4: TF-IDF Calculation
Implement a function to calculate Term Frequency (TF) and Inverse Document Frequency (IDF) for a given corpus of documents.

In [23]:
import math
from collections import Counter

def compute_tf(text):
    # Tokenize the text by splitting on whitespace
    words = text.split()
    # Count the occurrences of each word in the document
    word_counts = Counter(words)
    # Compute the term frequency for each word
    tf = {word: count / len(words) for word, count in word_counts.items()}
    return tf

def compute_idf(corpus):
    # Total number of documents
    num_docs = len(corpus)
    # Count the number of documents that contain each word
    word_doc_counts = Counter()
    for text in corpus:
        words = set(text.split())
        for word in words:
            word_doc_counts[word] += 1
    # Compute the inverse document frequency for each word
    idf = {word: math.log(num_docs / count) for word, count in word_doc_counts.items()}
    return idf

def compute_tf_idf(text, corpus):
    tf = compute_tf(text)
    idf = compute_idf(corpus)
    # Compute the TF-IDF for each word in the document
    tf_idf = {word: tf[word] * idf[word] for word in tf}
    return tf_idf

# Example usage
corpus = dataset
text = dataset[0]
tf_idf = compute_tf_idf(text, corpus)
print(tf_idf)

{'The': 0.06585054196083108, 'car': 0.5087208689434358, 'is': 0.05955625770454101, 'driven': 0.7386405707197359, 'on': 0.3180064308388159, 'the': 0.10397610550683362, 'road.': 0.7386405707197359}


# Task 5: Word2Vec Model Implementation
Build a simple Word2Vec model using Gensim for a given corpus

use gensi m.models
### from gensim.models import Word2Vec

In [27]:
from gensim.models import Word2Vec

corpus = dataset

# Tokenize the corpus
tokenized_corpus = [sentence.split() for sentence in corpus]

# Create and train the Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)

# Save the model
model.save("word2vec.model")

# Load the model
model = Word2Vec.load("word2vec.model")

# Find similar words
similar_words = model.wv.most_similar("airplane")
print(similar_words)

[('footage', 0.3249693214893341), ('cabin', 0.29437825083732605), ('exams.', 0.28312361240386963), ('run', 0.2572650909423828), ('roses.', 0.25433534383773804), ('changing', 0.25385114550590515), ('vacation.', 0.2497791349887848), ("company's", 0.24248816072940826), ('for', 0.24247483909130096), ('car.', 0.24011121690273285)]


# Task 6: Sentence Matching Based on Given Dataset
In this task, you will be given a large dataset of sentences (provided below). Your goal is to match a query with sentences from this dataset. You will implement a function to find sentences that contain specific words or phrases from a user query.
## **Steps:**
- **Data Acquisition:**Use the provided dataset of sentences.
- **Text Preparation:** Clean the dataset (remove punctuation, convert to lowercase, etc.).
- **Feature Engineering:** Use TF-IDF (Term Frequency-Inverse Document Frequency) to create numerical vectors for each sentence.
- **Search:** Match the query against the sentence vectors using cosine similarity.
**Return Matched Sentences:** Display the top matching sentences.

## **Required Libraries:**
- **NLTK** for text cleaning and preprocessing.
- **TfidfVectorizer** from **scikit-learn** for converting text to vectors.
- **Cosine Similarity** for finding similar sentences.


**Install required libraries**

In [28]:
pip install nltk scikit-learn



**Import Libraries**

In [29]:
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

**Data Preparation**:
- **Clean the dataset**: Remove punctuation, convert to lowercase, etc.
- **Tokenize the text**: Split the text into words.

In [30]:
nltk.download('punkt')

def clean_text(text):
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Tokenize the text
    words = nltk.word_tokenize(text)
    return ' '.join(words)

# Clean the dataset
cleaned_dataset = [clean_text(sentence) for sentence in dataset]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Feature Engineering**:
- Use TF-IDF to create numerical vectors for each sentence.

In [31]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(cleaned_dataset)

**Search**:
- Match the query against the sentence vectors using cosine similarity.

In [32]:
def find_similar_sentences(query, tfidf_matrix, vectorizer, top_n=3):
    # Clean the query
    cleaned_query = clean_text(query)
    # Transform the query to TF-IDF vector
    query_vector = vectorizer.transform([cleaned_query])
    # Compute cosine similarity
    cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
    # Get the top n similar sentences
    top_indices = cosine_similarities.argsort()[-top_n:][::-1]
    return [(dataset[i], cosine_similarities[i]) for i in top_indices]

# Example query
query = "She is learning"
matched_sentences = find_similar_sentences(query, tfidf_matrix, vectorizer)
for sentence, score in matched_sentences:
    print(f"Sentence: {sentence}, Similarity Score: {score}")

Sentence: She is learning French in her spare time.
, Similarity Score: 0.45701242718209695
Sentence: She is learning to play the guitar in her free time.
, Similarity Score: 0.42180067088279133
Sentence: She is learning to play the piano with the help of a tutor.
, Similarity Score: 0.405846214556852
