<a href="https://colab.research.google.com/github/Hesam-s/Preprocessing/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
paragraph = """
Preprocessing and lemmatization are essential steps in Natural Language Processing (NLP) that contribute significantly to enhancing the accuracy and efficiency of text analysis tasks. Preprocessing involves cleaning and transforming raw text data into a format that is more suitable for analysis, while lemmatization aims to reduce words to their base or dictionary form. Together, these processes help standardize text data, improve feature extraction, and ultimately enhance the performance of NLP models.

In the realm of preprocessing, the first step typically involves removing any unnecessary characters, such as punctuation marks, special symbols, and numerical digits, that do not carry significant semantic meaning for the analysis task at hand. This step simplifies the text data and reduces noise, making it easier for subsequent processing steps to focus on the relevant linguistic content. Additionally, preprocessing often includes converting the entire text to lowercase to ensure uniformity in word representations and to prevent the model from treating words with different cases as distinct entities.

Furthermore, another crucial aspect of preprocessing is tokenization, which involves breaking down the text into individual tokens, usually words or phrases. Tokenization serves as the foundation for subsequent analysis steps, enabling the NLP model to process text at the granular level of individual linguistic units. This step is particularly important for tasks such as sentiment analysis, part-of-speech tagging, and named entity recognition, where understanding the meaning and context of each word is essential for accurate analysis.

In addition to tokenization, preprocessing often includes the removal of stopwords—commonly occurring words that do not carry significant semantic meaning, such as "the," "is," "and," etc. Removing stopwords helps reduce the dimensionality of the feature space and focuses the model's attention on the words that are more indicative of the text's content. However, the list of stopwords may vary depending on the specific context and domain of the text data, necessitating customization for optimal performance.

After preprocessing, lemmatization plays a critical role in further standardizing the text data by reducing words to their base or dictionary forms, known as lemmas. Unlike stemming, which simply chops off affixes to derive the root form of words, lemmatization takes into account the morphological analysis of words and applies linguistic rules to accurately identify their canonical forms. For example, the word "running" would be lemmatized to "run," and "better" would be reduced to "good." By converting words to their lemmas, lemmatization helps consolidate different inflected forms of words into a common representation, thereby improving the consistency and interpretability of the text data.

Moreover, lemmatization contributes to reducing the sparsity of the feature space and alleviating data sparsity issues, which can negatively impact the performance of NLP models, especially in tasks with limited training data. By grouping together variant forms of words under their respective lemmas, lemmatization enhances the generalization capabilities of the model and facilitates better recognition of patterns and relationships within the text data.

In conclusion, preprocessing and lemmatization are indispensable steps in NLP that serve to refine and standardize raw text data for effective analysis. Through techniques such as cleaning, tokenization, removal of stopwords, and lemmatization, text data is transformed into a structured and coherent format that enhances the accuracy, efficiency, and interpretability of NLP models across a wide range of applications. By incorporating these preprocessing and lemmatization techniques into the NLP pipeline, researchers and practitioners can unlock the full potential of text data for extracting insights, understanding semantics, and building robust language understanding systems."""

In [20]:
# Tokenizing sentences
sentences = nltk.sent_tokenize(paragraph)
sentences

["Essay on Theme of Curiosity in H. G. Wells 'The Time Machine'\nThe Time Traveler started his story at the time when he finished his time machine.",
 '“I suppose a suicide who holds a pistol to\nhis skull feels much the same wonder at what will come next as I felt then.” (Wells 15).',
 'He is very nervous since he is the test\nsubject of his creation, here as observed that human experimentation was accepted in the era of H.G.',
 'Wells, in the 1800s, and\nunlike now, these experiments are inhumane.',
 'Upon arrival, he saw a white Sphinx statue, if portrayed in real life the meaning of\nSphinx is a symbol of mystery and benevolence.',
 'Such a symbol may foreshadow trial and hardship in his adventure into the\nworld he entered.',
 'Knowing that he is alone in his adventure, he panics and fears what might happen to him.',
 '“I felt naked in a\nstrange world.',
 'I felt as if perhaps a bird may feel in the clear air, knowing the hawk wings above and will swoop.',
 'My fear grew\nto fren

In [21]:
# Tokenizing words
words = nltk.word_tokenize(paragraph)
words

['Essay',
 'on',
 'Theme',
 'of',
 'Curiosity',
 'in',
 'H.',
 'G.',
 'Wells',
 "'The",
 'Time',
 "Machine'",
 'The',
 'Time',
 'Traveler',
 'started',
 'his',
 'story',
 'at',
 'the',
 'time',
 'when',
 'he',
 'finished',
 'his',
 'time',
 'machine',
 '.',
 '“',
 'I',
 'suppose',
 'a',
 'suicide',
 'who',
 'holds',
 'a',
 'pistol',
 'to',
 'his',
 'skull',
 'feels',
 'much',
 'the',
 'same',
 'wonder',
 'at',
 'what',
 'will',
 'come',
 'next',
 'as',
 'I',
 'felt',
 'then.',
 '”',
 '(',
 'Wells',
 '15',
 ')',
 '.',
 'He',
 'is',
 'very',
 'nervous',
 'since',
 'he',
 'is',
 'the',
 'test',
 'subject',
 'of',
 'his',
 'creation',
 ',',
 'here',
 'as',
 'observed',
 'that',
 'human',
 'experimentation',
 'was',
 'accepted',
 'in',
 'the',
 'era',
 'of',
 'H.G',
 '.',
 'Wells',
 ',',
 'in',
 'the',
 '1800s',
 ',',
 'and',
 'unlike',
 'now',
 ',',
 'these',
 'experiments',
 'are',
 'inhumane',
 '.',
 'Upon',
 'arrival',
 ',',
 'he',
 'saw',
 'a',
 'white',
 'Sphinx',
 'statue',
 ',',
 'i

In [7]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [25]:
stemmer = PorterStemmer()

# Stemming
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
  sentences[i] = ' '.join(words)

In [10]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [26]:
lemmatizer = WordNetLemmatizer()

sentencess = nltk.sent_tokenize(paragraph)

# Lemmatization
for i in range(len(sentencess)):
  words = nltk.word_tokenize(sentencess[i])
  words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
  sentencess[i] = ' '.join(words)

In [16]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [27]:
lemmatizer = WordNetLemmatizer()

sentencess = nltk.sent_tokenize(paragraph)

# Lemmatization
for i in range(len(sentencess)):
  words = nltk.word_tokenize(sentencess[i])
  words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
  sentencess[i] = ' '.join(words)

In [28]:
# Cleaning the texts
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

ps = PorterStemmer()
wordnet = WordNetLemmatizer()
sentencess = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentencess)):
  review = re.sub('[^a-zA-Z]', ' ', sentences[i])
  review = review.lower()
  review = review.split()
  review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
  review = ' '.join(review)
  corpus.append(review)

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
x = cv.fit_transform(corpus).toarray()

In [29]:
sentencess = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentencess)):
  review = re.sub('[^a-zA-Z]', ' ', sentences[i])
  review = review.lower()
  review = review.split()
  review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
  review = ' '.join(review)
  corpus.append(review)

# Creating the TF-IDF model
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
x = tf.fit_transform(corpus).toarray()