<a href="https://colab.research.google.com/github/ranamaddy/NLP/blob/main/1_NLP_BASIC_(Text_Preprocessing)with_Python_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NLP outline for python**

Here is an outline of Natural Language Processing (NLP) in Python:

**Text Preprocessing**: Python provides several libraries for text preprocessing, including NLTK, spaCy, and TextBlob. These libraries can be used for tasks such as tokenization, stemming, and part-of-speech tagging.

**Language Understanding**: Python libraries such as spaCy and TextBlob provide tools for language understanding, including named entity recognition and semantic analysis.

**Language Generation**: Python libraries such as NLTK and TextBlob can be used for language generation tasks such as text summarization and machine translation.

**Sentiment Analysis:** Python libraries such as TextBlob and NLTK provide tools for sentiment analysis, allowing you to determine the emotional tone or sentiment behind text.

**Speech Processing:** Python provides libraries for speech processing, such as SpeechRecognition and pyttsx3, which can be used for tasks such as speech recognition and speech synthesis.

**Machine Learning**: Python provides several libraries for machine learning, such as scikit-learn and TensorFlow, which can be used for tasks such as text classification and language modeling.

**Web Scraping:** Python's BeautifulSoup library can be used for web scraping, allowing you to extract text data from websites for NLP tasks.

Python is a popular language for NLP due to its ease of use and the availability of numerous powerful libraries. By using these libraries and tools, you can perform a wide range of NLP tasks in Python, from simple text preprocessing to complex machine learning model

**Text Preprocessing By NLTK**

Text preprocessing is a crucial step in natural language processing (NLP) that involves cleaning and preparing text data for further analysis. The NLTK (Natural Language Toolkit) library in Python provides several tools for text preprocessing, including:

1. Tokenization: Tokenization is the process of splitting text into individual words or tokens. NLTK provides several tokenizers, including the word tokenizer, which splits text into individual words, and the sentence tokenizer, which splits text into individual sentences.
Example:


In [1]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

text = "This is a sample sentence. It contains punctuation marks!"

tokens = word_tokenize(text)
print(tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['This', 'is', 'a', 'sample', 'sentence', '.', 'It', 'contains', 'punctuation', 'marks', '!']


**2. Stopword Removal:** Stopwords are commonly used words that do not carry significant meaning, such as "the" and "and". NLTK provides a list of stopwords that can be removed from text.

In [6]:
from nltk.corpus import stopwords
import nltk

stop_words = set(stopwords.words('english'))

tokens = ["I", "am", "not", "interested", "in", "this", "topic", ".", "It", "is", "boring", "."]

filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print(filtered_tokens)

print("\n\n\n")

['interested', 'topic', '.', 'boring', '.']






**3.Stemming and Lemmatization:** Stemming and lemmatization are techniques used to reduce words to their base form. NLTK provides several stemmers and lemmatizers, including the Porter stemmer and the WordNet lemmatizer.

In [9]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk
nltk.download('wordnet')
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

text = "He was running and eating at the same time. He has a bad habit of running in the street."

tokens = word_tokenize(text)

stemmed_tokens = [stemmer.stem(token) for token in tokens]
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

print(stemmed_tokens)
print(lemmatized_tokens)


[nltk_data] Downloading package wordnet to /root/nltk_data...


['he', 'wa', 'run', 'and', 'eat', 'at', 'the', 'same', 'time', '.', 'he', 'ha', 'a', 'bad', 'habit', 'of', 'run', 'in', 'the', 'street', '.']
['He', 'wa', 'running', 'and', 'eating', 'at', 'the', 'same', 'time', '.', 'He', 'ha', 'a', 'bad', 'habit', 'of', 'running', 'in', 'the', 'street', '.']


**4.Part-of-Speech Tagging:**

 Part-of-speech (POS) tagging involves labeling each word in a text with its corresponding part of speech, such as noun, verb, or adjective. NLTK provides a pre-trained POS tagger that can be used to tag words in a text.

In [11]:
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

text = "John likes to play football in the park."

tokens = word_tokenize(text)
tagged_tokens = pos_tag(tokens)

print(tagged_tokens)


[('John', 'NNP'), ('likes', 'VBZ'), ('to', 'TO'), ('play', 'VB'), ('football', 'NN'), ('in', 'IN'), ('the', 'DT'), ('park', 'NN'), ('.', '.')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


**5. Named Entity Recognition:**
 Named entity recognition (NER) involves identifying and labeling named entities in a text, such as names, organizations, and locations. NLTK provides a pre-trained NER tagger that can be used to identify named entities in a text.

In [12]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk

text = "Barack Obama was the 44th President of the United States of America."

tokens = word_tokenize(text)
tagged_tokens = pos_tag(tokens)
named_entities = ne_chunk(tagged_tokens)

print(named_entities)


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


(S
  (PERSON Barack/NNP)
  (PERSON Obama/NNP)
  was/VBD
  the/DT
  44th/JJ
  President/NNP
  of/IN
  the/DT
  (GPE United/NNP States/NNPS)
  of/IN
  (GPE America/NNP)
  ./.)


[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


**6.Word Embeddings**: Word embeddings are dense vector representations of words that capture their semantic meaning. NLTK provides several pre-trained word embeddings, such as Word2Vec and GloVe, that can be used to represent words in a text as vectors.

In [13]:
import nltk
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

text = "I like to eat pizza and pasta. Pizza is my favorite food."

sentences = nltk.sent_tokenize(text)
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]

model = Word2Vec(tokenized_sentences, min_count=1)

pizza_vector = model.wv['pizza']
print(pizza_vector)


[-9.5766364e-03  8.9430977e-03  4.1696127e-03  9.2342924e-03
  6.6383476e-03  2.9204944e-03  9.8048961e-03 -4.4205645e-03
 -6.8064686e-03  4.2235320e-03  3.7343241e-03 -5.6667118e-03
  9.7071612e-03 -3.5584576e-03  9.5499996e-03  8.3605840e-04
 -6.3371290e-03 -1.9779752e-03 -7.3787277e-03 -2.9853259e-03
  1.0427843e-03  9.4836736e-03  9.3608815e-03 -6.5972861e-03
  3.4747063e-03  2.2763265e-03 -2.4899244e-03 -9.2293778e-03
  1.0275932e-03 -8.1658373e-03  6.3170199e-03 -5.8003748e-03
  5.5417293e-03  9.8350449e-03 -1.5834211e-04  4.5298999e-03
 -1.8063794e-03  7.3588388e-03  3.9380039e-03 -9.0122502e-03
 -2.4018812e-03  3.6301538e-03 -1.0429607e-04 -1.1998524e-03
 -1.0579068e-03 -1.6744122e-03  6.0184632e-04  4.1674497e-03
 -4.2544161e-03 -3.8310036e-03 -5.7033147e-05  2.7015869e-04
 -1.7265114e-04 -4.7848234e-03  4.3171979e-03 -2.1732899e-03
  2.1029287e-03  6.6403928e-04  5.9715942e-03 -6.8387738e-03
 -6.8197562e-03 -4.4776555e-03  9.4360467e-03 -1.5933317e-03
 -9.4336243e-03 -5.42717

In [16]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download required resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Define text to preprocess
text = "The quick brown foxes jumped over the lazy dog's tail. The dogs didn't seem to mind."

# Convert text to lowercase
text = text.lower()

# Tokenize the text
tokens = word_tokenize(text)

# Remove stopwords
stopwords_list = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stopwords_list]

# Lemmatize the remaining tokens
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

# Print the result
print(lemmatized_tokens)
#The code above preprocesses a given text by converting it to lowercase, tokenizing it,
# removing stopwords, and lemmatizing the remaining tokens. The resulting lemmatized tokens are then printed.

['quick', 'brown', 'fox', 'jumped', 'lazy', 'dog', "'s", 'tail', '.', 'dog', "n't", 'seem', 'mind', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
