#  NLP Pipeline
## Text Preprocessing

1. Data Acquisition
2. Text Processing
    * Text cleanup
    * Basic Preprocessing
    * Advance Preprocessing
3. Feature Engineering
4. Modeling
5. Deployment

# Data Acquisition

## Dataset
    * CSB file (table)
    * Database -- Data Agensis
    * Less Data -- Data Argumentation

# Text Processing
1. Text cleanup
2. Basic Preprocessing
3. Advance Preprocessing

# Text cleanup

1. html tag
2. emoji
3. spelling check

In [1]:
import  pandas as pd

**pandas is an open-source data manipulation and analysis library for Python. It provides data structures and functions that make working with structured data, such as CSV files, Excel spreadsheets, SQL databases, and more, much easier and efficient. It's widely used in data science, data analysis, and data manipulation tasks.**

## Here are some key components and concepts of the pandas library:

1. Data Structures:

  * Series: A one-dimensional labeled array, similar to a column in a spreadsheet or a list in Python.
  * DataFrame: A two-dimensional labeled data structure with columns that can be of different types (numeric, string, boolean, etc.). It's similar to a table in a database or a spreadsheet.
  
2. Key Features:

 * Data Reading and Writing: pandas can read data from various file formats like CSV, Excel, SQL databases, and more. It also supports writing data to these formats.
  * Indexing and Selection: pandas provides flexible ways to index and select data, including label-based indexing, integer-based indexing, and boolean indexing.
   * Data Cleaning and Transformation: pandas allows you to handle missing data, duplicate data, and perform operations like filtering, sorting, grouping, and reshaping.
   * Aggregation and Statistics: You can easily compute summary statistics, apply functions to groups of data, and perform aggregations.
  * Merging and Joining: pandas supports combining data from different sources through operations like merging and joining.
   * Time Series Data: pandas has robust support for handling time series data, including date and time parsing, resampling, and time-based calculations.
  * Visualization: While pandas itself doesn't provide advanced visualization capabilities, it can integrate well with libraries like Matplotlib and Seaborn for data visualization.
3. Getting Started:
To use pandas, you need to import the library. Common import conventions are:  
 **import pandas as pd**
 
4. Basic Usage:

  * Creating a DataFrame from data: df = pd.DataFrame(data)
  * Reading data from a CSV file: df = pd.read_csv('data.csv')
  * Displaying the first few rows: df.head()
  * Indexing and selecting data: df['column_name'] or df.loc[row_label]
  * Applying functions: df['column_name'].apply(func)
  * Grouping data: df.groupby('group_column').agg(func)

In [2]:
df = pd.read_csv('IMDB Dataset.csv')

In [3]:
df.shape

(50000, 2)

In [4]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
df['review'][3].lower()

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [6]:
df['review'] = df['review'].str.lower()

# Html Tag Remove

# Importing the re Module:
**The code begins by importing the re module, which stands for "regular expressions." This module provides functions for working with regular expressions, which are powerful tools for pattern matching and manipulation of strings.**

In [7]:
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

In [8]:
text = "<html><body><p> Movie 1</p><p> Actor - Aamir Khan</p><p> Click here to <a href='http://google.com'>download</a></p></body></html>"

In [9]:
remove_html_tags(text)

' Movie 1 Actor - Aamir Khan Click here to download'

In [10]:
df['review'] = df['review'].apply(remove_html_tags)

In [11]:
df['review'][5]

'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times in the last 25 years. paul lukas\' performance brings tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, is a delight. the kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. and the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. if i had a dozen thumbs, they\'d all be "up" for this movie.'

# Remove URL

In [12]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [13]:
text1 = 'Check out my notebook https://www.kaggle.com/campusx/notebook8223fc1abb'
text2 = 'Check out my notebook http://www.kaggle.com/campusx/notebook8223fc1abb'
text3 = 'Google search here www.google.com'
text4 = 'For notebook click https://www.kaggle.com/campusx/notebook8223fc1abb to search check www.google.com'

In [14]:
remove_url(text4)

'For notebook click  to search check '

In [15]:
remove_url(text2)

'Check out my notebook '

# Remove Punctuation

In [16]:
import string,time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [17]:
exclude = string.punctuation

In [18]:
def remove_punc(text):
    for char in exclude:
        text = text.replace(char,'')
    return text

In [19]:
text = 'string. With. Punctuation?'

In [20]:
start = time.time()
print(remove_punc(text))
time1 = time.time() - start
print(time1*50000)

string With Punctuation
0.0


In [21]:
def remove_punc1(text):
    return text.translate(str.maketrans('', '', exclude))

In [22]:
start = time.time()
remove_punc1(text)
time2 = time.time() - start
print(time2*50000)

0.0


In [23]:
time1/time2

ZeroDivisionError: float division by zero

In [24]:
df['review'][5]

'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times in the last 25 years. paul lukas\' performance brings tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, is a delight. the kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. and the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. if i had a dozen thumbs, they\'d all be "up" for this movie.'

In [25]:
remove_punc1(df['review'][5])

'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'

# Chat Words Conversation

In [26]:
chat_words = {
    'LOL': 'laugh out loud',
    'BRB': 'be right back',
    'IMHO': 'in my humble opinion',
    'BTW': 'by the way',
    'ROFL': 'rolling on the floor laughing',
    'TTYL': 'talk to you later',
    'FYI': 'for your information',
    'OMG': 'oh my god',
    'IDK': "I don't know",
    'BFF': 'best friends forever',
    'IMO': 'in my opinion',
    'TMI': 'too much information',
    'JK': 'just kidding',
    'GTG': 'got to go',
    'THX': 'thanks',
    'NP': 'no problem',
    'WTF': 'what the fudge',
    'ICYMI': 'in case you missed it',
    'AFK': 'away from keyboard',
    'BRB': 'be right back',
    'DM': 'direct message',
    'SMH': 'shaking my head',
    'LMAO': 'laughing my ass off',
    'GR8': 'great',
    'OMW': 'on my way',
    'ROFLMAO': 'rolling on the floor laughing my ass off',
    'ICYMI': 'in case you missed it',
    'TL;DR': 'too long; didn’t read',
    'YOLO': 'you only live once',
    'FOMO': 'fear of missing out',
    'NSFW': 'not safe for work',
    'OOTD': 'outfit of the day',
    'IRL': 'in real life',
    'AMA': 'ask me anything',
    'SMH': 'shaking my head',
    'Bae': 'before anyone else',
    'DM': 'direct message',
    'FOMO': 'fear of missing out',
    'IMO': 'in my opinion',
    'RN': 'right now',
    'TBH': 'to be honest',
    'ICYMI': 'in case you missed it',
    'NVM': 'never mind',
    'FWIW': 'for what its worth',
    'BTW': 'by the way',
    'POV': 'point of view',
    'OMG': 'oh my god',
    'LOL': 'laugh out loud',
    'BRB': 'be right back',
    'IDK': "I don't know"
}


In [27]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [28]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [29]:
chat_conversion('IMHO he is the best BRB')

'in my humble opinion he is the best be right back'

# Spelling Correction

In [30]:
!pip install textblob



In [31]:
from textblob import TextBlob

***This line of code imports the TextBlob class from the textblob library. The TextBlob class is the main entry point for working with textual data using the TextBlob library. It provides various methods for text analysis, such as sentiment analysis, part-of-speech tagging, noun phrase extraction, and more.***

In [32]:
incorrect_text = 'ceertain conditionas duriing seveal ggenerations aree moodified in the saame maner.'

textBlb = TextBlob(incorrect_text)
textBlb.correct().string

'certain conditions during several generations are modified in the same manner.'

In [33]:
text = "This is a sample text for analysis."
blob = TextBlob(text)

In [34]:
sentiment = blob.sentiment
print(sentiment)

Sentiment(polarity=0.0, subjectivity=0.0)


In [35]:
blob.correct().string

'His is a sample text for analysis.'

# Removing Stop Words

In [36]:
 import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [37]:
from nltk.corpus import stopwords

In [38]:
stopwords.words('English')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [39]:
def remove_stopwords(text):
    new_text = []
    
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [40]:
remove_stopwords('probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times')


'probably  all-time favorite movie,  story  selflessness, sacrifice  dedication   noble cause,    preachy  boring.   never gets old, despite   seen   15   times'

In [41]:
remove_stopwords("This is an example text containing some words that are common stopwords. Stopwords are often removed to improve text analysis.")

'This   example text containing  words   common stopwords. Stopwords  often removed  improve text analysis.'

In [42]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


In [49]:
print('hello')

hello


# Remove Emoji

In [50]:
import re
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

 ***import re: This line imports the re module, which provides support for regular expressions in Python.***

***def remove_emoji(text):: This defines a function named remove_emoji that takes a single argument text, which is the input text from which you want to remove emojis.***

***emoji_pattern = re.compile("[ ... ]+", flags=re.UNICODE): This line defines a regular expression pattern to match emojis. The pattern is a combination of Unicode code point ranges for different categories of emojis. Unicode code points are numerical representations of characters, including emojis.***

***The code includes multiple ranges of Unicode code points corresponding to different categories of emojis. For example, u"\U0001F600-\U0001F64F" represents a range of emoticon emojis. Other ranges are specified for symbols, transport symbols, flags, and more.***

***flags=re.UNICODE: This flag ensures that the regular expression operates in Unicode mode, allowing it to correctly handle emojis represented by Unicode characters.***

***return emoji_pattern.sub(r'', text): This line uses the sub() method of the emoji_pattern regular expression to substitute matched emojis with an empty string (r''). This effectively removes the emojis from the input text.***

In [55]:
remove_emoji("Loved the Nature. It was 😘😘")

'Loved the Nature. It was '

In [56]:
remove_emoji("Lmao 😂😂")

'Lmao '

In [58]:
!pip install emoji

Collecting emoji
  Downloading emoji-2.8.0-py2.py3-none-any.whl (358 kB)
                                              0.0/358.9 kB ? eta -:--:--
     ---                                      30.7/358.9 kB ? eta -:--:--
     -----------                            112.6/358.9 kB 1.3 MB/s eta 0:00:01
     --------------------                   194.6/358.9 kB 1.5 MB/s eta 0:00:01
     ---------------------                  204.8/358.9 kB 1.4 MB/s eta 0:00:01
     ---------------------                  204.8/358.9 kB 1.4 MB/s eta 0:00:01
     -----------------------------          276.5/358.9 kB 1.1 MB/s eta 0:00:01
     -------------------------------------  358.4/358.9 kB 1.1 MB/s eta 0:00:01
     -------------------------------------- 358.9/358.9 kB 1.1 MB/s eta 0:00:00
Installing collected packages: emoji
Successfully installed emoji-2.8.0
Note: you may need to restart the kernel to use updated packages.


## In this example, the emoji library is used to convert emojis to text representations, text representations to emojis, and extract emojis from text.

In [59]:
import emoji
print(emoji.demojize('Python is 🔥'))

Python is :fire:


In [62]:
import emoji
print(emoji.demojize('Loved the movie. It was 😊'))

Loved the movie. It was  :smiling_face_with_smiling_eyes:


In [66]:
text_with_emojis = "Python is 🔥 and fun! 😄"
converted_text = emoji.demojize(text_with_emojis)
print(converted_text)


Python is :fire: and fun! :grinning_face_with_smiling_eyes:


In [67]:
text_with_aliases = "I :heart: Python :rocket:"
emoji_text = emoji.emojize(text_with_aliases)
print(emoji_text)

I :heart: Python 🚀


In [75]:
import re
import emoji

In [79]:
def extract_emojis(text):
    emoji_pattern = re.compile(r':[^:]+:')
    emoji_aliases = emoji_pattern.findall(text)
    emojis = [emoji.emojize(alias) for alias in emoji_aliases]
    return emojis

text_to_extract = "👋 Hello, 🌍 world! :rocket:"
extracted_emojis = extract_emojis(text_to_extract)
print(extracted_emojis)

['🚀']


# Tokenization
**Tokenization is the process of breaking a text or a sequence of characters into smaller units called tokens. Tokens are the fundamental building blocks of text analysis and natural language processing (NLP). In English, tokens are typically words, but they can also be sentences, phrases, or even individual characters, depending on the context and task.**

## Key Points:

**1.Word Tokenization: In word tokenization, a text is divided into individual words. For example, the sentence "I love programming" would be tokenized into the tokens "I," "love," and "programming."**

**2.Sentence Tokenization: In sentence tokenization, a text is divided into sentences. For example, the paragraph "NLTK is great. It helps with NLP tasks." would be tokenized into the sentences "NLTK is great." and "It helps with NLP tasks."**

**3.Importance: Tokenization is a crucial preprocessing step in NLP. It provides the foundation for various tasks, including sentiment analysis, language modeling, machine translation, and more.**

**4.Whitespace and Punctuation: Tokenization is often based on whitespace (spaces, tabs, line breaks) and punctuation marks. Punctuation can be retained as separate tokens or combined with adjacent words, depending on the application.**

**5.Ambiguities: Tokenization can be challenging for languages with complex word structures, such as agglutinative languages (e.g., Turkish) and languages with no clear word boundaries (e.g., Chinese).**

**6.Subword Tokenization: Subword tokenization breaks down words into smaller subword units, which can be useful for handling rare or out-of-vocabulary words and for applying techniques like Byte-Pair Encoding (BPE).**

**7.NLTK and Other Libraries: Libraries like NLTK, SpaCy, and the tokenize module in Python provide tools for tokenization. Different libraries might use different strategies and algorithms.**

**Example:
Consider the sentence: "NLTK is a library for natural language processing." Word tokenization would result in the following tokens: "NLTK," "is," "a," "library," "for," "natural," "language," "processing."**

**And for sentence tokenization, the text "NLTK is great. It helps with NLP tasks." would be tokenized into the sentences: "NLTK is great." and "It helps with NLP tasks."**

## 1. Using the split function¶

In [81]:
# word tokenization
sent1 = 'I am going to dhaka'
sent1.split()

['I', 'am', 'going', 'to', 'dhaka']

In [82]:
# sentence tokenization
sent2 = 'I am going to Dhaka. I will stay there for 3 days. Let\'s hope the trip to be great'
sent2.split('.')

['I am going to Dhaka',
 ' I will stay there for 3 days',
 " Let's hope the trip to be great"]

In [84]:
# Problems with split function
sent3 = 'I am going to dhaka!'
sent3.split()

['I', 'am', 'going', 'to', 'dhaka!']

In [85]:
sent4 = 'Where do think I should go? I have 3 day holiday'
sent4.split('.')

['Where do think I should go? I have 3 day holiday']

## 2. Regular Expression
**A regular expression (regex or regexp) is a sequence of characters that defines a search pattern. It is a powerful tool used for pattern matching and manipulation of strings. Regular expressions are widely used in programming, text processing, and various applications, including data validation, searching, and text manipulation.**

## key Concepts:

1. Pattern Matching: Regular expressions allow you to specify a pattern that defines the structure and content of text you want to match.

2. Metacharacters: Regular expressions use special characters, known as metacharacters, to represent classes of characters, repetitions, alternatives, and more.

3. Quantifiers: Quantifiers indicate the number of occurrences of a character or group. Common quantifiers include * (zero or more), + (one or more), ? (zero or one), {n} (exactly n occurrences), {n,} (n or more occurrences), and {n,m} (between n and m occurrences).

4. Character Classes: Character classes define sets of characters that can match at a particular position. For example, [a-z] matches any lowercase letter, and \d matches a digit.

5. Anchors: Anchors specify positions in the string where a match should occur. Common anchors include ^ (start of line), $ (end of line), and \b (word boundary).

6. Groups and Alternation: Parentheses () are used to create groups, which can be used for capturing or grouping subpatterns. The vertical bar | represents alternation, allowing you to match one of multiple alternatives.

7. Escape Sequences: Some characters have special meanings in regular expressions. To match these characters literally, you need to escape them using a backslash \

## Example:

### Consider the regular expression: ^[\w.-]+@\w+\.\w+$

This regex matches email addresses with the following structure

1. ^: Start of the line.
2. [\w.-]+: One or more word characters, dots, or hyphens (matches the username part of the email).
3. @: Matches the at symbol.
4. \w+: One or more word characters (matches the domain name).
5. \.: Matches a dot.
6. \w+: One or more word characters (matches the top-level domain, like ".com").
7. $: End of the line.

In [87]:
import re
sent3 = 'I am going to dhaka!'
tokens = re.findall("[\w']+", sent3)
tokens

['I', 'am', 'going', 'to', 'dhaka']

In [88]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""
sentences = re.compile('[.!?] ').split(text)
sentences

['Lorem Ipsum is simply dummy text of the printing and typesetting industry',
 "\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

## 3. NLTK

**NLTK (Natural Language Toolkit) is a powerful Python library that provides tools and resources for working with human language data, particularly in the field of natural language processing (NLP). It offers various functionalities for tasks such as tokenization, stemming, lemmatization, part-of-speech tagging, parsing, and more.**

## Key Features and Components:
1. Tokenization: NLTK provides methods for tokenizing text into sentences and words. The sent_tokenize() and word_tokenize() functions are commonly used for this purpose.
2. Stemming and Lemmatization:  
    * NLTK includes stemmers and lemmatizers for reducing words to their root forms. Common stemmers include the Porter Stemmer and Snowball Stemmer.
    * The WordNet Lemmatizer is available for lemmatization, which reduces words to their base or dictionary forms.

3. Part-of-Speech Tagging:
     * NLTK offers part-of-speech tagging, which identifies the grammatical parts of words in a sentence.
   * The pos_tag() function assigns parts of speech to each word in a sentence.

4. Corpora and Datasets:

   * NLTK provides a variety of pre-processed corpora and datasets for various languages and domains. These corpora are useful for research, experimentation, and training models.
   
5. Text Classification:

   * NLTK supports text classification tasks, such as sentiment analysis, using various machine learning algorithms.
   
6. Parsing and Chunking:

   * NLTK includes parsers and chunkers that can analyze sentence structure and identify grammatical chunks.
   
7. Named Entity Recognition (NER):

    * NLTK has tools for identifying named entities (e.g., names of people, organizations, locations) in text.
    
8. Frequency Distribution:

   * NLTK's FreqDist class can compute frequency distributions of words in a text.
   
9. Concordance and Collocations:

   * NLTK provides tools to find concordances (contexts where a word appears) and collocations (common word pairs) in a text.


In [91]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [92]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [94]:
sent1 = 'I am going to visit dhaka!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'visit', 'dhaka', '!']

In [95]:
sent5 = 'I have a Ph.D in A.I'
sent6 = "We're here to help! mail us at nks@gmail.com"
sent7 = 'A 5km ride cost $10.50'

word_tokenize(sent5)

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']

In [96]:
word_tokenize(sent7)

['A', '5km', 'ride', 'cost', '$', '10.50']

## 4. Spacy
**SpaCy is an open-source natural language processing (NLP) library for Python that provides efficient and fast tools for working with textual data. It's designed to be both user-friendly and production-ready, making it a popular choice for various NLP tasks.**


## Key Features and Components:

1. Tokenization: SpaCy performs efficient tokenization of text into words and sentences, including handling complex cases like contractions and punctuation.

2. Part-of-Speech Tagging: SpaCy provides part-of-speech tagging to identify the grammatical parts of words in a sentence.

3. Lemmatization: Lemmatization reduces words to their base or dictionary forms, helping to handle different forms of the same word.

4. Dependency Parsing: SpaCy performs dependency parsing to determine the grammatical relationships between words in a sentence.

5. Named Entity Recognition (NER): SpaCy identifies named entities such as names of people, organizations, and locations.

6. Text Classification: SpaCy supports text classification tasks by training models on labeled data.

7. Word Vectors: SpaCy provides pre-trained word vectors, such as Word2Vec and GloVe, which can be used for various NLP tasks.

8. Customization: SpaCy allows you to train custom models on your own data for specific tasks.

9. Language Support: While English is the most well-supported language, SpaCy also offers support for other languages.

10. Performance: SpaCy is designed for efficiency and speed, making it suitable for both research and production environments.

In [114]:
!pip install spacy



In [105]:
spacy.cli.download("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [106]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [110]:
doc1 = nlp(sent5)
doc2 = nlp(sent6)
doc3 = nlp(sent7)
doc4 = nlp(sent1)

In [111]:
for token in doc4:
    print(token)

I
am
going
to
visit
dhaka
!


In [117]:
for token in doc2:
    print(token, token.pos_)

We PRON
're AUX
here ADV
to PART
help VERB
! PUNCT
mail VERB
us PRON
at ADP
nks@gmail.com PROPN


In [118]:
text = "SpaCy is a powerful NLP library."
doc = nlp(text)

In [119]:
for token in doc:
    print(token.text, token.pos_)

SpaCy PROPN
is AUX
a DET
powerful ADJ
NLP PROPN
library NOUN
. PUNCT


# Stemming
 
**Stemming is a text normalization technique used in natural language processing (NLP) and information retrieval to reduce words to their base or root form. The idea behind stemming is to strip affixes (prefixes and suffixes) from words so that variations of a word can be treated as the same word.**

## Key Points:

1. Purpose: Stemming aims to simplify words to their basic form, which can help improve text analysis, information retrieval, and various NLP tasks.

**Example: For example, the words "running," "runner," and "runs" would all be stemmed to "run."**

2. Reduced Vocabulary: Stemming reduces the vocabulary size by considering different forms of the same word as a single word. This can help in tasks like text classification, where features are based on individual words.

3. Algorithmic Approaches: Stemming algorithms use linguistic rules and heuristics to identify and remove prefixes and suffixes from words. Some common stemming algorithms include the Porter Stemmer and Snowball Stemmer (also known as the Porter2 Stemmer).

4. Drawbacks: While stemming is a simple and effective way to normalize words, it can sometimes produce incorrect stems due to its heuristic nature. For example, "better" might be stemmed to "better," while the correct root form is "good."

5. Usage: Stemming is often used when the specific linguistic meaning of words is less important than capturing their basic form. It's commonly used in search engines, information retrieval systems, and text preprocessing pipelines.

In [123]:
# example
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [124]:
words = ["running", "runner", "runs"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

['run', 'runner', 'run']


**In this example, the words "running," "runner," and "runs" are stemmed to "run," "runner," and "run" respectively.**

**Stemming can be a useful technique for basic text analysis tasks where capturing the general meaning of words is more important than preserving linguistic accuracy. However, for more advanced NLP tasks, lemmatization (which produces valid words) might be preferred over stemming.**

In [125]:
from nltk.stem.porter import PorterStemmer

In [126]:
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [127]:
sample = "walk walks walking walked"
stem_words(sample)

'walk walk walk walk'

In [128]:
text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'
print(text)

probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie


In [129]:
stem_words(text)

'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi'

In [133]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [137]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [136]:

import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))

Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


# Lemmatization
**Lemmatization is a natural language processing technique used to reduce words to their base or dictionary form, known as a lemma. The goal of lemmatization is to transform different inflected or derived forms of a word into a common base form, so that they can be analyzed or compared more effectively. Lemmatization is particularly useful for text analysis tasks like information retrieval, text classification, and language modeling.**

### Here's an example to illustrate lemmatization:

**Original words: "running", "runner", "runs"**

**After lemmatization: "run"**

**In this example, the lemmatizer identifies that "running", "runner", and "runs" are all forms of the same base word "run" and transforms them accordingly.**

**Lemmatization is different from stemming, another text normalization technique. While stemming aims to remove prefixes and suffixes to find the root form of a word, lemmatization considers the context and meaning of the word to determine its base form. As a result, lemmatization tends to produce more accurate and meaningful results compared to stemming.**

In [138]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

wordnet_lemmatizer = WordNetLemmatizer()

words = ["running", "runner", "runs"]
lemmatized_words = [wordnet_lemmatizer.lemmatize(word, pos='v') for word in words]

print(lemmatized_words)

['run', 'runner', 'run']


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Lemma

**A lemma, in linguistics, refers to the base or canonical form of a word. It is the form of the word that is typically found in dictionaries and represents the word's core meaning. Lemmas are used in various natural language processing tasks to simplify the analysis of text by reducing inflected or derived forms of words to a common base.**

**For example, the lemma of the word "running" is "run," the lemma of "better" is "good," and the lemma of "went" is "go."**


**Lemmatization is the process of finding the lemma of a word. It involves considering the word's context and its part of speech to determine the appropriate base form. Lemmatization is commonly used in tasks such as text analysis, information retrieval, and language modeling to ensure that words with the same meaning are treated as equivalent, regardless of their inflected forms.**

**In many languages, lemmatization requires access to language-specific linguistic resources, such as dictionaries and morphological rules, to accurately determine the lemmas of words. Natural language processing libraries like NLTK and spaCy provide tools for performing lemmatization in various languages.**