# 1.nltk

NLTK, or the Natural Language Toolkit, is a comprehensive suite of libraries and tools designed for working with human language data (text) in Python. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Additionally, it includes wrappers for industrial-strength NLP libraries.

### Key Features and Components of NLTK

1. **Corpora**:
   - NLTK includes a wide range of text corpora (large and structured sets of texts) for use in various natural language processing (NLP) tasks.
   - Examples include the Brown Corpus, Gutenberg Corpus, and movie reviews dataset.

2. **Lexical Resources**:
   - Resources like WordNet, a lexical database for the English language, are available. WordNet groups English words into sets of synonyms and provides short definitions and usage examples.

3. **Text Processing Libraries**:
   - **Tokenization**: Splitting text into words or sentences.
     ```python
     from nltk.tokenize import word_tokenize, sent_tokenize
     text = "Hello world. This is a test."
     print(word_tokenize(text))  # ['Hello', 'world', '.', 'This', 'is', 'a', 'test', '.']
     print(sent_tokenize(text))  # ['Hello world.', 'This is a test.']
     ```
   - **Stemming**: Reducing words to their root forms.
     ```python
     from nltk.stem import PorterStemmer
     ps = PorterStemmer()
     print(ps.stem('running'))  # 'run'
     ```
   - **Lemmatization**: Reducing words to their base or dictionary form.
     ```python
     from nltk.stem import WordNetLemmatizer
     lemmatizer = WordNetLemmatizer()
     print(lemmatizer.lemmatize('running', pos='v'))  # 'run'
     ```
   - **Part-of-Speech Tagging**: Identifying the grammatical parts of speech in a sentence.
     ```python
     from nltk import pos_tag
     words = word_tokenize("The quick brown fox jumps over the lazy dog")
     print(pos_tag(words))  # [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
     ```
   - **Parsing**: Analyzing the grammatical structure of sentences.
     ```python
     from nltk import CFG
     grammar = CFG.fromstring("""
     S -> NP VP
     NP -> DT NN
     VP -> VBZ NP
     DT -> 'the'
     NN -> 'dog'
     VBZ -> 'barks'
     """)
     sentence = word_tokenize("the dog barks")
     parser = nltk.ChartParser(grammar)
     for tree in parser.parse(sentence):
         print(tree)
     ```

4. **Text Classification**:
   - Tools for building machine learning models to classify text into categories.
   - Example workflow:
     - Preprocess text (tokenization, stemming/lemmatization, etc.).
     - Extract features (e.g., bag of words, TF-IDF).
     - Train classifiers (Naive Bayes, Decision Trees, etc.).

5. **Semantic Reasoning**:
   - Tools for understanding the meaning and context of words and sentences, often using lexical resources like WordNet.

6. **Interfaces to Other NLP Libraries**:
   - NLTK provides wrappers for using powerful NLP libraries such as Stanford NLP, OpenNLP, and SpaCy.

### Installation and Basic Usage

To install NLTK, you can use pip:

```bash
pip install nltk
```

Once installed, you can download the necessary datasets and models using the NLTK downloader:

```python
import nltk
nltk.download()
```

This will open a graphical interface where you can choose and download specific datasets and models.

### Example Workflow Using NLTK

Here’s an example of a simple NLP workflow using NLTK to perform text preprocessing, feature extraction, and classification:

```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Sample data
documents = ["I love this movie", "I hate this movie", "This movie is okay"]
labels = ["positive", "negative", "neutral"]

# Text preprocessing
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and word.isalpha()]
    return ' '.join(tokens)

documents = [preprocess(doc) for doc in documents]

# Feature extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# Train a classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Make predictions
y_pred = classifier.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
```

### Conclusion

NLTK is a powerful and flexible library for natural language processing in Python. It supports a wide range of tasks from basic text processing to advanced NLP techniques, making it a valuable tool for researchers, practitioners, and developers working with textual data.

# 2.tokenization

Tokenization is the process of splitting text into smaller units called tokens, which can be words, sentences, or subwords. It's a fundamental step in natural language processing (NLP) because it prepares raw text data for further analysis, such as parsing, part-of-speech tagging, or machine learning model training.

### Types of Tokenization

1. **Word Tokenization**
2. **Sentence Tokenization**
3. **Regular Expression Tokenization**
4. **Whitespace Tokenization**

### Using NLTK for Tokenization

#### 1. Word Tokenization
Word tokenization splits text into individual words. NLTK provides several tools for word tokenization, with `word_tokenize` being the most common.

**Example**:
```python
import nltk
from nltk.tokenize import word_tokenize

text = "Hello, world! This is a test."
tokens = word_tokenize(text)
print(tokens)
# Output: ['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test', '.']
```

#### 2. Sentence Tokenization
Sentence tokenization splits text into individual sentences. NLTK’s `sent_tokenize` function handles sentence boundaries effectively.

**Example**:
```python
from nltk.tokenize import sent_tokenize

text = "Hello, world! This is a test. NLP is fun."
sentences = sent_tokenize(text)
print(sentences)
# Output: ['Hello, world!', 'This is a test.', 'NLP is fun.']
```

#### 3. Regular Expression Tokenization
Regular expression tokenization allows for more complex and customizable tokenization patterns using regular expressions. NLTK provides `RegexpTokenizer` for this purpose.

**Example**:
```python
from nltk.tokenize import RegexpTokenizer

text = "The price of bitcoin is $42,000 as of January 2024."
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['The', 'price', 'of', 'bitcoin', 'is', '42', '000', 'as', 'of', 'January', '2024']
```

#### 4. Whitespace Tokenization
Whitespace tokenization splits text based on whitespace characters. While not a dedicated function in NLTK, it can be easily implemented using Python’s `split` method.

**Example**:
```python
text = "Hello, world! This is a test."
tokens = text.split()
print(tokens)
# Output: ['Hello,', 'world!', 'This', 'is', 'a', 'test.']
```

### Tokenization Examples with NLTK

#### Word Tokenization with `TreebankWordTokenizer`
The `TreebankWordTokenizer` is another tokenizer provided by NLTK that uses the Penn Treebank tokenization conventions.

**Example**:
```python
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
text = "The quick brown fox jumps over the lazy dog."
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
```

#### Sentence Tokenization with `PunktSentenceTokenizer`
The `PunktSentenceTokenizer` is a pre-trained unsupervised machine learning tokenizer.

**Example**:
```python
from nltk.tokenize import PunktSentenceTokenizer

text = "Hello, world! This is a test. NLP is fun."
tokenizer = PunktSentenceTokenizer()
sentences = tokenizer.tokenize(text)
print(sentences)
# Output: ['Hello, world!', 'This is a test.', 'NLP is fun.']
```

### Tokenization Considerations

1. **Handling Punctuation**:
   - In word tokenization, punctuation marks can be treated as separate tokens or be removed.
   - Example: "Hello, world!" can be tokenized as ['Hello', ',', 'world', '!'] or ['Hello', 'world'].

2. **Handling Contractions**:
   - Contractions (e.g., "don't", "it's") need careful handling to split them correctly (e.g., ["do", "n't"], ["it", "'s"]).

3. **Multi-language Support**:
   - Tokenization rules vary across languages. NLTK supports tokenization in several languages but might require customization for specific needs.

4. **Efficiency**:
   - For large text corpora, efficient tokenization is crucial. NLTK’s tokenizers are generally efficient, but there are faster alternatives like SpaCy for very large datasets.

### Advanced Tokenization Techniques

1. **Subword Tokenization**:
   - Splits text into subword units rather than words or sentences, used in models like BERT and GPT.
   - **Byte Pair Encoding (BPE)** and **WordPiece Tokenization** are common subword tokenization methods.

2. **Custom Tokenizers**:
   - Creating custom tokenizers tailored to specific needs using regular expressions or rule-based methods.

### Example: Custom Tokenizer with Regular Expressions
```python
from nltk.tokenize import regexp_tokenize

text = "Hello, world! This is a test."
pattern = r'\s|[\.,;!\?\'\"]'
tokens = regexp_tokenize(text, pattern)
print(tokens)
# Output: ['Hello', '', 'world', '', 'This', 'is', 'a', 'test', '']
```

In this example, we defined a regular expression pattern to split the text on whitespace and punctuation marks.

### Conclusion

Tokenization is a crucial step in NLP that breaks down text into manageable pieces for further processing. NLTK provides a range of tokenization tools to handle different needs, from simple word and sentence tokenization to more complex regular expression-based tokenization. Understanding the characteristics of your text data and the requirements of your NLP tasks will help you choose the most appropriate tokenization method.

# 3.word rooting

     1-Stemming: Stemming reduces words to their base or root form, often by removing suffixes. This technique can help normalize words to their base form, which is useful in text processing tasks like search and indexing. NLTK provides several stemming algorithms

        Porter Stemmer: The Porter Stemmer is one of the most commonly used stemming algorithms. It applies a series of rules to iteratively reduce a word to its root form.

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('running'))  # Output: run

        Lancaster Stemmer: The Lancaster Stemmer is more aggressive compared to the Porter Stemmer, often resulting in shorter stems.

In [None]:
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
print(stemmer.stem('running'))  # Output: run

        Snowball Stemmer: The Snowball Stemmer, also known as the Porter2 stemmer, is an improvement over the original Porter Stemmer and supports multiple languages.

In [None]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
print(stemmer.stem('running'))  # Output: run

    2-Lemmatization:Lemmatization reduces words to their base or dictionary form (lemma) while considering the context. Unlike stemming, lemmatization uses a dictionary to map words to their lemmas, making it more accurate for understanding the morphological structure of words.

        WordNet Lemmatizer: The WordNet Lemmatizer uses the WordNet lexical database to find the base form of a word.

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('running', pos='v'))  # Output: run

    3-Stop Words Removal: Stop words are common words that usually do not contribute significantly to the meaning of a sentence. Removing stop words helps reduce noise in the text data.

        NLTK provides a list of common stop words for various languages.

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))
text = "This is a sample sentence, showing off the stop words filtration."
words = word_tokenize(text)
filtered_words = [word for word in words if word.lower() not in stop_words]
print(filtered_words)  # Output: ['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']

# 4.language recognition

    1. Parts of Speech (POS) Tagging: POS tagging involves assigning parts of speech to each word in a sentence, such as nouns, verbs, adjectives, etc. This helps in understanding the grammatical structure and syntactic relationships within the text.

        How POS Tagging Works in NLTK:
            Tokenization: First, the text is tokenized into words.
            Tagging: Each tokenized word is then tagged with a POS tag.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

# Sample text
text = "This is a sample sentence."

# Tokenize the text
words = word_tokenize(text)

# POS tagging
pos_tags = pos_tag(words)
print(pos_tags)  # Output: [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'NN'), ('sentence', 'NN'), ('.', '.')]

          In this example:
            DT: Determiner
            VBZ: Verb, 3rd person singular present
            NN: Noun, singular or mass  

    2. Named Entity Recognition (NER): NER involves identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, dates, etc. This helps in extracting important information and relationships from the text.

        How NER Works in NLTK:
            Tokenization and POS Tagging: The text is first tokenized and POS tagged.
            Chunking: The tagged tokens are then chunked to identify named entities.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk

# Sample text
text = "Barack Obama was born on August 4, 1961 in Honolulu, Hawaii."

# Tokenize and POS tagging
words = word_tokenize(text)
pos_tags = pos_tag(words)

# Named Entity Recognition
named_entities = ne_chunk(pos_tags).draw()
print(named_entities)

        In this output:
            PERSON: Person names
            DATE: Dates
            GPE: Geopolitical entities (countries, cities, states)