# Practical 1
## Introduction to python libraries for feature extraction and NLP.

### Feature extraction is a crucial part of any Natural Language Processing (NLP) project, as it transforms raw text into numerical representations that machine learning models can work with. Python offers several libraries that make this task easier, with each library providing tools to handle different aspects of feature extraction, text processing, and NLP tasks.

## 1. NLTK (Natural Language Toolkit)
### NLTK is one of the most popular libraries for working with human language data (text). It provides various tools for tokenization, stemming, tagging, parsing, and semantic reasoning. It also includes a suite of text corpora to help build models.

### Key Features:

* Tokenization: Breaking text into words or sentences.
* Stemming: Reducing words to their root forms.
* POS Tagging: Part-of-Speech tagging.
* Chunking: Dividing text into syntactically correlated parts.
* Corpora Access: Pre-defined datasets like WordNet, Gutenberg, etc.

In [28]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

text = "Natural language processing with Python is amazing."
tokens = word_tokenize(text)
print(tokens)

['Natural', 'language', 'processing', 'with', 'Python', 'is', 'amazing', '.']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\PCD\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 2. spaCy
### spaCy is a more modern and efficient library for NLP tasks. It focuses on deep learning integration and supports large-scale processing of text. It offers pre-trained models for multiple languages and is faster than NLTK in some operations.

### Key Features:

* Named Entity Recognition (NER): Identifies real-world entities.
* Dependency Parsing: Analyzing relationships between words.
* Tokenization: Split text into tokens.
* Pre-trained models: Ready-to-use models for language processing.

In [27]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp("Natural language processing with Python is amazing.")

for token in doc:
    print(token.text,"\t\t",token.pos_,"\t",token.tag_,"\t\t\t",spacy.explain(token.tag_))

Natural 		 ADJ 	 JJ 			 adjective (English), other noun-modifier (Chinese)
language 		 NOUN 	 NN 			 noun, singular or mass
processing 		 NOUN 	 NN 			 noun, singular or mass
with 		 ADP 	 IN 			 conjunction, subordinating or preposition
Python 		 PROPN 	 NNP 			 noun, proper singular
is 		 AUX 	 VBZ 			 verb, 3rd person singular present
amazing 		 ADJ 	 JJ 			 adjective (English), other noun-modifier (Chinese)
. 		 PUNCT 	 . 			 punctuation mark, sentence closer


## 3. Scikit-learn
### Scikit-learn is not a dedicated NLP library, but it provides many useful tools for feature extraction from text. It includes methods for converting text into numerical features like Bag-of-Words (BoW) and TF-IDF representations.

### Key Features:

* CountVectorizer: Converts a collection of text documents to a matrix of token counts (BoW).
* TfidfVectorizer: Converts a collection of text documents to a matrix of TF-IDF features.
* Text classification: Directly used for training ML models.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

docs = ["Natural language processing with Python.", 
        "Python is great for machine learning."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)

print(vectorizer.get_feature_names_out())
print(X.toarray())

['for' 'great' 'is' 'language' 'learning' 'machine' 'natural' 'processing'
 'python' 'with']
[[0 0 0 1 0 0 1 1 1 1]
 [1 1 1 0 1 1 0 0 1 0]]


## 4. Gensim
### Gensim is used for topic modeling and document similarity analysis. It's designed to handle large text corpora by providing efficient algorithms like Word2Vec for word embeddings.

### Key Features:

* Topic modeling: Latent Dirichlet Allocation (LDA).
* Word2Vec: Creating dense word embeddings.
* Document similarity: Calculate similarity between documents.

In [5]:
import gensim
from gensim.models import Word2Vec

sentences = [["I", "love", "natural", "language", "processing"],
             ["Python", "is", "great", "for", "NLP"]]

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
vector = model.wv['Python']
print(vector)

[-0.00713902  0.00124103 -0.00717672 -0.00224462  0.0037193   0.00583312
  0.00119818  0.00210273 -0.00411039  0.00722533 -0.00630704  0.00464722
 -0.00821997  0.00203647 -0.00497705 -0.00424769 -0.00310898  0.00565521
  0.0057984  -0.00497465  0.00077333 -0.00849578  0.00780981  0.00925729
 -0.00274233  0.00080022  0.00074665  0.00547788 -0.00860608  0.00058446
  0.00686942  0.00223159  0.00112468 -0.00932216  0.00848237 -0.00626413
 -0.00299237  0.00349379 -0.00077263  0.00141129  0.00178199 -0.0068289
 -0.00972481  0.00904058  0.00619805 -0.00691293  0.00340348  0.00020606
  0.00475375 -0.00711994  0.00402695  0.00434743  0.00995737 -0.00447374
 -0.00138926 -0.00731732 -0.00969783 -0.00908026 -0.00102275 -0.00650329
  0.00484973 -0.00616403  0.00251919  0.00073944 -0.00339215 -0.00097922
  0.00997913  0.00914589 -0.00446183  0.00908303 -0.00564176  0.00593092
 -0.00309722  0.00343175  0.00301723  0.00690046 -0.00237388  0.00877504
  0.00758943 -0.00954765 -0.00800821 -0.0076379   0.

## 5. Transformers (Hugging Face)
### Transformers from Hugging Face is a powerful library for working with pre-trained transformer models like BERT, GPT, and RoBERTa. It provides tools for text classification, question answering, and text generation.

### Key Features:

* Pre-trained transformer models.
* Tokenization.
* Text generation.
* Text classification.
* Question answering.

In [6]:
from transformers import pipeline

nlp = pipeline("sentiment-analysis")
result = nlp("Natural Language Processing is amazing with Python!")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



[{'label': 'POSITIVE', 'score': 0.9998145699501038}]


## 6. TextBlob
### TextBlob is a simpler library built on top of NLTK and Pattern. It offers an easy-to-use API for common NLP tasks like part-of-speech tagging, noun phrase extraction, and sentiment analysis.

### Key Features:

* Part-of-Speech Tagging.
* Noun Phrase Extraction.
* Sentiment Analysis.
* Translation.

In [11]:
from textblob import TextBlob

text = "Natural Language Processing is fun."
blob = TextBlob(text)
print(blob.sentiment)

Sentiment(polarity=0.2, subjectivity=0.30000000000000004)



> In TextBlob, sentiment analysis returns two main components:

- Polarity: A value between -1 and 1 that tells how positive or negative a sentence is. A score closer to 1 means positive sentiment, while a score closer to -1 means negative sentiment.
- Subjectivity: A value between 0 and 1 that tells how subjective (personal opinion, feeling) or objective (fact-based) the sentence is. A score closer to 1 means high subjectivity.
  
> In the case of Sentiment(polarity=0.2, subjectivity=0.30000000000000004), the breakdown is as follows:

* Polarity = 0.2: The text is slightly positive, but not strongly so. It's more neutral with a slight tilt toward positive sentiment.
* Subjectivity = 0.3: The text is mostly objective, meaning it is more fact-based, with a bit of personal opinion or subjectivity.

## 7. TfidfTransformer (Scikit-learn)
### TfidfTransformer is another Scikit-learn feature extraction tool that converts a matrix of token counts into TF-IDF scores, offering better understanding of the importance of words in documents.

In [20]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

docs = ["I love Python.", "Python is amazing for NLP tasks."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)

tfidf_transformer = TfidfTransformer()
tfidf_matrix = tfidf_transformer.fit_transform(X)
print(vectorizer.get_feature_names_out())
print(tfidf_matrix.toarray())

['amazing' 'for' 'is' 'love' 'nlp' 'python' 'tasks']
[[0.         0.         0.         0.81480247 0.         0.57973867
  0.        ]
 [0.4261596  0.4261596  0.4261596  0.         0.4261596  0.30321606
  0.4261596 ]]


### Document 1: "I love Python."
#### Row 1: [0. 0. 0. 0.81480247 0. 0.57973867 0. ]
-> The TF-IDF scores for words in Document 1:
* 'amazing' → 0 (not present)
* 'for' → 0 (not present)
* 'love' → 0 (not present)
* 'nlp' → 0.8148 (high score, probably due to its rarity)
* 'python' → 0.5797 (present and somewhat significant in Document 1)
* 'tasks' → 0 (not present)
### Document 2: "Python is amazing for NLP tasks."
#### Row 2: [0.4261596 0.4261596 0.4261596 0. 0.4261596 0.30321606 0.4261596 ]
-> The TF-IDF scores for words in Document 2:
* 'amazing' → 0.4261 (present and fairly important)
* 'for' → 0.4261 (present and fairly important)
* 'love' → 0.4261 (appears in this document)
* 'nlp' → 0 (not present)
* 'python' → 0.4261 (present but appears in both documents)
* 'tasks' → 0.3032 (present but less important)

### Conclusion
#### These libraries together cover the essential aspects of feature extraction and NLP tasks. You can choose based on your project’s needs: for large-scale processing (spaCy), deep learning models (Transformers), traditional ML (Scikit-learn), or specific NLP tasks (NLTK).