# NLP Concepts using Python

## Text Analysis

In [1]:
import nltk
import spacy
import re
from nltk.tokenize import (word_tokenize,
                           sent_tokenize,
                           TreebankWordTokenizer,
                           wordpunct_tokenize,
                           TweetTokenizer,
                           MWETokenizer)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ganad\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ganad\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
# a more complex sentence to showcase tokenization
complex_sentence = """Dr. Emily Wong, Ph.D., exclaimed, 'OMG! Despite reaching 7.8 billion in 2021, global population metrics, like those in Section 123(i)(3)(A)(ii), remain baffling,' and then she added in a mix of French and English, 'C’est incroyable, no? Check my blog at www.emilywong-science.com or email me at wong@scienceworld.com for the full story on Mycobacterium tuberculosis complex (MTBC) research, which is, you know, the state-of-the-art stuff—I've literally termed it "the Occam’s razor of epidemiology."""
sentence = "The quick brown fox jumps over the lazy dog."

### Tokenization

In [3]:
tokens_word = nltk.word_tokenize(sentence)
token_sent = nltk.sent_tokenize(sentence)
token_treebank = TreebankWordTokenizer().tokenize(sentence)
token_wordpunct = wordpunct_tokenize(sentence)
token_tweet = TweetTokenizer().tokenize(sentence)
token_mwe = MWETokenizer().tokenize(sentence.split())


display(tokens_word,
        token_sent,
        token_treebank,
        token_wordpunct,
        token_tweet,
        token_mwe)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

['The quick brown fox jumps over the lazy dog.']

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']

In [4]:
tokens_word_c = nltk.word_tokenize(complex_sentence)
token_sent_c = nltk.sent_tokenize(complex_sentence)
token_treebank_c = TreebankWordTokenizer().tokenize(complex_sentence)
token_wordpunct_c = wordpunct_tokenize(complex_sentence)
token_tweet_c = TweetTokenizer().tokenize(complex_sentence)
token_mwe_c = MWETokenizer().tokenize(complex_sentence.split())

print(tokens_word_c)
print()
print(token_sent_c)
print()
print(token_treebank_c)
print()
print(token_wordpunct_c)
print()
print(token_tweet_c)
print()
print(token_mwe_c)

['Dr.', 'Emily', 'Wong', ',', 'Ph.D.', ',', 'exclaimed', ',', "'OMG", '!', 'Despite', 'reaching', '7.8', 'billion', 'in', '2021', ',', 'global', 'population', 'metrics', ',', 'like', 'those', 'in', 'Section', '123', '(', 'i', ')', '(', '3', ')', '(', 'A', ')', '(', 'ii', ')', ',', 'remain', 'baffling', ',', "'", 'and', 'then', 'she', 'added', 'in', 'a', 'mix', 'of', 'French', 'and', 'English', ',', "'", 'C', '’', 'est', 'incroyable', ',', 'no', '?', 'Check', 'my', 'blog', 'at', 'www.emilywong-science.com', 'or', 'email', 'me', 'at', 'wong', '@', 'scienceworld.com', 'for', 'the', 'full', 'story', 'on', 'Mycobacterium', 'tuberculosis', 'complex', '(', 'MTBC', ')', 'research', ',', 'which', 'is', ',', 'you', 'know', ',', 'the', 'state-of-the-art', 'stuff—I', "'ve", 'literally', 'termed', 'it', '``', 'the', 'Occam', '’', 's', 'razor', 'of', 'epidemiology', '.']

["Dr. Emily Wong, Ph.D., exclaimed, 'OMG!", "Despite reaching 7.8 billion in 2021, global population metrics, like those in Secti

### Observation
- Word Tokenizer and the Treebank Tokenizer seems pretty good compared to the others

- Picking the right tokenizer depends on the task at hand. For example, if you are working on a sentiment analysis task, you might want to use a tokenizer that preserves emoticons and hashtags. If you are working on a machine translation task, you might want to use a tokenizer that preserves punctuation and capitalization.

### Penn Treebank Tagset Overview

A tagset is a list of part-of-speech tags, i.e., labels used to indicate the part of speech and often also other grammatical categories (case, tense etc.) of each token in a text corpus.

#### Introduction to Penn Treebank Tagset

The English Penn Treebank tagset is utilized with English corpora annotated by the TreeTagger tool, developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart. This version of the tagset contains modifications developed by Sketch Engine (earlier version).

[See a more recent version of this tagset](https://www.sketchengine.eu/penn-treebank-tagset/).

### What is a POS Tag?

POS tags classify words into grammatical categories which can help in understanding the structure and context of text. The table below shows the English Penn TreeBank tagset with Sketch Engine modifications (earlier version).

Example: Using `[tag="NNS"]` finds all nouns in the plural, e.g., people, years when used in the CQL concordance search (always use straight double quotation marks in CQL).

| POS Tag | Description                                   | Example             |
|---------|-----------------------------------------------|---------------------|
| CC      | Coordinating conjunction                      | and                 |
| CD      | Cardinal number                               | 1, third            |
| DT      | Determiner                                    | the                 |
| EX      | Existential there                             | there is            |
| FW      | Foreign word                                  | les                 |
| IN      | Preposition, subordinating conjunction        | in, of, like        |
| IN/that | That as subordinator                          | that                |
| JJ      | Adjective                                     | green               |
| JJR     | Adjective, comparative                        | greener             |
| JJS     | Adjective, superlative                        | greenest            |
| LS      | List marker                                   | 1)                  |
| MD      | Modal                                         | could, will         |
| NN      | Noun, singular or mass                        | table               |
| NNS     | Noun, plural                                  | tables              |
| NP      | Proper noun, singular                         | John                |
| NPS     | Proper noun, plural                           | Vikings             |
| PDT     | Predeterminer                                 | both the boys       |
| POS     | Possessive ending                             | friend’s            |
| PP      | Personal pronoun                              | I, he, it           |
| PPZ     | Possessive pronoun                            | my, his             |
| RB      | Adverb                                        | however, usually    |
| RBR     | Adverb, comparative                           | better              |
| RBS     | Adverb, superlative                           | best                |
| RP      | Particle                                      | give up             |
| SENT    | Sentence-break punctuation                    | . ! ?               |
| SYM     | Symbol                                        | / [ = *             |
| TO      | Infinitive ‘to’                               | to go               |
| UH      | Interjection                                  | uhhuhhuhh           |
| VB      | Verb, base form                               | be                  |
| VBD     | Verb, past tense                              | was, were           |
| VBG     | Verb, gerund/present participle               | being               |
| VBN     | Verb, past participle                         | been                |
| VBP     | Verb, sing. present, non-3d                   | am, are             |
| VBZ     | Verb, 3rd person sing. present                | is                  |
| VH      | Verb have, base form                          | have                |
| VHD     | Verb have, past tense                         | had                 |
| VHG     | Verb have, gerund/present participle          | having              |
| VHN     | Verb have, past participle                    | had                 |
| VHP     | Verb have, sing. present, non-3d              | have                |
| VHZ     | Verb have, 3rd person sing. present           | has                 |
| VV      | Verb, base form                               | take                |
| VVD     | Verb, past tense                              | took                |
| VVG     | Verb, gerund/present participle               | taking              |
| VVN     | Verb, past participle                         | taken               |
| VVP     | Verb, sing. present, non-3d                   | take                |
| VVZ     | Verb, 3rd person sing. present                | takes               |
| WDT     | Wh-determiner                                 | which               |
| WP      | Wh-pronoun                                    | who, what           |
| WP$     | Possessive wh-pronoun                         | whose               |
| WRB     | Wh-abverb                                     | where, when         |

### Main Differences to the Default Penn Tagset
- In TreeTagger:
  - Distinguishes 'be' (VB) and 'have' (VH) from other (non-modal) verbs (VV).
  - For proper nouns, NNP and NNPS have become NP and NPS.
  - SENT for end-of-sentence punctuation (other punctuation tags may also differ).
- In TreeTagger tool + Sketch Engine modifications:
  - The word 'to' is tagged IN when used as a preposition and TO when used as an infinitive marker.

### Bibliography

M. Marcus, B. Santorini and M.A. Marcinkiewicz (1993). Building a large annotated corpus of English: The Penn Treebank. In Computational Linguistics, volume 19, number 2, pp. 313–330.

### Part of Speech POS Tags

### Remove Punctuation

In [7]:
def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

print(remove_punctuation(sentence))
print(remove_punctuation(complex_sentence))

The quick brown fox jumps over the lazy dog
Dr Emily Wong PhD exclaimed OMG Despite reaching 78 billion in 2021 global population metrics like those in Section 123i3Aii remain baffling and then she added in a mix of French and English Cest incroyable no Check my blog at wwwemilywongsciencecom or email me at wongscienceworldcom for the full story on Mycobacterium tuberculosis complex MTBC research which is you know the stateoftheart stuffIve literally termed it the Occams razor of epidemiology


### Lowercasing

In [9]:
def lowercase(text):
    return text.lower()

print(lowercase(sentence), "\n", lowercase(complex_sentence))

the quick brown fox jumps over the lazy dog. 
 dr. emily wong, ph.d., exclaimed, 'omg! despite reaching 7.8 billion in 2021, global population metrics, like those in section 123(i)(3)(a)(ii), remain baffling,' and then she added in a mix of french and english, 'c’est incroyable, no? check my blog at www.emilywong-science.com or email me at wong@scienceworld.com for the full story on mycobacterium tuberculosis complex (mtbc) research, which is, you know, the state-of-the-art stuff—i've literally termed it "the occam’s razor of epidemiology.


### Stemming

In [10]:
from nltk.stem import PorterStemmer 
def stemming(words):
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in words]

print(stemming(sentence), "\n", stemming(complex_sentence))

['t', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ' ', 'b', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x', ' ', 'j', 'u', 'm', 'p', 's', ' ', 'o', 'v', 'e', 'r', ' ', 't', 'h', 'e', ' ', 'l', 'a', 'z', 'y', ' ', 'd', 'o', 'g', '.'] 
 ['d', 'r', '.', ' ', 'e', 'm', 'i', 'l', 'y', ' ', 'w', 'o', 'n', 'g', ',', ' ', 'p', 'h', '.', 'd', '.', ',', ' ', 'e', 'x', 'c', 'l', 'a', 'i', 'm', 'e', 'd', ',', ' ', "'", 'o', 'm', 'g', '!', ' ', 'd', 'e', 's', 'p', 'i', 't', 'e', ' ', 'r', 'e', 'a', 'c', 'h', 'i', 'n', 'g', ' ', '7', '.', '8', ' ', 'b', 'i', 'l', 'l', 'i', 'o', 'n', ' ', 'i', 'n', ' ', '2', '0', '2', '1', ',', ' ', 'g', 'l', 'o', 'b', 'a', 'l', ' ', 'p', 'o', 'p', 'u', 'l', 'a', 't', 'i', 'o', 'n', ' ', 'm', 'e', 't', 'r', 'i', 'c', 's', ',', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'h', 'o', 's', 'e', ' ', 'i', 'n', ' ', 's', 'e', 'c', 't', 'i', 'o', 'n', ' ', '1', '2', '3', '(', 'i', ')', '(', '3', ')', '(', 'a', ')', '(', 'i', 'i', ')', ',', ' ', 'r', 'e', 'm', 'a', 'i', 'n', ' ', 'b', 'a', 'f', '

### Lemmatization

In [12]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

def lemmatization(words):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in words]

print(lemmatization(sentence), "\n", lemmatization(complex_sentence))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ganad\AppData\Roaming\nltk_data...


['T', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ' ', 'b', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x', ' ', 'j', 'u', 'm', 'p', 's', ' ', 'o', 'v', 'e', 'r', ' ', 't', 'h', 'e', ' ', 'l', 'a', 'z', 'y', ' ', 'd', 'o', 'g', '.'] 
 ['D', 'r', '.', ' ', 'E', 'm', 'i', 'l', 'y', ' ', 'W', 'o', 'n', 'g', ',', ' ', 'P', 'h', '.', 'D', '.', ',', ' ', 'e', 'x', 'c', 'l', 'a', 'i', 'm', 'e', 'd', ',', ' ', "'", 'O', 'M', 'G', '!', ' ', 'D', 'e', 's', 'p', 'i', 't', 'e', ' ', 'r', 'e', 'a', 'c', 'h', 'i', 'n', 'g', ' ', '7', '.', '8', ' ', 'b', 'i', 'l', 'l', 'i', 'o', 'n', ' ', 'i', 'n', ' ', '2', '0', '2', '1', ',', ' ', 'g', 'l', 'o', 'b', 'a', 'l', ' ', 'p', 'o', 'p', 'u', 'l', 'a', 't', 'i', 'o', 'n', ' ', 'm', 'e', 't', 'r', 'i', 'c', 's', ',', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'h', 'o', 's', 'e', ' ', 'i', 'n', ' ', 'S', 'e', 'c', 't', 'i', 'o', 'n', ' ', '1', '2', '3', '(', 'i', ')', '(', '3', ')', '(', 'A', ')', '(', 'i', 'i', ')', ',', ' ', 'r', 'e', 'm', 'a', 'i', 'n', ' ', 'b', 'a', 'f', '

## Text Classification

In [2]:
import os
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
import pickle

model_dir = r'C:\Users\ganad\Desktop\Resume Work\NLP_HUB\models'

# Fetch the dataset
data = fetch_20newsgroups(subset='train')
categories = data.target_names

# Create pipelines for different classifiers
models = {
    'Naive Bayes': make_pipeline(TfidfVectorizer(), MultinomialNB()),
    'SVM': make_pipeline(TfidfVectorizer(), SVC(probability=True)),
    'Logistic Regression': make_pipeline(TfidfVectorizer(), LogisticRegression())
}

# Train and save each model in the specified directory
for name, model in models.items():
    model.fit(data.data, data.target)
    model_path = os.path.join(model_dir, f'{name.lower().replace(" ", "_")}_classifier.pkl')
    with open(model_path, 'wb') as f:
        pickle.dump(model, f)

In [5]:
# Define the directory where models are saved
model_dir = r'C:\Users\ganad\Desktop\Resume Work\NLP_HUB\models'

# Dictionary to map model names to file names
model_files = {
    'Naive Bayes': os.path.join(model_dir, 'naive_bayes_classifier.pkl'),
    'SVM': os.path.join(model_dir, 'svm_classifier.pkl'),
    'Logistic Regression': os.path.join(model_dir, 'logistic_regression_classifier.pkl')
}

# Function to load a model
def load_model(model_name):
    with open(model_files[model_name], 'rb') as f:
        model = pickle.load(f)
    return model

# Load the dataset to fetch categories
data = fetch_20newsgroups(subset='train')
categories = data.target_names

# Load all models
models = {name: load_model(name) for name in model_files.keys()}


# Print predictions from each model
for name, model in models.items():
    prediction = model.predict([sentence])
    print(f"Prediction from {name}: {categories[prediction[0]]}")

for name, model in models.items():
    prediction = model.predict([complex_sentence])
    print(f"Prediction from {name}: {categories[prediction[0]]}")

Prediction from Naive Bayes: rec.motorcycles
Prediction from SVM: rec.motorcycles
Prediction from Logistic Regression: rec.motorcycles
Prediction from Naive Bayes: soc.religion.christian
Prediction from SVM: sci.med
Prediction from Logistic Regression: sci.med


## Text Similarity

**Similarity Measures**
- Jaccard Index or Jaccard Similarity:
    - Jaccard Similarity is the size of the intersection divided by the size of the union of two sets.
    - Jaccard Similarity = (the number in both sets) / (the number in either set) * 100.
    - Jaccard Similarity = (Intersection of A and B) / (Union of A and B) * 100.
- Cosine Similarity:
    - Cosine similarity calculates similarity by measuring the cosine of angle between two vectors.
    - Cosine Similarity = (A . B) / (||A|| ||B||).
    - Cosine Similarity = (Sum of A[i] * B[i]) / (Square Root of Sum of A[i]^2) * (Square Root of Sum of B[i]^2).
- Euclidean Distance:
    - Euclidean distance calculates the distance between two (or multiple) points.
    - Euclidean Distance = sqrt(sum((x - y)^2)).

In [1]:
def jaccard_similarity(x,y):
  """ returns the jaccard similarity between two lists """
  intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
  union_cardinality = len(set.union(*[set(x), set(y)]))
  return intersection_cardinality/float(union_cardinality)

In [2]:
sentences = ["Mcdonalds is unhealthy because it has processed food",
"Denny's is unhealthy because it has processed food"]
sentences = [sent.lower().split(" ") for sent in sentences]
jaccard_similarity(sentences[0], sentences[1])

0.7777777777777778

In [10]:
import spacy
from scipy.spatial.distance import euclidean

nlp = spacy.load("en_core_web_md")

sentences = ["Mcdonalds is unhealthy because it has processed food", "Mcdonalds is unhealthy because it has processed food"]

# Generate embeddings for each sentence
embeddings = [nlp(sentence).vector for sentence in sentences]

# Calculate the Euclidean distance between the first two embeddings
distance = euclidean(embeddings[0], embeddings[1])
print(distance)

0.0


In [11]:
from math import exp
def distance_to_similarity(distance):
  return 1/exp(distance)

distance_to_similarity(distance) 

1.0

In [18]:
from scipy.spatial.distance import cosine

nlp = spacy.load("en_core_web_md")

sentences = ["Denny is unhealthy because it has processed food", "Mcdonalds is unhealthy because it has processed food"]
embeddings = [nlp(sentence).vector for sentence in sentences]

cosine_similarity = 1 - cosine(embeddings[0], embeddings[1])
print(cosine_similarity)

0.9957017397069031


## Keyword Extraction

In [20]:
!pip install rake_nltk

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting rake_nltk
  Downloading rake_nltk-1.0.6-py3-none-any.whl.metadata (6.4 kB)
Downloading rake_nltk-1.0.6-py3-none-any.whl (9.1 kB)
Installing collected packages: rake_nltk
Successfully installed rake_nltk-1.0.6


In [22]:
from rake_nltk import Rake
rake = Rake()
text = "Artificial intelligence and machine learning are transforming healthcare by enabling predictive analytics, personalized medicine, and efficient diagnostic tools, which significantly improve patient outcomes and operational efficiencies."
rake.extract_keywords_from_text(text)
keywords = rake.get_ranked_phrases_with_scores()
for score, phrase in keywords:
    print(f"{score}: {phrase}")

16.0: significantly improve patient outcomes
9.0: enabling predictive analytics
9.0: efficient diagnostic tools
4.0: transforming healthcare
4.0: personalized medicine
4.0: operational efficiencies
4.0: machine learning
4.0: artificial intelligence


**Lets try using a job post as an example**
https://g.co/kgs/pvxFeR9

Machine Learning Engineer, Applied Machine

Job highlights
Identified by Google from the original job post
Qualifications
•
Solid understanding of machine learning, deep learning (including LLMs) and natural language processing
•
5+ years proven programming skills using standard ML tools such as C/C++, Python, PyTorch, Tensorflow, HuggingFace, etc
•
Hands-on experience working (training, fine-tuning, optimizing, deploying) with large language models
•
Hands-on experience applying common machine learning optimization techniques, like quantization and distillation, to reduce the resource consumption and/or eliminate latency
•
Bachelors or MS in Computer Science or equivalent with Artificial Intelligence, Machine Learning, Data Science or related field
Responsibilities
•
You will help design and implement our machine learning strategy to take our developer experience platform to the next level and help accelerate app development inside Apple
•
We will be collaborating and working with multi-functional teams and applying algorithms to large-scale data
•
You will work data scientist, Machine Learning engineers, Software engineers to deliver and end to end AI enabled solution for this platform
•
You will work with existing and new model evaluate them, fine tune them come up with use cases to solve business problems
Benefits
•
Pay & Benefits
•
At Apple, base pay is one part of our total compensation package and is determined within a range
•
The base pay range for this role is between $138,900.00 and $256,500.00, and your base pay will depend on your skills, qualifications, experience, and location
•
You’ll also receive benefits including: Comprehensive medical and dental coverage, retirement benefits, a range of discounted products and free services, and for formal education related to advancing your career at Apple, reimbursement for certain educational expenses — including tuition
•
Additionally, this role might be eligible for discretionary bonuses or commission payments as well as relocation

In [23]:
text = "Qualifications Solid understanding of machine learning, deep learning (including LLMs) and natural language processing. 5+ years proven programming skills using standard ML tools such as C/C++, Python, PyTorch, Tensorflow, HuggingFace, etc. Hands-on experience working (training, fine-tuning, optimizing, deploying) with large language models. Hands-on experience applying common machine learning optimization techniques, like quantization and distillation, to reduce the resource consumption and/or eliminate latency. Bachelors or MS in Computer Science or equivalent with Artificial Intelligence, Machine Learning, Data Science or related field Responsibilities. You will help design and implement our machine learning strategy to take our developer experience platform to the next level and help accelerate app development inside Apple. We will be collaborating and working with multi-functional teams and applying algorithms to large-scale data. You will work data scientist, Machine Learning engineers, Software engineers to deliver and end to end AI enabled solution for this platform. You will work with existing and new model evaluate them, fine tune them come up with use cases to solve business problems"

rake.extract_keywords_from_text(text)
keywords = rake.get_ranked_phrases_with_scores()
for score, phrase in keywords:
    print(f"{score}: {phrase}")

64.0: years proven programming skills using standard ml tools
36.06666666666666: experience applying common machine learning optimization techniques
34.0: help accelerate app development inside apple
14.5: end ai enabled solution
9.566666666666666: machine learning strategy
9.066666666666666: machine learning engineers
9.0: solve business problems
9.0: related field responsibilities
9.0: qualifications solid understanding
9.0: new model evaluate
9.0: natural language processing
9.0: developer experience platform
8.0: large language models
8.0: c ++, python
7.333333333333334: work data scientist
6.566666666666666: machine learning
6.566666666666666: machine learning
6.5: applying algorithms
6.0: help design
5.5: experience working
5.166666666666666: deep learning
4.5: software engineers
4.333333333333334: scale data
4.333333333333334: data science
4.0: use cases
4.0: resource consumption
4.0: next level
4.0: like quantization
4.0: including llms
4.0: functional teams
4.0: eliminate late

## Topic Modeling

In [7]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD

# Sample clean corpus
clean_corpus = [
    'artificial intelligence and machine learning are transforming healthcare',
    'predictive analytics and personalized medicine are becoming more prevalent',
    'efficient diagnostic tools and operational efficiencies are improving patient outcomes',
    'healthcare systems are increasingly adopting artificial intelligence for better performance'
]

def print_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print(f"Topic {idx + 1}:")
        print([(vectorizer.get_feature_names_out()[i], topic[i]) for i in topic.argsort()[:-top_n - 1:-1]])
        print("\n")

# LDA 
vectorizer_lda = CountVectorizer(stop_words='english')
doc_term_matrix_lda = vectorizer_lda.fit_transform(clean_corpus)

lda_model = LatentDirichletAllocation(n_components=2, random_state=42)
lda_model.fit(doc_term_matrix_lda)

print("LDA Topics:")
print_topics(lda_model, vectorizer_lda)

# LSA 
vectorizer_lsa = TfidfVectorizer(stop_words='english')
tfidf_matrix_lsa = vectorizer_lsa.fit_transform(clean_corpus)

lsa_model = TruncatedSVD(n_components=2, random_state=42)
lsa_model.fit(tfidf_matrix_lsa)

print("LSA Topics:")
print_topics(lsa_model, vectorizer_lsa)

# Document-Topic Distributions for LDA
doc_topic_distributions_lda = lda_model.transform(doc_term_matrix_lda)
print("LDA Document-Topic Distributions:")
print(doc_topic_distributions_lda)

# Document-Topic Distributions for LSA
doc_topic_distributions_lsa = lsa_model.transform(tfidf_matrix_lsa)
print("LSA Document-Topic Distributions:")
print(doc_topic_distributions_lsa)

LDA Topics:
Topic 1:
[('improving', 1.4974504780382687), ('diagnostic', 1.4974504780382687), ('patient', 1.4974504780382687), ('outcomes', 1.4974504780382687), ('operational', 1.4974504780382687), ('tools', 1.4974504780382687), ('efficiencies', 1.4974504780382687), ('efficient', 1.4974504780382687), ('prevalent', 1.4958830831626244), ('medicine', 1.495883083162623)]


Topic 2:
[('intelligence', 2.4967267443510845), ('artificial', 2.496726744293454), ('healthcare', 2.4967267442656587), ('adopting', 1.497264948062675), ('systems', 1.497264948062675), ('better', 1.497264948062675), ('performance', 1.497264948062675), ('increasingly', 1.497264948062675), ('learning', 1.4963600003121), ('transforming', 1.4963600001151094)]


LSA Topics:
Topic 1:
[('artificial', 0.40478509903979104), ('healthcare', 0.40478509903979104), ('intelligence', 0.40478509903979104), ('transforming', 0.27875639219256004), ('machine', 0.27875639219256004), ('learning', 0.27875639219256004), ('adopting', 0.234662179407

## Text Summarization

In [31]:
import spacy
import pytextrank
 
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("textrank")
 
example_text = """Deep learning (also known as deep structured learning) is part of a 
broader family of machine learning methods based on artificial neural networks with 
representation learning. Learning can be supervised, semi-supervised or unsupervised. 
Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, 
recurrent neural networks and convolutional neural networks have been applied to
fields including computer vision, speech recognition, natural language processing, 
machine translation, bioinformatics, drug design, medical image analysis, material
inspection and board game programs, where they have produced results comparable to 
and in some cases surpassing human expert performance. Artificial neural networks
(ANNs) were inspired by information processing and distributed communication nodes
in biological systems. ANNs have various differences from biological brains. Specifically, 
neural networks tend to be static and symbolic, while the biological brain of most living organisms
is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple
layers in the network. Early work showed that a linear perceptron cannot be a universal classifier, 
but that a network with a nonpolynomial activation function with one hidden layer of unbounded width can.
Deep learning is a modern variation which is concerned with an unbounded number of layers of bounded size, 
which permits practical application and optimized implementation, while retaining theoretical universality 
under mild conditions. In deep learning the layers are also permitted to be heterogeneous and to deviate widely 
from biologically informed connectionist models, for the sake of efficiency, trainability and understandability, 
whence the structured part."""
print('Original Document Size:',len(example_text))
doc = nlp(example_text)
 
for sent in doc._.textrank.summary(limit_phrases=2, limit_sentences=2):
    print(sent)
    print('Summary Length:',len(sent))

c:\Users\ganad\anaconda3\envs\nlpenv\lib\site-packages
Original Document Size: 1820
Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, 
recurrent neural networks and convolutional neural networks have been applied to
fields including computer vision, speech recognition, natural language processing, 
machine translation, bioinformatics, drug design, medical image analysis, material
inspection and board game programs, where they have produced results comparable to 
and in some cases surpassing human expert performance.
Summary Length: 81
The adjective "deep" in deep learning refers to the use of multiple
layers in the network.
Summary Length: 20


## Language Translation

In [32]:
from deep_translator import GoogleTranslator
to_translate = 'I like to live my life'
translated = GoogleTranslator(source='auto', target='ar').translate(to_translate)

In [34]:
print(translated)

أحب أن أعيش حياتي


In [35]:
to_translate = 'I love you'
translated = GoogleTranslator(source='auto', target='fr').translate(to_translate)

In [36]:
print(translated)

Je t'aime


**I know English, Arabic and a bit of French and that is why I chose this example**

Both are Correct!

# References:
- Bird, Steven, Edward Loper and Ewan Klein (2009).
Natural Language Processing with Python.  O'Reilly Media Inc.
- https://neptune.ai/blog/tokenization-in-nlp
- https://www.sketchengine.eu/penn-treebank-tagset/
- https://www.newscatcherapi.com/blog/ultimate-guide-to-text-similarity-with-python#toc-3
- https://towardsdatascience.com/keyword-extraction-process-in-python-with-natural-language-processing-nlp-d769a9069d5c
- https://www.datacamp.com/tutorial/what-is-topic-modeling
- https://www.geeksforgeeks.org/text-summarization-in-nlp/
- https://medium.com/analytics-vidhya/how-to-translate-text-with-python-9d203139dcf5