<a href="https://colab.research.google.com/github/MhAmine/intro-to-NLP-Workshop/blob/main/Applied_NLP_NLP_with_NLTK_and_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Natural Language Processing (NLP) enables machines to understand and process human language. Python has two powerful NLP libraries:

NLTK (Natural Language Toolkit): Best for linguistic processing, tokenization, stemming, etc.
spaCy: Fast, efficient, and best for production-level NLP tasks like Named Entity Recognition (NER), dependency parsing, etc.

We'll explore key NLP tasks in English and French using NLTK and spaCy.

###  1. Installing Required Libraries

In [None]:
# pip install nltk spacy


In [2]:
import spacy

### # ✨ Note : For French NLP with spaCy, download the French language model:

In [6]:
!python -m spacy download fr_core_news_sm
nlp = spacy.load("fr_core_news_sm")


Collecting fr-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.7.0/fr_core_news_sm-3.7.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m84.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


### # ✨ 2. Tokenization (Splitting Text into Words or Sentences)

🔹 Using spaCy

In [12]:
import spacy

nlp_en = spacy.load("en_core_web_sm")
nlp_fr = spacy.load("fr_core_news_sm")

doc_en = nlp_en(text_en)
doc_fr = nlp_fr(text_fr)

tokens_en_spacy = [token.text for token in doc_en]
tokens_fr_spacy = [token.text for token in doc_fr]

print("spaCy English Tokens:", tokens_en_spacy)
print("spaCy French Tokens:", tokens_fr_spacy)


spaCy English Tokens: ['Natural', 'Language', 'Processing', 'is', 'amazing', '!', 'Let', "'s", 'learn', 'it', '.']
spaCy French Tokens: ['Le', 'traitement', 'du', 'langage', 'naturel', 'est', 'fascinant', '!', 'Apprenons', '-', 'le', '.']


### # ✨ 3. Stopwords Removal (Common Words Filtering)

-  Using spaCy:  SpaCy is more efficient as it detects stopwords automatically.

In [13]:
filtered_en_spacy = [token.text for token in doc_en if not token.is_stop]
filtered_fr_spacy = [token.text for token in doc_fr if not token.is_stop]

print("Filtered spaCy English Tokens:", filtered_en_spacy)
print("Filtered spaCy French Tokens:", filtered_fr_spacy)


Filtered spaCy English Tokens: ['Natural', 'Language', 'Processing', 'amazing', '!', 'Let', 'learn', '.']
Filtered spaCy French Tokens: ['traitement', 'langage', 'naturel', 'fascinant', '!', 'Apprenons', '-', '.']


### # ✨ 4. Stemming & Lemmatization (Finding Word Roots)
- Using spaCy (Lemmatization)

In [14]:
lemmas_en = [token.lemma_ for token in doc_en]
lemmas_fr = [token.lemma_ for token in doc_fr]

print("Lemmatized English Tokens:", lemmas_en)
print("Lemmatized French Tokens:", lemmas_fr)


Lemmatized English Tokens: ['Natural', 'Language', 'Processing', 'be', 'amazing', '!', 'let', 'us', 'learn', 'it', '.']
Lemmatized French Tokens: ['le', 'traitement', 'de', 'langage', 'naturel', 'être', 'fasciner', '!', 'apprendre', '-', 'le', '.']


### # ✨ 5. Part-of-Speech (POS) Tagging
- Identifying the grammatical role of words.

In [15]:
pos_tags_en_spacy = [(token.text, token.pos_) for token in doc_en]
pos_tags_fr_spacy = [(token.text, token.pos_) for token in doc_fr]

print("spaCy POS Tags (English):", pos_tags_en_spacy)
print("spaCy POS Tags (French):", pos_tags_fr_spacy)


spaCy POS Tags (English): [('Natural', 'PROPN'), ('Language', 'PROPN'), ('Processing', 'PROPN'), ('is', 'AUX'), ('amazing', 'ADJ'), ('!', 'PUNCT'), ('Let', 'VERB'), ("'s", 'PRON'), ('learn', 'VERB'), ('it', 'PRON'), ('.', 'PUNCT')]
spaCy POS Tags (French): [('Le', 'DET'), ('traitement', 'NOUN'), ('du', 'ADP'), ('langage', 'NOUN'), ('naturel', 'ADJ'), ('est', 'AUX'), ('fascinant', 'VERB'), ('!', 'PUNCT'), ('Apprenons', 'PRON'), ('-', 'PROPN'), ('le', 'PROPN'), ('.', 'PUNCT')]


### ✨ 6. Named Entity Recognition (NER)
- Extracting names, dates, locations, etc.

In [16]:
text_en = "Elon Musk founded SpaceX in 2002 in the United States."
text_fr = "Emmanuel Macron est le président de la France depuis 2017."

doc_en = nlp_en(text_en)
doc_fr = nlp_fr(text_fr)

print("English Entities:")
for ent in doc_en.ents:
    print(ent.text, "→", ent.label_)

print("\nFrench Entities:")
for ent in doc_fr.ents:
    print(ent.text, "→", ent.label_)


English Entities:
Elon Musk → PERSON
2002 → DATE
the United States → GPE

French Entities:
Emmanuel Macron → PER
la France → LOC
