<a href="https://colab.research.google.com/github/Nkeeydata/NLP_PRACTICE/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Natural Language Assignments on Foundational Courses

### What is Natural Language Processing (NLP)?

To me, Natural Language Processing (NLP) is like teaching computers how to understand and talk like humans. It helps machines read, listen, and even respond to things we say or write—kind of like how Siri or Google Assistant can answer questions or follow commands.


### Real-World Applications of NLP

a. Chatbots and Virtual Assistants
Apps like Siri, Alexa, and Copilot use NLP to understand what we say and reply in a smart way. They make life easier by helping us set reminders, search the web, or even control devices with our voice.
b. Language Translation
Google Translate uses NLP to convert one language into another. This is super helpful when traveling or chatting with people from different countries.
c. Spam Detection in Emails
NLP helps email apps figure out which messages are spam and which ones are important. It scans the words and patterns to keep our inbox clean.


### Challenges That Make NLP Complex

a. Ambiguity in Language
Sometimes words mean different things depending on how they’re used. For example, “bat” could mean an animal or a baseball bat. Computers struggle to figure out the right meaning without enough context.
b. Slang and Informal Language
People use slang, emojis, and abbreviations all the time—especially online. It’s hard for machines to understand stuff like “LOL” or “I’m dead 😂” unless they’re trained really well.


### Number 4

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm
import spacy



Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "C:\Users\Nkechi Pc\anaconda3\Lib\site-packages\spacy\__init__.py", line 6, in <module>
  File "C:\Users\Nkechi Pc\anaconda3\Lib\site-packages\spacy\errors.py", line 3, in <module>
    from .compat import Literal
  File "C:\Users\Nkechi Pc\anaconda3\Lib\site-packages\spacy\compat.py", line 4, in <module>
    from thinc.util import copy_array
  File "C:\Users\Nkechi Pc\anaconda3\Lib\site-packages\thinc\__init__.py", line 5, in <module>
    from .config import registry
  File "C:\Users\Nkechi Pc\anaconda3\Lib\site-packages\thinc\config.py", line 5, in <module>
    from .types import Decorator
  File "C:\Users\Nkechi Pc\anaconda3\Lib\site-packages\thinc\types.py", line 27, in <module>
    from .compat import cupy, has_cupy
  File "C:\Users\Nkechi Pc\anaconda3\Lib\site-packages

ImportError: numpy.core.multiarray failed to import (auto-generated because you didn't call 'numpy.import_array()' after cimporting numpy; use '<void>numpy._import_array' to disable if you are certain you don't need it).

In [None]:
import re
import string
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
import pandas as pd



In [None]:
# Sample text
text = "Contact us at support@company.com or sales@business.org. For more, email info@service.net."

# Regex pattern to match email addresses
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Extract all matches
emails = re.findall(email_pattern, text)

# Display result
print("Extracted Emails:", emails)

Extracted Emails: ['support@company.com', 'sales@business.org', 'info@service.net']


### Number 5

In [None]:
# Original text
text = "NLP makes AI smarter! But, sometimes, it’s challenging... Don’t you agree?"

# Step 1: Removal of punctuation
text_no_punct = text.translate(str.maketrans('', '', string.punctuation))

# Step 2: Converting to lowercase
text_lower = text_no_punct.lower()

# Step 3: Split into words
words = text_lower.split()

# Display result
print("Cleaned Words:", words)


Cleaned Words: ['nlp', 'makes', 'ai', 'smarter', 'but', 'sometimes', 'it’s', 'challenging', 'don’t', 'you', 'agree']


### Text Cleaning Task

In [None]:
# Original text
text = "OMG!! NLP is soooo coool 🤩...!!! It costs $1000. Learn it now at https://3mtt.com 😎."

# Step 1: Remove emojis and URLs
text = re.sub(r'https?://\S+', '', text)              # remove URLs
text = re.sub(r'[^\x00-\x7F]+', '', text)             # remove emojis and non-ASCII characters

# Step 2: Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))

# Step 3: Convert to lowercase
text = text.lower()

# Step 4: Split into words
words = text.split()

# Display result
print("Cleaned Words:", words)


Cleaned Words: ['omg', 'nlp', 'is', 'soooo', 'coool', 'it', 'costs', '1000', 'learn', 'it', 'now', 'at']


### Tokenization Task

In [None]:
# Download required resources
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Nkechi
[nltk_data]     Pc\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [None]:
text = "Tokenization is the first step in NLP. It splits text into smaller pieces for analysis."


In [None]:
# Sentence-level tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization:")
print(sentences)

# Word-level tokenization
words = word_tokenize(text)
print("\nWord Tokenization:")
print(words)


Sentence Tokenization:
['Tokenization is the first step in NLP.', 'It splits text into smaller pieces for analysis.']

Word Tokenization:
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'NLP', '.', 'It', 'splits', 'text', 'into', 'smaller', 'pieces', 'for', 'analysis', '.']


### Stemming and Lemmatization Task

In [None]:
words = ["running", "flies", "studies", "easily", "studying", "better"]


# Initialize stemmer
stemmer = PorterStemmer()

# Apply stemming
stemmed_words = [stemmer.stem(word) for word in words]
print("Stemmed Words:", stemmed_words)


Stemmed Words: ['run', 'fli', 'studi', 'easili', 'studi', 'better']


In [None]:
# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Apply lemmatization
doc = nlp(" ".join(words))
lemmatized_words = [token.lemma_ for token in doc]
print("Lemmatized Words:", lemmatized_words)



NameError: name 'spacy' is not defined

In [None]:
%pip install spacy


Note: you may need to restart the kernel to use updated packages.


### Assignment 3

In [None]:
# Define vocabulary
vocab = ["dog", "fox", "kernel", "lazy", "quick"]

# Create one-hot encoded vectors
one_hot = pd.DataFrame([
    [1 if word == vocab[i] else 0 for word in vocab]
    for i in range(len(vocab))
], columns=vocab)

# Display result
print("One-Hot Encoded Vectors:")
print(one_hot)



One-Hot Encoded Vectors:
   dog  fox  kernel  lazy  quick
0    1    0       0     0      0
1    0    1       0     0      0
2    0    0       1     0      0
3    0    0       0     1      0
4    0    0       0     0      1


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Dataset
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "The dog sleeps in the kernel"
]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform
bow_matrix = vectorizer.fit_transform(sentences)

# Convert to DataFrame for readability
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())

print("\nBag of Words Representation:")
print(bow_df)



Bag of Words Representation:
   brown  dog  fox  in  jumps  kernel  lazy  over  quick  sleeps  the
0      1    1    1   0      1       0     1     1      1       0    2
1      0    1    0   1      0       1     0     0      0       1    2


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform
tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)

# Convert to DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

print("\nTF-IDF Representation:")
print(tfidf_df)



TF-IDF Representation:
      brown       dog       fox        in     jumps    kernel      lazy  \
0  0.342369  0.243598  0.342369  0.000000  0.342369  0.000000  0.342369   
1  0.000000  0.302531  0.000000  0.425196  0.000000  0.425196  0.000000   

       over     quick    sleeps       the  
0  0.342369  0.342369  0.000000  0.487197  
1  0.000000  0.000000  0.425196  0.605061  


### Word2Vec Assignment

In [None]:
# Sample dataset
sentences = [
    ["the", "cat", "meows"],
    ["the", "dog", "barks"],
    ["the", "bird", "sings"]
]


In [None]:
pip install --upgrade numpy scipy h5py thinc


Collecting scipy
  Downloading scipy-1.16.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.8 kB ? eta -:--:--
     ------------ ------------------------- 20.5/60.8 kB 640.0 kB/s eta 0:00:01
     ------------ ------------------------- 20.5/60.8 kB 640.0 kB/s eta 0:00:01
     ------------------- ------------------ 30.7/60.8 kB 217.9 kB/s eta 0:00:01
     ------------------------- ------------ 41.0/60.8 kB 196.9 kB/s eta 0:00:01
     ------------------------------- ------ 51.2/60.8 kB 201.8 kB/s eta 0:00:01
     -------------------------------------- 60.8/60.8 kB 202.3 kB/s eta 0:00:00
Collecting h5py
  Downloading h5py-3.14.0-cp312-cp312-win_amd64.whl.metadata (2.7 kB)
Collecting thinc
  Downloading thinc-9.1.1-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting blis<1.1.0,>=1.0.0 (from thinc)
  Downloading blis-1.0.2-cp312-cp312-win_amd64.whl.metadata (7.8 kB)
Downloading scipy-1.16.1-cp312-cp312-win_amd64.whl (38.5 MB)
   -------------

  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy 3.8.7 requires thinc<8.4.0,>=8.3.4, but you have thinc 9.1.1 which is incompatible.


### Thank You!