<a href="https://colab.research.google.com/github/MehraeenTimas/nlp-course/blob/main/NameEntityRecognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Name Entity Recognition (NER)



In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. announced a new partnership with OpenAI at the annual Oscar Award event in California."

doc = nlp(text)

print("Named Entities:")
for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")


# ORG (Organization)
# GPE (Geopolitical Entity)


Named Entities:
Apple Inc. -> ORG
OpenAI -> GPE
Oscar Award -> WORK_OF_ART
California -> GPE


In [None]:
import nltk
import string
from nltk.corpus import gutenberg, stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer


# Select a text file from Gutenberg (e.g., 'shakespeare-hamlet.txt')
file_id = "shakespeare-hamlet.txt"
raw_text = gutenberg.raw(file_id)

# Step 1: Text Cleaning (Removing Gutenberg Header/Footer)
def clean_text(text):
    lines = text.split("\n") # break (enter - new line)
    start_idx, end_idx = 0, len(lines)

    # Removing Gutenberg boilerplate (First few and last few lines)
    for i, line in enumerate(lines):
        if "START OF THIS PROJECT GUTENBERG" in line:
            start_idx = i + 1
        if "END OF THIS PROJECT GUTENBERG" in line:
            end_idx = i
            break

    cleaned_lines = lines[start_idx:end_idx]
    cleaned_text = " ".join(cleaned_lines)
    return cleaned_text

text = clean_text(raw_text)

# Step 2: Lowercase
text = text.lower()

# Step 3: Tokenization
tokens = word_tokenize(text)

# Step 4: Remove Punctuation & Stopwords
stop_words = set(stopwords.words("english"))
tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

# Step 5: Stemming & Lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_tokens = [stemmer.stem(word) for word in tokens]
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

# Step 6: Convert back to text
stemmed_text = " ".join(stemmed_tokens)
lemmatized_text = " ".join(lemmatized_tokens)

# Output Results
print("Original Text (First 500 characters):\n", text[:500])
print("\nStemmed Text (First 500 characters):\n", stemmed_text[:500])
print("\nLemmatized Text (First 500 characters):\n", lemmatized_text[:500])


Original Text (First 500 characters):
 [the tragedie of hamlet by william shakespeare 1599]   actus primus. scoena prima.  enter barnardo and francisco two centinels.    barnardo. who's there?   fran. nay answer me: stand & vnfold your selfe     bar. long liue the king     fran. barnardo?   bar. he     fran. you come most carefully vpon your houre     bar. 'tis now strook twelue, get thee to bed francisco     fran. for this releefe much thankes: 'tis bitter cold, and i am sicke at heart     barn. haue you had quiet guard?   fran. not

Stemmed Text (First 500 characters):
 tragedi hamlet william shakespear 1599 actu primu scoena prima enter barnardo francisco two centinel barnardo fran nay answer stand vnfold self bar long liue king fran barnardo bar fran come care vpon hour bar strook twelu get thee bed francisco fran releef much thank bitter cold sick heart barn haue quiet guard fran mous stir barn well goodnight meet horatio marcellu riual watch bid make hast enter horatio marcellu f