# ## Named Entity Recognition (NER) and Noun Phrase Extraction with spaCy

In [None]:
!python -m spacy download en_core_web_sm


In [4]:
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Apple is looking at buying U.K. startup for $1 billion"

# Process the text with spaCy
doc = nlp(text)

# Named Entity Recognition
print("Named Entities, with their labels:")
for ent in doc.ents:
    print(ent.text, ent.label_)

# Noun Phrase Extraction
print("\nNoun Phrases:")
for np in doc.noun_chunks:
    print(np.text)


Named Entities, with their labels:
Apple ORG
U.K. GPE
$1 billion MONEY

Noun Phrases:
Apple
U.K.


# ## Text Classification with Random Forests/xgboost and TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.datasets import fetch_20newsgroups

# Load dataset
data = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'sci.space'])
X, y = data.data, data.target

# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Random Forest Classifier
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)
print(f"Random Forest Classifier Accuracy: {rf_clf.score(X_test, y_test)}")

# XGBoost Classifier
xgb_clf = XGBClassifier()
xgb_clf.fit(X_train, y_train)
print(f"XGBoost Classifier Accuracy: {xgb_clf.score(X_test, y_test)}")

# ## Why We Don't Recommend "Old" NLP Pipelines


Old NLP pipelines often involved stemming, stop word removal, and bag of words models.

These methods can lose important information and context, which modern techniques like

TF-IDF, word embeddings, and transformers preserve.

# ## Over-fitting, Training on Test Data, and Data Leaks

Over-fitting occurs when a model learns the training data too well, including its noise and outliers.

This results in poor generalization to new, unseen data.

Training on test data or having data leaks (where information from the test set is used during training)

Can lead to overly optimistic performance estimates and models that do not generalize well.

# ## k-Fold Stratified Cross Validation

In [8]:
# Stratified k-Fold Cross Validation ensures that each fold of the dataset has the same proportion of classes as the original dataset.

skf = StratifiedKFold(n_splits=5)
rf_clf = RandomForestClassifier()

# Perform cross-validation
scores = cross_val_score(rf_clf, X_tfidf, y, cv=skf)
print(f"Stratified k-Fold Cross Validation Scores: {scores}")
print(f"Mean CV Score: {scores.mean()}")

Stratified k-Fold Cross Validation Scores: [0.96744186 0.94883721 0.98139535 0.97196262 0.98130841]
Mean CV Score: 0.970189089328407


# ## Translation and Summarization with Transformers

In [10]:
from transformers import pipeline

# Translation
translator = pipeline("translation_en_to_fr")
translation = translator("Hello, how are you?")
print(f"Translation: {translation[0]['translation_text']}")

No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Translation: Bonjour, comment êtes-vous?


In [11]:
# Summarization
summarizer = pipeline("summarization")
summary = summarizer("The quick brown fox jumps over the lazy dog. " * 10)
print(f"Summary: {summary[0]['summary_text']}")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Your max_length is set to 142, but your input_length is only 103. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


Summary:  The quick brown fox jumps over the lazy dog . Quick brown fox jumped over the laziest dog in a series of pictures . Foxes jump over lazy dogs in a bid to keep their lazy owners happy . The foxes are often confused for each other and confused for the dog's identity .
