### **Load and Preprocess Text Data**

In [None]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
!pip install rake-nltk


nltk.download('stopwords')
nltk.download('punkt_tab')

# Sample text data
data = [
    "Natural language processing is a field of artificial intelligence.",
    "Keyword extraction helps in identifying the important terms in text.",
    "SpaCy and NLTK are useful Python libraries for NLP tasks.",
    "RAKE is a rapid automatic keyword extraction method.",
]

# Create a DataFrame
df = pd.DataFrame(data, columns=["text"])



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### **TF-IDF Keyword Extraction**

In [None]:
# Initialize the vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# Fit the model and transform the data
tfidf_matrix = vectorizer.fit_transform(df['text'])

# Get the feature names (words) and corresponding TF-IDF scores
feature_names = vectorizer.get_feature_names_out()
scores = tfidf_matrix.sum(axis=0).A1

# Combine the words and their scores into a DataFrame
tfidf_keywords = pd.DataFrame(zip(feature_names, scores), columns=['keyword', 'score'])

# Sort by score in descending order
tfidf_keywords = tfidf_keywords.sort_values(by='score', ascending=False)

# Display top N keywords (let's say top 5)
top_n_tfidf = tfidf_keywords.head(5)
print("Top N Keywords using TF-IDF:")
print(top_n_tfidf)

Top N Keywords using TF-IDF:
       keyword     score
2   extraction  0.659851
8      keyword  0.659851
18       rapid  0.436719
17        rake  0.436719
1    automatic  0.436719


### **RAKE Keyword Extraction**

We'll now use RAKE (Rapid Automatic Keyword Extraction) to extract keywords from the text.



In [None]:
from rake_nltk import Rake

# Initialize RAKE
rake = Rake()

# Extract keywords for each document
rake_keywords = []
for text in df['text']:
    rake.extract_keywords_from_text(text)
    keywords = rake.get_ranked_phrases_with_scores()
    rake_keywords.append(keywords)

# Display RAKE results
print("RAKE Keyword Extraction Results:")
for idx, keywords in enumerate(rake_keywords):
    print(f"Document {idx + 1}: {keywords[:5]}")  # Top 5 keywords per document



RAKE Keyword Extraction Results:
Document 1: [(9.0, 'natural language processing'), (4.0, 'artificial intelligence'), (1.0, 'field')]
Document 2: [(9.0, 'keyword extraction helps'), (4.0, 'important terms'), (1.0, 'text'), (1.0, 'identifying')]
Document 3: [(9.0, 'useful python libraries'), (4.0, 'nlp tasks'), (1.0, 'spacy'), (1.0, 'nltk')]
Document 4: [(25.0, 'rapid automatic keyword extraction method'), (1.0, 'rake')]


### **spaCy Keyword Extraction (Using Named Entity Recognition)**

In [None]:

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Function to extract named entities
def extract_spacy_keywords(text):
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]
    return entities

# Apply spaCy NER to each document
spacy_keywords = df['text'].apply(extract_spacy_keywords)

# Display results
print("spaCy Named Entity Recognition Results:")
print(spacy_keywords)




spaCy Named Entity Recognition Results:
0             []
1      [Keyword]
2    [NLTK, NLP]
3             []
Name: text, dtype: object


### **Compare Results**

Now that we've applied TF-IDF, RAKE, and spaCy for keyword extraction, let's compare the results:






*   TF-IDF: Ranks terms based on their importance in the entire corpus.
*   RAKE: Extracts keywords by evaluating co-occurrence of words.
*   spaCy NER: Extracts named entities, such as people, locations, and organizations.









In [None]:
print("Comparison of TF-IDF, RAKE, and spaCy results:")
print("TF-IDF top N keywords:")
print(top_n_tfidf)
print("\nRAKE keywords (top 5 per document):")
for idx, keywords in enumerate(rake_keywords):
    print(f"Document {idx + 1}: {keywords[:5]}")
print("\nspaCy Named Entities:")
print(spacy_keywords)


Comparison of TF-IDF, RAKE, and spaCy results:
TF-IDF top N keywords:
       keyword     score
2   extraction  0.659851
8      keyword  0.659851
18       rapid  0.436719
17        rake  0.436719
1    automatic  0.436719

RAKE keywords (top 5 per document):
Document 1: [(9.0, 'natural language processing'), (4.0, 'artificial intelligence'), (1.0, 'field')]
Document 2: [(9.0, 'keyword extraction helps'), (4.0, 'important terms'), (1.0, 'text'), (1.0, 'identifying')]
Document 3: [(9.0, 'useful python libraries'), (4.0, 'nlp tasks'), (1.0, 'spacy'), (1.0, 'nltk')]
Document 4: [(25.0, 'rapid automatic keyword extraction method'), (1.0, 'rake')]

spaCy Named Entities:
0             []
1      [Keyword]
2    [NLTK, NLP]
3             []
Name: text, dtype: object


**Final Notes:**
*  TF-IDF is great for identifying terms that are important across all documents.
*  TF-IDF is great for identifying terms that are important across all documents.
*   spaCy works well for named entity recognition but may not capture all keywords, especially non-entity terms.




