In [8]:
import re, nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download("stopwords")

text = """
In the rapidly evolving world of artificial intelligence, data has become the most valuable asset of modern organizations. Companies across industries—from healthcare and finance to retail and manufacturing—are increasingly relying on data-driven insights to make strategic decisions. However, collecting data is only the beginning. The true value lies in transforming raw, unstructured information into meaningful knowledge that can guide action.

Natural Language Processing (NLP), a subfield of artificial intelligence, focuses on enabling machines to understand, interpret, and generate human language. Every day, billions of text messages, emails, social media posts, and online reviews are generated across platforms such as Twitter, LinkedIn, and various news websites. These textual data sources contain hidden patterns, opinions, and trends that can significantly impact business outcomes.

For example, consider an e-commerce company that receives thousands of customer reviews daily. By applying sentiment analysis, the company can identify whether customers feel satisfied, frustrated, or neutral about specific products. Topic modeling can further reveal recurring issues, such as delayed deliveries, poor packaging, or product defects. Named Entity Recognition (NER) can detect brand names, product categories, and competitor mentions within the reviews. When combined, these techniques empower businesses to act proactively rather than reactively.

The NLP pipeline typically begins with text preprocessing. This includes converting text to lowercase, removing punctuation, eliminating stopwords, and performing stemming or lemmatization. Once cleaned, the text is converted into numerical representations using techniques such as Bag-of-Words, TF-IDF, or word embeddings like Word2Vec and GloVe. In recent years, transformer-based models such as BERT and GPT have revolutionized NLP by capturing contextual meaning rather than relying solely on word frequency.

Despite these advancements, challenges remain. Language is inherently ambiguous. Words often have multiple meanings depending on context, and sarcasm can mislead even sophisticated models. Additionally, handling multilingual data introduces complexities related to grammar, syntax, and cultural nuances. Data privacy and ethical concerns also play a crucial role, especially when analyzing personal communications.

From an industry perspective, the most successful NLP projects are not defined solely by complex models but by well-defined problem statements, clean data pipelines, and proper evaluation metrics. Precision, recall, and F1-score often provide better performance insights than accuracy alone, particularly when dealing with imbalanced datasets such as fraud detection or spam classification.

As organizations continue to digitize operations, NLP will become increasingly central to customer experience, automation, and decision intelligence. Professionals who understand both the theoretical foundations and practical implementations of NLP will be well-positioned to lead innovation in the coming years.
"""
stemmer = WordNetLemmatizer()
stop_words  = stopwords.words('english')

def processor(text):
    text = re.sub(r'[^a-zA-Z0-9\s]','',text)
    lower = text.lower()
    words = lower.split()
    words = [stemmer.lemmatize(w) for w in words if w not in stop_words]
    return " ".join(words)

clean_text = processor(text)
docs = [
    "NLP is amazing",
    "Machine learning uses data",
    clean_text
]

vectorizer = TfidfVectorizer(
        ngram_range=(2,3)
        ,binary=True
        ,max_features=500
    )
X= vectorizer.fit_transform(docs).toarray()

features = vectorizer.get_feature_names_out()
features
X


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\behli\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.04503773, 0.04503773, 0.04503773, ..., 0.04503773, 0.04503773,
        0.04503773]], shape=(3, 500))