# Day 16: Natural Language Processing (NLP)

In [1]:
import numpy as np
import pandas as pd


In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample Text Data
texts = ["I love machine learning!", "NLP is amazing.", "Understanding language is fascinating."]

# Text Vectorization
vectorizer = CountVectorizer()
vectorized_text = vectorizer.fit_transform(texts)

print("Vocabulary:", vectorizer.vocabulary_)
print("Vectorized Text:", vectorized_text.toarray())

Vocabulary: {'love': 5, 'machine': 6, 'learning': 4, 'nlp': 7, 'is': 2, 'amazing': 0, 'understanding': 8, 'language': 3, 'fascinating': 1}
Vectorized Text: [[0 0 0 0 1 1 1 0 0]
 [1 0 1 0 0 0 0 1 0]
 [0 1 1 1 0 0 0 0 1]]


In [5]:
# Text Features
# One of the simplest methods of encoding data is by word counts: you take each snippet of text,
# count the occurrences of each word within it, and put the results in a table. For example:
sample = ['problem of evil', 'evil queen', 'horizon problem']

# For a vectorization of this data based on word count, we could construct a column
# representing the word “problem,” the word “evil,” the word “horizon,” and so on. by using 
# ScikitLearn’s CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(sample)
X

# it is easier to inspect if we convert this to a DataFrame with labeled columns:
import pandas as pd
pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())

# We can optimize this output using frequency–inverse document frequency (TF–IDF), 
# which weights the word counts by a measure of how often they appear in the documents
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())

Unnamed: 0,evil,horizon,of,problem,queen
0,0.517856,0.0,0.680919,0.517856,0.0
1,0.605349,0.0,0.0,0.0,0.795961
2,0.0,0.795961,0.0,0.605349,0.0


In [None]:
Day 16: Natural Language Processing (NLP)

Definition:

Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. It bridges the gap between human communication and machine understanding, making it possible for machines to process and analyze vast amounts of natural language data.

Importance in Machine Learning:

Text Analysis: NLP plays a crucial role in extracting meaningful insights from unstructured text data, which constitutes the majority of data generated daily.

Applications Across Industries: From chatbots and virtual assistants to sentiment analysis, machine translation, and fraud detection, NLP has wide-ranging applications in various domains.

Enhanced User Experience: By powering tools like voice recognition and auto-suggestions, NLP improves the way users interact with technology.

Facilitates Automation: Automating processes like content moderation and document summarization saves time and reduces manual effort.