# Features Extraction from Text


## Bag of Words

* used for the count occurrence of a word in the sentence.
* used in multinomial naive-biased.


In [15]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [16]:
# Define sample sentences (documents)
A1 = 'hello and welcome, welcome again'
A2 = 'shri love NLP' 
A3 = 'shri is good boy'

    ### Stopwords: Benefits and Drawbacks

**Benefits of Removing Stopwords:**
1. **Noise Reduction:** Removes common words (like "the", "is") that often don't carry significant meaning for the topic.
2. **Dimensionality Reduction:** Reduces the number of features, leading to smaller models and faster computation.

**Performance Impact Example:**
Imagine a dataset with 1,000 documents.
*   **With Stopwords:** The vocabulary might contain 5,000 unique words. The resulting matrix is 1,000 x 5,000 (5 million elements).
*   **Without Stopwords:** Removing stopwords might reduce the vocabulary to 3,000 words. The matrix becomes 1,000 x 3,000 (3 million elements).
*   **Result:**
    *   **Memory:** 40% reduction in memory usage for the matrix.
    *   **Speed:** Algorithms (like Naive Bayes or Logistic Regression) train significantly faster because they have fewer features to process.

**Loss (Drawbacks) of Removing Stopwords:**
1. **Loss of Context/Meaning:** Some stopwords can be crucial for meaning (e.g., "not" in "not good").
2. **Structure Loss:** Removes grammatical structure which might be important for some NLP tasks.

In [17]:
# Initialize CountVectorizer with English stop words removed
vectorizer = CountVectorizer(stop_words='english')

# Fit the model and learn the vocabulary; then transform the data into a document-term matrix
vectors = vectorizer.fit_transform([A1,A2,A3])

# Get the unique words (features) identified in the data
feature_names = vectorizer.get_feature_names_out()

# Convert the sparse matrix to a dense format for readability
print(f"Sparce matric: {vectors}")
dense = vectors.todense()
print(f"Dense matrix: {dense}")

# Create a DataFrame to visualize the word counts for each document
result = pd.DataFrame(dense, columns=feature_names)

print(result)

Sparce matric: <Compressed Sparse Row sparse matrix of dtype 'int64'
	with 8 stored elements and shape (3, 7)>
  Coords	Values
  (0, 2)	1
  (0, 6)	2
  (1, 5)	1
  (1, 3)	1
  (1, 4)	1
  (2, 5)	1
  (2, 1)	1
  (2, 0)	1
Dense matrix: [[0 0 1 0 0 0 2]
 [0 0 0 1 1 1 0]
 [1 1 0 0 0 1 0]]
   boy  good  hello  love  nlp  shri  welcome
0    0     0      1     0    0     0        2
1    0     0      0     1    1     1        0
2    1     1      0     0    0     1        0


## TF - IDF

* used for getting more details about the data

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [19]:
A1 = 'hello and welcome, welcome again'
A2 = 'shri love NLP'
A3 = 'shri is good boy'

In [20]:
# Initialize TfidfVectorizer with English stop words removed
vectorizer = TfidfVectorizer(stop_words='english')

# Fit the model to learn vocabulary and transform documents into TF-IDF matrix
vectors = vectorizer.fit_transform([A1,A2,A3])

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Convert sparse matrix to dense matrix for readability
print(f"Sparce matric: {vectors}")
dense = vectors.todense()
print(f"Dense matrix: {dense}")

# Create DataFrame to view TF-IDF scores
result = pd.DataFrame(dense, columns=feature_names)

print(result)

Sparce matric: <Compressed Sparse Row sparse matrix of dtype 'float64'
	with 8 stored elements and shape (3, 7)>
  Coords	Values
  (0, 2)	0.4472135954999579
  (0, 6)	0.8944271909999159
  (1, 5)	0.4736296010332684
  (1, 3)	0.6227660078332259
  (1, 4)	0.6227660078332259
  (2, 5)	0.4736296010332684
  (2, 1)	0.6227660078332259
  (2, 0)	0.6227660078332259
Dense matrix: [[0.         0.         0.4472136  0.         0.         0.
  0.89442719]
 [0.         0.         0.         0.62276601 0.62276601 0.4736296
  0.        ]
 [0.62276601 0.62276601 0.         0.         0.         0.4736296
  0.        ]]
        boy      good     hello      love       nlp     shri   welcome
0  0.000000  0.000000  0.447214  0.000000  0.000000  0.00000  0.894427
1  0.000000  0.000000  0.000000  0.622766  0.622766  0.47363  0.000000
2  0.622766  0.622766  0.000000  0.000000  0.000000  0.47363  0.000000
