# Feature-extraction summary (pointer form)

- Bag of Words (Count-based)
    - What: Represents each document as counts of vocabulary terms (term-frequency).
    - How in this notebook: `cv = CountVectorizer()` → `bow = cv.fit_transform(df['text']).toarray()`; current `bow` is a 4×13 integer matrix of counts.
    - Pros: Simple, interpretable, fast; works well for many baseline models.
    - Cons: Loses word order/context, high dimensional and sparse, raw counts not normalized.
    - Tip: Use `CountVectorizer(binary=True)` for pure presence/absence (one-hot style per feature) or normalize/scale counts before some models.

- One‑Hot Encoding (binary presence)
    - What: Each vocabulary term becomes a binary feature (1 if appears in document, else 0).
    - How to get it here: `CountVectorizer(binary=True)` or convert counts: `(bow > 0).astype(int)`.
    - Pros: Removes document-length bias, simple for presence-based signals.
    - Cons: Still sparse/high-dimensional; ignores frequency and context.

- N‑grams (capture local order)
    - What: Extend tokens to contiguous sequences of length n (bigrams, trigrams, etc.) to encode short-term word order.
    - How in this notebook: `CountVectorizer(ngram_range=(2,2))` for bigrams, `ngram_range=(1,2)` for unigrams + bigrams (current `cv = CountVectorizer(ngram_range=(1, 2))`).
    - Pros: Captures local phrases and context (e.g., "write comment"), improves performance when phrase meaning matters.
    - Cons: Sharp increase in dimensionality and sparsity as n and corpus size grow; may need feature selection or hashing.

- TF‑IDF (term-frequency × inverse-document-frequency)
    - What: Weights terms by frequency in a doc and rarity across corpus: downweights common words and upweights discriminative ones.
    - How in this notebook: `TfidfVectorizer()` → `tfidf.fit_transform(df['text']).toarray()`; `tfidf.idf_` gives inverse-document frequencies.
    - Pros: Often better for classification/IR than raw counts; reduces impact of common tokens and normalizes document length.
    - Cons: Still sparse and high-dimensional; may reduce useful signal for very short docs if idf unstable.

- Practical guidance / trade-offs
    - Start with TF‑IDF (unigrams) for classification problems; add n‑grams if phrases are important.
    - Use binary features (one‑hot) if presence matters more than frequency.
    - Apply dimensionality reduction / feature selection (chi2, mutual info, TruncatedSVD) or hashing for large vocabularies.
    - Always inspect `cv.vocabulary_`, feature matrix shape, and sparsity before modeling; tune `ngram_range`, `min_df`, `max_df`, and stop-words.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Bag of Words

In [3]:
df=pd.DataFrame({"text":["people watch campusx", "campusx watch campusx", "people write comment", "campusx write comment"], "output":[1,1,0,0]})

In [4]:
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [8]:
bow = cv.fit_transform(df['text']).toarray()
print(cv.vocabulary_)

{'people': 2, 'with': 4, 'campusx': 0, 'watch': 3, 'write': 5, 'comment': 1}


In [9]:
print(bow)

[[1 0 1 0 1 0]
 [2 0 0 1 0 0]
 [0 1 1 0 0 1]
 [1 1 0 0 0 1]]


In [13]:
bow[0]

array([1, 0, 1, 0, 1, 0])

In [15]:
cv.transform(["campusx watch and write comments of campusx"]).toarray()

array([[2, 0, 0, 1, 0, 1]])

# Ngrams

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(2,2))

In [17]:
bow = cv.fit_transform(df['text']).toarray()
print(cv.vocabulary_)

{'people with': 2, 'with campusx': 5, 'campusx watch': 0, 'watch campusx': 4, 'people write': 3, 'write comment': 6, 'campusx write': 1}


# Bigram + Unigrams

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,2))

In [19]:
bow = cv.fit_transform(df['text']).toarray()
print(cv.vocabulary_)

{'people': 4, 'with': 9, 'campusx': 0, 'people with': 5, 'with campusx': 10, 'watch': 7, 'campusx watch': 1, 'watch campusx': 8, 'write': 11, 'comment': 3, 'people write': 6, 'write comment': 12, 'campusx write': 2}


# TF / IDF

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
tfidf.fit_transform(df['text']).toarray()

array([[0.44809973, 0.        , 0.55349232, 0.        , 0.70203482,
        0.        ],
       [0.78722298, 0.        , 0.        , 0.61666846, 0.        ,
        0.        ],
       [0.        , 0.57735027, 0.57735027, 0.        , 0.        ,
        0.57735027],
       [0.49681612, 0.61366674, 0.        , 0.        , 0.        ,
        0.61366674]])

In [22]:
print(tfidf.vocabulary_)

{'people': 2, 'with': 4, 'campusx': 0, 'watch': 3, 'write': 5, 'comment': 1}


In [23]:
print(tfidf.idf_)

[1.22314355 1.51082562 1.51082562 1.91629073 1.91629073 1.51082562]


In [24]:
print(tfidf.get_feature_names_out())

['campusx' 'comment' 'people' 'watch' 'with' 'write']
