# Feature Extraction in Machine Learning
Feature Extraction is the process of transforming raw data into numerical or symbolic features that can be used by machine learning algorithms. It is especially important when working with high-dimensional, unstructured, or complex data such as text, images, or time series.

## 📌 What is Feature Extraction?
- Feature Extraction is the process of converting input data into a set of measurable and informative features. 
- These features should represent the underlying structure or pattern in the data while reducing dimensionality and retaining important information.
- 🧠 Goal: Extract informative features that improve model performance and reduce noise or redundancy.

### 🔍 Why Feature Extraction is Important?

| Benefit                          | Description                                     |
| -------------------------------- | ----------------------------------------------- |
| 🧹 Reduces dimensionality        | Helps in compressing the data                   |
| 🎯 Enhances model performance    | Removes irrelevant or redundant data            |
| 🔍 Improves interpretability     | Converts raw input into understandable features |
| ⚙️ Required for non-tabular data | Text, image, audio must be vectorized           |


# Types of Feature Extraction (Based on Data Type)

## 1️⃣ For Text Data (NLP)

#### 🔹 Bag of Words (BoW)
- Counts the number of times each word appears in a document.
- Ignores grammar and word order.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(dataset)

#### 🔹 TF-IDF (Term Frequency-Inverse Document Frequency)
- Highlights important words by reducing weight of common terms.

In [None]:
# Example 
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(dataset)

#### 🔹 Word Embeddings
- Converts words to dense vectors preserving meaning (semantic similarity).

| Method   | Description                                             |
| -------- | ------------------------------------------------------- |
| Word2Vec | Vectorizes based on surrounding context                 |
| GloVe    | Combines global matrix factorization with local context |
| BERT     | Contextual word representation                          |

In [None]:
# Example 
from gensim.models import Word2Vec
model = Word2Vec(sentences, vector_size=100)
vector = model.wv['machine']

#### 🔹 Other NLP Features
| Feature                              | Description                   |
| ------------------------------------ | ----------------------------- |
| Text length                          | Number of characters or words |
| Count of specific POS (nouns, verbs) | Linguistic structure          |
| Sentiment score                      | Polarity of sentiment         |

---

## 2️⃣ For Image Data

#### 🔹 Raw Pixel Values
- Flatten the image matrix (e.g., 28x28 becomes 784 features).

In [None]:
# Example
pixels = image.reshape(-1)

#### 🔹 Histogram of Oriented Gradients (HOG)
- Captures edge directions and shapes.

In [None]:
# Example
from skimage.feature import hog
features, _ = hog(image, visualize=True)

#### 🔹 Color Histograms
- Extract color distribution in the image.

#### 🔹 Deep Learning Feature Extraction (CNN)
- Use pre-trained models (VGG16, ResNet, etc.) to extract features from intermediate layers.

In [None]:
# Example Code
from tensorflow.keras.applications import VGG16
model = VGG16(include_top=False, weights='imagenet')
features = model.predict(image)

---

## 3️⃣ For Audio Data

#### 🔹 MFCC (Mel-Frequency Cepstral Coefficients)
- Captures the timbral aspects of audio.

In [None]:
# Example Code
import librosa
mfcc = librosa.feature.mfcc(y=audio, sr=sr)

#### 🔹 Chroma Features
- Represents energy of 12 distinct semitone pitches.

#### 🔹 Spectral Features
- Includes spectral centroid, roll-off, flux, etc.

---

## 4️⃣ For Time Series Data

#### 🔹 Lag Features
- Previous values used as input.

#### 🔹 Rolling Statistics
- Moving averages, rolling mean, min, std.

#### 🔹 Fourier / Wavelet Transforms
- Converts time domain to frequency domain.

#### 🔹 Autocorrelation Features
- Captures relationship of current value with its past values.

#### 🔹 tsfresh / Kats
- Python libraries for automatic time series feature extraction.

In [None]:
# Example Code
from tsfresh import extract_features
features = extract_features(df, column_id='id', column_sort='time')

---

## 5️⃣ For Tabular Data
Even tabular data may benefit from:

- PCA (Principal Component Analysis)
- ICA (Independent Component Analysis)
- Autoencoders (deep learning-based)
- t-SNE / UMAP (for visualization, not modeling)


### 📦 Libraries for Feature Extraction

| Data Type   | Libraries                                             |
| ----------- | ----------------------------------------------------- |
| Text        | `sklearn`, `nltk`, `spacy`, `gensim`                  |
| Image       | `OpenCV`, `scikit-image`, `tensorflow`, `torchvision` |
| Audio       | `librosa`, `pyAudioAnalysis`                          |
| Time Series | `tsfresh`, `kats`, `sktime`                           |
| General     | `sklearn.decomposition`, `autoencoders`               |