# Feature Extraction 

Michael Mommert, Stuttgart University of Applied Sciences, 2025

This Notebook provides an introduction into feature extraction techniques for different data modalities.

In [None]:
%pip install numpy \
    pandas \
    matplotlib \
    pillow \
    scikit-image \
    nltk

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Quantitative Data

Consider the following DataFrame. Create a new DataFrame that contains the temperature in units of Kelvin and the rain amount in units of inches.

In [None]:
df = pd.DataFrame({
    'temp_C': [-0.3, 0.4, 3.9, 7.4, 12.0, 15.0, 17.2, 16.8, 13.1, 9.1, 3.7, 0.8],
    'rain_mm': [59, 57, 84, 100, 143, 153, 172, 164, 135, 89, 88, 80]},
    index=['jan', 'feb', 'mar', 'apr', 'may', 'jun', 
           'jul', 'aug', 'sep', 'oct', 'nov', 'dec'])
df

**Exercise**: Convert the temperature values into units of Kelvins (0K = -273.16C).

In [None]:
# use this cell for the exercise

**Exercise**: Convert the rain amount from millimeters to inches (1 inch = 25.4mm). 

In [None]:
# use this cell for the exercise

**Exercise**: Create a new DataFrame (`df2`) that contains both the temperature in Kelvin and the rain amount in inches.

In [None]:
# use this cell for the exercise

## Qualitative Data

We download a file that containes the names of passengers present on the Titanic when it sunk.

In [None]:
!curl -O https://raw.githubusercontent.com/Hochschule-fuer-Technik-Stuttgart/teaching-mommert/main/dataprocessing/featureextraction/titanic_names.csv

Read in the names of Titanic passengers.

In [None]:
df = pd.read_csv('titanic_names.csv')
df

**Exercise**: Guess from the names on the list the sex of each passenger (hint: consider the titles used in addressing the passengers).

In [None]:
# use this cell for the exercise

## Image Data

We download an image and extract different image features in the following.

In [None]:
!curl -O https://raw.githubusercontent.com/Hochschule-fuer-Technik-Stuttgart/teaching-mommert/main/dataprocessing/featureextraction/IMG_20230622_085147.jpg

In [None]:
from PIL import Image

# read image
img = np.array(Image.open('IMG_20230622_085147.jpg').convert('RGB'))

# display image
plt.imshow(img)

### Channel Histograms

A channel histogram extracts the pixel value distribution of each image channel or band. 

**Exercise**: Extract the R, G, and B band information into arrays `r`, `g` and `b`, respectively.

In [None]:
# use this cell for the exercise

**Exercise**: Create and plot the channel histograms.

In [None]:
# use this cell for the exercise

**Exercise**: Combine the three channel histograms into a single vector by concatenating their individual vectors.

In [None]:
# use this cell for the exercise

### Canny Edges

The Canny method extracts edges from an image. 

**Exercise**: Use the method (implemented in the scikit-image module) to extract edges from the provided image and plot the resulting array.

In [None]:
from skimage.feature import canny

# use this cell for the exercise

## Text Data

We download the first chapter of the book "Frankenstein" and experiment with some feature extraction methods for text data.

In [None]:
!curl -O https://raw.githubusercontent.com/Hochschule-fuer-Technik-Stuttgart/teaching-mommert/main/dataprocessing/featureextraction/frankenstein_chapter1.txt

In [None]:
# open text file
with open("frankenstein_chapter1.txt", "r") as f:
    data = f.readlines()

# extract all linebreaks from the text file
text = ""
for line in data:
    text += line.replace('\n', ' ')
text

**Exercise**: Use `nltk.word_tokenize` to tokenize the text.

In [None]:
import nltk
nltk.download('punkt')

# use this cell for the exercise
#tokens = ...

We can tag the tokens using a tagger:

In [None]:
nltk.download('averaged_perceptron_tagger')

nltk.pos_tag(tokens)

We can also use a stemmer to find the word stems of our tokens.

In [None]:
from nltk.stem.porter import *

stemmer = PorterStemmer()

for w in tokens[:100]:
    print(w, stemmer.stem(w))

**Exercise**: Based on the word stems, find the 10 most common words.

In [None]:
# use this cell for the exercise