<a href="https://colab.research.google.com/github/ACTH-DKES/ACTH2025/blob/main/week7/Week7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning in the Humanities

Machine Learning (ML) is a field of computer science that enables computers to learn patterns from data and make **predictions** or decisions without being explicitly programmed for each task. **The results are estimations, we can rarely completely rely on them 100%.**

In the humanities, ML can help us:
- Group artworks by stylistic similarities
- Discover hidden themes or topics in text
- Classify objects based on metadata
- Generate recommendations or analogies

We will first explore three text ML tasks using the MET Open Access dataset:
1. **Text Classification**
2. **Clustering**
3. **Topic Modeling**

## Key Concepts

### Training, Validation, and Test Sets
- **Training Set**: Used to train the model (learn patterns)
- **Validation Set**: (optional) Tune model parameters
- **Test Set**: Evaluate model on unseen data

### Evaluation Metrics
- **Accuracy**: % of correct predictions
- **Precision/Recall/F1**: We know the formulas
- **Confusion Matrix**: Breakdown of true vs. predicted classes

### Supervised vs. Unsupervised Learning
- **Supervised Learning**: Learn from labeled data (e.g., classify artwork type from description)
- **Unsupervised Learning**: Find structure in unlabeled data (e.g., group similar descriptions)


We will use the following Python libraries:

- **pandas**: you know
- **scikit-learn** (`sklearn`): Machine learning tools
- **nltk** or **spaCy**: Text preprocessing
- **matplotlib / seaborn**: Visualization

In [None]:
# Install required libraries (if needed)
#!pip install pandas scikit-learn matplotlib seaborn nltk

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA, NMF

import nltk
import re
nltk.download('stopwords')
from nltk.corpus import stopwords

In [None]:
import pandas as pd

df_filt15k = pd.read_csv("https://raw.githubusercontent.com/ACTH-DKES/ACTH2025/refs/heads/main/week7/filteredCleveland15k.csv")

df_filt15k = df_filt15k.drop(columns=['Unnamed: 0'])

In [None]:
stop_words = stopwords.words("english")

def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^\w\s]", "", text) # regex to remove non alphanumeric ch
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    return " ".join(tokens)

df_filt15k["clean_description"] = df_filt15k["description"].apply(clean_text)

In [None]:
df_filt15k.head()

# Text Classification

We'll use **TF-IDF** (Term Frequency-Inverse Document Frequency) to transform text into numerical vectors, and train a **Logistic Regression** model to predict the `Object Name` based on the description.

This is a **supervised learning** task.

TF-IDF = Term Frequency * Inverse Document Frequency. It weights words that are frequent in a document but rare across the corpus, giving more importance to informative terms.

### Logistic Regression

**Logistic Regression** is a supervised machine learning algorithm used for **classification tasks**. Despite its name, it is not used for regression.

#### What It Does
Logistic Regression models the **probability** that a given input belongs to a particular class.

- For **binary classification**, it outputs a value between 0 and 1 using the **sigmoid function**:
$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$
  This value is interpreted as the probability of belonging to the positive class.

- For **multiclass classification**, it uses the **softmax function** to assign probabilities across all classes.

#### Why Use It
- Simple and computationally efficient
- Performs well with high-dimensional data (such as text represented by TF-IDF vectors)
- Provides interpretable probability outputs
- Includes built-in regularization options to help prevent overfitting

TLDR: **Logistic Regression** is used as a baseline for classification problems.



# Text Classification task

We want to **predict the `type`** (e.g., "Bowl", "Painting", "Sword") from the cleaned description.

To do this:
1. Convert text to numerical features using **TF-IDF**
2. Train a **Logistic Regression** classifier
3. Evaluate the model

We limit the number of features to 3000 for efficiency and to avoid overfitting using:
```python
TfidfVectorizer(max_features=3000)
```
### More info on Overfitting: When the Model Knows Too Much

**Overfitting** happens when a machine learning model performs very well on the training data but poorly on unseen data. This means the model has "memorized" the training set rather than learned general patterns.

#### Symptoms of Overfitting:
- Very high accuracy on the training set
- Much lower accuracy on the test set
- Unusual or overconfident predictions

#### Why Does Overfitting Happen?
- The model is too complex for the amount of data (e.g. too many parameters or features)
- The training set contains noise or biases that the model learns
- The dataset is small or not representative

#### How to Avoid Overfitting:
- Use **simpler models** (e.g., fewer TF-IDF features with `max_features`)
- **Split your data** into training/test sets (we use `test_size=0.2` for this)
- Apply **regularization** (Logistic Regression does this by default)
- Get **more data** (hard in the humanities, but ideal)


In [None]:
tfidf = TfidfVectorizer(max_features=3000)  # Keep top 3000 terms
X = tfidf.fit_transform(df_filt15k["clean_description"])  # Fit and transform: learn vocab + transform text to vector
y = df_filt15k["type"]

# Drop rare classes to reduce noise
counts = y.value_counts()
keep_labels = counts[counts > 30].index  # Keep classes with >30 instances

# Convert boolean Series to numpy array for indexing sparse matrix
filter_mask = y.isin(keep_labels).to_numpy()
X = X[filter_mask]
y = y[y.isin(keep_labels)]


# Split data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
# test_size=0.2 means 20% of the data will be used for testing
# random state: reproduciblity, it will always split it in the same way

# Train classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)  # Fit = train the model on (X_train, y_train)

# Predict on test set
y_pred = clf.predict(X_test)  # Predict = generate output for unseen test data

# Report
print(classification_report(y_test, y_pred, zero_division=0))

In [None]:
new_desc = "A lady with a rose depicted with strong blush and vivid colors, printed by London Inc Press"
cleaned = clean_text(new_desc)
X_new = tfidf.transform([cleaned])  # Note: we use transform, not fit_transform
prediction = clf.predict(X_new)
print("Predicted Object Name:", prediction[0])
#.fit_transform() is only used during training.
#.transform() ensures new data is converted using the existing vocabulary and weights.

In [None]:
probs = clf.predict_proba(X_new)
class_probs = pd.Series(probs[0], index=clf.classes_).sort_values(ascending=False)
print(class_probs.head())

## Exercise: predictive function

Develop the function
`predict_label(description, vectorizer, model, clean_f, prob = False)`
that (i) takes as input a description, a vectorizer, a trained model, a cleaning function, and a Boolean (prob) which is False by default and (ii) returns the predicted class from the model if prob is False, otherwise it returns the top 5 probable classes if prob is True.

Test it with the current vectorizer, trained model and cleaning function written before in the notebook.

<details> <summary>Solution</summary>
<pre>
def predict_label(description, vectorizer, model, clean_f, prob = False):
    cleaned = clean_f(description)
    X_new = vectorizer.transform([cleaned])
    if prob:
        probs = model.predict_proba(X_new)
        class_probs = pd.Series(probs[0], index=model.classes_).sort_values(ascending=False)
        return class_probs.head()
    else:
        prediction = model.predict(X_new)
        return prediction[0]

predict_label("This photograph represents Mount Fuji snowing", tfidf, clf, clean_text, True)
</pre>
</details>

# Check overfitting

If the accuracy in the training is very high but it is low in the test set, it means the model is overfitting (i.e., it learns just the patterns of known data but it struggles with new data)

In [None]:
from sklearn.metrics import accuracy_score

# Predict on the training data
y_train_pred = clf.predict(X_train)

# Compare predictions to true training labels
train_accuracy = accuracy_score(y_train, y_train_pred)
print("Training Accuracy:", train_accuracy)

In [None]:
y_test_pred = clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

In [None]:
import matplotlib.pyplot as plt

plt.bar(["Training", "Test"], [train_accuracy, test_accuracy], color=["pink", "yellow"])
plt.title("Training vs Test Accuracy")
plt.ylim(0, 1)
plt.ylabel("Accuracy")
plt.grid(axis='y')
plt.show()


# Confusion matrix

Visualizing how classes are confused by the model

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import numpy as np

# Compute the confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
cm_df = pd.DataFrame(cm, index=clf.classes_, columns=clf.classes_)

# Plot
plt.figure(figsize=(15, 10))
sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
plt.ylabel("Actual Label")
plt.xlabel("Predicted Label")
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Clustering

Clustering is **unsupervised learning** — we don't provide labels. We let the model group similar items based on their TF-IDF vectors.

We use:
- **KMeans**: Standard clustering algorithm. We set `n_clusters=5` to force 5 groups.
- **PCA**: Principal Component Analysis reduces high-dimensional vectors to 2D for visualization.

`.fit_predict()`
- `.fit_predict()` fits the KMeans model and returns the cluster each item belongs to.


In [None]:
# Run KMeans clustering
k = 5
kmeans = KMeans(n_clusters=k, random_state=0)
clusters = kmeans.fit_predict(X)

# Reduce dimensionality for plotting
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X.toarray())  # Convert sparse to dense for PCA

# Build a DataFrame for plotting
df_plot = pd.DataFrame({
    "x": X_reduced[:, 0],
    "y": X_reduced[:, 1],
    "cluster": clusters
})

# Visualize the clusters
plt.figure(figsize=(8,6))
sns.scatterplot(data=df_plot, x="x", y="y", hue="cluster", palette="tab10")
plt.title("KMeans Clustering of Artwork Descriptions (PCA Projection)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.legend(title="Cluster")
plt.show()


In [None]:
df_clustered = df_filt15k[filter_mask].copy()
df_clustered["cluster"] = clusters

In [None]:
for i in range(kmeans.n_clusters):
    print(f"\nCluster {i} — {df_clustered[df_clustered['cluster'] == i].shape[0]} artworks")
    #print(df_clustered[df_clustered["cluster"] == i][["title", "type", "clean_description"]].head(5))

In [None]:
df_clustered[df_clustered["cluster"] == 3]["description"]

More machine learning? look at the sklearn library docs: https://scikit-learn.org/stable/

# Machine Learning on the visual aspects


We’ll use a combination of libraries:

- **Pandas**: for data handling
- **NumPy**: for matrix operations
- **Matplotlib**: for visualizations
- **Requests**: to fetch images from URLs
- **PIL (Pillow)**: to handle image loading and resizing
- **TensorFlow / Keras**: to use pre-trained convolutional neural networks (CNNs)
- **Scikit-learn (sklearn)**: for clustering (KMeans) and dimensionality reduction (t-SNE)

### What is a Convolutional Neural Network (CNN)?
A Convolutional Neural Network (CNN) is a type of deep neural network specialized for processing image data. Unlike standard (fully connected) networks, CNNs use convolutional layers that slide small filters over the input image to detect spatial patterns like edges, textures, shapes, and objects.

### Keras
Keras is a high-level API for building and training deep learning models. It’s part of the TensorFlow ecosystem.
It's used to easily define layers, losses, optimizers, and metrics; load pre-trained models from large public datasets (e.g., ImageNet); train and evaluate models on your own data.

We use Keras because:

It simplifies complex model building.

It integrates with TensorFlow in python.

It provides many tools for transfer learning (like MobileNetV2, VGG16, etc.).

### pre trained model

A pre-trained model is a neural network that has already been trained on a large dataset — typically ImageNet (which contains over 14 million images across 1000 categories).

Instead of training a CNN from scratch (which requires a lot of data and time), we reuse the early layers of a pre-trained model to extract general-purpose features. This process is known as transfer learning.

Two ways we use pre-trained models:

1.   Feature Extraction: Freeze all layers and use them as-is
2.   Fine-Tuning: Unfreeze some layers and retrain on a smaller dataset to specialize the model for a new task.

#### models in keras

A model in Keras is the full computational pipeline: from input to output.

It consists of layers: Convolutional → Pooling → Dense, etc.

It must be compiled with a loss function and an optimizer.

It can be trained using `.fit()`, evaluated using `.evaluate()`, and used for prediction using `.predict()`.

In our case:

The input will be an image tensor (224, 224, 3) (dimension1, dimension2, RGB) this is how many models are pretrained, so we will need to turn our images into this format, just like we turned the texts into tfidf.

The output will be a softmax probability distribution over our chosen category.




## Filtering Images

We’ll use the `"image_web"` field from the Cleveland dataset, which contains direct URLs to artwork images.

Steps:
1. Filter the dataset to only entries with valid image URLs
2. Sample a subset (e.g. 500–1000) to keep runtime reasonable on Colab

In [None]:
df_filt15k = df_filt15k.dropna(subset=["image_web"])
df_filt15k10 = df_filt15k.sample(n=500, random_state=42)

## Download and preprocess images

Each image will be:
- Downloaded via `requests`
- Resized to 224×224 (required input size for VGG16)
- Converted to a NumPy array (vector)
- Normalized using `preprocess_input()` from `keras.applications.vgg16` (because this is a pretrained model, we need to fit our dataset with the preprocess created for this model!)

We skip any failed downloads or unreadable images.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
from PIL import Image
from io import BytesIO
from tqdm import tqdm

from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

import tensorflow as tf
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input

def load_and_preprocess_image(url):
    try:
        response = requests.get(url, timeout=5)
        img = Image.open(BytesIO(response.content)).convert("RGB")
        img = img.resize((224, 224))
        arr = img_to_array(img)
        arr = preprocess_input(arr)  # Normalize with VGG16 preprocessing
        return arr
    except:
        return None

images = []
valid_indices = []

for idx, url in tqdm(enumerate(df_filt15k10["image_web"])):
    img = load_and_preprocess_image(url)
    if img is not None:
        images.append(img)
        valid_indices.append(idx)

images = np.array(images)
df_filt15k10 = df_filt15k10.iloc[valid_indices].reset_index(drop=True)


In [None]:
len(images)

In [None]:
import pickle

with open('images.pickle', 'wb') as handle:
    pickle.dump(images, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
with open('images.pickle', 'rb') as handle:
    images = pickle.load(handle)

In [None]:
## The file was too big for github: https://gigamove.rwth-aachen.de/en/download/373880ad2e3f1d7ff454a8826540ad27/password

# Feature Extraction with VGG16

We use a **pre-trained VGG16 model** from Keras:
- Trained on ImageNet (over 1 million images)
- We exclude the top classification layers (`include_top=False`)
- Output: a 7×7×512 feature map per image

We then flatten these into 1D feature vectors.

### Why is the VGG16 Output Shape `(n_images, 7, 7, 512)`?

When we pass images through the **VGG16** model (with `include_top=False`), we get feature maps with shape:
(n_images, 7, 7, 512)

Remember, input is
- `224 x 224` = width and height of the image in pixels
- `3` = number of color channels (RGB)

---

#### Architecture of VGG16

VGG16 has 13 convolutional layers organized into 5 blocks. Each block ends with a **MaxPooling layer** that reduces the spatial size (width and height) by half.

Here’s how the image dimensions change as it goes through the network:

| Block | Layers                        | Output Shape         |
|-------|-------------------------------|----------------------|
| 1     | Conv → Conv → MaxPool         | (112, 112, 64)       |
| 2     | Conv → Conv → MaxPool         | (56, 56, 128)        |
| 3     | Conv → Conv → Conv → MaxPool  | (28, 28, 256)        |
| 4     | Conv → Conv → Conv → MaxPool  | (14, 14, 512)        |
| 5     | Conv → Conv → Conv → MaxPool  | (7, 7, 512)          |

Each `MaxPool` operation halves the width and height.

---

#### Final Output

After Block 5, the final feature map has the shape:
- `7 x 7` = spatial dimensions (small version of original image)
- `512` = number of learned filters (pattern detectors)

So for a batch of `n_images`, the total output shape is:
25088  **because 7 * 7 * 512 = 25088**

Each image will be now represented as a 25,088-dimensional feature vector that encodes high-level visual information learned from ImageNet.


In [None]:
from tensorflow.keras.applications import VGG16
from tensorflow.keras.models import Model

# Load the model without the top classification layers
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Extract features
features = base_model.predict(images)  # Output shape: (n_images, 7, 7, 512)
features_flat = features.reshape(features.shape[0], -1)  # Flatten: shape (n_images, 25088)

# Clustering with KMeans

We cluster images based on their visual features:
- `n_clusters=5` group images into 5 visual categories
- `.fit_predict()` both fits the model and assigns each image to a cluster

In [None]:
kmeans = KMeans(n_clusters=5, random_state=0)
clusters = kmeans.fit_predict(features_flat)

df_filt15k10["cluster"] = clusters

## t-SNE Visualization

To visualize clusters in 2D, we use **t-SNE** (t-distributed stochastic neighbor embedding), a dimensionality reduction technique that preserves local structure.

We project each image’s 25,088-dimensional feature vector to 2D.

We also use the parameter **Perplexity**

---

### Perplexity

Perplexity controls the balance between local and global structure in the data when projecting it to 2D.

A lower perplexity (e.g. 5–10) focuses more on very local structure, meaning small neighborhoods

A higher perplexity (e.g. 40–50) preserves broader patterns, meaning more global relationships

perplexity=30 is a common default that balances both.

---

Then, we plot each image as a colored point according to its cluster. Optionally, you can overlay image thumbnails in place of points (see advanced versions).

This **SHOULD help** us visually interpret the kinds of images that group together.

In [None]:
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
coords = tsne.fit_transform(features_flat)

df_filt15k10["x"] = coords[:, 0]
df_filt15k10["y"] = coords[:, 1]

import plotly.express as px

df_filt15k10["hover_text"] = df_filt15k10.apply(
    lambda row: f"Title: {row['title']}<br>Cluster: {row['cluster']}", axis=1
)

# Build the interactive scatter plot
fig = px.scatter(
    df_filt15k10,
    x="x", y="y",
    color="cluster",
    hover_name="hover_text",
    hover_data={"x": False, "y": False, "cluster": False},
    title="Artwork Clusters (t-SNE + VGG16 Features)",
    width=950,
    height=750
)

# Clean hover template — show only custom hover text, no extra info
fig.update_traces(hovertemplate="%{hovertext}<extra></extra>")

# Show plot
fig.show()




# Inspect Images by Cluster

We can now examine representative samples from each cluster to understand what visual themes/patterns are emerging.


In [None]:
from IPython.display import display, Image, HTML

for cluster_id in sorted(df_filt15k10["cluster"].unique()):
    print(f"\n### Cluster {cluster_id} samples:\n")

    # Get the first 3 items from this cluster
    cluster_subset = df_filt15k10[df_filt15k10["cluster"] == cluster_id].head(3)

    for _, row in cluster_subset.iterrows():
        print(f"Title: {row['title']}")
        print(f"Type: {row['type']}")
        display(Image(url=row['image_web'], width=200))
        print("\n" + "*"*50)


## Fine-tuning and image classifiers!

We’ll build and train a deep learning model to **classify artworks based on their image using transfer learning.**

We will use the MobileNetV2 as a base. We will prepare images again according to the new model, and we will fine tune it to predict the type again,we will then evaluate the model performance so we can compare visual and textual results. (Just know, that for text we used 15000 description, and we are only using 450 images because otherwise it would take too long!)



In [None]:

target_col = "type"

# Only keep top 10 frequent labels for balance and feasibility, because we
# do not have a lot of images (compared to the texts we had)
top_labels = df_filt15k10[target_col].value_counts().head(10).index.tolist()
df_classify = df_filt15k10[df_filt15k10[target_col].isin(top_labels)].copy()

print("Classes:", top_labels)
print("Number of images:", len(df_classify))

We’ll use Keras’ `ImageDataGenerator` to:

Load images directly from URLs again

Resize them to (224, 224) again

Apply preprocessing compatible with MobileNetV2

In [None]:
import numpy as np
from PIL import Image
import requests
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input

# Load images and labels
images, labels = [], []

for _, row in df_classify.iterrows():
    try:
        response = requests.get(row["image_web"], timeout=5)
        img = Image.open(BytesIO(response.content)).convert("RGB")
        img = img.resize((224, 224))
        arr = np.array(img)
        arr = preprocess_input(arr)  # MobileNetV2 preprocessing
        images.append(arr)
        labels.append(row[target_col])
    except:
        continue

images = np.array(images)
labels = np.array(labels)
print(len(images))

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Encode labels as integers
le = LabelEncoder()
y_encoded = le.fit_transform(labels)
y_cat = to_categorical(y_encoded)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    images, y_cat, test_size=0.2, stratify=y_cat, random_state=42
)

print("Train samples:", len(X_train))
print("Test samples:", len(X_test))
print("Classes:", le.classes_)


In [None]:
# BUILDING THE MODEL!!!

from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.models import Model
from tensorflow.keras.layers import GlobalAveragePooling2D, Dense
from tensorflow.keras.optimizers import Adam

# Load base model
base_model = MobileNetV2(include_top=False, input_shape=(224, 224, 3), weights="imagenet")

# Freeze base layers (initial training)
for layer in base_model.layers:
    layer.trainable = False

# Add classification head
x = GlobalAveragePooling2D()(base_model.output)
output = Dense(y_cat.shape[1], activation="softmax")(x)
model = Model(inputs=base_model.input, outputs=output)

# Compile
model.compile(optimizer=Adam(), loss="categorical_crossentropy", metrics=["accuracy"])
model.summary()


In [None]:
# First stage of training!!! with frozen base

history = model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test))


In [None]:
# Fine tuning! Unfreezing the layers
for layer in base_model.layers:
    layer.trainable = True

# Recompile (lower learning rate helps prevent overfitting)
model.compile(optimizer=Adam(learning_rate=1e-5), loss="categorical_crossentropy", metrics=["accuracy"])

# Fine-tune
history_fine = model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test))


In [None]:
# Get the classification report

from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Get model predictions (one-hot to class index)
y_pred = model.predict(X_test)
y_pred_labels = le.inverse_transform(np.argmax(y_pred, axis=1))
y_true_labels = le.inverse_transform(np.argmax(y_test, axis=1))

In [None]:
# Print a standard precision/recall/F1 report
print(classification_report(y_true_labels, y_pred_labels))

In [None]:
# Create confusion matrix
cm = confusion_matrix(y_true_labels, y_pred_labels, labels=le.classes_)
cm_df = pd.DataFrame(cm, index=le.classes_, columns=le.classes_)

# Plot confusion matrix as heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix — Image Classifier")
plt.ylabel("Actual Label")
plt.xlabel("Predicted Label")
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()


In [None]:
# Saving a model

model.save("image_classifier_model.h5")
# model.save("image_classifier_model.keras")  # TensorFlow 2.11+

# Saving the encoder

import pickle

with open("label_encoder.pkl", "wb") as f:
    pickle.dump(le, f)

In [None]:
# Reloading a model

from tensorflow.keras.models import load_model

model = load_model("image_classifier_model.h5")

# Reloading the encoder

import pickle

with open("label_encoder.pkl", "rb") as f:
    le = pickle.load(f)

# PRedict on new image

# Load, preprocess, and predict
img = Image.open("path_or_url_to_image").convert("RGB").resize((224, 224))
arr = preprocess_input(np.expand_dims(np.array(img), axis=0))

pred = model.predict(arr)
predicted_class = le.inverse_transform([np.argmax(pred)])
print("Predicted type:", predicted_class[0])

### Exercise, train two models, one for text and one for images, to classify images on the "culture" tag

**Help**, maybe it's better to clean the culture tag and take only the first part

In [None]:
print(len(set(df_filt15k["culture"])))

In [None]:
def clean_culture(text):
    text = text.lower()
    text = text.split(",")[0]
    text = re.sub(r"[^\w\s]", "", text) # regex to remove non alphanumeric ch
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    return " ".join(tokens)

df_filt15k["clean_culture"] = df_filt15k["culture"].apply(clean_culture)

In [None]:
print(len(set(df_filt15k["clean_culture"])))

In [None]:
# Step 1: take the top 30 cultures and make a dataframe filtered with those
# Step 2: train a text classifier based on the cleaned descriptions to classify
# the cultures
# Step 3: make a classification report on it and a confusion matrix
# Step 4: extract a subset of images from the df (around 500) and fit them
# to a model for image classification
# Step 5: fine tune MobileNetV2 to classify images based on culture
# Step 6: make a classification report on it and a confusion matrix