# SWIN TESTING

This notebook is used to test the SWIN model on the huggingface/transformers repository.

## Testing and reproducing the feature extractor

In this section we test and figure out the structure of the feature extractor used in the huggingface library.

In [2]:
from transformers import AutoFeatureExtractor, SwinForImageClassification
from torchvision import transforms
import torch
from torch import nn
from PIL import Image, ImageMath
import numpy as np
import requests

The following code shows us the structure of the feature extractor of swin transformer. It includes first a resizing using billinear interpolation, then a normalization using the mean and std of the imagenet dataset, and finally a conversion to tensor. The feature extractor type is merely a notion of what type of feature is returned (image or text whatever).

In [3]:
feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/swin-tiny-patch4-window7-224")
# vit_feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
# model = SwinForImageClassification.from_pretrained("microsoft/swin-tiny-patch4-window7-224")
# print(feature_extractor)
print(feature_extractor)

ViTFeatureExtractor {
  "do_normalize": true,
  "do_resize": true,
  "feature_extractor_type": "ViTFeatureExtractor",
  "image_mean": [
    0.485,
    0.456,
    0.406
  ],
  "image_std": [
    0.229,
    0.224,
    0.225
  ],
  "resample": 3,
  "size": 224
}



Here we download a test image to see the effects of the feature extractor and our replication.

In [4]:
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

The following is us trying to replicate the feature extractor. We use the same resizing and normalization, but we use torchvision to convert the image to a tensor and to take care of normalization.

In [5]:
NORMALIZE_MEAN = feature_extractor.image_mean
NORMALIZE_STD = feature_extractor.image_std

PILToTensor = transforms.PILToTensor()
normalize = transforms.Normalize(NORMALIZE_MEAN, NORMALIZE_STD)

transform_test_image1 = image
transform_test_image1 = transform_test_image1.resize((224, 224), 2)
# transform_test_image = transforms.resize(transform_test_image, (224, 224), 2)
transform_test_image1 = PILToTensor(transform_test_image1)
transform_test_image1 = transform_test_image1.float() / 255.0
transform_test_image1 = normalize(transform_test_image1)
transform_test_image1

tensor([[[ 0.3138,  0.4337,  0.4679,  ..., -0.3541, -0.3369, -0.3369],
         [ 0.3652,  0.4337,  0.4679,  ..., -0.3541, -0.3541, -0.3883],
         [ 0.3138,  0.3994,  0.4166,  ..., -0.4397, -0.4226, -0.4054],
         ...,
         [ 1.8893,  1.7865,  1.6667,  ...,  1.5982,  1.4783,  1.4098],
         [ 1.8722,  1.8037,  1.7523,  ...,  1.3413,  1.0844,  0.9303],
         [ 1.8550,  1.7180,  1.7180,  ...,  0.2282, -0.0458, -0.3541]],

        [[-1.5980, -1.6155, -1.6155,  ..., -1.7906, -1.7906, -1.8081],
         [-1.5630, -1.5630, -1.5630,  ..., -1.7556, -1.7556, -1.7731],
         [-1.6155, -1.5980, -1.5630,  ..., -1.7906, -1.7906, -1.7906],
         ...,
         [-0.4076, -0.5126, -0.6176,  ..., -0.7577, -0.8277, -0.8803],
         [-0.4076, -0.4601, -0.5651,  ..., -0.8803, -1.0203, -1.0903],
         [-0.4251, -0.5651, -0.5826,  ..., -1.4405, -1.5455, -1.6681]],

        [[-0.7936, -0.6193, -0.6541,  ..., -1.2293, -1.1247, -1.1770],
         [-0.8110, -0.7238, -0.6715,  ..., -1

In [13]:
PILToTensor = transforms.PILToTensor()
normalize = transforms.Normalize(NORMALIZE_MEAN, NORMALIZE_STD)
resize = transforms.Resize((224, 224), transforms.InterpolationMode.BILINEAR, antialias=True)
to_float = transforms.ConvertImageDtype(torch.float32)

transform = transforms.Compose([
    transforms.PILToTensor(),
    transforms.Resize((224, 224), transforms.InterpolationMode.BILINEAR, antialias=False),
    transforms.ConvertImageDtype(torch.float32),
    transforms.Normalize(NORMALIZE_MEAN, NORMALIZE_STD)
])

transform_test_image2 = image
transform_test_image2 = PILToTensor(transform_test_image2)
transform_test_image2 = resize(transform_test_image2)
transform_test_image2 = to_float(transform_test_image2)
transform_test_image2 = normalize(transform_test_image2)
transform_test_image2

tensor([[[ 0.3309,  0.4337,  0.4679,  ..., -0.3541, -0.3369, -0.3369],
         [ 0.3652,  0.4337,  0.4679,  ..., -0.3541, -0.3541, -0.3883],
         [ 0.3138,  0.3994,  0.4166,  ..., -0.4397, -0.4226, -0.4054],
         ...,
         [ 1.8893,  1.7865,  1.6667,  ...,  1.5982,  1.4783,  1.4098],
         [ 1.8722,  1.8037,  1.7352,  ...,  1.3242,  1.0844,  0.9303],
         [ 1.8722,  1.7180,  1.7180,  ...,  0.2282, -0.0629, -0.3541]],

        [[-1.5980, -1.6155, -1.6155,  ..., -1.7906, -1.7906, -1.8081],
         [-1.5455, -1.5805, -1.5630,  ..., -1.7731, -1.7556, -1.7731],
         [-1.6155, -1.5980, -1.5630,  ..., -1.7906, -1.7906, -1.7906],
         ...,
         [-0.4076, -0.5301, -0.6176,  ..., -0.7402, -0.8102, -0.8803],
         [-0.4076, -0.4601, -0.5651,  ..., -0.8803, -1.0203, -1.0903],
         [-0.4251, -0.5651, -0.5826,  ..., -1.4405, -1.5455, -1.6681]],

        [[-0.7936, -0.6193, -0.6541,  ..., -1.2293, -1.1247, -1.1596],
         [-0.8110, -0.7238, -0.6715,  ..., -1

We now try using the feature extractor on the image. We see that the output is the same as the custom preprocessing we constructed.

In [7]:
inputs = feature_extractor(images=image, return_tensors="pt")
inputs["pixel_values"]

tensor([[[[ 0.3138,  0.4337,  0.4851,  ..., -0.3541, -0.3369, -0.3541],
          [ 0.3652,  0.4337,  0.4679,  ..., -0.3541, -0.3541, -0.3883],
          [ 0.3138,  0.3994,  0.4166,  ..., -0.4568, -0.4226, -0.3883],
          ...,
          [ 1.9064,  1.7865,  1.6495,  ...,  1.6153,  1.4954,  1.4440],
          [ 1.8722,  1.8037,  1.7523,  ...,  1.4098,  1.1358,  0.9817],
          [ 1.8722,  1.7180,  1.7352,  ...,  0.1254, -0.1657, -0.4739]],

         [[-1.6155, -1.6155, -1.6155,  ..., -1.7906, -1.7906, -1.8081],
          [-1.5630, -1.5630, -1.5630,  ..., -1.7731, -1.7556, -1.7731],
          [-1.6331, -1.5980, -1.5630,  ..., -1.8081, -1.7906, -1.7906],
          ...,
          [-0.3901, -0.5301, -0.6352,  ..., -0.7402, -0.8102, -0.8627],
          [-0.3901, -0.4426, -0.5651,  ..., -0.8452, -1.0028, -1.0728],
          [-0.4251, -0.5651, -0.5826,  ..., -1.4930, -1.5980, -1.7206]],

         [[-0.7936, -0.6018, -0.6541,  ..., -1.2293, -1.1247, -1.1596],
          [-0.8458, -0.7238, -

In [14]:
embedding_difference = (transform_test_image1 - transform_test_image2).abs()
print(embedding_difference.mean().le(1e-3).item())
print(embedding_difference.argmax())
print(embedding_difference.max())
print(embedding_difference.mean())

False
tensor(68662)
tensor(0.0175)
tensor(0.0023)


All thats left is to package this into a pytorch transformation module:

In [56]:
swin_preprocessor = transforms.Compose([
    transforms.PILToTensor(),
    transforms.Resize((224, 224), 2),
    transforms.ConvertImageDtype(torch.float),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

swin_preprocessor(image)

tensor([[[ 0.3309,  0.4337,  0.4679,  ..., -0.3712, -0.3198, -0.2171],
         [ 0.4679,  0.4166,  0.4508,  ..., -0.4054, -0.3369, -0.4397],
         [ 0.2624,  0.4166,  0.3823,  ..., -0.4911, -0.5424, -0.5082],
         ...,
         [ 1.8722,  1.7523,  1.5982,  ...,  1.6153,  1.4612,  1.3927],
         [ 1.8379,  1.8208,  1.7523,  ...,  1.3927,  1.1872,  0.9817],
         [ 1.8893,  1.6495,  1.6667,  ..., -0.0629, -0.2171, -0.5767]],

        [[-1.5980, -1.6331, -1.5980,  ..., -1.8256, -1.8081, -1.7731],
         [-1.4405, -1.5805, -1.5455,  ..., -1.7556, -1.7206, -1.6856],
         [-1.6856, -1.5980, -1.5980,  ..., -1.7906, -1.8081, -1.8957],
         ...,
         [-0.4426, -0.6176, -0.6176,  ..., -0.7752, -0.8627, -0.9678],
         [-0.4251, -0.4251, -0.5826,  ..., -0.7927, -0.9153, -1.0203],
         [-0.4251, -0.6176, -0.6352,  ..., -1.5805, -1.6506, -1.8081]],

        [[-0.8110, -0.5321, -0.5844,  ..., -1.2816, -1.1073, -1.1073],
         [-0.6890, -0.8284, -0.7413,  ..., -1

We see that it is not possible to completelt recreate the embedding that SWIN uses using only pytorch modules, and therefore we will use the transformers library to do handle encoding.