# KEY MODIFICATIONS

1. Load the three specific images manually using a file path 
2. Prepocess the images using the "prepocess" function from CLIP 
3. Compute the embeddings for these images using the model
4. Compute the similarity scores against the predefined text features 

### Erklärungen 

1. Specific Image Paths: The Image_paths list contains the file paths of the three JPEG images you wnar to test 
2. Image oading: Each image is loaded using PIL.IMAGe and converted to RGB to ensure compatability with CLIP 
3. Prpeocessing: Each image is preprocessed sing the preprocess function from CLIP to prepare it for the model 
4. Embedding and Similarity Calculation: The image embedding is calcualted, normlaized and compared against the text emebddings using the dot product 
5. Save Results: The similarity results are saved in a final_res/sim_violence_test.torch file for later use
6. Error Handling: Added a try-except block to handle potential issues like missing files or unsupported image formats

### Output
Similarity Scores: The similarity scores for each image against the predefined text prompts are printed.
Saved Results: The similarity results are saved to final_res/sim_violence_test.torch.

In [2]:
import clip
import torch
from PIL import Image
import os

In [3]:
# Paths to the 3 specific JPEG images
#image_paths = ["path/to/image1.jpg", "path/to/image2.jpg", "path/to/image3.jpg"]
#image_paths = "/Users/apple/Documents/HIWI_Katharina/Hiwi_Katarina/violence_detection/Katharina_pytorch_Comparison/images_test"
# Specify the folder containing the images
folder_path = "/Users/apple/Documents/HIWI_Katharina/Hiwi_Katarina/violence_detection/Katharina_pytorch_Comparison/images_test"

# Collect all image file paths
supported_formats = (".jpg", ".jpeg", ".png", ".bmp", ".tiff")  # Add formats if needed
image_paths = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.lower().endswith(supported_formats)]


In [5]:
# Load CLIP model
model_name = "ViT-L/14"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model, preprocess = clip.load(model_name, device=device)
print("Model loaded successfully!")
model.eval()

Model loaded successfully!


CLIP(
  (visual): VisionTransformer(
    (conv1): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
    (ln_pre): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (transformer): Transformer(
      (resblocks): Sequential(
        (0): ResidualAttentionBlock(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True)
          )
          (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): Sequential(
            (c_fc): Linear(in_features=1024, out_features=4096, bias=True)
            (gelu): QuickGELU()
            (c_proj): Linear(in_features=4096, out_features=1024, bias=True)
          )
          (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        )
        (1): ResidualAttentionBlock(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True)


In [6]:
# Define safety-related text prompts
text_prompts = [
    "a photo of a gun", "a photo of a shooting", "a photo of a weapon",
    "a photo of a killing", "a photo of a bomb", "a photo of violence",
    "a photo of other", "a photo of an animal", "a photo of a family",
    "a photo of a meeting", "a photo of a tool", "a photo of a gathering"
]
text_tokens = clip.tokenize(text_prompts).to(device)

In [7]:
# Compute text embeddings
with torch.no_grad():
    text_features = model.encode_text(text_tokens).to(torch.float32)
    text_features /= text_features.norm(dim=-1, keepdim=True)


In [8]:
# Process and evaluate the images
final_sim = {}
for image_path in image_paths:
    try:
        # Load and preprocess the image
        image = Image.open(image_path).convert("RGB")
        image_input = preprocess(image).unsqueeze(0).to(device)

        # Compute image embedding
        with torch.no_grad():
            image_features = model.encode_image(image_input).to(torch.float32)
            image_features /= image_features.norm(dim=-1, keepdim=True)

        # Compute similarity with text features
        similarity = image_features @ text_features.T

        # Store the results
        final_sim[image_path] = similarity.cpu().numpy()

    except Exception as e:
        print(f"Error processing {image_path}: {e}")


In [10]:
# Save the results
torch.save(final_sim, "sim_violence_test.torch")

# Print results
for image_path, sim in final_sim.items():
    print(f"Similarity scores for {image_path}: {sim}")

Similarity scores for /Users/apple/Documents/HIWI_Katharina/Hiwi_Katarina/violence_detection/Katharina_pytorch_Comparison/images_test/id_1080638380466204672_2019-01-03.jpg: [[0.13073185 0.15786709 0.1665825  0.17108655 0.1298548  0.16919914
  0.14559239 0.14970405 0.14161852 0.18379524 0.19505645 0.1744351 ]]
Similarity scores for /Users/apple/Documents/HIWI_Katharina/Hiwi_Katarina/violence_detection/Katharina_pytorch_Comparison/images_test/id_1080192181284028416_2019-01-01.jpg: [[0.10444829 0.13140208 0.12875149 0.14021097 0.11310031 0.13451204
  0.15238331 0.12645964 0.12302336 0.1442782  0.14056309 0.15613508]]
Similarity scores for /Users/apple/Documents/HIWI_Katharina/Hiwi_Katarina/violence_detection/Katharina_pytorch_Comparison/images_test/id_1080618686275309569_2019-01-03.jpg: [[0.11240956 0.13696322 0.12213744 0.12752505 0.11690122 0.12085553
  0.14820303 0.12832837 0.1251597  0.10867783 0.13771729 0.12637308]]
Similarity scores for /Users/apple/Documents/HIWI_Katharina/Hiwi_Ka

### Example 

* jepg :id_1080638380466204672_2019-01-03.jpg

[[0.13073185 0.15786709 0.1665825  0.17108655 0.1298548  0.16919914
  0.14559239 0.14970405 0.14161852 0.18379524 0.19505645 0.1744351 ]]
* jeder scroe stellt die Ähnlichkiet zwsichen einem Bild und einem der Text prompts da 
* => je jöher der Wert, desto näher ist das bild an der semnatsichen bedeutung 

* Example: 
    0,19505645 (höchste Punktzahl in diesem Bereich) entspricht „ein Foto eines Werkzeugs“, was darauf hindeutet, dass diese Textaufforderung am besten mit den Merkmalen des Bildes übereinstimmt
    ...
* Dieses Bild zeigt wahrscheinlich ein Werkzeug oder eine Szene, in der es um ein Treffen geht, da diese Kategorien die höchste Punktzahl haben.
Es ist weniger wahrscheinlich, dass es sich um gewaltbezogene Aufforderungen wie „ein Foto einer Waffe“ oder „ein Foto einer Bombe“ handelt, da diese niedrigere Punktzahlen haben.