<a href="https://colab.research.google.com/github/AXXionn/ML-Projects/blob/main/Logo_similarity_CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I decided to create a logo similarity algorithm using tensorflow, pandas, numpy, keras and sklearn for the cosine_similarity function

First of all i decided to upload the .parquet file to google colab since I didn't want to use Google Drive

In [None]:
from google.colab import files
uploaded = files.upload()

Saving logos.snappy.parquet to logos.snappy (3).parquet


Next I started importing all of the necesarry modules for the project, checking to see if the project is using the GPU for computing since CPU would be slower and also creating a directory path where the algorithm would save the images that it would gather from the sites provided in the .parquet file provided

In [None]:
import tensorflow as tf
print("Num GPUs Available:", len(tf.config.list_physical_devices('GPU')))
import os
import requests
import numpy as np
from tensorflow.keras.preprocessing import image
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import concurrent.futures
if tf.config.list_physical_devices('GPU'):
  print("Running on GPU")
  tf.config.experimental.set_memory_growth(tf.config.list_physical_devices('GPU')[0], True)
else:
  print("No GPU detected. Running on CPU")
# The path to save the logos
logo_dir = "/content/logos"
os.makedirs(logo_dir, exist_ok=True)

parquet_file_name = list(uploaded.keys())[0]

df = pd.read_parquet(parquet_file_name)
domains = df["domain"].tolist()

Num GPUs Available: 0
No GPU detected. Running on CPU


### First step, getting the images from each site and storing them in the target directory

Throughout the project I used the ThreadPoolExecutor function to run all of the operations to speed up the process and maximize the efficiency at which the CNN goes through each step in recreating the desired result

I also added a comment for each step since it was easier for me and also for whoever reads the code to understand the process and to have a better read at what measures I took in solving the matter

In [None]:
# Now we need to fetch and save the logos
def fetch_logo(domain):
  base_url = f"https://logo.clearbit.com/{domain}"
  response = requests.get(base_url, stream=True)
  if response.status_code == 200:
    with open(f"{logo_dir}/{domain}.png", "wb") as file:
      for chunk in response.iter_content(1024):
        file.write(chunk)
    print(f"Saved {domain}.png")
  else:
    print(f"Failed to fetch logo for {domain}")

for domain in domains:
  fetch_logo(domain)

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    executor.map(fetch_logo, domains)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Saved indianmotorcyclesturgis.com.png
Saved titancontainers.ie.png
Saved lindt.at.png
Failed to fetch logo for wedos.com
Saved maxus-norddeutschland.de.png
Saved toyotayatubodrum.com.tr.png
Saved aamcosandiego-miramar.com.png
Failed to fetch logo for bestwesternbraddock.com
Saved subway-franchise.fr.png
Saved wurth.kz.png
Saved carglass.rs.png
Failed to fetch logo for subaru.nc
Saved decathlonkz.com.png
Saved gs1tn.org.png
Saved entergy-nuclear.com.png
Saved funiber.fr.png
Failed to fetch logo for daad-australia.org
Failed to fetch logo for mcdonalds-mcdelivery.es
Saved starbucks.com.gr.png
Failed to fetch logo for mazda-autohaus-lehmann-senftenberg.de
Saved astrazeneca.com.hk.png
Saved whatsonsaudiarabia.com.png
Saved culliganofsouthwestwisconsin.com.png
Saved toyota.com.mk.png
Saved nhszketpo.hu.png
Saved culliganwisconsin.com.png
Saved brother.com.ar.png
Saved narscosmetics.de.png
Saved toysrus.com.bn.png
Saved linde-g

### Creating the CNN

After getting the images and storing them in the desired directory I decided to make a simple CNN model to take the images, process them and to look at them and compile in such a way that the program would reach the best accuracy possible

In [None]:
# Make our CNN model
def create_model():
  model = Sequential([
      Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)),
      MaxPooling2D((2, 2)),
      Conv2D(64, (3, 3), activation='relu'),
      MaxPooling2D(2,2),
      Conv2D(128, (3,3), activation='relu'),
      MaxPooling2D(2,2),
      Flatten(),
      Dense(256, activation='relu'),
      Dropout(0.5),
      Dense(128, activation='relu'),
      Dense(64, activation='relu')
  ])
  return model


In [None]:
# Creating and compiling the model
model = create_model()
model.compile(optimizer=Adam(learning_rate=0.001),
              loss=["mse"])


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [None]:
# Here i printed a summary for the model in order to see how many params i have gotten and if all of them are trainable
model.summary()

### After creating our CNN model, we need to preprocess the images in order to make our model more efficient and to transform all the images into numpy arrays and eventually into tensors

In [None]:
# Function to preprocess images
def extract_features(img_path):
  img = image.load_img(img_path, target_size=(224, 224))
  img_array = image.img_to_array(img) / 255.0
  img_array = np.expand_dims(img_array, axis=0)
  return model.predict(img_array).flatten()


In [None]:
# Logo file paths
logo_files = [f for f in os.listdir(logo_dir) if f.endswith(".png")]

### The next crucial step is to extract key features from our logos in order for our model to better extract any similarities between the array of pictures given, where again I used the ThreadPoolExecutor for a faster runtime and a more efficient process

In [None]:
# Extract features for all logos

def process_logo(logo):
    path = os.path.join(logo_dir, logo)
    try:
        return logo, extract_features(path)
    except Exception as e:
        print(f"Error processing {logo}: {e}")
        return logo, None

logo_features = {}
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    results = executor.map(process_logo, logo_files)
    for logo, features in results:
        if features is not None:
            logo_features[logo] = features

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 889ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 899ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 980ms/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 864ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1s/step



[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 456ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 376ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 481ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 499ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 580ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 774ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 688ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 984ms/s

### The last step and arguably the most important one is computing the data and saving them in order to print our results and see how efficient the model actually is in discovering the similarities between the logos

I used cosine similarity over Euclidian similarity because for our particular situation, the angle between the vectors is more important than the distance, thus giving us more accurate results.

In [None]:
# Compute similarity and save results
def compute_similarity(target_logo):
    if target_logo not in logo_features:
        return []
    similarities = []
    target_features = logo_features[target_logo]
    for logo, features in logo_features.items():
        if logo != target_logo:
            try:
                similarity = cosine_similarity([target_features], [features])[0][0]
                similarities.append((target_logo, logo, similarity))
            except Exception as e:
                print(f"Error computing similarity between {target_logo} and {logo}: {e}")
    similarities.sort(key=lambda x: x[2], reverse=True)
    return similarities[:5]

output_parquet = "/content/logo_similarity_results.parquet"
all_results = []

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    results = executor.map(compute_similarity, logo_files)
    for result in results:
        all_results.extend(result)

similarity_df = pd.DataFrame(all_results, columns=["Logo", "Similar Logo", "Similarity Score"])
similarity_df.to_parquet(output_parquet, index=False)
print(f"Similarity results saved to {output_parquet}")


Similarity results saved to /content/logo_similarity_results.parquet


### Lastly i created a download link for the results for a safer way of storing the data, internally, than to need to access the directory and search for the file, considering this would be a more simple way of retrieving the data

In [None]:
# Download link
files.download(output_parquet)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### For this project I used Google Colab running python as it's runtime type, because while trying to use PyCharm i could only get the Tensorflow 1.8 version, thus not being able to use some of the features present in this model

* But why use TensorFlow and not PyCharm or any other library?
  - I used TensorFlow because prior to this project I didn't have much experience regarding Machine Learning or Deep Learning and in the process of getting familiar with the branch I mostly used TensorFlow throughout my learning process.

### But how efficient really is the program?

Running on the GPU runtime, with all the Threading I have implemented and adjusted throughout the process, the model took about 70 minutes to fetch, preprocess, extract features, compute and save the result, with an average similiraty rate of 94% for all the entries given in the project file, resulting in 14505 matches saved and documented in the .parquet file that I will be attaching to the project so that you can also check the results.

### Threading versus non-threading

Before I decided to use the ThreadPoolExecutor path, I was at around the 3 hour mark and the model was still computing it's outputs, when I decided that for anyone who wanted to use the model for their own data, especially for a data file that is much larger than the one provided, some adjustments would have to be made in order for it not to reach a computing time of 4 hours +

### Closing thoughts and documentation

For me this was the first 'big' project that I had to make using Deep Learning and to me opened a new way of thinking about my future, I really enjoyed trying to manipulate the data, to crunch the numbers and especially to make the model as efficient as it can be.

The documentation used was the TensorFlow official documentation that can be found on their site: https://www.tensorflow.org
Stack overflow for the eventual errors I ran into
And some articles on Medium regarding data science and convolutional neural networks

