# Bike Price Prediction

## 1. Problem Formulation


The goal of the project is to create a  deep-learning model to predict the offer-price in Euros of a bicycle given an image of the same bycicle. 
The idea for this problem is inspired by the [CS229 2018 Stanford course](https://www.example.com), where one of the instructors proposes a similar semester project. 
<br>
The web is scraped for images of bicycles and their corresponding prices. 

2. Initially, the data is to be explored to show potential areas of improvement. Some problems for preprocessing might include: 
- Filtering out non-bicycle images
- bicycles with non-monotonous background
- bicycles not facing sideways -> more data needed
- data bias of different image sources (e.g. "tank classification problem")
- incorrect pricing


3. Then the initial hyperparameters have to be determined:
 - Loss Function:

## 2. Preliminaries

#### Neural Networks

#### Convolutional Neural Networks
Convolutional Neural Networks (CNNS) are first introduced by  LeCun et al. 
The domain they are mostly used in is computer vision, extracting knowledge from images, although they have been applied to other tasks such as time series prediction as well.

CNNS implement an operation called the convolution. 


This operation is tailored well to image data. 
Convolutional Neural Networks implement the Convolution Operation

LeCun, Yann et al. “Gradient-based learning applied to document recognition.” Proc. IEEE 86 (1998): 2278-2324.

In [None]:
## He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

## 3. Data exploration, overview, cleaning

### 3.1 Webscraping and cleaning considerations
50000 images from the web are accumulated. 

#### Scraping
1. High quality images come from the websites "fahrrad.de" (1500 images) and "www.fahrrad-xxl.de" (5500 images). 
2. The remaining ~7000 images are scraped from the "shopping" tab of "www.google.com". After trial and error, following methods are recognized to work well using the keywords "bicycle","citybike", "ebike", "mountainbike", "racingbike", "childrens bike", "trekkingbike", "holland bicycle", "bmx" and their german equivalents. These keywords cover the common types.

   2.1. First method: Scraping is executed using breathfirst search, starting with the above-mentioned initial keywords. The links of the (n+1)-depth are found  by the "further recommendations" in each (n)-depth search. Already downloaded images are tracked in a csv file. It is found that a depth of  n=2 is sufficient, the amount of new images in depth n>2 is sparse.
   
   2.2. Second method: Above-mentioned keywords are combined with the words of bicycle colors (['blue','brown','yellow','orange','turkis','purple','white', 'grey', 'green','red','silver']) and also price ranges (200 till 1500 in "50" steps, 1500 till 3000 in "500" steps, 3000 till 6000 in "3000" steps) to scrape further images.  
3. In both cases, also the subsequent pages of each search result are crawled. The technologies used for web-crawling are [Scrapy](https://scrapy.org/) (for 1.), a efficient scraping library, and  [microsoft-playwright](https://playwright.dev/) (for 2.), which is similar to [Selenium](https://www.selenium.dev/) a browser-based test-automation tool, but may be utilized for scraping. Playwright allows for better page accessibility than Scrapy. 

#### Considerations for subsequent exploration and cleaning
1. Especially the data scraped from google.com includes images of other objects or multiple bicycles. The inclusion of images of bicycles with non-monotonous or with non-sideview depiction may increase the difficulty for an algorithm to depict features of a bike, and therefore 50000 images may not be sufficient.
1. The scraped bicycle images and their prices are likely to include regional bias (Germany), as well as incorrect, only regional pricing. Also, the prices are dependent on the date extracted (June 2023). As mentioned in the problem formulation, assigned prices are solely "offer"-prices and may not reflect the prices for which they are sold. 

Option: Precompute image Embeddings of the frozen layers of the CNN. This saves computational cost and time

--- exploratory data analysis ideas: compute "average image", see eigenfaces or fisher faces https://en.wikipedia.org/wiki/Eigenface or simply compute mean image, compute this per price range, plot image dimensions on x, y axis. Plot these or also plot colors per price range to see if there are any color biases. 

use weights and biases to track data

### 3.2 Data Exploration and cleaning

In [5]:
import os
from pathlib import Path
from tqdm.auto import tqdm
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
DRIVE_FOLDER = Path('/content/drive/MyDrive/DataExplorationProject/')

In [12]:
unzipped = False

In [13]:
if not unzipped:
  !unzip /content/drive/MyDrive/DataExplorationProject/bicycles_xxl.zip
  !unzip /content/drive/MyDrive/DataExplorationProject/bicycles_de.zip
  !unzip /content/drive/MyDrive/DataExplorationProject/bicycles_goog.zip
  unzipped =True

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
  inflating: content/bicycles_all/640adc6f-d52b-3d0a-8664-55f7384c6969_1599.00.jpg  
  inflating: content/bicycles_all/0d5dae00-08c7-3347-87c3-ac45ad426b94_205.87.jpg  
  inflating: content/bicycles_all/0b11fbe7-7fd1-3177-a761-45ec673de440_736.49.jpg  
  inflating: content/bicycles_all/e4258919-1b2a-39f0-a4a8-054942570cb0_444.00.jpg  
  inflating: content/bicycles_all/8f976931-bd44-3bad-b494-6154f91ab677_2799.00.jpg  
  inflating: content/bicycles_all/germany_004ed149-5218-4bb4-80f7-a9bba61037f10_3899.00.jpg  
  inflating: content/bicycles_all/f3eb4d06-a7a5-3682-acba-ebb04514d658_525.00.jpg  
  inflating: content/bicycles_all/ce0c325d-203b-3bf8-b51f-097d71ea10ef_6029.50.jpg  
  inflating: content/bicycles_all/germany_9248014e-b490-4e68-b170-e85fb98869770_1199.00.jpg  
  inflating: content/bicycles_all/a08f2f6d-23e6-305f-ad8a-ca46d3ca8edd_367.67.jpg  
  inflating: content/bicycles_all/germany_84427a65-a02d-

#### 2.2.1 Removing duplicate images
The webscraping process does not ensure that only unique images are loaded.
1. Compute image embeddings with a standard 50 resnet-blocks deep resnet which was pretrained on the ImageNet dataset. Only the base model without the prediction head is used. "model.summary()" below shows the architecture of the model. The output of the last average-pooling layer is used for the embeddings thus each image is represented by a 2048-dimensional vector.



In [14]:
model.summary()

Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_4 (InputLayer)           [(None, 224, 224, 3  0           []                               
                                )]                                                                
                                                                                                  
 conv1_pad (ZeroPadding2D)      (None, 230, 230, 3)  0           ['input_4[0][0]']                
                                                                                                  
 conv1_conv (Conv2D)            (None, 112, 112, 64  9472        ['conv1_pad[0][0]']              
                                )                                                                 
                                                                                            

In [10]:
import os
import numpy as np

from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.preprocessing import image
from tensorflow.keras.models import Model
from sklearn.metrics.pairwise import cosine_similarity


# Load  pretrained ResNet50 model, was pretrained on 1000 class imagenet
base_model = ResNet50(weights='imagenet')
model = Model(inputs=base_model.input, outputs=base_model.get_layer('avg_pool').output)

# preprocess imgs
def preprocess_image(img_path):
  img = image.load_img(img_path, target_size=(224, 224))
  img_data = image.img_to_array(img)
  img_data = np.expand_dims(img_data, axis=0)
  img_data = preprocess_input(img_data)
  return img_data

# compute embeddings
def compute_embeddings(folder_path, batch_size=64):
  dataset_name = folder_path.split('/')[-1]
  p = DRIVE_FOLDER/(dataset_name+'_resnet50embeddings.npz')
  
  if os.path.exists(p): # load precomputed
    print(f"load already precomputed embeddings in file {p}")
    return np.load(p)

  image_list = os.listdir(folder_path)
  num_images = len(image_list)
  num_batches = int(np.ceil(num_images / batch_size))

  image_embeddings = {}

  for i in tqdm(range(num_batches),total=num_batches):
    batch_images = []
    start = i * batch_size
    end = min((i + 1) * batch_size, num_images)

    for j in range(start, end):
      try:
        img_path = os.path.join(folder_path, image_list[j])
        img_data = preprocess_image(img_path)
      except BaseException as e:
        print(e, image_list[j])
      batch_images.append(img_data)
    
    batch_data = np.vstack(batch_images)
    batch_embeddings = model.predict(batch_data)

    for j in range(batch_embeddings.shape[0]):
      image_embeddings[os.path.join(folder_path, image_list[start + j])] = batch_embeddings[j]

  return image_embeddings

# Compute embeddings for images in all folders
folder1_embeddings = compute_embeddings('/content/bicycles_goog')
folder2_embeddings = compute_embeddings('/content/bicycles_de')
folder3_embeddings = compute_embeddings('/content/bicycles_xxl')

/content/drive/MyDrive/DataExplorationProject/bicycles_goog_resnet50embeddings.npz
/content/drive/MyDrive/DataExplorationProject/bicycles_de_resnet50embeddings.npz
/content/drive/MyDrive/DataExplorationProject/bicycles_xxl_resnet50embeddings.npz


In [17]:
list(folder3_embeddings.keys())[0]

'/content/bicycles_xxl/73096c95-4873-4703-934e-a39771bfa089_5999.jpg'

## 3. Hyperparameter selection (And reasons for selection): Loss Function, Image Size, Model, Finetuning approach (transfer learning)

## 4. Hyperparameter selection: Optimizer, Learning Rate, Batch Size, Regression Layer depth, (Epochs)

- use bayesian optimization, see https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f

## 5. Layer Activation Visualization

==One Problem was that visualization did not work, because I used a keras model as base model, and my own regression head on top, therefore could not access the base model layers==

maybe some kind of embedding visualization with TSNE / UMAP, e.g. show that classes are close to each other (plot the actual images in 2d plot) -> pretrained model without regression head could be used for classification

## 6. Conclusion