# Bike Price Prediction

## 1. Problem Formulation


The goal of the project is to create a  deep-learning model to predict the offer-price in Euros of a bicycle given an image of the same bycicle. 
The idea for this problem is inspired by the [CS229 2018 Stanford course](https://www.example.com), where one of the instructors proposes a similar semester project. 
<br>
The web is scraped for images of bicycles and their corresponding prices. 

2. Initially, the data is to be explored to show potential areas of improvement. Some problems for preprocessing might include: 
- Filtering out non-bicycle images
- bicycles with non-monotonous background
- bicycles not facing sideways -> more data needed
- data bias of different image sources (e.g. "tank classification problem")
- incorrect pricing


3. Then the initial hyperparameters have to be determined:
 - Loss Function:

## 2. Data exploration, overview, cleaning

### 2.1 Webscraping and cleaning considerations
50000 images from the web are accumulated. 

#### Scraping
1. High quality images come from the websites "fahrrad.de" (1500 images) and "www.fahrrad-xxl.de" (5500 images). 
2. The remaining ~7000 images are scraped from the "shopping" tab of "www.google.com". After trial and error, following methods are recognized to work well using the keywords "bicycle","citybike", "ebike", "mountainbike", "racingbike", "childrens bike", "trekkingbike", "holland bicycle", "bmx" and their german equivalents. These keywords cover the common types.

   2.1. First method: Scraping is executed using breathfirst search, starting with the above-mentioned initial keywords. The links of the (n+1)-depth are found  by the "further recommendations" in each (n)-depth search. Already downloaded images are tracked in a csv file. It is found that a depth of  n=2 is sufficient, the amount of new images in depth n>2 is sparse.
   
   2.2. Second method: Above-mentioned keywords are combined with the words of bicycle colors (['blue','brown','yellow','orange','turkis','purple','white', 'grey', 'green','red','silver']) and also price ranges (200 till 1500 in "50" steps, 1500 till 3000 in "500" steps, 3000 till 6000 in "3000" steps) to scrape further images.  
3. In both cases, also the subsequent pages of each search result are crawled. The technologies used for web-crawling are [Scrapy](https://scrapy.org/) (for 1.), a efficient scraping library, and  [microsoft-playwright](https://playwright.dev/) (for 2.), which is similar to [Selenium](https://www.selenium.dev/) a browser-based test-automation tool, but may be utilized for scraping. Playwright allows for better page accessibility than Scrapy. 

#### Considerations for subsequent exploration and cleaning
1. Especially the data scraped from google.com includes images of other objects or multiple bicycles. The inclusion of images of bicycles with non-monotonous or with non-sideview depiction may increase the difficulty for an algorithm to depict features of a bike, and therefore 50000 images may not be sufficient.
1. The scraped bicycle images and their prices are likely to include regional bias (Germany), as well as incorrect, only regional pricing. Also, the prices are dependent on the date extracted (June 2023). As mentioned in the problem formulation, assigned prices are solely "offer"-prices and may not reflect the prices for which they are sold. 

Option: Precompute image Embeddings of the frozen layers of the CNN. This saves computational cost and time

## 3. Hyperparameter selection (And reasons for selection): Loss Function, Image Size, Model, Finetuning approach (transfer learning)

## 4. Hyperparameter selection: Optimizer, Learning Rate, Batch Size, Regression Layer depth, (Epochs)

- use bayesian optimization, see https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f

## 5. Layer Activation Visualization

==One Problem was that visualization did not work, because I used a keras model as base model, and my own regression head on top, therefore could not access the base model layers==

maybe some kind of embedding visualization with TSNE / UMAP, e.g. show that classes are close to each other (plot the actual images in 2d plot) -> pretrained model without regression head could be used for classification

## 6. Conclusion