# iPhone Project

*by Alexander Marinskiy*

## Part 1. Data Collection

For data collection, I decided to use the site avito.ru. Currently, over a million ads are published in the "phones" category, and each ad can contain up to 10 photos. Thus, the amount of dataset available is larger than we can theoretically process. Moreover, these are photos taken by the users themselves, which corresponds to the data on which the model will be tested.

### Step 1. Build functions to collect images

In [None]:
# import libraries
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup
import os
import urllib.request 

In [None]:
# function to get web page from url
def get_html(url):
    
    # set user agent
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'

    # get web page from url
    r = requests.get(url, headers={'User-Agent': user_agent})#, proxies=proxy)
  
    # return text of web page
    return r.text

In [None]:
# function to get links to all the product images on the page 
def get_links_from_page(html):
    
    # create soup
    soup = BeautifulSoup(html, 'lxml')

    # get links to all the images
    images = [x['src'] for x in soup.findAll('img', {'class': 'large-picture-img'})]
    
    # filter only the images we need
    product_photo = []
    for i in images:
        if i[:9] != 'https://w':
            product_photo.append(i)

    # return list with links to images
    return product_photo

In [None]:
# function to get list of links to pictures of iphone
def get_all_links(name, base_url, n_pages=10, query=''):
    
    # Construct url adress. 
    # iPhones: https://www.avito.ru/rossiya/telefony/iphone?p=1&q=iphone+x
    # Other: https://www.avito.ru/rossiya/telefony/alcatel?p=1
    
    # if downloading iphones we specify model
    if query != '':
        query = '&q=iphone+' + query
    
    # get links to images from all the pages
    links_list = []
    for i in range(1, n_pages+1):
        url_gen = base_url + 'p=' + str(i) + query
        print(url_gen)
        page_html = get_html(url_gen)
        links_list += get_links_from_page(page_html)
        
        # wait for 5 second in order to avoid block from avito
        time.sleep(5)

    # save list of links to csv
    df = pd.DataFrame()
    df['links'] = links_list
    df.to_csv(name+'.csv')

In [None]:
# get images
def get_images(subfolder, models):
    for model in models[:]:
        # create folder
        if not os.path.exists('dataset/' + subfolder + '/' + model):
            os.makedirs('dataset/' + subfolder + '/' + model)

        # read list of links
        df = pd.read_csv(model+'.csv')

        # print info
        print('Downloading ' + model + '. Total number of photos: ' + str(len(df['links'])))

        # getting photos
        count = 0
        for i in df['links']:         
            count+= 1
            
            # print information massege every 100 photos
            if count % 100 == 0: 
                print(str(count) + ' done')

            # get the image
            try:
                urllib.request.urlretrieve(i, 'dataset/' + subfolder + '/' + model + '/' + str(count) + '.jpg')
            except:
                print('Skip photo ' + str(count) + ' due to error')

            # wait for 0.5 second to avoid ban
            time.sleep(0.5)

### Step 2. Collect images of iPhones

Since we need to learn to recognize all the iPhone models that exist on the market, we will upload photos of these iPhones in equal proportions.

In [None]:
# list of models
iphone_models = ['XR', 'XS', 'X', '8', '7', 'SE', '6S', '6', '5S', '5C', '5', '4S', '4', '3GS', '3G']

In [None]:
# create lists of links for all te models
for model in iphone_models:
    print('Getting links for model', model)
    get_all_links(name=model, base_url='https://www.avito.ru/rossiya/telefony/iphone?', n_pages=12, query=model)

In [None]:
get_images('iphone', iphone_models)

### Step 3. Get images of non-iphones

Since there are only 15 iPhone models, it was reasonable to upload the same number of photos for each model. In the case of non-iPhones, there are much more models, so another strategy was applied. I looked at the number of ads for each of the manufacturers on Avito and decided to upload photos in appropriate proportions. Thus, our dataset will reflect as closely as possible the conditions in which the model will be tested. The proportions have been saved in the non-iphones.xlsx file.

In [None]:
# check what number of avito pages we need to download
df_non_iphones = pd.read_excel('non-iphones.xlsx')
df_non_iphones['n_pages'] = df_non_iphones['n_pages'].apply(int)
df_non_iphones['brand'] = df_non_iphones['brand'].apply(str)
df_non_iphones

In [None]:
# create lists of links for all te models
for i in range(len(df_non_iphones['brand'])):
    print('Getting links for model', df_non_iphones['brand'][i])
    get_all_links(name=df_non_iphones['brand'][i],
                  base_url='https://www.avito.ru/rossiya/telefony/' + df_non_iphones['brand'][i] + '?', 
                  n_pages=df_non_iphones['n_pages'][i])

In [None]:
get_images('other', other_models)

Thus, I managed to collect a balanced dataset from 50,000 images of iPhones and phones from other manufacturers. Then this dataset was merged with the dataset collected by Anton Anisimov. The model was trained in a combined dataset.

It is important to note that more than 1,000,000 phone advertisements are available on Avito, and in each ad there are 2-4 photos. Thus, our dataset can be painlessly increased by a factor of over sixty, which could significantly improve the accuracy of the model, but also would require more computational resources and time for training.