# FINAL PROJECT - OCCASION VEHICLES PRICE ESTIMATOR
___
#### MASTER IN DATA SCIENCE - KSCHOOL - 2016/17
#### KOLDO PINA ORTIZ
____

## Motivation

The aim of the present work is to get a good estimator of the price of a second-hand vehicle, based on the prices of the second-hand market

## Action Plan

To reach our goal, we will follow the the next steps:

1. ***Scrape*** website, **motos.net** to obtain the ***data***.
2. ***Clean*** the ***data***.
2. ***Merge*** some ***data***.
3. ***Train*** various models.
4. Compare the metrics and choose the model with the best one.
5. Create a flask web server with the app.

## Scrape

We are going to scrape each website separately.
To this end, we have developed two python scripts called ***scraper_motos.py*** and ***scrapers_coches.py***.
Both return a dataframe.

You can import them to the notebook.

In [5]:
#Call the scraper_motos function to scrape motos.net and create a csv file with the raw data
#from scraper_motos import scraper_motos
#scraper_motos().to_csv('motos_raw_data.csv', index = False, header = True, encoding = 'utf-8')

In [None]:
#Start time: 2017-05-14 20:41:19.733669
#num_ads 26161
#End time: 2017-05-14 23:58:13.605710

In [2]:
#from scraper_coches import Scraper_coches
#Scraper_coches().to_csv('coches_raw_data.csv', index = False, header = True, encoding = 'utf-8')

In [None]:
#Start time: 2017-05-03 19:44:43.260000s
#num_pages 5365
#End time: 2017-05-04 12:12:29.699000

In [105]:
import pandas as pd
df_motos_raw = pd.read_csv('motos_raw_data.csv')
df_motos_raw.shape

(26081, 10)

In [106]:
df_motos_raw.head()

Unnamed: 0,city,brand,model,type,cc,color,km,year,price,url
0,Granada,TRIUMPH,SPEED TRIPLE,Naked,1050.0,Amarillo,32000.0,2006,4500,http://motos.net/ocasion/triumph/speed_triple/...
1,Vizcaya,BMW,R 1200 GS 98cv,Trail,1170.0,Blanco,74200.0,2007,8500,http://motos.net/ocasion/bmw/r_1200_gs_98cv/20...
2,Asturias,BMW,R 1200 GS,Trail,1170.0,Rojo,79000.0,2007,7500,http://motos.net/ocasion/bmw/r_1200_gs/2007-en...
3,Sevilla,HARLEY DAVIDSON,Sportster 883 XLH 53,Custom,883.0,Negro,22000.0,2003,3500,http://motos.net/ocasion/harley_davidson/sport...
4,Valencia,KYMCO,Super Dink 125i,Scooter 125cc,125.0,,33000.0,2012,1950,http://motos.net/ocasion/kymco/super_dink_125i...


## Data cleaning

In [107]:
# Convert the dataframe to lower case
df_motos_raw = df_motos_raw.apply(lambda x: x.astype(str).str.lower())

In [108]:
#Join the words in the 'model' and 'type' fields with an underscore
df_motos_raw['model'] = df_motos_raw['model'].str.replace(' ', '_')
df_motos_raw['type'] = df_motos_raw['type'].str.replace(' ', '_')

In [109]:
# Looking for duplicates
df_motos_raw['is_duplicated'] = df_motos_raw.duplicated()
duplicates = df_motos_raw['is_duplicated'].sum()
print '%d duplicates' %duplicates

61 duplicates


In [110]:
# Removing duplicates and delete 'is_duplicated' column
df_motos_raw = df_motos_raw.loc[df_motos_raw['is_duplicated']==False]
df_motos_raw = df_motos_raw.drop('is_duplicated', 1)
df_motos_raw.shape

(26020, 10)

In [111]:
# Lets investigate column by column the NaNs we have in the dataframe
for column in df_motos_raw.columns:
    n_nan = df_motos_raw[column]=='nan'
    print column + " %d -- > %f" %(n_nan.sum(), (n_nan.sum()*1.0)/df_motos_raw.shape[0]*100)

city 0 -- > 0.000000
brand 0 -- > 0.000000
model 0 -- > 0.000000
type 0 -- > 0.000000
cc 97 -- > 0.372790
color 7094 -- > 27.263643
km 1624 -- > 6.241353
year 0 -- > 0.000000
price 0 -- > 0.000000
url 0 -- > 0.000000


In [112]:
# Lets investigate the unique values we have in the columns
for column in ['city', 'brand', 'model', 'type', 'color', 'year']:
    column_uv = df_motos_raw[column].unique()
    print column + " --> " + "%d unique values" %len(sorted(column_uv))

city --> 52 unique values
brand --> 146 unique values
model --> 2989 unique values
type --> 15 unique values
color --> 1932 unique values
year --> 48 unique values


## Merge some data

In [113]:
#In order to calculate our first metric, we will use the following columns:
# "lon" and "lat" : These are the longitude and latitude of the corresponding city. We will add them later.
# "brand", "model", "type", "year"

In [114]:
# Calculating the longitude and latitude of the cities
import geopy
from geopy.geocoders import Nominatim
geolocator = Nominatim()

In [115]:
cities = df_motos_raw['city'].unique()
locations_rows = []
for city in cities:
    location = geolocator.geocode([city], timeout = 15)
    locations_rows.append([city, location.latitude, location.longitude])
#Save into a csv
df_locations = pd.DataFrame(locations_rows, columns = ['city', 'lat', 'lon'])
df_locations.to_csv('locations_coords.csv', index = False)

In [116]:
df_locations.head(3)

Unnamed: 0,city,lat,lon
0,granada,37.18302,-3.602192
1,vizcaya,43.238264,-2.852207
2,asturias,43.271563,-5.853946


In [117]:
# Merge df_locations with df_motos_raw
df_motos_raw_coord = pd.merge(df_motos_raw, df_locations, on = 'city')
#Save into a csv
df_motos_raw_coord.to_csv('df_motos_raw_coord.csv', index = False)

In [118]:
# We have created two csv files with a rank for the motos brands and types
# called rank_motos_brands.csv and rank_motos_types.csv

# With the first one, rank_motos_brands.csv, we are gint to create another  column, with a score to the corresponding brand
df_rank_moto_brand = pd.read_csv('rank_moto_brands.csv', sep=';')
df_motos_raw_coord_brand = pd.merge(df_motos_raw_coord, df_rank_moto_brand, on = 'brand', how = 'left')
#If the brand does not exist, the rank value will be zero
#!!OJO, AÑADIR MARCAS DE MOTOS QUE FALTAN ANTES DE PONER UN CERO!!!!!!!!!!!!!!!!!
df_motos_raw_coord_brand.brand_score.fillna(0, inplace=True)
# Save into a csv
df_motos_raw_coord_brand.to_csv('df_motos_coord_brand.csv', index = False)

# With the second one, rank_moto_types.csv, we are going to create another column, with a score to the corresponding type
df_rank_moto_type = pd.read_csv('rank_moto_types.csv', sep=';')
df_motos_raw_coord_brand_types = pd.merge(df_motos_raw_coord_brand, df_rank_moto_type, on = 'type', how = 'left')
#Save into a csv
df_motos_raw_coord_brand_types.to_csv('df_motos_raw_coord_brand_type.csv', index = False)


In [119]:
df_motos_raw_coord_brand_types.head(3)

Unnamed: 0,city,brand,model,type,cc,color,km,year,price,url,lat,lon,brand_score,type_score
0,granada,triumph,speed_triple,naked,1050.0,amarillo,32000.0,2006,4500,http://motos.net/ocasion/triumph/speed_triple/...,37.18302,-3.602192,120.0,15
1,granada,triumph,tiger_800,trail,799.0,negro,39200.0,2011,6200,http://motos.net/ocasion/triumph/tiger_800/201...,37.18302,-3.602192,120.0,14
2,granada,kawasaki,vn_900_classic_special_edition,custom,903.0,negro,32000.0,2011,5000,http://motos.net/ocasion/kawasaki/vn_900_class...,37.18302,-3.602192,378.0,13


In [120]:
df_motos_raw_coord_brand_types.shape

(26020, 14)

In [131]:
# OK!, so we have a first version of the data we will use to recommend vehicles
# df_motos_raw_coord_brand_types
# Lets try to calculate the metric only with some fields. We are going to add these distances:
# cities distance, brand_distance, type_distance, year_distance
# We need to create some functions:

def cities_distance(city_lat, city_lon, user_lat, user_lon):
    """    
    :param city_lat: the value in the dataset's lat column to the corresponding city
    :param city_lon: the value in the dataset's lon column to the corresponding city
    :param user_lat: The corresponding lat value in the location dataset of the city selected by the user
    :param user_lon: The corresponding lon value in the location dataset of the city selected by the user
    
    :return: The value in kilometers of the distance between the two cities.
    
    Usage of the Vicenty distance
    """
    
    from geopy.distance import vincenty
    
    column_city = (city_lat, city_lon)
    user_city = (user_lat, user_lon)
    
    return (vincenty(column_city, user_city).km)

def distance_abs_value(a_value, b_value):
    return abs(a_value - b_value)

def w_s(city_row, brand_row, type_row, year_row):
    import numpy as np
    weigth = 100
    brand_weight = 40
    type_weight = 40
    year_weight = 10
    city_weight = 10
    
    params = np.array([city_row, brand_row, type_row, year_row])
    weights = np.array([city_weight, brand_weight, type_weight, year_weight])
    
    num = sum(params * weights) * 1.0
    return num/weigth

In [122]:
#Example of request of a user
user_request = ('leon', 'bmw', 'custom', 2000)

# We need to calculate some variables:
user_lat = float(df_locations[df_locations['city'] == user_request[0]].lat)
user_lon = float(df_locations[df_locations['city'] == user_request[0]].lon)
user_brand = int(df_rank_moto_brand[df_rank_moto_brand['brand'] == user_request[1]].brand_score)
user_type = int(df_rank_moto_type[df_rank_moto_type['type'] == user_request[2]].type_score)
user_year = user_request[3]

user_vars = [user_brand, user_type, user_year]
score_columns = ['brand_score', 'type_score', 'year']

In [132]:
df_motos_raw_coord_brand_types['city_metric'] = df_motos_raw_coord_brand_types.apply(lambda row: cities_distance(row['lat'], row['lon'], user_lat, user_lon), axis=1)

for i, element in enumerate(['brand', 'type', 'year']):
    new_column = element + '_metric'
    print new_column, score_columns[i], user_vars[i]
    df_motos_raw_coord_brand_types[new_column] = df_motos_raw_coord_brand_types.apply(lambda row: distance_abs_value(int(row[score_columns[i]]), user_vars[i]), axis=1)
    
#!!! Revisar ranking de motos, falta alguna, por eso aparenden NAN en la columna brand_score al hacer el merge

df_motos_raw_coord_brand_types['total_metric_pond'] = df_motos_raw_coord_brand_types.apply(lambda row: w_s(row['city_metric'], row['brand_metric'], row['type_metric'], row['year_metric']), axis = 1)

brand_metric brand_score 356
type_metric type_score 13
year_metric year 2000


In [133]:
results = df_motos_raw_coord_brand_types.sort_values(by = ['total_metric_pond'], ascending=True)

In [134]:
results.head(5)

Unnamed: 0,city,brand,model,type,cc,color,km,year,price,url,lat,lon,brand_score,type_score,city_metric,brand_metric,type_metric,year_metric,total_metric_pond
25362,leon,bmw,f_650_st,trail,652.0,blanco,78282.0,1997,1500,http://motos.net/ocasion/bmw/f_650_st/1997-en-...,30.468306,-84.254907,356.0,14,0.0,0,1,3,0.7
25268,leon,bmw,f_650_gs,trail,798.0,,8.0,2005,3600,http://motos.net/ocasion/bmw/f_650_gs/2005-en-...,30.468306,-84.254907,356.0,14,0.0,0,1,5,0.9
25251,leon,bmw,r_1200_gs,trail,1170.0,rojo,97000.0,2005,7000,http://motos.net/ocasion/bmw/r_1200_gs/2005-en...,30.468306,-84.254907,356.0,14,0.0,0,1,5,0.9
25340,leon,bmw,r_1200_gs,trail,1170.0,blanco,91000.0,2006,7400,http://motos.net/ocasion/bmw/r_1200_gs/2006-en...,30.468306,-84.254907,356.0,14,0.0,0,1,6,1.0
25371,leon,bmw,r_1200_gs,trail,1170.0,amarillo,40600.0,2009,11200,http://motos.net/ocasion/bmw/r_1200_gs/2009-en...,30.468306,-84.254907,356.0,14,0.0,0,1,9,1.3


In [94]:
brand_null = df_motos_raw_coord_brand_types[df_motos_raw_coord_brand_types['brand_score'].isnull()]

In [95]:
for element in brand_null.brand.unique():
    print element

gas gas
fb mondial
monkey bike
mash
peugeot
imr
vespino
hudson boss
jin lun
husaberg
mxonda
mv agusta
moto morini
montesa
rav
apollo orion
hanway
zongshen
puch
lifan
lem
ossa
ycf
tgb
elmoto
renault
riya
goes
leonart
chopper nation
cagiva
italjet
victory
indian
wildlander
csr
lml
scorpa
ksr moto
ural
malaguti
hm
jonway
lambretta
lemev
baotian
big dog
zero motorcycles
kangxin
zms motors
oset
mtr
brp
rebel
alpina renania
gowinn
xmotos
brammo
scomadi
bereco
yiying
mobilette
cpi
quadro
samada
rewaco
can-am
qingqi
orcal
jianshe
innocenti
znen
arctic cat
ridley motorcycles
american ironhorse
young rider
orion
vmoto
tbq
sumo
mecatecno
tm
wottan
dorton
mtm
kinroad
motogac
ajp
vor
via scooter
torrot
i-moto
e-max
adly
xispa
hsun
lectric
roan
bunker trike
xingyue
aeon
lanvertti
ymr


In [127]:
import numpy as np

params = np.array([2, 4, 10, 2])
weights = np.array([10, 40, 40, 10])


In [130]:
sum(params * weights)

600