# FINAL PROJECT - Second-hand motorcycles recommender
___
#### MASTER IN DATA SCIENCE - KSCHOOL - 2016/17
#### KOLDO PINA ORTIZ
____

## Motivation

The aim of the present work is to get a good recommender of a second-hand vehicle, based on the adds of a second-hand marketplace web site

## Action Plan

To reach our goal, we will follow the the next steps:

1. ***Scrape*** website, **motos.net** to obtain the ***data***.
2. ***Clean*** the ***data***.
2. ***Merge*** some ***data***.
3. ***Train*** various models.
4. Compare the metrics and choose the model with the best one.
5. Create a flask web server with the app.

## Scraping...

The first thing we need to do is get the data from the web site. The web site we are goint to scrape is [http://www.motos.net](http://www.motos.net).
So, to do this we have created a program called <span style="color:orange">**scraper_motos.py**</span> that is in the folder <span style="color:73b113">**scrapers**</span>. 
This scraper program runs through all motorcycle ads capturing features. At the end it creates a csv archive with all data, and places it in the folder <span style="color:73b113">**scraped_data**</span> with the name <span style="color:dba368">**motos_raw_data.csv**</span>

Ok. So we already have the most important part of the data we are going to need.
In addition to this, we have also created another scraper that obtains a first ranking, based on the votes of users, of motorcycle brands. This new scraper program is called <span style="color:orange">**scraper_moto_brands_rank.py**</span> and it is located in the folder <span style="color:73b113">**scrapers**</span>. This scraper obtain the data from [http://en.classora.com/reports/f87259/ranking-of-the-best-motorcycle-brands?id=872&groupCount=50&startIndex=1](http://en.classora.com/reports/f87259/ranking-of-the-best-motorcycle-brands?id=872&groupCount=50&startIndex=1). The obtained data is placed in the folder <span style="color:73b113">**scraped_data**</span> with the name <span style="color:dba368">**rank_moto_brands.csv**</span>


This is an example of running <span style="color:orange">**scraper_motos**</span>
```python
from scraper_motos import scraper_motos
scraper_motos()

Start time: 2017-06-13 19:28:56.441707
num_ads 26588
End time: 2017-06-13 22:47:15.427469
```

So, let's take a look at the data...

In [1]:
import pandas as pd
df_motos_raw = pd.read_csv('scraped_data/motos_raw_data.csv', sep=';')
df_rank_moto_brands = pd.read_csv('scraped_data/rank_moto_brands.csv', sep=';')

In [4]:
df_motos_raw.shape

(26511, 10)

In [5]:
df_motos_raw.head()

Unnamed: 0,city,brand,model,type,cc,color,km,year,price,url
0,Madrid,BMW,R 1200 GS,Trail,1170.0,ROJO,57400.0,2004,7490,http://motos.net/ocasion/bmw/r_1200_gs/2004-en...
1,Navarra,KYMCO,Agility 125,Scooter 125cc,125.0,,7000.0,2014,1600,http://motos.net/ocasion/kymco/agility_125/201...
2,Navarra,KTM,390 Duke,Naked,375.0,,9400.0,2014,4700,http://motos.net/ocasion/ktm/390_duke/2014-en-...
3,Navarra,HONDA,CBR 125R,Sport,125.0,,2100.0,2014,2800,http://motos.net/ocasion/honda/cbr_125r/2014-e...
4,Granada,SUZUKI,BURGMAN 250,Scooters +125cc,250.0,,57000.0,2006,1300,http://motos.net/ocasion/suzuki/burgman_250/20...


In [6]:
df_rank_moto_brands.shape

(72, 2)

In [7]:
df_rank_moto_brands.head()

Unnamed: 0,brand,brand_score
0,yamaha,646
1,honda,635
2,suzuki,580
3,ducati,473
4,harley davidson,434


In [8]:
df_rank_moto_brands.tail()

Unnamed: 0,brand,brand_score
67,polaris,8
68,cpi,8
69,husaberg,8
70,emc-puch,8
71,peugeot,3


## Cleaning, transforming and loading...

Our first step will be to replace the NAN registers that appear in the color column by the words "not specified" that is a little more elegant.

In [9]:
df_motos_raw['color'].fillna('not specified', inplace=True)

Now we are going to convert the dataframe to lower case

In [10]:
df_motos_raw = df_motos_raw.apply(lambda x: x.astype(str).str.lower())

And now we are going to join the words in the *model* and *type* fields with an underscore symbol.

In [11]:
df_motos_raw['model'] = df_motos_raw['model'].str.replace(' ', '_')
df_motos_raw['type'] = df_motos_raw['type'].str.replace(' ', '_')

The next thing to do is to find duplicates and delete them.

In [12]:
# Looking for duplicates
df_motos_raw['is_duplicated'] = df_motos_raw.duplicated()
duplicates = df_motos_raw['is_duplicated'].sum()
print '%d duplicates' %duplicates

93 duplicates


In [13]:
# Removing duplicates and delete 'is_duplicated' column
df_motos_raw = df_motos_raw.loc[df_motos_raw['is_duplicated']==False]
df_motos_raw = df_motos_raw.drop('is_duplicated', 1)
df_motos_raw.shape

(26418, 10)

The next thing we are going to do is build our database of motorcycle ads in a csv file.

To do this, we start adding two new columns to the dataframe: lon (longitude) and lat (latitude) of the corresponding city.
We are going to use the geopy library

In [14]:
# Calculating the longitude and latitude of the cities
import geopy
from geopy.geocoders import Nominatim
geolocator = Nominatim()

In [15]:
cities = df_motos_raw['city'].unique()
locations_rows = []
for city in cities:
    location = geolocator.geocode([city], timeout = 15)
    locations_rows.append([city, location.latitude, location.longitude])
#create a dataframe with the location data
df_locations = pd.DataFrame(locations_rows, columns = ['city', 'lat', 'lon'])

In [16]:
# Merge df_locations with df_motos_raw
df_motos_raw_coord = pd.merge(df_motos_raw, df_locations, on = 'city')

In [17]:
df_motos_raw_coord.head()

Unnamed: 0,city,brand,model,type,cc,color,km,year,price,url,lat,lon
0,madrid,bmw,r_1200_gs,trail,1170.0,rojo,57400.0,2004,7490,http://motos.net/ocasion/bmw/r_1200_gs/2004-en...,40.416705,-3.703582
1,madrid,ducati,monster_696+,naked,696.0,blanco,18166.0,2011,5999,http://motos.net/ocasion/ducati/monster_696/20...,40.416705,-3.703582
2,madrid,kymco,super_dink_300i_abs,scooters_+125cc,299.0,blanco,9644.0,2011,3299,http://motos.net/ocasion/kymco/super_dink_300i...,40.416705,-3.703582
3,madrid,kymco,xciting_250,scooters_+125cc,249.0,gris,38100.0,2006,1150,http://motos.net/ocasion/kymco/xciting_250/200...,40.416705,-3.703582
4,madrid,bmw,f_800_r,naked,800.0,not specified,2800.0,2016,8500,http://motos.net/ocasion/bmw/f_800_r/2016-en-m...,40.416705,-3.703582


Now, we are going to add another column, brand_score, with the score of the brand.

Above we said that we had scraped a website to obtain a ranking of brands. 
Now is when we are going to use that data. <span style="color:dba368">**scraped_data/rank_moto_brands.csv**</span>.

<span style="color:red">We are faced here with the problem that not all brands in our dataframe appear in the ranking. 
So, to those that not appear in the ranking,we will give a score value of zero by default.</span>

Note: we have manualy modified the ranking obtained by doing the scraping, to adjust some brands according to our criteria.
Feel free to modify the file. The modified file is called <span style="color:dba368">**scraped_data/rank_moto_brands_plus.csv**</span>.

In [21]:
#We have scraped a web to obtain a simple brand rank. 
#We have it in the fodler named scraped_data and is called rank_moto_brands.csv
#Nevertheless, not all the brands in the adds appear in it. So, first of all, we are going to add the ones that are missing.
brands_in_df_list = df_motos_raw_coord.brand.unique()
df_rank_moto_brand = pd.read_csv('scraped_data/rank_moto_brands_plus.csv', sep=';')
rank_values = df_rank_moto_brand.values.tolist()
brands_rank_values =  df_rank_moto_brand.brand.values
for df_brand in brands_in_df_list:
    if df_brand not in brands_rank_values:
        rank_values.append([df_brand,0])
df_rank_moto_brands = pd.DataFrame(rank_values, columns=['brand', 'brand_score'])

In [22]:
df_rank_moto_brands.tail()

Unnamed: 0,brand,brand_score
159,hsun,0
160,dak,0
161,xingyue,0
162,cfmoto,0
163,lectric,0


So, now we need to merge df_motos_raw_coord with df_rank_moto_brands in order to add the new column called brand_score

In [24]:
df_motos_raw_coord_brand = pd.merge(df_motos_raw_coord, df_rank_moto_brands, on = 'brand', how = 'left')
df_motos_raw_coord_brand.head()

Unnamed: 0,city,brand,model,type,cc,color,km,year,price,url,lat,lon,brand_score
0,madrid,bmw,r_1200_gs,trail,1170.0,rojo,57400.0,2004,7490,http://motos.net/ocasion/bmw/r_1200_gs/2004-en...,40.416705,-3.703582,356
1,madrid,ducati,monster_696+,naked,696.0,blanco,18166.0,2011,5999,http://motos.net/ocasion/ducati/monster_696/20...,40.416705,-3.703582,473
2,madrid,kymco,super_dink_300i_abs,scooters_+125cc,299.0,blanco,9644.0,2011,3299,http://motos.net/ocasion/kymco/super_dink_300i...,40.416705,-3.703582,37
3,madrid,kymco,xciting_250,scooters_+125cc,249.0,gris,38100.0,2006,1150,http://motos.net/ocasion/kymco/xciting_250/200...,40.416705,-3.703582,37
4,madrid,bmw,f_800_r,naked,800.0,not specified,2800.0,2016,8500,http://motos.net/ocasion/bmw/f_800_r/2016-en-m...,40.416705,-3.703582,356


Now, we are going to add another column, type_score, with the score of the type of the motorcycle.

For this porpuse, we have created a csv file with a rank of the types called <span style="color:dba368">**rank_moto_types.csv**</span>.

We have builded this rank manually. First we heve obtained the motorcycle types and then we have assigned each one a score, at our discretion.
Feel free again to modify the file.


In [25]:
# With the second one, rank_moto_types.csv, we are going to create another column, with a score to the corresponding type
df_rank_moto_type = pd.read_csv('auxiliary_data/rank_moto_types.csv', sep=';')
df_motos_raw_coord_brand_types = pd.merge(df_motos_raw_coord_brand, df_rank_moto_type, on = 'type', how = 'left')
#Save into a csv
df_motos_raw_coord_brand_types.to_csv('app/static/data/df_motos_raw_coord_brand_type.csv', index = False)

OK!, so now we have our motorcyles ads database in the path  <span style="color:dba368">**app/static/data/df_motos_raw_coord_brand_type.csv**</span>.

This will simulate the data as if it were in any database management system.

Now that we have our data properly prepared, we need to build our recommender.

We have created a python class called recommender_class.py. Its in the next path app/recommender_class.py


Notes about the recommender_class:
