# Data Scientist Professional Practical Exam

### Company Background

Nearly New Nautical is a website that allows users to advertise their used boats for sale. When users list their boat, they have to provide a range of information about their boat. Boats that get lots of views bring more traffic to the website, and more potential customers. 

To boost traffic to the website, the product manager wants to prevent listing boats that do not receive many views.




### Customer Question

The product manager wants to know the following:
- Can you predict the number of views a listing will receive based on the boat's features?



### Success Criteria

The product manager would consider using your model if, on average, the predictions were only 50% off of the true number of views a listing would receive.


### Dataset

The data you will use for this analysis can be accessed here: `"data/boat_data.csv"`

# Import and Install Modules

In [1]:
# Install modules not preinstalled in DC Workspaces
!pip install forex_python
!pip install ftfy

Collecting forex_python
  Downloading forex_python-1.8-py3-none-any.whl (8.2 kB)
Collecting requests
  Downloading requests-2.28.1-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting simplejson
  Downloading simplejson-3.18.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (135 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.5/135.5 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting urllib3<1.27,>=1.21.1
  Downloading urllib3-1.26.13-py2.py3-none-any.whl (140 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 kB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting idna<4,>=2.5
  Downloading idna-3.4-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.5/61.5 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting c

In [2]:
# Import modules
import pandas as pd
import numpy as np
import chardet as ch 
from forex_python.converter import CurrencyRates
from ftfy import fix_and_explain, fix_text
import datetime

import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import r2_score,mean_squared_error
plt.style.use('ggplot')

# Load Data

In [3]:
with open('data/boat_data.csv', 'rb') as file:             # check CSV file encoding to reduce reding errors and data cleanup
    print(ch.detect(file.read()))

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}


In [4]:
df = pd.read_csv('data/boat_data.csv', encoding = "utf-8")     # load data into dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9888 entries, 0 to 9887
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Price                        9888 non-null   object 
 1   Boat Type                    9888 non-null   object 
 2   Manufacturer                 8550 non-null   object 
 3   Type                         9882 non-null   object 
 4   Year Built                   9888 non-null   int64  
 5   Length                       9879 non-null   float64
 6   Width                        9832 non-null   float64
 7   Material                     8139 non-null   object 
 8   Location                     9852 non-null   object 
 9   Number of views last 7 days  9888 non-null   int64  
dtypes: float64(2), int64(2), object(6)
memory usage: 772.6+ KB


In [5]:
df.describe()

Unnamed: 0,Year Built,Length,Width,Number of views last 7 days
count,9888.0,9879.0,9832.0,9888.0
mean,1893.19286,11.570017,3.520124,149.160801
std,460.201582,6.00282,1.220534,151.819752
min,0.0,1.04,0.01,13.0
25%,1996.0,7.47,2.54,70.0
50%,2007.0,10.28,3.33,108.0
75%,2017.0,13.93,4.25,172.0
max,2021.0,100.0,25.16,3263.0


In [6]:
df.head(10)

Unnamed: 0,Price,Boat Type,Manufacturer,Type,Year Built,Length,Width,Material,Location,Number of views last 7 days
0,CHF 3337,Motor Yacht,Rigiflex power boats,new boat from stock,2017,4.0,1.9,,Switzerland Â» Lake Geneva Â» VÃ©senaz,226
1,EUR 3490,Center console boat,Terhi power boats,new boat from stock,2020,4.0,1.5,Thermoplastic,Germany Â» BÃ¶nningstedt,75
2,CHF 3770,Sport Boat,Marine power boats,new boat from stock,0,3.69,1.42,Aluminium,Switzerland Â» Lake of Zurich Â» StÃ¤fa ZH,124
3,DKK 25900,Sport Boat,Pioner power boats,new boat from stock,2020,3.0,1.0,,Denmark Â» Svendborg,64
4,EUR 3399,Fishing Boat,Linder power boats,new boat from stock,2019,3.55,1.46,Aluminium,Germany Â» Bayern Â» MÃ¼nchen,58
5,CHF 3650,Sport Boat,Linder power boats,new boat from stock,0,4.03,1.56,Aluminium,Switzerland Â» Lake Constance Â» Uttwil,132
6,CHF 3600,Catamaran,,"Used boat,Unleaded",1999,6.2,2.38,Aluminium,Switzerland Â» Neuenburgersee Â» Yvonand,474
7,DKK 24800,Sport Boat,,Used boat,0,3.0,,,Denmark Â» Svendborg,134
8,EUR 3333,Fishing Boat,Crescent power boats,new boat from stock,2019,3.64,1.37,,Germany Â» Bayern Â» Boote+service Oberbayern,45
9,EUR 3300,Pontoon Boat,Whaly power boats,new boat from stock,2018,4.35,1.73,,Italy Â» Dormelletto,180


In [1]:
df.nunique()

NameError: name 'df' is not defined

# Data Clean Up

To Dos:
* 	\[Price\]:  
	*  Parse Price into 'Currency' + 'Amount' columns 		✔
	*  Convert to Euros or USD  						✔
* 	\[Boat Type\]:
	* 	Search & Group Similar Categories
* 	\[Manufacturer\]:
	*	Remove "power boats"  							✔
	* 	Clean up characters 							✔
	* 	Fuzzy match manufacturers to reduce counts
* 	\[Type\]:
	* 	Parse Column into 'Condition' + 'Fuel Type' columns
    * Separate into: New / Used / Display, Diesel / Unleaded / Electric / Gas
* 	\[Material\]:
	* 	()
* 	\[Location\]:
	* 	Parse Country, Region & City	
	* 	Correct Mispelled Words / Characters

### Data Clean Up

In [8]:
df[['Currency', 'Amount']] = df.Price.str.split(" ", expand=True )         # Split 'Price' into 'Currency' & 'Amount' 
print(df.Amount.isnull().values.any())                                     # Check there are non nulls in 'Amount'
print(df.Currency.isnull().values.any())                                   # Check there are non nulls in 'Currency'
print(df.Currency.unique())

False
False
['CHF' 'EUR' 'DKK' 'Â£']


In [9]:
df['Currency'] = df['Currency'].str.replace('Â£', 'GBP')                   # Clean up British Pound currency chars
print(df.Currency.unique())

['CHF' 'EUR' 'DKK' 'GBP']


In [10]:
df.Amount.str.isdigit().all()                       # Check that 'Amount' only contains numeric chars (no '.' or ',')

True

In [11]:
df['Amount'] = df['Amount'].astype('float64')         # Convert 'Amount' to numeric

In [12]:
# The following approach is too slow, as it has to perform 9k get requests

# curr = CurrencyRates()
# df['Amount (USD)'] = df.apply( lambda x: curr.convert( x.Currency, 'USD', x.Amount), axis = 1)

In [13]:
curr = CurrencyRates()
currencies = df.Currency.unique()
conversion_date = datetime.datetime(2023, 1, 3)

rates = [curr.convert(i, 'USD', 1, conversion_date) for i in currencies]
rates_to_USD = dict(zip(currencies, rates))

print(rates_to_USD)

{'CHF': 1.0674157303370786, 'EUR': 1.0545, 'DKK': 0.1417910447761194, 'GBP': 1.1976421951662728}


In [14]:
df['Price (USD)'] = round(df['Amount'] * df['Currency'].map(rates_to_USD), 2)

In [15]:
t = df['Boat Type'].unique()
sorted(t)

['Bowrider',
 'Bowrider,Cabin Boat,Deck Boat',
 'Bowrider,Center console boat,Sport Boat',
 'Bowrider,Classic',
 'Bowrider,Deck Boat,Water ski',
 'Bowrider,Motor Yacht,Sport Boat',
 'Bowrider,Motor Yacht,Wakeboard/Wakesurf',
 'Bowrider,Sport Boat,Wakeboard/Wakesurf',
 'Bowrider,Wakeboard/Wakesurf',
 'Cabin Boat',
 'Cabin Boat,Classic',
 'Cabin Boat,Classic,Flybridge',
 'Cabin Boat,Classic,Motor Yacht',
 'Cabin Boat,Classic,Passenger boat',
 'Cabin Boat,Classic,Trawler',
 'Cabin Boat,Fishing Boat',
 'Cabin Boat,Fishing Boat,House Boat',
 'Cabin Boat,Fishing Boat,Pilothouse',
 'Cabin Boat,Fishing Boat,Sport Boat',
 'Cabin Boat,Flybridge',
 'Cabin Boat,Flybridge,Motor Yacht',
 'Cabin Boat,Hardtop',
 'Cabin Boat,Hardtop,Motor Yacht',
 'Cabin Boat,Hardtop,Sport Boat',
 'Cabin Boat,Hardtop,Trawler',
 'Cabin Boat,House Boat',
 'Cabin Boat,House Boat,Trawler',
 'Cabin Boat,Motor Yacht',
 'Cabin Boat,Motor Yacht,Offshore Boat',
 'Cabin Boat,Motor Yacht,Sport Boat',
 'Cabin Boat,Motor Yacht,Traw

In [16]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df.groupby('Boat Type').size())

Boat Type
Bowrider                                        301
Bowrider,Cabin Boat,Deck Boat                     1
Bowrider,Center console boat,Sport Boat           1
Bowrider,Classic                                  1
Bowrider,Deck Boat,Water ski                      1
Bowrider,Motor Yacht,Sport Boat                   1
Bowrider,Motor Yacht,Wakeboard/Wakesurf           1
Bowrider,Sport Boat,Wakeboard/Wakesurf            2
Bowrider,Wakeboard/Wakesurf                       1
Cabin Boat                                      585
Cabin Boat,Classic                                9
Cabin Boat,Classic,Flybridge                      1
Cabin Boat,Classic,Motor Yacht                    3
Cabin Boat,Classic,Passenger boat                 1
Cabin Boat,Classic,Trawler                        1
Cabin Boat,Fishing Boat                           2
Cabin Boat,Fishing Boat,House Boat                1
Cabin Boat,Fishing Boat,Pilothouse                1
Cabin Boat,Fishing Boat,Sport Boat                1
Ca

In [17]:
t = df['Manufacturer'].fillna('None').unique()
sorted(t)

['2 emme marine power boats',
 '3B Craft power boats',
 'A. Mostes power boats',
 'AB Yachts power boats',
 'ACM Dufour power boats',
 'AGA-Marine power boats',
 'AICON Yachts power boats',
 'AL Custom power boats',
 'AM Yacht power boats',
 'AMS Marine Yachten power boats',
 'AMT power boats',
 'ARS Mare power boats',
 'AS Marine power boats',
 'ATOMIX power boats',
 'AW Yachts power boats',
 'AW power boats',
 'AXOPAR power boats',
 'AYROS power boats',
 'Abacus power boats',
 'Abati Yachts power boats',
 'Abeking & Rasmussen power boats',
 'Absolute power boats',
 'Acquaviva (IT) power boats',
 'Acroplast power boats',
 'Adagio Yachts power boats',
 'Adec power boats',
 'Adex Nautica power boats',
 'Adler power boats',
 'Admiral power boats',
 'Adventure power boats',
 'Aegean Yachts power boats',
 'Agder power boats',
 'Aicon power boats',
 'Airon Marine power boats',
 'Akerboom power boats',
 'Ala Blu power boats',
 'Alalunga power boats',
 'Albatro power boats',
 'Albemarle power

In [18]:
df['Manufacturer'] = df['Manufacturer'].fillna('None Specified')
df['Manufacturer'] = df['Manufacturer'].str.replace(' power boats', '')
mfrs_misspelled = df[df.Manufacturer.str.contains(r'[^0-9a-zA-Z -.]')].Manufacturer.unique()

In [19]:
df['Manufacturer'] = [fix_text(i) for i in df['Manufacturer']]
t = df['Manufacturer'].unique()
sorted(t)

['2 emme marine',
 '3B Craft',
 'A. Mostes',
 'AB Yachts',
 'ACM Dufour',
 'AGA-Marine',
 'AICON Yachts',
 'AL Custom',
 'AM Yacht',
 'AMS Marine Yachten',
 'AMT',
 'ARS Mare',
 'AS Marine',
 'ATOMIX',
 'AW',
 'AW Yachts',
 'AXOPAR',
 'AYROS',
 'Abacus',
 'Abati Yachts',
 'Abeking & Rasmussen',
 'Absolute',
 'Acquaviva (IT)',
 'Acroplast',
 'Adagio Yachts',
 'Adec',
 'Adex Nautica',
 'Adler',
 'Admiral',
 'Adventure',
 'Aegean Yachts',
 'Agder',
 'Aicon',
 'Airon Marine',
 'Akerboom',
 'Ala Blu',
 'Alalunga',
 'Albatro',
 'Albemarle',
 'Albin',
 'Alen Yacht',
 'Alfamarine',
 'Alfastreet Marine',
 'Allegra',
 'Allround',
 'Alpa',
 'Altair',
 'Altena',
 'AluForce',
 'AluVenture',
 'Aluminiumjon',
 'Amberg',
 'Amel',
 'Amer',
 'Amerglass',
 'American Marine',
 'Ancora',
 'Antaris',
 'Anytec',
 'Anytec Boats',
 'Apreamare',
 'Aquabat',
 'Aquador',
 'Aqualum',
 'Aquanaut',
 'Aquarius',
 'Aquastar',
 'Aquaviva',
 'Arcoa',
 'Argo',
 'Arkos',
 'Armee Suisse',
 'Arp-Werft ',
 'Arvor',
 'Astinor

In [23]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df.groupby('Manufacturer').size().sort_values(ascending = False))

Manufacturer
None Specified                    1338
Bénéteau                           631
Jeanneau                           537
Sunseeker                          383
Princess                           241
Sea Ray                            239
Cranchi                            219
Azimut                             215
Bavaria                            185
Fairline                           172
Quicksilver (Brunswick Marine)     167
Sessa                              148
Bayliner                           142
Sealine                            120
Quicksilver                        118
Prestige Yachts                    108
Galeon                              94
Regal                               90
Riva                                77
Linssen                             70
Windy                               64
Ferretti                            63
Parker                              62
Boesch                              55
Pershing                            54
Four Winns  

In [28]:
t = df['Type'].fillna('None').unique()
#sorted(t)
print(df.groupby('Type').size().sort_values(ascending = False))

Type
Used boat,Diesel                4140
Used boat,Unleaded              1686
Used boat                       1462
new boat from stock,Unleaded    1107
new boat from stock              665
new boat from stock,Diesel       291
new boat on order,Unleaded       150
Display Model,Unleaded            75
new boat on order,Diesel          61
new boat on order                 61
Diesel                            57
Used boat,Electric                27
Unleaded                          22
Display Model,Diesel              19
Display Model                     18
new boat from stock,Electric      18
Used boat,Gas                     10
Display Model,Electric             6
new boat from stock,Gas            2
Used boat,Propane                  1
Electric                           1
new boat from stock,Hybrid         1
Display Model,Gas                  1
Used boat,Hybrid                   1
dtype: int64
