In this project I am using a k-nearest neighbors algorithm to predict a car's market price using its attributes.

# Dataset used in this project
I will be using [UCI's Automobile Data Set](https://archive.ics.uci.edu/ml/datasets/automobile). This data contains car data from the 1985 Ward's Automotive Yearbook and can be downloaded [here](https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data).

##  Variables in the dataset

Variable|Meaning|Type
-|-|-
symboling|Risk symbol (A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.)|categorical(numeric)
normalized-losses|Normalized losses in use as compared to other cars.|continuous
make|make of car|categorical(string)
fuel-type|diesel or gas|categorical(string)
aspiration|std or turbo|categorical(string)
num-of-doors|four or two|categorical(string)
body-style|hardtop, wagon, sedan, hatchback, convertible|categorical(string)
drive-wheels|4wd, fwd, rwd|categorical(string)
engine-location|front or rear|categorical(string)
wheel-base|horizontal distance between the centers of the front and rear wheels|continuous
length|car length|continuous
width|car width|continuous
height|car height|continuous
curb-weight|total mass of a vehicle with standard equipment and all necessary operating consumables such as motor oil, transmission oil, brake fluid, coolant, air conditioning refrigerant, and sometimes a full tank of fuel, while not loaded with either passengers or cargo.|continuous
engine-type|dohc, dohcv, l, ohc, ohcf, ohcv, rotor.|categorical(string)
num-of-cylinders|eight, five, four, six, three, twelve, two.|categorical(string)
engine-size|size of car engine|continuous
fuel-system|1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.|categorical(string)
bore|diameter of each cylinder|continuous
stroke|stroke of engine|continuous
compression-ratio|compression ratio of engine|continuous
horsepower|car horsepower|continuous
peak-rpm|peak revolutions per minute|continuous
city-mpg|miles per gallon under city conditions|continuous
highway-mpg|miles per gallon on an open stretch of road|continuous
price|car price|continuous

# Reading in the data

In [1]:
import pandas as pd
col_names = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels", "engine-location", "wheel-base", "length", "width", "height",
             "curb-weight", "engine-type", "num-of-cylinders", "engine-size", "fuel-system", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "price"]
cars = pd.read_csv("imports-85.data", names = col_names)

cars.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [2]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    object 
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       205 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non

# Cleaning Data

* normalized-losses should be int, but is object dtype. Preview already shows likely reason: Missing data encoded as "?".
* bore, stroke, horsepower, peak-rpm and price should be should be int or float, but are object dtype. Possible the same issue as normalized-losses
* num-of-cylinders can be transformed to int

In [3]:
print("Values in bore, alphabetical descending order:")
print(cars["bore"].value_counts().sort_index(ascending=False).head(5))

print("\nValues in stroke, in alphabetical descending order:")
print(cars["stroke"].value_counts().sort_index(ascending=False).head(5))

print("\nValues in horsepower, in alphabetical descending order:")
print(cars["horsepower"].value_counts().sort_index(ascending=False).head(5))

print("\nValues in peak-rpm, in alphabetical descending order:")
print(cars["peak-rpm"].value_counts().sort_index(ascending=False).head(5))

print("\nValues in price, in alphabetical descending order:")
print(cars["price"].value_counts().sort_index(ascending=False).head(5))

Values in bore, alphabetical descending order:
?       4
3.94    2
3.80    2
3.78    8
3.76    1
Name: bore, dtype: int64

Values in stroke, in alphabetical descending order:
?       4
4.17    2
3.90    3
3.86    4
3.64    5
Name: stroke, dtype: int64

Values in horsepower, in alphabetical descending order:
?     2
97    5
95    7
94    2
92    4
Name: horsepower, dtype: int64

Values in peak-rpm, in alphabetical descending order:
?       2
6600    2
6000    9
5900    3
5800    7
Name: peak-rpm, dtype: int64

Values in price, in alphabetical descending order:
?       4
9995    1
9989    1
9988    1
9980    1
Name: price, dtype: int64


As suspected, missing values are coded as "?". Cleaning this up in the next step.

In [4]:
cols_to_clean = ["normalized-losses","bore", "stroke", "horsepower", "peak-rpm", "price"]