# Final Project

## Link to dataset
Here is a <a href="https://www.kaggle.com/lepchenkov/usedcarscatalog" target="_blank">link</a> to the dataset on Kaggle.

## Possible Ideas
- Predict automatic or mechanical (manual) (*Perceptron, Logistic regression, SVM, k-NN*)
- Find similiarities in the vehicles (*k-Means*)
- Predict the duration of the listing given the make and model of the car (*Linear regression*)
- Reduce dimensionality of feature_1 through feature_9 columns to improve predictions (*PCA*)

In [1]:
import numpy as np 
import pandas as pd 


In [2]:
df = pd.read_csv('data/cars.csv')
df.head()

Unnamed: 0,manufacturer_name,model_name,transmission,color,odometer_value,year_produced,engine_fuel,engine_has_gas,engine_type,engine_capacity,...,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,duration_listed
0,Subaru,Outback,automatic,silver,190000,2010,gasoline,False,gasoline,2.5,...,True,True,True,False,True,False,True,True,True,16
1,Subaru,Outback,automatic,blue,290000,2002,gasoline,False,gasoline,3.0,...,True,False,False,True,True,False,False,False,True,83
2,Subaru,Forester,automatic,red,402000,2001,gasoline,False,gasoline,2.5,...,True,False,False,False,False,False,False,True,True,151
3,Subaru,Impreza,mechanical,blue,10000,1999,gasoline,False,gasoline,3.0,...,False,False,False,False,False,False,False,False,False,86
4,Subaru,Legacy,automatic,black,280000,2001,gasoline,False,gasoline,2.5,...,True,False,True,True,False,False,False,False,True,7


In [3]:
# Check what kind of data is in each column
df.dtypes

manufacturer_name     object
model_name            object
transmission          object
color                 object
odometer_value         int64
year_produced          int64
engine_fuel           object
engine_has_gas          bool
engine_type           object
engine_capacity      float64
body_type             object
has_warranty            bool
state                 object
drivetrain            object
price_usd            float64
is_exchangeable         bool
location_region       object
number_of_photos       int64
up_counter             int64
feature_0               bool
feature_1               bool
feature_2               bool
feature_3               bool
feature_4               bool
feature_5               bool
feature_6               bool
feature_7               bool
feature_8               bool
feature_9               bool
duration_listed        int64
dtype: object

**Find Columns with Missing Values**

In [4]:
cols = df.columns[df.isna().sum()>0]
print(f'Columns with missing values: {cols.values}')

Columns with missing values: ['engine_capacity']


Fill in the missing values with the mean of the column.

In [5]:
df['engine_capacity'].fillna(np.mean(df['engine_capacity']), inplace=True)

**Convert True/False Columns to 0/1**

In [6]:
for column in df.columns:
    if df.dtypes[column] == 'bool':
        df[column] = df[column].astype(np.int)    

Once that is finished, plot a histogram of the data to check which features are highly correlated and which aren't.

**Convert Categorical Data to Numeric Data**

In [7]:
def categorical_to_num(column):
    "Converts a list containing categorical values to a dictionary of the categorical name and an int64"
    category = df[column].unique().tolist()          # Get the unique values from the column
    idx = range(len(df[column].unique()))            # Create an index for each unique value from the column
    category_dict = dict(zip(category, idx))         # Create dictionary of unique values and corresponding index
    
    # https://stackoverflow.com/questions/20250771/remap-values-in-pandas-column-with-a-dict
    df[column].replace(category_dict, inplace=True)  # Update the column in the DataFrame
    
    return category_dict                             # Returns a dictionary

In [8]:
# Dictionary to keep all the indeces of each column
category_dicts = {}

for column in df.columns:
    if df.dtypes[column] == 'O':                             # Check for columns containing string type values
        category_dicts[column] = categorical_to_num(column)  # Update column with int64 values

### Cleaned Data

In [9]:
df.head()

Unnamed: 0,manufacturer_name,model_name,transmission,color,odometer_value,year_produced,engine_fuel,engine_has_gas,engine_type,engine_capacity,...,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,duration_listed
0,0,0,0,0,190000,2010,0,0,0,2.5,...,1,1,1,0,1,0,1,1,1,16
1,0,0,0,1,290000,2002,0,0,0,3.0,...,1,0,0,1,1,0,0,0,1,83
2,0,1,0,2,402000,2001,0,0,0,2.5,...,1,0,0,0,0,0,0,1,1,151
3,0,2,1,1,10000,1999,0,0,0,3.0,...,0,0,0,0,0,0,0,0,0,86
4,0,3,0,3,280000,2001,0,0,0,2.5,...,1,0,1,1,0,0,0,0,1,7


In [10]:
# Check to ensure that there is no categorical data
df.dtypes

manufacturer_name      int64
model_name             int64
transmission           int64
color                  int64
odometer_value         int64
year_produced          int64
engine_fuel            int64
engine_has_gas         int64
engine_type            int64
engine_capacity      float64
body_type              int64
has_warranty           int64
state                  int64
drivetrain             int64
price_usd            float64
is_exchangeable        int64
location_region        int64
number_of_photos       int64
up_counter             int64
feature_0              int64
feature_1              int64
feature_2              int64
feature_3              int64
feature_4              int64
feature_5              int64
feature_6              int64
feature_7              int64
feature_8              int64
feature_9              int64
duration_listed        int64
dtype: object

### TODO 
Plot a histogram to check which features are the most important.

In [1]:
print('1')

1
