# Vroom (Name subject to change)
Jupyter Notebook used to import, explore, and hopefully implement a proper model.

## Importing
With the use of Pandas library in Python, we are able to successfully load the .csv for the data.

## Exploring
### Overview
With the use of Pandas to delve deep into the data given to have a general overview.

### Changing Localisation
Current locale: Detusch (Deutschland/Germany).\
As a result; keywords will be identified and replaced with English equivalent.

### NaNs/Missing Data
Using Pandas/Numpy to fill in missing data or remove unretrievable data (If number is negligible).

In [3]:
# Importing libraries
import numpy as np
import pandas as pd

In [4]:
# Loading data

# 1. Using relative path
# (I'm using this due to file/folder setup within my gitrepo)
carDat = pd.read_csv("../dataset/autos.csv")

# 2. Use this (uncomment) if .csv is in the same folder
# in addition to commenting the above line of code
#carDat = pd.read_csv("./autos.csv")

# Showing Data
Using head/tail/describe/info

Head: Shows first $n$ rows.
Tail: Shows last $n$ rows.

Describle: Runs basic statistics on numerical columns.

Info: Shows an overview regarding dataset

In [None]:
# Dictating how many rows to view
n = 5

# Using the display() to show
# multiple outputs from the same cell

# Getting the first n rows
display(carDat.head(n))

# Getting the last n rows
display(carDat.tail(n))

In [None]:
# Run this cell (and one below) instead of the above
# For separate cell outputs

# First n rows
carDat.head(n)

In [None]:
# Last n rows
carDat.tail(n)

In [59]:
# Removing index column
# Python already indexes
# Does not pose any value
carDat.drop(columns = 'index', inplace = True)

In [None]:
# Describing numerical columns
carDat.describe()

In [None]:
# Printing Dataframe info
carDat.info()

# Selecting keywords that matter
From looking at some values in columns, some values seem to be repeating regularly and thus they would be considered keywords to be translated later.

Getting unique values from non-numerical columns\
Due to the current locale, swapping to english
would be more optimal\
Unless specified to work with German as is

Noted Columns: [seller, offertype, abtest, vehicleType, gearbox, fuelType, notRepairedDamage]

In [None]:
# Getting list of columns within the dataframe
# whilst excluding numerical columns
col_names_obj = list(carDat.select_dtypes(exclude='int64').columns.values)

print(col_names_obj, carDat.select_dtypes(exclude='int64').info())

In [9]:
# Manually removing object-like values
# i.e dates
# selecting columns that may have a keyword in German
col_names_unique = ['seller', 'offerType', 'abtest', 'vehicleType', 'gearbox', 'fuelType', 'notRepairedDamage']

In [8]:
# As seen from the above output cell
# we'll only need to extract unique
# words from object columns

for col in col_names_unique:
    print(carDat[col].unique())

['privat' 'gewerblich']
['Angebot' 'Gesuch']
['test' 'control']
[nan 'coupe' 'suv' 'kleinwagen' 'limousine' 'cabrio' 'bus' 'kombi'
 'andere']
['manuell' 'automatik' nan]
['benzin' 'diesel' nan 'lpg' 'andere' 'hybrid' 'cng' 'elektro']
[nan 'ja' 'nein']
