# DATA 1 Practical 2 - Questions

Simos Gerasimou


## Classic Cars & Co

Classic Cars & Co is a UK company that has a large collection of classic cars from the 1980s. 

DataVision (the company you are working as a Data Scientist) has been contracted to analyse the data available for the cars and provide insights by analysing the different characteristics of the cars (e.g., speed, price). 

This Jupyter Notebook will be presented to the Classic Cars & Co main stakeholders who have limited knowledge about data science. So, your findings should be complemented by a suitable justification explaining what you observe and, when applicable, what does this observation mean and, possibly, why it occurs. The analysis along with the explanation will help them to understand whether they need to invest more to expand their collection.

### **Important Information**

(1) To answer these exercises, you **must first read Chapter 2: Introduction to NumPy from the Python Data Science Handbook** (https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html)


(2) For each question (task) a description is provided accompanied (most of the time) by two cells: one for writing the Python code and another for providing the justification. Feel free to add more cells if you feel they are needed, but keep the cells corresponding to the same question close by.

**Hint 1**: If you find difficulties in solving a task, look at Chapter 2 from the Python Data Science Handbook.

**Hint 2**: Solving each task using NumPy should require less than 10 lines of code

#### **T1) Explore the dataset and for each column write its name, data type (categorical/numerical - nominal,ordinal,discrete,continuous) and its meaning (i.e., what does it capture?)**

* You may want to open the CSV file using a text editor (e.g., Notepad) or a spreadsheet editor (e.g., Excel)

**Write your answer here (the first is given)**
* Make: Categorical (Nominal) - The model of the car
* fueltype: Categorical (Nominal) - the type of fuel the car takes
* numofdoors: Categorical (Ordered) - the amount of doors the car has
* bodystyle: Categorical (Nominal) - the style/shape of body the car has
* drivewheel: Categorical (Nominal) - Which wheels the car drives with
* wheelbase: Numerical (Continuous) - Distance between the wheel axles
* length: Numerical (Continuous) - How long the car is
* width: Numerical (Continuous) - How wide the car is
* height: Numerical (Continuous) - How tall the car is
* numofcylinders: Categorical (Ordered) - how many cylinders are in the engine
* enginesize: Numerical (Discrete) - size of the engine
* horsepower: Numerical (Discrete) - engine power
* citympg: Numerical (Discrete) - miles per gallon of fuel in the city
* highwaympg: Numerical (Discrete) - miles per gallon of fuel on a highway
* price: Numerical (Discrete) - price of car

### 1) Reading dataset

The classic cars dataset is available on VLE (look for classicCars.csv in the Practicals section)

In [83]:
#Using NumPy to read the dataset
import numpy as np
import math
#Define the path to the dataset
data_path = "ClassicCars.csv"
#Define the type of each dataset column. 
#This is needed because NumPy arrays cannot directly read files with different data types
#Hence, we are using Structured arrays. 
#But, we will soon move to Pandas which makes data manipulation easier
types = ['U20', 'U10', 'U5', 'U20', 'U3', 'f4', 'f4', 'f4', 'f4', 'U10', 'i4', 'i4', 'i4', 'i4', 'i4']
#Read the dataset
data = np.genfromtxt(data_path, dtype=types, delimiter=',', names=True)

**Structured Arrays**
* Read more about structured arrays:
  * https://jakevdp.github.io/PythonDataScienceHandbook/02.09-structured-data-numpy.html
  * https://numpy.org/doc/stable/user/basics.rec.html

### Analysing the dataset


#### **Extracting the column names**

In [84]:
data.dtype.names

('make',
 'fueltype',
 'numofdoors',
 'bodystyle',
 'drivewheels',
 'wheelbase',
 'length',
 'width',
 'height',
 'numofcylinders',
 'enginesize',
 'horsepower',
 'citympg',
 'highwaympg',
 'price')

#### **Extracting the shape of the array**

In [85]:
print("The shape of the array is: ", data.shape)

The shape of the array is:  (205,)


In [86]:
print(len(data))

205


#### **T2) What do you see?**
* How many entries does the array have?
* What does each entry include? 
* Hint: Print the elements of an entry


**Write your answer here**

* The array has 205 entries
* Each entry includes these fields:

In [87]:
data.dtype.names

('make',
 'fueltype',
 'numofdoors',
 'bodystyle',
 'drivewheels',
 'wheelbase',
 'length',
 'width',
 'height',
 'numofcylinders',
 'enginesize',
 'horsepower',
 'citympg',
 'highwaympg',
 'price')

#### **Extracting the entries of a column given its name**

* By specifying the name of a column, you can get all the entries within the array for this column (reminder: you are using Structured Arrays)


In [88]:
#Print the entries within the 'make' column
print(data['make'])

['alfa-romero' 'alfa-romero' 'alfa-romero' 'audi' 'audi' 'audi' 'audi'
 'audi' 'audi' 'audi' 'bmw' 'bmw' 'bmw' 'bmw' 'bmw' 'bmw' 'bmw' 'bmw'
 'chevrolet' 'chevrolet' 'chevrolet' 'dodge' 'dodge' 'dodge' 'dodge'
 'dodge' 'dodge' 'dodge' 'dodge' 'dodge' 'honda' 'honda' 'honda' 'honda'
 'honda' 'honda' 'honda' 'honda' 'honda' 'honda' 'honda' 'honda' 'honda'
 'isuzu' 'isuzu' 'isuzu' 'isuzu' 'jaguar' 'jaguar' 'jaguar' 'mazda'
 'mazda' 'mazda' 'mazda' 'mazda' 'mazda' 'mazda' 'mazda' 'mazda' 'mazda'
 'mazda' 'mazda' 'mazda' 'mazda' 'mazda' 'mazda' 'mazda' 'mercedes-benz'
 'mercedes-benz' 'mercedes-benz' 'mercedes-benz' 'mercedes-benz'
 'mercedes-benz' 'mercedes-benz' 'mercedes-benz' 'mercury' 'mitsubishi'
 'mitsubishi' 'mitsubishi' 'mitsubishi' 'mitsubishi' 'mitsubishi'
 'mitsubishi' 'mitsubishi' 'mitsubishi' 'mitsubishi' 'mitsubishi'
 'mitsubishi' 'mitsubishi' 'nissan' 'nissan' 'nissan' 'nissan' 'nissan'
 'nissan' 'nissan' 'nissan' 'nissan' 'nissan' 'nissan' 'nissan' 'nissan'
 'nissan' 'nissa

#### **T3) Extract the bodystyles within the dataset**


In [89]:
#Write your answer here
print(data["bodystyle"])

['convertible' 'convertible' 'hatchback' 'sedan' 'sedan' 'sedan' 'sedan'
 'wagon' 'sedan' 'hatchback' 'sedan' 'sedan' 'sedan' 'sedan' 'sedan'
 'sedan' 'sedan' 'sedan' 'hatchback' 'hatchback' 'sedan' 'hatchback'
 'hatchback' 'hatchback' 'hatchback' 'sedan' 'sedan' 'sedan' 'wagon'
 'hatchback' 'hatchback' 'hatchback' 'hatchback' 'hatchback' 'hatchback'
 'sedan' 'wagon' 'hatchback' 'hatchback' 'sedan' 'sedan' 'sedan' 'sedan'
 'sedan' 'sedan' 'sedan' 'hatchback' 'sedan' 'sedan' 'sedan' 'hatchback'
 'hatchback' 'hatchback' 'sedan' 'sedan' 'hatchback' 'hatchback'
 'hatchback' 'hatchback' 'hatchback' 'sedan' 'hatchback' 'sedan' 'sedan'
 'hatchback' 'sedan' 'sedan' 'sedan' 'wagon' 'hardtop' 'sedan' 'sedan'
 'convertible' 'sedan' 'hardtop' 'hatchback' 'hatchback' 'hatchback'
 'hatchback' 'hatchback' 'hatchback' 'hatchback' 'hatchback' 'hatchback'
 'hatchback' 'sedan' 'sedan' 'sedan' 'sedan' 'sedan' 'sedan' 'sedan'
 'sedan' 'wagon' 'sedan' 'hatchback' 'sedan' 'wagon' 'hardtop' 'hatchback'
 'seda

### How do the car prices look like?


#### **T4) Calculate the range of car prices for the entire dataset**


In [90]:
#Write your answer here
minPrice = np.min(data["price"])
maxPrice = np.max(data["price"])
rangePrice = maxPrice - minPrice
print(rangePrice)

40282


#### **T5) Calculate the min, max, mean and median prices of the cars**


In [91]:
#Write your answer here
meanPrice = np.mean(data["price"])
medianPrice = np.median(data["price"])
print(f"MIN PRICE: ${minPrice}, MAX PRICE: ${maxPrice}, MEAN PRICE: ${meanPrice}, MEDIAN PRICE: ${medianPrice}")

MIN PRICE: $5118, MAX PRICE: $45400, MEAN PRICE: $13300.239024390245, MEDIAN PRICE: $10345.0


#### **T6) Considering the values calculated above, what insights can you extract? Where do you think the majority of car prices will be clustered?**


**Write your answer here**

Generally the car prices will be clustered around 10k - 13k range as they are the median and mean values, which are fairly close together, meaning that most of the car prices should be fairly close by to these

#### **T7) Write code to calculate the standard deviation for the car prices. Then use the corresponding NumPy function to confirm the correctness of your calculation**



In [92]:
#Write your answer here
diffSquaredSum = 0
for i in data["price"]:
    diffSquaredSum += math.pow((meanPrice - i), 2)
calculatedSD = math.sqrt(diffSquaredSum / len(data["price"]))
numpySD = np.std(data["price"])
print(f"Calculated Standard Deviation: {calculatedSD}, NumPy Standard Deviation: {numpySD}")

Calculated Standard Deviation: 7969.541401038539, NumPy Standard Deviation: 7969.54140103854


#### **T8) Find the details of cars with the smallest and largest car volumes**
* Hint: see how to calculate the volume of a car https://info.japanesecartrade.com/content-item/297-what-is-m3-cubic-meter-size-of-a-vehicle


In [93]:
#Write your answer here
minM3 = data["length"][0] * data["width"][0] * data["height"][0]
minM3ID = 0
maxM3 = data["length"][0] * data["width"][0] * data["height"][0]
maxM3ID = 0
for i in range(1, len(data["length"])):
    currentM3 = data["length"][i] * data["width"][i] * data["height"][i]
    if currentM3<minM3:
        minM3 = currentM3
        minID = i
    if currentM3>maxM3:
        maxM3 = currentM3
        maxID = i
def printCarDetails(id, m3):
    print(f"Make: {data['make'][i]}, Fuel Type: {data['fueltype'][i]}, Body Style: {data['bodystyle'][i]}, Drive Wheel: {data['drivewheels'][i]}, Wheel Base: {data['wheelbase'][i]}, Length: {data['length'][i]}, Width: {data['width'][i]}, Height: {data['height'][i]}, M3: {m3}, Num Cylinders: {data['numofcylinders'][i]}, Engine Size: {data['enginesize'][i]}, Horse Power: {data['horsepower'][i]}, City MPG: {data['citympg'][i]}, Highway MPG: {data['highwaympg'][i]}, Price: {data['price'][i]}")
print("CAR WITH THE SMALLEST M3")
printCarDetails(minM3ID, minM3)
print("CAR WITH THE LARGEST M3")
printCarDetails(maxM3ID, maxM3)

CAR WITH THE SMALLEST M3
Make: volvo, Fuel Type: gas, Body Style: sedan, Drive Wheel: rwd, Wheel Base: 109.0999984741211, Length: 188.8000030517578, Width: 68.9000015258789, Height: 55.5, M3: 452643.15625, Num Cylinders: four, Engine Size: 141, Horse Power: 114, City MPG: 19, Highway MPG: 25, Price: 22625
CAR WITH THE LARGEST M3
Make: volvo, Fuel Type: gas, Body Style: sedan, Drive Wheel: rwd, Wheel Base: 109.0999984741211, Length: 188.8000030517578, Width: 68.9000015258789, Height: 55.5, M3: 846007.625, Num Cylinders: four, Engine Size: 141, Horse Power: 114, City MPG: 19, Highway MPG: 25, Price: 22625


#### **T9) Find the different types of bodystyles for the cars in the dataset**

* Hint: You may want to check: https://numpy.org/doc/stable/reference/generated/numpy.unique.html

In [94]:
#Write your answer here
print(np.unique(data["bodystyle"]))

['convertible' 'hardtop' 'hatchback' 'sedan' 'wagon']


#### **T10) Find the number of different car *brands* (makes)**


In [95]:
#Write your answer here
differentCarBrands = len(np.unique(data["make"]))
print(f"There are {differentCarBrands} different car brands")

There are 22 different car brands


#### **T11) Find the engine size and the horsepower for the most and least efficient cars when driven in the city and the highway (i.e., the cars with the smallest and largest difference in fuel consumption when driven in the city and the highway)**

In [96]:
#Write your answer here
minCityMPGID = np.argmin(data["citympg"])
maxCityMPGID = np.argmax(data["citympg"])
minHighwayMPGID = np.argmin(data["highwaympg"])
maxHighwayMPGID = np.argmax(data["highwaympg"])
def outputEngineAndHorsepower(ID):
    print(f"Engine Size: {data['enginesize'][ID]}, Horse Power: {data['horsepower'][ID]}")
print("Minimum City MPG:")
outputEngineAndHorsepower(minCityMPGID)
print("\nMaximum City MPG:")
outputEngineAndHorsepower(maxCityMPGID)
print("\n\nMinimum Highway MPG:")
outputEngineAndHorsepower(minHighwayMPGID)
print("\nMaximum Highway MPG:")
outputEngineAndHorsepower(maxHighwayMPGID)

Minimum City MPG:
Engine Size: 326, Horse Power: 262

Maximum City MPG:
Engine Size: 92, Horse Power: 58


Minimum Highway MPG:
Engine Size: 308, Horse Power: 184

Maximum Highway MPG:
Engine Size: 92, Horse Power: 58


#### **T12) Find the make with the largest number of cars and how many they are**

In [97]:
#Write your answer here
counts = [[],[]]
for i in np.unique(data["make"]):
    counts[0].append(i)
    counts[1].append(np.count_nonzero(data["make"]==i))
print(counts)
numArr = np.array(counts[1])
print(numArr)
maxIndex = np.argmax(numArr)
print(maxIndex)
print(f"{counts[0][maxIndex]} has the most cars, with {counts[1][maxIndex]} cars.")

[['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda', 'isuzu', 'jaguar', 'mazda', 'mercedes-benz', 'mercury', 'mitsubishi', 'nissan', 'peugot', 'plymouth', 'porsche', 'renault', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo'], [3, 7, 8, 3, 9, 13, 4, 3, 17, 8, 1, 13, 18, 11, 7, 5, 2, 6, 12, 32, 12, 11]]
[ 3  7  8  3  9 13  4  3 17  8  1 13 18 11  7  5  2  6 12 32 12 11]
19
toyota has the most cars, with 32 cars.


#### **T13) Find how many cars have a wheel base greater than 100**

* Hint: See https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html

In [98]:
#Write your answer here
print(np.count_nonzero(data["wheelbase"]>100))

63


#### **T14) Find if there are any convertible cars that cost less than £15000**

* Hint: See https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html

In [99]:
#Write your answer here
print(np.sum((data["bodystyle"]=="convertible")) & (data["price"]<15000))

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


#### **T15) Calculate the interquartile range for the price of all cars**

In [100]:
#Write your answer here
lq = np.percentile(data["price"], 25, interpolation="midpoint")
uq = np.percentile(data["price"], 75, interpolation="midpoint")
iqr = uq - lq
print(iqr)

8715.0


#### **T16) Calculate the 50th percentile range for the horsepower of all cars. Which value is the 50th percentile equal to?**

In [101]:
#Write your answer here
percentile50 = lq = np.percentile(data["price"], 50, interpolation="midpoint")
print(percentile50)

10345.0


### Ideas for practicing further at home

* Find the engine and horsepower of 4wd cars
* Find whether diesel or gas cars are more efficient in the city/highway
* Any other analysis that you might could generate some useful insight.
