# DATA 1 Practical 2 - Model Answers

Simos Gerasimou


## Classic Cars & Co

Classic Cars & Co is a UK company that has a large collection of classic cars from the 1980s. 

DataVision (the company you are working as a Data Scientist) has been contracted to analyse the data available for the cars and provide insights by analysing the different characteristics of the cars (e.g., speed, price). 

This Jupyter Notebook will be presented to the Classic Cars & Co main stakeholders who have limited knowledge about data science. So, your findings should be complemented by a suitable justification explaining what you observe and, when applicable, what does this observation mean and, possibly, why it occurs. The analysis along with the explanation will help them to understand whether they need to invest more to expand their collection.

### **Important Information**

(1) To answer these exercises, you **must first read Chapter 2: Introduction to NumPy from the Python Data Science Handbook** (https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html)


(2) For each question (task) a description is provided accompanied (most of the time) by two cells: one for writing the Python code and another for providing the justification. Feel free to add more cells if you feel they are needed, but keep the cells corresponding to the same question close by.

**Hint 1**: If you find difficulties in solving a task, look at Chapter 2 from the Python Data Science Handbook.

**Hint 2**: Solving each task using NumPy should require less than 10 lines of code

#### **T1) Explore the dataset and for each column write its name, data type (categorical/numerical - nominal,ordinal,discrete,continuous) and its meaning (i.e., what does it capture?)**

* You may want to open the CSV file using a text editor (e.g., Notepad) or a spreadsheet editor (e.g., Excel)

**Write your answer here (the first is given)**

* **make**: Categorical (Nominal) - The model of the car
* **fueltype**: Categorical (Nominal) - The fuel consumed by the car (gas,diesel)
* **numofdoors**: Categorical (Ordinal) - The number of doors (two,four); Note: this is not Numerical (Discrete)
* **bodystyle**: Categorical (Nominal) - The bodystyle of the car (convertible, hardtop, hatchback, sedan, wagon)
* **drivewheels**: Categorical (Nominal) - The drivewheels of the car (4wd, fwd, rwd), i.e., 4-wheel-drive, front-wheel-drive, rear-wheel-drive
* **wheelbase**:	Numerical (Continous) - The car's wheel base
* **length**: Numerical (Continous) - The car's length
* **width**:	Numerical (Continous) - The car's width
* **height**: Numerical (Continous) - The car's height
* **numofcylinders**: Categorical (Ordinal) - Number of cylinders (two, three, four, five, six, eight, twelve)
*	**enginesize**: Numerical (Discrete) - The car's engine size
* **horsepower**: Numerical (Discrete) - The car's horsepower
*	**citympg**: Numerical (Discrete) - Miles per gallon consumed by the car in the city
* **highwaympg**:	Numerical (Discrete) - Miles per gallon consumed by the car in the highway
* **price**: Numerical (Discrete) - The car's price

**Other key observations**

1) the vocabulary used is targeted to North America (e.g., gallons). This can happen, for instance, if the software engineering team that designed the software is USA-based or if the company has purachsed a COTS software. Hence, additional effort may be needed for UK data scientist to understand this dataset and interpret it for any UK stakeholders.

2) the numerical data lacks any units. Therefore, we should assume the most commonly used units in the USA (miles, gallons, inches, cubic inches, dollars)

3) There is a lack of car model information, i.e., only make informatioon is specified, not model (e.g., Audi - not Audi A4 etc). Thus, the table does not provide key information.


### 1) Reading dataset

The classic cars dataset is available on VLE (look for classicCars.csv in the Practicals section)

In [1]:
#Using NumPy to read the dataset
import numpy as np
#Define the path to the dataset
data_path = "ClassicCars.csv"
#Define the type of each dataset column. 
#This is needed because NumPy arrays cannot directly read files with different data types
#Hence, we are using Structured arrays. 
#But, we will soon move to Pandas which makes data manipulation easier
types = ['U20', 'U10', 'U5', 'U20', 'U3', 'f4', 'f4', 'f4', 'f4', 'U10', 'i4', 'i4', 'i4', 'i4', 'i4']
#Read the dataset
data = np.genfromtxt(data_path, dtype=types, delimiter=',', names=True)

**Structured Arrays**
* Read more about structured arrays:
  * https://jakevdp.github.io/PythonDataScienceHandbook/02.09-structured-data-numpy.html
  * https://numpy.org/doc/stable/user/basics.rec.html

### Analysing the dataset


#### **Extracting the column names**

In [2]:
data.dtype.names

('make',
 'fueltype',
 'numofdoors',
 'bodystyle',
 'drivewheels',
 'wheelbase',
 'length',
 'width',
 'height',
 'numofcylinders',
 'enginesize',
 'horsepower',
 'citympg',
 'highwaympg',
 'price')

#### **Extracting the shape of the array**

In [3]:
print("The shape of the array is: ", data.shape)

The shape of the array is:  (205,)


#### **T2) What do you see?**
* How many entries does the array have?
* What does each entry include? 
* Hint: Print the elements of an entry


**Write your answer here**

* The array (dataset) has 205 entries, i.e., it keeps information for 205 cars
* Each entry provides information for each car that the company has purchased

#### **Extracting the entries of a column given its name**

* By specifying the name of a column, you can get all the entries within the array for this column (reminder: you are using Structured Arrays)


In [4]:
#Print the entries within the 'make' column
print(data['make'])

['alfa-romero' 'alfa-romero' 'alfa-romero' 'audi' 'audi' 'audi' 'audi'
 'audi' 'audi' 'audi' 'bmw' 'bmw' 'bmw' 'bmw' 'bmw' 'bmw' 'bmw' 'bmw'
 'chevrolet' 'chevrolet' 'chevrolet' 'dodge' 'dodge' 'dodge' 'dodge'
 'dodge' 'dodge' 'dodge' 'dodge' 'dodge' 'honda' 'honda' 'honda' 'honda'
 'honda' 'honda' 'honda' 'honda' 'honda' 'honda' 'honda' 'honda' 'honda'
 'isuzu' 'isuzu' 'isuzu' 'isuzu' 'jaguar' 'jaguar' 'jaguar' 'mazda'
 'mazda' 'mazda' 'mazda' 'mazda' 'mazda' 'mazda' 'mazda' 'mazda' 'mazda'
 'mazda' 'mazda' 'mazda' 'mazda' 'mazda' 'mazda' 'mazda' 'mercedes-benz'
 'mercedes-benz' 'mercedes-benz' 'mercedes-benz' 'mercedes-benz'
 'mercedes-benz' 'mercedes-benz' 'mercedes-benz' 'mercury' 'mitsubishi'
 'mitsubishi' 'mitsubishi' 'mitsubishi' 'mitsubishi' 'mitsubishi'
 'mitsubishi' 'mitsubishi' 'mitsubishi' 'mitsubishi' 'mitsubishi'
 'mitsubishi' 'mitsubishi' 'nissan' 'nissan' 'nissan' 'nissan' 'nissan'
 'nissan' 'nissan' 'nissan' 'nissan' 'nissan' 'nissan' 'nissan' 'nissan'
 'nissan' 'nissa

#### **T3) Extract the bodystyles within the dataset**


In [5]:
#Write your answer here
data['bodystyle']

array(['convertible', 'convertible', 'hatchback', 'sedan', 'sedan',
       'sedan', 'sedan', 'wagon', 'sedan', 'hatchback', 'sedan', 'sedan',
       'sedan', 'sedan', 'sedan', 'sedan', 'sedan', 'sedan', 'hatchback',
       'hatchback', 'sedan', 'hatchback', 'hatchback', 'hatchback',
       'hatchback', 'sedan', 'sedan', 'sedan', 'wagon', 'hatchback',
       'hatchback', 'hatchback', 'hatchback', 'hatchback', 'hatchback',
       'sedan', 'wagon', 'hatchback', 'hatchback', 'sedan', 'sedan',
       'sedan', 'sedan', 'sedan', 'sedan', 'sedan', 'hatchback', 'sedan',
       'sedan', 'sedan', 'hatchback', 'hatchback', 'hatchback', 'sedan',
       'sedan', 'hatchback', 'hatchback', 'hatchback', 'hatchback',
       'hatchback', 'sedan', 'hatchback', 'sedan', 'sedan', 'hatchback',
       'sedan', 'sedan', 'sedan', 'wagon', 'hardtop', 'sedan', 'sedan',
       'convertible', 'sedan', 'hardtop', 'hatchback', 'hatchback',
       'hatchback', 'hatchback', 'hatchback', 'hatchback', 'hatchback',
      

### How do the car prices look like?


#### **T4) Calculate the range of car prices for the entire dataset**


In [6]:
#Write your answer here
min=np.min(data['price'])
max=np.max(data['price'])
print('Min car price:', min)
print('Max car price:', max)
print('Range: ', max-min)

Min car price: 5118
Max car price: 45400
Range:  40282


#### **T5) Calculate the min, max, mean and median prices of the cars**


In [7]:
#Write your answer here
print("Min car price:", np.min(data['price']))
print("Max car price:", np.max(data['price']))
print("Mean car price:", np.mean(data['price']))
print("Median car price:", np.median(data['price']))

Min car price: 5118
Max car price: 45400
Mean car price: 13300.239024390245
Median car price: 10345.0


#### **T6) Considering the values calculated above, what insights can you extract? Where do you think the majority of car prices will be clustered?**


**Write your answer here**

* There is substantial difference between the minimum and maximum car prices, as also shown in the range value calculated in task T5.
* The mean and median values are closer to the minimum car price, indicating that most cars are closer to the lower end of the car prices.
* The mean and median values are also close enough; hence, both can be used as a measure of central tendency.

#### **T7) Write code to calculate the standard deviation for the car prices. Then use the corresponding NumPy function to confirm the correctness of your calculation**



In [8]:
#Write your answer here
def stdUsingNumpyOnly(prices):
  return np.sqrt(np.sum(np.power(np.subtract(prices, np.mean(prices)), 
                                 2))/len(prices))

def stdImplementation(prices):
  meanPrice    = np.mean(prices)
  priceDiffSq  = [np.power(price-meanPrice, 2) for price in prices]
  priceDiffAvg = np.sum(priceDiffSq)/len(prices)
  return np.sqrt(priceDiffAvg)

print("Standard deviation using only numpy functions: ", stdUsingNumpyOnly(data['price']))
print("Standard deviation by implementing std function: ", stdImplementation(data['price']))
print("Standard deviation using NumPy's std function:", np.std(data['price']))

Standard deviation using only numpy functions:  7969.54140103854
Standard deviation by implementing std function:  7969.54140103854
Standard deviation using NumPy's std function: 7969.54140103854


**Note:** Because we have the entire population of the company's cars in the dataset, the denominator for the standard deviation is $n=205$ (i.e., the number of entries in the dataset). If we only had a sample of the company's cars in the dataset, the denominator would be $n-1$ (see, parameter ddof in np.std)

#### **T8) Find the details of cars with the smallest and largest car volumes**
* Hint: see how to calculate the volume of a car https://info.japanesecartrade.com/content-item/297-what-is-m3-cubic-meter-size-of-a-vehicle


In [9]:
#Write your answer here
#Solution1
carsVolume = np.multiply(np.multiply(data['length'], data['height']), data['width'])
#Solution2
# carsVolume = data['length']* data['height']* data['width']
#Solution3
# carsVolume = np.prod(np.vstack([data['length'], data['height'], data['width']]), axis=0)

maxVolume = np.max(carsVolume)
minVolume = np.min(carsVolume)

carWithMaxVolume = np.argmax(carsVolume)
carWithMinVolume = np.argmin(carsVolume)

print("Max volume:", maxVolume, " belongs to car ", data[carWithMaxVolume])

print("Min volume:", minVolume, " belongs to car ", data[carWithMinVolume])


Max volume: 846007.7  belongs to car  ('mercedes-benz', 'gas', 'four', 'sedan', 'rwd', 120.9, 208.1, 71.7, 56.7, 'eight', 308, 184, 14, 16, 40960)
Min volume: 452643.2  belongs to car  ('chevrolet', 'gas', 'two', 'hatchback', 'fwd', 88.4, 141.1, 60.3, 53.2, 'three', 61, 48, 47, 53, 5151)


#### **T9) Find the different types of bodystyles for the cars in the dataset**

* Hint: You may want to check: https://numpy.org/doc/stable/reference/generated/numpy.unique.html

In [10]:
#Write your answer here
print("Unique bodystyles: ", np.unique(data['bodystyle']))

Unique bodystyles:  ['convertible' 'hardtop' 'hatchback' 'sedan' 'wagon']


#### **T10) Find the number of different car *brands* (makes)**


In [11]:
#Write your answer here
uniqueCarMakes = np.unique(data['make'])
print("There are ", len(uniqueCarMakes), "unique car makes which are:", uniqueCarMakes)

There are  22 unique car makes which are: ['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'mazda' 'mercedes-benz' 'mercury' 'mitsubishi' 'nissan' 'peugot'
 'plymouth' 'porsche' 'renault' 'saab' 'subaru' 'toyota' 'volkswagen'
 'volvo']


#### **T11) Find the engine size and the horsepower for the most and least efficient cars when driven in the city and the highway (i.e., the cars with the smallest and largest difference in fuel consumption when driven in the city and the highway)**

In [12]:
#Write your answer here
fuelDiff = np.subtract(data['highwaympg'], data['citympg'])

minFuelDiff = np.min(fuelDiff)
maxFuelDiff = np.max(fuelDiff)

carWithMinFuelDiff = np.argmin(fuelDiff)
carWithMaxFuelDiff = np.argmax(fuelDiff)

print("A %s with engine size=%d and horsepower=%d has the minimum fuel difference (%d) when driven in the city and the highway" 
      % (data['make'][carWithMinFuelDiff], data["enginesize"][carWithMinFuelDiff], data['horsepower'][carWithMinFuelDiff], minFuelDiff)) 

print("A %s with engine size=%d and horsepower=%d has the maximum fuel difference (%d) when driven in the city and the highway" 
      % (data['make'][carWithMaxFuelDiff], data["enginesize"][carWithMaxFuelDiff], data['horsepower'][carWithMaxFuelDiff], maxFuelDiff)) 

A peugot with engine size=152 and horsepower=95 has the minimum fuel difference (0) when driven in the city and the highway
A porsche with engine size=203 and horsepower=288 has the maximum fuel difference (11) when driven in the city and the highway


#### **T12) Find the make with the largest number of cars and how many they are**

In [13]:
#Write your answer here
makes,counts = np.unique(data['make'], return_counts=True)
maxCarsSameMake = np.argmax(counts)
make = makes[maxCarsSameMake]

print("The company has %d %s cars " % (np.max(counts), make))

The company has 32 toyota cars 


#### **T13) Find how many cars have a wheel base greater than 100**

* Hint: See https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html

In [14]:
#Write your answer here
carsWithLargeWheelBase = np.count_nonzero(data['wheelbase']>100)
print("There are %d cars whose wheel base is greater than 100" % (carsWithLargeWheelBase))

There are 63 cars whose wheel base is greater than 100


#### **T14) Find if there are any convertible cars that cost less than £15000**

* Hint: See https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html

In [15]:
#Write your answer here
cheapConvertibles = data[(data['bodystyle']=="convertible") & (data['price']<15000)]
print("Details of convertibles that cost less than £15000:\n", cheapConvertibles)

Details of convertibles that cost less than £15000:
 [('alfa-romero', 'gas', 'two', 'convertible', 'rwd', 88.6, 168.8, 64.1, 48.8, 'four', 130, 111, 21, 27, 13495)
 ('volkswagen', 'gas', 'two', 'convertible', 'fwd', 94.5, 159.3, 64.2, 55.6, 'four', 109,  90, 24, 29, 11595)]


#### **T15) Calculate the interquartile range for the price of all cars**

In [16]:
#Write your answer here
Q3P = np.percentile(data['price'], 75) #Third quartile
Q1P = np.percentile(data['price'], 25) #First quartile
IQRP = Q3P - Q1P #Inter Quartile Range
print('Price IQR:', IQRP)

Price IQR: 8715.0


#### **T16) Calculate the 50th percentile range for the horsepower of all cars. Which value is the 50th percentile equal to?**

In [17]:
#Write your answer here
percentile50HP = np.percentile(data['horsepower'], 50) #50th percentile
print('Horsepower 50th percentile:', percentile50HP)
print("Median horsepower:", np.median(data['horsepower']))

Horsepower 50th percentile: 95.0
Median horsepower: 95.0


### Ideas for practicing further at home

* Find the engine and horsepower of 4wd cars
* Find whether diesel or gas cars are more efficient in the city/highway
* Any other analysis that you might could generate some useful insight.
