# <center><font color = 'green'>Simulated data for purchasing a new electric car</fon></center>

![car-reg](Images/car-reg.png)

## <center><font color = 'green'>Barry Clarke</fon></center>
## <center><font color = 'green'>Programming for Data Analytics - Project - Autumn 2019</fon></center>

## Description

This repository will simulate the data for the purchasing of new hybrid and full electric cars in Ireland in the 2019. Summary car sales figures show that there has been a significant increase in Hybrid and full electric cars in 2019, when compared to 2018 sales figures.[1] The trend of total sales of Hybrids + full electric cars for 2018 and 2019 can be seen below. <br> 
![New E-car sales summary](Images/E-Car-sales-summary.png)<br>
This data is of interest to the author, as his next car purchase will either be a hybrid or a full electric. Note the trend in the above plot with two peaks, one in January and one in July. This can be explained by the volume of car sales seen at the beginning of the 1st half of the year, and the beginning of the 2nd half of the year. For the below dataset, this trend is not important, rather the focus will be on the overall volume of sales for individual car types for the entire first 10months of 2019.

The dataset will include the following columns:
1. **Model:** The make of car
2. **Make:** The manufacturer of the car
3. **Technology** The engine type
4. **Class:** The class of car
5. **Price** The price of the car
6. **Location** County where the car is registered. <font color = 'Red'>Note:</font> The dataset will only focus on Dublin and the commuter counties, as the author is a commuter and would like data relevant to his life situation
6. **colour** The colour of the car
7. **Gender:** Gender of purchaser
8. **Scrappage:** Yes/No based on whether the purchaser was availing of a scrappage deal

Within the dataset, the relationships between the variables will be discussed and simulated. The final simulated dataset will contain 1000 entries.<br>
**Note:** The simulated dataset is primarily based on data gathered from the [Irish Motor Industry website](https://stats.beepbeep.ie/)

In [1]:
# Import the relevant libraries
import numpy as np
import pandas as pd

The below data is based on the summary data for the registrations of hybrid and Electrice passenger cars sold in Dublin and it's commuter counties in 2019 to end of October. This data can be viewed [here](Data/passenger-cars-by-model.xlsx)

In [2]:
# List of options within each variable
model = ['Leaf', 'Kona', 'Niro', 'zoe', 'E-Golf', 'Outlander', 'Range Rover sport RA', 'Ioniq', 'I3', '5 Series', 'XC90', 
        '3 Series', 'Soul', 'Countryman', 'Model S', 'XC60', 'E-Tron', 'I-Pace', 'Range Rover', 'Model 3', 'Model X',
       'S90', 'Prius', 'Panamera', 'Mondeo', '2 Series', 'Range Rover Evoque R', 'EQC', 'E Class', 'Evalia', '7 Series',
       'Passat', 'Twizy', 'C Class', 'A7', 'V90', 'Cayenne']
make = ['Nissan', 'Hyundai', 'Kia', 'Renault', 'Volkswagon', 'Mitsubishi', 'Land Rover', 'Hyundai', 'BMW', 'Volvo', 'Mini',
       'Tesla', 'Audi', 'Jaguar', 'Toyota', 'Porsche', 'Ford', 'Mercedez-Benz']
technology = ['Hybrid/Diesel', 'Hybrid/Petrol', 'Electric']
classification = ['small', 'medium', 'large', 'SUV']
Price = ['Whatever']
location = ['Dublin', 'Meath', 'Louth', 'Kildare', 'Wicklow', 'Wexford', 'Carlow']
colour = ['Black', 'White', 'Red', 'Silver', 'Navy', 'Other']
gender = ['Male', 'Female']
scrappage = ['Yes', 'No']

For each sale simulated, the [numpy.random.choice()](#https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html) function will be used, with a probability attached to each car based on the sales statistics of all cars in the data.

In [3]:
# Probabilities of each car model being registered, based on 2019 Summary data
p1=[0.212, 0.220, 0.090, 0.037, 0.062, 0.046, 0.046, 0.003, 0.035, 0.027, 0.022, 0.02, 0.012, 0.013, 0.014, 0.01, 0.011, 0.012, 0.010, 0.009, 0.008, 0.009, 0.005, 0.006, 0.002, 0.002, 0.002, 0.001, 0.001, 0.0003, 0.0007, 0.0007, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003]
# numpy.random.choice() can have difficulty suming all probabilities to 1. The solution is to normalize the probabilities 
# by dividing them by their sum if the sum is close enough to 1. [2]
p1 = np.array(p1)
p1 /= p1.sum()

In [4]:
# Probabilities of location where car is registered
p2=[0.728, 0.057, 0.033, 0.076, 0.057, 0.034, 0.015]

In [5]:
Cars = pd.DataFrame({'Model': np.random.choice(model, 10, p=p1), 'Technology': np.random.choice(technology, 10),
                    'Class': np.random.choice(classification, 10), 'Price': np.random.randint(20000, 80000, size=10),
                    'Location': np.random.choice(location, size=10, p=p2), 'Gender': np.random.choice(gender, 10),
                    'Scrappage': np.random.choice(scrappage, 10)})

In [6]:
Cars

Unnamed: 0,Model,Technology,Class,Price,Location,Gender,Scrappage
0,Kona,Hybrid/Petrol,large,42492,Dublin,Female,Yes
1,XC60,Hybrid/Petrol,SUV,36264,Dublin,Female,Yes
2,Kona,Hybrid/Petrol,small,20240,Dublin,Male,Yes
3,XC90,Electric,medium,35427,Wexford,Female,No
4,Range Rover,Hybrid/Diesel,large,59373,Kildare,Male,Yes
5,Countryman,Hybrid/Petrol,medium,48848,Dublin,Male,Yes
6,Niro,Hybrid/Petrol,large,37759,Dublin,Female,No
7,Kona,Electric,medium,27637,Dublin,Male,Yes
8,5 Series,Hybrid/Petrol,small,71306,Dublin,Male,Yes
9,5 Series,Electric,medium,63448,Dublin,Male,No


## References
1. https://stats.beepbeep.ie/
2. https://stackoverflow.com/questions/46539431/np-random-choice-probabilities-do-not-sum-to-1
2. https://towardsdatascience.com/synthetic-data-generation-a-must-have-skill-for-new-data-scientists-915896c0c1ae
3. https://www.analyticsvidhya.com/blog/2017/09/6-probability-distributions-data-science/