# Project - Programming for Data Analysis 

## Problem statement 

For this project you must create a data set by simulating a real-world phenomenon of your choosing. You may pick any phenomenon you wish – you might pick one that is of interest to you in your personal or professional life. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python. We suggest you use the numpy.random package for this purpose. 

Speciﬁcally, in this project you should: 

- Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four diﬀerent variables. 
- Investigate the types of variables involved, their likely distributions, and their relationships with each other. 
- Synthesise/simulate a data set as closely matching their properties as possible. 
- Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook. 

Note that this project is about simulation – you must synthesise a data set. Some students may already have some real-world data sets in their own ﬁles. It is okay to base your synthesised data set on these should you wish (please reference it if you do), but the main task in this project is to create a synthesised data set. The next section gives an example project idea.


## Example project idea 

the performance of students studying a ten-credit module. 

most interesting variable - grade
number of hours on average a student studies per week (hours), 
the number of times they log onto Moodle in the ﬁrst three weeks of term (logins), 
and their previous level of degree qualiﬁcation (qual) 

The hours and grade variables will be non-negative real number with two decimal places, d to grade. 
logins will be a non-zero integer 
and qual will be a categorical variable with four possible values: none, bachelors, masters, or phd. 
four hours per week with a standard deviation of a quarter of an hour 

and that a normal distribution is an acceptable model of such a variable. 
Likewise, I investigate the other four variables, 
and I also look at the relationships between the variables. 

I devise an algorithm (or method) to generate such a data set, 
simulating values of the four variables for two-hundred students. 
I detail all this work in my notebook, 
and then I add some code in to generate a data set with those properties.


## Researching a real world application of interest

### Choosing a dataset

Going through various articles on datasets and data collections across the spectrum of topics, a personal topic of interest is the factors having an effect and influence  automotive fuel efficiency.

Having reviewed several articles on the topic, there seems to be many technical reasons for less than optimal fuel efficiency and the technical issues really comes down to maintenance or a lack thereof. A lot of the other factors comes down to driving style and really becomes temperament and age related. Other factors coming nit play is the size of the vehicle, distance of the commute and weather conditions, so all in all a reasonable set of conditions that can make an interesting dataset.

There is an overwhelming amount of variables that can influence the overall fuel efficiency of a vehicle and very granular maintenance specifics like oil and fuel quality, tyres, air conditioner use and travel distance. The basic factors influencing the outcome is the engine size, the weight, speed, drive style, aerodynamics and mechanical resistance. 

So as a first pass this is rough idea of data that should allow one to make reasonable estimates on fuel consumption.

Make  |Model  |Sub-Class|Type|CC |Cylinders|Gender|Age|Drivestyle|Serviced|Commute
------|-------|---------|----|---|---------|------|---|------- --|--------|-------
Toyota|Corolla|Verso    |MPV |1.6|4        |Male  |55 |Rational  |Annually|32  

Maybe subclass and Type is over complicating the matter.



### Defining the data values and types for the dataset


Variable  |Description                 |Data Type  |Distributions
----------|----------------------------|-----------|-------------
Make      |Manufacturer                |Text       |Geometric
Model     |Model                       |Text       |    - 
CC        |Engine size in CC           |0.8-4.5    |Geometric
Cylinder  |Cylinders inferred from CC  |2-12       |    - 
Gender    |Gender of driver            |male/female|Bernoulli
Age       |Age of driver to infer style|16-99      |Normal
Drivestyle|Driver Type                 |text       |    - 
Services  |Services annually           |yes/no     |Bernoulli
Commute   |Distance of commute         |1-100      |Gaussian
Type      |Urban, Rural, Highway       |text       |Bernoulli

### Generating a list of manufacturers 
So to get an idea of where values should go, the TEA18 dataset, referenced below, was used as a guideline for distribution of types vehicle types.

In [62]:
import pandas as pd                              # Import pandas for dataframe features
pd.options.mode.chained_assignment = None        # disable warnings en errors on dataframe usage
df=pd.read_csv('data/CarsIrelandbyCC.csv')       # import the TEA18 dataset to establish benchmarks
makes=df[df['Make'].str.contains('All ')==False] # filter out collection and keep manufacturers only
#makes=df[df['Make'].str.contains('All')==True]  # test result
#df.sort_values('All', ascending=False)          # sort for esy comparison
makes['pct']=((df['All']/168327)*100)            # calculate a percentage for reference 
makes[['Make','All','pct']].sort_values('All', ascending=False)#.head(10)   # display the list in descending order

Unnamed: 0,Make,All,pct
41,Volkswagen,21070,12.517303
14,Ford,18657,11.083783
39,Toyota,16114,9.573033
26,Nissan,12941,7.688012
16,Hyundai,12440,7.390377
3,Audi,8815,5.236831
34,Skoda,8508,5.054448
30,Renault,7444,4.422345
5,BMW,7384,4.3867
27,Opel,7350,4.366501


So looking at the list of new motor vehicles purchased in 2018, sorted by total sales, the list seems to approximate a geometric distribution. The distribution of engine sizes seems to be similarly distributed, however regardless of the distribution type, ideally what I would like to reproduce is a random generated list from 41 manufacturers that will always yield around __12.5%__ Volkswagen's, __11%__ Ford's followed by __9.6%__ Toyota's etc..

Several hours were spent to try and reproduce the result and one sample of such an attempt is below trying to use distributions to yield this result.

In [12]:
import numpy.random as rnd
d=rnd.noncentral_chisquare(10,1,30)**2*rnd.randint(3,size=30)
d

array([  0.        ,   0.        ,   0.        ,   0.        ,
        27.8048312 ,   0.        , 141.27470823, 438.66072555,
        88.71402032, 100.44632912,   0.        , 501.24198718,
       342.1967057 , 232.9289895 ,  28.98250259, 249.41054585,
       128.46104679,   0.        ,   0.        , 558.63885155,
       277.57013437,  58.25002349, 662.71228701,   0.        ,
       113.06181449, 277.84497443, 634.37616323, 124.2955145 ,
        95.32618272,   0.        ])

After countless hours working on the problem, and testing all kinds of distributions, the final and most elegant solution has already been created and secretly exists in the numpy.random.choice library option at the fourth parameter __p__, _probability_. This important point was completely missed in dealing with this library and only stumbling on the phrase _"probability"_ and parameter __p__ in the parameter description documentation, yielded the desired results.

```python
choice(a, size=None, replace=True, p=None)

Parameters
-----------
p : 1-D array-like, optional
    The probabilities associated with each entry in a.
    If not given the sample assumes a uniform distribution over all
    entries in a.
```

### Determining the winners

In [34]:
makes=df[df['Make'].str.contains('All makes')==False]
m=makes.sort_values('All', ascending=False)
m=list(m['Make'])
print(m)

['Volkswagen', 'Ford', 'Toyota', 'Nissan', 'Hyundai', 'Audi', 'Skoda', 'Renault', 'BMW', 'Opel', 'Kia', 'Peugeot', 'Vauxhall', 'Mercedes Benz', 'Citroen', 'Dacia', 'Mazda', 'Seat', 'Honda', 'Volvo', 'Suzuki', 'Mitsubishi', 'Mini', 'Land Rover', 'Fiat', 'Lexus', 'Jaguar', 'Saab', 'Subaru', 'All other makes', 'Porsche', 'Alfa Romeo', 'Ssangyong', 'Jeep', 'Smart', 'Chevrolet', 'Rover', 'Chrysler', 'Daihatsu', 'Daewoo', 'Austin', 'Dodge']


So arranging by makes in order of popularity:
```python
makes = ['Volkswagen', 'Ford', 'Toyota', 'Nissan', 'Hyundai', 'Audi', 'Skoda', 'Renault', 'BMW', 'Opel', 'Kia', 'Peugeot', 'Vauxhall', 'Mercedes Benz', 'Citroen', 'Dacia', 'Mazda', 'Seat', 'Honda', 'Volvo', 'Suzuki', 'Mitsubishi', 'Mini', 'Land Rover', 'Fiat', 'Lexus', 'Jaguar', 'Saab', 'Subaru', 'All other makes', 'Porsche', 'Alfa Romeo', 'Ssangyong', 'Jeep', 'Smart', 'Chevrolet', 'Rover', 'Chrysler', 'Daihatsu', 'Daewoo', 'Austin', 'Dodge']
```

So in order to to create a realistic simulation resembling the finding a reasonably narrow and specific distribution must be followed.

### The working code

Testing the generator and distribution results

In [48]:
from collections import Counter as count # import the counter for validation of the generator
makes['pct']=((df['All']/sum(makes['All']))) # create a percentage column value that adds up to one - required by propability
makes[['Make','All','pct']].sort_values('All', ascending=False) #sort the list and create lists for validation purposes
manufacturers=list(makes['Make']) # generate a manufacturers list from the makes dataframe
probability=list(makes['pct']) # create the probabbility distribution list for the list
#sum(probability)  # test the list forcompatibility to the fucntion requirements, i.e. mist add up to one
t=rnd.choice(manufacturers,1000,p=probability) # create a 1000 samples
count(t).most_common() # count and show them

[('Volkswagen', 135),
 ('Ford', 105),
 ('Toyota', 92),
 ('Nissan', 68),
 ('Skoda', 61),
 ('Hyundai', 61),
 ('Audi', 53),
 ('Opel', 48),
 ('Renault', 45),
 ('BMW', 45),
 ('Peugeot', 44),
 ('Vauxhall', 29),
 ('Dacia', 29),
 ('Citroen', 24),
 ('Seat', 23),
 ('Kia', 23),
 ('Mazda', 20),
 ('Honda', 19),
 ('Mercedes Benz', 19),
 ('Volvo', 18),
 ('Suzuki', 8),
 ('Mitsubishi', 8),
 ('Mini', 7),
 ('Lexus', 3),
 ('Subaru', 3),
 ('Land Rover', 3),
 ('Fiat', 2),
 ('Porsche', 1),
 ('Ssangyong', 1),
 ('Alfa Romeo', 1),
 ('Jaguar', 1),
 ('Saab', 1)]

This approach yields exactly what I was hoping to achieve and engine size can be derived in a similar fashion, bearing in mind that all manufacturers might not have all engine sizes.

### Distilling Engine sizes

So in order to add engine sizes to the mix, a review of the engine statistics and distributions is done.

In [35]:
makes=df[df['Make'].str.contains('All makes')==True]
makes

Unnamed: 0,Make,All,<900,1000,1300,1400,1500,1600,2000,2400,>2400
0,All makes,168327,881,10210,20204,15272,18800,43824,48599,7407,3130


To better make sense of this result, it's transposed and cleaned up and stored in a csv file for easier reference and access.

In [52]:
ccd=pd.read_csv('data/cc_dist.csv').sort_values('pct', ascending=False)
ccd

Unnamed: 0,cc,count,pct
6,2000,48599,0.288718
5,1600,43824,0.26035
2,1300,20204,0.120028
4,1500,18800,0.111687
3,1400,15272,0.090728
1,1000,10210,0.060656
7,2400,7407,0.044004
8,>2400,3130,0.018595
0,<900,881,0.005234


The surprise is that the most common engine size, contrary to popular believe, is a 2000 cc engine, closely followed by the 1600 cc category. The gap in the catagories probably add to the big percentage in this grouping, however the two accounts for 50% of the vehicle population on the road. 45% is shared by 1000-1500 cc and 4.4% in 2.4% and the balance of around 1.5% fill in the rest, so very few above 2.4k cc and even less below 900cc.

<img src="img/cc_plot.png">

So a list in the order of popularity would be:

```python
cc_order=[2000,1600,1300,1500,1400,1000,2400,>2400,<900]
```

### Distribution of engine sizes and variance by manufacturers

The order of manufacturers is actually not that surprising, and when sorting the complete list by all or various engine size columns creates minor shifts. The shifting is really down to engine sizes more prevalent by specific manufacturers, for example Toyota would not make 900cc vehicles and the <900 cc category is dominated by Nissan, with relatively low volumes. 

It is quote surprising and then again very satisfying that this information can be derived from official datasets that is already publicly available. It is an area that must be explored more deeply and frequently for knowledge and gain.

For the sake of completeness before proceeding let us investigate the variance of manufacturers by engine sizes

In [53]:
makes=df[df['Make'].str.contains('All makes')==False]
makes.sort_values('<900', ascending=False).head(3)

Unnamed: 0,Make,All,<900,1000,1300,1400,1500,1600,2000,2400,>2400
26,Nissan,12941,496,53,5016,46,6164,1032,93,12,29
9,Dacia,3696,139,0,358,1,3198,0,0,0,0
30,Renault,7444,109,60,1843,45,4965,275,122,17,8


In [54]:
makes.sort_values('1000', ascending=False).head(3)

Unnamed: 0,Make,All,<900,1000,1300,1400,1500,1600,2000,2400,>2400
39,Toyota,16114,0,3176,284,6601,626,814,4325,178,110
41,Volkswagen,21070,6,1966,1959,483,399,9080,7119,2,56
16,Hyundai,12440,0,1478,1554,1311,20,1913,5388,772,4


In [55]:
makes.sort_values('1300', ascending=False).head(3)

Unnamed: 0,Make,All,<900,1000,1300,1400,1500,1600,2000,2400,>2400
26,Nissan,12941,496,53,5016,46,6164,1032,93,12,29
14,Ford,18657,0,1065,2544,486,1258,9638,3491,135,40
41,Volkswagen,21070,6,1966,1959,483,399,9080,7119,2,56


In [56]:
makes.sort_values('1400', ascending=False).head(3)

Unnamed: 0,Make,All,<900,1000,1300,1400,1500,1600,2000,2400,>2400
39,Toyota,16114,0,3176,284,6601,626,814,4325,178,110
27,Opel,7350,0,52,717,1961,39,1998,2581,2,0
16,Hyundai,12440,0,1478,1554,1311,20,1913,5388,772,4


In [57]:
makes.sort_values('1500', ascending=False).head(3)

Unnamed: 0,Make,All,<900,1000,1300,1400,1500,1600,2000,2400,>2400
26,Nissan,12941,496,53,5016,46,6164,1032,93,12,29
30,Renault,7444,109,60,1843,45,4965,275,122,17,8
9,Dacia,3696,139,0,358,1,3198,0,0,0,0


In [58]:
makes.sort_values('1600', ascending=False).head(3)

Unnamed: 0,Make,All,<900,1000,1300,1400,1500,1600,2000,2400,>2400
14,Ford,18657,0,1065,2544,486,1258,9638,3491,135,40
41,Volkswagen,21070,6,1966,1959,483,399,9080,7119,2,56
34,Skoda,8508,0,942,1499,43,225,4008,1791,0,0


In [59]:
makes.sort_values('2000', ascending=False).head(3)

Unnamed: 0,Make,All,<900,1000,1300,1400,1500,1600,2000,2400,>2400
41,Volkswagen,21070,6,1966,1959,483,399,9080,7119,2,56
3,Audi,8815,0,64,115,169,106,1360,6436,3,562
5,BMW,7384,37,0,0,0,458,52,6179,8,650


In [60]:
makes.sort_values('2400', ascending=False).head(3)

Unnamed: 0,Make,All,<900,1000,1300,1400,1500,1600,2000,2400,>2400
23,Mercedes Benz,4027,0,0,0,6,506,771,124,2287,333
22,Mazda,3049,0,0,0,326,289,170,197,2064,3
16,Hyundai,12440,0,1478,1554,1311,20,1913,5388,772,4


In [61]:
makes.sort_values('>2400', ascending=False).head(3)

Unnamed: 0,Make,All,<900,1000,1300,1400,1500,1600,2000,2400,>2400
5,BMW,7384,37,0,0,0,458,52,6179,8,650
3,Audi,8815,0,64,115,169,106,1360,6436,3,562
21,Lexus,657,0,0,0,0,0,0,126,56,475


### The conclusion on engine sizes

Manufacturer order vary some between engine sizes since everyone does not manufacture the same ranges or engines, for example BMW and Mercedes dominates above the 2.4 category. 

The second aspect is the extreme complexity introduced across 41 unique manufacturers to account for the engine sizes in the modelling and the net effect overall is almost negligible as a result of the overall combined manufacturer popularity distribution, for example while Mercedes dominates the 2400 cc category, the overall representation of the manufacturer is only 2.39%, so roughly half this number around 1.2% that I will assign a bigger engine to the wrong manufacturer. All in all the set will balance as the cc distribution probability  will still be accurately applied to the overall set. 

## Modelling the engine sizes

In [69]:
count(rnd.choice(ccd['cc'],1000,p=ccd['pct'])).most_common()
#ccd

[('2000', 287),
 ('1600', 268),
 ('1300', 118),
 ('1400', 101),
 ('1500', 95),
 ('1000', 59),
 ('2400', 57),
 ('>2400', 12),
 ('<900', 3)]

## References

1. __[5 Ways to Find Interesting Data Sets](https://www.dataquest.io/blog/5-ways-to-find-interesting-data-sets/)__
1. __[18 places to find data sets for data science projects](https://www.dataquest.io/blog/free-datasets-for-projects/)__
1. __[100+ Interesting Data Sets for Statistics](http://rs.io/100-interesting-data-sets-for-statistics/)__
1. __[19 Free Public Data Sets for Your First Data Science Project](https://www.springboard.com/blog/free-public-data-sets-data-science-project/)__
1. __[Cool Data Sets I’ve found](https://towardsdatascience.com/cool-data-sets-ive-found-adc17c5e55e1)__
1. __[Summary of Links to data sources](http://hdip-data-analytics.com/resources/data_sources)__
1. __[13 factors that increase fuel consumption](https://www.monitor.co.ug/Business/Auto/13-factors-that-increase-fuel-consumption/688614-2738644-b69hkkz/index.html)__
1. __[Many Factors Affect Fuel Economy](https://www.fueleconomy.gov/feg/factors.shtml)__
1. __[Want Your MPG? 10 Factors That Affect Fuel Economy](https://www.newgateschool.org/blog/entry/want-your-mpg-10-factors-that-affect-fuel-economy)__
1. __[How to Reduce Fuel Consumption](https://www.carsdirect.com/car-buying/10-ways-to-lower-engine-fuel-consumption)__
1. __[8 Main Causes of Bad Gas Mileage](https://www.carsdirect.com/car-buying/8-main-causes-of-bad-gas-mileage)__
1. __[]()__
1. __[Cars Dataset](http://www.rpubs.com/dksmith01/cars)__
1. __[The 5 types of drivers on the road](https://rsadirect.ae/blog/5-types-drivers-road)__
1. __[TEA18 - Private Cars Licensed for the First Time](https://data.gov.ie/dataset/tea18-ime-by-engine-capacity-cc-car-make-emission-band-licensing-authority-year-and-statistic-b6cc)__
1. __[Github Markdown reference](https://guides.github.com/features/mastering-markdown/)__
1. __[Jupyter Markdown reference](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html)__
1. __[Latex Reference](http://www.malinc.se/math/latex/basiccodeen.php)__