# Project - Programming for Data Analysis 

## Problem statement 

For this project you must create a data set by simulating a real-world phenomenon of your choosing. You may pick any phenomenon you wish – you might pick one that is of interest to you in your personal or professional life. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python. We suggest you use the numpy.random package for this purpose. 

Speciﬁcally, in this project you should: 

- Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four diﬀerent variables. 
- Investigate the types of variables involved, their likely distributions, and their relationships with each other. 
- Synthesise/simulate a data set as closely matching their properties as possible. 
- Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook. 

Note that this project is about simulation – you must synthesise a data set. Some students may already have some real-world data sets in their own ﬁles. It is okay to base your synthesised data set on these should you wish (please reference it if you do), but the main task in this project is to create a synthesised data set. The next section gives an example project idea.


## Example project idea 

the performance of students studying a ten-credit module. 

most interesting variable - grade
number of hours on average a student studies per week (hours), 
the number of times they log onto Moodle in the ﬁrst three weeks of term (logins), 
and their previous level of degree qualiﬁcation (qual) 

The hours and grade variables will be non-negative real number with two decimal places, d to grade. 
logins will be a non-zero integer 
and qual will be a categorical variable with four possible values: none, bachelors, masters, or phd. 
four hours per week with a standard deviation of a quarter of an hour 

and that a normal distribution is an acceptable model of such a variable. 
Likewise, I investigate the other four variables, 
and I also look at the relationships between the variables. 

I devise an algorithm (or method) to generate such a data set, 
simulating values of the four variables for two-hundred students. 
I detail all this work in my notebook, 
and then I add some code in to generate a data set with those properties.


## Researching a real world application of interest

### Choosing a dataset

Going through various articles on datasets and data collections across the spectrum of topics, a personal topic of interest is the factors having an effect and influence  automotive fuel efficiency.

Having reviewed several articles on the topic, there seems to be many technical reasons for less than optimal fuel efficiency and the technical issues really comes down to maintenance or a lack thereof. A lot of the other factors comes down to driving style and really becomes temperament and age related. Other factors coming nit play is the size of the vehicle, distance of the commute and weather conditions, so all in all a reasonable set of conditions that can make an interesting dataset.

There is an overwhelming amount of variables that can influence the overall fuel efficiency of a vehicle and very granular maintenance specifics like oil and fuel quality, tyres, air conditioner use and travel distance. The basic factors influencing the outcome is the engine size, the weight, speed, drive style, aerodynamics and mechanical resistance. 

So as a first pass this is rough idea of data that should allow one to make reasonable estimates on fuel consumption.

Make  |Model  |Sub-Class|Type|CC |Cylinders|Gender|Age|Drivestyle|Serviced|Commute
------|-------|---------|----|---|---------|------|---|------- --|--------|-------
Toyota|Corolla|Verso    |MPV |1.6|4        |Male  |55 |Rational  |Annually|32  

Maybe subclass and Type is over complicating the matter.

### Generating a list of manufacturers 
So to get an idea of where values should go, the TEA18 dataset, referenced below, was used as a guideline for distribution of types vehicle types.

In [41]:
pd.options.mode.chained_assignment = None  # default='warn'
import pandas as pd # Import pandas 
df=pd.read_csv('data/CarsIrelandbyCC.csv') # import the TEA18 dataset to establish benchmark statistics for generating a dataset
makes=df[df['Make'].str.contains('All ')==False] # filter out collection and keep manufacturers only
#makes=df[df['Make'].str.contains('All')==True] # test result
#df.sort_values('All', ascending=False) # sort for esy comparison
makes['pct']=((df['All']/168327)*100) # calculate a percentage for reference 
makes[['Make','All','pct']].sort_values('All', ascending=False)#.head(10) # display the list in descending order

Unnamed: 0,Make,All,pct
41,Volkswagen,21070,12.517303
14,Ford,18657,11.083783
39,Toyota,16114,9.573033
26,Nissan,12941,7.688012
16,Hyundai,12440,7.390377
3,Audi,8815,5.236831
34,Skoda,8508,5.054448
30,Renault,7444,4.422345
5,BMW,7384,4.3867
27,Opel,7350,4.366501


So looking at the list of new motor vehicles purchased in 2018, sorted by total sales, the list seems to approximate a geometric distribution. The distribution of engine sizes seems to be similarly distributed, however regardless of the distribution type, ideally what I would like to reproduce is a random generated list from 41 manufacturers that will always yield around __12.5%__ Volkswagen's, __11%__ Ford's followed by __9.6%__ Toyota's etc..

Several hours were spent to try and reproduce the result and one sample of such an attempt is below trying to use distributions to yield this result.

In [60]:
import numpy.random as rnd
d=rnd.noncentral_chisquare(10,1,30)**2*rnd.randint(3,size=30)
d

array([ 43.54931294,   0.        , 314.30906248, 137.53244482,
       186.96055942, 166.52014011, 203.98277326, 309.00755531,
       257.70347043, 286.25154138,   0.        ,  65.4212289 ,
       124.68392147, 231.43192445, 131.8435442 ,   0.        ,
         0.        , 257.41754307, 482.5621757 , 220.37892921,
       436.34985592,  47.96785995, 177.11510052,  18.29240102,
       288.71156273, 124.37849733,  65.86384797,   0.        ,
       145.86937635,   0.        ])

After countless hours working on the problem, and testing all kinds of distributions, the final and most elegant solution has already been created and secretly exists in the numpy.random.choice library option at the fourth parameter __p__, _probability_. This important point was completely missed in dealing with this library and only stumbling on the phrase _"probability"_ and parameter __p__ in the parameter description documentation, yielded the desired results.

```python
choice(a, size=None, replace=True, p=None)

Parameters
-----------
p : 1-D array-like, optional
    The probabilities associated with each entry in a.
    If not given the sample assumes a uniform distribution over all
    entries in a.
```
### The working code

Testing the generator and distribution results

In [62]:
from collections import Counter as count
makes['pct']=((df['All']/sum(makes['All']))) # create a percentage column value that adds up to one
makes[['Make','All','pct']].sort_values('All', ascending=False) #sort the list and create lists for validation purposes
manufacturers=list(makes['Make'])
probability=list(makes['pct'])
#sum(probability)
t=rnd.choice(manufacturers,1000,p=probability)
count(t).most_common()

[('Ford', 129),
 ('Volkswagen', 116),
 ('Toyota', 91),
 ('Hyundai', 78),
 ('Nissan', 73),
 ('Audi', 65),
 ('Skoda', 57),
 ('BMW', 41),
 ('Renault', 38),
 ('Opel', 37),
 ('Peugeot', 32),
 ('Vauxhall', 29),
 ('Kia', 27),
 ('Citroen', 25),
 ('Dacia', 22),
 ('Mercedes Benz', 21),
 ('Volvo', 18),
 ('Seat', 18),
 ('Mazda', 15),
 ('Honda', 14),
 ('Suzuki', 12),
 ('Lexus', 10),
 ('Land Rover', 8),
 ('Mini', 7),
 ('Mitsubishi', 4),
 ('Jaguar', 3),
 ('Subaru', 3),
 ('Fiat', 2),
 ('Saab', 2),
 ('Porsche', 1),
 ('Smart', 1),
 ('Rover', 1)]

This approach yields exactly what I was hoping to achieve and engine size can be derived in a similar fashion, bearing in mind that all manufacturers might not have all engine sizes.

In [17]:
makes=df[df['Make'].str.contains('All makes')==True]
makes

Unnamed: 0,Make,All,<900,1000,1300,1400,1500,1600,2000,2400,>2400
0,All makes,168327,881,10210,20204,15272,18800,43824,48599,7407,3130


<img src="img/cc_plot.png">

So a list in the right order would be:

```python
cc_order=[2000,1600,1300,1500,1400,1000,2400,>2400,<900]
```

### Order of makes

The order of manufacturers is actually not that surprising, and when sorting the complete list by all or various engine size columns creates minor shifts. The shifting is really down to engine sizes more prevalent by specific manufacturers, for example Toyota would not make 900cc vehicles and the <900 cc category is dominated by Nissan, with relatively low volumes. A very surprising aspect is that the most dominant overall engine size is actually 2000cc closely followed by 1600cc.

In [33]:
makes.sort_values('<900', ascending=False).head(10)

Unnamed: 0,Make,All,<900,1000,1300,1400,1500,1600,2000,2400,>2400
26,Nissan,12941,496,53,5016,46,6164,1032,93,12,29
9,Dacia,3696,139,0,358,1,3198,0,0,0,0
30,Renault,7444,109,60,1843,45,4965,275,122,17,8
5,BMW,7384,37,0,0,0,458,52,6179,8,650
35,Smart,55,32,18,2,2,1,0,0,0,0
13,Fiat,950,23,0,741,61,1,87,35,2,0
41,Volkswagen,21070,6,1966,1959,483,399,9080,7119,2,56
8,Citroen,3801,4,174,191,170,0,2999,249,12,2
6,Chevrolet,44,3,11,9,10,0,4,5,1,1
25,Mitsubishi,1123,3,103,96,81,21,258,227,308,26


In [34]:
makes=df[df['Make'].str.contains('All makes')==False]
m=makes.sort_values('All', ascending=False)
m=list(m['Make'])
print(m)

['Volkswagen', 'Ford', 'Toyota', 'Nissan', 'Hyundai', 'Audi', 'Skoda', 'Renault', 'BMW', 'Opel', 'Kia', 'Peugeot', 'Vauxhall', 'Mercedes Benz', 'Citroen', 'Dacia', 'Mazda', 'Seat', 'Honda', 'Volvo', 'Suzuki', 'Mitsubishi', 'Mini', 'Land Rover', 'Fiat', 'Lexus', 'Jaguar', 'Saab', 'Subaru', 'All other makes', 'Porsche', 'Alfa Romeo', 'Ssangyong', 'Jeep', 'Smart', 'Chevrolet', 'Rover', 'Chrysler', 'Daihatsu', 'Daewoo', 'Austin', 'Dodge']


So arranging by makes in order of popularity:
```python
makes = ['Volkswagen', 'Ford', 'Toyota', 'Nissan', 'Hyundai', 'Audi', 'Skoda', 'Renault', 'BMW', 'Opel', 'Kia', 'Peugeot', 'Vauxhall', 'Mercedes Benz', 'Citroen', 'Dacia', 'Mazda', 'Seat', 'Honda', 'Volvo', 'Suzuki', 'Mitsubishi', 'Mini', 'Land Rover', 'Fiat', 'Lexus', 'Jaguar', 'Saab', 'Subaru', 'All other makes', 'Porsche', 'Alfa Romeo', 'Ssangyong', 'Jeep', 'Smart', 'Chevrolet', 'Rover', 'Chrysler', 'Daihatsu', 'Daewoo', 'Austin', 'Dodge']
```

So in order to 

### Defining the data values and types for the dataset


Variable  |Description                 |Data Type  |Distributions
----------|----------------------------|-----------|-------------
Make      |Manufacturer                |Text       |Geometric
Model     |Model                       |Text       |    - 
CC        |Engine size in CC           |0.8-4.5    |Geometric
Cylinder  |Cylinders inferred from CC  |2-12       |    - 
Gender    |Gender of driver            |male/female|Bernoulli
Age       |Age of driver to infer style|16-99      |Normal
Drivestyle|Driver Type                 |text       |    - 
Services  |Services annually           |yes/no     |Bernoulli
Commute   |Distance of commute         |1-100      |Gaussian
Type      |Urban, Rural, Highway       |text       |Bernoulli

In [15]:
import numpy.random as rnd
d=rnd.noncentral_chisquare(10,1,30)**2*rnd.randint(3,size=30)
d

array([   0.        ,    0.        ,  277.12823245,    0.        ,
          0.        ,    0.        ,  139.86625596,   71.36090239,
       1136.50294628,    0.        ,    0.        ,    0.        ,
         35.60118598,   44.32864418,  189.52707968,   41.8604969 ,
         80.03613357,  654.35099001,  230.67076016,    0.        ,
          0.        ,   85.1178053 ,    7.79018641,  989.87195361,
          0.        ,  421.03669429,    0.        ,    0.        ,
        255.36137528,  120.06445116])

[[https://www.google.ie/search?ei=JYb0W4_rJ4vikgWh2YeIDA&q=factors+influencing+fuel+consumption&oq=factors+influencing+fuel+consumption&gs_l=psy-ab.3..0i22i30l9.138940.154082..154848...0.0..0.107.2555.35j1......0....1..gws-wiz.......0j0i71j0i67j0i131i67j0i131j0i10j0i13i30.B8fbKVxlciY]]

## References

1. __[5 Ways to Find Interesting Data Sets](https://www.dataquest.io/blog/5-ways-to-find-interesting-data-sets/)__
1. __[18 places to find data sets for data science projects](https://www.dataquest.io/blog/free-datasets-for-projects/)__
1. __[100+ Interesting Data Sets for Statistics](http://rs.io/100-interesting-data-sets-for-statistics/)__
1. __[19 Free Public Data Sets for Your First Data Science Project](https://www.springboard.com/blog/free-public-data-sets-data-science-project/)__
1. __[Cool Data Sets I’ve found](https://towardsdatascience.com/cool-data-sets-ive-found-adc17c5e55e1)__
1. __[Summary of Links to data sources](http://hdip-data-analytics.com/resources/data_sources)__
1. __[13 factors that increase fuel consumption](https://www.monitor.co.ug/Business/Auto/13-factors-that-increase-fuel-consumption/688614-2738644-b69hkkz/index.html)__
1. __[Many Factors Affect Fuel Economy](https://www.fueleconomy.gov/feg/factors.shtml)__
1. __[Want Your MPG? 10 Factors That Affect Fuel Economy](https://www.newgateschool.org/blog/entry/want-your-mpg-10-factors-that-affect-fuel-economy)__
1. __[How to Reduce Fuel Consumption](https://www.carsdirect.com/car-buying/10-ways-to-lower-engine-fuel-consumption)__
1. __[8 Main Causes of Bad Gas Mileage](https://www.carsdirect.com/car-buying/8-main-causes-of-bad-gas-mileage)__
1. __[]()__
1. __[Cars Dataset](http://www.rpubs.com/dksmith01/cars)__
1. __[The 5 types of drivers on the road](https://rsadirect.ae/blog/5-types-drivers-road)__
1. __[TEA18 - Private Cars Licensed for the First Time](https://data.gov.ie/dataset/tea18-ime-by-engine-capacity-cc-car-make-emission-band-licensing-authority-year-and-statistic-b6cc)__
1. __[Github Markdown reference](https://guides.github.com/features/mastering-markdown/)__
1. __[Jupyter Markdown reference](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html)__
1. __[Latex Reference](http://www.malinc.se/math/latex/basiccodeen.php)__