## Programming for Data Analysis - Project

### Problem statement

For this project you must create a data set by simulating a real-world phenomenon of
your choosing. You may pick any phenomenon you wish – you might pick one that is
of interest to you in your personal or professional life. 

Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python.


We suggest you use the numpy.random package for this purpose.

Specifically, in this project you should:

* Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.

* Investigate the types of variables involved, their likely distributions, and their relationships with each other.

* Synthesise/simulate a data set as closely matching their properties as possible.

* Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

#### Note:
this project is about simulation – you must synthesise a data set. Some
students may already have some real-world data sets in their own files. It is okay to
base your synthesised data set on these should you wish (please reference it if you do),
but the main task in this project is to create a synthesised data set. The next section
gives an example project idea.

### Example project idea

As a lecturer I might pick the real-world phenomenon of the performance of students
studying a ten-credit module. After some research, I decide that the most interesting
variable related to this is the mark a student receives in the module - this is going to be
one of my variables (grade).

Upon investigation of the problem, I find that the number of hours on average a
student studies per week (hours), the number of times they log onto Moodle in the
first three weeks of term (logins), and their previous level of degree qualification (qual)
are closely related to grade. 

The hours and grade variables will be non-negative real number with two decimal places, logins will be a non-zero integer and qual will be a categorical variable with four possible values: none, bachelors, masters, or phd.

After some online research, I find that full-time post-graduate students study on average four hours per week with a standard deviation of a quarter of an hour and that a normal distribution is an acceptable model of such a variable. Likewise, I investigate the other four variables, and I also look at the relationships between the variables. 

I devise an algorithm (or method) to generate such a data set, simulating values of the
four variables for two-hundred students. I detail all this work in my notebook, and then
I add some code in to generate a data set with those properties.

In [5]:
import pandas as pd

In [17]:
df = pd.read_csv('covid_impact_on_airport_traffic.csv', nrows=2)
df


Unnamed: 0,AggregationMethod,Date,Version,AirportName,PercentOfBaseline,Centroid,City,State,ISO_3166_2,Country,Geography
0,Daily,2020-07-05,1.0,Kingsford Smith,52,POINT(151.180087713813 -33.9459774986125),Sydney,New South Wales,AU,Australia,"POLYGON((151.164354085922 -33.9301772341877, 1..."
1,Daily,2020-05-28,1.0,Kingsford Smith,61,POINT(151.180087713813 -33.9459774986125),Sydney,New South Wales,AU,Australia,"POLYGON((151.164354085922 -33.9301772341877, 1..."


In [4]:
df = pd.read_csv('drug_poisoning_deaths_by_state-_us_2013_2014-v7.csv')
df


Unnamed: 0,State,2014Rate,2014Number,2014Range,2013Rate,2013Number,2013Range,Change,Significant
0,ND,6.3,43,2.8 to 11.0,2.8,20,2.8 to 11.0,125.0,Significant
1,NE,7.2,125,2.8 to 11.0,6.5,117,2.8 to 11.0,10.8,Not Significant
2,SD,7.8,63,2.8 to 11.0,6.9,55,2.8 to 11.0,13.0,Not Significant
3,IA,8.8,264,2.8 to 11.0,9.3,275,2.8 to 11.0,-5.4,Not Significant
4,TX,9.7,2601,2.8 to 11.0,9.3,2446,2.8 to 11.0,4.3,Not Significant
5,MN,9.6,517,2.8 to 11.0,9.6,523,2.8 to 11.0,0.0,Not Significant
6,VA,11.7,980,11.1 to 13.5,10.2,854,2.8 to 11.0,14.7,Significant
7,MS,11.6,336,11.1 to 13.5,10.8,316,2.8 to 11.0,7.4,Not Significant
8,GA,11.9,1206,11.1 to 13.5,10.8,1098,2.8 to 11.0,10.2,Significant
9,HI,10.9,157,2.8 to 11.0,11.0,158,2.8 to 11.0,-0.9,Not Significant


In [5]:
df = pd.read_csv('death-rate-from-opioid-use.csv')
df



Unnamed: 0,Entity,Code,Year,Deaths - Opioid use disorders - Sex: Both - Age: Age-standardized (Rate)
0,Afghanistan,AFG,1990,0.610114
1,Afghanistan,AFG,1991,0.622036
2,Afghanistan,AFG,1992,0.634234
3,Afghanistan,AFG,1993,0.656677
4,Afghanistan,AFG,1994,0.684709
...,...,...,...,...
6463,Zimbabwe,ZWE,2013,1.164133
6464,Zimbabwe,ZWE,2014,1.148561
6465,Zimbabwe,ZWE,2015,1.151673
6466,Zimbabwe,ZWE,2016,1.158403


In [11]:
pd.set_option('display.max_rows', None)
df = pd.read_csv('death-rates-smoking-age.csv', nrows= 1000)
df

Unnamed: 0,Entity,Code,Year,All Ages (Rate),15-49 years (Rate),50-69 years (Rate),70+ years (Rate),Under 5 (Rate),5-14 years (Rate)
0,Afghanistan,AFG,1990,63.895905,16.589519,267.230009,679.006755,,
1,Afghanistan,AFG,1991,61.846347,15.456913,266.975516,677.617648,,
2,Afghanistan,AFG,1992,53.436511,12.767999,266.430053,679.50581,,
3,Afghanistan,AFG,1993,47.044347,11.000425,267.969428,683.973588,,
4,Afghanistan,AFG,1994,45.799808,10.73802,272.403687,691.007773,,
5,Afghanistan,AFG,1995,44.109036,10.161701,273.634237,691.828857,,
6,Afghanistan,AFG,1996,42.616425,10.067977,274.834193,693.401558,,
7,Afghanistan,AFG,1997,41.579362,10.224704,277.037448,696.796549,,
8,Afghanistan,AFG,1998,40.860275,10.520466,279.855021,700.286019,,
9,Afghanistan,AFG,1999,41.321832,11.3709,283.107385,703.955623,,
