# Programming for Data Analysis Project

**Problem statement**

create a data set by simulating a real-world phenomenon. Then,  model and synthesise data using Python by using
the `numpy.random` package.

**Guidelines**:
    
• Choose a real-world phenomenon that can be measured and for which you could
collect at least one-hundred data points across at least four different variables.

• Investigate the types of variables involved, their likely distributions, and their
relationships with each other.

• Synthesise/simulate a data set as closely matching their properties as possible.

• Detail your research and implement the simulation in a Jupyter notebook – them
data set itself can simply be displayed in an output cell within the notebook.


## Simulating Dublin Airport movements

Dublin airport is the main airport in the Republic of Ireland. It was established 80 years ago. Dublin Airport officially opened at 9:00am on January 19, 1940. It was a cold Friday morning when the inaugural flight - an Aer Lingus Lockheed 14 bound for Liverpool - departed from Collinstown Airport, as it was then known.

In this assignment, I will analyse some profile of passengers that were using the airport in 2018 within one hour time frame.

Number of passengers per year: 31.5 million in 2018 [1]

Average hourly passengers calculated by dividing 31.5 (million) / 364 (days, closed for Chrismas day) which gives us 86,5 (Thousands) passengers daily on average.

Hourly Passenger Rate = 31.5 (millions) / 364 (days, closed for Chrismas day)  / 24 (hours).

Hourly Passenger Rate = 3605 

We can anticipate that there was one passenger using the airport each second.

In [1]:
# importing numerical library
import numpy as np

rng = np.random.default_rng()

# importing library to generate data frames
import pandas as pd

# importing libraries for visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# for interactive and inline rendered plots, we use the magic command
%matplotlib inline

# Better sized plots.
plt.rcParams['figure.figsize'] = (12, 8)
# Nicer colours and styles for plots.
# plt.style.use("ggplot")

#### Creating the first column in the data set which represents a datetime for an hour in 2018 .

In [15]:
# importing datetime function from pandas
# https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries
import datetime

dti = pd.to_datetime(["06/11/2018", np.datetime64('2018-11-06'), datetime.datetime(2018, 11, 6)])

DateTime = pd.date_range("2018/11/06", periods = 3600 , freq = "S")

In [16]:
DateTime

DatetimeIndex(['2018-11-06 00:00:00', '2018-11-06 00:00:01',
               '2018-11-06 00:00:02', '2018-11-06 00:00:03',
               '2018-11-06 00:00:04', '2018-11-06 00:00:05',
               '2018-11-06 00:00:06', '2018-11-06 00:00:07',
               '2018-11-06 00:00:08', '2018-11-06 00:00:09',
               ...
               '2018-11-06 00:59:50', '2018-11-06 00:59:51',
               '2018-11-06 00:59:52', '2018-11-06 00:59:53',
               '2018-11-06 00:59:54', '2018-11-06 00:59:55',
               '2018-11-06 00:59:56', '2018-11-06 00:59:57',
               '2018-11-06 00:59:58', '2018-11-06 00:59:59'],
              dtype='datetime64[ns]', length=3600, freq='S')

### **Gender**

Representaion of gender of the passengers was:

- __51% Male__

- __49% Female__ [2]


In [4]:
genders = ("male", "female")

p = (0.51, 0.49)

gender = rng.choice(genders, size = 3600, p = p)

In [10]:
df = pd.DataFrame({"DateTime" : DateTime, "gender" : gender})

In [11]:
df

Unnamed: 0,DateTime,gender
0,2018-11-06 00:00:00,male
1,2018-11-06 00:00:01,male
2,2018-11-06 00:00:02,female
3,2018-11-06 00:00:03,male
4,2018-11-06 00:00:04,male
...,...,...
3595,2018-11-06 00:59:55,female
3596,2018-11-06 00:59:56,male
3597,2018-11-06 00:59:57,male
3598,2018-11-06 00:59:58,male


#### Generating the `Country of Residence` of passengers


People from all over the world visit Dublin Airport

- 48% of our passengers call the Republic of Ireland their home __(Ireland)__

- 18% arrive on our shores from the UK (including NI) __(UK / NI)__

- 17% visit us from Continental Europe __(Europe)__

- 16% come from North American destinations __(N.America)__

- 1% come from the Rest of the World __(RoW)__

In [7]:
dframe = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['India', 'USA', 'China', 'Russia'])

# compute a formatted string from each floating point value in frame
changefn = lambda x: '%.2f' % x


# Make changes element-wise
dframe['d'].map(changefn)

India     -0.68
USA        0.41
China     -0.66
Russia    -0.60
Name: d, dtype: object

## References:

[1] Dublin Airport; Facts and Figures: https://www.dublinairport.com/corporate/about-us/facts-and-figures

[2] Dublin Airport; Passenger Profile: https://www.dublinairport.com/corporate/about-us/passenger-profile