# Programming for Data Analysis Project

**Problem statement**

create a data set by simulating a real-world phenomenon. Then,  model and synthesise data using Python by using
the `numpy.random` package.

**Guidelines**:
    
• Choose a real-world phenomenon that can be measured and for which you could
collect at least one-hundred data points across at least four different variables.

• Investigate the types of variables involved, their likely distributions, and their
relationships with each other.

• Synthesise/simulate a data set as closely matching their properties as possible.

• Detail your research and implement the simulation in a Jupyter notebook – them
data set itself can simply be displayed in an output cell within the notebook.


## Simulating Dublin Airport Movements

Dublin airport is the main airport in the Republic of Ireland. It was established 80 years ago. Dublin Airport officially opened at 9:00am on January 19, 1940. It was a cold Friday morning when the inaugural flight - an Aer Lingus Lockheed 14 bound for Liverpool - departed from Collinstown Airport, as it was then known.

In this assignment, I will analyse some profile of passengers that were using the airport in 2018 within one hour time frame.

Number of passengers per year: 31.5 million in 2018 [1]

Average hourly passengers calculated by dividing 31.5 (million) / 364 (days, closed for Chrismas day) which gives us 86,5 (Thousands) passengers daily on average.

Hourly Passenger Rate = 31.5 (millions) / 364 (days, closed for Chrismas day)  / 24 (hours).

Hourly Passenger Rate = 3605 

We can anticipate that there was one passenger using the airport each second.

In [1]:
# importing numerical library
import numpy as np

rng = np.random.default_rng()

# importing library to generate data frames
import pandas as pd

# importing libraries for visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# for interactive and inline rendered plots, we use the magic command
%matplotlib inline

# Better sized plots.
plt.rcParams['figure.figsize'] = (12, 8)
# Nicer colours and styles for plots.
# plt.style.use("ggplot")

### Creating the first array in the data set which represents a **DateTime** for an hour in 2018 :

In [2]:
# importing datetime function from pandas
# https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries
import datetime

dti = pd.to_datetime(["06/11/2018", np.datetime64('2018-11-06'), datetime.datetime(2018, 11, 6)])

DateTime = pd.date_range("2018/11/06", periods = 3600 , freq = "S")

In [3]:
DateTime

DatetimeIndex(['2018-11-06 00:00:00', '2018-11-06 00:00:01',
               '2018-11-06 00:00:02', '2018-11-06 00:00:03',
               '2018-11-06 00:00:04', '2018-11-06 00:00:05',
               '2018-11-06 00:00:06', '2018-11-06 00:00:07',
               '2018-11-06 00:00:08', '2018-11-06 00:00:09',
               ...
               '2018-11-06 00:59:50', '2018-11-06 00:59:51',
               '2018-11-06 00:59:52', '2018-11-06 00:59:53',
               '2018-11-06 00:59:54', '2018-11-06 00:59:55',
               '2018-11-06 00:59:56', '2018-11-06 00:59:57',
               '2018-11-06 00:59:58', '2018-11-06 00:59:59'],
              dtype='datetime64[ns]', length=3600, freq='S')

### **Gender** :

Representaion of gender of the passengers was:

- __51% Male__

- __49% Female__ [2]


In [4]:
genders = ("male", "female")

p = (0.51, 0.49)

Gender = rng.choice(genders, size = 3600, p = p)

## Age :

People of all ages pass through our doors here at Dublin Airport.

- __15%__ of our passengers are __Under 25__

- __54%__ are between the ages of __25-49__

- __30%__ are aged __50__ or __older__ [2]

In [8]:
age_ranges = ("under_25", "25-49", "older_than_50")

p = (0.15, 0.55, 0.30)

Age = rng.choice(age_ranges, size = 3600, p = p)
Age

array(['25-49', '25-49', 'under_25', ..., 'older_than_50', 'under_25',
       '25-49'], dtype='<U13')

### **Country of Residence** :

#### Generating the `Country of Residence` of passengers array


People from all over the world visit Dublin Airport

- __48%__ of our passengers call the Republic of Ireland their home __(IE)__

- __18%__ arrive on our shores from the UK (including NI) __(UK / NI)__

- __17%__ visit us from Continental Europe __(EUP)__

- __16%__ come from North American destinations __(N-AMR)__

- __1%__ come from the Rest of the World __(RoW)__ [2]

In [17]:
## Country of Residence: CoR

CoR = ("IE", "UK/NI", "EUP", "N-AMR", "RoW")
p = (0.48, 0.18, 0.17, 0.16, 0.01)

CoR = rng.choice(CoR, size = 3600, p = p)


In [18]:
CoR

array(['N-AMR', 'IE', 'N-AMR', ..., 'IE', 'IE', 'EUP'], dtype='<U5')

### Generating The Data Frame

In [9]:
df = pd.DataFrame({"DateTime" : DateTime, "Gender" : Gender, "Age" : Age, "CoR" : CoR})

In [10]:
df

Unnamed: 0,DateTime,Gender,Age,CoR
0,2018-11-06 00:00:00,male,25-49,EUP
1,2018-11-06 00:00:01,female,25-49,UK/NI
2,2018-11-06 00:00:02,female,under_25,N-AMR
3,2018-11-06 00:00:03,male,older_than_50,IE
4,2018-11-06 00:00:04,male,25-49,N-AMR
...,...,...,...,...
3595,2018-11-06 00:59:55,female,25-49,UK/NI
3596,2018-11-06 00:59:56,male,older_than_50,EUP
3597,2018-11-06 00:59:57,male,older_than_50,UK/NI
3598,2018-11-06 00:59:58,male,under_25,IE


In [22]:
df.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3590,3591,3592,3593,3594,3595,3596,3597,3598,3599
DateTime,2018-11-06 00:00:00,2018-11-06 00:00:01,2018-11-06 00:00:02,2018-11-06 00:00:03,2018-11-06 00:00:04,2018-11-06 00:00:05,2018-11-06 00:00:06,2018-11-06 00:00:07,2018-11-06 00:00:08,2018-11-06 00:00:09,...,2018-11-06 00:59:50,2018-11-06 00:59:51,2018-11-06 00:59:52,2018-11-06 00:59:53,2018-11-06 00:59:54,2018-11-06 00:59:55,2018-11-06 00:59:56,2018-11-06 00:59:57,2018-11-06 00:59:58,2018-11-06 00:59:59
Gender,male,female,female,male,male,male,male,female,male,male,...,female,female,female,female,female,female,male,male,male,male
Age,25-49,25-49,under_25,older_than_50,25-49,older_than_50,25-49,25-49,older_than_50,25-49,...,25-49,25-49,older_than_50,older_than_50,under_25,25-49,older_than_50,older_than_50,under_25,25-49
CoR,EUP,UK/NI,N-AMR,IE,N-AMR,EUP,IE,UK/NI,N-AMR,IE,...,IE,N-AMR,EUP,EUP,UK/NI,UK/NI,EUP,UK/NI,IE,EUP


In [23]:
df.head()

Unnamed: 0,DateTime,Gender,Age,CoR
0,2018-11-06 00:00:00,male,25-49,EUP
1,2018-11-06 00:00:01,female,25-49,UK/NI
2,2018-11-06 00:00:02,female,under_25,N-AMR
3,2018-11-06 00:00:03,male,older_than_50,IE
4,2018-11-06 00:00:04,male,25-49,N-AMR


In [24]:
df.describe()

Unnamed: 0,DateTime,Gender,Age,CoR
count,3600,3600,3600,3600
unique,3600,2,3,5
top,2018-11-06 00:02:00,male,25-49,IE
freq,1,1801,1971,1752
first,2018-11-06 00:00:00,,,
last,2018-11-06 00:59:59,,,


In [26]:
df.mean

<bound method DataFrame.mean of                 DateTime  Gender            Age    CoR
0    2018-11-06 00:00:00    male          25-49    EUP
1    2018-11-06 00:00:01  female          25-49  UK/NI
2    2018-11-06 00:00:02  female       under_25  N-AMR
3    2018-11-06 00:00:03    male  older_than_50     IE
4    2018-11-06 00:00:04    male          25-49  N-AMR
...                  ...     ...            ...    ...
3595 2018-11-06 00:59:55  female          25-49  UK/NI
3596 2018-11-06 00:59:56    male  older_than_50    EUP
3597 2018-11-06 00:59:57    male  older_than_50  UK/NI
3598 2018-11-06 00:59:58    male       under_25     IE
3599 2018-11-06 00:59:59    male          25-49    EUP

[3600 rows x 4 columns]>

## References:

[1] Dublin Airport; Facts and Figures: https://www.dublinairport.com/corporate/about-us/facts-and-figures

[2] Dublin Airport; Passenger Profile: https://www.dublinairport.com/corporate/about-us/passenger-profile

[3] caktusgroup: https://www.caktusgroup.com/blog/2020/04/15/quick-guide-generating-fake-data-with-pandas/
