# Project 2019 Programming for Data Analysis

* Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.<br>
* Investigate the types of variables involved, their likely distributions, and their relationships with each other<br>
* Synthesise/simulate a data set as closely matching their properties as possible.<br>
* Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.<br>


# Section 1

# Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.

For this project, I have selected a dataset that is available from the Irish Government's open data project to research, investigate then simulate some of the variables.<br>
The Open Data project is an initiative by the government of Ireland that makes data held by public bodies available and easily accessible online for reuse and redistribution to create interest and encourage engagement with open data.<br>
I have chosen the [Office of Public Works Heritage Site Details](https://www.opw.ie/en/media/opw-heritage-site-details.csv) open dataset which contains one hundred data points across twenty-four variables and was collected in 2015.<br>
The Office of Public Works (OPW) is a Government department with responsibility for the day-to-day running of all National Monuments in State care and National Historic Properties. <br>

The real-world phenomenon that is presented is a collection of information relating to the Heritage Sites that are open to the public.  <br>
I chose this dataset because it is of interest to me in my professional life. <br>
In the next section of the project I will explore the kinds of variables that appear in a dataset relating to Heritage Sites, the relationships (if any) between variables and the distributions that are apparent.

***

# Section 2

# Investigate the types of variables involved, their likely distributions, and their relationships with each other

**Investigation of the original OPW dataset**

In order to simulate a dataset on the subject of Heritage States owned by the State/ citizens of Ireland I must investigate a pre-existing one.[Office of Public Works Heritage Site Details](https://www.opw.ie/en/media/opw-heritage-site-details.csv)<br>
Considering that the dataset was made available as part of a government initiative to create interest and encourage engagement with open data - it was of poor quality.<br>
* The original dataset is relatively small - 100 records and 24 variables - however, the csv file was un-necessarily large when loading due to the inclusion of digits on line 2424 of the original spreadsheet.
* The financial data - in this case cost of entry for different demographics was put together in one column along with other visitor information. This created difficulties for my investigation.
I have adjusted it separately, re-save this truncated version in this github repository and continued the project with the updated format.


**Observations On The Types of Variables**

The dataset contains information about 100 unique, named Heritage Sites managed by The Office of Public Works collected in 2015.<br>
There are 24 different variables in the original dataset, most of which relate to visitor information e.g. GPS co-ordinates and contact details for the site.
The following points are relevant to this exercise and the objective of synthesising data set in a methodical way which can match the contents.

#### Heritage Site Name
* Every Heritage Site name is a unique object

#### Pricing structures in Euro, Datatype: Integer
* Adult	entrance price - an integer between 0 and 12 
* Senior / Group entrance price	- an integer between 0 and 9 
* Child entrance price - an integer between 0 and 7 
* Student entrance price - an integer between 0 and 8 
* Family entrance price - an integer between 0 and 32 

* 51% of the sites have free admission, 35% have an adult entrance fee of €5.
* When an entrance fee is paid, there is a price point for all types of visitors.
* An individual adult is the most expensive ticket with all others reducing by 1 or 2 euro from that point
* A family ticket is approximately the same price as the sum of two adult plus one child tickets

#### Visitor Numbers, Datatype : Integer
* 2015 Visitor Numbers contain integers that range from 0 - 553,348. As previously state there is a strong relationship between the Region and Visitor Numbers.

* 31 of the entries for 2015 Visitor Numbers contain a null value.
* The remaining 69 datapoints show that visitor numbers range from 1750 to 553348
* The total number of visitors is 5.1 million people


#### Geographical Location, Datatype : Object
* The county where the Heritage Site is located affects the Regional classification, if this information were to be shuffled, the county/region need to be linked.
* There is no relationship between the number of Heritage Sites in a county and the visitor numbers.
* There is a strong relationship between the Region and the Visitor Numbers.
* Instead of joining the county and region, I will therefore omit the county variable entirely my reason for this is because the county variable has no strong relationship with any distribution - the regionality is a stronger one.

Heritage Sites are in 7 different geographical regions, 
* Dublin
* Midlands & East Coast 
* North-West
* Shannon
* South-East
* South-West
* West

The majority of sites are located in Dublin, South - East and South-West.

#### Cafe Facilities, Datatype : Integer
In the original dataset, 9 out of 100 Heritage Sites have a Cafe on site

#### Opening Dates, Datatype : Integer
42 of the sites are open all year round, the remainder have seasonal opening times.


**Likely Distributions in the OPW Heritage Sites Dataset used to inform a synthesised Dataset**

What is the type of distribution that appears and that can be used to inform a synthesised Dataset?<br>
The normal distribution is a very common one and can be considered the standard distribution, therefore I will use it where there is an option in randomly generating variables.
On two occasion, the binomial distribution is used, further information about this decision appears below.<br>
In addition, the central limit theorem can be used to support my decision to use normal distribution. <br>
This theorem states that the mean of any sample of variables (with finite mean and variance) with any distribution will approximate the normal distribution.

**Relationships in the OPW Heritage Sites Dataset used to inform a synthesised Dataset**

There is a strong relationship between the Region and the Visitor Numbers in the original Dataset.
To me, this is the the most interesting variable in the original dataset.

An assumption coming to the dataset would be that the busiest sites are in the most populous region of the country.
This assumption is born out by Failte Ireland's (the National Tourism Development Authority) 2018 figures where 4 of the top 10 paying visitor attrations are in Dublin with a further two in the South-East. However, other sites on the list are in underpopulated areas e.g. the Cliffs of Moher that are world renowned for their remoteness and unspolit beauty.

I do not expect to be able to recreate this subtle relationship with synthesised data, however the relationships that are produced will be explored in due course.

# Section 3

# Synthesise/simulate a data set as closely matching the properties of the original as is possible.

In [14]:
#Import modules required for the Assignment
#NumPy package
import numpy as np
#Pandas library
import pandas as pd
#Seaborn package
import seaborn as sns
#Matplotlib library
import matplotlib.pyplot as plt

Following on from the findings in Section 2, I will simulate a data set as closely matching their properties as possible using the numpy random package thus building on my [previous work](http://localhost:8888/?token=98bc2512905f44f91efe55dc0b350cacc78b93d3f4e55086) carried out during this course where I explored the numpy random package.<br>
Unless otherwise stated, the scripts come from this project.<br>
I will :
- Permute the 100 Heritage Sites names from the original OPW Datset using random.permutation
- Synthesise 100 variables from the choice of seven Region names using random.choice
- Synthesise random data for the number of visitors from integers that range from 0 - 553,348 using random.randint
- Synthesise random data for the Adult price point variable and ensure that 51 out of the 100 Sites have free entry / zero value using random.randint and permutation<br>
- Synthesise random data for the number of cafes available at Heritage Sites using random.binomial<br>
- Synthesise random data for the opening hours at Heritage sites using random.binomial<br>


Then merge these dataframes into one large dataset that mirrors the original.


### Permute the Heritage Site names and create a new dataframe
It is not possible to randomly generate this text, therefore I will use the selection the provided in the original Dataset that informs this project.

In [15]:
df=pd.read_csv("https://raw.githubusercontent.com/ClodaghMurphy/ProgDA_ProjectDec2019/master/opw-heritage-site-detailsNEW.csv", encoding="ISO-8859–1",nrows=100)

In [16]:
#code adapted from https://stackoverflow.com/questions/49545599/how-to-turn-a-pandas-column-into-array-and-transpose-it
New_Names = df[['Name']]
#Permute the synthesised dataframe. Permute is a function from the random package that rearranges an array
#this code ensures the output will be in column format
df1 = pd.DataFrame((np.random.permutation(New_Names)), columns = ['New_Names'])
#df1

### Use .random.choice to produce a 100 row dataframe using the given 7 OPW regions

In [17]:
#The .random.choice function randomly chooses a sample from an array
#Code adapted from https://pynative.com/python-random-sample/
#Provide array of 7 OPW regions as they appear in the original dataset
Regions = (["Dublin", "Midlands & East Coast", "North-West", "Shannon", "South-East", "South-West", "West"])
#When 100 is entered into the argument, 100 selections are output
#Calling pd.Dataframe ensures the output is in a datafame format
df2 = pd.DataFrame((np.random.choice(Regions, 100)), columns = ['Region'])
#A uniform distribution is assumed in this function
#df2

### Synthesise random data for the number of visitors from integers that range from 0 - 553348

In [18]:
#As per the numpy documentation, this command returns random integers from the “discrete uniform” distribution
df3 = pd.DataFrame((np.random.randint(0, high=553348, size=100, dtype='l')), columns = ['New_Numbers'])

Instead of forcing a nil amount of Visitors for 31 Heritage Site, I will allow the numpy library to generate data.<br>
My reason for this choice, is that the number of visitors in those sites was not zero - it was simply not collected for various business reasons e.g. the site is a main thoroughfare in the case of St. Stephen's Green. It will be more interesting dataset if these statistics are contained in it.

### Synthesise random data for the Adult price point variable and ensure that 51 out of the 100 Sites have free entry / zero value

There are five different categories of visitor in the original dataset
Adult, Senior / Group, Child, Student and Family.
I have focussed on the Adult price is the most expensive individual entry price and the other amounts are based on it.


In [23]:
#Adult entrance price -  100 integers between 1 and 12 with a normal distribution
#0-12 is not used because this would result in some random values of 0
#Code adapted from library documentation https://docs.scipy.org/doc/numpy-1.15.0/reference/routines.random.html
Adult = np.random.randint(1, high=12, size=100, dtype='l')
Adult = pd.DataFrame(Adult)
#Code adapted from https://stats.stackexchange.com/questions/283572/using-iloc-to-set-values/283575
#Replace 51 values with free entry/zero
Adult.loc[0:50,0] = 0
#Permute the synthesised dataframe. Permute is used because shuffle creates a "key error" when used with a dataframe.
df4 = pd.DataFrame(np.random.permutation(Adult), columns = ['Adult'])
#df4


### Synthesise random data for the number of cafes and opening hours at Heritage Sites

In the original dataset, there is lots of visitor information jumbled together in once cell covering facts such as whether there are toilets, parking, wheelchair access and cafes on site or nearby.
From the dataset I extracted that 100 Heritage Sites have a Cafe on site.
Similarly, through scanning through the original data in excel format which goes into great detail about the individual local opening hours I can put the information in a much simpler way - 42 of the sites are open all year round, the remainder have seasonal opening times.
(These investigations are not shown as part of this assignment - only the results.)

One of the learning Outcomes for this module is that I will be able to model real-world problems as computing problems.
I can display this ability through turning this data-intensive information into a Boolean format, i.e.,
The Heritage Site has a Cafe - True/False
The Heritage Site is open all year - True/False

In order to synthesise data to meet the requirements I will use the binomial distribution from the numpy library.
In the assignment that I completed earlier this year, I wrote about the binomial distribution which can be used in any instance repeated multiple times where there are deemed to be two possible outcomes - success or failure.
The "probability of success" input is taken from the findings in the original dataset 9/100 and 42/100 respectively.



In [25]:
#Code adapted from https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.binomial.html
n, p = 1, .09  # number of trials, probability of each trial is 9/100
df5 = pd.DataFrame((np.random.binomial(n, p, 100)), columns = ['Cafe'])
#df5

In [26]:
#Code adapted from https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.binomial.html
n, p = 1, .42  # number of trials, probability of each trial is 42/100
df6 = pd.DataFrame((np.random.binomial(n, p, 100)), columns = ['Year Round Opening'])
#df6


In [27]:
#Code adapted from https://stackoverflow.com/questions/28135436/concatenate-rows-of-two-dataframes-in-pandas
New_Dataset = pd.concat([New_Names,df2, df3, df4, df5, df6,], axis=1)
New_Dataset



Unnamed: 0,Name,Region,New_Numbers,Adult,Cafe,Year Round Opening
0,Rathcroghan - Royal Celtic Site,North-West,80824,1,0,1
1,Altamont Gardens,North-West,42150,0,0,0
2,Dromore Wood,North-West,21614,7,0,0
3,Ennis Friary,South-West,399083,0,0,1
4,Scattery Island Centre,West,418947,3,0,0
5,Barryscourt Castle,South-East,444283,0,0,0
6,Charles Fort,South-East,505665,0,0,0
7,Desmond Castle,South-West,517661,3,0,0
8,Doneraile Wildlife Park,Midlands & East Coast,245613,4,0,1
9,Fota Arboretum and Gardens,Shannon,217439,10,0,0


### Summary Data of the New Dataset

In [28]:
New_Dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
Name                  100 non-null object
Region                100 non-null object
New_Numbers           100 non-null int32
Adult                 100 non-null int32
Cafe                  100 non-null int32
Year Round Opening    100 non-null int32
dtypes: int32(4), object(2)
memory usage: 3.2+ KB


.info is used to provide a concise summary of the information contained in the New_Dataset DataFrame.
The output above tells me that there are six columns, there are 100 rows of information in each, the datatypes are as expected and the DataFrame uses 3.2KB memory.

In [None]:
### Exploration of the price variable

In [29]:
#Print a description of the output
print("Description of the OPW Dataset")
New_Dataset.describe()
#if parentheses ()are not used, all columns will display but no useful summary statistics!

Description of the OPW Dataset


Unnamed: 0,New_Numbers,Adult,Cafe,Year Round Opening
count,100.0,100.0,100.0,100.0
mean,277482.75,3.05,0.09,0.39
std,174116.156346,3.780265,0.287623,0.490207
min,11579.0,0.0,0.0,0.0
25%,129400.5,0.0,0.0,0.0
50%,250488.5,0.0,0.0,0.0
75%,453500.0,6.0,0.0,1.0
max,551214.0,11.0,1.0,1.0


According to pandas 0.25.1 documentation:
> For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. 
. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

The output confirms that there are 100 data points
A large amount of free sites produces unusual results in many of the columns e.g. the 25% and 50% quartiles are zero across the range.
The standard deviations listed give an indication of the distance between the mean and all values, this figure is affected by the free entrance to over half of the sites in the dataset where tickets prices are concerned.

In [None]:
#Print a description of the output
print ("Pandas Groupby Function - Size of Grouping 'Adult' Admission Prices ")
#Code adapted from https://dfrieds.com/data-analysis/groupby-python-pandas
#Group by of a Single Column and Apply a Single Aggregate Method on a Column
#Groupby splits the data into a "group" depending on your choice of variable
#The size() argument returns a count of non-null values
New_Dataset.groupby(by='Adult').size()


In [None]:
#Print a description of the output
print("OPW Dataset - Use of .loc and Boolean format as a sorting tool")
#this is a useful tool using Boolean values to select item
New_Dataset.loc[New_Dataset.loc[:, 'Region'] == 'West']


In [None]:
#code adapted from https://towardsdatascience.com/how-to-perform-exploratory-data-analysis-with-seaborn-97e3413e841d
New_Dataset.hist(bins=15, figsize=(15, 6), layout=(2, 4));
print ("Data Visualisation - Histograms setting out all numerical data")

sns.countplot(New_Dataset["New_Numbers"]);


#**Comment** When the bin containing the large number of free entry sites(51%) is not taken into account, the distribution of the fee paying sites can be said to have a roughly normal distribution.
#That is 
#>it has a bell shape, the mean and median are equal, and 
#>68% of the data falls within 1 standard deviation.

#Source: [Khanacademy.org](https://www.khanacademy.org/math/statistics-probability/modeling-distributions-of-data/normal-distributions-library/a/normal-distributions-review)


In [None]:
#Code amended from https://stackoverflow.com/questions/42528921/how-to-prevent-overlapping-x-axis-labels-in-sns-countplot
#Following on from the last plot, this countplot give a more accurate visualisation of the number of smokers versus non smokers.

ax = sns.countplot(x="Region", data=New_Dataset)

ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()


In [None]:
***

## References used in completing the project

SAMPLE RTE News, 2010. Galway respite
funding will not be cut [Online].
Available from : http://www.rte.ie/
news/2010/0707/health.html [viewed 1
February 2011].


Assignment 2019 for Programming for Data Analysis module, GMIT. [Online] <br>
Available on: https://github.com/ClodaghMurphy/Assignment-2019-progda [viewed 26 november 2019]<br>

Dan Friedman's Data Science Knowledge Base[Online]<br>
Available on: https://dfrieds.com/   [viewed 30 November 2019]<br>

Failte Ireland: Key Tourism Facts 2018 [Online] <br>
Available on:  https://www.failteireland.ie/FailteIreland/media/WebsiteStructure/Documents/3_Research_Insights/Key-Tourism-Facts-2018.pdf?ext=.pdf [viewed 13 December 2019 <br>

How to Perform Exploratory Data Analysis with Seaborn  [Online]<br>
Available on: https://towardsdatascience.com/how-to-perform-exploratory-data-analysis-with-seaborn-97e3413e841d  [viewed 30 November 2019]<br>

Normal Distribution [Online] <br>
Available on: http://mathworld.wolfram.com/NormalDistribution.html [viewed 7 December 2019]<br>

Normal distributions review [Online]<br>
Available on: https://www.khanacademy.org/math/statistics-probability/modeling-distributions-of-data/normal-distributions-library/a/normal-distributions-review   [viewed 7 December 2019]<br>


OPW OPEN DATA SETS   [Online]<br>
Available on: https://www.opw.ie/en/opendata/#d.en.34620 [viewed 26 November 2019]<br>

Python random.sample() function to Choose multiple items from list [Online] <br>
Available on: https://pynative.com/python-random-sample/ [viewed 30 november 2019] <br>

Random sampling (numpy.random) [online] <br>
Available on: https://docs.scipy.org/doc/numpy-1.15.0/reference/routines.random.html [viewed 3 December 2019]<br>

Using iloc to set values [Online]<br>
Available on: https://stats.stackexchange.com/questions/283572/using-iloc-to-set-values/283575  [viewed 7 December 2019]<br>

The Pandas DataFrame – loading, editing, and viewing data in Python  [Online]<br>
Available on: https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/ [viewed 26 November 2019]<br>


End