# Project 2019 Programming for Data Analysis

* Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.<br>
* Investigate the types of variables involved, their likely distributions, and their relationships with each other<br>
* Synthesise/simulate a data set as closely matching their properties as possible.<br>
* Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.<br>


# Section 1

# Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.

For this project, I have selected a dataset that is available from the Irish Government's open data project to research, investigate then simulate some of the variables.<br>
The Open Data project is an initiative by the government of Ireland that makes data held by public bodies available and easily accessible online for reuse and redistribution to create interest and encourage engagement with open data.<br>
I have chosen the [Office of Public Works Heritage Site Details](https://www.opw.ie/en/media/opw-heritage-site-details.csv) open dataset which contains one hundred data points across twenty-four variables and was collected in 2015.<br>
The Office of Public Works (OPW) is a Government department with responsibility for the day-to-day running of all National Monuments in State care and National Historic Properties. <br>

The real-world phenomenon that is presented is a collection of information relating to the Heritage Sites that are open to the public.  <br>
I chose this dataset because it is of interest to me in my professional life. <br>
In the next section of the project I will explore the kinds of variables that appear in a dataset relating to Heritage Sites, the relationships (if any) between variables and the distributions that are apparent.

***

# Section 2

# Investigate the types of variables involved, their likely distributions, and their relationships with each other

**Investigation of the original OPW dataset**

In order to simulate a dataset on the subject of Heritage States owned by the State/ citizens of Ireland I must investigate a pre-existing one.[Office of Public Works Heritage Site Details](https://www.opw.ie/en/media/opw-heritage-site-details.csv)<br>

**Observations on The Types of Variables**

The dataset contains information about 100 unique, named Heritage Sites managed by The Office of Public Works collected in 2015.<br>
There are 24 different variables in the original dataset, most of which relate to visitor information e.g. GPS co-ordinates and contact details for the site.
The following points are relevant to this exercise and the objective of synthesising data set in a methodical way which can match the contents.

#### Heritage Site Name
* Every Heritage Site name is a unique object

#### Pricing structures in Euro Datatype: Integer
* Adult	entrance price - an integer between 0 and 12 
* Senior / Group entrance price	- an integer between 0 and 9 
* Child entrance price - an integer between 0 and 7 
* Student entrance price - an integer between 0 and 8 
* Family entrance price - an integer between 0 and 32 

* 51% of the sites have free admission, 35% have an adult entrance fee of €5.
* When an entrance fee is paid, there is a price point for all types of visitors.
* An individual adult is the most expensive ticket with all others reducing by 1 or 2 euro from that point
* A family ticket is approximately the same price as the sum of two adult plus one child tickets

#### Visitor Numbers Datatype : Integer
* 2015 Visitor Numbers contain integers that range from 0 - 553,348. As previously state there is a strong relationship between the Region and Visitor Numbers.

* 31 of the entries for 2015 Visitor Numbers contain a null value.
* The remaining 69 datapoints show that visitor numbers range from 1750 to 553348
* The total number of visitors is 5.1 million people


#### Geographical Location Datatype : Object
* The county where the Heritage Site is located affects the Regional classification, if this information were to be shuffled, the county/region need to be linked.
* There is no relationship between the number of Heritage Sites in a county and the visitor numbers.
* There is a strong relationship between the Region and the Visitor Numbers.
* Instead of joining the county and region, I will therefore omit the county variable entirely my reason for this is because the county variable has no strong relationship with any distribution - the regionality is a stronger one.

Heritage Sites are in 7 different geographical regions, 
* Dublin
* Midlands & East Coast 
* North-West
* Shannon
* South-East
* South-West
* West

The majority of sites are located in Dublin, South - East and South-West.

#### Cafe facilities Datatype : Integer
In the original dataset, 9 out of 100 Heritage Sites have a Cafe on site

#### Opening Dates Datatype : Object (Yes/No)
42 of the sites are open all year round, the remainder have seasonal opening times.


**Likely Distributions in the OPW Heritage Sites Dataset used to inform a synthesised Dataset**

**Relationships in the OPW Heritage Sites Dataset used to inform a synthesised Dataset**

In [None]:
#Import modules
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# Section 3
# Synthesise/simulate a data set as closely matching the properties of the original as is possible.



Following on from the findings in Section 2, I will simulate a data set as closely matching their properties as possible using the numpy random package thus building on my previous work carried out during this course where I explored the numpy random package.
Unless otherwise stated, the scripts come from this project.
I will :

shuffle the Heritage Sites names
shuffle the Region names
synthesise random data for the number of visitors from integers that range from 0 - 553,348
synthesise random data for price point variables for each demographic within the original scope, I will then ensure that 51 out of the 100 Sites have free entry / zero value
Then merge these dataframes into one large dataset that mirrors the original.


**Comment:**

In [None]:
#12042019 Investigate the DataSet
#Experimenting with pandas functions
#Adapted from
# https://stackoverflow.com/questions/33034243/calculating-the-mean-and-std-on-excel-file-using-python
#Import pandas module
import pandas
#
#The standard deviation is amount of variability (or spread) 
#among the numbers in a data set, that is the standard (or typical) 
# amount of deviation (or distance) from the mean
#https://wiki.kidzsearch.com/wiki/Standard_deviation

#dataset = pandas.read_csv('irisdataset.txt')
print(" 'std' calculates and displays the standard deviation in each column")
print(df.std())

In [None]:
#Print a description of the output
print ("Data Visualisation - Countplot of OPW Regions")
#Code amended from https://stackoverflow.com/questions/42528921/how-to-prevent-overlapping-x-axis-labels-in-sns-countplot
#Following on from the last plot, this countplot give a more accurate visualisation of the number of smokers versus non smokers.

ax = sns.countplot(x="Region", data=df)

ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()



In [None]:
#Print a description of the output
print ("Data Visualisation - Countplot of OPW Counties")
#Code amended from https://amitkushwaha.co.in/data-visualization-part-1.html
#Following on from the last plot, this countplot give a more accurate visualisation of the number of smokers versus non smokers.

ax = sns.countplot(x="County", data=df)

ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()

#sns.countplot('County', data=df)

In [None]:


#Print a description of the output
print ("Data Visualisation - Countplot of 2015 Visitor Numbers")
#Code amended from https://stackoverflow.com/questions/42528921/how-to-prevent-overlapping-x-axis-labels-in-sns-countplot
#Following on from the last plot, this countplot give a more accurate visualisation of the number of smokers versus non smokers.

ax = sns.countplot(x="2015 Visitor Numbers", data=df)

ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
fig.autofmt_xdate()
plt.tight_layout()
plt.show()

In [None]:
#https://seaborn.pydata.org/tutorial/distributions.html



sns.jointplot(x="Region", y="2015 Visitor Numbers", data=df);


In [None]:
**Comment** 

**Comment** 

In [None]:
#code adapted from https://seaborn.pydata.org/generated/seaborn.boxplot.html
ax = sns.boxplot(x="Region", y="2015 Visitor Numbers", data=df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
#plt.tight_layout()
#plt.show()

# Section 3

# Synthesise/simulate a data set as closely matching the properties of the original as is possible.

Following on from the findings in Section 2, I will simulate a data set as closely matching their properties as possible using the numpy random package thus building on my [previous work](http://localhost:8888/?token=98bc2512905f44f91efe55dc0b350cacc78b93d3f4e55086) carried out during this course where I explored the numpy random package.<br>
Unless otherwise stated, the scripts come from this project.<br>
I will :
- Permute the Heritage Sites names
- Permute the Region names
- Synthesise random data for the number of visitors from integers that range from 0 - 553,348
- Synthesise random data for price point variables for each demographic within the original scope, I will then ensure that 51 out of the 100 Sites have free entry / zero value<br>
- Synthesise random data for the number of cafes available at Heritage Sites<br>
- Synthesise random data for the opening hours at Heritage sites<br>
Then merge these dataframes into one large dataset that mirrors the original.


In [None]:
### Shuffle the Heritage Site names and create a new dataframe

In [None]:
#Print a description of the output
print ("Synthesise a Dataset - Shuffle Heritage Site Names and print the array")
#code adapted from https://stackoverflow.com/questions/49545599/how-to-turn-a-pandas-column-into-array-and-transpose-it
New_Names = df[['Name']]
#Permute the synthesised dataframe. Permute is used because shuffle creates a "key error" when used with a dataframe.
#df1 = pd.DataFrame(np.random.permutation(df[['Name']])
New_Names = pd.DataFrame((np.random.permutation(New_Names)), columns = ['New_Names'])
New_Names

In [None]:
### Shuffle the OPW regions and create a new dataframe

In [None]:
#Print a description of the output
print ("Synthesise a Dataset - Permute Heritage Site Regions")
#code adapted from https://stackoverflow.com/questions/49545599/how-to-turn-a-pandas-column-into-array-and-transpose-it

New_Regions = df[['Region']]
df2 = pd.DataFrame((np.random.permutation(New_Regions)), columns = ['New_Region'])

synthesise random data for the number of visitors from integers that range from 0 - 553348

In [None]:
#Print a description of the output
print ("Synthesise a Dataset - Visitor Numbers and print the array")
#As per the numpy documentation, this command returns random integers from the “discrete uniform” distribution
df3 = pd.DataFrame((np.random.randint(0, high=553348, size=100, dtype='l')), columns = ['New_Numbers'])

**Comment** Instead of forcing a nil amount of Visitors for 31 Heritage Site, I will allow the numpy library to generate data. My reason for this choice, is that the number of visitors in those sites was not zero - it was simply not collected for various business reasons e.g. the site is a main thoroughfare in the case of St. Stephen's Green. It will be more interesting dataset if these statistics are contained in it.

synthesise random data for price point variables for each demographic within the original scope, I will then ensure that 51 out of the 100 Sites have free entry / zero value

In [None]:
#Print a description of the output
print ("Synthesise a Dataset - Create random Adult Entrance Fees between the allowed range of values")
#Adult entrance price -  100 integers between 1 and 12 with a normal distribution
#0-12 is not used because this would result in some random values of 0
#Code adated from library documentation https://docs.scipy.org/doc/numpy-1.15.0/reference/routines.random.html
New_Adult = np.random.randint(1, high=12, size=100, dtype='l')
New_Adult = pd.DataFrame(New_Adult)
#Code adapted from https://stats.stackexchange.com/questions/283572/using-iloc-to-set-values/283575
#Replace 51 values with free entry/zero
New_Adult.loc[0:50,0] = 0
#Permute the synthesised dataframe. Permute is used because shuffle creates a "key error" when used with a dataframe.
df4 = pd.DataFrame(np.random.permutation(New_Adult), columns = ['New_Adult'])
#df8 = pd.DataFrame((np.random.permutation(New_Family)), columns = ['New_Family']
#df4


In [None]:
#Print a description of the output
print ("Synthesise a Dataset - Create random values for the other Entrance Fees")
#Senior / Group entrance price - an integer between 1 and 9 with a normal distribution
#Child entrance price - an integer between 1 and 7 with a normal distribution
#Student entrance price - an integer between 1 and 8 with a normal distribution
#Family entrance price - an integer between 1 and 32 with a normal distribution

#Code adated from library documentation https://docs.scipy.org/doc/numpy-1.15.0/reference/routines.random.html
New_SnrGroup = np.random.randint(1, high=9, size=100, dtype='l')
New_SnrGroup = pd.DataFrame(New_SnrGroup)
New_SnrGroup.loc[0:50,0] = 0
df5 = pd.DataFrame((np.random.permutation(New_SnrGroup)), columns = ['New_SnrGroup'])
#
New_Child = np.random.randint(1, high=7, size=100, dtype='l')
New_Child = pd.DataFrame(New_Child)
New_Child.loc[0:50,0] = 0
df6 = pd.DataFrame((np.random.permutation(New_Child)), columns = ['New_Child'])
#
New_Student = np.random.randint(1, high=8, size=100, dtype='l')
New_Student = pd.DataFrame(New_Student)
New_Student.loc[0:50,0] = 0
df7 = pd.DataFrame((np.random.permutation(New_Student)), columns = ['New_Student'])
#
New_Family = np.random.randint(1, high=32, size=100, dtype='l')
New_Family = pd.DataFrame(New_Family)
New_Family.loc[0:50,0] = 0
df8 = pd.DataFrame((np.random.permutation(New_Family)), columns = ['New_Family'])


In [None]:
#Code adapted from https://stackoverflow.com/questions/28135436/concatenate-rows-of-two-dataframes-in-pandas
New_Dataset = pd.concat([New_Names,df2, df3, df4, df5,df6, df7, df8], axis=1)
New_Dataset



## References used in completing the project

SAMPLE RTE News, 2010. Galway respite
funding will not be cut [Online].
Available from : http://www.rte.ie/
news/2010/0707/health.html [viewed 1
February 2011].


Assignment 2019 for Programming for Data Analysis module, GMIT. [Online] <br>
Available on: https://github.com/ClodaghMurphy/Assignment-2019-progda [viewed 26 november 2019]<br>

OPW OPEN DATA SETS   [Online]<br>
Available on: https://www.opw.ie/en/opendata/#d.en.34620 [viewed 26 November 2019]<br>




End