<img src="images/GMIT logo.jpeg" width="350" align="center">

# Programming for Data Analysis Project 2019

Peter McGowan
G00190832

### Tasks:
1. Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
2. Investigate the types of variables involved, their likely distributions, and their relationships with each other.
3. Synthesise/simulate a data set as closely matching their properties as possible.
4. Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

## Background

I have chosen to simulate data about Adult Education in Ireland. I have chosen the following variables:
* Highest Qualification Awarded
* Field of Study
* Gender
* Age

## Research and Investigate

### General

The Central Statistics Office (CSO) is perhaps best known for carrying out a census every 5 years in Ireland (most recently in 2016). In addition to this, the CSO carries out a range of other important statistical work covering a broad range of themes on a rotational basis.

The General Household Survey occurs approximately quarterly and the theme varies. In Q3 & Q4 2019 the theme covered was Adult Education. Considering the HDip in Data Analytics and the makeup of its student cohort (myself included), the Adult Education Survey (AES) immediately piqued my interest when I came across it.

*** links to documents

Considering the results of this survey and the characteristics of its variables studied, I will simulate a dataset matching its qualities.

### Sample Data Characteristics

The design sample size for the survey carried out was 13,200. Of this, 4,863 valid responses were collated. I intend to simulate 200 samples, under 5% of the total actual sample size.

### Highest Qualification Awarded

The National Framework of Qualifications (NFQ) outlines a ten level framework for qualifications. I am interested only in those levels which cover postgraduate qualifications:

https://nfq.qqi.ie/

NFQ Level:
0. No formal education or training
1. Primary or below
2. Primary or below
3. Lower Secondary
4. Higher Secondary, Post Leaving Certificate
4. Higher Secondary, Post Leaving Certificate
6. Post Leaving Certificate, Higher Certificate and Equivalent
7. Ordinary Degree or Equivalent
8. Honours Bachelor Degree, Graduate Diploma, Higher Diploma
9. Masters Degree, Post Graduate Diploma
10. Doctoral Degree, Higher Doctorate

Unfortunately, the AES aggregates these levels into the following categories:
* Primary or Below: 341 (7.00%)
* Lower Secondary: 654 (13.45%)
* Higher Secondary or Below: 941 (19.35%)
* Post Leaving Certificate: 799 (16.43%)
* Third Level Non-Honours Degree: 531 (10.92%)
* Third Level Honours Degree or Higher: 1,597 (32.84%)

Without further information, I will simulate a dataset approximating this range, with 6 categories.

### Field of Study

The AES categories education into 12 different fields:
* Other
* General Programmes & Qualifications
* Education
* Arts and Humanities
* Social Sciences, Journalism and Information
* Business, Administration and Law
* Natural Sciences, Mathematics and Statistics
* Information and Communication Technologies
* Engineering, Manufacturing and Construction
* Agriculture, Forestry, Fisheries and Veterinary
* Health and Welfare
* Services

Fields:
* Other (OTH):
* General Programmes & Qualifications (GEN): 0.23%
* Education (EDU): 9.16%
* Arts and Humanities (AHU): 13.03%
* Social Sciences, Journalism and Information (SJI): 6.35%
* Business, Administration and Law (BAL): 23.16%
* Natural Sciences, Mathematics and Statistics (NMS): 7.90%
* Information and Communication Technologies (ICT): 6.13%
* Engineering, Manufacturing and Construction (EMC): 10.14%
* Agriculture, Forestry, Fisheries and Veterinary (AFV): 1.54%
* Health and Welfare (HWE): 17.60%
* Services (SVC): 4.65%

### Gender

The gender breakdown of those who participated in an educational activity as given in the AES report is:
* Male: 2,213 (45.5%)
* Female: 2,650 (54.5%)

The survey would indicate that females are slightly more likely to participate in adult education. The simulated dataset will therefore reflect this.

Comparing percentages of males vs females participating in each educational field, it is clear that gender is correlated with field of study.

### Age

The age range covered by the survey was 18-64, however the results are presented only from ages 25-64. This gives me a convenient range for the simulated ages. Results were presented in the following bands:
* 25-34 year olds
* 35-44 year olds
* 45-54 year olds
* 55-64 year olds

In constructing the dataset it will not be necessary to limit the sumulated ages to these bands - I will use the 2016 census data to generate an appropriate distribution.

*** Continuous or ranges?

## Synthesise

### Preliminaries

Firstly, import several libraries to manage, simulate and visualise.

In [10]:
# Import numpy to analyse it
# Import matplotlib.pyplot and seaborn for visualisations
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [11]:
# Magic command to ensure that plots render inline
%matplotlib inline

In [12]:
# Control Seaborn aesthetics
# Use darkplot plot style for contrast
sns.set_style("darkgrid")
# Set the default colour palette
sns.set_palette("colorblind")

## Generate Data

First we will create an empty pandas dataframe to hold the data.

In [8]:
df = pd.DataFrame(columns = ["Qual", "Field", "Gender", "Age" ])

In [9]:
df

Unnamed: 0,Qual,Field,Gender,Age


### Qualification Data

### Field of Study Data

### Gender Data

These will categorical variable, either "F" or "M".

In [14]:
genders = ["F", "M"]

np.random.choice(genders, size=200)

array(['M', 'M', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'F',
       'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'F',
       'F', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'M',
       'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'F',
       'M', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'F',
       'F', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M',
       'M', 'F', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'F',
       'F', 'M', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'F',
       'F', 'F', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'M', 'M',
       'F', 'M', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'M',
       'F', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'F',
       'M', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'F',
       'M', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'M',
       'M', 'F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M

### Age Data

** Adjust median age, stdev etc

Presuming the ages range from 18 to 85, with a median postgraduate age of 25:

In [23]:
a = np.random.normal(25, 10, 200)
b = np.round(a)
b = b.astype(int)

b

array([37, 20, 39, 12, 23, 18, 25, 39, 23, 22, 22, 18, 17, 18, 18, 33, 33,
       19, 15, 38, 16, 28, 33, 10, 16, 26, 43, 21, 18, 28, 24, 30, 29, 36,
       29, 49, 14, 11, 41, 41, 20, 15, 40, 17, 33, 19, 13, 34, 29, 11, 29,
       22, 29, 25, 26, 48, 35,  7, 15, 27,  8, 30, 22, 31, 26, 38, 27, 43,
        4, 33, 36, 15, 17, 29, 32, 21, 15, 24, 21, 23, 27, 47, 26, 23, 27,
       21, 33, 30, 38, 36, 30, 19, 21, 15, 18, 18, 19, 41, 33, 24,  8,  8,
       19, 39, 35, 17, 24, 18, 24,  6, 25, 21, 32, 32, 42, 23, 26, 16, 46,
       42,  2, 24, 27, 22, 33, 42, 27, 10,  7, 28, 35, 10, 15, -1, 43, 25,
       12, 31, 29, 18, 12, 33, 24, 23,  2, 25, 46, 22, 22, 29, 20, 36, 18,
       12, 41, 34, 43, 13, 32,  7, 18, 18, 23,  3, 21, 22, 34, 28, 17, 44,
       24, 18, 33, 53, 23, 26, 33, 17, 28, 20, 13, 28, 26, 24, 18, 16, 27,
       30, 22, 18, 29, 22, 36, 41, 25, 25, 29, 26, 28, 35])

## Results