### Project December 2018 for Programming for Data Analysis

### Problem statement
For this project you must create a data set by simulating a real-world phenomenon of
your choosing. You may pick any phenomenon you wish – you might pick one that is
of interest to you in your personal or professional life. Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python.
We suggest you use the numpy.random package for this purpose.

Specifically, in this project you should:
- Choose a real-world phenomenon that can be measured and for which you could
collect at least one-hundred data points across at least four different variables.
- Investigate the types of variables involved, their likely distributions, and their
relationships with each other.
- Synthesise/simulate a data set as closely matching their properties as possible.
- Detail your research and implement the simulation in a Jupyter notebook – the
data set itself can simply be displayed in an output cell within the notebook.

Note that this project is about simulation – you must synthesise a data set. Some
students may already have some real-world data sets in their own files. It is okay to
base your synthesised data set on these should you wish (please reference it if you do),
but the main task in this project is to create a **synthesised data set**. 

### Real-world phenomenon chosen

Changing tack here based on my research into life expantancy probablility and understanding of the project, I have now decided to change my real-world phenomenon! [3]

The real-world phenomenon I have chosen is the attitude of primary school children to Santa Claus. I decided the most interesting variable related to this is the age of children when they stop writing letters to Santa Claus (age). 

On investigation of the problem I reasoned that the Sex of the child (sex); whether or not the child had older siblings (sibs); and the number of hours per week the child was accessing a mobile device (hours) had a relationship with the age they stopped writing letters to Santa Claus.


- The Sex (sex) variable will be M/F and follow the Bernoulli distribution?
- The Siblings (sibs) variable will be Yes/No and follow the Bernoulli distribution?
- The Hours per week (hours) variable will be non-negative real number with two decimal places.

After extensive research at home observing my children and interrogating other parents, I find that primary school students access mobile devices on average five hours per week with a standard deviation of one hour and that a normal distribution is an acceptable model for such a variable.

In [1]:
# generates a random set of (integers) ages of primary school children from 7 to 12
import numpy as np
age = np.random.randint(7, 13, 100)
age

array([ 7,  9,  9, 12, 10,  9, 10, 11,  9,  9,  9,  8,  8, 11, 10,  7, 11,
       10,  9,  8,  8, 11,  8, 12,  9, 10,  9, 11, 12,  9,  9, 12,  7,  7,
       12,  9,  8, 11,  9,  9, 12,  8,  9,  9,  8, 12, 11, 11, 11,  7,  7,
        7,  9, 11,  7, 11, 12,  7, 12,  9,  9, 10,  9,  9, 10,  8, 10, 12,
       12, 11,  9, 11, 10,  9,  7,  8, 11, 11,  8, 12, 10,  7,  7,  9,  9,
       10, 12, 12, 11, 10, 11, 12,  7,  8,  7, 10, 11,  8,  9, 10])

In [2]:
# generates a set of hours of mobile device usage with mean of 5 using the normal distribution
hours = np.random.normal(5, 1, 100)
hours

array([3.95802225, 5.24289038, 5.7059411 , 4.5119417 , 5.3527958 ,
       4.97164348, 5.00413708, 4.22990875, 4.74487781, 4.3985171 ,
       3.76539014, 4.61294067, 5.18388036, 4.96917072, 6.20480823,
       3.26810979, 4.31351849, 5.59413503, 5.02969117, 4.38378757,
       5.05051251, 4.54373085, 5.05144976, 3.54825149, 6.33473466,
       5.85912451, 5.10488066, 3.93576947, 5.81468737, 4.9153546 ,
       5.83125338, 4.77006851, 6.35238727, 6.57128843, 6.88797424,
       5.49744491, 3.97894927, 5.84609319, 3.68386633, 4.3724841 ,
       4.44823329, 3.97748691, 5.14022737, 5.75303731, 3.89310973,
       3.9424231 , 4.65232476, 4.33357029, 5.97216184, 4.82640545,
       6.62901088, 5.29675038, 5.10968179, 6.74981495, 4.35959469,
       5.40215509, 5.61430535, 7.47429631, 4.78416903, 4.52500971,
       4.65789509, 6.25775812, 6.51329781, 4.1496974 , 4.1636108 ,
       3.87583632, 5.13232937, 5.96536317, 4.21685139, 5.16655917,
       4.92351724, 4.51437747, 3.5838205 , 5.47954101, 6.20448

In [3]:
# generates an array of the sex of the child using the uniform distribution
# we will convert the integers to strings later
# Male = 0 
# Female = 1

sex = np.random.binomial(1, .5, 100)
sex

array([0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1])

In [4]:
# generates an array Yes/No answers whether a child has older siblings using the uniform distribution
# we will convert the integers to strings later
# Yes = 0 
# No = 1

sibs = np.random.binomial(1, .5, 100)
sibs

array([1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1])

In [5]:
# Create a dataframe with all of the variables  and their data randomly generated

import pandas as pd
df=pd.DataFrame(data={'Age': age, 'Hours on mobile device': hours, 'Sex': sex, 'Older Siblings': sibs})



df.head()

Unnamed: 0,Age,Hours on mobile device,Sex,Older Siblings
0,7,3.958022,0,1
1,9,5.24289,1,1
2,9,5.705941,1,1
3,12,4.511942,0,0
4,10,5.352796,0,0


In [6]:
# WIP
# Replacing numerical values with strings [4]:

df ['Sex'].replace({0:'Male', 1:'Female'}, inplace=True)
df ['Older Siblings'].replace({0:'Yes', 1:'No'}, inplace=True)
df.head()


Unnamed: 0,Age,Hours on mobile device,Sex,Older Siblings
0,7,3.958022,Male,No
1,9,5.24289,Female,No
2,9,5.705941,Female,No
3,12,4.511942,Male,Yes
4,10,5.352796,Male,Yes


In [7]:
# change order of the columns

df = df[['Sex','Older Siblings','Age','Hours on mobile device']]
df.head()

Unnamed: 0,Sex,Older Siblings,Age,Hours on mobile device
0,Male,No,7,3.958022
1,Female,No,9,5.24289
2,Female,No,9,5.705941
3,Male,Yes,12,4.511942
4,Male,Yes,10,5.352796



Below I have added an additional column, the sum of age and average weekly hours spent on a mobile device. From researching online and based on the average hours per week spent online of 5 hours, I have concluded that if the value of Column **Age+Hours** is > than 14 than the liklihood is that the child did not write a letter to Santa Claus.

In [8]:
# adding additional column based on the other columns [5]:
df['Age+hours'] = (df.Age + df['Hours on mobile device'])
df.head(10)

Unnamed: 0,Sex,Older Siblings,Age,Hours on mobile device,Age+hours
0,Male,No,7,3.958022,10.958022
1,Female,No,9,5.24289,14.24289
2,Female,No,9,5.705941,14.705941
3,Male,Yes,12,4.511942,16.511942
4,Male,Yes,10,5.352796,15.352796
5,Male,Yes,9,4.971643,13.971643
6,Female,Yes,10,5.004137,15.004137
7,Female,Yes,11,4.229909,15.229909
8,Female,No,9,4.744878,13.744878
9,Female,No,9,4.398517,13.398517


In [10]:
# my_names is a list of 100 names drawn from the allowed values - each with equal chance of being drawn
# if you check out the np.random.choice() documentation it covers how to load the selections and
# give more weighting to particular values

wrote_letter = ['Yes', 'No']

letter = [np.random.choice(wrote_letter) for i in range(100)]
letter



['No',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'No',
 'No',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'Yes',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'Yes',
 'No',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'No',
 'No',
 'No',
 'No',
 'Yes',
 'No',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'Yes',
 'No',
 'No',
 'Yes',
 'Yes',
 'No',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'Yes',
 'No',
 'No',
 'No',
 'No',
 'No',
 'Yes',
 'Yes',
 'No',
 'No',
 'Yes',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'No',
 'No',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'Yes',
 'No',
 'Yes',
 'No',
 'Yes',
 'No',
 'Yes',
 'No',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'No',
 'Yes',
 'No',
 'No',
 'No',
 'Yes',
 'No',
 'Yes']

In [17]:
# added new column to the dataframe, I need to somehow correleate this new column 'Wrote Letter' with 'Age+hours'


df['Wrote letter'] = (letter)
df.head(10)

Unnamed: 0,Sex,Older Siblings,Age,Hours on mobile device,Age+hours,Wrote letter
0,Male,No,7,3.958022,10.958022,No
1,Female,No,9,5.24289,14.24289,No
2,Female,No,9,5.705941,14.705941,Yes
3,Male,Yes,12,4.511942,16.511942,Yes
4,Male,Yes,10,5.352796,15.352796,Yes
5,Male,Yes,9,4.971643,13.971643,No
6,Female,Yes,10,5.004137,15.004137,No
7,Female,Yes,11,4.229909,15.229909,No
8,Female,No,9,4.744878,13.744878,No
9,Female,No,9,4.398517,13.398517,Yes


### References:
1. https://realpython.com/python-random/
2. https://data.gov.ie/
3. https://understandinguncertainty.org/why-life-expectancy-misleading-summary-survival
4. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html
5. https://erikrood.com/Python_References/create_new_col_pandas.html
4. /n
