# This Jupyter Notebook contains the instructions for Programming for Data Analysis Project 2019

### GMIT H.Dip Data Analytics - Academic Year 2019 - 2020

Student: Henk Tjalsma

GMIT email address:

G00376321@gmit.ie

## Problem statement

For this project I created a data set by simulating a real-world phenomenon of my choosing. I could pick any phenomenon I wished – you can pick one that is of interest to you in your personal or professional life. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python.

It was suggested to use the numpy.random package for this purpose.

Specifically, in this project you should:

1. Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.

2. Investigate the types of variables involved, their likely distributions, and their relationships with each other.

3. Synthesise/simulate a data set as closely matching their properties as possible.

4. Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

Note that this project is about simulation – you must synthesise a data set. Some students may already have some real-world data sets in their own files. It is okay to base your synthesised data set on these should you wish (please reference it if you do), but the main task in this project is to create a synthesised data set.

### What is a synthetic dataset?

As the name suggests, quite obviously, a synthetic dataset is a repository of data that is generated programmatically. So, it is not collected by any real-life survey or experiment. Its main purpose, therefore, is to be flexible and rich enough to help an ML practitioner conduct fascinating experiments with various classification, regression, and clustering algorithms. [21]

Desired properties are:

* It can be numerical, binary, or categorical (ordinal or non-ordinal).

* The number of features and length of the dataset should be arbitrary.

* It should preferably be random and the user should be able to choose a wide variety of statistical distribution to base this data upon i.e. the underlying random process can be precisely controlled and tuned.

* If it is used for classification algorithms, then the degree of class separation should be controllable to make the learning problem easy or hard.

* Random noise can be interjected in a controllable manner

* For a regression problem, a complex, non-linear generative process can be used for sourcing the data.

### Real-World Phenomenon 

The phenomenon I have picked is the long hours teenagers (13 - 19 years) are spending on their mobile nowadays, and the stress, anxiety, isolation, sleep deprivation, unhappiness, just to name a few, it can cause.

I can see it with our own teenage daughter, 15 years old, who is spending a lot of time on her mobile when at home. This means we are required to restrict the wifi use. I do know from other parents we talked to they are more lenient on it, as the topic was causing a lot of arguments. To be honest, many parents have given up on setting any wifi restriction at all, for older teenagers, even so that some are allowed to have the mobile in the bedroom all night. 

Although it's a natural process to some extent, we noticed our daughter tends to isolate to her bedroom a lot more, meaning the interaction with the family is very limited.  Also she seems to spend less time with her peers socialising. 

Furthermore our daugher was big into sports as a young teenager, but since a few years has given up on that completely. This means she doesn't get the pyhysical and mental release she needs. On top she stays up longer or awake in the evening, as her mind isn't switched off, and so she doesn't get enough sleep, and sleep deprivation kicks in.

![mobile](mobile.jpg)

An article in the Irish Times confirms my conclusions, below some quotes:

<https://www.irishtimes.com/business/technology/phone-and-internet-use-making-teenagers-unhappier-than-previous-generations-1.3833840>

* "Teenagers who spend long hours browsing through social media and using their smartphones are significantly less happy and more likely to become depressed than previous generations of young people, a UN study has found."

* "Research carried out among teenagers born after 1995 in the US has found that happiness among American teenagers has declined significantly since 2012 while the time young people spend on screen activities has steadily increased."

* "The Sad State of Happiness in the United States and the Role of Digital Media report, released this week as part of the UN World Happiness study, argues that young people who spend more time with friends and family, exercising and sleeping are happier than peers who spend hours every day using the internet and smartphones."

* "It warns that those born after 1995 as part of the iGen generation are “markedly lower in psychological well-being” than millennials (born between 1980-1994) were at the same age."

##### Rise in digital media
* "Author of the report Dr Jean Twenge, professor of psychology at San Diego State University, notes that while happiness levels among young people increased between 1991 and 2011, both adults and adolescents said they were significantly less happy in 2016-17 compared to how they felt in the early 2000s. Dr Twenge draws a strong correlation between these feelings of unhappiness and the rapid increase in the number of people using smartphones in recent years. This rise in digital media use has resulted in teenagers spending less time with their peers socialising and going to parties, notes Dr Twenge, adding that the way adolescents spend time together has “fundamentally shifted”. She writes that depression, suicidal thoughts and self-harm have increased sharply among teens since 2010, particularly among young women and girls in the US and teens in the UK. In 2017, the average 17/18 year old spent more than six hours a day using the internet, on social media and texting. By 2018, 95 per cent of US teenagers had access to a smartphone while 45 per cent said they were “almost constantly” online. Girls who spend five or more hours a day on social media were found to be three times more likely to be depressed than non-social media users while heavy internet users were twice as likely to be unhappy."

#### Most Interesting Variables related to this Real-World Phenomenon

    1. The most interesting variable in this project is the hours teenagers spent on the mobile 
    (Hours).

    2.  Also the age of the teenager is important as the number of hours on the mobile I assume 
    will go up when they get older (Age). 

    3. Another variable of interest, the sex of the teenager (Sex). I assume girls spent more time on 
    it than boys. 

Even though the focus of this piece has been on mobile phones, and number of hours teens spent on it, having access to other mobile devices that connect them to other people and other networks, might be worth checking as well. The most prevalent of these devices are mobile gaming devices like XBox, Sony PlayStation Portable (PSP), tablets, etc. Mobile gaming devices are owned predominantly by younger teens (those ages 12-14). [16]

    4. Access to other mobile devices (Other_Devices), or not.

    5. Also an interesting variable is Household Income (Income), so the difference in phone
    ownership by socio-economic status.

    6. Another variable I would like to research is Race/Ethnicity, and if there are any 
    discrepancies between groups.  


##### Numpy Distributions to Create Synthesised Data Set

<https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html>

* The hours per week (Hours) variable will be non-negative real number with two decimal places. Normal (or Uniform) distribution I will use.

Based on below uniform definition I think normal distribution fits better:

If you need to generate random floats that lie within a specific [x, y] interval, you can use random.uniform(), which plucks from the continuous uniform distribution.

A uniform distribution (often called 'rectangular') is one in which all values between two boundaries occur roughly equally. 

If the hours per week matter as it will be in this project, there’s usually some non-uniformity, unless the hours have been constructed to be uniform: as for controlled studies.

* The Age variable is a continuous numerical variable. I will generate a random set of integers, using randint function. I will do so for 1000 data points, in the age groups 13 till 19. 

You can generate a random integer between two endpoints in Python with the random.randint() function. This spans the full [x, y] interval and may include both endpoints. [22]

* The Sex (sex) variable will be B/G (Boy/Girl) and I'll use the binomial distribution.

* Access to other mobile devices (Other_Devices), it can have the values Yes/No. Again binomial distribution might do. 

Alternatively I can create categorical variable, with a number of choices. You can use np.random.choice() and specify a vector of probabilities corresponding to the chosen-from arrray. [20]

* The other 2 variables, Household Income (Income) and Race/Ethnicity, are categorical variables with a number of possible values. Normal or Uniform distribution maybe, or np.random.choice() again.

## References

[1] Software Freedom Conservancy. Git.

https://git-scm.com/.

[2] Inc. GitHub. Github.

https://github.com/.

[3] GMIT. Quality assurance framework.

https://www.gmit.ie/general/quality-assurance-framework.

[4] NumPy developers. Numpy.

http://www.numpy.org/.

[5] Project Jupyter. Project jupyter home.

http://jupyter.org/.

[6] Anaconda

https://docs.anaconda.com/anaconda/

[7] Python

https://www.python.org/downloads/

[8] Cmder software

https://cmder.net/

[9] https://pandas.pydata.org/

[10] https://matplotlib.org/

[11] Seaborn

https://anaconda.org/anaconda/seaborn

[12] https://www.scipy.org/

[13] https://code.visualstudio.com/docs

[14] Is there a way to edit a commit message on GitHub?

https://superuser.com/questions/751699/is-there-a-way-to-edit-a-commit-message-on-github

[15] Phone and internet use making teenagers unhappier than previous generations

https://www.irishtimes.com/business/technology/phone-and-internet-use-making-teenagers-unhappier-than-previous-generations-1.3833840

[16] Teens and Mobile Phones Over the Past Five Years: Pew Internet Looks Back

https://www.pewresearch.org/internet/2009/08/19/teens-and-mobile-phones-over-the-past-five-years-pew-internet-looks-back/

[17] https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html

[18] Practical Tutorial on Data Manipulation with Numpy and Pandas in Python

https://www.hackerearth.com/practice/machine-learning/data-manipulation-visualisation-r-python/tutorial-data-manipulation-numpy-pandas-python/tutorial/

[19] https://www.dataquest.io/blog/numpy-tutorial-python/

[20] How to generate random categorical data in python according to a probability distribution?

https://stackoverflow.com/questions/57435469/how-to-generate-random-categorical-data-in-python-according-to-a-probability-dis

[21] Synthetic data generation — a must-have skill for new data scientists

https://towardsdatascience.com/synthetic-data-generation-a-must-have-skill-for-new-data-scientists-915896c0c1ae

[22] https://realpython.com/python-random/#the-random-module

[23] https://www.sharpsightlabs.com/blog/numpy-random-choice/#numpy-random-choice-examples

[24] https://www.techbeamers.com/using-python-random/