***
## Title

Programming for Data Analysis - Assignment 2020.  
Submitted by ***Jack Caffrey***
***

## Introduction

The following assignment was undertaken as part of the Higher Diploma in Science - Data Analytics course through the Galway-Mayo Institute of Technology, for the module Programming for Data Analysis. 

***Please note the following:***
1. The assignment critera will be outlined in the below ***Problem Statement*** section of the assignment.  
2. All reference will be indicated using [#]. These references will be listed in the ***References*** section of this assignment.
3. Markdown formatting references are as follows.   
     * Basic Syntax. Matt Cone. https://www.markdownguide.org/basic-syntax/
     * Extended Syntax. Matt Cone. https://www.markdownguide.org/extended-syntax/
     * Motivating Examples. Jupyter Team. https://jupyternotebook.readthedocs.io/en/stable/examples/Notebook/Typesetting%20Equations.html

### Problem Statement

Using a Jupyter notebook explain the use on the ***numpy.random*** package in Python.  
This explanation must include the following:  

1. The **purpose** of the package. 
2. The use of the **"Simple random data"** and **"Permutations"** functions.   
3. The use and purpose of at least five **"Distributions"** functions.   
4. Explain the use of **"seeds"** in generating pseudorandom numbers.   

***Note:***  
The problem statement was defined using the critera outlined in the Programming for Data Analysis Assignment [1]

***



## 1. History of NumPy. 

Before exploring the purpose of the ***numpy.random*** package it is important to provide a brief history of the development of the ***numPy*** package as a whole.  
  
The ***numPy*** package developed from the Python programmming language extensions ***Numeric*** and ***Numarray*** [2]. Numeric was largely developed by Jim Hugunin a software programmer in 1995, with contributions from many people including Jim Fulton, David Ascher, Paul DuBois, Konard Hinsen [3].  
Numeric was originally developed with maximum performance as it's main aim. A consicence of solely focusing efforts on maximum performance, resulted in design choices that meant Numeric was not extremely efficient for very large data sets [4]. (A data set is "*a collection of related sets of information that is composed of separate elements but can be manipulated as a unit by a computer*") [5].  
  
As a result of this poor efficiency when dealing with large data sets the Numarray package was developed. Numarray was developed to have faster operating speeds for larger data sets when compared to Numeric, but this lead to Numarray having slower operating speeds for smaller data sets versus Numeric. This lead to both packages being used simultaneoulsy depending on the desired output [6].  

At the beginning of 2005, an American Data Scientist and Businessman Travis Oliphant [7] wanted to develop a single array package (An array "*is a data structure consisting of a collection of elements, each identified by at least one array index or key.*") [8]. The creation of this single package was called *SciPy core*, with the intention to implement this as part of the bigger scientific package *SciPy*. This approach lead to confusion of how the package operated, resulting in the name change to *numerix*. The *numerix* name was already trademarked by another organisation, as a result of this trademarking another name change was required and the package ***NumPy*** was born. [9]. 

### What is NumPy?

Numpy is described as "the fundamental package for scientific computing in Python". It is through this Python library that is possible to complete fast operations on arrays including but not limited to the following:  
1. Mathematical. 
2. Logical.
3. Shape Manipulation. 
4. Sorting.
5. Random Simulation. [10]

Note: For the purpose of the assignment random simulation (sampling) will be the main area of focus and the ***numpy.random*** package.


### Random Sampling & the *numpy.random* Packcage

Random sampling is a method used to select a sample of data from a larger data population.  
In random sampling each sample of the data population has an equal chance of being selected. This selection is meant to be a netural and unbiased portayal of the larger data population it is selected from. If the selection does not represent a netural and unbiased portayal of the larger data population this is known as "*sampling error*" [11].  
  
Random sampling is the best method of selecting samples from the data population you are interested in. The sample selected should portray the data population you are investigating and remove sampling bias. [12]. 

A package used by Python to generator these random values is ***numpy.random***. This ***numpy.random*** package is used to supplement the Python ***random*** function, with functions for efficiently generating whole arrays of sample values from many kinds of probability distributions [13].  
  
In order to generate the required psuedorandom numbers (samples) from a population a combination of a BitGenerator and a Generator is used.        
* **BitGenerator** - is used to create sequences.
* **Generator**  - uses the created sequences to sample from the required statistical distribution [14].
* **Psuedorandom Numbers** - "A set of values or elements that is statistically random, but it is derived from a known starting point and is typically repeated over and over. The algorithm can repeat the sequence, therefore the numbers are not entirely random".[15]. 

Today by default, Generator uses bits provided by PCG64 (Permuted Congruential Generator (64-bit)) and has replaced the use of MT19937 (Mersenne Twister 19937). 

MT19937 is a legacy psuedorandom number generator. **RandomState** is used to provide access to the generator.It is best practice to only use this class when it is essential to have randoms that are identical to ones produced using previous version on Numpy,  as it is not possible to reproduce the exact random values required using Generator for normal distrubitions or any other distribution.  
  
PCG64 provides bits to the **Generator** which has more efficent statistical properties when compared to **RandomState**.

### PCG64 vs MT19337 Quick Comparison

All comparison information is provided from https://numpy.org/doc/stable/reference/random/new-or-different.html#new-or-different
 
| Feature  | Older Equivalent | Notes |
|:--------:|:----------------:|:-----|
|Generator |  RandomState     |Generator requires a stream source, called a BitGenerator A number of these are provided. RandomState uses the Mersenne Twister MT19937 by default, but can also be instantiated with any BitGenerator.       |
|random    |random_sample, |Access the values in a BitGenerator, convert them to float64 in the interval [0.0., `` 1.0)``. In addition to the size kwarg, now supports dtype='d' or dtype='f', and an out kwarg to fill a user- supplied array.| 
|integers  |randint, random_integers |  Many other distributions are also supported.Use the endpoint kwarg to adjust the inclusion or exclution of the high interval endpoint     |

For a more detailed comparison please see: https://numpy.org/doc/stable/reference/random/new-or-different.html#new-or-different

The **Seed** plays a vital role which enables both PCG64 and MT19337.  

MT19337 uses a random ***Seed*** to begin the pseudorandom number generator. Values can be any integer between 0 and 624  including the number 1. If no value is provided the BitGenerator will take values from the windows analogue or the clock[16]. 

PCG64 supports the advance method to support the pseudorandom number generator ***Seed***. This is represented by 2 128-bit unassigned integers. The first is the state of the PRNG (psuedorandom number generator) which is advanced by a LCG (Linear Congruential Generator). The second is a fixed odd increment used in LCG. [17]. 
***

### Simple Random Data Functions
***

Simple random Data is a population of values where each value of the population has an equal probability of being selected.This random sample is meant to be an unbiased representation of the population.[18].  

The following are Simple Random Data Functions used by the numpy.random package:
   * numpy.random.Generator.integers
   * numpy.random.Generator.random
   * numpy.random.Generator.choice
   * numpy.random.Generator.bytes   

### Permutation Functions
*** 

A Permutation function is an ordered arrangment of values from a population without any value being repeated.[19] 

The following are Permutation Functions are used by the numpy.random package:
   * numpy.random.Generator.shuffle
   * numpy.random.Generator.permutation

### Distribution Functions
*** 

1. Unifrom Distrubtion 
2. Pareto Distrubtion
3. Normal Distrubtion
4. Exponential Distrubtion
5. Poisson Distrubtion 

***
### Unifrom Distrubtion 

\begin{align}
\ p(x) & = (\frac{1}{b-a}) \\
\end{align}

***
### Pareto Distrubtion 

\begin{align}
\ p(x) & = \frac{am^a}{x^{a+1}} \\
\end{align}

***
### Normal Distubtion

\begin{align}
\ p(x) = \frac{1}{\sqrt{2\pi}{\sigma}}e^{-{\frac{(x -\mu)^2}{2\sigma^2}}} \\
\end{align}

***
### Exponential Distrubtion

\begin{align}
\ f(x;\frac{1}{\beta}) = \frac{1}{\beta}exp(-\frac{x}{\beta}) \\
\end{align}

***
### Poisson Distrubtion

\begin{align}
\ f(k; {\lambda}) = \frac{\lambda^ke^{-\lambda}}{k!}\\
\end{align}

***
### References
[1] ProgDA_Assignment. GMIT.  
[2] People / Jim Hugunin. Peoplepill. https://peoplepill.com/people/jim-hugunin/      
[3] The birth of Numeric.  SciPy History_of_SciPy. https://scipy.github.io/old-wiki/pages/History_of_SciPy    
[4] Python Numeric. History. http://people.csail.mit.edu/jrennie/python/numeric/    
[5] Definitions from Oxford Languages. Dictionary. https://www.google.com/search?client=firefox-b-d&q=what+is+a+data+set  
[6] NumPy. History. https://en.wikipedia.org/wiki/NumPy  
[7] People / Travis Oliphant. Peoplepill. https://peoplepill.com/people/travis-oliphant/  
[8] Array data structure. Wikipedia. https://en.wikipedia.org/wiki/Array_data_structure  
[9] The reunion, aka the birth of NumPy. SciPy History_of_SciPy. https://scipy.github.io/old-wiki/pages/History_of_SciPy   
[10] What is NumPy?. Numpy. https://numpy.org/doc/stable/user/whatisnumpy.html#  
[11] The Economic Times .Definition of 'Random Sampling'. https://economictimes.indiatimes.com/definition/Random-Sampling  
[12] Saul Mcleod.Random Sampling. https://www.simplypsychology.org/sampling.html#:~:text=Random%20samples%20are%20the%20best,time%2C%20effort%20and%20money)   [13]lmiguelvargasf. Differences between numpy.random and random.random in Python.https://stackoverflow.com/questions/7029993/differences-between-numpy-random-and-random-random-in-python#:~:text=From%20Python%20for%20Data%20Analysis,many%20kinds%20of%20probability%20distributions     
[14] NumPy .Random sampling (numpy.random).https://numpy.org/doc/stable/reference/random/index.html  
[15] encyclopedia.pseudo-random numbers.https://www.pcmag.com/encyclopedia/term/pseudo-random-numbers  
[16] Numpy.Parameters. https://numpy.org/doc/stable/reference/random/legacy  
[17] Numpy. State and Seeding. https://numpy.org/doc/stable/reference/random/bit_generators/pcg64.html#numpy.random.PCG64  
[18] Adam Hayes. Simple Random Sample. https://www.investopedia.com/terms/s/simple-random-sample.asp#:~:text=A%20simple%20random%20sample%20is,equal%20probability%20of%20being%20chosen.&text=In%20this%20case%2C%20the%20population,equal%20chance%20of%20being%20chosen.  
[19] Permutations function. Minitab® 18 Support. https://support.minitab.com/en-us/minitab/18/help-and-how-to/calculations-data-generation-and-matrices/calculator/calculator-functions/arithmetic-calculator-functions/permutations-function/#:~:text=A%20permutation%20is%20an%20ordered,from%20a%20group%20without%20repetitions.&text=Use%20the%20Permutation%20function%20to,possible%20outcomes%20(binomial%20experiment)   

***
## End