# Programming for Data Analysis Project

---

### Autumn/Winter 2021
---

<br>

### The Session

The project brief stipulated we *"create a data set by simulating a real-world phenomenon"* It was further suggested we pick something that is of interest to us in our personal or professional life. I've decided to do something on traditional music sessions! I'm going to try and simulate a dataset based on the attendees at sessions in a given month in a particular pub in Cork city!
I'll try to work out all of the variables, examine their properties, potential data types, and how they relate to each other. I'll then try and work out code to simulate a random dataset based on that information.

![session1](images/session1.jpg)

<br>

#### Variables

1. The players  
There can be anywhere between 3 and sometimes more than 10 musicians on any given night and this particular pub (pre-covid!) would usually have 7 sessions a week. So that's a monthly range of between 84 and 280+. The brief asks for "at least one-hundred data points across at least four different variables" - hopefully my simulation doesn't return a dataset close to 84!  
The norm would be around 5 or 6 musicians. 3 is always the minimum, as they would be the paid hosts of the session. Sometimes there might be more than 10 for a party or special occasion. Possibly this could be normal distribution with a mean of 5 or 6 and SD of 2/3? Maybe a lower cutoff at 3? Expermiment with the standard deviation that may occasionally allow more than 10? The datatype would obviously have to be integer. This value could also serve as the index?

2. The instruments  
These can be anything from fiddle, accordion, guitar, flute, uilleann pipes, bodhrán, bouzouki, banjo, tin whistle, concertina, double bass, mandolin. Fiddle, accordion, and guitar would be the most common and there would nearly always be at least these three. I guess I create a list and write code to choose a number of them at random but specifying a much higher percentage chance for the 3 mentioned, and various degrees of lower percentages for the others. Other issues I could address are - there would never be more than 1 bodhrán or bouzouki. The datatype would be object (string pardon the pun!!) I can use `np.random.choice` to choose from an array and specify the percentage chance of each coming out.  
```rng.choice(list_1, p=[0.1, 0.1, 0.1, 0.1, 0.1, 0.5], size=6)```

3. Genre  
While the pub mostly has Irish traditional music (ITM), there are also sessions of bluegrass, old-time, and blues music. I could create an array and use `choice` again? Or code the genres as 1,2,3,4?

4. Ability  
All of the musicians would be of differing ability levels. The hosts would normally be the most able. The guests can range from beginners just joining in, to experienced professional musicians passing through town. The pub in question is quite famous so there would be more of the latter. I guess I create an ability range from 1-10, again maybe normal distribution with a mean around 7? This would suggest a datatype of integer.  
Or this could be categorical objects like 'beginner', 'intermediate', 'advanced'? Or even just intermediate and advanced as there wouldn't be many beginners at these sessions. So this could also be an array of 2/3 items? Or a boolean if just two?

5. Age  
The pub in question is over 21 and is regarded as an 'old man's pub'! Age isn't necessarily connected to ability but can be! I need to decide if this is relevent! The age range would be from 21 up to 80 I'd say. This would be equally spread out I think with possible spikes at either end of the spectrum - old heads and college kids! So data type of integer.  
Could also be age ranges? In which case it would be a list - maybe college kids, grown-ups, old heads! Or age ranges?

6. Repertoire  
A top traditional musician would have a reportoire of hundreds if not thousands of tunes. However a top younger musicians mightn't have learnt that many tunes yet. Equally an intermediate older musician may have amassed a huge amount of tunes but not be great at playing them! So I guess repertoire could be a big range - maybe from 50 up to 2000? And would be directly related to age and possibly ability. Datatype integer? Or maybe range? i.e. 50-100, 100-200, 200-500, 500-1000, >1000? In which case an array?

7. Paid  
The session would always have at least 3 hosts who would be paid. It's most common for these to be the best ability-wise, biggest repertoire, and older (but not always!). The typical instruments would accordion, fiddle and guitar. This datatype would be boolean - True or False. For the genres outside of ITM more musicians may share the fee.

8. Drink  
Just putting this in for the craic! The typical drinks might be Beamish, Guinness, Murphys, various lagers on tap, craft beers, red wine, spirits, water/soft drinks. The older hosts and musicians tend to drink Beamish and red wine! The younger craft beers and non-alcoholic drinks? I guess a list? Need to do a bit of research on the drinking habits of the other music genres maybe?

9. Night of the week  
Another possibility! The midweek nights have less musicians - sometime just the hosts. While the weekends are always busier. The age profile would always be older during the week (retirees!) while the weekend would have a bigger range. There would also be more drinking at the weekend! I'll maybe create a list of the seven days and maybe another option for more than 1?

So what might be the point of all this? So like any dataset we could investiagte whether we could determine someones age from their music genre and drinking habits? Or determine their ability from instrument, repertoire and what night of the week they played!  
This dataset assumes that each of the msucians only appears once which is not the case in reality! There is one particular musician who participates at multiple session in different genres each week!

<br>

#### Coding the variables

We begin by importing the necessary python packages.

In [1]:
# numerical arrays
import numpy as np

# plotting
import matplotlib.pyplot as plt

# dataframes
import pandas as pd

1. **The players**  
We'll decide that the month is December so that's 31 days and 31 sessions. The range for sessions is 3-10, so we could create an array of 31 values between those 2 parameters. We use numpy.random to create a random number generator and then create the array. 
I can change the '31;' parameter to change the month. Maybe give that as an option to the user? I can also change or remove the seed to give a different result each time.

In [2]:
# create a random number generator with seed
rng = np.random.default_rng(42)

# create an array of 31 numbers between 3 and 11 (non-inclusive)
x = rng.integers(3, 11, 31)

# print x
x

array([ 3,  9,  8,  6,  6,  9,  3,  8,  4,  3,  7, 10,  8,  9,  8,  9,  7,
        4,  9,  6,  7,  5,  4, 10,  9,  8,  6,  9,  7,  6,  6])

We calculate the `.sum()` of those numbers.

In [3]:
# sum of x
x = x.sum()

We need to turn this total into the first column of the dataframe, and the index?

In [4]:
# create numPy array using x as upper limit
data = np.arange(x)

# have a look
data

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 18

In [5]:
# turn it into a dataframe
df = pd.DataFrame(data, columns = ['muso'])

# have a look
df

Unnamed: 0,muso
0,0
1,1
2,2
3,3
4,4
...,...
208,208
209,209
210,210
211,211


2. **The instruments**  
We need to create a list of the instuments first.

In [6]:
# create a list of instruments
instrs = ['accordion', 'fiddle', 'guitar', 'flute', 'uilleann_pipes', 'concertina', 'bodhrán',
        'mandolin', 'bass', 'banjo']

In [7]:
# we can pass in the probability for each item favoring one over the others
instrs = rng.choice(instrs, p=[0.2, 0.2, 0.3, 0.05, 0.05, 0.05, 0.05, 0.05, 0.025, 0.025], size=x)

In [8]:
df['instrument'] = instrs.tolist()

In [9]:
df

Unnamed: 0,muso,instrument
0,0,guitar
1,1,accordion
2,2,concertina
3,3,guitar
4,4,uilleann_pipes
...,...,...
208,208,fiddle
209,209,mandolin
210,210,accordion
211,211,guitar


3. **The genre**  
This is maybe slightly more complicated in that some instruments are more popular with some genres than others. For example all of the ITM sessions will probably have an accordion whereas the blues or bluegrass certainly wouldn't. Of the 7 sessions a week, 4 are ITM and then 1 each of bluegrass, old-time, and blues. For now I'm just going to divide them up and work out the issues later!

In [10]:
# create a list of genres
genres = ['ITM', 'bluegrass', 'old-time', 'blues']

In [11]:
genres = rng.choice(genres, p=[0.6, 0.2, 0.1, 0.1], size=x)

In [12]:
df['genre'] = genres.tolist()
df

Unnamed: 0,muso,instrument,genre
0,0,guitar,ITM
1,1,accordion,ITM
2,2,concertina,ITM
3,3,guitar,blues
4,4,uilleann_pipes,ITM
...,...,...,...
208,208,fiddle,ITM
209,209,mandolin,bluegrass
210,210,accordion,bluegrass
211,211,guitar,old-time


**N.B. So we can see already that we have an accordion player in the blues session! Need to work out how to prevent that from happening?**

4. **Ability**  
For the minute I'm going to go with a binary approach - intermediate or advanced. In reality there are no beginners at these sessions. There would be a big range of abilities so I may revisit.  
We'll try a binomial distribution this time. We're going to assume for now that the split between intermediate and advanced is 50/50.

In [13]:
# creates an array of booleans
ability = rng.binomial(n=1, p=0.5, size=x)
ability

array([1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1])

We get a big list of booleans! Can I convert this back to strings? In the meantime I'll just add to the dataframe as is.

In [14]:
df['ability'] = ability.tolist()
df

Unnamed: 0,muso,instrument,genre,ability
0,0,guitar,ITM,1
1,1,accordion,ITM,0
2,2,concertina,ITM,0
3,3,guitar,blues,1
4,4,uilleann_pipes,ITM,1
...,...,...,...,...
208,208,fiddle,ITM,0
209,209,mandolin,bluegrass,1
210,210,accordion,bluegrass,1
211,211,guitar,old-time,0


5. **Age**  
For now we're going to have 3 age-groups - college kids, grown-ups, old heads. The breakdown is going to be skewed towards the outer 2 ranges.

In [15]:
age_group = ['college_kid', 'grown-up', 'old-head']

In [16]:
age_group = rng.choice(age_group, p=[0.4, 0.2, 0.4], size=x)

In [17]:
df['age_group'] = age_group.tolist()
df

Unnamed: 0,muso,instrument,genre,ability,age_group
0,0,guitar,ITM,1,college_kid
1,1,accordion,ITM,0,old-head
2,2,concertina,ITM,0,college_kid
3,3,guitar,blues,1,grown-up
4,4,uilleann_pipes,ITM,1,college_kid
...,...,...,...,...,...
208,208,fiddle,ITM,0,grown-up
209,209,mandolin,bluegrass,1,college_kid
210,210,accordion,bluegrass,1,old-head
211,211,guitar,old-time,0,old-head


6. **Repertoire**  
We're going to create a list of ranges for this - 50-100, 100-200, 200-500, 500-1000, >1000. Later I'll add in parameters to dictate the size of repoertoire for a particular person but for now I'll keep it random.

In [18]:
# create an array
reportoire = ['50-100', '100-200', '200-500', '500-1000', '>1000']

# create a random array with size x
reportoire = rng.choice(reportoire, p=[0.2, 0.2, 0.2, 0.2, 0.2], size=x)

In [19]:
# add to dataframe
df['reportoire'] = reportoire.tolist()

# have a look
df

Unnamed: 0,muso,instrument,genre,ability,age_group,reportoire
0,0,guitar,ITM,1,college_kid,50-100
1,1,accordion,ITM,0,old-head,100-200
2,2,concertina,ITM,0,college_kid,500-1000
3,3,guitar,blues,1,grown-up,50-100
4,4,uilleann_pipes,ITM,1,college_kid,>1000
...,...,...,...,...,...,...
208,208,fiddle,ITM,0,grown-up,200-500
209,209,mandolin,bluegrass,1,college_kid,200-500
210,210,accordion,bluegrass,1,old-head,500-1000
211,211,guitar,old-time,0,old-head,100-200


7. **Paid**  
This will be a simple boolean again. For the moment I'm going to suggest that 20% of musicians are paid.

In [20]:
paid = rng.binomial(n=1, p=0.2, size=x)

In [21]:
df['paid'] = paid.tolist()
df

Unnamed: 0,muso,instrument,genre,ability,age_group,reportoire,paid
0,0,guitar,ITM,1,college_kid,50-100,1
1,1,accordion,ITM,0,old-head,100-200,1
2,2,concertina,ITM,0,college_kid,500-1000,1
3,3,guitar,blues,1,grown-up,50-100,1
4,4,uilleann_pipes,ITM,1,college_kid,>1000,0
...,...,...,...,...,...,...,...
208,208,fiddle,ITM,0,grown-up,200-500,0
209,209,mandolin,bluegrass,1,college_kid,200-500,0
210,210,accordion,bluegrass,1,old-head,500-1000,0
211,211,guitar,old-time,0,old-head,100-200,0


8. **Drink**  
Here we'll create a list of the common drinks and for now divide them up randomly.

In [22]:
drinks = ['beamish', 'guinness', 'murphys', 'other_draft', 'craft_beers', 'red_wine', 'spirits', 'water/soft drinks']

In [23]:
drinks = rng.choice(drinks, size=x)

In [24]:
df['drinks'] = drinks.tolist()
df

Unnamed: 0,muso,instrument,genre,ability,age_group,reportoire,paid,drinks
0,0,guitar,ITM,1,college_kid,50-100,1,guinness
1,1,accordion,ITM,0,old-head,100-200,1,water/soft drinks
2,2,concertina,ITM,0,college_kid,500-1000,1,murphys
3,3,guitar,blues,1,grown-up,50-100,1,beamish
4,4,uilleann_pipes,ITM,1,college_kid,>1000,0,other_draft
...,...,...,...,...,...,...,...,...
208,208,fiddle,ITM,0,grown-up,200-500,0,water/soft drinks
209,209,mandolin,bluegrass,1,college_kid,200-500,0,spirits
210,210,accordion,bluegrass,1,old-head,500-1000,0,craft_beers
211,211,guitar,old-time,0,old-head,100-200,0,red_wine


9. **Night of the week**  
We create a list of the days plus an option for 'more than one'. I skew it in favour of the weekend nights maybe. I will revisit this as the blues, bluegrass and old-time are each on a particular night!

In [25]:
night = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday', 'multiple']
night = rng.choice(night, p=[0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.1, 0.1], size=x)

In [26]:
df['night'] = night.tolist()
df

Unnamed: 0,muso,instrument,genre,ability,age_group,reportoire,paid,drinks,night
0,0,guitar,ITM,1,college_kid,50-100,1,guinness,friday
1,1,accordion,ITM,0,old-head,100-200,1,water/soft drinks,monday
2,2,concertina,ITM,0,college_kid,500-1000,1,murphys,multiple
3,3,guitar,blues,1,grown-up,50-100,1,beamish,thursday
4,4,uilleann_pipes,ITM,1,college_kid,>1000,0,other_draft,tuesday
...,...,...,...,...,...,...,...,...,...
208,208,fiddle,ITM,0,grown-up,200-500,0,water/soft drinks,friday
209,209,mandolin,bluegrass,1,college_kid,200-500,0,spirits,friday
210,210,accordion,bluegrass,1,old-head,500-1000,0,craft_beers,wednesday
211,211,guitar,old-time,0,old-head,100-200,0,red_wine,saturday


<br>

---
So far this is all completely random - just in case anyone is looking this far back!! 🤣

Now I have a try and come up with some code to create and simulate the relationships between values.

<br>
---

## References

https://www.geeksforgeeks.org/different-ways-to-create-pandas-dataframe/

https://www.statology.org/add-numpy-array-to-pandas-dataframe/

<br>

---
# END