# Programming for Data Analysis

# Programming for Data Analysis Assignment 2022

***

### Eleanor Sammon, Student # G00411277


## Table of Contents

1. [Introduction](#introduction)
2. [Synthetic Data](#synthetic_data)
3. [Variables](#variables)
4. [Coding the Dataset](#dataset)
5. [Testing the Dataset](#testing)
6. [Conclusion](#conclusion)
7. [References](#references)


## 1. Introduction  <a class="anchor" id="introduction"></a>

The purpose of this project is to synthesise a dataset using data points from a real life phenomenon. 

Having first outlined the pros and cons of synthetic data, I will then synthesise my own data-set which will model (the relationship, if any between age, sex, social class, education and the tendancy towards cigarette smoking). I will outline each of the variables, code a synthetic set of data points for each variable based on real data and perform analysis on the resulting composite data set. 


## 2. Synthetic Data  <a class="anchor" id="synthetic_data"></a>

Synthetic data is artificially generated data that models real data.  The main advantages of synthetic data are:

**It is easy to generate and use**  Collecting and collating real-world data can be time consuming and raise privacy and data protection issues. Synthetic data is cleaner and doesn’t have the inaccuracies, duplicates or formatting niggles that often come with real data.

**Its of superior quality**  Real-world data can be time-consuming to collect and collate, it may be missing values, contain inaccuracies or be biased.  Synthetic data is cheaper, faster, has more accurate patterns, is balanced, unbiased, and high quality.

**It scales well**  Successfully training a predictive model works best with large scale inputs.

**It has no security, moral or regulatory conflicts** Even where real-world data exists in sufficient quantities it can often be ring-fenced because of compliance issues. General Data Protection Regulation (GDPR) forbids uses of data that weren't explicitly consented to at the time of collection. 

Disadvantages of synthetic data include:

**Bias** The quality of the synthesised output correlates directly to the quality of the input and where there is bias in the input it will be reflected and potentially amplified in the synthesised data with potential for false insights and poor decision-making. 

**Ignoring outliers** Synthetic data approximates real-world data, it can never really replicate it entirely and therefore outliers which maybe appear in real-world, organic data could be ignored or overlooked in a synthesised data set.  Such outliers can be significant in themselves.

**Improper use and application** Even synthetic data has to be founded in fact and there are concerns that the underlying real data, which could be sensitive, may still be identifiable to its source.  A synthesised dataset must be carefully aligned to the original problem to ensure it's fulfilling its purpose. 

## 3. Variables <a class="anchor" id="variables"></a>

I know someone recently diagnosed with stage 4 lung cancer so I decided to synthesise a data set looking at whether certain societal factors might influence whether a person is more likely to be a smoker. My four variables are age, sex, social class and education. 

### 3.1 Age
The data for smoking statistics I am using is taken from the Healthy Ireland Survey 2021.  It looks at smokers in the age range from 15 to 65+ so on this basis I will generate an array of 1000 random ages between 15 and 82. The average [life expectancy](https://www.worlddata.info/europe/ireland/index.php) in Ireland in 2020 is 82.2 years so I have used this as the upper parameter of my array.


In [1]:
# import the necessary libraries to run the code and perform analysis
import numpy as np
import pandas as pd 
import seaborn as sns
from scipy.stats import norm
import matplotlib.pyplot as plt



In [2]:
## generate a random array of 1000 ages between 15 and 82
Age = np.random.randint(low=15, high=82, size=1000) 
print (Age)

[79 43 79 42 21 18 46 75 66 35 76 65 36 39 44 42 18 62 74 73 57 70 31 18
 48 61 26 19 19 29 62 17 26 17 38 22 46 64 50 36 67 27 41 70 36 76 61 69
 64 48 15 51 36 55 44 80 49 52 45 16 40 15 31 52 46 72 63 49 20 46 44 60
 71 38 32 54 54 69 50 16 45 68 26 19 17 71 36 61 36 64 78 37 47 54 29 62
 28 32 57 40 43 63 80 15 35 61 20 49 54 53 48 38 60 71 56 34 34 53 61 69
 20 55 51 77 68 62 80 58 31 32 30 32 30 77 75 35 62 26 49 27 56 27 74 17
 51 38 70 71 17 76 36 36 58 69 80 26 53 57 47 40 44 46 71 75 53 76 80 70
 75 78 80 81 48 72 31 38 42 19 55 30 79 79 52 38 35 66 73 65 18 33 49 29
 47 80 63 67 66 24 40 19 40 16 44 34 42 57 35 37 53 34 46 35 22 28 75 36
 73 79 80 16 77 23 70 34 35 38 20 43 46 41 60 45 57 22 67 80 47 20 15 40
 77 72 68 68 62 32 80 75 66 24 53 44 29 16 77 72 52 26 62 64 55 30 31 47
 38 45 23 30 58 78 17 76 36 33 21 22 29 40 53 79 62 57 32 28 65 62 36 44
 77 56 49 26 36 18 38 34 78 67 17 68 48 24 19 49 44 64 72 34 42 36 36 68
 61 61 56 76 40 29 24 78 47 57 42 43 63 57 51 77 47

### 3.2 Gender
According to [World Bank](https://data.worldbank.org/indicator/SP.POP.TOTL.FE.ZS?locations=IE), in 2021 the female and male ratio in Ireland reached almost parity (50.3% female as against 49.7% male) largely owing to migration in recent years. For the purposes of this exercise I will have a 50/50 split.

In [3]:
Genders = ['Male', 'Female'] 
np.random.choice(Genders, 1000, p=[0.497, 0.503])

array(['Female', 'Female', 'Female', 'Male', 'Female', 'Male', 'Female',
       'Male', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female',
       'Female', 'Female', 'Male', 'Female', 'Female', 'Female', 'Male',
       'Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male',
       'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
       'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female',
       'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female',
       'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female',
       'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male',
       'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male',
       'Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male',
       'Male', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female',
       'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male',
       'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'F

### 3.3 Smoking
18% of the Irish population are currently smokers (daily and occasionally), breaking down into 20% male smokers and 17% female. 

### 3.4 Social Class
The Irish population is classified into one of seven social class groups which are ranked on the basis of occupation, thereby bringing together people with similar levels of occupational skill.  According to the [2016 Census of Ireland](https://www.cso.ie/en/media/csoie/newsevents/documents/census2016summaryresultspart2/Chapter_6_Socio-economic_group_and_social_class.pdf), being the most recent complete data set available, the following is the breakdown of the population by social class:

**Professional workers 8.1%
Managerial and technical 28.1%
Non-manual 17.6%
Skilled manual 14.1%
Semiskilled	10.5%
Unskilled 3.6%
All other gainfully occupied and unknown 18.0%**


In [4]:
Social_Class = np.random.choice(["Professional_Workers", "Managerial_Technical", "Non_Manual", "Skilled_Manual", "Semiskilled", "Unskilled", "Other"], size=1000, p=[0.081, 0.281, 0.176, 0.141, 0.105, 0.036, 0.180], replace=True)


### 3.5 Education


## 4. Coding the Dataset  <a class="anchor" id="dataset"></a>

In [5]:
# overall dataframe categories
df = pd.DataFrame(columns=['Genders', 'Age', 'Age_brackets','Smoker', 'Social_Class', 'Education'])

# dataframe for age
df['Age'] = np.random.randint(15, 82, 1000)

# Create a list of age bins
age_groups = ['15-24', '25-34', '35-44', '45-54', '55-64', '65+']

evaluation_bins = [15, 24, 34, 44, 54, 64, np.inf]

df['Age_brackets'] = pd.cut(df['Age'], bins=evaluation_bins, labels=age_groups, include_lowest=True, right=False)

#dataframe for genders
Genders = ['Male', 'Female'] 
df['Genders'] = np.random.choice(Genders, 1000, p=[0.497, 0.503])

# dataframe for smokers
smokers = ['Yes', 'No']  #Create the smoker/ non-smoker

# function to assign smoker based on gender probabilities
def smoker_gender(smoke):
    if smoke == 'Male':
        return np.random.choice(smokers, p=[0.20, 0.80])
    if smoke == 'Female':
        return np.random.choice(smokers, p=[0.17, 0.83])
    
df['Smoker'] = df['Genders'].apply(smoker_gender)

# dataframe for social class
classes = ['Professional_Workers', 'Managerial_Technical', 'Non_Manual', 'Skilled_Manual', 'Semiskilled', 'Unskilled', 'Other']
df['Social_Class'] = np.random.choice(classes, size=1000, p=[0.081, 0.281, 0.176, 0.141, 0.105, 0.036, 0.180], replace=True)

df

Unnamed: 0,Genders,Age,Age_brackets,Smoker,Social_Class,Education
0,Female,49,45-54,No,Other,
1,Female,72,65+,No,Other,
2,Male,22,15-24,Yes,Semiskilled,
3,Male,39,35-44,Yes,Semiskilled,
4,Male,15,15-24,No,Non_Manual,
...,...,...,...,...,...,...
995,Female,61,55-64,No,Semiskilled,
996,Male,80,65+,Yes,Other,
997,Male,42,35-44,No,Professional_Workers,
998,Male,50,45-54,Yes,Other,


## 5. Testing the Dataset  <a class="anchor" id="testing"></a>

## 6. Conclusion  <a class="anchor" id="conclusion"></a>

## 7. References  <a class="anchor" id="references"></a>

https://datagen.tech/guides/synthetic-data/synthetic-data/ - visited on 07/11/2022

video : https://www.youtube.com/watch?v=uG_YMEcyaA8 – watched 07/11/2022

Advantages and disadvantages of synthetic data - https://www.dataversity.net/the-pros-and-cons-of-synthetic-data/ - visited 09/11/2022

https://www.hse.ie/eng/about/who/tobaccocontrol/research/smoking-in-ireland-2021.pdf - visited 09/11/2022

https://www.cso.ie/en/media/csoie/newsevents/documents/census2016summaryresultspart2/Chapter_6_Socio-economic_group_and_social_class.pdf visited 15/11/2022

Assigning age bins - https://stackoverflow.com/questions/71564603/pandas-cut-and-specifying-specific-bin-sizes - visited 17/11/2022