# Programming for Data Analysis

# Programming for Data Analysis Assignment 2022

***

### Eleanor Sammon, Student # G00411277


## Table of Contents

1. [Introduction](#introduction)
2. [Synthetic Data](#synthetic_data)
3. [Variables](#variables)
4. [Coding the Dataset](#dataset)
5. [Testing the Dataset](#testing)
6. [Conclusion](#conclusion)
8. [References](#references)


## 1. Introduction

The purpose of this project is to synthesise a dataset using data points from a real life phenomenon. 

Having first outlined the pros and cons of synthetic data, I will then synthesise my own data-set which will model (the relationship, if any between age, sex, social class, education and the tendancy towards cigarette smoking). I will outline each of the variables, code a synthetic set of data points for each variable based on real data and perform analysis on the resulting composite data set. 


## 2. Synthetic Data

Synthetic data is artificially generated data that models real data.  The main advantages of synthetic data are:

**It is easy to generate and use**  Collecting and collating real-world data can be time consuming and raise privacy and data protection issues. Synthetic data is cleaner and doesn’t have the inaccuracies, duplicates or formatting niggles that often come with real data.

**Its of superior quality**  Real-world data can be time-consuming to collect and collate, it may be missing values, contain inaccuracies or be biased.  Synthetic data is cheaper, faster, has more accurate patterns, is balanced, unbiased, and high quality.

**It scales well**  Successfully training a predictive model works best with large scale inputs.

**It has no security, moral or regulatory conflicts** Even where real-world data exists in sufficient quantities it can often be ring-fenced because of compliance issues. General Data Protection Regulation (GDPR) forbids uses of data that weren't explicitly consented to at the time of collection. 

Disadvantages of synthetic data include:

**Bias** The quality of the synthesised output correlates directly to the quality of the input and where there is bias in the input it will be reflected and potentially amplified in the synthesised data with potential for false insights and poor decision-making. 

**Ignoring outliers** Synthetic data approximates real-world data, it can never really replicate it entirely and therefore outliers which maybe appear in real-world, organic data could be ignored or overlooked in a synthesised data set.  Such outliers can be significant in themselves.

**Improper use and application** Even synthetic data has to be founded in fact and there are concerns that the underlying real data, which could be sensitive, may still be identifiable to its source.  A synthesised dataset must be carefully aligned to the original problem to ensure it's fulfilling its purpose. 

## 3. Variables

I know someone recently diagnosed with stage 4 lung cancer so I decided to synthesise a data set looking at whether certain societal factors might influence whether a person is more likely to be a smoker. My four variables are age, sex, socio-economic status and education. 

### 3.1 Age
The data for smoking statistics I am using is taken from the Healthy Ireland Survey 2021.  It looks at smokers in the age range from 15 to 65+ so on this basis I will generate an array of 1000 random ages between 15 and 82. The average [life expectancy](https://www.worlddata.info/europe/ireland/index.php) in Ireland in 2020 is 82.2 years so I have used this as the upper parameter of my array.


In [4]:
import numpy as np
import pandas as pd 
import seaborn as sns
from scipy.stats import norm
import matplotlib.pyplot as plt

## generate a random array of 1000 ages between 15 and 99
Age = np.random.randint(low=15, high=82, size=1000) 
print (Age)

### 3.2 Gender
According to the [World Bank](https://data.worldbank.org/indicator/SP.POP.TOTL.FE.ZS?locations=IE), in 2021 the female and male ratio in Ireland reached almost parity (50.3% female as against 49.7% male) largely owing to migration in recent years. For the purposes of this exercise I will have a 50/50 split.

In [2]:
Genders = ['Male', 'Female'] 
np.random.choice(Genders, 1000, p=[0.497, 0.503])

array(['Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male',
       'Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Female',
       'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male',
       'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male',
       'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female',
       'Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male',
       'Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Female',
       'Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male',
       'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female',
       'Female', 'Female', 'Male', 'Female', 'Female', 'Female', 'Male',
       'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Female',
       'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male',
       'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Female',
       'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female',
    

### 3.3 Smoking
18% of the Irish population are currently smokers (daily and occasionally), breaking down into 20% male smokers and 17% female. 

### 3.4 Socio Economic Status


### 3.5 Education


## Coding the Database

In [18]:
df = pd.DataFrame(columns=['Genders', 'Age', 'Smoker', 'Socio', 'Education'])

Genders = ['Male', 'Female'] 
df['Genders'] = np.random.choice(Genders, 1000, p=[0.497, 0.503])

df['Age'] = np.random.randint(15,82, 1000)

smokers = ['Yes', 'No']  #Create the smoker/ non-smoker
# Function that assigns smoker or non-smoker #
def smoker_gender(smoke):
    if smoke == 'Male':
        return np.random.choice(smokers, p=[0.20, 0.80])
    if smoke == 'Female':
        return np.random.choice(smokers, p=[0.17, 0.83])
    
df['Smoker'] = df['Genders'].apply(smoker_gender)


print(df)

    Genders  Age Smoker Socio Education
0      Male   41     No   NaN       NaN
1      Male   47     No   NaN       NaN
2      Male   74     No   NaN       NaN
3    Female   71     No   NaN       NaN
4      Male   25    Yes   NaN       NaN
..      ...  ...    ...   ...       ...
995  Female   20     No   NaN       NaN
996  Female   42     No   NaN       NaN
997    Male   40     No   NaN       NaN
998  Female   79     No   NaN       NaN
999  Female   39     No   NaN       NaN

[1000 rows x 5 columns]


## References

https://datagen.tech/guides/synthetic-data/synthetic-data/ - visited on 07/11/2022

video : https://www.youtube.com/watch?v=uG_YMEcyaA8 – watched 07/11/2022

Advantages and disadvantages of synthetic data - https://www.dataversity.net/the-pros-and-cons-of-synthetic-data/ - visited 09/11/2022