<center><img src="images/data.png" alt="graph" style="width:250px"></center>


# Generating synthetic data
***

<br>

## Table of Contents

#### [1. Introduction](#Intro)

#### [2.Data Exploration - Palmer Archipelago (Antartica) Penguin Dataset](#Exploration)
&nbsp;&nbsp;&nbsp;&nbsp;[- Data Cleaning](#Clean)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[- Visualise the data](#Visualise)<br>


#### [3. ](#)
&nbsp;&nbsp;&nbsp;&nbsp;[- ](#)<br>

#### [4. ](#)
&nbsp;&nbsp;&nbsp;&nbsp;[- ](#)<br>

#### [5. ](#)


<br>

***
# <center> 1. Introduction <center>
***

Synthetic data is that produced by an algorithm which serves as an alternative to real-world or "authentic" data. That is, it is computer generated data rather than the collection of real-world measurements. Although technically articifial, it is modelled on and represents real-world phenomena in a mathematical and statistical sense. For this reason, it is as valuable as real data within its context.

Synthetic data is being used to train machine learning algorithms and to validate mathematical models. Some specific examples, include Amazon's alexa which uses synthetic data to train its language system, while American Express uses it for the improvement of fraud detection. [https://www.statice.ai/post/types-synthetic-data-examples-real-life-examples]

This purpose of this project is to generate synthetic data for learning purposes, using the popular Palmer Archipelago (Antartica) Penguin Data Set.

<br>

***

<br>

<center><img src="images/palmer_penguin.png" alt="Palmer Penguins" style="width:200px"></center>

# <center>2. Data Exploration</center>
 
### <center><i>The Palmer Archipelago (Antarctica) Penguin Data Set</i></center>



***

<br>

The Palmer Archipelago (Antartica) Penguin Data Set is often termed the new Iris data set due to its popularity amongst those new to data science. The data was originally collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER and focuses on three penguin species found in the Palmer Archipelago Islands, Antartica.[Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081] The [data set](https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data) which will be used in this project was sourced from [Kaggle](https://www.kaggle.com/).

There are 17 data points in the data set and for the purpose of this project, the <b>seven attributes</b> below will be used:

- <b>species</b>: penguin species (Chinstrap, Adélie, or Gentoo)
- <b>culmen_length_mm</b>: culmen length in mm
- <b>culmen_depth_mm</b>: culmen depth in mm
- <b>flipper_length_mm</b>: flipper length in mm
- <b>body_mass_g</b>: body mass in grams
- <b>island</b>: the island name (Dream, Torgersen, or Biscoe) in the Palmer Archipelago, Antarctica
- <b>sex</b>: penguin sex

<br>

The <b>culmen</b> is the upper margin of the beak, the length and width measurements are depicted below. 

<img src="images/culmen.jpeg" alt="Culmen" style="width:400px">



<br>

### Import libraries
***

In [32]:
# Dataframes.
import pandas as pd

# Machine learning library.
import sklearn as sk

# Fill missing values.
from sklearn.impute import SimpleImputer

# Plotting.
import matplotlib.pyplot as plt

# Stylish plots.
import seaborn as sns

<br>

### Load data
***

In [33]:
# Read in csv data.
data = pd.read_csv('data/penguins_lter.csv')
data.head()

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,11/11/07,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
1,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,11/11/07,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
2,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,11/16/07,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
3,PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,11/16/07,,,,,,,,Adult not sampled.
4,PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,11/16/07,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,


<br>

Removing columns that will not be used.

In [34]:
# Remove columns that will not be used.
data = data.drop(labels=['studyName', 
                         'Sample Number', 
                         'Region', 
                         'Stage', 
                         'Individual ID', 
                         'Clutch Completion', 
                         'Date Egg', 
                         'Delta 15 N (o/oo)', 
                         'Delta 13 C (o/oo)', 
                         'Comments'], 
                          axis=1)

In [35]:
# Get basic info about dataset.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Species              344 non-null    object 
 1   Island               344 non-null    object 
 2   Culmen Length (mm)   342 non-null    float64
 3   Culmen Depth (mm)    342 non-null    float64
 4   Flipper Length (mm)  342 non-null    float64
 5   Body Mass (g)        342 non-null    float64
 6   Sex                  334 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [36]:
# Statistical summary.
data.describe()

Unnamed: 0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g)
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


<br>

## Clean the data
***

In [37]:
# Check for null values.
data.isnull().sum()

Species                 0
Island                  0
Culmen Length (mm)      2
Culmen Depth (mm)       2
Flipper Length (mm)     2
Body Mass (g)           2
Sex                    10
dtype: int64

In [45]:
# Eyeball data to see where Null values are. 
pd.set_option('display.max_rows', None)

<br>

After a visual check of the data, it was found that the penguins at index 3 and 339 had no values entered in any of the columns other than species and the decision was made to fill the missing values rather than delete the rows entirely. The same process will be applied with the missing sex values with the implementation below. 

<br>


#### Fill missing values

In [39]:
# Code adapted from https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris/notebook
# Fill in missing values with the most frequent occurance in the column.
imputer = SimpleImputer(strategy='most_frequent') 
data.iloc[:,:] = imputer.fit_transform(data)

In [40]:
# Check data again. 
data.isnull().sum()

Species                0
Island                 0
Culmen Length (mm)     0
Culmen Depth (mm)      0
Flipper Length (mm)    0
Body Mass (g)          0
Sex                    0
dtype: int64

<br>

#### Convert values in Sex column from strings to integers

In [42]:
# Convert sex type from boolean string value to boolean integer value.
lb = sk.preprocessing.LabelEncoder()
data["Sex"] = lb.fit_transform(data["Sex"])

In [43]:
# Check the Sex column.
data.Sex.head()

0    2
1    1
2    1
3    2
4    1
Name: Sex, dtype: int64

<br>

##### The value 2 male and 1 denotes female.

<br>

#### Check species count

In [44]:
# Count of each species. 
data['Species'].value_counts()

Adelie Penguin (Pygoscelis adeliae)          152
Gentoo penguin (Pygoscelis papua)            124
Chinstrap penguin (Pygoscelis antarctica)     68
Name: Species, dtype: int64

***
# End