# Programming for Data Analysis 

<cr>

## Project 2020

## Problem Statement

For this project you must create a data set by simulating a real-world phenomenon of
your choosing. You may pick any phenomenon you wish – you might pick one that is
of interest to you in your personal or professional life. Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python.
We suggest you use the numpy.random package for this purpose.
Specifically, in this project you should:
• Choose a real-world phenomenon that can be measured and for which you could
collect at least one-hundred data points across at least four different variables.
• Investigate the types of variables involved, their likely distributions, and their
relationships with each other.
• Synthesise/simulate a data set as closely matching their properties as possible.
• Detail your research and implement the simulation in a Jupyter notebook – the
data set itself can simply be displayed in an output cell within the notebook.
Note that this project is about simulation – you must synthesise a data set. Some
students may already have some real-world data sets in their own files. It is okay to
base your synthesised data set on these should you wish (please reference it if you do),
but the main task in this project is to create a synthesised data set.

## Introduction

 These are my solutions in relation to the above problem statement. The author is Jean Bonsenge. Email: g00387887@gmit.ie
 

## References

[1]EU Open Data Portal (2020) “COVID-19 Coronavirus data” .Available at: 

https://data.europa.eu/euodp/en/data/dataset/covid-19-coronavirus-data/resource/260bbbde-2316-40eb-aec3-7cd7bfc2f590



## Development 

### • Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables. 

In [2]:
# Import pandas to perform operations under dataFrame function
import pandas as pd

df = pd.read_csv("COVID-19-geographic-disbtribution-worldwide - Copy - Copy.csv")

## > About the dataset
This dataset is about Covid-19 worldwide issued by the EU Open Data Portal. 
Available at the link above[1] 
A csv dataset that contains 57640 rows and 12 columns(variables). 
See the output below.

In [3]:
# Call dataFrame function.
df

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2019,continentExp,Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
0,24/11/2020,24,11,2020,246,17,Afghanistan,AF,AFG,38041757.0,Asia,6.713675
1,23/11/2020,23,11,2020,252,8,Afghanistan,AF,AFG,38041757.0,Asia,6.655844
2,22/11/2020,22,11,2020,154,12,Afghanistan,AF,AFG,38041757.0,Asia,6.203709
3,21/11/2020,21,11,2020,232,25,Afghanistan,AF,AFG,38041757.0,Asia,6.130106
4,20/11/2020,20,11,2020,282,5,Afghanistan,AF,AFG,38041757.0,Asia,5.672714
...,...,...,...,...,...,...,...,...,...,...,...,...
57635,25/03/2020,25,3,2020,0,0,Zimbabwe,ZW,ZWE,14645473.0,Africa,
57636,24/03/2020,24,3,2020,0,1,Zimbabwe,ZW,ZWE,14645473.0,Africa,
57637,23/03/2020,23,3,2020,0,0,Zimbabwe,ZW,ZWE,14645473.0,Africa,
57638,22/03/2020,22,3,2020,1,0,Zimbabwe,ZW,ZWE,14645473.0,Africa,


In [6]:
# The describe function provides a summary of the dataset including:
# count, mean, std, min, 25%, average(50%), 75% and max
df.describe()

Unnamed: 0,day,month,year,cases,deaths,popData2019,Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
count,57640.0,57640.0,57640.0,57640.0,57640.0,57537.0,54781.0
mean,15.786103,6.724774,2019.998838,1028.929441,24.2483,41341960.0,57.131752
std,8.745097,2.767141,0.034074,5836.617972,125.652523,154010200.0,144.085459
min,1.0,1.0,2019.0,-8261.0,-1918.0,815.0,-147.419587
25%,8.0,5.0,2020.0,0.0,0.0,1324820.0,0.64812
50%,16.0,7.0,2020.0,14.0,0.0,7813207.0,6.178242
75%,23.0,9.0,2020.0,237.0,4.0,28608720.0,45.35689
max,31.0,12.0,2020.0,196117.0,4928.0,1433784000.0,1900.83621


In [4]:
# With .loc function we select four variables we are interested.
# These include: cases, deaths, popData2019 and 
# Cumulative_number_for_14_days_of_COVID-19_cases_per_100000.
# We also select 100 rows or points from 0-100.
df.loc[0:100,["cases", "deaths","popData2019", "Cumulative_number_for_14_days_of_COVID-19_cases_per_100000"]]

Unnamed: 0,cases,deaths,popData2019,Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
0,246,17,38041757.0,6.713675
1,252,8,38041757.0,6.655844
2,154,12,38041757.0,6.203709
3,232,25,38041757.0,6.130106
4,282,5,38041757.0,5.672714
...,...,...,...,...
96,160,8,38041757.0,2.268560
97,0,0,38041757.0,2.024092
98,3,0,38041757.0,2.239644
99,45,5,38041757.0,2.329020


In [7]:
# .iloc function displays the fourth variable(column,label) values of 100 rows(positions)
df.iloc[0:100, 4]

0     246
1     252
2     154
3     232
4     282
     ... 
95     97
96    160
97      0
98      3
99     45
Name: cases, Length: 100, dtype: int64

In [8]:
# .iloc function displays the fifth variable(column,label) values of 100 rows(positions)
df.iloc[0:100, 5]

0     17
1      8
2     12
3     25
4      5
      ..
95     2
96     8
97     0
98     0
99     5
Name: deaths, Length: 100, dtype: int64

In [9]:
# .iloc function displays the ninth variable(column,label) values of 100 rows(positions)
df.iloc[0:100, 9]

0     38041757.0
1     38041757.0
2     38041757.0
3     38041757.0
4     38041757.0
         ...    
95    38041757.0
96    38041757.0
97    38041757.0
98    38041757.0
99    38041757.0
Name: popData2019, Length: 100, dtype: float64

In [10]:
# .iloc function displays the eleventh variable(column,label) values of 100 rows(positions)
df.iloc[0:100, 11]

0     6.713675
1     6.655844
2     6.203709
3     6.130106
4     5.672714
        ...   
95    2.415766
96    2.268560
97    2.024092
98    2.239644
99    2.329020
Name: Cumulative_number_for_14_days_of_COVID-19_cases_per_100000, Length: 100, dtype: float64

### • Investigate the types of variables involved, their likely distributions, and their relationships with each other. 

### • Synthesise/simulate a data set as closely matching their properties as possible. 

### • Detail your research and implement the simulation in a Jupyter notebook.