# SEN163A - Responsible Data Analytics
## Lab session 1: Data collection, description and visualization
### Delft University of Technology
### Q3 2022

**Instructor**: Dr. Ir. Jacopo De Stefani - J.deStefani@tudelft.nl

**TAs**: Antonio Sanchez Martin - A.SanchezMartin@student.tudelft.nl

### Instructions

Lab session aim to:
- Show and reinforce how models and ideas presented in class are put to practice.
- Help you gather hands-on data analytics skills.

Lab sessions are:

- Learning environments where you work with Jupyter notebooks and where you can get support from TAs and fellow students.
- Not graded and do not have to be submitted.
- A good preparation for the assignments (which are graded).


### Application: Exploratory data analysis of a medical dataset

In this lab session, we will explore different open data sources, data types and practice exploratory data analysis on an example dataset. 
The example dataset is an extract from a study relating the likelihood of experiencing a stroke, to both some physiological quantities (e.g. age, gender, glucose level) as well as contextual infomation (localization of the home, type of job).

#### Learning objective. 
After completing the following exercises you will be able to:

1. Explore different publicly available portals at different geografical levels. 
2. Import a data set from multiple sources using pandas.
3. Describe a dataset using pandas.
4. Visualise data and get preliminary insights using seaborn.


# Data collection

## Data sources

### Global websites
- Proprietary sources
    - Kaggle
    - Google Datasets
    - …
- Governmental sources
    - EU: https://data.europa.eu/en
    - US: https://www.data.gov/
    - …
- Academic sources
    - UCI Machine Learning Repository
    - …

### National websites
- Governmental sources
    - https://opendata.cbs.nl/statline/portal.html?_la=en&_catalog=CBS
    - https://data.overheid.nl/

### Local websites
- Governmental sources (NL only)
    - https://www.zuid-holland.nl/politiek-bestuur/feiten-cijfers/open-data-0/
    - https://data-delft.opendata.arcgis.com/

## Scraping
![WebServerWorkflow](figures/WebScraping.png)

Scraping will not be dicussed in detail in the course, but more information can be found at (among others):

- https://hirinfotech.com/what-is-web-scraping/
- https://www.educative.io/blog/python-web-scraping-tutorial

## Activity 1 - Data sources

Explore the different data sources to find a dataset which is particularly relevant to you. If this seems overwhelming, you can have a look in the `example` folder for some pre-selected examples.

Try to answer the following questions:

- In which format is the found dataset?
- Which computer application can you use to open this data?
- Is it structured or unstructured data?


Write your answers here

In [None]:
# CODE YOUR ANSWERS HERE (Use as many cells as you need)

## Activity 2 - Data formats

We are now going to try to import different data types (`.csv`,`.xlsx`,`.json`,`.zip`) in Python.

1. Have a look at the documentation of the [Pandas](https://pandas.pydata.org/) library (and specifically the `read_` functions) in order to determine how to use it.
2. Try to use these function to read the datasets from the `data/examples` folder into your Python code.


In [None]:
# CODE YOUR ANSWERS HERE (Use as many cells as you need)

## Activity 3 - Dataset summarization

Let's start the analysis of the medical dataset (XLS format).

### Context

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

### Attributes Information

1) id: unique identifier
2) gender: "Male", "Female" or "Other"
3) age: age of the patient
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6) ever_married: "No" or "Yes"
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8) Residence_type: "Rural" or "Urban"
9) avg_glucose_level: average glucose level in blood
10) bmi: body mass index
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

### Acknowledgements

(Confidential Source) - Use only for educational purposes


### Tasks

1. Store it in a Python variable as a Pandas DataFrame.
2. Use the Pandas built-in functions (`info, describe, value_counts, nunique`) to provide a summary of the dataset.
    - Does the columns contain any missing values?
    - What are the types of the columns? Categorical, Numerical, Ordinal?
3. Try to select the first 100 lines/samples/datapoints of the dataset.
4. Try to select the first 5 columns/features of the dataset.

In [None]:
# CODE YOUR ANSWERS HERE (Use as many cells as you need)

## Activity 4 - Variable analysis

We are going to use the `pandas` library to perform some exploratory understanding of the data.

1. Load the dataset in the `stroke_df` variable
2. Display the content of the `stroke_df` variable
3. What are the type of the different columns? Use the knowledge from `pandas` to determine the type.


In [None]:
# CODE YOUR ANSWERS HERE (Use as many cells as you need)

## Activity 5 - Visualization

Two type of variables can be found in this dataset:
- Numeric: `int64` and `float64`
- Categorical: `object`

We are going to use the `seaborn` library to perform some exploratory visualizations.
Have a look at the documentation of the [Seaborn](https://seaborn.pydata.org/) library and the Gallery to get some inspiration for some plots.

According to the type of variables, different visualizatons can be used:

- Numeric - Numeric: Scatterplot
- (Numeric - Categorical)/(Categorical - Numeric): Boxplot, Scatterplot, Violin plot

1. Try to answer to some of these questions using the [Seaborn](https://seaborn.pydata.org/) library:
    1) Who has more strokes between Male/Female?
    2) People of which age group are more likely to get a stroke?
    3) Is hypertension a cause?
    4) Is A person with heart disease more likely to get a stroke ?
    5) May Marriage be a cause of strokes ?
    6) Are people working in private jobs the majority of people with strokes (mostly because of stress) ?
    7) Do people living in urban areas have more chances of getting stroke?
    8) Are glucose levels a symptom of stroke ? Or of a pathology, such as obesity, leading to stroke ?
    9) What is the relationship between BMI, age and gender?
    10) Are people who smoke more likely to get a stroke ?
    
2. Provide additional visualizations you deem relevant for the problem at hand using [Seaborn](https://seaborn.pydata.org/)

Make sure to:
- Think carefully about the data types of the different variables while determining the suitable plots, before diving into the actual code
- Respect the best practices in terms of plot readability (axis naming, ticks naming, proper scaling)

In [None]:
# CODE YOUR ANSWERS HERE (Use as many cells as you need)