# Exploratory Data Analysis and Regression
## Introduction to Data Science
### Kigali, Rwanda
### July 8th, 2019


#### Import libraries

<img src="fig/logos.jpg">

## What is this course?
This is a 5 day short workshop on the ***practical*** fundamentals of data science using python. 

This workshop will cover a variety of models solving a range of different data science tasks; it will familiarize participants with useful `python` libraries for data science.

## What is the structure of each day?

Each session of the workshop is 4 hours. Roughly half of which will be lecture and half will be hands-on exercises.

You will work in groups during class to complete coding exercises solving problems involving ***real*** data.

## What is expected of the participants?

This workshop involves a great deal of programming and requires a intuitive but sound understanding of the mathematical or statistical theory behind machine learning models.

To be successful, participants need to have some previous experience with programming as well as basic statistics; participants need to be committed to gaining a working knowledge of `python` during the workshop.

## How is the workshop graded?

Participants are recommended to complete all in-class exercises as homework. 

There will be an in-class, 4-hour, open-book exam on Day 5. The exam will require participants to solve a data science task using a real-data set and will be based on the skills demonstrated in the in-class exercises.

## More information about the course

All materials and information can be found at the course website:
```
https://github.com/onefishy/rwanda_workshop
```

# Introduction to Data Science

## Outline
1. What Is Data?
2. What is Data Science?
3. Exploring Data
  - Descriptive Statistics
  - Data Visualization
4. Data Science Tools

# What is Data?

## What is Data?

*“A **datum** is a single measurement of something on a scale that is understandable to both the recorder and the reader. **Data** is multiple such measurements.”*

**Claim:** everything is (can be) data!

<img src="fig/fig1.jpg" style="height:300px;">

# What is Data Science

## What Does it Mean to Do Data Science?
The "data science" process.

1. Find a problem you want to solve
2. Collection data you think will help solve it
3. Exploration Data Analysis: 
  - examine your data for patterns and trends.
4. Build mathematical models to describe trends and patterns
5. Analyze your model
  - What does it say about your data?
  - How do your finding help solve your problem?
6. Visualization and Presentation of Results

**Note:** This process is not linear!

## Where Does Data Come From?

**Internal sources:** already collected by or is part of the overall data collection of you organization.

**Existing External Sources:** available in ready to read format from an outside source for free or for a fee.

**External Sources Requiring Collection Efforts:** available from external source but acquisition requires special processing.

## What Does Data Look Like?

Simple or atomic data:
 - **Numeric:** integers, floats
 - **Boolean:** binary or true false values 
 - **Strings:** sequence of symbols

## What Does Data Look Like?

Compound, composed of a bunch of atomic types:
 - **Date and time:** compound value with a specific structure
 - **Lists:** a list is a sequence of values
 - **Dictionaries:** collections of key-value pairs, a pair of values `x : y`, where x is usually a string called key representing the "name" of the value, and y is a value of any type.
 
   **Example:** Student record
      - First: Weiwei
      - Last: Pan
      - Nationality: USA

## How is your data represented and stored?

Data formats:
 - **Tabular Data:** a dataset that is a two-dimensional table, where each row typically represents a single data record, and each column represents one type of measurement (csv, xlsx etc.).

- **Structured Data:** each data record is presented in a form of a, possibly complex and multi-tiered, dictionary (json, xml etc.)

- **Semistructured Data:** not all records are represented by the same set of keys or some data records are not represented using the key-value pair structure.

## Tabular Data
In tabular data, each row or ***observation*** represents a set of measurements of a single object or event.

<img src="fig/fig2.jpg">

Each type of measurement is an ***attribute*** of the data. The number of attributes is the ***dimension*** of the data.

## Types of Attributes in Tabular Data
It’s vital to distinguish between classes of attributes based on the type of values they take on.
- **Quantitative attribute:** a real valued number whose values can be ordered.

   **Example:** Height is a quantitative attribute
<br><br>
- **Categorical attribute:** a real valued or string with no inherent order among the values.

   **Example:** “What kind of pet you have” is a categorical attribute

## Is the data any good?

- **Missing values:** how do we fill in?
- **Wrong values:** how can we detect and correct?
- **Messy format:** how can we convert structured or semistructured data into tabular data?
- **Not usable:** what if the data cannot answer the question posed?

# Exploring Data

## Describing Individual Attributes

Given a large dataset, we want to compute a few quantities that intuitively summarizes the data. We’d like to know:

1. what are "typical" values for our attributes? 
2. how "representative" are these typical values?



## Describing A Typical Value
We can describe a typical value for $n$ samples of a **quantitative** attribute $x$ by the ***mean*** or the ***median***.

<img src="fig/fig3.jpg" style="height:300px;">

## Describing A Typical Value
We can describe a typical value for $n$ samples of a **categorical** attribute $x$ by the ***mode***.

<img src="fig/fig4.jpg" style="height:300px;">

## Describing the Spread of a Value

The spread of samples measures how well the mean or median describes the sample set:

1. ***Range***
2. ***Standard variance***
3. ***Standard deviation***


## Why Data Visualization?

The following is called Anscombe’s Quartet; all four sets of data have identical simple summary statistics.

<img src="fig/fig5.jpg" style="height:300px;">


## Anscombe's Quartet

The following is called Anscombe’s Quartet; all four sets of data have identical simple summary statistics.

<img src="fig/fig6.jpg" style="height:300px;">



## Why Data Visualization?

If I tell you that the average score for the final examis: 7.64/15. 

What does that suggest?

## Why Data Visualization?

If I then show you the following graph, what does it suggest?

<img src="fig/fig7.jpg" style="height:300px;">

## What is Data Visualization Good For?

Analyze:
- Identify hidden patterns and trends
- Help formulate/test hypothese
- Help determine the next step in analysis/modeling

## What is Data Visualization Good For?
Communicate:
- Present information and ideas succinctly 
- Provide evidence and support
- Influence and persuade

## Types of Data Visualization

What do you want your visualization to show about your data?
- **Distribution:** how a variable or variables in the dataset distribute over a range of possible values.
- **Relationship:** how the values of multiple variables in the dataset relate
- **Composition:** how the dataset breaks down into subgroups
- **Comparison:** how trends in multiple variable or datasets compare

## Visualizing Distributions

A ***histogram*** is a way to visualize how a **quantitative** attribute is distributed across certain values.


<img src="fig/fig8.jpg" style="height:120px;">

**Note:** Trends in histograms are sensitive to number of bins.

## Visualizing Distributions

A ***bar chart*** is a way to visualize how a **categorical** attribute is distributed across certain values.


<img src="fig/fig4.jpg" style="height:300px;">

## Visualizing Relationships

A ***scatter plot*** is a way to visualize the relationship between two different attributes.

<img src="fig/fig9.jpg" style="height:300px;">

## Visualizing Comparisons of Subgroups 

Plotting multiple histograms or curves on the same axes is a way to visualize how different subgroups of data compare.

**For example:** the following are the blood-glucose distributions of three subgroups within the dataset.

<img src="fig/fig10.jpg" style="height:300px;">

## Generating Hypothese

We can see that birth weight is positively correlated with femur length.

<img src="fig/fig11.jpg" style="height:200px;">

Can we describe exactly how they are correlated?

## Generating Hypothese
We can see that types of iris seem to be distinguished by petal and sepal lengths.

<img src="fig/fig12.jpg" style="height:300px;">

Can we predict the type of iris given petal and sepal lengths?

# Data Science Tools

## Tools for the Workshop

In this workshop, we will be using `python`, for which there are many robust and well documented libraries for machine learning and data science.

Specifically, we will use
1. `colab notebook` - a Google web app for creating interactive document containing text, equations, live code and outputs.
2. `pandas` - a `python` library for reading and manipulating data
3. `matplotlib` - a `python` library for data visualization
4. `numpy` - a `python` library for manipulating numeric data
5. `scikit-learn` - a `python` library with many machine learning models
6. `keras` - a `python` library for deep learning


# `colab` Notebooks

## What Does Colab Notebook Look Like?

Go to `https://colab.research.google.com/`

<img src="fig/fig13.jpg" style="height:300px;">

## What Does Colab Notebook Look Like?

Each notebook consists of blocks of cells. Each cell can be of two types:  (1) rich text (Markdown) cells or (2) code cells. 

<img src="fig/fig14.jpg" style="height:200px;">

## What Does Colab Notebook Look Like?

Code is executed by an "computational engine" called the ***kernel*** (IPython). The output of the code is displayed directly below.. 

<img src="fig/fig15.jpg" style="height:250px;">

## What Does Colab Notebook Look Like?

Each cell can be executed independently, but once a block of code is executed, it lives in the memory of the kernel.

<img src="fig/fig16.jpg" style="height:200px;">

# General Introduction to `python`

## What Does Python Look Like?

Code readability is key, Python syntax itself is close to plain english.

Your variables should be given descriptive identifiers!

<img src="fig/fig17.jpg" style="height:100px;">

Identifiers for variable should be descriptive words separated by underscore (not spaces) and in all lower case.

## What Does Python Look Like?

You should use white space to increase readability.

<img src="fig/fig18.jpg" style="height:100px;">


## What Does Python Look Like?

You should liberally intersperse your code with comments!

<img src="fig/fig19.jpg" style="height:100px;">

## What Does Python Look Like?

Proper indentation is non-negotiable!

<img src="fig/fig20.jpg" style="height:100px;">

Code blocks are not indicated by delimiters (e.g. `{` `}`) only by indentation!

## Python Data Types
The basic built-in Python data types we’ll be using today are:
1. **integers, floating points:** `7`, `7.0`
2. **booleans:** `True`, `False`
3. **strings:** `'hi'`, `"7.0"`
4. **lists:** `['hi', False, 7]`

## Variables and Assignment
In `python` the type of variables is inferred based on the valued assigned to the variable.
For example: The assignment
```python
my_var = 7
```
types `my_var` as an integer. Later, the assignment
```python
my_var = 'hello'
```
will cause `my_var` to be typed as a string.

## Learning `python`

The course website contains readings, cheatsheets and tutorials for mastering general `python` programming as well as using `python` libraries for data manipulation/visualization.

**Tips for learning a programmming language:**
1. Don't read code line by line (at first)
2. Copy as much as possible (learn patterns and templates)
3. Don't be afriad to play (trial and error)
4. Read documentation!

# `pandas`: Reading and Manipulating Data

## `pandas`

`pandas` provides ways of storing tabular data along with labels on the rows and columns.

To use any `python` library, we must include the line
```python
import pandas as pd
```

After which we may use any function or object in this library using for example, `pd.Series()`.


## `pandas` Series

The `pandas` Series object stores a 1D array of data with an index object (labels).

<img src="fig/fig21.jpg" style="height:300px;">

## `pandas` Series
Each `pandas` series has a `.values` and an `.index` attribute.

<img src="fig/fig22.jpg" style="height:200px;">

## `pandas` DataFrames

The `pandas` DataFrame object stores a 2D table of data with column and row index object (labels).

<img src="fig/fig23.jpg" style="height:200px;">

Each column in the data frame is a series.

## Reading Data with `pandas`

We can import tabular data in a csv file into a data frame:

<img src="fig/fig24.jpg" style="height:200px;">

## Basic Data Exploration with `pandas`

We should start by checking how large is the data.

We do this by checking the shape of the DataFrame.

<img src="fig/fig26.jpg" style="height:150px;">

## Basic Data Exploration with `pandas`

We should start by getting a rough sense of what’s in the data.

<img src="fig/fig25.jpg" style="height:250px;">

## Basic Data Exploration with `pandas`

The `.head()` function returns the first `N` rows of your data frame!

<img src="fig/fig27.jpg" style="height:250px;">

## Basic Data Exploration with `pandas`

The `.describe()` function returns the descriptive stats for each numeric column as a data frame object!

<img src="fig/fig28.jpg" style="height:300px;">

## Basic Data Manipulation with `pandas`

Accessing a column by label:

<img src="fig/fig29.jpg" style="height:200px;">

## Basic Data Manipulation with `pandas`

Accessing columns by label:

<img src="fig/fig30.jpg" style="height:200px;">

## Basic Data Manipulation with `pandas`

Accessing columns by criteria (***filtering***):

<img src="fig/fig31.jpg" style="height:300px;">

## Basic Data Transformation with `pandas`

Since machine learning models are mathematical functions (i.e. require numerical input), what do we do with categorical attributes?

We encode them as numerical vectors, whose position encodes for the possible values of the attribute. This is called ***one-hot encoding***. Why can't we encode them as integers?

<img src="fig/fig39.jpg" style="height:300px;">

## Basic Data Transformation with `pandas`

One-hot encoding in `pandas`.

```python
# convert categorical columns to numerical columns by one-hot encoding
one_hot_categorical = pd.get_dummies(df_categorical_columns, prefix=df_categorical_columns.columns)
```

## Exercise: Perform EDA on the Berlin Airbnb Dataset

Today we will be working with a real data set: the Berlin Airbnb Dataset.

Find the link to this notebook on the workshop website:
```
https://github.com/onefishy/rwanda_workshop
```
