# EXPOSE: Data Analytics Tutorial

### Topics:
* Exploratory data analysis (EDA)
* Data manipulation with Pandas
* Data visualisation with Matplotlib
* Model training with Scikit-learn

### Tutorial TAs:
* Nicholas Russell Saerang (NUS, Y2 DSA)
* Wilson Widyadhana (NUS, Y1 DSA)

---
### Step 0: Importing necessary packages
In this step, we first import the packages essential to what we want to do today.
We will be importing:
* [Scikit-learn](https://scikit-learn.org/) (`sklearn`), which is an introductory data analytics package,
* [Pandas](https://pandas.pydata.org/) (`pandas`, usually imported as `pd`), which is often used to manipulate data,
* [Matplotlib](https://matplotlib.org/) (`matplotlib.pyplot`, usually imported as `plt`), used to visualise data,
* `warnings`, which will just be used in our case to prevent unnecessary warnings appearing in our Google Colab notebook

In [2]:
# Importing necessary packages
from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt
import warnings

Next, we run this function that supresses future warnings. Do not remove this code, just run it.

In [3]:
def warn(*args, **kwargs):
    pass
warnings.warn = warn

---
### Step 1: Loading our dataset
We shall use the iris dataset from the `sklearn` package that we just imported recently.

The iris dataset is a dataset consisting of iris flowers with different:
- sepal length (cm)
- sepal width (cm)
- petal length (cm)
- petal width (cm)
- the species of the flower - our dataset label
  * 0 means the species is *Iris setosa*
  * 1 means the species is *Iris versicolour*
  * 2 means the species is *Iris virginica*

However, the labels are separated from the other four informations as shown below.

In [4]:
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
target = pd.DataFrame(iris.target)

What the previous lines of code do is assigning our loaded dataset to a variable `iris`, and puts the measurements information and the label into two variables, `df` and `target`.

Now let us extract the first five rows of the dataframe, `df`, using the `.head()` method.

In [10]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


The purpose of this is to have a glimpse on the dataset without actually showing all the rows (it can be a lot of them!). If you want to show more/less rows, you can specify the number inside the brackets like this.

In [14]:
# Showing the first 8 rows
df.head(8)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2


Next, to find the number of rows and columns a dataframe has, use the `shape` attribute as shown below.

In [15]:
df.shape

(150, 4)

This means the dataframe has 150 rows and 4 columns.

---
### Step 2: Querying the dataframe
Given our current dataframe `df`, we can try to perform queries and different data manipulations on it.

Extract the column corresponding to the sepal length. This will be used in a later step.

How many flowers in the dataset have a sepal length of less than 5.6cm?

How many flowers have a sepal length of between 5-6cm **AND** petal length between 1.4-1.6cm?

How to obtain the first column with values now in **millimetres**?

Join the `target` dataframe into `df`.

Rename the columns into "sepal_length", "sepal_width", "petal_length", "petal_width", and "label".

Our final processed dataset should be like this.

---
### Step 3: Data visualization
Data visualisation enables us to see patterns that are otherwise difficult to observe from just raw data, which is critical for drawing insights and conclusions from the data.

There are several types of visualisations that we can do, including:
* Scatter plots
* Bar charts
* Box plots
* Histograms
* Line plots

In [None]:
# TODO
