# EXPOSE: Data Analytics Tutorial

### Topics:
* Exploratory data analysis (EDA)
* Data manipulation with Pandas
* Data visualisation with Matplotlib
* Model training with Scikit-learn

### Tutorial TAs:
* Nicholas Russell Saerang (NUS, Y2 DSA)
* Wilson Widyadhana (NUS, Y1 DSA)

---
### Step 0: Importing necessary packages
In this step, we first import the packages essential to what we want to do today.
We will be importing:
* [Scikit-learn](https://scikit-learn.org/) (`sklearn`), which is an introductory data analytics package,
* [Pandas](https://pandas.pydata.org/) (`pandas`, usually imported as `pd`), which is often used to manipulate data,
* [Matplotlib](https://matplotlib.org/) (`matplotlib.pyplot`, usually imported as `plt`), used to visualise data,
* `warnings`, which will just be used in our case to prevent unnecessary warnings appearing in our Google Colab notebook

In [None]:
# Importing necessary packages and suppress warnings
from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

def warn(*args, **kwargs):
    pass
warnings.warn = warn

---
### Step 1: Loading our dataset
We shall use the iris dataset from the `sklearn` package that we just imported recently.

The iris dataset is a dataset consisting of iris flowers with different:
- sepal length (cm)
- sepal width (cm)
- petal length (cm)
- petal width (cm)
- the species of the flower - our dataset label
  * 0 means the species is *Iris setosa*
  * 1 means the species is *Iris versicolour*
  * 2 means the species is *Iris virginica*

However, the labels are separated from the other four informations as shown below.

In [None]:
# Get dataframe and label
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
target = pd.DataFrame(iris.target)

# Join both

# TODO
df = df.join(target)

# Rename columns
mapper = {
    'sepal length (cm)': 'sepal_length',
    'sepal width (cm)': 'sepal_width',
    'petal length (cm)': 'petal_length',
    'petal width (cm)': 'petal_width',
    0: 'label'
}
df = df.rename(mapper, axis='columns')

What the previous lines of code do is assigning our loaded dataset to a variable `iris`, and puts the measurements information and the label into two variables, `df` and `target`. The code then joins both information into `df` as one unity. Finally, some columns are renamed so the data is tidy for you to use now.

Now let us extract the first five rows of the dataframe, `df`, using the `.head()` method.

In [None]:
# TODO
df.head()

The purpose of this is to have a glimpse on the dataset without actually showing all the rows (it can be a lot of them!). If you want to show more/less rows, you can put a number inside the brackets to specify the number of rows you want (so you'll type something like `df.head(<number>)`).

In [None]:
# Showing the first 8 rows

# TODO
df.head(8)

Next, to find the number of rows and columns a dataframe has, use the `shape` attribute as shown below.

In [None]:
# TODO
df.shape

This means the dataframe has 150 rows and 4 columns.

---
### Step 2: Querying the dataframe
Given our current dataframe `df`, we can try to perform queries and different data manipulations on it.

Extract the column corresponding to the sepal length. This will be used in a later step.

In [None]:
# Indexing the column name directly
df['sepal_length']

In [None]:
# Using iloc, result is the same as above
df.iloc[:, 0]

How many flowers in the dataset have a sepal length of less than 5.6cm?

In [None]:
df[df['sepal_length'] < 5.6].shape[0]

How many flowers have a sepal length of between 5-6cm **AND** petal length between 1.4-1.6cm?

In [None]:
# Important to use brackets to separate ambiguous boolean statements, if OR use | instead of &
df[
    (df['sepal_length'] >= 5) &
    (df['sepal_length'] <= 6) &
    (df['petal_length'] >= 1.4) &
    (df['petal_length'] <= 1.6)
].shape[0]

How to obtain the first column with values now in **millimetres**? (the current value is in centimetres)

In [None]:
def f(x):
    return 10*x

df.iloc[:, 0].apply(f, 1)

In a similar way, we can also **add another column** for the flower's species based on the label.

In [None]:
def to_species(label):
    if label == 0:
        return 'Iris setosa'
    elif label == 1:
        return 'Iris versicolour'
    else:
        return 'Iris virginica'

df['species'] = df['label'].apply(to_species)
df.head()

#### Optional: Saving to a file
Of course, there must be a way to convert the dataframe into a file so you can deal with them at another time. Here's how you can do it. CSV files and Excel files conversion will be given as an example below.

In [None]:
df.to_csv('iris.csv', index=False)

In [None]:
df.to_excel('iris.xlsx', index=False)

---
### Step 3: Data visualization
Data visualisation enables us to see patterns that are otherwise difficult to observe from just raw data, which is critical for drawing insights and conclusions from the data.

There are several types of visualisations that we can do, including:
* Scatter plots
* Bar charts
* Box plots
* Histograms
* Line plots

We will try to plot each one of them using `matplotlib.pyplot` that we have aliased to `plt`.

**(SANITY CHECK)** Before we start, make sure your dataframe `df` is the one we have modified on the previous parts.

In [None]:
df.head()

#### Scatter plot
Let's plot the sepal length (y) against the sepal width (x) across all flowers.

In [None]:
fig = plt.scatter(y=df.sepal_length, x=df.sepal_width)

In [None]:
# Modify to have colours and labels
for label in range(3):
    plt.scatter(y=df.sepal_length[df.label==label], x=df.sepal_width[df.label==label], label=to_species(label))
plt.legend()

#### Bar charts
For each species, plot a bar chart of the number of flowers with sepals of width more than 3 cm.

In [None]:
def count_wide_sepals(label):
    return df[(df.sepal_width > 3) & (df.label == label)].shape[0]

fig = plt.bar(x=df.label.apply(to_species), height=df.label.apply(count_wide_sepals))

Looks cool, seems like most of the flowers with sepals wider than 3 cm are Iris setosas!

#### Box plots
Let's find the distribution of the sepal widths of the flowers.

In [None]:
fig = plt.boxplot(x=df.sepal_width)

In [None]:
# Yellow line = median, what about mean?
fig = plt.boxplot(x=df.sepal_width, showmeans=True)

From the boxplot, we can see that the median of the sepal widths is very close to 3.0 cm and we have some outlier lengths as well :)

#### Histograms
Let's plot the histogram of the petal lengths and see which length interval has the most flowers.

In [None]:
fig = plt.hist(x=df.petal_length)

In [None]:
# Fix the number of bins to 5
fig = plt.hist(x=df.petal_length, bins=5)

It looks like we have two separate groups of petal lengths, one ranging from 1-2 cm and one from 3-7 cm. It may be related to the species of the flowers, who knows? (Explore more to find it out by yourself!)

#### Line plots
Lastly, we have the line plot.
One thing we can do with the iris dataset is to plot the line plot for some columns of the dataset.

In [None]:
row_numbers = list(range(150))
for col_idx in [2, 3]:
    plt.plot(row_numbers, df.iloc[:, col_idx], label=df.columns[col_idx])
plt.legend()

Interestingly, if you take a look at the dataset again, the first 50 rows are for label 0, next 50 rows for label 1, and next 50 rows for label 2. We can clearly see a distinction between each label by looking at the line plot!

#### BONUS: Case study
Suppose you want to find a way to differentiate iris species/labels with only their petal and sepal lengths and widths, here are some cool alternatives that might open up your mind.

Alternative 1: Use petal length and width. Label 0 is separated very well, but we can't tell certainly the difference between labels 1 and 2.

In [None]:
for label in range(3):
    plt.scatter(y=df.petal_length[df.label==label], x=df.petal_width[df.label==label], label=to_species(label))
plt.legend()

Alternative 2: Use sepal length and petal length. Again, only those with label 0 can be obviously identified.

In [None]:
for label in range(3):
    plt.scatter(y=df.sepal_length[df.label==label], x=df.petal_length[df.label==label], label=to_species(label))
plt.legend()

Alternative 3: Use machine learning?

This is a much more complex topic, so you don't have to worry about the complex code yet.

But let's see if this complex code can give a good solution too :)

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=2)
fit_result = lda.fit(df.iloc[:, :4], df.label).transform(df.iloc[:, :4])
for label in range(3):
    plt.scatter(y=fit_result[df.label==label, 1], x=fit_result[df.label==label, 0], label=to_species(label))
plt.legend()

---
### Step 4: Wrapping up
You have learned the very basics of data manipulation and data visualization. Pat on the back!
If you are interested to see more from Matplotlib and Pandas, take a look at https://matplotlib.org/ and https://pandas.pydata.org/ directly!

Matplotlib's website also includes a variety of ways to improve and customize your plot, making them look more elegant. For example, adding plot title and legends and even more customized colors. What you saw here are just the basic customizations.

That should be it for today. Hope you enjoyed them and see you again soon!