## Task 1.3: Revisions - Data processing basics in python

ITU KSADMAL1KU-NLP - Advanced Machine Learning for NLP in KCS 2024

by Stefan Heinrich, Bertram Højer, Christian H. Rasmussen, & material by Kevin Murphy.

All info and static material: https://learnit.itu.dk/course/view.php?id=3024579

-------------------------------------------------------------------------------

In [None]:
# @title #### Import dependencies

from IPython.display import display
import numpy as np
import pandas as pd
import scipy as scp 
import matplotlib.pyplot as plt
import seaborn as sns

#### Load Iris dataset

https://archive.ics.uci.edu/ml/datasets/iris

In [None]:
# next line allows to download the file, in case you run this in Colab
#!wget https://datahub.io/machine-learning/iris/r/iris.csv -O ../data/iris.csv

iris_file = "../data/iris.csv"
iris_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

#### Load iris dataset as numpy ndarray

Task: briefly experiment with different scopes on the data 

https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html


In [None]:
iris_data_np = np.genfromtxt(iris_file, delimiter=',', names=True, 
                        dtype=[float, float, float, float, 'U15'])
# problem: since data types are mixed, 
# the numpy array is not a 2D matrix, but a list of tuples
print(iris_data_np.dtype)
print(iris_data_np[:8])

# we need to select all variables of the same type do cast an numpy array that returns a matrix
iris_data_np_inputs = np.array([d[:-1] for d in iris_data_np.tolist()])
iris_data_np_targets = [d[-1] for d in iris_data_np.tolist()]
print(iris_data_np_inputs.dtype, "\n")
print(iris_data_np_inputs[:8], "\n")
print(iris_data_np_targets[:8], "\n")

# conversion from labels to ids is a bit complicated
iris_classes = sorted(set(iris_data_np_targets))
iris_classes_dict = {iris_classes[i]:i for i in range(len(iris_classes))}
iris_data_np_targets_ids = np.array([iris_classes_dict[t] for t in iris_data_np_targets])
print(iris_data_np_targets_ids[:8])   


#### Load iris dataset as pandas dataframe

Task: learn + test different properties of dataframes - accessing & indexing, 
conversion, computations & basic statistics

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

In [None]:
# the pandas dataframe already takes care for mixed data types
iris_data_pd = pd.read_csv(iris_file)

print(iris_data_pd.dtypes, "\n")
print(iris_data_pd[:8], "\n")
print(iris_data_pd['sepallength'][:8])

# pandas dataframe can be displayed nicely with jupyter
display(iris_data_pd)


#### Analyse data with scipy

Task: explore subpackages in scipy to explore 

https://docs.scipy.org/doc/scipy/reference/

In [None]:
print(scp.stats.describe(iris_data_np_inputs))
print(scp.stats.skew(iris_data_np_inputs[:,0]))
print(scp.stats.skew(iris_data_np_inputs[:,1]))


#### Plot with matplotlib

Task: briefly inspect different options for visualisations with matplotlib, 
create a fig that compares all combinations of variables in scatter poly 
using six subplots

https://matplotlib.org/3.2.2/gallery/index.html

In [None]:
fig_m, ax_m = plt.subplots()
m_colors = ['blue','orange','green']
m_x = iris_data_np_inputs[:,0]
m_y = iris_data_np_inputs[:,1]
m_t = [m_colors[t] for t in iris_data_np_targets_ids]
ax_m.scatter(m_x, m_y, c=m_t, s=15.0)
# legends are a sometimes complicated ("hacky") as well:
fake_line = [plt.Line2D([0,0],[0,0],color=color, marker='o', linestyle='') 
             for color in set(m_t)]
ax_m.legend(fake_line, iris_classes)

#### Plot with seaborn

Task: inspect plotting options in seaborn, and get familiar with controlling 
the output graphs by modifying: appearances (colormaps, symbols) and 
information (legends, axis labels)
https://seaborn.pydata.org/examples/index.html

In [None]:
p_x = iris_data_pd['sepallength']
p_y = iris_data_pd['sepalwidth']
p_t = iris_data_pd['class']
ax_m = sns.scatterplot(x=p_x, y=p_y, hue=p_t)

## Optional Exercises:
- Recapitulate data loading options with numpy and pandas
- Recapitulate plotting options in matplotlib and seaborn