# Lecture 1 Introduction to Data Science

# What is Data Science?

Three correlated concepts: 
- Data Science
- Artificial Intelligence 
- Machine Learning

[Battle of the Data Science Venn Diagrams ](https://www.kdnuggets.com/2016/10/battle-data-science-venn-diagrams.html)

The original Venn diagram from Drew Conway:

<div>
<img src="./img/Data_Science_VD.png" width="300">
</div>

Another diagram from Steven Geringer:

<div>
<img src="./img/moz-screenshot-3-729576.png" width="400">
</div>

Another version:

<div>
<img src="./img/1_-XKVI5SAEpffNR7BusdvNQ.png" width="300">
</div>

Perhaps the reality should be:
<div>
<img src="./img/DataScienceDisciplines.png" width="400">
</div>
<div>
<img src="./img/tumblr_m74i4eR9Ym1qa0uujo1_1280.jpg" width="300">
</div>

[David Robinson's Auto-pilot example](http://varianceexplained.org/r/ds-ml-ai/):
- machine learning: **predict** whether there is a stop sign in the camera
- artificial intelligence: design the **action** of applying brakes (either by rules or from data)
- data science: provide the **insights** why the system does not work well after sunrise

**Peijie's Definition**:
Data Science is the science 
- *of* the data -- what
- *by* the data -- how
- *for* the data -- why

# Mathematics of Data 

### Representation of Data

What data do we have, and how to relate it with math objects?

#### **Tabular Data**

In [None]:
import pandas as pd
import numpy as np
df_house = pd.read_csv('./data/kc_house_data.csv')
print(df_house.shape)   
df_house.head()

- A structured data table, with $n$ observations and $p$ variables.
- **Mathematical representation**: The data *matrix* $X\in\mathbb{R}^{n\times p}$. For notations we write
<center>
$X=\left(
 \begin{matrix}
   \mathbf{x}^{(1)}\\
   \mathbf{x}^{(2)} \\
   \cdots \\
   \mathbf{x}^{(n)}
  \end{matrix} 
\right)
$, where the $i$-th row vector represents $i$-th observation, $\mathbf{x}^{(i)}=(x_{1}^{(i)},\dots,x_{p}^{(i)})\in\mathbb{R}^{p}$.</center>
    
- [Example: Precision Medicine and Single-cell Sequencing.](https://learn.gencore.bio.nyu.edu/single-cell-rnaseq/)
<div>
<img src="./img/scRNA-overview.jpg" width="400">
</div>

- *Roughly speaking*, big data -- large $n$, high-dimensional data -- large $p$.

#### **Time-series Data**

In [None]:
import matplotlib.pyplot as plt
ts_tesla = pd.read_csv('./data/Tesla.csv')
print(ts_tesla.head())

ts_tesla['Date'] = pd.to_datetime(ts_tesla['Date'])
ts_tesla.set_index('Date',inplace=True)

# Suppose we only focus on the time-series of close price
plt.figure(dpi=80)
plt.title('Close Price History')
plt.plot(ts_tesla['Close'], color='red')
plt.xlabel('Date', fontsize=18)
plt.ylabel('Close Price USD', fontsize = 18)
plt.show()
# this is only about tesla -- we can also have the time-series of apple,amazon,facebook...

- Simple case: $N$ one-dimensional trajectories with each sampled at $T$ time points.
- **Mathematical representation I**: Still use the data *matrix* $X\in\mathbb{R}^{N\times T}$. For notations we write
<center>
$X=\left(
 \begin{matrix}
   \mathbf{x}^{(1)}\\
   \mathbf{x}^{(2)} \\
   \cdots \\
   \mathbf{x}^{(N)}
  \end{matrix} 
\right)
$, where the $i$-th row vector represents $i$-th trajectory, $\mathbf{x}^{(i)}=(x_{1}^{(i)},\dots,x_{T}^{(i)})\in\mathbb{R}^{T}$.
</center>
- Question: The difference with tabular data?
- **Mathematical representation II**: Each trajectory is a *function* of time $t$. The whole dataset can be represented as $z=f(\omega,t)$ where $\omega$ represents the sample and $t$ represents the time. In probability theory, this is called *stochastic process*.
    - For fixed $\omega$, we have a trajectory, which is the function of time. 
    - For fixed $t$, we obtain an ensemble drawn from random distribution. 
- Question: How about $N$ $d$-dimensional trajectories with each sampled at $T$ time points?
- [Example: Electroencephalography (EEG) data and Parkinson's disease](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3858815/).
<div>
<img src="./img/3-Figure1-1.png" width="600">
</div>

#### **Images**
Example: [MNIST handwritten digits data](http://yann.lecun.com/exdb/mnist/):Each image is 28x28 matrix

In [5]:
import pandas as pd
mnist = pd.read_csv('./data/train.csv') # stored as data table
mnist.sample(5)

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,pixel10,pixel11,pixel12,pixel13,pixel14,pixel15,pixel16,pixel17,pixel18,pixel19,pixel20,pixel21,pixel22,pixel23,pixel24,pixel25,pixel26,pixel27,pixel28,pixel29,pixel30,pixel31,pixel32,pixel33,pixel34,pixel35,pixel36,pixel37,pixel38,pixel39,pixel40,pixel41,pixel42,pixel43,pixel44,pixel45,pixel46,pixel47,pixel48,pixel49,pixel50,pixel51,pixel52,pixel53,pixel54,pixel55,pixel56,pixel57,pixel58,pixel59,pixel60,pixel61,pixel62,pixel63,pixel64,pixel65,pixel66,pixel67,pixel68,pixel69,pixel70,pixel71,pixel72,pixel73,pixel74,pixel75,pixel76,pixel77,pixel78,pixel79,pixel80,pixel81,pixel82,pixel83,pixel84,pixel85,pixel86,pixel87,pixel88,pixel89,pixel90,pixel91,pixel92,pixel93,pixel94,pixel95,pixel96,pixel97,pixel98,pixel99,pixel100,pixel101,pixel102,pixel103,pixel104,pixel105,pixel106,pixel107,pixel108,pixel109,pixel110,pixel111,pixel112,pixel113,pixel114,pixel115,pixel116,pixel117,pixel118,pixel119,pixel120,pixel121,pixel122,pixel123,pixel124,pixel125,pixel126,pixel127,pixel128,pixel129,pixel130,pixel131,pixel132,pixel133,pixel134,pixel135,pixel136,pixel137,pixel138,pixel139,pixel140,pixel141,pixel142,pixel143,pixel144,pixel145,pixel146,pixel147,pixel148,pixel149,pixel150,pixel151,pixel152,pixel153,pixel154,pixel155,pixel156,pixel157,pixel158,pixel159,pixel160,pixel161,pixel162,pixel163,pixel164,pixel165,pixel166,pixel167,pixel168,pixel169,pixel170,pixel171,pixel172,pixel173,pixel174,pixel175,pixel176,pixel177,pixel178,pixel179,pixel180,pixel181,pixel182,pixel183,pixel184,pixel185,pixel186,pixel187,pixel188,pixel189,pixel190,pixel191,pixel192,pixel193,pixel194,pixel195,pixel196,pixel197,pixel198,pixel199,pixel200,pixel201,pixel202,pixel203,pixel204,pixel205,pixel206,pixel207,pixel208,pixel209,pixel210,pixel211,pixel212,pixel213,pixel214,pixel215,pixel216,pixel217,pixel218,pixel219,pixel220,pixel221,pixel222,pixel223,pixel224,pixel225,pixel226,pixel227,pixel228,pixel229,pixel230,pixel231,pixel232,pixel233,pixel234,pixel235,pixel236,pixel237,pixel238,pixel239,pixel240,pixel241,pixel242,pixel243,pixel244,pixel245,pixel246,pixel247,pixel248,...,pixel534,pixel535,pixel536,pixel537,pixel538,pixel539,pixel540,pixel541,pixel542,pixel543,pixel544,pixel545,pixel546,pixel547,pixel548,pixel549,pixel550,pixel551,pixel552,pixel553,pixel554,pixel555,pixel556,pixel557,pixel558,pixel559,pixel560,pixel561,pixel562,pixel563,pixel564,pixel565,pixel566,pixel567,pixel568,pixel569,pixel570,pixel571,pixel572,pixel573,pixel574,pixel575,pixel576,pixel577,pixel578,pixel579,pixel580,pixel581,pixel582,pixel583,pixel584,pixel585,pixel586,pixel587,pixel588,pixel589,pixel590,pixel591,pixel592,pixel593,pixel594,pixel595,pixel596,pixel597,pixel598,pixel599,pixel600,pixel601,pixel602,pixel603,pixel604,pixel605,pixel606,pixel607,pixel608,pixel609,pixel610,pixel611,pixel612,pixel613,pixel614,pixel615,pixel616,pixel617,pixel618,pixel619,pixel620,pixel621,pixel622,pixel623,pixel624,pixel625,pixel626,pixel627,pixel628,pixel629,pixel630,pixel631,pixel632,pixel633,pixel634,pixel635,pixel636,pixel637,pixel638,pixel639,pixel640,pixel641,pixel642,pixel643,pixel644,pixel645,pixel646,pixel647,pixel648,pixel649,pixel650,pixel651,pixel652,pixel653,pixel654,pixel655,pixel656,pixel657,pixel658,pixel659,pixel660,pixel661,pixel662,pixel663,pixel664,pixel665,pixel666,pixel667,pixel668,pixel669,pixel670,pixel671,pixel672,pixel673,pixel674,pixel675,pixel676,pixel677,pixel678,pixel679,pixel680,pixel681,pixel682,pixel683,pixel684,pixel685,pixel686,pixel687,pixel688,pixel689,pixel690,pixel691,pixel692,pixel693,pixel694,pixel695,pixel696,pixel697,pixel698,pixel699,pixel700,pixel701,pixel702,pixel703,pixel704,pixel705,pixel706,pixel707,pixel708,pixel709,pixel710,pixel711,pixel712,pixel713,pixel714,pixel715,pixel716,pixel717,pixel718,pixel719,pixel720,pixel721,pixel722,pixel723,pixel724,pixel725,pixel726,pixel727,pixel728,pixel729,pixel730,pixel731,pixel732,pixel733,pixel734,pixel735,pixel736,pixel737,pixel738,pixel739,pixel740,pixel741,pixel742,pixel743,pixel744,pixel745,pixel746,pixel747,pixel748,pixel749,pixel750,pixel751,pixel752,pixel753,pixel754,pixel755,pixel756,pixel757,pixel758,pixel759,pixel760,pixel761,pixel762,pixel763,pixel764,pixel765,pixel766,pixel767,pixel768,pixel769,pixel770,pixel771,pixel772,pixel773,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
10926,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33,54,54,54,54,54,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,190,228,253,253,253,253,253,217,158,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,253,253,253,253,253,253,253,226,27,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,253,253,174,184,159,147,206,253,181,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,67,246,253,253,243,188,188,188,188,188,188,111,38,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,47,53,53,44,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26965,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33,97,155,154,221,255,215,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,26,186,230,202,143,60,44,44,44,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,77,194,243,151,37,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,198,251,200,35,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,180,243,53,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,40,253,104,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,104,56,0,0,3,199,104,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,46,205,44,0,0,194,139,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,160,249,172,145,228,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,16,117,253,253,216,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
38920,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,26,133,218,240,155,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,28,132,246,254,254,254,254,228,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,161,254,208,116,33,27,86,246,39,0,14,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,107,246,158,27,0,0,0,16,129,36,0,219,149,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,126,197,14,0,0,0,0,0,0,0,8,247,250,32,0,0,...,0,0,0,0,0,0,0,0,194,196,7,0,0,36,226,197,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,31,247,111,0,0,0,0,107,252,62,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,36,254,76,0,0,0,15,119,254,103,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,210,231,163,128,193,225,254,231,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,34,141,226,254,254,175,154,17,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10225,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,18,106,192,144,144,62,34,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,210,253,253,253,253,253,253,178,100,12,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,151,253,253,253,253,253,253,254,253,218,96,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,15,121,73,25,121,203,231,254,253,253,243,56,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,253,253,170,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,61,253,253,131,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,108,12,12,12,12,12,146,243,253,206,21,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,165,253,253,253,253,255,253,253,253,224,45,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,56,220,253,253,253,253,254,253,253,203,23,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,19,187,253,253,253,254,224,143,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
14465,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,254,232,47,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,144,253,254,218,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,234,253,254,203,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,88,251,253,254,121,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,175,253,253,231,24,0,0,0,0,...,0,0,0,0,0,0,158,255,254,254,214,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,41,244,254,253,253,213,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,204,253,254,253,253,116,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,43,244,253,254,250,116,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,18,238,253,231,106,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
mnist.shape

In [None]:
target = mnist['label']
mnist = mnist.drop("label",axis=1)

import matplotlib.pyplot as plt
plt.figure(dpi=100)
for i in range(0,70): #plot the first 70 images
    plt.subplot(7,10,i+1)
    grid_data = mnist.iloc[i,:].to_numpy().reshape(28,28)  # reshape from 1d to 2d pixel array
    plt.imshow(grid_data,cmap='gray_r', vmin=0, vmax=255)
    plt.xticks([])
    plt.yticks([])
plt.tight_layout()

- Simple case: N grayscale images with $m\times n$ pixels each.
- **Mathematical Representation I**: Each image can be represented by a matrix $I\in\mathbb{R}^{m\times n}$, whose elements denotes the intensities of pixels. The whole datasets have $N$ matrices of $m$ by $n$, or represented by a $N\times m\times n$ *tensor*.

[Illustrated Introduction to Linear Algebra using NumPy](https://medium.com/@kaaanishk/illustrated-introduction-to-linear-algebra-using-numpy-11d503d244a1)
<div>
<img src="./img/1_hd0aMCRIDbyFQo5lYgb5Fw.jpeg" width="400" >
</div>

- **Mathematical representation II**: *Random field model* $z=\mathbf{f}(\omega,x,y)$.


- **Color images**: Decompose into RGB (red,green and blue) channels and 
    - use three matrices (or three-dimensional tensor) to represent one image, or 
    - build the random field model with vector-valued functions $z=\mathbf{f}(\omega,x,y)\in \mathbb{R}^{3}$
    
[convolutional neural networks](https://www.esantus.com/blog/2019/1/31/convolutional-neural-networks-a-quick-guide-for-newbies)

<div>
<img src="./img/conv_rgb.png" width="400">
</div>
- Question: Can image datasets also be transformed into tabular data? What are the pros/cons?

In [None]:
mnist.head()

#### **Videos**

- *Time-series* of images, or *random field* model $z=\mathbf{f}(\omega,x,y,t)$

#### **Texts**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['He is a good person',
          'He is bad student',
          'He is hardworking']
df = pd.DataFrame(data=corpus, columns=['sentences'])
print(df)
vectorizer = CountVectorizer(vocabulary=['he', 'is', 'a', 'good', 'person', 'bad', 'student', 'hardworking'], min_df=0,
                             stop_words=frozenset(), token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(df['sentences'].values)
result = pd.DataFrame(data=X.toarray(), columns=vectorizer.get_feature_names())
result.head()

- **Proposal I**: Tabular data by extracting key words. "Document-Term Matrix"
    - useful in sentiment analysis, document clustering, topic modelling
    - popular algorithms include tf-idf,Word2Vec,bag of words, etc.
- **Proposal II**: Time-series of individual words.
    - useful in machine translation
    
[Recurrent neural network model for machine translations](https://smerity.com/articles/2016/google_nmt_arch.html)

<div>
<img src="./img/gnmt_arch_1_enc_dec.svg" width="500">
</div>

#### **Networks**

- Concepts: node/edge/weight, directed/undirected
- **Mathematical Representation**: adjacency matrix
- Question: what about the whole datasets of networks, and time-evolving networks?

# Tasks with Data: Machine Learning

The tasks with data can often be transfromed into *machine learning* problems, which can be generally classified as:
- Supervised Learning -- "learning with training";
- Unsupervised Learning -- "learning without training";
- Reinforment Learning -- "learning by doing".

Our course will focus on the first two categories.

## **Supervised Learning**

- Given the *training dataset* $(x^{(i)},y^{(i)})$ with $y^{(i)}\in \mathbb{R}^{q}$ denotes the *labels*, the supervised learning aims to find a mapping $\mathbf{f}:\mathbb{R}^{p}\to\mathbb{R}^{q}$ such that $y^{(i)}\approx\mathbf{f}(x^{(i)})$. Then with a new observation $x^{(new)}$, we can predict that $y^{(new)}=\mathbf{f}(x^{(new)})$.

    - when $y\in\mathbb{R}$ is continuous, the problem is also called as *regression*. **Example**: Housing price prediction
    - when $y\in\mathbb{R}$ is discrete, the problem is also called as *classification*. **Example**: Handwritten digit recognization


- **Practical Strategy**: Limit the mapping $\mathbf{f}$ to certain space by parametrization $\mathbf{f(x;\theta)}$. Then define the loss function of $\theta$
<center>$L(\theta)=\sum\limits_{i=1}^{n}\ell(y^{(i)},\mathbf{f}(x^{(i)})),$ </center> where $\ell$ quantifies the "distance" between $y^{(i)}$ and $\mathbf{f}(x^{(i)})$, and a common choice is mean squre error (MSE) for continous data $\ell(y^{(i)},\mathbf{f}(x^{(i)}))=||y^{(i)}-\mathbf{f}(x^{(i)})||^{2}$. We then seek to choose the optimal $\theta$ that minimizes the loss function<center>$\theta^{*}=\mathop{\mathrm{argmin}}\limits_{\theta}L(\theta),$</center>
which can be tacked numeracally by optimzation methods (including the popular stochastic gradient descent).


- Difference choice of $\mathbf{f(x;\theta)}$ leads to various supervised learning models:
    - Linear function : Linear Regression (for regression)/Logistic Regression (for classification)
    - Composition of linear + nonlinear functions: Neural Network
    
    
- **Important Terms**:
    - **Training Data**: Both X and y are provided. The dataset which we use to fit the function.
    - **Test Data**: In principle, only X is provided (some times $y^{test}$ is also provided as the ground-truth to verify). The dataset which we generate new predictions $y^{pred}$. -- This is the final judgement of your unsupervised ML model!
    - **Validation Data**: A good-fit model on training data does not guarantee the good performance on test data. To gain more confidence before really applying to test data, we "fake" some test data as the "sample exam". To do this, we further split the original training data into new traning data and validation data, and then learn the mapping on new training data, and judge on the validation data. We may make some adjustment if the model does not perform well in the "sample exam".
    - Intuitive Understanding: Training data is like quizzes -- you want to learn the "mapping" between the question and correct answer. Test data is like your exam. Validation is like you take a sample exam before the real exam and make some "clinics" about your weakpoints.
    - See the illustration [here](https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7)

**Example:** The [Wisconsin breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) and low-code ML package [pycaret](https://pycaret.org/).

In [None]:
pip install pycaret #install pycaret -- it's a new package, not coming with Anaconda

In [2]:
from sklearn.datasets import load_breast_cancer # load the dataset
X,y = load_breast_cancer(as_frame = True,return_X_y = True)

In [None]:
X

In [None]:
y

In this dataset, all labels are known. To mimic a real situation, we manully create train and test datasets.

In [3]:
from sklearn.model_selection import train_test_split # manually split into train and test by random sampling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

In [None]:
X_train.shape

In [None]:
y_test.shape

In [6]:
data_train = pd.concat([X_train,y_train],axis=1) # the whole data table of training
data_train

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
56,19.21,18.57,125.5,1152.0,0.1053,0.1267,0.1323,0.08994,0.1917,0.05961,0.7275,1.193,4.837,102.5,0.006458,0.02306,0.02945,0.01538,0.01852,0.002608,26.14,28.14,170.1,2145.0,0.1624,0.3511,0.3879,0.2091,0.3537,0.08294,0
144,10.75,14.97,68.26,355.3,0.07793,0.05139,0.02251,0.007875,0.1399,0.05688,0.2525,1.239,1.806,17.74,0.006547,0.01781,0.02018,0.005612,0.01671,0.00236,11.95,20.72,77.79,441.2,0.1076,0.1223,0.09755,0.03413,0.23,0.06769,1
60,10.17,14.88,64.55,311.9,0.1134,0.08061,0.01084,0.0129,0.2743,0.0696,0.5158,1.441,3.312,34.62,0.007514,0.01099,0.007665,0.008193,0.04183,0.005953,11.02,17.45,69.86,368.6,0.1275,0.09866,0.02168,0.02579,0.3557,0.0802,1
6,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,0.05742,0.4467,0.7732,3.18,53.91,0.004314,0.01382,0.02254,0.01039,0.01369,0.002179,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,0
8,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.07389,0.3063,1.002,2.406,24.32,0.005731,0.03502,0.03553,0.01226,0.02143,0.003749,15.49,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,0
474,10.88,15.62,70.41,358.9,0.1007,0.1069,0.05115,0.01571,0.1861,0.06837,0.1482,0.538,1.301,9.597,0.004474,0.03093,0.02757,0.006691,0.01212,0.004672,11.94,19.35,80.78,433.1,0.1332,0.3898,0.3365,0.07966,0.2581,0.108,1
320,10.25,16.18,66.52,324.2,0.1061,0.1111,0.06726,0.03965,0.1743,0.07279,0.3677,1.471,1.597,22.68,0.01049,0.04265,0.04004,0.01544,0.02719,0.007596,11.28,20.61,71.53,390.4,0.1402,0.236,0.1898,0.09744,0.2608,0.09702,1
252,19.73,19.82,130.7,1206.0,0.1062,0.1849,0.2417,0.0974,0.1733,0.06697,0.7661,0.78,4.115,92.81,0.008482,0.05057,0.068,0.01971,0.01467,0.007259,25.28,25.59,159.8,1933.0,0.171,0.5955,0.8489,0.2507,0.2749,0.1297,0
202,23.29,26.67,158.9,1685.0,0.1141,0.2084,0.3523,0.162,0.22,0.06229,0.5539,1.56,4.667,83.16,0.009327,0.05121,0.08958,0.02465,0.02175,0.005195,25.12,32.68,177.0,1986.0,0.1536,0.4167,0.7892,0.2733,0.3198,0.08762,0
246,13.2,17.43,84.13,541.6,0.07215,0.04524,0.04336,0.01105,0.1487,0.05635,0.163,1.601,0.873,13.56,0.006261,0.01569,0.03079,0.005383,0.01962,0.00225,13.94,27.82,88.28,602.0,0.1101,0.1508,0.2298,0.0497,0.2767,0.07198,1


In [7]:
from pycaret.classification import setup
from pycaret.classification import compare_models

bc = setup(data=data_train, target='target') # target is the y column name we want to predict

Unnamed: 0,Description,Value
0,session_id,5279
1,Target,target
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(381, 31)"
5,Missing Values,False
6,Numeric Features,30
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


In [8]:
best = compare_models() # pycaret automatically fit different ML models for you, and compare their performance on the training dataset with cross-validation!

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.9625,0.989,0.9941,0.9513,0.9714,0.9173,0.9225,0.01
ada,Ada Boost Classifier,0.9624,0.9892,0.9824,0.9598,0.9704,0.9189,0.9213,0.038
rf,Random Forest Classifier,0.9551,0.9952,0.9647,0.9654,0.964,0.9044,0.9076,0.195
xgboost,Extreme Gradient Boosting,0.9551,0.9851,0.9647,0.965,0.9636,0.9051,0.9089,0.11
ridge,Ridge Classifier,0.955,0.0,0.9882,0.9437,0.9648,0.9023,0.9065,0.009
qda,Quadratic Discriminant Analysis,0.955,0.9885,0.9706,0.959,0.9638,0.9043,0.9075,0.009
et,Extra Trees Classifier,0.9514,0.9944,0.9643,0.9598,0.9607,0.8968,0.9006,0.175
catboost,CatBoost Classifier,0.9514,0.9916,0.9647,0.9592,0.9611,0.8964,0.8989,2.728
lightgbm,Light Gradient Boosting Machine,0.9476,0.9886,0.9585,0.9584,0.9577,0.8887,0.891,0.156
lr,Logistic Regression,0.9399,0.9872,0.9699,0.9399,0.953,0.8694,0.8763,0.413


In [9]:
best # the best model selected by pycaret

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)

In [20]:
predict_model(best); # predict on the validation data that pycaret have selected -- sample exam!

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,0.9391,0.9754,1.0,0.9103,0.953,0.8671,0.8749


In [12]:
from pycaret.classification import finalize_model
best_final = finalize_model(best) # re-train the dataset with whole input training data

In [13]:
from pycaret.classification import predict_model
predictions = predict_model(best_final, data = X_test) # make new predictions on new-coming patients, with best model selected
predictions

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,Label,Score
512,13.40,20.52,88.64,556.7,0.11060,0.14690,0.14450,0.08172,0.2116,0.07325,...,113.30,844.4,0.15740,0.38560,0.51060,0.20510,0.3585,0.11090,0,0.9973
457,13.21,25.25,84.10,537.9,0.08791,0.05205,0.02772,0.02068,0.1619,0.05584,...,91.29,632.9,0.12890,0.10630,0.13900,0.06005,0.2444,0.06788,1,0.9978
439,14.02,15.66,89.59,606.5,0.07966,0.05581,0.02087,0.02652,0.1589,0.05586,...,96.53,688.9,0.10340,0.10170,0.06260,0.08216,0.2136,0.06710,1,0.9999
298,14.26,18.17,91.22,633.1,0.06576,0.05220,0.02475,0.01374,0.1635,0.05586,...,105.80,819.7,0.09445,0.21670,0.15650,0.07530,0.2636,0.07676,1,0.9978
37,13.03,18.42,82.61,523.8,0.08983,0.03766,0.02562,0.02923,0.1467,0.05863,...,84.46,545.9,0.09701,0.04619,0.04833,0.05013,0.1987,0.06169,1,0.9998
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100,13.61,24.98,88.05,582.7,0.09488,0.08511,0.08625,0.04489,0.1609,0.05871,...,108.60,906.5,0.12650,0.19430,0.31690,0.11840,0.2651,0.07397,0,0.8100
336,12.99,14.23,84.08,514.3,0.09462,0.09965,0.03738,0.02098,0.1652,0.07238,...,87.38,576.0,0.11420,0.19750,0.14500,0.05850,0.2432,0.10090,1,1.0000
299,10.51,23.09,66.85,334.2,0.10150,0.06797,0.02495,0.01875,0.1695,0.06556,...,70.10,362.7,0.11430,0.08614,0.04158,0.03125,0.2227,0.06777,1,1.0000
347,14.76,14.74,94.87,668.7,0.08875,0.07780,0.04608,0.03528,0.1521,0.05912,...,114.20,880.8,0.12200,0.20090,0.21510,0.12510,0.3109,0.08187,1,0.8518


In [14]:
df_compare = pd.concat([predictions['Label'],y_test],axis = 1) # compare with the ground-truth
df_compare

Unnamed: 0,Label,target
512,0,0
457,1,1
439,1,1
298,1,1
37,1,1
...,...,...
100,0,0
336,1,1
299,1,1
347,1,1


In [15]:
import numpy as np
np.mean(predictions['Label'].to_numpy() == y_test.to_numpy()) # calculate the percentage of accurate prediction (accuracy)

0.973404255319149

In [17]:
from pycaret.classification import create_model
lr = create_model('lr') # what if we only want the logistic regression model?

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,0.9259,1.0,1.0,0.8947,0.9444,0.8344,0.846
2,0.9259,0.9941,0.9412,0.9412,0.9412,0.8412,0.8412
3,0.8889,0.9588,0.8824,0.9375,0.9091,0.7666,0.7689
4,0.9259,0.9882,1.0,0.8947,0.9444,0.8344,0.846
5,0.963,1.0,0.9375,1.0,0.9677,0.9244,0.927
6,0.8846,0.9438,1.0,0.8421,0.9143,0.7417,0.7678
7,0.9231,0.9938,1.0,0.8889,0.9412,0.8312,0.8433
8,0.9615,0.9938,0.9375,1.0,0.9677,0.9202,0.9232
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [19]:
predict_model(lr) # validation dataset -- sample exam!

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.9478,0.9936,0.9437,0.971,0.9571,0.8905,0.8911


Unnamed: 0,mean radius,mean texture,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,...,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target,Label,Score
0,11.940000,18.240000,437.600006,0.08261,0.04751,0.01972,0.01349,0.1868,0.06110,0.2273,...,527.200012,0.11440,0.08906,0.09203,0.06296,0.2785,0.07408,1,1,0.9965
1,17.010000,20.260000,904.299988,0.08772,0.07304,0.06950,0.05390,0.2026,0.05223,0.5858,...,1210.000000,0.11110,0.14860,0.19320,0.10960,0.3275,0.06469,0,0,0.9994
2,11.870000,21.540001,432.000000,0.06613,0.10640,0.08777,0.02386,0.1349,0.06612,0.2560,...,507.200012,0.09457,0.33990,0.32180,0.08750,0.2305,0.09952,1,1,0.9830
3,14.990000,25.200001,698.799988,0.09387,0.05131,0.02398,0.02899,0.1565,0.05504,1.2140,...,698.799988,0.09387,0.05131,0.02398,0.02899,0.1565,0.05504,0,0,0.8766
4,15.060000,19.830000,705.599976,0.10390,0.15530,0.17000,0.08815,0.1855,0.06284,0.4768,...,1025.000000,0.15510,0.42030,0.52030,0.21150,0.2834,0.08234,0,0,0.9959
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110,13.640000,15.600000,575.299988,0.09423,0.06630,0.04705,0.03731,0.1717,0.05660,0.3242,...,683.400024,0.12780,0.12910,0.15330,0.09222,0.2530,0.06510,1,1,0.9839
111,18.049999,16.150000,1006.000000,0.10650,0.21460,0.16840,0.10800,0.2152,0.06673,0.9806,...,1610.000000,0.14780,0.56340,0.37860,0.21020,0.3751,0.11080,0,0,1.0000
112,11.600000,12.840000,412.600006,0.08983,0.07525,0.04196,0.03350,0.1620,0.06582,0.2315,...,512.500000,0.14310,0.18510,0.19220,0.08449,0.2772,0.08756,1,1,0.9978
113,14.480000,21.459999,648.200012,0.09444,0.09947,0.12040,0.04938,0.2075,0.05636,0.4204,...,808.900024,0.13060,0.19760,0.33490,0.12250,0.3020,0.06846,0,0,0.6182


In [21]:
final_lr = finalize_model(lr)

In [22]:
predictions_lr = predict_model(final_lr, data = X_test)
np.mean(predictions_lr['Label'].to_numpy() == y_test.to_numpy())

0.9627659574468085

In [24]:
from pycaret.classification import tune_model
tuned_lr = tune_model(lr) # fine-tuning the parameters in logistic regression

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,0.963,1.0,1.0,0.9444,0.9714,0.9189,0.922
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,0.8889,0.9824,0.8824,0.9375,0.9091,0.7666,0.7689
4,0.9259,0.9882,1.0,0.8947,0.9444,0.8344,0.846
5,0.963,1.0,0.9375,1.0,0.9677,0.9244,0.927
6,0.9615,0.9312,1.0,0.9412,0.9697,0.9172,0.9204
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0
8,0.9231,0.9812,0.9375,0.9375,0.9375,0.8375,0.8375
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [25]:
predict_model(tuned_lr) # still doing the sample exam -- validation dataset

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.9478,0.9923,0.9437,0.971,0.9571,0.8905,0.8911


Unnamed: 0,mean radius,mean texture,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,...,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target,Label,Score
0,11.940000,18.240000,437.600006,0.08261,0.04751,0.01972,0.01349,0.1868,0.06110,0.2273,...,527.200012,0.11440,0.08906,0.09203,0.06296,0.2785,0.07408,1,1,0.9979
1,17.010000,20.260000,904.299988,0.08772,0.07304,0.06950,0.05390,0.2026,0.05223,0.5858,...,1210.000000,0.11110,0.14860,0.19320,0.10960,0.3275,0.06469,0,0,0.9999
2,11.870000,21.540001,432.000000,0.06613,0.10640,0.08777,0.02386,0.1349,0.06612,0.2560,...,507.200012,0.09457,0.33990,0.32180,0.08750,0.2305,0.09952,1,1,0.9671
3,14.990000,25.200001,698.799988,0.09387,0.05131,0.02398,0.02899,0.1565,0.05504,1.2140,...,698.799988,0.09387,0.05131,0.02398,0.02899,0.1565,0.05504,0,0,0.9944
4,15.060000,19.830000,705.599976,0.10390,0.15530,0.17000,0.08815,0.1855,0.06284,0.4768,...,1025.000000,0.15510,0.42030,0.52030,0.21150,0.2834,0.08234,0,0,0.9991
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110,13.640000,15.600000,575.299988,0.09423,0.06630,0.04705,0.03731,0.1717,0.05660,0.3242,...,683.400024,0.12780,0.12910,0.15330,0.09222,0.2530,0.06510,1,1,0.9850
111,18.049999,16.150000,1006.000000,0.10650,0.21460,0.16840,0.10800,0.2152,0.06673,0.9806,...,1610.000000,0.14780,0.56340,0.37860,0.21020,0.3751,0.11080,0,0,1.0000
112,11.600000,12.840000,412.600006,0.08983,0.07525,0.04196,0.03350,0.1620,0.06582,0.2315,...,512.500000,0.14310,0.18510,0.19220,0.08449,0.2772,0.08756,1,1,0.9979
113,14.480000,21.459999,648.200012,0.09444,0.09947,0.12040,0.04938,0.2075,0.05636,0.4204,...,808.900024,0.13060,0.19760,0.33490,0.12250,0.3020,0.06846,0,0,0.6504


In [26]:
final_tuned_lr = finalize_model(tuned_lr) #retrain with the whole dataset

In [27]:
predictions_tuned_lr = predict_model(final_tuned_lr, data = X_test)
np.mean(predictions_tuned_lr['Label'].to_numpy() == y_test.to_numpy())

0.9627659574468085

Of course, as a math course, we are not satisfied with merely calling functions in pycaret. In the rest of lectures this quarter, we are going to dig into details of some algorihms and learn more underlying math -- turn the black box of ML into white (at least gray) one!

## **Unsupervised Learning**

It is still challenging to give a general and rigorous definition for unsupervised learning mathematically. Let's focus on more specific tasks.

- Dimension Reducion
  
    Given $X\in \mathbb{R}^{n\times p}$, finding a mapping function $\mathbf{f}:\mathbb{R}^{p}\to \mathbb{R}^{q} (q\ll p)$ such that the low-dimensional coordinates $z^{(i)}=\mathbf{f}(x^{(i)})$ "preserve the information" about $x^{(i)}$.
  - Question: Difference with supervised learning?
  - Linear mapping: Principle Component Analysis (PCA)
  - Nonlinear mapping: Manifold Learning, Autoencoder

In [None]:
from sklearn.datasets import load_iris
X,y = load_iris(return_X_y = True) # Note that in the hw this week, it's not allowed to load iris data in this way!!!
X

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2) # principle component analysis, ruduce 4-dimenional data to 2-dimensional
X_pca = pca.fit_transform(X)
X_pca

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set() # set the seaborn theme style
figure = plt.figure(dpi=100)
plt.scatter(X_pca[:, 0], X_pca[:, 1],c=y, s=15, edgecolor='none', alpha=0.5,cmap=plt.cm.get_cmap('tab10', 3))
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.colorbar();

- Clustering

    Given $X\in \mathbb{R}^{n\times p}$, finding a partition of the dataset into $K$ groups such that 
    - data within the same group are similiar;
    - data from different groups are dissimiliar.

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=0) #call k-means clustering algorithm
y_km = kmeans.fit_predict(X)
y_km # the groups assigned by algorithm

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
fig, (ax1, ax2) = plt.subplots(1, 2,dpi=150, figsize=(10,4))

fig1 = ax1.scatter(X_pca[:, 0], X_pca[:, 1],c=y_km, s=15, edgecolor='none', alpha=0.5,cmap=plt.cm.get_cmap('Set1', 3))
fig2 = ax2.scatter(X_pca[:, 0], X_pca[:, 1],c=y, s=15, edgecolor='none', alpha=0.5,cmap=plt.cm.get_cmap('Accent', 3))
ax1.set_title('K-means Clustering')
legend1 = ax1.legend(*fig1.legend_elements(), loc="best", title="Classes")
ax1.add_artist(legend1)
ax2.set_title('True Labels')
legend2 = ax2.legend(*fig2.legend_elements(), loc="best", title="Classes")
ax2.add_artist(legend2)

Question: What is the difference between clustering and classification? Can you try classification on Iris data with pycaret right now?

In [None]:
# try classification with pycaret for Iris data by yourself!