# Welcome to the introduction to machine learning workshop!!!

This is a Python notebook for reference. Feel free to set this up on Google Colab (https://colab.research.google.com/) or your local machine with your favorite IDEs, and follow along with the presentation! 

Once you have the dataset, run the cell below to read in the data and test out all the model demos. The markdown cells will explain what the codes are doing, but please ask questions if anything is unclear, or you run into issues. There are no stupid questions! 

## Categories of ML Tasks

Machine Learning tasks can be generalized into three categories: 
1. Supervised Learning Tasks
    - A model is trained given the input features, X, and their corresponding targets, Y. 
    - EX: regressions tasks, labeled classification tasks
2. Unsupervised Learning Tasks.
    - A model is trained given only the input features, X, without their corresponding targets. 
    - EX: clustering, dimensionality reduction. 
3. Reinforcement Learning Tasks 
    - Train a model or an action policy of an agent based on feedback from the environment. 
    - EX: markov descision process, Q table learning.

In this workshop, we will look at both supervised and unsupervised tasks, including: 
* regression 
    - linear regression 
    - lasso regression 
    - support vector machine 
* classificaiton 
    - logistic regression 
    - decision tree 
    - random forest tree 
* clustering  
    - kmeans clustering
    - density-based spatial clustering of application with noise (DBSCAN)

## Steps to solve a ML task

<h4>We follow the guideline of the `Machine Learning Life Cycle` when building our model.</h4>

<!-- ![ML-Lifecycle](ML-lifecycle.drawio.png){ width="800" height="600" style="display: block; margin: 0 auto" } -->
<p align="center">
    <img src="ML-lifecycle.drawio.png">
<p>

In this workshop, we already collected the data for you, and there will not be any model deployment required. We will focus on model training and evaluation. 
* However, in reality, designing and implementing data pipeline and making model production-ready is a lot more time consuming than model building; we will cover these aspect of ML in other workshops in the future. 

## Tools
* `Pandas`: Read in data and do some basic transformation. 
* `matplotlib`: Visualize model performance. 
* `Scikit-learn (Sklearn)`: A library of ML models and common utility functions. 
* `time`: Native package from Python, it helps us calculate the runtime. 

Here, we use `!pip install <package_name>` to install packages inside of a Jupyter Notebook runtime. Alternatively, you can run `pip install ...` (or `conda install ...` if you have Anaconda available) in your terminal to install these packages in a virtual environment; this way you don't have to run this command every time you launch a new notebook instance. 

In [7]:
# Package download and import
# !pip install scikit-learn
# !pip install seaborn 

from sklearn.model_selection import train_test_split
import pandas as pd
import time 

In [11]:
# TODO: replace with path to your downloaded dataset  
mobile_price_filepath = './mobile_price_classification_train.csv'

# load the data into memory
mobile_price_train, mobile_price_test = None, None
mobile_price_df = pd.read_csv(mobile_price_filepath)

# Feel free to playaround with this function, but if you change the test_size and/or random_state, you will see different results 
mobile_price_train, mobile_price_test = train_test_split(mobile_price_df, test_size=0.4, random_state=2023, shuffle=True) 

##### Before we being training models, we have to understand what data we are working with. 

The `mobile_price` dataset contains mock specifications of 2000 smart phones. In the scripts above, we have split the data into train and test sets, where there are 1200 instance in train set and 600 in test set. <br>

There are a total of 20 feature columns and 1 target columns, and here is `metadata` of the dataset: 

| Feature | Description |
| :-------| :-----------|
| `battery_power` | int, total energy a battery can store in mAh|
| `blue`| bool, has bluetooth or not |
| `clock_speed`| float, speed at which microprocessor executes instructions|
| `dual_sim`| bool, has dual sim support or not|
| `fc`| int, front Camera mega pixels|
| `four_g`| bool, has 4G or not|
| `int_memory`| int, internal Memory in Gigabytes|
| `m_dep`| float, mobile Depth in cm|
| `mobile_wt`| int, weight of mobile phone|
| `n_cores`| int, number of cores of processor|
| `pc`| int, primary Camera mega pixels|
| `px_height`| int, pixel Resolution Height|
| `px_width`| int, pixel Resolution Width|
| `ram`| int, random Access Memory in Megabytes|
| `sc_h`| int, screen Height of mobile in cm|
| `sc_w`| int, screen Width of mobile in cm|
| `talk_time`| int, longest time that battery will last by a call|
| `three_g`| bool, has 3G or not|
| `touch_screen`| bool, has touch screen or not|
| `wifi`| bool, has wifi or not|
| `price_range` | int; categories of prices: {0: cheap, 1: medium, 2: expensive, 3: very expensive}|


In [21]:
# check out the first entry
mobile_price_train.iloc[0]

battery_power    1135.0
blue                1.0
clock_speed         2.8
dual_sim            1.0
fc                  9.0
four_g              0.0
int_memory         43.0
m_dep               0.4
mobile_wt         158.0
n_cores             1.0
pc                 11.0
px_height         690.0
px_width         1589.0
ram              3204.0
sc_h               18.0
sc_w               13.0
talk_time           6.0
three_g             1.0
touch_screen        0.0
wifi                0.0
price_range         3.0
Name: 729, dtype: float64

## Models

After we have created our train and test sets, we can start working on building different models!

### Supervised Regression 

<b>Goal</b>: use the given features, create a model to predict a mobile phone's battery power. 

* We are 

For supervised regression, we will be looking at three models: 
1. Linear Regression 
2. Linear Regression with L1 Regularization (i.e. Lasso Regression)
3. Support Vector Machine (SVM) for Regression

In [None]:
# Task 1: multivariate linear regression

In [None]:
# Task 2: Lasso regression (linear regression with regularization)

In [None]:
# Task 3: Support Vector Machine 

### Supervised Classificaiton

For supervised classification, we will be looking at three models: 
1. Logistic Regression
2. Decision Tree 
3. Support Vector Machine (SVM) for Classification

In [None]:
# Task 4: logistic regression (classification)

In [None]:
# Task 5: Decision Tree (Random Forest)

In [None]:
# Task 6: Support Vector Machine 

### Unsupervised Clustering

For unsupervised clustering, we will be looking at two models: 
1. Kmeans 
2. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

In [None]:
# Task 7: Kmeans Clustering 

In [None]:
# Task 8: DBSCAN 

## Quick Overview of Other Tasks and Use Cases 