# Machine Lerning with Python
<hr>

## Concept

### **Definition：** <br>
  Ability to learn without being explicitly programmed （Using pattern)

### **Major machine learning techniques:** <br>
- Regression/Estimation
  - Predicting continuous values
- Classification
  - Item class/category of case
- Clustering
  - Finding the structure of data; summarization
- Association
  - Frequent co-occurring items/events
- Anomaly detection
  - Discovering abnormal and unusual cases
- Sequence mining
  - Predicting next events; click stream
- Dimension Reduction
  - Reduce the size of data
- REcommendation systems
  - Recommending items

### **Difference between AI and Machine Learning:**
- AI
  - Computer vision
  - Language processing
  - Creativity
- Machine Learning
  - Classification
  - Clustering
  - Neural Network
- Revolution in ML:
  - Deep learning
<hr>

## Python
- NumPy
- SciPy
- matplotlib
- pandas
- scikit-learn [check the tutorial](https://www.youtube.com/watch?v=hDKCxebp88A)
  - classification / regression / clustering algorithms
  - data processing -> Train -> Algorithm setup -> Model fitting -> Prediction -> Evaluation --> Model export 
<hr>

## Supervised vs Unsupervised
- **Supervised**(controlled environment)
  - Using label the data to teach model: *numerical* / *categorical*
  - **classification** & **regression**
- **Unsupervised**(Less controlled environment)
  - Dimension reduction
  - Density estimation
  - Market basket analysis
  - **Clustering**: group of data points or objects that are somehow similar
<hr><hr>

## Machine Learning

### Regression
- Variables
  - Dependent variable: **X**
  - Independent variable: **Y**

- **Simple linear regression:**<br>
  *residual error*: between actual and predict data, use *mean square error(MSE)* to define how big it is.
- **Multiple linear variable:**<br>
  *optimized parameters*
  - ordinary least squares: takes time for large datasets
  - optimization algorithm
    - gradient descent
    - proper approach

### Model Regression
- Dataset
  - Train/Test
- Training Accuracy
  - Over-fit: capture and produce a non-generalized model
- Out-of-Sample Accuracy
  - Should have HIGH out-of-sample accuracy
  - Using *train/test split* evaluation approach
  - *K-fold cross validation* average accuracy
- Evaluation Metrics
  - **MAE** mean absolute error
  - **MSE** mean square error
  - **RMSE** root of mean square error
  - **RAE** relative absolute error
  - **RSE** relative square error
    - $R^2$ $= 1 - RSE$

### Non-Linear Regression
- **Polynomial Regression**
  - polynomial regression model can be transferred into linear regression model -> *Least Square*

### Classification
- Supervised learning approach
- Binary(0,1) or multi-class classification
- **Algorithm**
  - Decision Trees (ID3, C4,5, C5.0)
    1. choose an *attribute*
    2. calculate the significance of attribute
        - More *Predictiveness*, Less *Impurity*, Lower *Entropy*(randomness or uncertainty)
        - Entropy -> *information gain* (increase the certainty), choose **HIGHER**
    3. split data based on value
    4. go back to step 1
  - Naive Bayes
  - Linear Discriminant Analysis
  - k-Nearest Neighbor
    1. pick up a k (by examine the *Accuracy*)
    2. calculate the distance from all cases
    3. find the nearest data point
    4. choose the most popular value
  - Logistic Regression
    - binary data
    - probabilistic results
    - need linear decision boundary
    - need to understand the impact of feature
  - Neural Networks
  - Support Vector Machines (SVM)
    - Data *transformation*
    - Kernelling(Integrated in the library):
      - Linear
      - Polynomial
      - RBF
      - Sigmoid
    - find the *hyperplane* (choose *bigger* margin by *support vector*)
    1. mapping data to *high-dimensional* feature space
    2. find a *separator*
- Evaluation Metrics
  - Jaccard index
  - F1-Score (Confusion matrix)(True Positive... TP/FN/FP/TN)
    - $Precision = TP/(TP + FP)$
    - $Recall = TP/(TP + FN)$
    - $F1 = 2*(prc * rec)/(prc + rec)$
  - Log loss


### Clustering
**Unlabeled dataset**
A group of objects that *similar to other projects* and *dissimilar to data points* in other cluster.<br>
#### Algorithm
- Partitioned-based
  - Relatively efficient
  - k-Means, k-Median, Fuzzy c-Means
- Hierarchical Clustering
  - Produces trees of clusters
  - Agglomerative, Divisive
- Density-based
  - Produces arbitrary shaped cluster
  - DBSCAN
#### K-Means
- divides the data into *non-overlapping* subsets
- Steps:
  1. find k and their value -> *centroid* chosen randomly
  2. calculate the distance between and make a matrix
  3. assign each point to the closest centroid
  4. the SSE(sum of the square errors) is high and compute new centroids for each cluster
  5. repeat until there are no more changes
- Accuracy
  - external approach
    - compare with ground truth (which is not possible for the most cases)
  - internal approach
    - average the distance between data points within a cluster
  - with the increasing k, the accuracy is always decreasing. choose the **elbow point**
#### Hierarchal 
*each node is a cluster* consists of the clusters of its daughter nodes
- Agglomerative is going UP
  - Steps:
    1. create *n* clusters
    2. computer the proximity matrix
    3. Repeat
       - Merge the two closest clusters
       - Update the matrix 
    4. Until only single cluster remains 
- Divisive is going DOWN 
#### DBSCAN
locates regions of *high density* and separates outliers, for class identification
Density-Based Spatial Clustering of Applications with Noise <br>
- R (Radius of neighborhood)
- M (Min number of neighbors)
- each point is either:
  - core
  - border
    - less than the M
    - reachable from the core point
  - outlier
    - not reachable from the core point
- if the core point can be reached with the R, all of these core points should connect together into ONE cluster


