# Study Guide ML-Specialist 

## High-level Exam Objectives

### Section 1 - Evaluate business problem including ethical implications
- 1.1 Understand business requirements
- 1.2 Understand what data is available
- 1.3 Understand ethical challenges in the business problem
- 1.4 Perform AI design thinking
- 1.5 Assess progress on the AI Ladder


### Section 2 - Exploratory Data Analysis including data preparation
- 2.1 Identify the methods used to clean, label, and anonymize data
- 2.2 Visualize data
- 2.3 Balance and partition data


### Section 3 - Implement the proper model
- 3.1 Implement Supervised Learning: Regression
- 3.2 Implement Supervised Learning: Classification
- 3.3 Implement Unsupervised Learning: Clustering
- 3.4 Implement Unsupervised Learning: Dimensional Reduction


### Section 4 - Refine and deploy the model
- 4.1 Identify operations and transformations taken to select and engineer features
- 4.2 Select the proper tools
- 4.3 Configure the appropriate environment specifications for training the model
- 4.4 Train the model and optimize hyperparameters
- 4.5 Implement the ability for the model to explain itself
- 4.6 Deploy the model


### Section 5 - Monitor models in production
- 5.1 Assess the model
- 5.2 Monitor the model in production
- 5.3 Determine if there is unfair bias in the model

## 1.1. Understand business requirements
### SUBTASKS:
- 1.1.1. Explain how IBM Garage Methodology works
- 1.1.2. Understand the CRISP-DM process
- 1.1.3. Identify which business opportunity to prioritize and define success metrics for an MVP
    - REFERENCES:
        - https://www.ibm.com/garage
        - https://thinkinsights.net/digital/crisp-dm/



## 1.2. Understand what data is available
#### SUBTASKS:
- 1.2.1. Use SQL to access data
- 1.2.1.1. Extracting specific columns
- 1.2.1.2. Filtering data
- 1.2.1.3. Combining Tables
- 1.2.2. Use Python APIs to access data
- 1.2.3. Scrape information from a website
- 1.2.4. Read data into a Pandas Dataframe
- 1.2.4.1. Reading different types of data assets
- 1.2.4.2. Manipulating column names
- 1.2.4.3. Obtaining specific rows
    - REFERENCES:
        - https://www.w3schools.com/sql/
        - https://realpython.com/python-api/
        - https://www.crummy.com/software/BeautifulSoup/bs4/doc/
        - https://www.crummy.com/software/BeautifulSoup/bs4/doc/
        - https://pandas.pydata.org/docs/getting_started/index.html



## 1.3. Understand ethical challenges in the business problem
#### SUBTASKS:
- 1.3.1. List potential sources of unfair bias
- 1.3.2. List potential sources of privacy violations
- 1.3.3. List potential secondary and tertiary effects of the application
- 1.3.4. Plan to prevent or mitigate negative consequences
    - REFERENCES:
        - https://learn.ibm.com/course/view.php?id=8390
        - https://learning.oreilly.com/library/view/ai-fairness/9781492077664/Introduction
        - https://www.ibm.com/design/thinking/page/courses/AI_Essentials
        - https://www.designethically.com/layers
        - https://www.ibm.com/design/ai/ethics/



## 1.4. Perform AI design thinking
#### SUBTASKS:
- 1.4.1. Align on user intents for a solution
- 1.4.2. Document the available data
- 1.4.3. Determine what training will be required
- 1.4.4. Create hypotheses about what the behavior of the system will be
- 1.4.5. Assess feasibility and refine if needed
- 1.4.6. Consider direct and indirect effects of the solution
    - REFERENCES:
        - https://www.ibm.com/design/thinking/page/courses/AI_Essentials
        - https://www.ibm.com/design/thinking/page/toolkit/activity/ai-essentials-intent
        - https://www.ibm.com/design/thinking/static/team-essentials-for-ai-workbook-8dc9aadb2cc2dc6343cc5e420b522ca2.pdf
        - https://learning.oreilly.com/library/view/operationalizing-ai/9781098101329/ --- Chapter 3



## 1.5. Assess progress on the AI Ladder
#### SUBTASKS:
- 1.5.1. Assess progress in collecting data
- 1.5.2. Assess progress in organizing data
- 1.5.3. Assess progress in analyzing data
- 1.5.4. Assess progress in infusing AI into the organization
    - REFERENCES:
        - https://www.ibm.com/downloads/cas/O1VADKY2
        - https://learn.ibm.com/course/view.php?id=8496




## 2.1. Identify the methods used to clean, label, and anonymize data

#### SUBTASKS:
- 2.1.1. Clean data
    - 2.1.1.1. Fill or drop missing values
    - 2.1.1.2. Remove duplicate rows
    - 2.1.1.3. Remove outliers
    - 2.1.1.4. Converting data types
    - 2.1.1.5. Data normalization
    
    
- 2.1.2. Label data
    - 2.1.2.1. Understand the benefits and challenges to labeling data
    - 2.1.2.2. Explain data labeling approaches
    
    
- 2.1.3. Anonymize data
    
REFERENCES:
- https://www.ibm.com/garage/method/practices/reason/prepare-data-for-machine-learning/
- https://www.ibm.com/garage/method/practices/code/data-preparation-ai-data-science/
- https://www.ibm.com/cloud/learn/data-labeling
- https://dataplatform.cloud.ibm.com/docs/content/wsj/governance/dmg22.html

## 2.1.1 Clean data

### 2.1.1.1. Fill or drop missing values

- Remove missing values
    - `DataFrame.dropna([axis, how, thresh, ...])`


- Fill NA/NaN values using the specified method.
    - `DataFrame.fillna([value, method, axis, ...])`
	
    
- Detect missing values.
    - `DataFrame.isna()`
	
    
- Replace values given in to_replace with value.	
    - `DataFrame.replace([to_replace, value, ...])`
	

### 2.1.1.2. Remove duplicate rows

- Return DataFrame with duplicate rows removed.
    - `new_df = df.drop_duplicates(keep=False, inplace=false)`


### 2.1.1.3. Remove outliers

#### various ways of outlier detection:
- Z-Score
    - A z-score simply tells you how many standard deviations away an individual data value falls from the mean.
    

- IQR-distance from Median
    - Interquartile range. The IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.
    
    - sklearn's RobustScaler 
        - Scale features using statistics that are robust to outliers. This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). 



### 2.1.1.4. Converting data types

When working with missing <b>Numerical</b> values:
- dropna()
    - `df.dropna(subset=['price'])`


- drop()
    - `df.drop('price', axis=1)`


- fillna()
    - `median = df['price'].median()
      df['price'].fillna(median, inplace=True)`
      
      
- scikit-learn `SimpleImputer`
    
    `from sklearn.impute import SimpleImputer
    imputer = SimpleImputer(strategy='median')`
    
    `price_num = df.drop('price', axis=1)
    imputer.fit(price_num)`
    
    `X = imputer.transform(price_num)`
    
    `transform_df = pd.DataFrame(X, columns=price_num.columns, index=price_num.index`
    



When working iwth missing <b>Text</b> and <b>Categorica;</b> attributes:


- Ordinal Encoding using sci-kit learn `OrdinalEncoder`

    `wrd_cat = df[['words']]`

    `from sklearn.preprocessing import OrdinalEncoder
    ordinal_encoder = OrdinalEncoder()
    wrd_cat_encoded = ordinal_encoder.fit_transform(wrd_cat)`
    
- one-hot encoding using sci-kit learn OneHotEncoder

    `wrd_cat = df[['words']]`

    `from sklearn.preprocessing import OneHotEncoder
    cat_encoder = OneHotEncoder()
    wrd_cat_1hot = cat_encoder.fit_transform(wrd_cat)`
    
    the result is a SciPy sparse matrix. To convert it to a (dense) NumPy array use `toarray()`
    
    `wrd_cat_1hot.toarray()`

### 2.1.1.5. Data normalization

- Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1. For machine learning, every dataset does not require normalization. It is required only when features have different ranges.

- Min-max scaling (aka normalization) is the simplest form of feature scaling. Values are shifted and rescaled so that they end up ranging from 0 to 1. 
    - scikit-learn MinMaxScaler


- Z Normalization(Standardization):


- Unit Vector Normalization:


<b>Good practice usage with the MinMaxScaler and other scaling techniques is as follows:</b>

- Fit the scaler using available training data. For normalization, this means the training data will be used to estimate the minimum and maximum observable values. This is done by calling the fit() function.
    
    
- Apply the scale to training data. This means you can use the normalized data to train your model. This is done by calling the transform() function.
    
    
- Apply the scale to data going forward. This means you can prepare new data in the future on which you want to make predictions.



## 2.1.2. Label data



### 2.1.2.1. Understand the benefits and challenges to labeling data

<b>Benefits</b>

Data labeling provides users, teams and companies with greater context, quality and usability. More specifically, you can expect:


- More Precise Predictions: Accurate data labeling ensures better quality assurance within machine learning algorithms, allowing the model to train and yield the expected output. Otherwise, as the old saying goes, “garbage in, garbage out.” Properly labeled data  provide the “ground truth” (i.e., how labels reflect “real world” scenarios) for testing and iterating subsequent models.


- Better Data Usability: Data labeling can also improve usability of data variables within a model. For example, you might reclassify a categorical variable as a binary variable to make it more consumable for a model.  Aggregating data in this way can optimize the model by reducing the number of model variables or enable the inclusion of control variables. Whether you’re using data to build computer vision models (i.e. putting bounding boxes around objects) or NLP models (i.e. classifying text for social sentiment), utilizing high-quality data is a top priority.


<b>Challenges</b>

Data labeling is not without its challenges. In particular, some of the most common challenges are:

- Expensive and time-consuming: While data labeling is critical for machine learning models, it can be costly from both a resource and time perspective. If a business takes a more automated approach, engineering teams will still need to set up data pipelines prior to data processing, and manual labeling will almost always be expensive and time-consuming.


- Prone to Human-Error: These labeling approaches are also subject to human-error (e.g. coding errors, manual entry errors), which can decrease the quality of data. This, in turn, leads to inaccurate data processing and modeling. Quality assurance checks are essential to maintaining data quality.


### 2.1.2.2. Explain data labeling approaches

Data labeling is a critical step in developing a high-performance ML model. Though labeling appears simple, it’s not always easy to implement. As a result, companies must consider multiple factors and methods to determine the best approach to labeling. Since each data labeling method has its pros and cons, a detailed assessment of task complexity, as well as the size, scope and duration of the project is advised.

<b>Here are some paths to labeling your data:</b>

- Internal labeling - Using in-house data science experts simplifies tracking, provides greater accuracy, and increases quality. However, this approach typically requires more time and favors large companies with extensive resources.
   
   
- Synthetic labeling - This approach generates new project data from pre-existing datasets, which enhances data quality and time efficiency. However, synthetic labeling requires extensive computing power, which can increase pricing.


- Programmatic labeling - This automated data labeling process uses scripts to reduce time consumption and the need for human annotation. However, the possibility of technical problems requires HITL to remain a part of the quality assurance (QA) process.


- Outsourcing - This can be an optimal choice for high-level temporary projects, but developing and managing a freelance-oriented workflow can also be time-consuming. Though freelancing platforms provide comprehensive candidate information to ease the vetting process, hiring managed data labeling teams provides pre-vetted staff and pre-built data labeling tools.


- Crowdsourcing - This approach is quicker and more cost-effective due to its micro-tasking capability and web-based distribution. However, worker quality, QA, and project management vary across crowdsourcing platforms. One of the most famous examples of crowdsourced data labeling is Recaptcha. This project was two-fold in that it controlled for bots while simultaneously improving data annotation of images. For example, a Recaptcha prompt would ask a user to identify all the photos containing a car to prove that they were human, and then this program could check itself based on the results of other users. The input of from these users provided a database of labels for an array of images.



## 2.1.3. Anonymize data

Data anonymization helps you protect sensitive data, such as personally identifiable information or restricted business data to avoid the risk of compromising confidential information. It is defined in policy rules that are enforced for an asset. Depending on the method of data anonymization, data is redacted, masked, or substituted in the asset preview. 

<b>method to anonymize data:</b>

- Redact data values in asset columns.
    This method replaces each data value with a string of exactly ten letters of X to remove information that is, for example, identifying or otherwise sensitive. With redacted data, neither the format of the data nor referential integrity is retained.


- Substitute data values in asset columns.
    This method replaces data with values that don’t match the original format. It preserves referential integrity (RI) to ensure that table relationships are consistent.
    If a value is used several times in a column with substituted data, Substitute uses the same substitution value for identical data values.
    For example, if a column contains the email address userA@example.com several times, each finding is replaced by the same substitution value, such as: 500ddcc98133703531re3456.


- Mask data values in asset columns that contain SSN (US social security numbers) data.
    This method replaces data types like SSN with similarly formatted values that match the original format. It does not preserve referential integrity (RI) or data distribution. 


## 2.2. Visualize data
#### SUBTASKS:


- 2.2.1. Choose the column(s) from your dataset to be visualized
- 2.2.2. Identify what the visualization should describe about the column(s)
    - 2.2.2.1. Distribution
    - 2.2.2.2. Correlation
    - 2.2.2.3. Comparison
    - 2.2.2.4. Time Series
- 2.2.3. Select a type of chart based on the descriptive need
    - 2.2.3.1. Histogram/Box plot/Violin plot
    - 2.2.3.2. Scatterplot/Heatmap
    - 2.2.3.3. Bar chart
    - 2.2.3.4. Line plot
- 2.2.4. Select a library or tool for visualization
    - 2.2.4.1. Matplotlib
    - 2.2.4.2. Seaborn
    - 2.2.4.3. Bokeh
    - 2.2.4.4. Plotly
- 2.2.5. Plot the visualization


REFERENCES:
- https://seaborn.pydata.org/introduction.html
- https://matplotlib.org/stable/tutorials/introductory/usage.html#sphx-glr-tutorials-introductory-usage-py
- https://docs.bokeh.org/en/latest/docs/first_steps.html
- https://plotly.com/python/
- https://learn.ibm.com/course/view.php?id=8794
- https://learning.oreilly.com/library/view/statistics-in-a/9781449361129/ 

## 2.2.1. Choose the column(s) from your dataset to be visualized

Explore data with:
- df.info()
- df.decribe()
- df.shape
- df.dtypes
- df.head() / df.tail()
- df['col']
- df.sum()
- df.max() / df.min()
- df['col'].isnull().sum()
- df.isnull().sum().sum()


## 2.2.2. Identify what the visualization should describe about the column(s)

### 2.2.2.1. Distribution
-  Histogram 
    - DataFrame.plot.hist(by=None, bins=10, **kwargs)
    - A histogram is a representation of the distribution of data. This function groups the values of all given Series in the DataFrame into bins 

- box plot
    - Another useful plot the see the data distribution is the box plot. You can simply plot it by using df.plot.box()
    
- Kernel Density Estimation (KDE) 
    - plots save you from the hassle of deciding on the bin size by smoothing the histogram. 
    
- Violin plots 
    - are the perfect combination of the box plots and KDE plots. They deliver the summary statistics with the box plot inside and shape of distribution with the KDE plot on the sides.


### 2.2.2.2. Correlation
- correlation matrix 
    - A correlation matrix is a matrix that shows the correlation values of the variables in the dataset. df.corr()
    
- correlation heatmap 
    -  plots the correlation as a heatmap. seaborn.heatmap(df.corr())
    
- Correlation Scatter Plot
    - It also supports drawing the linear regression fitting line in the scatter plot. You can enable it or disable it using the fit_reg parameter. By default, the parameter fit_reg is always True which means the linear regression fit line will be plotted by default.
    
- pair plots

### 2.2.2.3. Comparison
?????????


### 2.2.2.4. Time Series
???????????

## 2.2.3. Select a type of chart based on the descriptive need
 
### 2.2.3.1. Histogram/Box plot/Violin plot
-  Histogram 
    - DataFrame.plot.hist(by=None, bins=10, **kwargs)
    - A histogram is a representation of the distribution of data. This function groups the values of all given Series in the DataFrame into bins 

- box plot
    - Another useful plot the see the data distribution is the box plot. You can simply plot it by using df.plot.box()

- Kernel Density Estimation (KDE) 
    - plots save you from the hassle of deciding on the bin size by smoothing the histogram. 

- Violin plots 
    - are the perfect combination of the box plots and KDE plots. They deliver the summary statistics with the box plot inside and shape of distribution with the KDE plot on the sides.

### 2.2.3.2. Scatterplot/Heatmap
### 2.2.3.3. Bar chart
### 2.2.3.4. Line plot


## 2.2.4. Select a library or tool for visualization

### 2.2.4.1. Matplotlib
- Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported under the plt alias:
    - import matplotlib.pyplot as plt

### 2.2.4.2. Seaborn
- Seaborn is a library that uses Matplotlib underneath to plot graphs. It will be used to visualize random distributions.
- Distplot stands for distribution plot, it takes as input an array and plots a curve corresponding to the distribution of points in the array. 
    - import matplotlib.pyplot as plt
    
    import seaborn as sns
    
    sns.distplot([0, 1, 2, 3, 4, 5])
    
    plt.show() 
    
### 2.2.4.3. Bokeh
- Bokeh is a Python library for creating interactive visualizations for modern web browsers. It helps you build beautiful graphics, ranging from simple plots to complex dashboards with streaming datasets. With Bokeh, you can create JavaScript-powered visualizations without writing any JavaScript yourself.

### 2.2.4.4. Plotly
- Plotly is a javascript library for data visualization. It is based on the famous d3.js library, and provides a python wrapper allowing to build stunning interactive charts directly from Python.

## 2.2.5. Plot the visualization

# 2.3. Balance and partition data

#### SUBTASKS:
- 2.3.1. Partition data
- 2.3.1.1. Create train/test/validation splits
    - REFERENCES:
        - https://learn.ibm.com/mod/video/view.php?id=165773 (data leakage mentioned in passing)

- 2.3.1.2. Understand and implement cross validation
    - REFERENCES:
        - https://learn.ibm.com/mod/video/view.php?id=166655
        - https://learn.ibm.com/mod/page/view.php?id=170328&forceview=1

- 2.3.1.3. Prevent data leakage
    - REFERENCES:
        - https://en.wikipedia.org/wiki/Leakage_(machine_learning)
        - https://reproducible.cs.princeton.edu/ (this is a common problem)

- 2.3.1.4. Create data splits that are reproducible
    - REFERENCES:
        - https://cs230.stanford.edu/blog/split/
        - https://learn.ibm.com/mod/video/view.php?id=166646&forceview=1
        - https://learning.oreilly.com/library/view/machine-learning-design/9781098115777/ch06.html#problem-id00022
        
- 2.3.2. Balance data

- 2.3.2.1. Understand why imbalanced data is problematic
    - REFERENCES:
        - https://learn.ibm.com/mod/video/view.php?id=167242
        - https://learn.ibm.com/mod/video/view.php?id=168614
        - https://learn.ibm.com/mod/page/view.php?id=170229&forceview=1

- 2.3.2.2. Understand and implement pros, cons, and how to of up-, down-, and re- sampling
    - REFERENCES:
        - https://learn.ibm.com/mod/video/view.php?id=167243
        - https://learn.ibm.com/mod/video/view.php?id=167247
        - https://learn.ibm.com/mod/video/view.php?id=167246
        - https://learn.ibm.com/mod/page/view.php?id=170230&forceview=1
        - https://learn.ibm.com/mod/video/view.php?id=168610&forceview=1

- 2.3.2.3. Understand and implement other methods to handle imbalanced data, such as weighting and stratified sampling
    - REFERENCES:
        - https://learn.ibm.com/mod/video/view.php?id=167245
        - https://imbalanced-learn.org/stable/index.html (included in

## 2.3.1. Partition data
### 2.3.1.1. Create train/test/validation splits
- from sklearn.model_selection import train_test_split
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    - test_size is the percentage used for the test size
    
### 2.3.1.2. Understand and implement cross validation
- cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:

- A model is trained using k - 1 of the folds as training data; the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

- The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset.
    - from sklearn.model_selection import cross_val_score
    - clf = svm.SVC(kernel='linear', C=1, random_state=42)
    - scores = cross_val_score(clf, X, y, cv=5)
    
- other cross validation strategies by passing a cross validation iterator instead, for instance:
    - from sklearn.model_selection import ShuffleSplit
    - n_samples = X.shape[0]
    - cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
    - cross_val_score(clf, X, y, cv=cv)

Links: 
- https://www.kaggle.com/code/alexisbcook/cross-validation
- https://scikit-learn.org/stable/modules/cross_validation.html

### 2.3.1.3. Prevent data leakage
- Data Leakage is the scenario where the Machine Learning Model is already aware of some part of test data after training.This causes the problem of overfitting.

- In Machine learning, Data Leakage refers to a mistake that is made by the creator of a machine learning model in which they accidentally share the information between the test and training data sets.

Ways to Help Prevent Data Leakage:
- Understanding the Dataset
- Cleaning Dataset for Duplicates
- Selecting Features with Regard to Target Variable Correlation and Temporal Ordering
- Splitting Dataset into Train, Validation, and Test Groups
- Normalizing After Splitting, BUT Before Cross Validation
- Assessing Model Performance with a Healthy Skepticism


###  2.3.1.4. Create data splits that are reproducible
- by using the same test_size and random_state values
  

## 2.3.2. Balance data

### 2.3.2.1. Understand why imbalanced data is problematic


### 2.3.2.2. Understand and implement pros, cons, and how to of up-, down-, and re- sampling
   
### 2.3.2.3. Understand and implement other methods to handle imbalanced data, such as weighting and stratified sampling


# Section 3 – Implement the proper model

## 3.1. Implement Supervised Learning: Regression

#### SUBTASK(S):

- 3.1.1. Describe Regression
    - REFERENCES:
        - https://towardsdatascience.com/supervised-learning-basics-of-linear-regression-1cbab48d0eba
        
- 3.1.2. Understand the benefits of Regression
    - REFERENCES:
        - https://towardsdatascience.com/supervised-learning-the-what-when-why-good-and-bad-part-1-f90e6fe2a606
        
- 3.1.3. Understand some of the most popular Regression algorithms
    - 3.1.3.1. Gradient Boosting Tree
    - 3.1.3.2. Neural Network
    - 3.1.3.3. Random Forest
    - 3.1.3.4. Linear Regression
    - 3.1.3.5. Decision Tree
        - REFERENCES:
            - https://scikit-learn.org/stable/supervised_learning.html




## 3.2. Implement Supervised Learning: Classification
#### SUBTASK(S):
- 3.2.1. Describe Classification
    - REFERENCES:
        - https://towardsdatascience.com/supervised-learning-the-what-when-why-good-and-bad-part-1-f90e6fe2a606

- 3.2.2. Understand the benefits of Classification
    - REFERENCES:
        - https://www.javatpoint.com/regression-vs-classification-in-machine-learning
        
- 3.2.3. Understand some of the most popular Classification algorithms
    - 3.2.3.1. Naïve Bayes
    - 3.2.3.2. Linear SVM
    - 3.2.3.3. Logistic Regression
    - 3.2.3.4. K-Nearest Neighbors
    - 3.2.3.5. Stochastic Gradient Descent
    - 3.2.3.6. Neural Network
    - 3.2.3.7. Decision Trees & Random Forest
    - 3.2.3.8. Boosting Classifiers
        - REFERENCES:
            - https://analyticsindiamag.com/7-types-classification-algorithms/

## 3.3. Implement Unsupervised Learning: Clustering

#### SUBTASK(S):

- 3.3.1. Describe Clustering
    - REFERENCES:
        - https://machinelearningmastery.com/clustering-algorithms-with-python/
- 3.3.2. Understand the benefits of Clustering
    - REFERENCES:
        - https://www.explorium.ai/blog/clustering-when-you-should-use-it-and-avoid-it/
- 3.3.3. Understand some of the most popular Clustering algorithms
- 3.3.3.1. K-means
- 3.3.3.2. Gaussian Mixture Model
- 3.3.3.3. DBSSCAN
    - REFERENCES:
        - https://scikit-learn.org/stable/modules/clustering.html
    - Additional REFERENCES:
        - https://www.statlearning.com/
        - http://www.mmds.org/
        - https://scikit-learn.org/stable/modules/mixture.html#gmm
        - https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf
        - https://www.dbs.ifi.lmu.de/Publikationen/Papers/OPTICS.pdf




## 3.4. Implement Unsupervised Learning: Dimensional Reduction
#### SUBTASK(S):
- 3.4.1. Describe Dimensional Reduction
    - REFERENCES:
        - https://machinelearningmastery.com/dimensionality-reduction-for-machine-learning/
- 3.4.2. Understand the benefits of Dimensional Reduction
    - REFERENCES:
        - https://machinelearningmastery.com/dimensionality-reduction-for-machine-learning/
- 3.4.3. Understand some of the most popular Dimensional Reduction Algorithms
- 3.4.3.1. Singular Value Decomposition
- 3.4.3.2. Latent Dirichlet Analysis
- 3.4.3.3. Principal Component Analysis
    - REFERENCES:
        - https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/
    - General Reference for differentiation:
        - https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning
    - Additional References
        - https://www.statlearning.com/
        - http://www.mmds.org/
        - https://geometria.math.bme.hu/sites/geometria.math.bme.hu/files/users/csgeza/howard-anton-chris-rorres-elementary-linear-algebra-applications-version-11th-edition.pdf
        - https://jmlr.org/papers/volume18/14-546/14-546.pdf
        - https://en.wikipedia.org/wiki/Curse_of_dimensionality

# Section 4 - Refine and deploy the model


## 4.1. Identify operations and transformations taken to select and engineer features
#### SUBTASK(S):
- 4.1.1. Obtain raw data
- 4.1.2. Engineer features using attributes of the raw data
- 4.1.3. Use automated techniques to augment and/or select features for use in learning
    - REFERENCES:
        - https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/
        - https://developer.ibm.com/articles/automated-feature-engineering-for-relational-data-with-ibm-autoai/
        - https://developer.ibm.com/patterns/model-mgmt-on-watson-studio-local/


## 4.2. Select the proper tools
#### SUBTASK(S):
- 4.2.1. Identify the tools required based on the:
- 4.2.1.1. Model type
- 4.2.1.2. Type of data
- 4.2.1.3. Feature engineering requirements
- 4.2.1.4. Amount of automation desired
- 4.2.1.5. Production environment requirements
    - REFERENCES:
        - https://www.ibm.com/garage/method/practices/reason/evaluate-and-select-machine-learning-algorithm/
        - https://developer.ibm.com/articles/cc-models-machine-learning/
        - https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_latest/wsj/analyze-data/ml-overview_local.html
        - https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_latest/wsj/getting-started/tools.html


## 4.3. Configure the appropriate environment specifications for training the model

#### SUBTASK(S):
- 4.3.1. Identify the frameworks supported by Watson Machine Learning
- 4.3.2. Explain the GPU-accelerated computing
    - REFERENCES:
        - https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-overview.html
        - https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/pm_service_supported_frameworks.html


## 4.4. Train the model and optimize hyperparameters
#### SUBTASK(S):
- 4.4.1. Choose and justify the type of algorithm
- 4.4.1.1. Regression
- 4.4.1.2. Classification
- 4.4.1.3. Clustering
- 4.4.1.4. Recommendation engines
- 4.4.1.5. Anomaly detection
- 4.4.2. Describe the trade-offs between underfitting and overfitting a model
- 4.4.2.1. Avoid underfitting or overfitting by splitting the data into training, testing, and validation sets
- 4.4.3. Compare model parameters and hyperparameters
- 4.4.4. Explain hyperparameters and hyperparameter tuning
- 4.4.4.1. Tuning is a trial-and-error process
- 4.4.4.2. Tuning is based on the training output loss value
- 4.4.4.3. Learning rate, number of epochs, hidden layers, hidden units, activation functions
- 4.4.5. Summarize search algorithms
- 4.4.5.1. Grid Search
- 4.4.5.2. Random Search
- 4.4.5.3. Bayesian Optimization
- 4.4.6. Ensemble multiple models
- 4.4.7. Choose and justify the type of algorithm
- 4.4.7.1. Regression
- 4.4.7.2. Classification
- 4.4.7.3. Clustering
- 4.4.7.4. Recommendation engines
- 4.4.7.5. Anomaly detection
    - REFERENCES:
        - https://www.ibm.com/garage/method/practices/reason/optimize-train-ai-model/
        - https://www.ibm.com/docs/en/wmla/2.2.0?topic=optimization-hyperparameter-search-algorithms
        - https://www.ibm.com/garage/method/practices/reason/evaluate-and-select-machine-learning-algorithm/
        - https://developer.ibm.com/articles/cc-models-machine-learning/
        - https://www.ibm.com/garage/method/practices/reason/evaluate-and-select-machine-learning-algorithm/
        - https://developer.ibm.com/articles/cc-models-machine-learning/


## 4.5. Implement the ability for the model to explain itself
#### SUBTASK(S):
- 4.5.1. Determine what user profiles need explanations
- 4.5.2. Determine what sort of explanations will make sense to those users
- 4.5.3. Select and apply algorithms to generate model explanations
- 4.5.3.1. Boolean Decision Rule
- 4.5.3.2. Generalized Linear Rule Model
- 4.5.3.3. ProfWeight
- 4.5.3.4. Teaching Explanations for Decisions (TED)
- 4.5.3.5. Contrastive Explanations
- 4.5.3.6. Disentangled Inferred Prior VAE
- 4.5.3.7. ProtoDash
- 4.5.4. Present expalantions in a form that will make sense to the target users
    - REFERENCES:
        - https://learn.ibm.com/course/view.php?id=8717
        - https://learn.ibm.com/course/view.php?id=8718
        - https://aix360.mybluemix.net/


## 4.6. Deploy the model
#### SUBTASK(S):
- 4.6.1. Containerize the model with Docker
- 4.6.2. Embed the model into Spark
- 4.6.3. Deploy the model with Watson Machine Learning
    - REFERENCES:
        - https://learn.ibm.com/course/view.php?id=8797

5.1. Assess the model
SUBTASK(S):
5.1.1. Distinguish metrics for Classification Models
5.1.1.1. Explain how a confusion matrix works
5.1.1.2. Explain what AUC measures
5.1.1.3. ROC curve
Reference for plots with ROC curves:
https://people.inf.elte.hu/kiss/11dwhdm/roc.pdf
https://synapse.koreamed.org/articles/1027596
5.1.1.4. Difference in distance measurements
5.1.1.4.1. Manhatta
5.1.1.4.2. Euclidean
5.1.1.4.3. Cosine similarity
REFERENCES:
https://learning.oreilly.com/library/view/thoughtful-machine-learning/9781491924129/ Chapter 3, Distances
https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa
5.1.1.5. Understand how tree-based models determine features to split on
5.1.2. Distinguish metrics for Regression Models
5.1.2.1. How do L1 and L2 Regularization impact the model features
5.1.2.2. Understand distinction between Bias and Variance
5.1.2.3. What do MSE and R-Squared measure
5.1.2.4. Understand common error metrics to evaluate regression models
5.1.3. Distinguish metrics for Unsupervised Models
5.1.3.1. How do you determine optimal number of K for K-Means Algorithm
5.1.3.2. Explain how the inertia metric is calculated
5.1.3.3. Explain how the Distortion metric is calculated
5.1.3.4. How can you avoid your centroids getting stuck in bad local optima
5.1.4. Identify trade-offs between model performance and computational cost
5.1.5. Choose the best metric for the model and business problem
REFERENCES:
https://scikit-learn.org/stable/modules/model_evaluation.html
https://learn.ibm.com/mod/video/view.php?id=166640
https://learn.ibm.com/mod/page/view.php?id=170322&forceview=1
https://learn.ibm.com/mod/video/view.php?id=166668
https://learn.ibm.com/mod/video/view.php?id=166669
https://learn.ibm.com/mod/video/view.php?id=166785
https://learn.ibm.com/mod/page/view.php?id=170325&forceview=1
https://learn.ibm.com/mod/video/view.php?id=166786
https://learn.ibm.com/mod/page/view.php?id=170329&forceview=1
https://learn.ibm.com/mod/video/view.php?id=169061&forceview=1
There is an error in learning material minute 5:10: https://learn.ibm.com/mod/video/view.php?id=166785 The presenter is talking about specificity and the formula for specificity is displayed, but it is incorrectly signed as “Sensitivity”. The next slide has corrected version of formula.



5.2. Monitor the model in production
SUBTASK(S):
5.2.1. Understand what MLOps is
REFERENCES:
https://www.ibm.com/blogs/journey-to-ai/2021/04/paving-the-paths-to-ai-engineering-and-modelops/
https://ibm-cloud-architecture.github.io/refarch-data-ai-analytics/methodology/MLops/
https://learning.oreilly.com/library/view/introducing-mlops/9781492083283/ch01.html
5.2.1.1. Understand types of data drift and their impact
5.2.2. Monitor model performance metrics using logging
https://learn.ibm.com/mod/video/view.php?id=169287
https://learn.ibm.com/mod/page/view.php?id=169579&forceview=1
https://learn.ibm.com/mod/page/view.php?id=169581&forceview=1
https://learn.ibm.com/mod/page/view.php?id=169598&forceview=1
5.2.3. Monitor model business KPIs
REFERENCES:
https://learn.ibm.com/mod/page/view.php?id=170330&forceview=1
https://learn.ibm.com/mod/video/view.php?id=169575
5.2.4. Decide when to retrain model
REFERENCES:
https://learning.oreilly.com/library/view/introducing-mlops/9781492083283/ch07.html#online_evaluation – Champion/Challenger section
https://learning.oreilly.com/library/view/ml-ops-operationalizing/9781492074663/ch01.html#retraining_and_remodeling - Retraining and remodeling section
https://learn.ibm.com/mod/video/view.php?id=169604
5.2.5. Use IBM OpenPages to govern models
REFERENCES:
https://www.ibm.com/docs/en/cloud-paks/cp-data/4.0?topic=governance-set-up-model-openpages-mrg

5.3. Determine if there is unfair bias in the model
SUBTASK(S):
5.3.1. Understand how model bias can creep in
REFERENCES:
https://www.brookings.edu/research/algorithmic-bias-detection-and-mitigation-best-practices-and-policies-to-reduce-consumer-harms/
https://developer.ibm.com/articles/machine-learning-and-bias/
5.3.2. Understand the role of transparency in mitigating bias
REFERENCES:
https://www.forbes.com/sites/cognitiveworld/2020/05/23/towards-a-more-transparent-ai/?sh=b928073d9371
5.3.3. Create an AI FactSheet
REFERENCES:
https://www.ibm.com/blogs/research/2020/07/aifactsheets/
5.3.4. Detect bias in models using IBM AI Fairness 360 Toolkit and Watson OpenScale
REFERENCES:
https://developer.ibm.com/blogs/ai-fairness-360-raise-ai-right/
https://github.com/IBM/bias-mitigation-of-machine-learning-models-using-aif360/blob/main/README.md
https://learn.ibm.com/mod/video/view.php?id=168628
https://www.ibm.com/docs/en/cloud-paks/cp-data/4.0?topic=governance-manage-model-risk

# Next Steps
1. Take the IBM Machine Learning Data Scientist v1 assessment test.
2. If you pass the assessment exam, visit pearsonvue.com/ibm to schedule your testing sessions.
3. If you failed the assessment exam, review how you did by section. Focus attention on the sections where you need improvement. Keep in mind that you can take the assessment exam as many times as you would like <b>($30 per exam)</b>, however, you will still receive <b>the same questions only in a different order.</b>