# Study Guide ML-Specialist 

## Section 2 – Exploratory Data Analysis including data preparation

### 2.1. Identify the methods used to clean, label, and anonymize data

### SUBTASKS:
- 2.1.1. Clean data
    - 2.1.1.1. Fill or drop missing values
    - 2.1.1.2. Remove duplicate rows
    - 2.1.1.3. Remove outliers
    - 2.1.1.4. Converting data types
    - 2.1.1.5. Data normalization
    
    
- 2.1.2. Label data
    - 2.1.2.1. Understand the benefits and challenges to labeling data
    - 2.1.2.2. Explain data labeling approaches
    
    
- 2.1.3. Anonymize data
    
REFERENCES:
- https://www.ibm.com/garage/method/practices/reason/prepare-data-for-machine-learning/
- https://www.ibm.com/garage/method/practices/code/data-preparation-ai-data-science/
- https://www.ibm.com/cloud/learn/data-labeling
- https://dataplatform.cloud.ibm.com/docs/content/wsj/governance/dmg22.html

## 2.1.1 Clean data

### 2.1.1.1. Fill or drop missing values

- Remove missing values
    - `DataFrame.dropna([axis, how, thresh, ...])`


- Fill NA/NaN values using the specified method.
    - `DataFrame.fillna([value, method, axis, ...])`
	
    
- Detect missing values.
    - `DataFrame.isna()`
	
    
- Replace values given in to_replace with value.	
    - `DataFrame.replace([to_replace, value, ...])`
	

### 2.1.1.2. Remove duplicate rows

- Return DataFrame with duplicate rows removed.
    - `new_df = df.drop_duplicates(keep=False, inplace=false)`


### 2.1.1.3. Remove outliers

#### various ways of outlier detection:
- Z-Score
    - A z-score simply tells you how many standard deviations away an individual data value falls from the mean.
    

- IQR-distance from Median
    - Interquartile range. The IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.
    
    - sklearn's RobustScaler 
        - Scale features using statistics that are robust to outliers. This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). 



### 2.1.1.4. Converting data types

When working iwth missing <b>Numerical</b> values:
- dropna()
    - `df.dropna(subset=['price'])`


- drop()
    - `df.drop('price', axis=1)`


- fillna()
    - `median = df['price'].median()
      df['price'].fillna(median, inplace=True)`
      
      
- scikit-learn `SimpleImputer`
    
    `from sklearn.impute import SimpleImputer
    imputer = SimpleImputer(strategy='median')`
    
    `price_num = df.drop('price', axis=1)
    imputer.fit(price_num)`
    
    `X = imputer.transform(price_num)`
    
    `transform_df = pd.DataFrame(X, columns=price_num.columns, index=price_num.index`
    



When working iwth missing <b>Text</b> and <b>Categorica;</b> attributes:


- Ordinal Encoding using sci-kit learn `OrdinalEncoder`

    `wrd_cat = df[['words']]`

    `from sklearn.preprocessing import OrdinalEncoder
    ordinal_encoder = OrdinalEncoder()
    wrd_cat_encoded = ordinal_encoder.fit_transform(wrd_cat)`
    
- one-hot encoding using sci-kit learn OneHotEncoder

    `wrd_cat = df[['words']]`

    `from sklearn.preprocessing import OneHotEncoder
    cat_encoder = OneHotEncoder()
    wrd_cat_1hot = cat_encoder.fit_transform(wrd_cat)`
    
    the result is a SciPy sparse matrix. To convert it to a (dense) NumPy array use `toarray()`
    
    `wrd_cat_1hot.toarray()`

### 2.1.1.5. Data normalization

- Min-max scaling (aka normalization) is the simplest form of feature scaling. Values are shifted and rescaled so that they end up ranging from 0 to 1. 
    - scikit-learn MinMaxScaler


## 2.1.2. Label data



### 2.1.2.1. Understand the benefits and challenges to labeling data



2.2. Visualize data
SUBTASKS:
2.2.1. Choose the column(s) from your dataset to be visualized
2.2.2. Identify what the visualization should describe about the column(s)
2.2.2.1. Distribution
2.2.2.2. Correlation
2.2.2.3. Comparison
2.2.2.4. Time Series
2.2.3. Select a type of chart based on the descriptive need
2.2.3.1. Histogram/Box plot/Violin plot
2.2.3.2. Scatterplot/Heatmap
2.2.3.3. Bar chart
2.2.3.4. Line plot
2.2.4. Select a library or tool for visualization
2.2.4.1. Matplotlib
2.2.4.2. Seaborn
2.2.4.3. Bokeh
2.2.4.4. Plotly
2.2.5. Plot the visualization
REFERENCES:
https://seaborn.pydata.org/introduction.html
https://matplotlib.org/stable/tutorials/introductory/usage.html#sphx-glr-tutorials-introductory-usage-py
https://docs.bokeh.org/en/latest/docs/first_steps.html
https://plotly.com/python/
https://learn.ibm.com/course/view.php?id=8794
https://learning.oreilly.com/library/view/statistics-in-a/9781449361129/ Chapter 4