### Cleaning Data in Python

* NaN : not a number -- special floating-point value
* Working with duplicates and missing values
    * isnull()
    * notnull()
    * dropna()
    * fillna()
    * replace()
* Dropping duplicate data
* Which values should be replaced with missing values based on data identifying and eliminating outliers

#### Identifying and Eliminating Outliers
* Outliers are observations that are significantly different from other data points
* Outliers can adversely affect the training process of a machine learning algorithm, resulting in a loss of accuracy.
* Need to use the mathematical formula and retrieve the outlier data.

     **interquartile range(IQR) = Q3(quantile(0.75)) − Q1(quantile(0.25))**
     ![boxplot](boxplot.png)

     
* **Plotting**
* **Saving**

# Data Preprocessing with scikit-learn
# Preprocessing Techniques
* Data Preprocessing is a technique that is used to convert the raw data into a clean data set

### Data preprocessing steps


* Data collecting from sources
* Clean through unnecessary data
* analyzing data/ Processing data

    * will learn data preprocessing techniques with scikit-learn, one of the most popular frameworks used for industry data science
    * The scikit-learn library includes tools for data preprocessing and data mining. It is imported in Python via the statement import sklearn.

![ddd.PNG](ddd.PNG)

### Data Imputation 
* if the dataset is missing too many values, we just don't use it
*  if only a few of the values are missing, we can perform data imputation to substitute the missing data with some other value(s).
* There are many different methods for data imputation
    * Using the mean value
    * Using the median value
    * Using the most frequent value
    * Filling in missing values with a constant
    

## Feature Scaling

### 1.Standardizing Data


* Data scientists will convert the data into a standard format to make it easier to understand.
* The standard format refers to data that has 0 mean and unit variance (i.e. standard deviation = 1), and the process of    converting data into this format is called data standardization.
* improve the performance of models
* it rescales the data to have mean = 0 and varience(statistical measure that provides indicator of data's dispresion) = 1

* Standardization rescales data so that it has a mean of 0 and a standard deviation of 1.
* The formula for this is:  (𝑥 − 𝜇)/𝜎

    * We subtract the mean (𝜇) from each value (x) and then divide by the standard deviation (𝜎)
    
![stddata.PNG](stddata.PNG)

![std.PNG](std.PNG)

    
### 2. Data Range
* Scale data by compressing it into a fixed range
* One of the biggest use cases for this is compressing data into the range [0, 1]
* MinMaxScaler 
![minmax.PNG](minmax.PNG)



### 3. Robust Scaling
* Deal with is outliers (data point that is significantly further away from the other data points)
* Robustly scale the data, i.e. avoid being affected by outliers
* Scaling by using data's median and Interquartile Range (IQR)
* Here mean affected but median remains same
* Subtract the median from each data value then scale to the IQR


### 4. Normalizing Data

* Want to scale the individual data observations (i.e. rows)
* Rescales the data in smaller range -1.0 to 1.0 or 0.0 to 1.0.
* Used in classification Problems and data mining 
* when clustering data we need to apply L2 normalization to each row
* L2 normalization applied to a particular row of a data array 
* L2 norm of a row is just the square root of the sum of squared values for the row

![norm.PNG](norm.PNG)


