# Cleaning Data: Missing Data and Dealing with Outliers

![Data Science Workflow](files/img/DSworkflow.png)

## Cleaning Data
- Understand the **data quality**
- **Improve** the quality (if possible)
- Dealing with **missing data** (both rows in single entries)
    - Examples include 
        - **Replacing** missing values/entries with mean values
        - **Interpolation** of values (in time series)
- Dealing with **data outliers**
    - Examples include 
        - Default missing values in system: sometimes as 0-values
        - Wrong values
- Removing **duplicates**
    - Common problem to have duplicate entries
- Process requires **domain knowledge**

## Missing Data
- Missing data is sometimes refered to as **NA** values in pandas.
- A great source to learn about is [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)


- Two types of missing data we consider
    1. NaN data
    2. Rows in time series data

Examples

## Outliers
- Requires domain knowledge
- But typical examples could include

## Demonstration how it affects results
- [Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/c/home-data-for-ml-course/overview)
- The dataset contains a training and testing dataset.
    - The goal is to predict prices on the testing dataset.
- We will explore how dealing with missing values impacts the prediction of a linear regression model

### Remove non-numeric for this demonstration
```Python
data = data.select_dtypes(include='number')
```

Then we check of missing values.
```Python
data.info()
```

### Helper function
- Implement a helper function to calculate the r-square score
- It should take independent features `X` and dependent feature `y`
- Then split that into training and testing sets.
- Fit the training set.
- Predict the test set.
- Return the r-square score

### Calculations
- Try first to calcualte the r-square by using `data.dropna()`
    - This serves as the ussual way we have done it
- Then with `data.fillna(data.mean())`
    - [`fillna()`](https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.fillna.html) Fill NA/NaN values using the specified method.
- Then with `data.fillna(data.mode().iloc[0])`

## Time series
- Now let's explore a time series: `files/weather.parquet`


### Missing time series rows
- One way to find missing rows of data in a time series is as follows
```Python
idx = pd.Series(data=pd.date_range(start=df.index.min(), end=df.index.max(), freq="H"))
mask = idx.isin(df.index)
idx[~mask]
```

### Insert missing datetimess and interpolate them
- To insert missing datetimes we can use [`reindex()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reindex.html)
- To interploate values that are missing [`interpolate`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html)

## Outliers
- If we focus on `Pressure (millibars)` for `2006'
- One way to handle 0-values is with [`replace()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html)
```Python
.replace(0, np.nan)
```
- Then we can apply `interploate()`

### Removing Duplicates
- [`drop_duplicates()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) Return DataFrame with duplicate rows removed.