# Lecture 4
***

## **Common Types of Datasets**

- **Feature-based data:** Consists of multiple `attributes` or `characteristics` of
the data sample, such as height, weight, age, etc.  
Each column represents a different feature of the data sample.

- **Time series data:** Consists of values that `change over time`, such as stock
prices, weather data, sensor data, or audio data. Each entry represents a
different point in time.  

- **Image data:** Consists of pixels, shapes, or colors, such as photos,
drawings, logos, or icons. Each data sample `contains an image`.

- **Text (language) data:** Consists of words, sentences, or documents, such
as tweets, reviews, news articles, or books. Each data sample contains a
`series of text` or `spoken words`.

## **Data Statistics**

- Examining the first few rows of data can `verify` if the data is `imported correctly.`

- You also need to recognize the symbol for `NaN values`, which are
assigned to empty cells by default.

- Some datasets may also use specific values, such as `-1 or 999`, to
indicate `missing values`. This is why `checking the metadata`, which
describes the data and its attributes, is important.

- Understanding the data’s shape and statistics can help you choose a
`suitable analysis method`, such as mean, variance, and correlation

## **Data Processing**

### **Why Do We Need to Process Datasets?**

- If the dataset is based on `real-life data`, it `might not be perfect`
- Your dataset might include:
    - Missing values
    - Erroneous measurements
    - Noise

#### **Missing Values**

##### **How to Find Missing Values in a Pandas DataFrame**

- `Check the data type` for each column using `df.dtypes`. If a column has invalid
data points, such as empty strings or non-numeric values, the data type will
be object.

- You can either `manually change` the data type for all the columns using
`df.astype()` or `replace the invalid` points with `NaN` using `df.replace()`.

- Once all the columns are the proper data type, you can count the number of
`NaN` values using one of these methods:

    <img src="./images/L4/L4-1.png" alt="L4-1.png" width="500"/>

##### **Dealing with Missing Values**

- Determine the type (MCAR, MAR, or MNAR) and cause of the missingness.
This guides your choice of method.
- Examine the pattern and extent of the missing data. This shows how the
missing data affects the data quality and analysis validity.
- Select a method that fits the data and analysis. Common methods are
deleting, imputing, or using the missingness as a feature. Each method
has pros and cons, and you should weigh the trade-offs and assumptions.

<br/>
<li><em><strong>MCAR:<strong\> Missing Completely At Random<em\><br\><li\>
<li><em><strong>MAR:<strong\> Missing At Random<em\><br\><li\>
<li><em><strong>MNAR:<strong\> Missing Not At Random<em\><br/><li\>

Pandas offers several built-in functions to deal with missing values
in different ways.
- You can choose to remove the rows or columns that contain NaN
values,
- or you can replace them with a specific value or a calculated
value based on the rest of the data.

##### Dropping NaN Values

- Dropping values is easy with a Series,
as you can `drop the values individually`
- For `DataFrames`, it is a bit more
`complicated`, as you can’t have an
uneven number of rows
- You can `drop any row` or `drop any column` 
that has at least one NaN value
(based on the `specified axis`)
- Or you can use the `“how”` or `“thresh”`
keywords to `specify the number of NaN` values 
that must exist `before you drop`
the row or column.

##### Filling NaN Values

- **Forward-fill:** Use the previous valid value to fill the missing value, which
can be useful for time series data.
- **Back-fill:** Use the next valid value to fill the missing value, which can be
useful for reverse time series data.
- **Custom code:** Write your own logic to fill the missing values, which can
be useful for complex or specific cases.
The choice of method depends on the source and nature of the data and the
desired outcome.

#### **Errors and Noise**

##### **Detecting Errors in Real Measurements**

- When collecting data from the real world, there might be some
`inaccuracies` due to various factors such as `faulty equipment` or
`environmental noise`.
- To identify `potential outliers`, you can use different methods depending
on the data's characteristics and shape, such as `visualizing` or
`analyzing statistically`.
- For instance, you can look at how far they are from the center of the
data.
- After finding the errors, you can handle them in the same way as you
handled the `NaN values`. A simple way to code this is to change all the
`incorrect values to np.nan` and then use your preferred method to
replace the missing values.

**Example:**

<img src="./images/L4/L4-2.png" alt="L4-2.png" width="500"/>

#### **Machine Learning for Error Detection**

- Sometimes, you `can’t detect` the errors with `simple visualization` and 
`statistical methods`.
- For `communication signals`, the error is `contained within` the signal, in the
form of noise.
- Error detection can also be used to `detect bad actors` and other cybersecurity
risks.
- ML methods have been developed to address these issues, and will be
discussed later in this course.

## **Summary**

- It is important to `understand your data` before you start implementing
your machine learning method
- Data statistics can help you determine `which ML method` to use – will
discuss more later in the course
- `Missing and erroneous/noisy values` need to be 
`addressed before starting ML analysis`