# Week 4: Data Cleaning and Preprocessing

### Objectives
- Learn techniques to handle missing values, identify and remove duplicates, and clean data for analysis.
- Develop skills to transform raw data into a reliable dataset ready for further exploration and modeling.

### Topics
- **Dropping Duplicates**: Identify and remove duplicate rows to ensure data quality.
- **Handling Missing Values**: Explore strategies for dealing with missing values, such as removal or imputation.
- **Data Cleaning**: Gain familiarity with various preprocessing techniques to format, normalize, and prepare data.

### Content

1. **Introduction to Data Cleaning and Preprocessing**
   - Data cleaning is a crucial first step in any data analysis or machine learning project. It involves identifying and handling inconsistencies, missing values, and duplicate entries, which can otherwise lead to incorrect analyses and insights.
   - **Importance of Data Cleaning**:
     - Ensures data quality and reliability.
     - Reduces errors and inconsistencies.
     - Prepares data for accurate analysis and modeling.

2. **Dropping Duplicates**
   - **Identifying Duplicates**:
     - Duplicates in datasets can occur due to multiple entries of the same data. Identifying and removing duplicates helps maintain data integrity.
     - Use the `duplicated()` method to check for duplicate rows, which returns a Boolean series indicating duplicated rows.
     - **Example**:
       ```python
       duplicate_rows = df.duplicated()
       print(duplicate_rows)
       ```
   - **Removing Duplicates with `drop_duplicates()`**:
     - The `drop_duplicates()` method is used to remove duplicate rows, keeping only the first occurrence by default.
     - **Syntax**: `df.drop_duplicates(subset=None, keep='first', inplace=False)`, where `subset` specifies columns to check for duplicates, and `keep` determines which duplicates to retain.
     - **Example**:
       ```python
       df_cleaned = df.drop_duplicates()
       ```
   - **Example with Column Subset**:
     - To remove duplicates based on specific columns (e.g., if only certain columns are relevant for identifying duplicates):
       ```python
       df_cleaned = df.drop_duplicates(subset=['column1', 'column2'])
       ```

3. **Handling Missing Values**
   - **Identifying Missing Values**:
     - Missing values are common in datasets and can be represented by `NaN` (Not a Number) in Pandas. Detect missing values using the `isnull()` method, which generates a Boolean DataFrame.
     - **Example**:
       ```python
       missing_values = df.isnull()
       print(missing_values)
       ```
   - **Counting Missing Values**:
     - The `isnull().sum()` method provides the count of missing values in each column, making it easy to identify columns that may need attention.
     - **Example**:
       ```python
       missing_counts = df.isnull().sum()
       print(missing_counts)
       ```
   - **Strategies for Handling Missing Values**:
     - **Dropping Missing Values**:
       - Use `dropna()` to remove rows or columns with missing values.
       - **Example (drop rows with any missing values)**:
         ```python
         df_dropped = df.dropna()
         ```
       - **Example (drop columns with missing values)**:
         ```python
         df_dropped_cols = df.dropna(axis=1)
         ```
     - **Imputing Missing Values**:
       - **Fill with a Specific Value**: Use `fillna(value)` to replace missing values with a constant (e.g., 0, mean, or median of the column).
       - **Example**:
         ```python
         df['column'] = df['column'].fillna(0)  # Replace NaN with 0
         ```
       - **Impute with Statistical Measures**:
         - Fill missing values with the mean, median, or mode of the column to maintain data continuity without removing rows or columns.
         - **Example**:
           ```python
           df['column'] = df['column'].fillna(df['column'].mean())
           ```

4. **Data Cleaning**
   - **Removing Unwanted Columns**:
     - Sometimes datasets contain irrelevant columns that do not contribute to the analysis. Use the `drop()` method to remove these columns.
     - **Example**:
       ```python
       df = df.drop(columns=['unwanted_column1', 'unwanted_column2'])
       ```
   - **Handling Inconsistent Data**:
     - Inconsistent data, like differing formats or mixed types within columns, can affect analysis. Standardize formats, such as date formats, or convert strings to lowercase.
     - **Example (standardizing text format)**:
       ```python
       df['column'] = df['column'].str.lower()  # Convert all text to lowercase
       ```
   - **Changing Data Types**:
     - Convert columns to the appropriate data types (e.g., from string to integer) using the `astype()` method.
     - **Example**:
       ```python
       df['column'] = df['column'].astype(int)
       ```

5. **Practical Data Cleaning Workflow**
   - A typical data cleaning process involves:
     1. Identifying and handling duplicates.
     2. Checking for missing values and deciding on a strategy (e.g., drop or fill).
     3. Dropping unnecessary columns and standardizing data formats.
     4. Ensuring correct data types across columns.

### Exercises
- **Exercise 1**: Load a dataset and identify any duplicate rows. Remove duplicates and inspect the dataset to confirm.
- **Exercise 2**: Identify missing values in a dataset and try different strategies for handling them (e.g., dropping rows, filling with mean).
- **Exercise 3**: Practice converting data types and standardizing formats in a sample dataset to ensure data consistency.
