## Data Preprocessing
Data preprocessing is a crucial step in the data analysis and machine learning pipeline. It involves the cleaning and transformation of raw data into a format that is suitable for analysis or input to a machine learning model. The main goals of data preprocessing are to improve the quality of the data and enhance the performance and effectiveness of machine learning models. The choice of preprocessing techniques is influenced by the nature of the data, and different algorithms are applied accordingly to address unique challenges associated with diverse data types.

## Data Types

The columns in a Pandas DataFrame can contain different types of data. Here are some common types you might encounter:

1. **Numerical Data:**
   - **Discrete:** These are numerical data that have a countable number of distinct values. For example, the **number of cars** in a parking lot or the **number of students in a classroom**.
   - **Continuous :** They can take any numeric value within a range and have an infinite number of possible values. Examples include **height, weight, or temperature**.

2. **Categorical Data:**
   - **Nominal:** These type of data represent categories without any inherent order or ranking. Examples include **gender, color, or types of fruits**.
   - **Ordinal:** They have categories with a meaningful order. Examples include socio economic status **(low income,middle income,high income), education level (high school,BS,MS,PhD)**.

3. **Datetime :**
   - **Datetime (datetime64):** Datetime data, also referred to as timestamp or time series data, represents information related to dates and times.

4. **Sparse Data:**
    - Sparse data refers to data where a large proportion of the elements have a value of zero. This type of data is common in various fields, such as natural language processing, recommendation systems, and network analysis. An example of sparse data is given below where rows represent users and columns represent movies. Each entry in the dataset indicates whether a user has rated a particular movie.

    ```
    User       Movie A   Movie B   Movie C   Movie D   Movie E
    User 1        4         0         0         0         0
    User 2        0         0         0         5         0
    User 3        0         0         3         0         0
    User 4        0         0         0         0         2
    User 5        0         1         0         0         0
    ```

    - Sparse data often requires specific techniques and algorithms to efficiently handle and analyze, as processing all the zero values can be computationally expensive and may not provide meaningful insights.

    - The `scipy.sparse` module provides a variety of sparse matrix types and operations for sparse matrix manipulations. It includes formats such as CSR (Compressed Sparse Row), CSC (Compressed Sparse Column), COO (Coordinate), and others.


In [None]:
import pandas as pd
url = 'https://drive.google.com/file/d/19aYZVyCsbKp0UEQl8QQagKyHFmromwQg/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
df = pd.read_csv(path)
df.head()

1. **total_bill (Numeric):** Represents the total bill amount for a meal, usually a float.

2. **tip (Numeric):** Represents the tip amount given by the customer, usually a float.

3. **sex (Categorical):** Represents the gender of the person paying the bill, often categorized as "Male" or "Female."

4. **smoker (Categorical):** Indicates whether the party was a smoker or non-smoker, often categorized as "Yes" or "No."

5. **day (Categorical):** Represents the day of the week when the meal took place, categorized as "Thur," "Fri," "Sat," or "Sun."

6. **time (Categorical):** Indicates whether the meal was lunch or dinner.

7. **size (Categorical):** Represents the size of the dining party.


## Handling Missing Values

Handling missing data is a crucial aspect of data cleaning and analysis. In pandas, missing data is typically represented by `NaN` (Not a Number). Here are some common techniques for handling missing data in pandas:

### 1. Detecting Missing Data:
   - The `isnull()` method can be used to detect missing values in a DataFrame. It returns a DataFrame of the same shape, where each element is `True` or `False` based on whether the corresponding element in the original DataFrame is missing.

   ```python
   # Detect missing values
   missing_values = df.isnull()
   number_of_missing_values=df.isnull().sum()
   ```

### 2. Dropping Missing Values:
   - The [dropna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) method can be used to remove rows or columns containing missing values. The `axis` parameter specifies whether to drop rows (`axis=0`) or columns (`axis=1`). The `how` parameter specifies whether to drop rows/columns with any missing values (`how='any'`) or all missing values (`how='all'`).

   ```python
   # Drop rows containing any missing values
   df_no_missing_rows_any = df.dropna(how='any', axis=0)

   # Drop rows containing all missing values
   df_no_missing_rows_all = df.dropna(how='all', axis=0)

   # Drop columns containing any missing values
   df_no_missing_cols_any = df.dropna(how='any', axis=1)

   # Drop columns containing all missing values
   df_no_missing_cols_all = df.dropna(how='all', axis=1)
   ```

### 3. Filling Missing Values:
   - The [fillna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) method can be used to fill missing values with a specific value or with the result of a function.

   ```python
   # Fill missing values with a specific value
   df_filled = df.fillna(0)
   # Fill missing values with the mean of each column
   df_mean_filled = df.fillna(df.mean())
   # Fill missing values of one or more columns
   df.fillna({column_name:value})
   ```


In [None]:
import pandas as pd
import numpy as np
data = {
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, np.nan,11]
}
df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,1.0,5.0,9.0
1,2.0,,10.0
2,,,
3,4.0,8.0,11.0


In [None]:
df.isnull().sum()

A    1
B    2
C    1
dtype: int64

In [None]:
df_dropped_any = df.dropna(how='any', axis=0) # Drop rows with any NaN values (how='any', axis=0)
df_dropped_all = df.dropna(how='all', axis=0) # Drop rows with all NaN values (how='all', axis=0)
df_dropped_any_col = df.dropna(how='any', axis=1) # Drop columns with any NaN values (how='any', axis=1)
df_dropped_all_col = df.dropna(how='all', axis=1) # Drop columns with all NaN values (how='all', axis=1)

## Encoding Categorical Variables

Encoding categorical  is a crucial step in the data preprocessing phase, especially when working with machine learning models.

Most Common Categorical Data Encoding Techniques:

1. **Label Encoding:**

    - Assigns a unique integer to each category.
    - Suitable for ordinal data where there is a natural order among categories.
    - Not recommended for nominal data as it may imply misleading relationships.

2. **One-Hot Encoding:**

    - Creates binary columns for each category (0 or 1).
    - Suitable for nominal data without any natural order.
    - Increases dimensionality but avoids false ordinal relationships.

### Label Encoding

In [None]:
import pandas as pd
url = 'https://drive.google.com/file/d/19aYZVyCsbKp0UEQl8QQagKyHFmromwQg/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
df = pd.read_csv(path)
df

In [None]:
labels={'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6}
df['size']=df['size'].map(labels)
df

### One-Hot Encoding

In [None]:
# Perform one-hot encoding on the categorical columns
one_hot_encoded_df = pd.get_dummies(df,columns=['sex','day','time','smoker'])
one_hot_encoded_df

## Feature Scaling

Feature scaling is a data preprocessing technique used in machine learning and statistics to scale the values of features ( or attributes) in a dataset to a standard range. The primary goal of feature scaling is to ensure that all the features have similar scales, which can help improve the performance of many machine learning algorithms. It is particularly important when using algorithms that are sensitive to the scale of input features, such as gradient descent-based optimization methods (e.g., in neural networks) and distance-based algorithms (e.g., k-nearest neighbors or support vector machines).

Common methods of feature scaling include:

1. Min-Max Scaling (Normalization):
   - This method scales the feature values to a specific range, typically between 0 and 1.
   - The formula for min-max scaling is:
     ```
     X_normalized = (X - X_min) / (X_max - X_min)
     ```
   - Here, X is the original feature value, X_normalized is the normalized value, X_min is the minimum value in the feature, and X_max is the maximum value in the feature.

    **The function below accepts a Pandas column as a parameter and returns the scaled data:**
    ```python
    def minmax_scale(column):
        min_val = column.min()
        max_val = column.max()
        scaled_column = (column - min_val) / (max_val - min_val)
        return scaled_column
    ```
   

2. Standardization:
   - Also known as z-score standardization, scales the feature values to have a mean (average) of 0 and a standard deviation of 1.
   - The formula for standardization is:
     ```
     X_standardized = (X - mean) / standard deviation
     ```
   - Here, X is the original feature value, X_standardized is the standardized value, mean is the mean of the feature values, and the standard deviation is the standard deviation of the feature values.
   
    **The function below accepts a Pandas column as a parameter and returns the scaled data:**
   ```python
    def zscore_standardize(column):
        mean_val = column.mean()
        std_dev = column.std()
        standardized_column = (column - mean_val) / std_dev
        return standardized_column
   ```

3. Robust Scaling:
   - Robust scaling is a method that scales the features using the interquartile range (IQR) to make it less sensitive to outliers.
   - The formula for robust scaling is:
     ```
     X_robust = (X - X_median) / (Q3 - Q1)
     ```
   - Here, X is the original feature value, X_robust is the robust-scaled value, Q1 is the first quartile, and Q3 is the third quartile of the feature values.

      ```python
      def robust_scale(column):
        median_val = column.median()
        iqr = column.quantile(0.75) - column.quantile(0.25)
        scaled_column = (column - median_val) / iqr
        return scaled_column
        ```

In [None]:
# Define Min-Max scaling function
def minmax_scale(column):
    min_val = column.min()
    max_val = column.max()
    scaled_column = (column - min_val) / (max_val - min_val)
    return scaled_column

df['total_bill']=minmax_scale(df['total_bill'])
df['tip']=minmax_scale(df['tip'])