# Exploratory Data Analysis

## Exploratory Data Analysis and Data Cleaning

### Variable Identification
1. Identify Predictor and Target variable
```python
predictor_variables = df.drop(columns=[useless_column_names])
target_variable = df['target_variable_name']

print(predictor_variables.columns)
print(target_variable.head())
```
2. Identify the data type and category of the variables.
- We can learn the shape of object types of our data.
- We cannot identify it with code. It's better to identify with our knowledge by seeing the data.
    - Read the problem and identify the variables described. Note key properties of the variables, such as what types of values the variables can take.
    - Identify any variables from step 1 that take on values from a limited number of possible values with no particular ordering. These variables are categorical.

3. Dividing them to subcategories for further analysis
https://towardsdatascience.com/data-types-in-statistics-347e152e8bee

#### Categorical Data:

1. **Nominal Data:**
   - Definition: Categories without any inherent order or ranking.
   - Examples: Colors, gender, types of animals.
   - Identification: Check if the data consists of distinct categories without any implied order.
   - use one hot encoding, to transform nominal data into a numeric feature.
   - **Visualization Tool:** Bar Chart or Pie Chart.
     - Use different colors or patterns for each category.
     - Find frequencies, percentage and proportion by dividing the frequency by the total number of events. (e.g how often something happened divided by how often it could happen)

2. **Ordinal Data:**
   - Definition: Categories with a clear order or ranking.
   - Examples: Education levels, customer satisfaction ratings.
   - Identification: Look for data where the categories have a meaningful sequence or hierarchy.
   - use one label encoding, to transform ordinal data into a numeric feature.
   - - **Visualization Tool:** Ordered Bar Chart or Dot Plot or Pie Chart.
     - Display the categories in a meaningful order.
     - summarise frequencies, percentage and proportion by dividing the frequency by the total number of events. (e.g how often something happened divided by how often it could happen)
     - use percentiles, median, mode and the interquartile range to summarise your data.

#### Numerical Data:

1. **Discrete Data:**
   - Definition: Countable and distinct values.
   - Examples: Number of cars, number of people.
   - Identification: Data points are individual and separate; there are no fractional values.
   - **Visualization Tool:** Histogram or Bar Chart.
     - Display the count of each discrete value.
     - summarise frequencies, percentage and proportion by dividing the frequency by the total number of events. (e.g how often something happened divided by how often it could happen)
     - use mean, percentiles, median, mode and the interquartile range to summarise your data.

2. **Continuous Data:**
   - Definition: Infinite possible values within a given range or measurable but not countable.
   - Examples: Height, weight, temperature.
   - Identification: Data points can take any value within a range, including fractional values.
   - **Visualization Tool:** Histogram or Box Plot.
     - Show the distribution of data and identify outliers.
     - Summarize percentiles, median, interquartile range, mean, mode, standard deviation, and range.
     - With a histogram, you can check the central tendency, variability, modality, and kurtosis of a distribution. Note that a histogram can’t show you if you have any outliers. This is why we also use box-plots.
     https://medium.com/@agarwal.vishal819/outlier-detection-with-boxplots-1b6757fafa21

   - **Interval Data:**
      - Definition: Data with a consistent interval between values, but no true zero point.
      - Examples: Temperature in Celsius, IQ scores.
      - Identification: Differences between values are meaningful, but ratios are not.
      - **Visualization Tool:** Histogram or Line Chart.
       - Highlight trends over a continuous range.

   - **Ratio Data:**
      - Definition: Data with a consistent interval between values and a true zero point.
      - Examples: Height, weight, income.
      - Identification: Ratios between values are meaningful (e.g., one value is twice or three times another).
      - **Visualization Tool:** Scatter Plot or Box Plot.
       - Explore relationships between variables and identify outliers.

In [2]:
nominal_columns = ["column_name1", "column_name2", ...]
ordinal_columns = ["column_name3", "column_name4", ...]
discrete_columns = ["column_name5", "column_name6", ...]
interval_columns = ["column_name7", "column_name8", ...]
ratio_columns = ["column_name9", "column_name10", ...]

### Univariate Analysis
https://medium.com/@agarwal.vishal819/outlier-detection-with-boxplots-1b6757fafa21
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
https://github.com/krishnaik06/Machine-Learning-in-90-days/blob/master/Section%201-%20Python%20Crash%20Course/EDA.ipynb

**Continuous Variable**
- summary statistics, box plot and histogram
- highlight missing and outlier values. 

**Categorical Variable**
- Count and Count% against each category. 
- Bar chart can be used as visualization.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

# Function for univariate analysis of categorical variables
def analyze_categorical_variable(column_name):
    # Frequency table
    frequency_table = df[column_name].value_counts()

    # Percentage of values under each category
    percentage_table = df[column_name].value_counts(normalize=True) * 100

    # Combine count and percentage into a single DataFrame
    summary_df = pd.DataFrame({'Count': frequency_table, 'Count%': percentage_table})

    # Bar chart
    summary_df.plot(kind='bar', y='Count', legend=False)
    plt.title(f'Univariate Analysis - {column_name}')
    plt.xlabel(column_name)
    plt.ylabel('Count')
    plt.show()

# Analyze each categorical variable
categorical_columns = ['Category_Column1', 'Category_Column2']

for column in categorical_columns:
    analyze_categorical_variable(column)

NameError: name 'df' is not defined

In [None]:
# Univariate
# Get unique values in the 'targetcolumn' column
unique_cols = df['targetcolumn'].unique()

# Plotting loop
for col in unique_cols:
    data_variable = df.loc[df['targetcolumn'] == col]['col_name']
    plt.plot(data_variable, np.zeros_like(data_variable), 'o', label=target)

# Set labels and show the plot
plt.xlabel('col name')
plt.legend()
plt.show()

### Bivariate Analysis
- look for association and disassociation between variables at a pre-defined significance level.

**Continuous & Continuous**
- scatter plot
- The relationship can be linear or non-linear. That helps to identify ML algos
```python
sns.FacetGrid(df,hue="target_variable",size=5).map(plt.scatter,"x_col","y_col").add_legend();
plt.show()
```
If some points are not overlapped we can use linear.

- To find the strength of the relationship, we use Correlation. Correlation varies between -1 and +1.
-1: perfect negative linear correlation
+1:perfect positive linear correlation and 
0: No correlation
- Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))

**Categorical & Categorical**
- 2 methods
1. **Two-way table:**
    - two-way table of count and count%
    - rows represents the category of one variable and the columns represent the categories of the other variable
    - show count or count% of observations available in each combination of row and column categories.
2. **Stacked Column Chart:** visual form of Two-way table
3. **Chi-Square Test:** 
    - derive the statistical significance of relationship between the variables.
    - tests whether the evidence in the sample is strong enough to generalize that the relationship for a larger population as well
    - difference between the expected and observed frequencies in one or more categories in the two-way table. It returns probability for the computed chi-square distribution with the degree of freedom.
Probability of 0: It indicates that both categorical variable are dependent

Probability of 1: It shows that both variables are independent.

Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence. The chi-square test statistic for a test of independence of two categorical variables is found by a formula

**Categorical & Continuous:**
- draw box plots for each level of categorical variables.
    - If levels are small in number, it will not show the statistical significance. To look at the statistical significance we can perform Z-test, T-test or ANOVA.
    

### Multivariate analysis
```python
sns.pairplot(df,hue="species",size=3)
```

## Handling Missing Data
### Identifying Missing Values
Convert "?" to NaN and by isna method we can convert values to nan and check how many nan values are there in every column
- Convert "?" to NaN. Or we can convert any missing value symbol to NaN with it
    ```python
    dataframe.replace("?", np.nan, inplace = True)
    ```
- Check number of null values for every column
    ```python
    missing_values_count = dataframe.isnull().sum()
    missing_values_count
    ```
- Check the number of total null values and find out the percentage it is taking
    ```python
    # how many total missing values do we have?
    total_cells = np.product(dataframe.shape)
    total_missing = missing_values_count.sum()

    # percent of data that is missing
    percent_missing = (total_missing/total_cells) * 100
    print(percent_missing)
    ```
- if the proportion of column value missing is likely small enough for reasonable replacement with some form of imputation.if too much of that data missing to do something useful with at a basic level. We'll probably drop this later, or change it to another feature like "col_name: 1 or 0"
    ```python
    sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
    ```

In [None]:
def check_missing_values(df):
    df.replace("?", np.nan, inplace = True)
    
    missing_values_count = df.isnull().sum()
    print(missing_values_count)
    
    # how many total missing values do we have?
    total_cells = np.product(df.shape)
    total_missing = missing_values_count.sum()

    # percent of data that is missing
    percent_missing = (total_missing/total_cells) * 100
    print(percent_missing)
    
    sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
    
    return df

### Dropping Missing Values
with the dropna method, you can choose to drop rows or columns that contain missing values like NaN. Specify
- axis = 0 to drop rows with missing value or,
    ```python
    dataframe.dropna(axis=0)
    ```
- axis = 1 to drop columns with missing value
    ```python
    dataframe.dropna(axis=1)
    ```
- to remove specific rows with the columns having missing values we can spepcify subset containing the column names and on dropna method we can specify axis = 0 to drop rows and inplace to True in order to modify directly to the dataframe
    ```python
    column_rows = ['column_name', 'column_name'....]
    # To modify directly
    datarame.dropna(subset=column_rows, axis = 0, inplace = True)
    # Or to just save it to another dataframe
    another_dataframe = datarame.dropna(subset=column_rows, axis = 0)
    ```
- To remove specify columns with missing values we can just drop the columns
    ```python
    column_to_drop = ['column_name', 'column_name'....]
    dataframe.drop(columns=columns_to_drop, errors='ignore', inplace=True)
    ```

In [None]:
# This function drops missing values 
def drop_missing_values(df, drop_col, drop_row, row_to_drop, row_to_drop_by_col, col_to_drop, columns_to_drop):
    
    if drop_col:
        df = df.dropna(axis=1)
    if drop_row:
        df = df.dropna(axis=0)
    
    if row_to_drop:
        df = df.dropna(subset=row_to_drop_by_col, axis = 0)
    if col_to_drop:
        print("Columns in original dataset: %d \n" % df.shape[1])
        df = df.drop(columns=columns_to_drop, errors='ignore')
        # just how much data did we lose?
        print("Columns with na's dropped: %d" % df.shape[1])
        
    return df

### Imputation according to outliers and normality
- Use the check_outliers function to visualize outliers using box plots for each relevant numerical column.
    -  Look for points beyond the "whiskers" of the box plot. These points are potential outliers.
- Use the check_normality function to assess the normality of each numerical column using Q-Q plots and the Shapiro-Wilk test.
    - **Q-Q Plot:** If the data points on the plot closely follow the diagonal line, the data is likely normally distributed.
    - **Shapiro-Wilk Test:** A p-value > 0.05 indicates that the data is normally distributed.

**Based on the results:**
- If outliers are present and the data is not normally distributed, consider median imputation.
- If outliers are minimal and the data is approximately normally distributed, mean imputation can be considered.

In [None]:
from scipy.stats import shapiro
from statsmodels.graphics.gofplots import qqplot

# Function to check for outliers using box plots
def check_outliers(df, column_name):
    plt.figure(figsize=(8, 4))
    sns.boxplot(x=df[column_name])
    plt.title(f'Boxplot for {column_name}')
    plt.show()

# Function to check for normality using Q-Q plot and Shapiro-Wilk test
def check_normality(df, column_name):
    # Q-Q plot
    plt.figure(figsize=(8, 4))
    qqplot(df[column_name], line='s')
    plt.title(f'Q-Q Plot for {column_name}')
    plt.show()

    # Shapiro-Wilk test for normality
    stat, p_value = shapiro(df[column_name])
    print(f'Shapiro-Wilk test for {column_name}:')
    print(f'Statistic: {stat}, p-value: {p_value}')
    if p_value > 0.05:
        print(f'The data for {column_name} appears to be normally distributed.')
    else:
        print(f'The data for {column_name} does not appear to be normally distributed.')

# Example usage for a specific column
column_of_interest = ['columnn_names']

for col in column_of_interest:
    check_outliers(df, col)
    check_normality(df, col)

### Replacing Missing Values
method called replace which can be used to fill in the missing values with the newly calculated values.
- Replace a column's missing values with a fixed value
    ```python
    dataframe[column_name] = dataframe[column_name].fillna(value)
    ```
- Replace a column's missing values with mean
    ```python
    # If the column is float type
    mean = dataframe[column_name].mean()
    # If the column is int type
    mean = math.floor(dataframe[column_name].mean())
    dataframe[column_name].replace(np.nan, mean, inplace=True)
    ```
- Similar case Imputation: In this case, we calculate average for some category mean individually of non missing values then replace the missing value based on category. For “Male“, we will replace missing values of manpower with 29.75 and for “Female” with 25.
```python
plt.figure(figsize=(12, 7))
sns.boxplot(x='category_column_name',y='targetted_column',data=train,palette='winter')
# If the result is valid we proceed with imputation
def impute_targetted_col(cols):
    targetted_column = cols[0]
    category_column_name = cols[1]
    
    if pd.isnull(targetted_column):

        if category_column_name == value:
            return mean_of_that

        elif category_column_name == value:
            return mean_of_that

        else:
            return mean_of_that

    else:
        return Age
train['targetted_column'] = train[['targetted_column','category_column_name']].apply(impute_age,axis=1)
```
- Replace a column's missing values with median
    ```python
    # If the column is float type
    median = dataframe[column_name].median()
    # If the column is int type
    median = math.floor(dataframe[column_name].median())
    dataframe[column_name].replace(np.nan, median, inplace=True)
    ```
- Replace a column's missing values with mode if categorical. We use [0] so that if there are multiple mode we take the first one
    ```python
    mode = dataframe[column_name].mode()[0]
    dataframe[column_name].replace(np.nan, mode, inplace=True)
    ```
- Forward fill
    ```python
    dataframe[column_name].fillna(method='ffill', inplace=True)
    ```
- Backward fill
    ```python
    dataframe[column_name].fillna(method='bfill', inplace=True)
    ```

In [None]:
def replace_missing_values(df, fixed_col_nam, fixed_col_val, mean_col_nam, med_col_nam, mod_col_nam, ffill_col, bfill_col):
    # Replace with Fixed value
    for i, j in zip(fixed_col_nam, fixed_col_val):
        df[i] = df[i].fillna(j)

    # Replace with Mean value
    for i in mean_col_nam:
        mean = df[i].mean()
        df[i].replace(np.nan, mean, inplace=True)
    
    # Replace with Median value
    for i in med_col_nam:
        median = df[i].median()
        df[i].replace(np.nan, median, inplace=True)
        
    # Replace with Mode value
    for i in mod_col_nam:
        mode = df[i].mode()[0]
        df[i].replace(np.nan, mode, inplace=True)
        
    # Forward fill
    for i in ffill_col:
        df[i].fillna(method='ffill', inplace=True)
        
    # Backward fill
    for i in bfill_col:
        df[i].fillna(method='bfill', inplace=True)
    
    return df

### Replace missing values with prediction model

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Function to handle missing data using a prediction model
def impute_missing_with_model(df, target_column):
    # Split the dataset into two sets: one with no missing values and one with missing values
    df_train = df.dropna(subset=[target_column])
    df_test = df[df[target_column].isnull()]

    # Identify predictor variables (features) and the target variable
    X = df_train.drop(columns=[target_column])
    y = df_train[target_column]

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Create a predictive model (you can use any model suitable for your data)
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Predict missing values
    predictions = model.predict(df_test.drop(columns=[target_column]))

    # Fill missing values in the original DataFrame
    df.loc[df[target_column].isnull(), target_column] = predictions

    # Evaluate the model performance (optional)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f'Mean Squared Error on the test set: {mse}')

# Example usage for a specific column
target_column_to_impute = 'your_target_column'
impute_missing_with_model(df, target_column_to_impute)

## Outlier detection
- Outlier can be of two types: Univariate and Multivariate.
- When we do the analysis we can detect the unusual points
- the ideal way to tackle them is to find out the reason of having these outliers.
- The method to deal with them would then depend on the reason of their occurrence.
See type of error https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

- To detect outliers use visualization methods Box-plot, Histogram, Scatter Plot

    - Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
    - Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier
    - Data points, three or more standard deviation away from mean are considered outlier
    - Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import zscore
import numpy as np

# Load your dataset (replace 'your_data.csv' with your actual file)
# df = pd.read_csv('your_data.csv')

# Function to detect outliers using visualization methods
def visualize_outliers(dataframe, column_name):
    plt.figure(figsize=(15, 5))

    # Box-plot
    plt.subplot(1, 3, 1)
    sns.boxplot(x=dataframe[column_name])
    plt.title(f'Boxplot for {column_name}')

    # Histogram
    plt.subplot(1, 3, 2)
    sns.histplot(dataframe[column_name], bins=30, kde=True)
    plt.title(f'Histogram for {column_name}')

    # Scatter Plot (for bivariate outlier detection)
    plt.subplot(1, 3, 3)
    sns.scatterplot(x=dataframe.index, y=dataframe[column_name])
    plt.title(f'Scatter Plot for {column_name}')

    plt.show()

# Function to detect outliers using statistical measures
def detect_outliers(dataframe, column_name):
    # Z-Score method
    z_scores = zscore(dataframe[column_name])
    outliers_zscore = np.where(np.abs(z_scores) > 3)[0]

    # Percentile Capping method
    lower_limit = np.percentile(dataframe[column_name], 5)
    upper_limit = np.percentile(dataframe[column_name], 95)
    outliers_percentile = np.where((dataframe[column_name] < lower_limit) | (dataframe[column_name] > upper_limit))[0]

    # Standard Deviation method
    mean_value = np.mean(dataframe[column_name])
    std_dev = np.std(dataframe[column_name])
    outliers_std_dev = np.where(np.abs((dataframe[column_name] - mean_value) / std_dev) > 3)[0]

    print(f'Outliers using Z-Score method: {outliers_zscore}')
    print(f'Outliers using Percentile Capping method: {outliers_percentile}')
    print(f'Outliers using Standard Deviation method: {outliers_std_dev}')

# Example usage for a specific column
column_of_interest = 'your_column_name'
visualize_outliers(df, column_of_interest)
detect_outliers(df, column_of_interest)

### Outlier treatment
**Delete Observation**
- We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. 
- We can also use trimming at both ends to remove outliers.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import zscore
import numpy as np

# Load your dataset (replace 'your_data.csv' with your actual file)
# df = pd.read_csv('your_data.csv')

# Function to visualize and delete outliers using trimming
def visualize_and_delete_outliers(dataframe, column_name, lower_percentile=5, upper_percentile=95):
    plt.figure(figsize=(12, 4))

    # Box-plot before trimming
    plt.subplot(1, 2, 1)
    sns.boxplot(x=dataframe[column_name])
    plt.title(f'Boxplot for {column_name} (Before Trimming)')

    # Trimming outliers
    lower_limit = np.percentile(dataframe[column_name], lower_percentile)
    upper_limit = np.percentile(dataframe[column_name], upper_percentile)
    trimmed_data = dataframe[(dataframe[column_name] >= lower_limit) & (dataframe[column_name] <= upper_limit)]

    # Box-plot after trimming
    plt.subplot(1, 2, 2)
    sns.boxplot(x=trimmed_data[column_name])
    plt.title(f'Boxplot for {column_name} (After Trimming)')

    plt.show()

    return trimmed_data

# Example usage for a specific column
column_of_interest = 'your_column_name'
trimmed_data = visualize_and_delete_outliers(df, column_of_interest)

# Function to delete outliers using Z-Score
def delete_outliers_zscore(dataframe, column_name, zscore_threshold=3):
    z_scores = zscore(dataframe[column_name])
    outliers_indices = np.where(np.abs(z_scores) > zscore_threshold)[0]
    cleaned_data = dataframe.drop(index=outliers_indices)
    return cleaned_data

# Example usage for a specific column
column_of_interest_zscore = 'your_column_name'
cleaned_data_zscore = delete_outliers_zscore(df, column_of_interest_zscore)

# Function to delete outliers using Percentile Capping
def delete_outliers_percentile(dataframe, column_name, lower_percentile=5, upper_percentile=95):
    lower_limit = np.percentile(dataframe[column_name], lower_percentile)
    upper_limit = np.percentile(dataframe[column_name], upper_percentile)
    cleaned_data = dataframe[(dataframe[column_name] >= lower_limit) & (dataframe[column_name] <= upper_limit)]
    return cleaned_data

# Example usage for a specific column
column_of_interest_percentile = 'your_column_name'
cleaned_data_percentile = delete_outliers_percentile(df, column_of_interest_percentile)

# Function to delete outliers using Standard Deviation
def delete_outliers_std_dev(dataframe, column_name, std_dev_threshold=3):
    mean_value = np.mean(dataframe[column_name])
    std_dev = np.std(dataframe[column_name])
    outliers_indices = np.where(np.abs((dataframe[column_name] - mean_value) / std_dev) > std_dev_threshold)[0]
    cleaned_data = dataframe.drop(index=outliers_indices)
    return cleaned_data

# Example usage for a specific column
column_of_interest_std_dev = 'your_column_name'
cleaned_data_std_dev = delete_outliers_std_dev(df, column_of_interest_std_dev)

**Transform and Bin**
- Natural log of a value reduces the variation caused by extreme values. 
- Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. 
- We can also use the process of assigning weights to different observations.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor

# Load your dataset (replace 'your_data.csv' with your actual file)
# df = pd.read_csv('your_data.csv')

# List of columns of interest for separate techniques
log_transform_columns = ['column1', 'column2', ...]
binning_columns = ['column3', 'column4', ...]
weighting_columns = ['column5', 'column6', ...]

# Function to perform natural log transformation
def log_transform(dataframe, column_name):
    dataframe[f'log_{column_name}'] = np.log1p(dataframe[column_name])

# Function to perform binning using Decision Tree algorithm
def decision_tree_binning(dataframe, column_name):
    # Use Decision Tree Regressor to create bins
    model = DecisionTreeRegressor(max_leaf_nodes=10, random_state=42)
    X = dataframe[column_name].values.reshape(-1, 1)
    y = dataframe[column_name]
    model.fit(X, y)

    # Create a new column with bin labels
    dataframe[f'bin_{column_name}'] = pd.cut(dataframe[column_name], bins=model.apply(X), include_lowest=True)

# Function to assign weights to different observations
def assign_weights(dataframe, column_name):
    # Example: Assign higher weight to values above a certain threshold
    threshold = 5
    dataframe[f'weighted_{column_name}'] = np.where(dataframe[column_name] > threshold, dataframe[column_name] * 2, dataframe[column_name])

# Example usage for a specific column
for column in log_transform_columns:
    log_transform(df, column)

for column in binning_columns:
    decision_tree_binning(df, column)

for column in weighting_columns:
    assign_weights(df, column)

# Visualization for demonstration (you can adjust or add more as needed)
plt.figure(figsize=(15, 5))

# Original vs. Log Transform
plt.subplot(1, 3, 1)
sns.histplot(df['column1'], kde=True, color='blue', label='Original')
sns.histplot(df['log_column1'], kde=True, color='orange', label='Log Transform')
plt.title('Log Transformation')
plt.legend()

# Original vs. Binning
plt.subplot(1, 3, 2)
sns.histplot(df['column3'], kde=True, color='blue', label='Original')
sns.histplot(df['bin_column3'], kde=True, color='green', label='Binned')
plt.title('Binning using Decision Tree')
plt.legend()

# Original vs. Weighted
plt.subplot(1, 3, 3)
sns.histplot(df['column5'], kde=True, color='blue', label='Original')
sns.histplot(df['weighted_column5'], kde=True, color='red', label='Weighted')
plt.title('Assigning Weights')
plt.legend()

plt.show()

**Imputing**
- If it is artificial, we can go with imputing values.
- same as handling missing values

**Treat Seperately**
If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output.


We can always compare with every technique and find the best possible matching to move on