## Q1. Missing Values in a Dataset
*Missing values* are data points where no data value is stored for a variable in an observation. They can occur due to various reasons like data corruption, data entry errors, or the unavailability of information.

*Handling missing values is essential because*:
- They can bias the analysis, leading to incorrect conclusions.
- Some algorithms cannot handle missing values and will throw errors or perform poorly.

*Algorithms not affected by missing values* include:
- Decision Trees
- Random Forest
- XGBoost (with certain parameters)

In [None]:
## Q2. Techniques to Handle Missing Data
# Techniques:
# 1. Deletion:
#   - Listwise Deletion: Removing all rows with any missing values.
#   - Pairwise Deletion: Using available data without removing rows.


    import pandas as pd

    df = pd.DataFrame({'A': [1, 2, None], 'B': [4, None, 6]})
    df_dropped = df.dropna()  # Listwise deletion
   

 # 2. Imputation:
 #  - Mean/Median/Mode Imputation:

     df['A'].fillna(df['A'].mean(), inplace=True)
     

 #  - Regression Imputation: Predicting missing values using regression models.
    
     from sklearn.linear_model import LinearRegression

    known = df[df['B'].notna()]
    unknown = df[df['B'].isna()]

    lr = LinearRegression()
    lr.fit(known[['A']], known['B'])

     df.loc[df['B'].isna(), 'B'] = lr.predict(unknown[['A']])
     

  # - K-Nearest Neighbors (KNN) Imputation:
  
    from sklearn.impute import KNNImputer

    imputer = KNNImputer(n_neighbors=2)
    df_imputed = imputer.fit_transform(df)
     

# 3. Advanced Methods:
#   - Multiple Imputation by Chained Equations (MICE): Multiple imputation with chained equations to account for uncertainty.
#   - Machine Learning Models: Using sophisticated models to predict missing values.

## Q3. Imbalanced Data
*Imbalanced data* occurs when the classes in a dataset are not represented equally. This can lead to a model being biased towards the majority class, resulting in poor performance for the minority class.

*If not handled*, the model might:
- Predict the majority class more often.
- Have high overall accuracy but low sensitivity/recall for the minority class.


In [None]:
## Q4. Up-sampling and Down-sampling
# Up-sampling (Over-sampling): Increasing the number of instances in the minority class by replicating them or creating synthetic examples.
# Down-sampling (Under-sampling): Reducing the number of instances in the majority class.

# Example:

from sklearn.utils import resample

# Up-sampling
minority_class = df[df['class'] == 1]
upsampled_minority = resample(minority_class, 
                              replace=True, 
                              n_samples=len(df[df['class'] == 0]))

# Down-sampling
majority_class = df[df['class'] == 0]
downsampled_majority = resample(majority_class, 
                                replace=False, 
                                n_samples=len(df[df['class'] == 1]))


# When required:
# - Up-sampling: When the minority class is too small and underrepresented.
# - Down-sampling: When the majority class is too large, leading to computational inefficiency.

In [None]:
## Q5. Data Augmentation and SMOTE
# Data Augmentation: Techniques used to increase the diversity of training data without actually collecting new data, such as rotating or flipping images.

# SMOTE (Synthetic Minority Over-sampling Technique): A method to generate synthetic samples for the minority class by interpolating between existing minority class examples.

from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_res, y_res = smote.fit_resample(X, y)

## Q6. Outliers in a Dataset
*Outliers* are data points that differ significantly from other observations. They can distort statistical analyses and model predictions.

*Handling outliers is essential* because:
- They can skew and mislead the data analysis results.
- They can affect the performance and accuracy of machine learning models.

## Q7. Techniques to Handle Missing Data in Analysis
1. *Imputation* (Mean/Median/Mode, KNN, etc.).
2. *Deletion* (Removing rows or columns).
3. *Using Algorithms that Support Missing Values* (Decision Trees, Random Forest).

## Q8. Determining If Missing Data is Random
1. *Missing Completely at Random (MCAR)*: Missingness is unrelated to any observed or unobserved data.
   - Test with Little's MCAR test.
2. *Missing at Random (MAR)*: Missingness is related to observed data.
   - Analyze correlations between missingness and other variables.
3. *Missing Not at Random (MNAR)*: Missingness is related to unobserved data.
   - Requires domain knowledge to identify patterns.

## Q9. Evaluating Model Performance on Imbalanced Data
1. *Use Metrics like Precision, Recall, F1-Score* instead of accuracy.
2. *ROC-AUC*: Consider the area under the ROC curve.
3. *Confusion Matrix*: Detailed analysis of true positives, false positives, etc.

In [None]:
## Q10. Balancing Dataset by Down-sampling
# 1. Random Under-sampling: Randomly remove samples from the majority class.

   from sklearn.utils import resample

       majority_class_downsampled = resample(majority_class, 
                                             replace=False, 
                                             n_samples=len(minority_class))
   
 #2. Cluster Centroids: Replace clusters of majority class samples with their centroids.

In [None]:
## Q11. Balancing Dataset by Up-sampling
# 1. Random Over-sampling: Randomly duplicate samples from the minority class.
# 2. SMOTE: Generate synthetic samples for the minority class.

   from imblearn.over_sampling import SMOTE

       smote = SMOTE()
        X_res, y_res = smote.fit_resample(X, y)