#### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

#### Q2: List down techniques used to handle missing data. Give an example of each with python code.

In [None]:
Ans-

There are several techniques that can be used to handle missing data in a dataset. 
Here are some common techniques with examples in Python:

In [2]:
#Deletion:
'''This technique involves removing observations or variables that contain missing values.
However, this method can lead to a loss of valuable information, especially if the missing data is substantial.'''

import pandas as pd
import numpy as np
# Creating a DataFrame with missing values
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, 6, np.nan, 8]})

# Drop rows with missing values
df.dropna(inplace=True)
print(df)


     A    B
0  1.0  5.0
1  2.0  6.0
3  4.0  8.0


In [3]:
#Imputation:
'''This technique involves filling in missing values with estimated values based on the data available. 
There are different methods of imputation, including mean imputation, median imputation, mode imputation, and regression imputation.'''

import pandas as pd
from sklearn.impute import SimpleImputer

# Creating a DataFrame with missing values
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, 6, np.nan, 8]})

# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
df[['A', 'B']] = imputer.fit_transform(df[['A', 'B']])
print(df)


          A         B
0  1.000000  5.000000
1  2.000000  6.000000
2  2.333333  6.333333
3  4.000000  8.000000


In [4]:
#K-nearest neighbors imputation:
'''This technique involves imputing missing values with the average value of the nearest k neighbors based on other variables'''

import pandas as pd
from sklearn.impute import KNNImputer

# Creating a DataFrame with missing values
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, 6, np.nan, 8]})

# Impute missing values with KNN imputation
imputer = KNNImputer(n_neighbors=2)
df[['A', 'B']] = imputer.fit_transform(df[['A', 'B']])
print(df)


          A         B
0  1.000000  5.000000
1  2.000000  6.000000
2  2.333333  6.333333
3  4.000000  8.000000


In [5]:
#Model-based imputation:
'''This technique involves using statistical models to predict the missing values based on the available data.
For example, linear regression or decision trees can be used to estimate missing values.'''

import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression

# Creating a DataFrame with missing values
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, 6, np.nan, 8]})

# Impute missing values with iterative imputation
imputer = IterativeImputer(random_state=0, estimator=LinearRegression())
df[['A', 'B']] = imputer.fit_transform(df[['A', 'B']])
print(df)


          A         B
0  1.000000  5.000000
1  2.000000  6.000000
2  2.333333  6.333333
3  4.000000  8.000000


#### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

#### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down- sampling are required.

#### Q5: What is data Augmentation? Explain SMOTE.

#### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

#### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

#### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

#### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

#### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

#### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?