In [None]:
Q2: List down techniques used to handle missing data. Give an example of each with python code.

In [None]:
Ans: Some techniques used to handle missing data are:
    1. Dropping missing values
    examples:
        import pandas as pd

# Load dataset
df = pd.read_csv('data.csv')

# Drop rows with missing values
df.dropna(inplace=True)

# Drop columns with missing values
df.dropna(axis=1, inplace=True)


2. Imputing missing values with a constant:
    import pandas as pd

# Load dataset
df = pd.read_csv('data.csv')

# Impute missing values with zero
df.fillna(0, inplace=True)

# Impute missing values with mean
df.fillna(df.mean(), inplace=True)



3. Imputing missing values with interpolation:
    import pandas as pd

# Load dataset
df = pd.read_csv('data.csv')

# Interpolate missing values
df.interpolate(inplace=True)



4. Using machine learning alogorithms to impute missing values:
from sklearn.impute import KNNImputer

# Load dataset
df = pd.read_csv('data.csv')

# Impute missing values using KNN imputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df))


In [None]:
Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

In [None]:
Ans: Imbalanced data refers to a situation where the number of observations belonging to one class in
a binary classification problem is significantly higher or lower than the number of observations 
belonging to the other class. This can cause issues in machine learning models since the algorithm 
will be biased towards the majority class and may have difficulty accurately predicting the minority
class.

If imbalanced data is not handled following problems will arise:
    1. The model may have a high accuracy rate due to its ability to correctly predict the majority class.
    2. The model may become biased towards the majority class, leading to poor  generalization on new, unseen data.
    3. The model may exhibit low recall  for the minority class.

In [None]:
Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-
sampling are required.

In [None]:
Ans: Up-sampling and down-sampling are techniques used to address imbalanced data by either incresing or decreasing
the number of observations in a particular class.

Up-sampling involves randomly duplicating observations from the minority class until the number of observations
in that class is equal to the number of observation in the majority class.


Down-sampling involves randomly removing observations from the majority class until the number of observations
in that class is equal to the number of observations in the minority class.

For example, consider a medical dataset where the number of patients with a rare disease is much 
smaller than the number of patients without the disease. In this case, up-sampling can be used to
generate synthetic data points to balance the number of observations in both classes. Conversely, in 
a dataset for credit card fraud detection, down-sampling can be used to remove a significant portion 
of non-fraudulent transactions, which can be much more numerous, to reduce the bias towards the
majority class.

In [None]:
Q5: What is data Augmentation? Explain SMOTE.

In [None]:
Ans: Data augmentation is a technique used to artificially increase the size of a dataset by 
generating new samples from the existing data. This is typically done by applying various 
transformations to the original data, such as rotation, flipping, scaling, or adding noise

SMOTE (Synthetic Minority Over-sampling Technique) is a specific type of data augmentation technique 
used to address imbalanced data in binary classification problems. It involves generating synthetic
examples of the minority class by creating new data points that are a combination of existing minority
class samples.

In [None]:
Q6: What are outliers in a dataset? Why is it essential to handle outliers?

In [None]:
Ans: Outliers are observations that deviate significantly from the rest of the data points in a 
dataset. They can be caused by measurement or recording errors, natural variations in the data, 
or rare events.


It is essential to handle outliers because they can have a significant impact on the results of
statistical analysis and machine learning models. Outliers can skew the results of summary statistics
such as the mean and standard deviation, leading to inaccurate inferences about the data. Additionally,
they can affect the performance of machine learning models by introducing noise and bias.

In [None]:
Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?

In [None]:
Ans: When dealing with missing data in a customer data analysis project, there are several techniques
you can use. One approach is to simply remove any observations with missing data, although this can 
reduce the size of the dataset and potentially introduce bias. Another approach is to impute the 
missing values by filling them in with estimated values based on the other observations in the dataset. Imputation techniques can include mean imputation, median imputation, regression imputation, or k-nearest neighbors imputation.

In [None]:
Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are
some strategies you can use to determine if the missing data is missing at random or if there is a pattern
to the missing data?

In [None]:
Ans: To determine if missing data is missing at random or if there is a pattern to the missing data, 
you can use exploratory data analysis techniques such as data visualization, summary statistics, and
hypothesis testing. One approach is to compare the distribution of the missing data with the
distribution of the complete data. If the distributions are similar, it may indicate that the data
is missing at random. If there are differences in the distributions, it may indicate a pattern to the
missing data.

In [None]:
Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?

In [None]:
Ans: When dealing with an imbalanced medical diagnosis dataset, some strategies you can use to evaluate
the performance of your machine learning model include using evaluation metrics that are appropriate 
for imbalanced datasets, such as precision, recall, F1 score, and area under the ROC curve.
Additionally, you can use techniques such as stratified sampling, resampling methods 
(such as up-sampling or down-sampling), or cost-sensitive learning to balance the dataset and improve
the performance of the machine learning model.

In [None]:
Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?

In [None]:
Ans:  When dealing with an unbalanced customer satisfaction dataset, you can employ methods such as 
down-sampling the majority class to balance the dataset. This involves randomly removing some 
observations from the majority class to match the size of the minority class. Another approach is
to use weighting methods, where you give more weight to the minority class to balance the impact
of the imbalance on the machine learning model.

In [None]:
Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?

In [None]:
Ans:  When dealing with an unbalanced dataset with a low percentage of occurrences, you can employ
methods such as up-sampling the minority class to balance the dataset. This involves generating 
synthetic data points for the minority class to increase its size and balance the dataset. Another 
approach is to use ensemble methods such as bagging or boosting to improve the performance of the
machine learning model on the minority class.