### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

Ans
Missing values are values that are not stored in a dataset during observations. They can occur due to various reasons, such as 
human error, sensor failure, or data corruption. It is essential to handle missing values because they can affect the quality 
and performance of data analysis and machine learning algorithms3. Some algorithms that are not affected by missing values are 
K-nearest neighbors and Naive Bayes, while others, such as decision trees, have different ways of dealing with them



### Q2: List down techniques used to handle missing data. Give an example of each with python code.

Ans  
Some techniques used to handle missing data are:

Deletion: This involves removing the rows or columns that contain missing values. This is the easiest but not the best option, as it can lead to loss of information and bias. For example, to drop rows with missing values in pandas, you can use:

In [2]:
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
df.dropna(inplace=True)

Imputation: This involves replacing the missing values with some other values, such as a constant, a statistic (mean, median, mode), or a value predicted by another algorithm. For example, to impute missing values with the mean in pandas, you can use:

In [4]:
import pandas as pd
df = sns.load_dataset('titanic')
df['age'].fillna(df['age'].mean(), inplace=True)

### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Ans Imbalanced data is a term used to describe a situation where one target class has a much higher or lower number of observations than another target class in a classification problem. For example, in a dataset of credit card transactions, the class of frauds may be much less frequent than the class of non-frauds1.

If imbalanced data is not handled, it can lead to poor performance and accuracy of machine learning models, as they may be biased towards the majority class and ignore the minority class. For example, a model trained on imbalanced data may predict that all transactions are non-frauds, even though some of them are frauds

### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down- sampling are required. 

And Up-sampling and down-sampling are techniques for changing the sampling rate of a digital signal. Up-sampling increases the number of samples by inserting zeros or interpolating new values between the original samples. Down-sampling decreases the number of samples by discarding some samples or averaging them into fewer values.

Up-sampling and down-sampling are required when we want to change the resolution or bandwidth of a signal, or when we want to process signals with different sampling rates together. For example, up-sampling can be used to increase the quality of an audio signal before applying some effects, and down-sampling can be used to reduce the size of an image or a video file.

### Q6: What are outliers in a dataset? Why is it essential to handle outliers? 

Ans Outliers are values in a dataset that are very different from most of the other values. They can affect the results of statistical analyses and hypothesis tests. Outliers can be caused by natural variation, measurement errors, or other factors. It is essential to handle outliers because they can skew the mean, standard deviation, and correlation of a dataset. There are different ways to find and deal with outliers, such as using box plots, z-scores, or robust statistics

### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

Ans There are two main approaches to handle missing data: deletion or imputation. Deletion means removing the rows or columns that have missing values, but this can reduce the sample size and introduce bias. Imputation means filling in the missing values with reasonable guesses, such as the mean, median, mode, or a constant value. There are different techniques for imputation, such as using machine learning algorithms, regression models, or nearest neighbors. The choice of technique depends on the type and amount of missing data, and the goal of the analysis

### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

It is not easy to determine if the missing data is random or not, because you cannot observe the missing values or the reasons for their missingness. However, some possible strategies are:

Checking if the missing data is missing completely at random (MCAR), which means the probability of missingness is the same for all values and does not depend on any other variable. You can do this by comparing the mean, standard deviation, and distribution of the observed and complete data, or by performing a chi-square test or a t-test.

Checking if the missing data is missing at random (MAR), which means the probability of missingness depends on some observed variables but not on the missing values themselves. You can do this by performing a logistic regression with the missing indicator as the outcome and the other variables as predictors, and testing if the coefficients are significant.

Checking if the missing data is missing not at random (MNAR), which means the probability of missingness depends on the missing values themselves or some unobserved variables. You can do this by using domain knowledge, sensitivity analysis, or multiple imputation methods to explore different scenarios of missingness.

### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Some strategies to evaluate the performance of your machine learning model on this imbalanced dataset are:

Using appropriate metrics that can capture the true positive and false negative rates, such as precision, recall, F1-score, or AUC-ROC. Accuracy is not a good metric because it can be inflated by the majority class.

Using cross-validation or stratified sampling to ensure that the train and test sets have the same proportion of classes as the original dataset. This can prevent overfitting or underfitting the model to the minority class.

Using resampling techniques such as oversampling the minority class or undersampling the majority class to create a balanced dataset. This can improve the modelâ€™s ability to learn from both classes equally.

Using ensemble methods such as bagging, boosting, or stacking to combine multiple models and reduce the variance or bias of the predictions. This can increase the robustness and generalization of the model.

Using cost-sensitive learning to assign different weights or penalties to different classes based on their importance or frequency. This can make the model more sensitive to the minority class and less sensitive to the majority class

### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

Some methods to balance the dataset and down-sample the majority class are:

Using simple random sampling to select a subset of the majority class that matches the size of the minority class. This can create a balanced dataset, but it may lose some information from the majority class.

Using stratified sampling to select a subset of the majority class that preserves the distribution of some important variables or features. This can create a balanced dataset that maintains the representativeness of the majority class.

Using cluster-based sampling to group the majority class into clusters based on some similarity measure and then select a subset of clusters to form the balanced dataset. This can reduce the variability within the majority class and increase the diversity of the balanced dataset.

Using cost-sensitive learning to assign different weights or penalties to different classes based on their importance or frequency. This can make the model more sensitive to the minority class and less sensitive to the majority class without changing the original dataset


### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class? 

Some methods to balance the dataset and up-sample the minority class are:

Using simple random sampling with replacement to duplicate observations from the minority class until it matches the size of the majority class. This can create a balanced dataset, but it may introduce overfitting or noise to the minority class.

Using synthetic minority oversampling technique (SMOTE) to generate new observations from the minority class by interpolating between existing ones. This can create a balanced dataset that increases the diversity of the minority class.

Using adaptive synthetic sampling (ADASYN) to generate new observations from the minority class by using a density distribution as a criterion to decide the number of synthetic samples per existing sample. This can create a balanced dataset that adapts to the underlying data distribution and reduces the learning bias.

Using ensemble methods such as bagging, boosting, or stacking to combine multiple models and reduce the variance or bias of the predictions. This can increase the robustness and generalization of the model without changing the original dataset