In [None]:
Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.

In [None]:
-Missing values are data points that are not available or were not recorded for a particular variable
or observation in a dataset. In other words, missing values refer to the absence of a value or a piece 
of information that would typically be expected to be present in a dataset.

-Many machine learning algorithms fail if the dataset contains missing values. 
However, algorithms like K-nearest and Naive Bayes support data with missing values.
You may end up building a biased machine learning model, leading to incorrect results 
if the missing values are not handled properly.

1.Decision trees: Decision trees can handle missing values in the data and can still produce accurate results.

2.Random forests: Random forests are an extension of decision trees and can also handle missing values in the data.

3.k-Nearest Neighbors (k-NN): k-NN is a non-parametric algorithm that can handle missing values by imputing the missing values with 
the mean or median value of the k-nearest neighbors.

4.Support Vector Machines (SVM): SVM can handle missing values by imputing them with the mean or median value of the available data.

5.Gaussian Mixture Models (GMM): GMM can handle missing values by imputing them with the mean or median value of the available data.

6.Principal Component Analysis (PCA): PCA can handle missing values by imputing them with the mean or median value of the available data.

In [None]:
Q2: List down techniques used to handle missing data. Give an example of each with python code.

In [None]:
Deletion: This involves removing the entire row or column that contains missing data. This method is simple and straightforward, 
but it can lead to loss of important information and may affect the accuracy of the analysis.

Imputation: Imputation involves filling in missing values with estimated or imputed values. 
There are several methods for imputation, such as mean imputation, median imputation, regression imputation, 
and k-nearest neighbor imputation. The imputation method chosen will depend on the type and amount of missing data,
as well as the distribution of the data.

Marking: In this method, missing values are replaced with a special value, such as -999 or NaN, to indicate that the value is missing. 
This method can be useful in cases where imputation is not possible or not desirable.

Model-based imputation: This method involves creating a model to predict the missing values based on the available data. For example,
regression models can be used to predict missing values based on other variables in the dataset.

Multiple imputation: This involves creating multiple imputations of missing data using different imputation methods, and then combining 
the results to obtain a final estimate. This method can provide more accurate estimates than single imputation methods, but it can also
be more computationally intensive.

Machine learning-based imputation: In this method, machine learning algorithms are used to predict missing values based on other variables 
in the dataset. This method can be useful when dealing with large datasets with complex relationships between variables.

In [3]:
import seaborn as sns
#we know there are missing values so we replace NAN values with mean mode ans median
df = sns.load_dataset("titanic")

In [13]:
df["age_mean"]=df["age"].fillna(df["age"].mean()) #should be used when outliers are absent
df["age_mean"]=df["age"].fillna(df["age"].median())#used when outliers are present
df["age_mean"]=df["age"].fillna(df["age"].mode())#used for categorical value

In [None]:
Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

In [None]:
Imbalanced data refers to a situation where the number of observations in each class or category 
of a binary or multiclass classification problem is not equal. In other words, one or more classes 
may have a much smaller number of observations than the other classes, which can lead to a significant imbalance in the data.

If imbalanced data is not handled, it can lead to several issues, including:

1.Biased model performance: Since most machine learning algorithms are designed to maximize overall accuracy, 
they tend to be biased towards the majority class in imbalanced data. As a result, the model may perform poorly 
on the minority class and may even misclassify most or all of the minority class examples.

2.Poor generalization: Models trained on imbalanced data may have poor generalization performance when applied to new, 
unseen data, particularly if the distribution of classes in the new data is different from that in the training data.

3.Overfitting: Imbalanced data can also lead to overfitting, where the model learns to memorize the training data instead 
of generalizing to new data. This can happen when the model is optimized to minimize the training error without taking into
account the imbalanced nature of the data.

In [None]:
Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-
sampling are required.

In [None]:
Up-sampling and down-sampling are two common techniques used to address the issue of imbalanced data in a classification problem.

1.Up-sampling involves increasing the number of instances in the minority class to balance the 
class distribution. This can be done by either duplicating existing instances in the minority
class (simple up-sampling) or generating synthetic instances that are similar to the existing ones 
(e.g., using techniques like SMOTE - Synthetic Minority Over-sampling Technique). Up-sampling can help 
to ensure that the minority class is well-represented in the dataset and can improve the performance of machine learning models.

2.Down-sampling, on the other hand, involves reducing the number of instances in the majority class to balance the class distribution.
This can be done by either randomly selecting a subset of instances in the majority class (simple down-sampling) or selecting instances
that are similar to those in the minority class (e.g., using techniques like Tomek links). Down-sampling can help to prevent the machine
learning models from being biased towards the majority class, and can improve the performance on the minority class.

In [14]:
import numpy as np
import pandas as pd

In [22]:
np.random.seed(123)
n_samples = 100
class_0_ratio = 0.8
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

In [23]:
class_0 = pd.DataFrame({
    'first': np.random.normal(loc=0, scale=1, size=n_class_0),
    'second': np.random.normal(loc=0, scale=1, size=n_class_0),
    'third': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'first': np.random.normal(loc=2, scale=1, size=n_class_1),
    'second': np.random.normal(loc=2, scale=1, size=n_class_1),
    'third': [1] * n_class_1
})

In [24]:
df = pd.concat([class_0,class_1]).reset_index(drop=True)

In [25]:
df_minority=df[df['third']==1]
df_majority=df[df['third']==0]

In [26]:
from sklearn.utils import resample

In [31]:
df_minority_upsample=resample(df_minority,
                             replace=True,
                             n_samples=len(df_majority))

df_majority_downsample=resample(df_majority,
                             replace=False, 
                              n_samples=len(df_minority))

In [None]:
Q5: What is data Augmentation? Explain SMOTE.

In [None]:
Data augmentation is a technique used to increase the size of a dataset by creating new,
synthetic samples based on the existing ones. The aim of data augmentation is to improve the performance 
and robustness of machine learning models by providing more diverse and representative examples for training.

One popular data augmentation technique is SMOTE (Synthetic Minority Over-sampling Technique). 
SMOTE is used in cases where the dataset is imbalanced, meaning that one class has significantly 
fewer examples than the other(s). SMOTE generates synthetic examples for the minority class by
interpolating between existing examples of the minority class. 

In [None]:
Q6: What are outliers in a dataset? Why is it essential to handle outliers?

In [None]:
Outliers are data points in a dataset that are significantly different from the other data points.
Outliers can be caused by a variety of factors, such as measurement errors, experimental errors, or rare events. 
Outliers can have a significant impact on the statistical analysis of the dataset and can lead to incorrect conclusions.

It is essential to handle outliers for several reasons:

1.They can bias statistical analysis: Outliers can significantly affect the mean and standard deviation of the dataset,
which can lead to incorrect statistical analysis. For example, if the mean is significantly affected by an outlier,
it may not be a representative measure of the central tendency of the data.

2.They can affect machine learning algorithms: Outliers can also affect the performance of machine learning algorithms 
by distorting the relationships between the features and the target variable. This can lead to overfitting, where the model
fits the noise instead of the underlying patterns in the data.

3.They can affect data visualization: Outliers can also affect data visualization by distorting the scales of the plots,
making it difficult to see the patterns in the data.

In [None]:
Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?

In [None]:
Deletion: This involves removing the entire row or column that contains missing data. This method is simple and straightforward, 
but it can lead to loss of important information and may affect the accuracy of the analysis.

Imputation: Imputation involves filling in missing values with estimated or imputed values. 
There are several methods for imputation, such as mean imputation, median imputation, regression imputation, 
and k-nearest neighbor imputation. The imputation method chosen will depend on the type and amount of missing data,
as well as the distribution of the data.

Marking: In this method, missing values are replaced with a special value, such as -999 or NaN, to indicate that the value is missing. 
This method can be useful in cases where imputation is not possible or not desirable.

Model-based imputation: This method involves creating a model to predict the missing values based on the available data. For example,
regression models can be used to predict missing values based on other variables in the dataset.

Multiple imputation: This involves creating multiple imputations of missing data using different imputation methods, and then combining 
the results to obtain a final estimate. This method can provide more accurate estimates than single imputation methods, but it can also
be more computationally intensive.

Machine learning-based imputation: In this method, machine learning algorithms are used to predict missing values based on other variables 
in the dataset. This method can be useful when dealing with large datasets with complex relationships between variables.

In [None]:
Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are
some strategies you can use to determine if the missing data is missing at random or if there is a pattern
to the missing data?

In [None]:
When dealing with missing data, it is important to determine whether the data is missing at random (MAR) or not at random (MNAR). 
Here are some strategies that can be used to determine the missing data pattern:

Check the missing data pattern: Visualize the pattern of missing data to see if there is a specific pattern or not.
If the missing data is distributed randomly across the dataset, it is more likely to be MAR. However, if there is a 
particular pattern (e.g., missing values only for certain categories or time periods), it may be MNAR.

Explore correlation between missing data and other variables: Examine whether the missing data is correlated with other variables in the dataset.
If there is no correlation, the missing data is more likely to be MAR. However, if the missing data is related to other variables, it may be MNAR.

Impute the missing values: By imputing the missing data, we can compare the imputed values with the actual values to determine the missing data
pattern. If the imputed values are similar to the actual values, the missing data is more likely to be MAR. If the imputed values differ 
significantly from the actual values, it may be MNAR.

In [None]:
Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?

In [None]:
Imbalanced datasets, where the proportion of samples in each class is unequal, are common in many machine learning applications, 
including medical diagnosis. Here are some strategies that can be used to evaluate the performance of machine learning models on imbalanced datasets:
    
Resample the dataset: One common approach to address class imbalance is to resample the dataset.
Two common techniques for resampling include oversampling the minority class and undersampling the majority class.
Oversampling involves generating more samples for the minority class, while undersampling involves removing samples from the majority class.
However, its important to ensure that the resampling technique does not introduce bias into the dataset.

In [None]:
Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?

In [None]:
Random undersampling: This method randomly selects a subset of samples from the majority class to match the number of samples in the minority class.

1.Cluster centroids: This method clusters the majority class samples and then selects the centroids of each cluster as representatives of the 
majority class.

2.Tomek links: Tomek links are pairs of samples from different classes that are closest to each other. This method removes the majority class 
samples that are part of Tomek links.

3.NearMiss: NearMiss is an undersampling method that selects samples from the majority class that are closest to the minority class samples.

4.One-sided selection: This method selects the samples from the majority class that are nearest to the decision boundary.



In [None]:
Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?

In [None]:
When dealing with a dataset with a low percentage of occurrences of a rare event, it's common to use upsampling methods to balance the dataset.
Here are some methods that can be employed to up-sample the minority class:

1.Random oversampling: This method randomly duplicates samples from the minority class to increase their number and balance the dataset.

2.SMOTE: Synthetic Minority Over-sampling Technique (SMOTE) generates new synthetic samples by interpolating between the minority class samples. 
This method avoids exact replication of minority samples and provides more variety to the dataset.

3.ADASYN: Adaptive Synthetic Sampling (ADASYN) generates synthetic samples by considering the difficulty of learning samples in the minority class.
This method creates more synthetic samples for difficult to learn samples and less for easy to learn samples.

4.Synthetic Minority Over-sampling TEchnique with Application of Multi-grained Architecture (SMOTEMA): 
SMOTEMA is an enhanced version of SMOTE which generates synthetic samples in a multi-grained way. 
This method generates samples with different magnitudes of changes from the original samples.

5.Class weighting: Class weighting assigns higher weights to the minority class samples during model training to increase their 
influence on the learning process.