In [None]:
# 1.Ans.

Missing values are those values in a dataset that are absent or unknown. 
These values can occur due to various reasons such as human errors, data corruption,
or simply missing information.

There are several methods to handle missing values, such as imputation, deletion, or prediction. 
Imputation involves filling in the missing values with an estimated value based on the other observations
in the dataset. Deletion can be done by removing the rows or columns containing missing values, but this
can lead to a loss of valuable information. Prediction involves using machine learning algorithms to predict 
the missing values based on the available data.

Some algorithms that are not affected by missing values in feature engineering are decision trees, 
random forests, and support vector machines (SVMs). 


In [None]:
# 2.Ans.

Some Techniques use to handle missing data:
    
Deletion: This involves deleting the rows or columns containing missing values. 
However, this approach should be used with caution, as it may lead to a loss of information.

Imputation: This involves filling in the missing values with estimates. 
Some popular methods for imputation include mean imputation, median imputation, mode 
imputation, and regression imputation.

Encoding: In some cases, missing data may be encoded as a separate category, which can be 
treated as a feature in its own right.

   Example with python code are:
      
    Deletion:
        
import pandas as pd

data = {'Name': ['Jo', 'sk', 'Ma', 'Ja', 'Da'],
        'Age': [25, 30, 20, None, 35],
        'Salary': [50000, 60000, None, 70000, 80000]}
df = pd.DataFrame(data)

df.dropna(inplace=True)
print(df)

    Imputation:
        
import pandas as pd
from sklearn.impute import SimpleImputer

data = {'Name': ['Jo', 'sk', 'Ma', 'Ja', 'Da'],
        'Age': [25, 30, 20, None, 35],
        'Salary': [50000, 60000, None, 70000, 80000]}
df = pd.DataFrame(data)

imputer = SimpleImputer(strategy='mean')
df['Age'] = imputer.fit_transform(df[['Age']])
df['Salary'] = imputer.fit_transform(df[['Salary']])
print(df)


    Encoding:
        
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = {'Name': ['Jo', 'sk', 'Ma', 'Ja', 'Da'],
        'Gender': ['Male', 'Male', 'Female', 'Male', 'Male']}
df = pd.DataFrame(data)

le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])
print(df)


    

In [None]:
# 3.Ans.

Imbalanced data refers to a situation where the number of samples in each class of a
classification problem is not evenly distributed. This means that one or more classes 
have significantly fewer samples compared to the other classes.

If imbalanced data is not handled it can have a 
negative impact on the performance of machine learning algorithms. This is 
because most algorithms are designed to maximize overall accuracy and may be biased towards the majority 
class, resulting in poor predictive performance for the minority class.



In [None]:
# 4.Ans.

Up-sampling and down-sampling are techniques used in machine learning to handle
imbalanced data by adjusting the class distribution of a dataset.

Up-sampling involves increasing the number of samples in the minority class to balance the class distribution.

Down-sampling involves reducing the number of samples in the majority class to balance the class distribution.

Here's an example of when up-sampling and down-sampling :

Suppose we have a dataset of 1000 customer transactions, where 900 transactions are non-fraudulent and 
100 are fraudulent. In this case, the data is imbalanced because the minority class (fraudulent transactions) 
has only a small percentage of the total number of samples.

If we were to train a machine learning model on this imbalanced dataset, it might be biased towards the
majority class and have poor predictive performance for the minority class. To address this issue, we 
could use up-sampling or down-sampling to balance the class distribution.

For example, we could up-sample the minority class by generating synthetic samples using SMOTE, resulting 
in a new dataset with 900 non-fraudulent transactions and 900 synthetic fraudulent transactions. 
Alternatively, we could down-sample the majority class by randomly selecting 100 non-fraudulent 
transactions to create a new dataset with 100 non-fraudulent and 100 fraudulent transactions.

By balancing the class distribution through up-sampling or down-sampling, we can improve the
performance of the machine learning model for the minority class and ensure that it is not biased 
towards the majority class.

In [None]:
# 5.Ans.

Data augmentation is a technique used in machine learning to increase the amount 
and diversity of data available for training models.

SMOTE (Synthetic Minority Over-sampling Technique) is a technique used in machine learning 
to address imbalanced datasets where the minority class has significantly fewer instances 
than the majority class. SMOTE involves generating synthetic instances of the minority class by 
interpolating between existing instances.

In [None]:
# 6.Ans.

Outliers are data points in a dataset that differ significantly from other
observations and can skew the overall statistical analysis and model performance.

It is essential to handle outliers in a dataset for several reasons:

Outliers can bias the statistical analysis of the data, leading to incorrect conclusions and decisions.

Outliers can affect the performance of machine learning models by introducing noise and reducing their 
accuracy and generalization ability.

Outliers can also impact the training process of models, making them more sensitive to noise and
leading to overfitting.



In [None]:
# 7.Ans.

There are several techniques you can use to handle missing data in your analysis:

Deletion: One option is to simply delete any records with missing data. However, 
this method can result in a loss of important information and can bias your analysis 
if the missing data is not missing completely at random.

Imputation: Another option is to impute, or estimate, the missing data using a variety of
techniques. One of the most common techniques is mean imputation, which involves replacing
missing values with the mean value of the non-missing data. Other techniques include regression
imputation, hot-deck imputation, and multiple imputation.

Data augmentation: This technique involves adding additional data points to your dataset to make 
up for the missing values. This can be done using various methods, such as generating synthetic 
data or copying values from similar records.

In [None]:
# 8.Ans.

There are several strategies you can use to determine if missing data is 
missing at random or if there is a pattern to the missing data:
    
Analyze the distribution of missing data.

Use statistical tests: You can use statistical tests to determine if there is a relationship 
between missing data and other variables in the dataset. For example, you can use a 
chi-square test to determine if there is a significant relationship between missing data and a 
categorical variable in the dataset.

Visualize the missing data: You can use visualization techniques, such as heatmaps or
scatterplots, to visualize the missing data and look for patterns.



In [None]:
# 9.Ans.

Confusion Matrix: Start by computing the confusion matrix, which is a table that 
summarizes the number of true positives, false positives, true negatives, and false
negatives for your model's predictions. This will help you understand the trade-offs
between sensitivity (the ability of the model to correctly identify positive cases) and
specificity (the ability of the model to correctly identify negative cases).

Precision-Recall Curve: The precision-recall (PR) curve is a graphical representation of
the performance of a binary classifier at different classification thresholds.
The PR curve can be useful for evaluating classifiers on imbalanced datasets, as it provides a 
more detailed view of the trade-offs between precision and recall (or sensitivity).

F1 Score: F1 score is a harmonic mean of precision and recall, which provides a single score that 
balances both precision and recall. F1 score is often used as a metric to evaluate the performance 
of binary classifiers on imbalanced datasets.


In [None]:
# 10.Ans.

Here are some approaches that you can try:

Undersampling: One approach is to randomly remove samples from the majority class 
until the dataset is balanced. This method is simple to implement, but it can lead
to the loss of important information, especially if the dataset is already small.

Oversampling: Another approach is to randomly replicate samples from the minority class
until the dataset is balanced. This method increases the sample size of the minority class,
but it can also lead to overfitting and poor generalization performance.

Synthetic Data Generation: You can generate synthetic data for the minority class using 
techniques such as SMOTE (Synthetic Minority Over-sampling Technique). This method generates
new samples by interpolating between existing minority samples, which can help to increase the 
size of the minority class without overfitting.

In [None]:
# 11.Ans.

Oversampling: In this method, we randomly duplicate samples from the minority class 
to increase their numbers. There are different oversampling techniques like
SMOTE (Synthetic Minority Over-sampling Technique) that generate synthetic data points to balance the dataset.

Undersampling: In this method, we randomly remove samples from the majority class to 
reduce their numbers. This method is simple and easy to implement, but it can result 
in loss of information and accuracy.

Hybrid methods: These methods combine both oversampling and undersampling techniques to
balance the dataset. One such technique is the SMOTEENN method, which first oversamples
the minority class using SMOTE and then performs undersampling on the majority class.
