### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.


#### <b>Three types of missing values in a dataset:

  1. CMAR - Missing Completely at Random
  2. MAR -  Missing at Random
  3. MNAR - Missing Not at Random


#### <b>Why is it essential to handle missing values?

Missing data are problematic because, depending on the type, they can sometimes cause sampling bias. 

In practice, you can often consider $MCAR$ data, $MAR$ data types of missing data ignorable because the missing data don’t systematically differ from your observed values. For these two data types, the likelihood of a data point being missing has nothing to do with the value itself. So it’s unlikely that your missing values are significantly different from your observed values.

On the flip side, you have a biased dataset if the missing data systematically differ from your observed data. Data that are MNAR are called non-ignorable for this reason.

#### <b><u>Name some algorithms that are not affected by missing values:
    
    1. CART methodology (Breiman et al. 1984) 
    2. C5.0 (Quinlan 1993; Kuhn and Johnson 2013)
    3. Naive Bayes
    
Source: http://www.feat.engineering/models-that-are-resistant-to-missing-values.html

### Q2: List down techniques used to handle missing data. Give an example of each with python code.

Handling missing values falls generally into two categories. We will look at the most common in each category. The two categories are as follows:
  1. Deletion: (i) row deletion that contains missing data, (ii) column deletion which contains missing data(The general rule of thumb for when to perform list-wise deletion is when the number of observations with missing values exceeds the number of observations without missing values.)
  2. Imputation: Filling up missing values with Mean, Median and Mode of the columns.

#### Deletion example

In [2]:
import seaborn as sns

# bringing dataset with NUll VALUES.

df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [5]:
## Checking missing values, column wise

df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

#### column wise deletion

Above, you can see 'deck' feature or column has most of its value as null. So, in this you can delete the 'deck' columns.

In [6]:
# In the below code, all the column which have null values will be deleted. So, in this case, 'age' & 'deck' both 
# column will be deleted.

# you have mention inplace = True in order to make the change permanent. 

df.dropna(axis=1)

Unnamed: 0,survived,pclass,sex,sibsp,parch,fare,class,who,adult_male,alive,alone
0,0,3,male,1,0,7.2500,Third,man,True,no,False
1,1,1,female,1,0,71.2833,First,woman,False,yes,False
2,1,3,female,0,0,7.9250,Third,woman,False,yes,True
3,1,1,female,1,0,53.1000,First,woman,False,yes,False
4,0,3,male,0,0,8.0500,Third,man,True,no,True
...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,0,0,13.0000,Second,man,True,no,True
887,1,1,female,0,0,30.0000,First,woman,False,yes,True
888,0,3,female,1,2,23.4500,Third,woman,False,no,False
889,1,1,male,0,0,30.0000,First,man,True,yes,True


In [10]:
# as we wanted to drop only 'deck' column, then we can use the below code:

df.drop(columns=['deck'], axis=1, inplace=True)

In [11]:
# in the below list of columns, you will not able to find the 'deck' column

df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alive',
       'alone'],
      dtype='object')

#### Row Deletion with Null Value

In [13]:
# this will drop all the rows with null value in it. This is not desirable as we lose a lot of data portions.

df.dropna()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,0,3,female,39.0,0,5,29.1250,Q,Third,woman,False,Queenstown,no,False
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,Southampton,yes,True
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,Cherbourg,yes,True


#### Imputation of Missing values.

In [17]:
# Mean Imputation:

'''

1. df['age'].mean() -- calculate mean value of the values of 'age' column
2. df['age'].fillna(df['age'].mean()) - will fill the missing value with the calculated mean value
3. df['age_mean'] = df['age'].fillna(df['age'].mean()) - will create 'age_imputed' new column with the replaced missing 
value by the mean value. 

'''

df['age_mean'] = df['age'].fillna(df['age'].mean()) 
df[['age', 'age_mean']]

Unnamed: 0,age,age_mean
0,22.0,22.000000
1,38.0,38.000000
2,26.0,26.000000
3,35.0,35.000000
4,35.0,35.000000
...,...,...
886,27.0,27.000000
887,19.0,19.000000
888,,29.699118
889,26.0,26.000000


In [18]:
# Median Imputation  -- if we have outliers in the dataset

'''

1. df['age'].median() -- calculate mean value of the values of 'age' column
2. df['age'].fillna(df['age'].median()) - will fill the missing value with the calculated mean value
3. df['age_median'] = df['age'].fillna(df['age'].median()) - will create 'age_imputed' new column with the replaced missing 
value by the mean value. 

'''

df['age_median'] = df['age'].fillna(df['age'].median()) 
df[['age', 'age_median']]

Unnamed: 0,age,age_median
0,22.0,22.0
1,38.0,38.0
2,26.0,26.0
3,35.0,35.0
4,35.0,35.0
...,...,...
886,27.0,27.0
887,19.0,19.0
888,,28.0
889,26.0,26.0


In [20]:
# Mode Imputation - done on categorical dataset

'''
1. df['embarked'].notna() - all the non-null values as True and Null values as false
2. df[df['embarked'].notna()] - total dataframe with only those non-null values
3. df[df['embarked'].notna()]['embarked'] - embarked column with non-null values
4. df[df['embarked'].notna()]['embarked'].mode() - returns a series with mode values of the non-null values of embarked column
5. df[df['embarked'].notna()]['embarked'].mode()[0] - retunrs the mode value

'''

mode_value = df[df['embarked'].notna()]['embarked'].mode()[0]
print("Calculated Mode: ", mode_value)

Calculated Mode:  S


In [26]:
# filling up all the missing values of embarked column with mode value and creating a new column 'embarked_mode' with it.

df['embarked_mode'] = df['embarked'].fillna(mode_value)

In [27]:
df['embarked'].value_counts()

S    644
C    168
Q     77
Name: embarked, dtype: int64

In [29]:
# in the new colummn, u will find two 's' value increased. cause two missing value has been replcaed with mode value which is 's'
# in this case.

df['embarked_mode'].value_counts()

S    646
C    168
Q     77
Name: embarked_mode, dtype: int64

### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?


#### <b><u>Imbalanced Data:
Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations, i.e one class label has a very high number of observations and the other has a very low number of observations. 

Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes.
    
#### <b><u>What will happen if imbalanced data is not handled?

If the dataset is biased towards one class, an algorithm trained on the same data will be biased towards the same class.

The model learns more from biased examples as opposed to the examples in the minority class. One might end up with a scenario where a model assumes that any data you feed it belongs to the majority class.

This, as a result, makes a model seem naïve in its predictions, regardless of achieving high accuracy scores.

### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and downsampling are required.


#### <b><u>Oversampling or Up-Sampling:

Oversampling is a technique to alter unequal classes of data to create balanced datasets. This technique attempts to increment the size of rare samples to create a balance when the data is insufficient.

<b><u>Example:
    
For example, let’s have a classification problem with two classes and 100K data points. 20K data points are of the positive class, 80K for the negative class. The positive class, which is the minority class, would need to be oversampled.

To do this, we take the 20K data points and replicate them four times to produce 80K. This yields an equal number of examples for both positive and negative classes. The size of the dataset would increase to 160K as a result.
    
    
#### <b><u> Undersampling or Down-Sampling:

When there exists a class that is in abundance in an imbalanced dataset, undersampling aims to reduce the size of the abundant class to balance the dataset.
    
<b><u>Example:

Using a similar context to the oversampling example, we have a classification problem with two classes and 100K data points. 20K data points are of the positive class, 80K for the negative class. We would need to undersample the majority class.

This would involve choosing 20K data points randomly from the 80K available. We then have 20K positive and 20K negative data points, bringing the total dataset size to 40K data points.
    

Source: https://www.section.io/engineering-education/imbalanced-data-in-ml/

### Q5: What is data Augmentation? Explain SMOTE.

#### <b><u> Data Augmentation: 
    
Data augmentation is a technique of artificially increasing the training set by creating modified copies of a dataset using existing data. It includes making minor changes to the dataset or using deep learning to generate new data points.  
    
#### <b><u>SMOTE (Synthetic Minority Oversampling Technique) – Oversampling

SMOTE (Synthetic Minority Oversampling Technique) is one of the most commonly used oversampling methods to solve the imbalance problem.
    
It aims to balance class distribution by randomly increasing minority class examples by replicating them.
    
SMOTE synthesises new minority instances between existing minority instances. It generates the virtual training records by linear interpolation for the minority class. These synthetic training records are generated by randomly selecting one or more of the k-nearest neighbors for each example in the minority class. After the oversampling process, the data is reconstructed and several classification models can be applied for the processed data.
    
<b><u>More Deep Insights of how SMOTE Algorithm work:
    
* Step 1: Setting the minority class set A, for each $x \in A$, the k-nearest neighbors of x are obtained by calculating the Euclidean distance between x and every other sample in set A.
* Step 2: The sampling rate N is set according to the imbalanced proportion. For each $x \in A$, N examples (i.e x1, x2, …xn) are randomly selected from its k-nearest neighbors, and they construct the set $A_1$ .
* Step 3: For each example $x_k \in A_1$ (k=1, 2, 3…N), the following formula is used to generate a new example:
          $x' = x + rand(0, 1) * \mid x - x_k \mid$
    in which rand(0, 1) represents the random number between 0 and 1.

### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

#### <b><u>Outlier
    
In simple terms, an outlier is an extremely high or extremely low data point relative to the nearest data point and the rest of the neighboring co-existing values in a data graph or dataset you're working with.

Outliers are extreme values that stand out greatly from the overall pattern of values in a dataset or graph.
    
#### <b><u>Why is it essential to handle outliers?
    
Effect of outliers on a data set Outliers have a huge impact on the result of data analysis and various statistical measures. It is important to handle outliers so that we can reduce the below effects on the dataset because of their presence: 

* If the outliers are non-randomly distributed, they can decrease normality.
* It increases the error variance and reduces the power of statistical tests.
* They can cause bias and/or influence estimates.
* They can also impact the basic assumption of regression as well as other statistical models.

#### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?


* We can use the Deletion Method like: row deletion as we have some data that are missing
* We can use Imputation Method like, mean imputation, median imputation & mode imputation to replace the missing values with mean, median and mode of the non-null feature values.

* it is important to choose right Imputation method and it depends upon what type of missing data we are having in the dataset.

<table>
    <th>
        <tr>
            <td><b>Type of missing data</b></td>
            <td><b>Imputation method</b></td>
        </tr>
    </th>
    <td>
        <tr>
            <td>Missing Completely At Random</td>
            <td>Mean, Median, Mode, or any other imputation method</td>
        </tr>
        <tr>
            <td>Missing At Random</td>
            <td>Multiple imputation, Regression imputation</td>
        </tr>
        <tr>
            <td>Missing Not At Random</td>
            <td>Pattern Substitution, Maximum Likelihood estimation</td>
        </tr>
    </td>
</table>
    


### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

#### <b><u>Strategies to determine the type of missing data:
    

1. <b>The Missing Data is MCAR(Missing Completely at Random)</b>: This happens if all the variables and observations have the same probability of being missing. 

2. <b>The Missing Data is MAR(Missing at Random)</b>: This happens when the probability of the value being missing is related to the value of the variable or other variables in the dataset. This means that not all the observations and variables have the same chance of being missing.

3. <b>Missing Data in MNAR(Missing Not at Random)</b>: MNAR is considered to be the most difficult scenario among the three types of missing data. It is applied when neither MAR nor MCAR apply. In this situation, the probability of being missing is completely different for different values of the same variable, and these reasons can be unknown to us. 

#### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?



The most common metrics to use for imbalanced datasets are:

* F1 score
* Precision
* Recall
* AUC score (AUC ROC)
* Average precision score (AP)
* G-Mean

It is good practice to track multiple metrics when developing a machine learning model as each highlights different aspects of model performance.


For More Details: https://www.kaggle.com/code/marcinrutecki/best-techniques-and-metrics-for-imbalanced-dataset

### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?


Some of the more widely used and implemented undersampling methods includes which can be used to down-sample the majority class are:

    Random Undersampling
    Condensed Nearest Neighbor Rule (CNN)
    Near Miss Undersampling
    Tomek Links Undersampling
    Edited Nearest Neighbors Rule (ENN)
    One-Sided Selection (OSS)
    Neighborhood Cleaning Rule (NCR)


For more information: https://machinelearningmastery.com/data-sampling-methods-for-imbalanced-classification/

### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?


Some of the more widely used and implemented oversampling methods include which can be used on this case:

    Random Oversampling
    Synthetic Minority Oversampling Technique (SMOTE)
    Borderline-SMOTE
    Borderline Oversampling with SVM
    Adaptive Synthetic Sampling (ADASYN)


* For more information: https://machinelearningmastery.com/data-sampling-methods-for-imbalanced-classification/