### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.


Missing values refer to absence of a particular value or info for a variable or observation reason such as incomplete data collection, data entry errors, or technical issues during data processing.

It's essential to handle such missing values as they affect the accuracy and validity of analysis and accuracy of ML models, which lead to biased or incorrect results, difficult to draw reliable conclusions 

Algo affected by missing values:
- Decision Trees
- Random Forest
- K-Nearest Neighbors (KNN)
- Support Vector Machines (SVM)
- Principal Component Analysis (PCA)

### Q2: List down techniques used to handle missing data. Give an example of each with python code.


In [3]:
import seaborn as sns
df = sns.load_dataset('titanic')
import pandas as pd


In [2]:
# 1 Deleting rows containing missing values
df.dropna()

In [None]:
# 2 Deleting columns which contains missing values
df.dropna(axis = 1)

## 3 Imputation
- Mean Imputation 
- Median Imputation 
- Mode Imputation 
- Random Value Imputation

In [None]:
# 1 Mean Imputation 
df['age_mean'] = df['age'].fillna(df['age'].mean())

In [None]:
# 2 Median imputation 
df['age_median'] = df['age'].fillna(df['age'].median())

In [None]:
# 3 Mode Imputation
df['age_mode'] = df['age'].fillna(df['age'].mode())

In [None]:
# 4 Random value Imputation
df['age_mean'] = df['age'].fillna(23) # any random value filled

### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?



Imbalance data refers to a situation where representation of dataset is not equal. This means that one or more classes have significantly fewer samples than the others. Imbalanced data is common in many real-world applications such as fraud detection

Most ML algo designed to assume that dataset are balanced & will behave poorly if imbalanced data applied, 

If Imbalanced data isn't handled it can lead to several problems, includes: 
- Poor Performance
- Biased Models 
- Overfitting

#### To handle this several techniques could be used
- Resampling
- Cost-sensitive learning
- Algo modification

### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down- sampling are required.



#### Upsampling involves increasing the number of samples in the minority class to match with the majority class. Done by replicating existing samples in the minority class, or by generating new synthetic samples.
##### example, consider a dataset with 1000 samples of Class A and 100 samples of Class B. If we upsample Class B to 1000 samples using SMOTE, we can create a balanced dataset with 1000 samples of each class.

#### DownSampling:
##### This involves reducing the no. of samples in the majority class to match with no.s of minority class. Done randomly or using more sophisticated techniques, such as clustering or instance selection. 

##### For example, consider a dataset with 1000 samples of Class A and 100 samples of Class B. If we downsample Class A to 100 samples, we can create a balanced dataset with 100 samples of each class.

### Q5: What is data Augmentation? Explain SMOTE.


#### Artificially expand the size of a training set by creating modified data from the existing one is said to Data Automation. Used to prevent overfitting, or the initial dataset is too small to train on, or for better model performace

#### DA good for enhancing the model’s performance.

#### Popular DA technique SMOTE (Synthetic Minority Over-sampling Technique). SMOTE used to address imbalace where minority class is significantly small. SMOTE generates synthetic examples of the minority class by interpolating between existing instance

#### Basic idea of SMOTE is similar to k-nearest neighbor simply finding points nearest to existing Specifically, SMOTE selects a random point along the line segment connecting the minority example and its nearest neighbor and adds this point as a new datapoint.

#### SMOTE can be very effective in improving the performance of machine learning models on imbalanced datasets. By creating synthetic examples of the minority class, SMOTE can help to address the problem of class imbalance and ensure that the model is better able to generalize to new examples.


### Q6: What are outliers in a dataset? Why is it essential to handle outliers?


#### Outliers are nothing but the points that lie outside the overall distribution of the dataset. if not treated, can cause serious problems in statistical analyses.

### Types of Outliers

##### Outliers are generally classified into two types: Univariate and Multivariate.

##### 1. Univariate Outliers – These outliers are found in the distribution of values in a single feature space.

##### 2. Multivariate Outliers – These outliers are found in the distribution of values in a n-dimensional space (n-features).

#### Problems due to outliers

##### 1. Skewed data distribution

##### 2. Misleading statistical measures

##### 3. Biased machine learning models

##### 4. Reduced model performance

### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?


### There are several techniques that can be used to handle missing data in customer data analysis:

1. Deletion
2. Imputation
3. Regression
4. Multiple imputation
5. Machine learning

### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?


### When dealing with missing data, there are several strategies to determine if the missing data is missing at random or if there is a pattern to the missing data. Here are some of the most commonly used methods:

1. Analyze missing patterns: Can be done by analyzing, plotting, distinguishing them related to missingness

2. Correlation analysis: Find correlation of missing vlaues with other field

3. Imputation and analysis: Impute the missing values using various techniques and compare the results.

4. Expert knowledge: Sometimes expert knowledge can help determining if the missing data is missing at random or not.

5. Statistical tests: Performing statistical tests Like MCAR test to determine if the missing data 


### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

### Strategies to evaluate performance of ML model on an imbalanced dataset:

1. Confusion matrix: this summarizes the performance of a classification model. It shows the true positive, false positive, true negative, and false negative rates. Not great for imbalanced datasets

2. Resampling techniques: Used to balance the dataset. Upsampling and Downsampling could be done, upsampling leads to overfitting and downsampling leads to underfitting of the model

3. Ensemble methods: This combine multiple models to improve their performance. Training multiple models on different subsets of the dataset and averaging their predictions. 

4. Cost-sensitive learning: This involves assigning different costs to different types of errors. In the case of an imbalanced dataset, misclassifying a minority class example as a majority class example may be more costly than the opposite. By assigning different costs to different types of errors, the model can be trained to minimize the overall cost of errors rather than just the number of errors.

5. Domain knowledge: used to improve the model's performance on an imbalanced dataset. For example, if the dataset contains demographic information, you can use this information to stratify the dataset and ensure that both classes are represented equally in each stratum.


### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?


### Several methods to balance an unbalanced dataset and down-sample the majority class.:

1. Random under-sampling: removing instances from the majority class until the dataset is balanced.

2. Cluster-based under-sampling: clustering the majority class instances and then selecting representative instances from each cluster.

3. Tomek Links: This identifies pairs of instances from different classes that are close to each other, and removes the majority class instance from each pair. By doing this, the Tomek Links method creates a clearer separation between the two classes.

### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on an project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

### Having unbalanced data with rare occurance of event we could employ several technique to balance the dataset and up-sample the minority class:
1. Random over-sampling: This involves randomly duplicating instances from the minority class until the dataset is balanced. 

2. Synthetic minority over-sampling technique (SMOTE): This method involves creating synthetic instances of the minority class by interpolating between existing instances.