#### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

Missing values in a dataset refer to instances where no data is available for a particular feature or attribute. Some common reasons for missing values are:

- Data was simply not collected or recorded for some instances
- Data was lost or corrupted 
- Participants skipped certain questions in a survey

It is essential to handle missing values for the following reasons:

1. Many machine learning algorithms cannot handle missing values and will throw errors if values are missing.

2. The presence of missing values can negatively impact the performance of algorithms that can handle them, especially if the missing values are not random.

3. Imputing values for missing data can improve the performance of some algorithms.

Algorithms that are not affected by missing values include:

- Decision trees algorithms like Random Forest and Gradient Boosted Trees
- Instance-based algorithms like K-Nearest Neighbors
- Some deep learning architectures like Convolutional Neural Networks

To handle missing values, common approaches are:

- Drop the instances with missing values 
- Impute values using mean, median or mode of the feature
- Impute using more sophisticated techniques like K-nearest neighbor imputation


#### Q2: List down techniques used to handle missing data. Give an example of each with python code.

In [29]:
import seaborn as sns
import pandas as pd
df = sns.load_dataset('titanic')
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [34]:
#Mean Imputation
df['age'].fillna(int(df['age'].mean()))

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    29.0
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

In [35]:
#Median Imputation
df['age'].fillna(int(df['age'].median()))

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    28.0
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

In [49]:
#Mode Imputation
df['embarked'].fillna(df[df['embarked'].notna()]['embarked'].mode()[0])

0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: embarked, Length: 891, dtype: object

#### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Imbalanced data refers to datasets where the classes are not roughly equal in their representation. One class may have significantly more data than the other class.

For example:
- Fraud detection datasets often have much fewer fraud instances compared to non-fraud transactions
- Medical datasets often have many more healthy patients compared to patients with the disease

If imbalanced data is not handled properly, the following can happen:

1. Models tend to be biased towards the majority class - Since the majority class has more examples, models learn to just predict the majority class for all examples. This results in high accuracy but poor performance on the minority class.

2. Failure to detect the minority class - The minority class ends up being overlooked since models learn to mainly predict the majority class.

3. Unreliable performance metrics - Accuracy is not a good metric for imbalanced data since the model can achieve high accuracy just by predicting the majority class. 


#### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

Up-sampling and down-sampling are two techniques used to deal with imbalanced data.

Up-sampling involves increasing the number of samples from the minority class, typically by creating synthetic samples. This helps make the minority and majority classes more balanced.

Down-sampling involves reducing the number of samples from the majority class. This is done by either random removal of samples or more intelligent techniques.

Up-sampling is required when:

- The minority class represents the class of interest, e.g. fraud transactions. Increasing the number of minority class samples helps models learn better from this class.

- The dataset is very small and up-sampling can help generate more training data.

An example of up-sampling is SMOTE, which generates synthetic samples for the minority class.

Down-sampling is required when:

- The majority class has too many samples, causing models to be biased towards it 
- Training models on the entire majority class is computationally expensive

An example of down-sampling is randomly removing majority class samples till a desired balance is achieved.

#### Q5: What is data Augmentation? Explain SMOTE.

Data augmentation refers to techniques used to artificially expand the size of a training dataset by creating modified versions of existing samples. This helps address issues like:

- Limited training data 
- Overfitting to the available data

SMOTE (Synthetic Minority Over-sampling TEchnique) is a specific data augmentation technique used for imbalanced classification problems. It works by generating new synthetic samples for the minority class to balance the class distribution. 

How it works:

1. For each minority class sample, SMOTE finds its k nearest neighbors, typically from the same class.

2. It then randomly chooses one of the neighbors. 

3. A new synthetic sample is generated along the line segment joining the original sample and the chosen neighbor.

For example, if k=5, for a given minority class sample A:

- SMOTE finds the 5 nearest samples to A from the same minority class
- It randomly chooses one of these 5 samples, say B
- It then generates a new synthetic sample somewhere along the line segment between A and B

This process is repeated until the desired balance between the minority and majority class is achieved.


Overall, data augmentation techniques like SMOTE help generate more training data and improve model performance, especially on small and imbalanced datasets.

#### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Outliers are data points that are very different from the rest of the data in a dataset. They lie outside the overall pattern or distribution of the data.

It is important to handle outliers because:

1. They can significantly skew the results of statistical analyses. Common statistics like mean, median and standard deviation can be heavily influenced by even a single outlier, distorting any conclusions drawn from the data.

2. They can mask the true signal or pattern in the data. The presence of outliers can obscure the general trend or relationship that exists in the majority of the data. 

3. Machine learning models can be negatively impacted. Models trained on data with outliers can lead to poor performance when deployed since most real-world data will not contain such extreme values.

4. They can indicate data errors. Outliers may be the result of data entry errors, measurement errors or other issues that need to be addressed.

#### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

1. Case deletion - Simply removing all cases with any missing data. However, this can significantly reduce the sample size.

2. Mean imputation - Replacing missing data with the mean or average of the available data for that variable. Simple but can distort your distributions.

3. Median imputation - Similar to mean imputation but uses the median instead. More robust to outliers.

4. Mode imputatiom - for categorical data, simply replacing missing data with mode.

#### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

Here are some strategies to determine if data is missing at random or in a systematic pattern:

1. Compare variables with missing data to complete cases. Look for statistically significant differences in the mean, median, or distribution of other variables between cases with missing data and complete cases. If differences exist, the data is likely missing systematically.

2. Examine patterns in the missing data. Look for correlations between variables with high amounts of missing data. Missing data that clusters or co-occurs may indicate a systematic pattern.  

3. Perform chi-square tests of independence between variables with missing data and other categorical variables. A statistically significant result suggests the missing data is not random.

4. Plot variables with missing data against other variables. Look for any trends or clusters in the plot that exclude cases with missing data. This could indicate a non-random pattern.

5. Build predictive models to identify factors associated with missing data. If certain variables strongly predict which cases have missing data, that suggests a systematic cause for the missingness.

6. Look for logical explanations for non-random missing data based on your domain knowledge. Consider data collection methods, question wording, time periods, etc. This contextual information can aid your analysis.

7. Compare missing data rates across subgroups. If certain subgroups (e.g. demographic groups) have significantly higher rates of missing data, that points to a non-random pattern.


#### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Here are some strategies used to evaluate the performance of your machine learning model on this imbalanced dataset:

1. Upsampling: Adding duplicates or synthetically generated samples of the diseased class to match the healthy class. This can improve minority class prediction performance.

2. Downsampling: Remove some healthy class samples at random to match the number of diseased class. This can make the classes balanced for model training.

#### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

Here are a few methods to downsample the majority satisfied customers class in your customer satisfaction dataset:

1. Random undersampling: Randomly remove majority class examples at random until the classes are balanced. This is simple but can discard potentially useful data.

2. Tomek links: Identify data points from the majority class that are very close to minority class examples, known as Tomek links. Remove one data point from each Tomek link, preferentially keeping minority class examples. This focuses the downsampling on data points that are hardest to classify.

3. Cluster-based undersampling: Group majority class examples into clusters and randomly remove clusters until a balance is achieved. This preserves the underlying distribution of the data better than random undersampling.

4. One-sided selection: Sort the majority class examples by their distance from the minority class, and remove examples that are closest. This preferentially retains examples that are "further away" from the minority class.

5. Condensed nearest neighbor: Identify majority class examples that have only examples of the same class as their nearest neighbors, and remove them. This leaves data points that are more informative for distinguishing the classes.

#### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

Here are some methods  to upsample the minority class of occurrences in your imbalanced dataset:

1. Random duplication: Simply duplicate random minority class examples at random until the classes are balanced. This is simple but can lead to overfitting since some examples are repeated multiple times.

2. SMOTE: Synthetic Minority Oversampling Technique generates synthetic minority class examples rather than duplicating existing data. It works by identifying minority class examples that are close together, and generating new data along the line segment joining any/all of the k closest neighbors. 

3. ADASYN:  Advanced Synthetic Sampling generates different numbers of synthetic examples for different minority class examples, based on how "hard" they are to learn. It focuses more synthetic data around minority examples that are hard to learn.

4. Cluster-based oversampling: Group minority class examples into clusters and generate synthetic samples for each cluster. This helps generate more varied synthetic data that captures some of the underlying structure of the minority class.

5. Gaussian distribution oversampling: Generate synthetic data from a Gaussian distribution centered around existing minority class examples. The width of the distribution controls how similar synthetic data is to existing data.