### Q1 : What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

Missing values in a dataset refer to the absence of data for certain observations or variables. They can occur due to various reasons, such as data entry errors, equipment failure, survey non-response, or any other issue that prevents the collection of complete data.

Handling missing values is essential for several reasons:

1. Data integrity: Missing values can introduce bias and inaccuracies in data analysis, modeling, and inference. Failing to address missing values can lead to incorrect conclusions and flawed decision-making.

2. Statistical requirements: Many statistical techniques and machine learning algorithms assume complete data, making them unsuitable for datasets with missing values. To apply these methods correctly, it is necessary to handle missing values appropriately.

3. Computational requirements: Missing values can create computational challenges when performing calculations or running algorithms. Some software packages may not handle missing values automatically, requiring explicit handling before analysis.

4. Data utilization: Missing values can result in a loss of information. By handling missing values effectively, we can maximize the use of available data and gain more accurate insights.

Regarding algorithms that are not affected by missing values, some methods can handle missing data without the need for imputation or preprocessing:

1. Tree-based algorithms: Decision trees and ensemble methods based on them, such as Random Forest and Gradient Boosting, can handle missing values naturally. They can make decisions based on available variables and do not require imputation.

2. Support Vector Machines (SVM): SVMs can handle missing values by excluding missing entries when constructing the decision boundary.

3. Maximum Likelihood Estimation (MLE): Some algorithms based on MLE, such as Expectation-Maximization (EM) algorithm, can handle missing values by iteratively estimating the missing data.

4. Probabilistic Graphical Models: Algorithms like Bayesian networks and Markov networks can effectively handle missing values by using probabilistic inference techniques.

It's important to note that while these algorithms can handle missing values, the choice of handling missing data may still impact the performance and accuracy of the resulting model.

### Q2 : List down techniques used to handle missing data.  Give an example of each with python code.

There are several techniques commonly used to handle missing data in datasets. Here are five techniques along with examples of how they can be implemented using Python:

1. Deletion: This approach involves removing observations or variables with missing values from the dataset. There are two types of deletion techniques:
   - a. Listwise Deletion: Removes entire observations with missing values.
   - b. Pairwise Deletion: Keeps the observations with complete data for each analysis, ignoring missing values in specific calculations.
2. Mean/Median/Mode Imputation: This approach involves filling missing values with the mean, median, or mode of the respective variable.

In [None]:
df = pd.read_csv('titanic.csv')

# Remove rows with missing values
df_clean = df.dropna()

In [None]:
df = pd.read_csv('traveller.csv')

# Perform analysis on variables without missing values
cleaned_data = df[['miles', 'age']].dropna()

In [None]:
df = pd.read_csv('bank_data.csv')

df['mean_imputation'].fillna(df['cust_age'].mean())

df['median_imputation'].fillna(df['cust_age'].median())

df['mode_imputation'].fillna(df['cust_age'].mode()[0])

### Q3 : Explain the imbalanced data. What will happen if imbalanced data is not handled?

#### Imbalanced data:
Imbalanced data refers to a situation where the distribution of target classes in a classification problem is highly skewed, with one class being significantly more prevalent than the other(s). For example, in a binary classification problem, if the positive class comprises only a small percentage of the dataset while the negative class dominates, it is considered imbalanced data.

### If imbalanced data is not handled properly, it can lead to several issues:

#### Biased Model Performance: 
Machine learning models trained on imbalanced data tend to have a biased performance towards the majority class. They may achieve high accuracy by simply predicting the majority class for most instances, ignoring the minority class. Consequently, the model's ability to correctly identify the minority class (often the one of greater interest) is compromised.

#### Poor Generalization: 
Models trained on imbalanced data may struggle to generalize well to new, unseen data. The imbalance in the training data does not reflect the real-world distribution, making the model less capable of handling new instances that deviate from the training class distribution.

#### Increased False Positives or False Negatives: 
Imbalanced data can result in models with a high false positive or false negative rate for the minority class. Depending on the problem domain, this can have serious consequences. For instance, in medical diagnosis, failing to identify a rare disease (false negative) can have severe implications for patient health.

### Q4 : What are Up-sampling and Down-sampling? Explain with an example when up-sampling and downsampling are required.

#### Up-sampling and down-sampling are two common techniques used to address imbalanced data by either increasing or decreasing the representation of certain classes in the dataset. Here's an explanation of both techniques along with examples of when they are required:

### Up-sampling (Over-sampling):
Up-sampling involves increasing the number of instances in the minority class to balance the class distribution. This can be done by either duplicating existing instances or generating synthetic samples using techniques such as SMOTE (Synthetic Minority Over-sampling Technique).

#### Example: 
Suppose you have a dataset for credit card fraud detection, where fraudulent transactions (positive class) are rare compared to non-fraudulent transactions (negative class). The dataset contains 100 instances, with only 5 being fraudulent. In this case, up-sampling can be applied to increase the representation of the fraudulent class. For instance, you can use SMOTE to generate synthetic fraudulent transactions, resulting in an augmented dataset with a more balanced class distribution.

#### Down-sampling (Under-sampling):
Down-sampling involves reducing the number of instances in the majority class to achieve a balanced class distribution. This can be done by randomly removing instances from the majority class or using more sophisticated techniques that select representative instances.

#### Example: 
Consider a dataset for cancer diagnosis, where the majority class represents non-cancerous cases and the minority class represents cancerous cases. If the dataset contains 1000 instances, with 900 being non-cancerous and only 100 being cancerous, down-sampling can be applied to reduce the dominance of the non-cancerous class. For instance, random under-sampling can be performed to randomly select 100 instances from the non-cancerous class, resulting in a balanced dataset with equal representation of both classes.

#### When to use Up-sampling and Down-sampling:

#### Up-sampling: 
Up-sampling is typically used when the minority class has insufficient representation, resulting in the model not capturing its patterns effectively. It can be applied when the focus is on improving the recall or sensitivity of the minority class and when the dataset is not too large to avoid overfitting.

#### Down-sampling: 
Down-sampling is employed when the majority class is dominant, and its large representation overshadows the minority class. It can be used when the dataset is large, and the majority class has redundant instances that do not provide significant additional information.

### Q5 : What is data Augmentation? Explain SMOTE

#### Data augmentation: 
Data augmentation is a technique used to artificially increase the size and diversity of a dataset by applying various transformations or modifications to the existing data. It is commonly used in machine learning and computer vision tasks to enhance model performance and improve generalization.

#### SMOTE: 
The Synthetic Minority Over-sampling Technique (SMOTE) is a specific data augmentation method designed to address class imbalance in datasets. SMOTE focuses on increasing the representation of the minority class by creating synthetic examples rather than simply duplicating existing instances.

Here's an explanation of how SMOTE works:

#### Identify Minority Class Instances:
SMOTE starts by identifying the instances belonging to the minority class. These are the observations that are underrepresented and need to be augmented.

#### Select Nearest Neighbors:
For each minority class instance, SMOTE selects a set of its k nearest neighbors (typically chosen using Euclidean distance). The value of k is determined based on the desired level of augmentation.

#### Generate Synthetic Instances:
SMOTE generates synthetic instances by creating new observations along the line segments connecting the minority class instance with its k nearest neighbors. The synthetic instances are created by randomly selecting a point on the line segment and using its feature values to create a new data point. This process is repeated for each minority class instance.

#### Repeat the Process:
The previous steps are repeated until the desired level of augmentation is achieved.

### Q6 : What are outliers in a dataset? Why is it essential to handle outliers?

Outliers are data points that significantly deviate from the general pattern or distribution of the rest of the dataset. These data points are extreme values that are located far away from the majority of the data points. Outliers can occur due to various reasons such as measurement errors, data entry errors, natural variations, or genuinely anomalous observations.

Handling outliers is essential for several reasons:

#### Impact on Statistical Measures: 
Outliers can have a substantial impact on statistical measures such as the mean and standard deviation. Since these measures are sensitive to extreme values, outliers can distort the estimates and misrepresent the true central tendency and variability of the data.

#### Biased Analysis and Modeling: 
Outliers can lead to biased analysis and modeling results. They can unduly influence the parameter estimates and relationships between variables, leading to incorrect conclusions and misleading interpretations. Models trained on datasets with outliers may have reduced predictive accuracy.

#### Violation of Assumptions: 
Outliers can violate the assumptions of many statistical techniques and machine learning algorithms. Assumptions such as normality, linearity, and homoscedasticity may be violated in the presence of outliers, compromising the validity of the results obtained from these methods.

#### Data Integrity and Quality: 
Outliers can be indicative of data quality issues, errors, or anomalies in the data collection process. Identifying and handling outliers is crucial for maintaining data integrity and ensuring the accuracy and reliability of subsequent analyses and decision-making.

#### Influence on Decision-Making: 
Outliers, if left unhandled, can lead to incorrect decisions and actions based on inaccurate or misleading information. Outliers can disproportionately affect certain analyses or models, potentially leading to incorrect predictions or suboptimal decision outcomes.

### Q7 : You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

When dealing with missing data in customer data analysis, several techniques can be employed to handle the missing values. The choice of technique depends on factors such as the nature of the missing data, the amount of missingness, the underlying data distribution, and the specific analysis objectives. Here are four commonly used techniques:

1. Deletion: If the missing data is relatively small and occurs randomly, deletion techniques can be applied.

   a. Listwise Deletion: Also known as complete-case analysis, this approach involves removing entire observations with missing values. It is suitable when the missingness is random and does not introduce bias.

   b. Pairwise Deletion: In this approach, missing values are ignored for specific calculations, allowing for the maximum utilization of available data. It is suitable when the missingness is not related to the specific analysis being performed.

2. Imputation: Imputation involves filling in the missing values with estimated or predicted values. This allows for the use of the entire dataset while minimizing data loss. Several imputation techniques can be used:

   a. Mean/Median/Mode Imputation: Missing values are replaced with the mean, median, or mode of the respective variable. This technique assumes that missing values are missing completely at random (MCAR) or missing at random (MAR).

   b. Regression Imputation: Missing values are imputed using regression models, where a variable with missing values is predicted based on other variables. It is suitable when there is a relationship between the missing variable and other variables.

   c. Multiple Imputation: Multiple imputation creates multiple plausible imputed datasets, incorporating the uncertainty associated with imputation. Analysis is performed on each imputed dataset, and the results are pooled for final inference.

3. Domain-Specific Knowledge: In some cases, domain-specific knowledge can guide the handling of missing data. For example, if certain missing values can be considered as a separate category or if there is a logical way to infer missing values based on business rules or external data sources.

### Q8 : You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

When dealing with missing data, it is important to assess whether the missingness occurs randomly (Missing Completely at Random - MCAR), is related to other observed variables (Missing at Random - MAR), or has a systematic pattern (Missing Not at Random - MNAR). Here are some strategies to determine the pattern of missing data:

1. Descriptive Analysis: Conducting a descriptive analysis of the missing data can provide insights into its pattern. Calculate summary statistics, such as the percentage of missing values in each variable, and examine patterns or correlations between missing values across variables. Visualization techniques, such as missing data heatmaps or bar plots, can help identify any patterns visually.

2. Missing Data Mechanism Tests: There are statistical tests available to assess the missing data mechanism. These tests compare the characteristics of the observed data with the missing data to determine if they differ significantly. Some commonly used tests include the Little's MCAR test, the Missing Indicator test, and the Chi-square test of independence.

3. Missing Data Patterns: Examine the patterns in the missing data. Are there specific variables or groups of variables with higher rates of missingness? Analyze if there are underlying reasons or correlations for the missingness. For example, missing data in certain variables could be due to survey non-response or specific data collection procedures.

4. Auxiliary Variables: Identify auxiliary variables that may be associated with the missingness. These are observed variables that can help explain the missing data mechanism. Analyze the relationship between these auxiliary variables and the missingness to identify patterns. If the missingness depends on observed variables, it suggests a Missing at Random (MAR) mechanism.

5. Sensitivity Analysis: Perform sensitivity analysis by imputing missing values using different assumptions about the missing data mechanism (e.g., MCAR, MAR, or MNAR). Compare the results obtained from different imputation strategies and analyze if they lead to consistent conclusions or if the results differ substantially depending on the assumed mechanism.

6. Expert Knowledge: Consult domain experts or individuals with knowledge about the data collection process to gain insights into the potential patterns or reasons for missingness. They may provide valuable context and help identify any systematic patterns.

### Q9 : Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

#### Resampling Techniques: 
Consider using resampling techniques to address the class imbalance during training. These techniques include oversampling the minority class (e.g., SMOTE) or undersampling the majority class. By creating a more balanced training set, the model can better learn from the minority class instances.

#### Stratified Sampling and Cross-Validation: 
Ensure the evaluation process reflects the class imbalance by using stratified sampling and cross-validation techniques. These techniques maintain the class distribution in each fold or subset, allowing for a more reliable assessment of the model's performance.

### Q10 : When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

#### Combination of Over-Sampling and Under-Sampling:
Instead of solely down-sampling the majority class, a combination of over-sampling the minority class (e.g., using SMOTE) and under-sampling the majority class can be employed to achieve a more balanced dataset.

#### Random Under-Sampling:
Randomly select a subset of the majority class instances to match the size of the minority class. This approach randomly removes instances from the majority class, resulting in a more balanced dataset.

### Q11 : You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

#### Random Over-Sampling:
Randomly duplicate instances from the minority class to increase its representation. This approach randomly replicates existing minority class instances, which helps balance the class distribution.

#### SMOTE (Synthetic Minority Over-sampling Technique):
SMOTE creates synthetic samples by interpolating between minority class instances and their nearest neighbors. It generates new synthetic instances along the line segments connecting the minority class instances, effectively increasing the minority class representation.

#### ADASYN (Adaptive Synthetic Sampling):
ADASYN is an extension of SMOTE that aims to address the imbalance by focusing on the more challenging instances. It adjusts the weights of the minority class instances based on their difficulty in learning, generating more synthetic examples for the difficult instances.

#### SMOTE-ENN:
This is a combination of over-sampling (SMOTE) and under-sampling (ENN). SMOTE is first used to generate synthetic samples for the minority class, and then ENN is applied to remove any instances from both classes that are misclassified.