#### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

Missing values refer to the absence of data for one or more variables in a dataset. These missing values may arise due to various reasons, such as incomplete data collection, data entry errors, or data loss during transmission or storage.

It is essential to handle missing values in a dataset because they can lead to biased or inaccurate results in data analysis and modeling. Ignoring missing values can lead to an incomplete or biased understanding of the relationship between variables, and can also reduce the effectiveness of machine learning models.

Some algorithms that are not affected by missing values include:

- Decision Trees: Decision trees can handle missing values by splitting the data based on available variables, and creating a separate branch for missing values.

- Random Forest: Similar to decision trees, random forests can handle missing values by splitting the data based on available variables and constructing a separate branch for missing values.

- K-Nearest Neighbors (KNN): KNN can handle missing values by ignoring the missing values and computing the distances between available data points only.

- Gradient Boosting: Gradient boosting algorithms can handle missing values by splitting the data based on available variables and constructing a separate branch for missing values.

- Naive Bayes: Naive Bayes algorithms can handle missing values by ignoring the missing values and computing the probabilities based on available data only.

#### Q2: List down techniques used to handle missing data.  Give an example of each with python code.

There are several techniques used to handle missing data in machine learning and data analysis. Here are some common techniques with examples in Python using the pandas library:

Deletion: This technique involves deleting rows or columns with missing data. This can be done using the dropna() method in pandas.

In [48]:
import pandas as pd

# create a dataframe with missing values
df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, None, 8]})

# drop rows with any missing values
df_drop_rows = df.dropna()
print(df_drop_rows)

# drop columns with any missing values
df_drop_cols = df.dropna(axis=1)
print(df_drop_cols)

     A    B
0  1.0  5.0
3  4.0  8.0
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]


Imputation: This technique involves filling in missing values with estimated values. This can be done using various methods such as mean, median, mode, or regression. In pandas, we can use the fillna() method to fill in missing values.

##### Mean

In [54]:
df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, None, 8]})
df.mean = df.fillna(df.mean())
df.mean

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,6.5
2,2.333333,6.5
3,4.0,8.0


##### Median

In [52]:
df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, None, 8]})
df_median = df.fillna(df.median())
df_median

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,6.5
2,2.0,6.5
3,4.0,8.0


#### Mode

In [65]:
df = pd.DataFrame({'A': [1, 2, None, 4,5], 'B': ['Sky', None, None, 'Sky','Rat']})
df['B']= df['B'].fillna(df['B'].mode()[0])     
df['B']

0    Sky
1    Sky
2    Sky
3    Sky
4    Rat
Name: B, dtype: object

Interpolation: This technique involves estimating missing values by interpolating between adjacent values. This can be done using methods such as linear interpolation, spline interpolation, or time series interpolation. In pandas, we can use the interpolate() method to interpolate missing values.

In [50]:
df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, None, 8]})
df_interpolate = df.interpolate()
df_interpolate

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,6.0
2,3.0,7.0
3,4.0,8.0


#### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Imbalanced data refers to a situation where the number of observations in one class or category of a binary or multi-class classification problem is significantly smaller than the number of observations in the other class or categories. For example, in a binary classification problem where the task is to predict whether a transaction is fraudulent or not, if the number of fraudulent transactions is much smaller than the number of non-fraudulent transactions, the data is said to be imbalanced.

If imbalanced data is not handled properly, it can lead to biased or inaccurate model performance. Specifically, a model trained on imbalanced data may tend to favor the majority class and perform poorly on the minority class. This is because the model will be optimized to minimize the overall error, and therefore will focus on correctly classifying the majority class, while ignoring the minority class.

In many real-world applications, the minority class is often the one of most interest, as it represents a rare event or a critical outcome. Therefore, it is crucial to address the issue of imbalanced data to ensure that the model can accurately capture the patterns and relationships in both the majority and minority classes.

There are several techniques that can be used to handle imbalanced data, including:

- Undersampling the majority class
- Oversampling the minority class
- Synthetic minority oversampling technique (SMOTE)


#### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down sampling are require

In the context of machine learning and signal processing, downsampling and upsampling are techniques used to modify the resolution or sampling rate of a signal or data set.

Downsampling refers to the process of reducing the sampling rate of a signal by selecting a subset of the original samples. This is typically done by taking every nth sample from the original signal, where n is an integer greater than 1. Downsampling can be useful for reducing the storage requirements and computational cost of processing large data sets, but it can also result in loss of information or aliasing effects if not done carefully.

Upsampling refers to the process of increasing the sampling rate of a signal by adding additional samples between the original samples. This is typically done by interpolation, which involves estimating the values of the new samples based on the existing samples. Upsampling can be useful for improving the resolution of a signal or data set, but it can also result in increased storage requirements and computational cost.

Both downsampling and upsampling are often used in conjunction with signal processing techniques such as filtering, feature extraction, and classification to extract useful information from signals or data sets.

#### Upsampling

In [6]:
import numpy as np
import pandas as pd
df1 = pd.DataFrame ({'feature_1' : np.random.normal(loc=0,size=900,scale = 1 ),
'feature_2' : np.random.normal(loc=0,size = 900, scale = 1),
'target' : [0]*900})

In [9]:
df2 = pd.DataFrame({'feature_1' : np.random.normal(loc=0,size=100,scale = 1 ),
'feature_2' : np.random.normal(loc=0,size = 100, scale = 1),
'target' : [1]*100})

In [14]:

df = pd.concat([df1,df2]).reset_index (drop =  True)

In [16]:
df['target'].value_counts()

0    900
1    100
Name: target, dtype: int64

In [24]:
from sklearn.utils import resample
df_minority = df[df['target']==1] #1 is minority
df_majority = df[df['target']==0] ##0 is majority

In [25]:
df_majority['target'].value_counts()

0    900
Name: target, dtype: int64

In [27]:
df_minority.shape

(100, 3)

In [30]:
df_minority_upsample = resample(df_minority,
                        replace=True, ## Sample With replacement
                              n_samples=len(df_majority), # to match the majority class)
                              random_state=42
                             )

In [31]:
df_minority_upsample['target'].value_counts()

1    900
Name: target, dtype: int64

In [32]:
df_upsampled = pd.concat([df_minority_upsample , df_majority]).reset_index(drop=True)

In [33]:
df_upsampled.shape

(1800, 3)

#### Down sampling

In [37]:
## loc defines the mean value and scale defines the standard deviation value
df1 = pd.DataFrame({
    'feature_1' : np.random.normal(loc = 0, scale = 1 , size = 1500) , 
    'fesatur_2' : np.random.normal(loc = 0 , scale = 1 , size = 1500),
    'target' : [1]*1500
})
df2 =pd.DataFrame ({
    'feature_1' : np.random.normal(loc = 0, scale = 1 , size = 500) , 
    'fesatur_2' : np.random.normal(loc = 0 , scale = 1 , size = 500),
    'target' : [0]*500
})

In [39]:
df = pd.concat([df1,df2]).reset_index (drop = True) ## drop = True removes own index with default index

In [40]:
df_minority = df[df['target']==0]
df_majority = df[df['target']==1]

In [42]:
df_majority_downsampled = resample(df_majority , 
                                  replace = True,
                                  n_samples =len( df_minority),
                                  random_state = 50) 

In [45]:
df_downsampled = pd.concat([df_majority_downsampled,df_minority]).reset_index(drop=True)

In [47]:
print(df_downsampled.shape)
print(df_downsampled['target'].value_counts())

(1000, 3)
1    500
0    500
Name: target, dtype: int64


#### Q5: What is data Augmentation? Explain SMOTE.

Data augmentation is a technique used in machine learning and computer vision to artificially increase the size of a dataset by creating additional training data from existing data. This is typically done by applying various transformations to the existing data, such as flipping, rotating, scaling, or adding noise, to create new data points that are similar but not identical to the original data.

SMOTE (Synthetic Minority Over-sampling Technique) is a specific data augmentation technique used to address imbalanced data in classification problems. It works by creating synthetic minority class samples by interpolating between existing minority class samples.

The SMOTE algorithm involves the following steps:

Select a minority class sample x.
Find the k nearest neighbors of x in the feature space.
Select one of the k neighbors, say y.
Generate a new synthetic sample by taking a weighted average of x and y, where the weights are chosen randomly between 0 and 1.
Repeat steps 1-4 until the desired number of synthetic samples has been generated.
SMOTE is effective at increasing the size of the minority class and improving the performance of machine learning models on imbalanced datasets. It is often used in conjunction with other techniques, such as undersampling the majority class or adjusting class weights, to further improve model performance on imbalanced datasets.

In [76]:
from sklearn.datasets import make_classification
## creating the dataset with tow features x independent and y dependent
X,Y = make_classification(n_samples = 1000 , n_features = 2,n_redundant = 0, n_clusters_per_class = 1,
                           weights = [0.90],random_state=1)
df1 = pd.DataFrame(X,columns = ['f1','f2'])
df2 = pd.DataFrame(Y, columns= ['target'])
final_df = pd.concat([df1,df2],axis = 1).reset_index(drop = True)

In [78]:
final_df.shape

(1000, 3)

In [84]:
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X,Y = oversample.fit_resample(final_df[['f1','f2']],final_df['target'])

In [91]:
X.shape,Y.shape

((1788, 2), (1788,))

In [88]:
df1=pd.DataFrame(X,columns=['f1','f2'])
df2=pd.DataFrame(Y,columns=['target'])
oversample_df=pd.concat([df1,df2],axis=1)

In [89]:
oversample_df.shape

(1788, 3)

#### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

In statistics, outliers are data points that significantly differ from other data points in a dataset. These are extreme values that lie far away from the other data points and do not fit into the overall pattern or distribution of the data.

It is essential to handle outliers because they can have a significant impact on statistical analyses and machine learning models. Outliers can skew the results of statistical measures such as the mean and standard deviation and can lead to inaccurate conclusions. In machine learning, outliers can cause the model to overfit the data or underperform when making predictions on new data.

There are several techniques for handling outliers, including removing them from the dataset, replacing them with a more reasonable value, or transforming the data to reduce the effect of outliers. The best approach depends on the nature of the data, the analysis being performed, and the goals of the study.

Handling outliers can lead to more accurate results and better machine learning models, so it is essential to identify and deal with them appropriately.





#### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

When dealing with missing data in customer data analysis, there are several techniques that can be used:

- Deletion: This technique involves deleting rows or columns with missing data. If the amount of missing data is small, this method might be appropriate. However, if the missing data is large, this technique can lead to a loss of valuable information.

- Imputation: This technique involves filling in missing values with estimated values. This can be done using various methods such as mean, median, mode, or regression. Imputation can help preserve valuable information and minimize the impact of missing data.

- Prediction models: This technique involves building a prediction model to estimate missing values. This can be done using machine learning algorithms such as K-Nearest Neighbors or Decision Trees. Prediction models can be more accurate than simple imputation methods and can help preserve valuable information.

- Interpolation: This technique involves estimating missing values by interpolating between adjacent values. This can be done using methods such as linear interpolation, spline interpolation, or time series interpolation. Interpolation can help preserve valuable information and can be particularly useful in time series data.

#### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

There are several strategies that can be used to determine if missing data is missing at 
random or if there is a pattern to the missing data:

- Visual inspection: One simple approach is to create a visualization of the missing data. This can be done using a heatmap or a bar chart to show the distribution of missing values across variables. If the missing data is random, we would expect to see a relatively uniform distribution of missing values across variables. However, if there is a pattern to the missing data, we would see clusters of missing values in certain variables or combinations of variables.

- Statistical tests: Another approach is to use statistical tests to determine if the missing data is missing at random. One commonly used test is Little's MCAR test, which tests the null hypothesis that the missing data is missing completely at random (MCAR). If the p-value from the test is high, we fail to reject the null hypothesis and conclude that the missing data is MCAR.

- Imputation and analysis: Imputing the missing data using different methods and analyzing the resulting dataset can also provide insights into the nature of the missing data. For example, if the imputed data results in a significant change in the distribution of a variable or in the results of the analysis, this may indicate that the missing data is not missing at random.

- Domain knowledge: Finally, domain knowledge can be used to determine if the missing data is likely to be missing at random or if there is a pattern to the missing data. For example, if missing data occurs only in certain groups of observations, this may indicate that the missing data is not missing at random.

- In general, a combination of these strategies may be used to determine if the missing data is missing at random or if there is a pattern to the missing data. It is important to carefully consider the nature of the data and the research question when selecting an approach for handling missing data.





#### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

- SMOTE (Synthetic Minority Over-sampling Technique): As mentioned earlier, SMOTE can also be used to balance the dataset by generating synthetic samples from the minority class. However, in this case, we need to use SMOTE to generate new samples from the minority class until it is balanced with the majority class.
- Random under-sampling: This technique involves randomly removing samples from the majority class until the dataset is balanced. This method can be effective when the dataset is large, and the majority class has a large number of samples.

#### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

We can use Down Sampling to balance the dataset, some techniques of downsampling are:
- Random under-sampling: Randomly removing samples from the majority class until the dataset is balanced. This method can be effective when the dataset is large and the majority class has a significant number of samples.
- Stratified sampling: This method ensures that the resulting dataset has the same proportion of classes as the original dataset. It randomly selects samples from both the majority and minority classes, but it ensures that the ratio of samples in each class remains the same.
- Cluster-based under-sampling: This method involves grouping samples in the majority class into clusters and then randomly selecting samples from each cluster until the dataset is balanced. This method can be effective when the majority class has a complex distribution.

#### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

In this scenario we can use the following methods of upsampling or over_sampling
- Random over-sampling: Randomly duplicating samples from the minority class until the dataset is balanced. This method can be effective when the dataset is small and the minority class has a small number of samples.
- SMOTE (Synthetic Minority Over-sampling Technique): This method generates new samples from the minority class by interpolating between the existing samples. It creates synthetic samples that are similar to the minority class samples but not identical. This method can be effective when the minority class has a small number of samples and there is a need to generate new samples.