# Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

---> Missing values in a dataset refer to the absence of an observation for a particular variable.

---> It is essential to handle missing values because many machine learning algorithms fail if the dataset contains missing values.

---> Algorithms that are not affected by missing values are 
    1) Regresson
    2) Naive Bayes
    3) K-nearest

# Q2: List down techniques used to handle missing data.  Give an example of each with python code.

In [1]:
#1)  Dropping missing data: This is the easiest way to handle missing values in Python. You can get rid of the rows or columns where there is missing information.
# code:
import pandas as pd

df = pd.read_csv('data.csv')
df.dropna(inplace=True)

#2) Imputing missing data with a specific value: You can replace missing values with a specific value such as mean, median or mode.
# code:
import pandas as pd

df = pd.read_csv('data.csv')
df.fillna(df.mean(), inplace=True)

#3) Using machine learning algorithms: You can use machine learning algorithms such as KNN (K-Nearest Neighbors) to impute missing values.
#code:
import pandas as pd
from sklearn.impute import KNNImputer

df = pd.read_csv('data.csv')
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)

#4) Using deep learning algorithms: You can use deep learning algorithms such as Autoencoders to impute missing values. 
#code:
import pandas as pd
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

df = pd.read_csv('data.csv')
input_layer = Input(shape=(n_features,))
encoded = Dense(32, activation='relu')(input_layer)
decoded = Dense(n_features, activation='sigmoid')(encoded)
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
autoencoder.fit(df, df, epochs=100)
df_imputed = autoencoder.predict(df)

# Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

---> Imbalanced data refers to a situation, primarily in classification machine learning, where one target class represents a significant        portion of observations.

---> If imbalanced data is not handled properly, it can lead to poor performance of the model. 

---> The model may be biased towards the majority class and may not be able to predict the minority class accurately.

---> This can lead to false negatives or false positives.



# Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and downsampling are required.

---> Up-sampling and Down-sampling are techniques used to handle unbalanced data.

* Up-sampling:

---> Up-sampling involves adding more samples to the minority class so that it becomes balanced with the majority class. 

--->  This can be useful when you have a small dataset and removing some samples will affect the overall performance of the model.

---> Example: Suppose you have a dataset with 1000 samples, out of which 100 belong to class A and 900 belong to class B. In this case, you can add more samples to class A so that it becomes balanced with class B.


* Down-sampling:

---> Down-sampling involves removing some samples from the majority class so that it becomes balanced with the minority class.

---> This can be useful when you have a large dataset and removing some samples will not affect the overall performance of the model.

---> Example: Suppose you have a dataset with 1000 samples, out of which 900 belong to class A and 100 belong to class B. In this case, you can remove some samples from class A so that it becomes balanced with class B.

# Q5: What is data Augmentation? Explain SMOTE.

* Data Augmentation

---> Data augmentation is a technique used to increase the size of a dataset by creating new data from the existing data.

---> This can be useful when you have a small dataset and you want to improve the performance of your model.

---> Data augmentation can be done in several ways such as flipping, rotating, cropping, zooming, etc.

* SMOTE

---> SMOTE stands for Synthetic Minority Over-sampling Technique.

---> It is a data augmentation algorithm that creates synthetic data points from raw data.

--->  SMOTE is widely used in preprocessing imbalanced datasets.

# Q6: What are outliers in a dataset? Why is it essential to handle outliers?

---> An outlier in a data set is a value that is much larger or smaller than the others, or that lies an abnormal distance from the rest of the observations.

---> It is essential to handle outliers because they can have a significant impact on statistical analysis. Outliers can skew results and make it difficult to identify patterns in data.

--->  They can also lead to incorrect conclusions about relationships between variables.

# Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

---> There are several techniques to handle the missing data.

---> One way is to delete the observations with missing values. 

---> Another technique is to impute the missing values. This involves filling in the missing values with estimated values based on the available data.

# Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

--->  One way is to use your substantive scientific knowledge of the data and your field. The more sensitive the issue, the less likely people are to tell you.

---> Another way is to use statistical tests. There are statistical tests to determine if the data is missing at random (MAR), but given that you need some hypothesis about missing values and where you expect them, endless testing seems a bit cumbersome.

---> You can also use visualization techniques to explore the data visually and stay attentive to potential method-related biases in case you have no strong ideas right-away

# Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in thedataset do not have the condition  of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

---> One way is to use appropriate evaluation metrics. When dealing with an imbalanced dataset, it is crucial to use appropriate metrics to evaluate the model’s performance. Accuracy is not an appropriate metric as it can be misleading in an imbalanced dataset.

---> Another way is to use resampling techniques such as oversampling or undersampling.

---> Oversampling involves increasing the number of instances in the minority class by generating synthetic samples.

---> Undersampling involves reducing the number of instances in the majority class by randomly removing samples.

---> Ensemble methods can also be used to balance the dataset and improve model performance. Ensemble methods involve combining multiple models to improve overall performance.

---> Cost-sensitive learning can also be used to balance the dataset and improve model performance. Cost-sensitive learning involves assigning different costs to different types of errors.


# Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

---> One way is to use resampling techniques such as oversampling or undersampling.

---> Oversampling involves increasing the number of instances in the minority class by generating synthetic samples.

---> Undersampling involves reducing the number of instances in the majority class by randomly removing samples.

---> Another way is to use cost-sensitive learning techniques. Cost-sensitive learning involves assigning different costs to different types of errors.

---> Ensemble methods can also be used to balance the dataset and improve model performance. Ensemble methods involve combining multiple models to improve overall performance.


# Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

---> There are several methods that can be employed to balance an unbalanced dataset and up-sample the minority class.

1) Collect more data
2) Try changing your performance metric
3) Try resambling your dataset
4) Try generating synthetic samples
5) Try different algorithms
6) Try penalized models