Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.

Missing values in a dataset refer to the absence of a particular value for a variable in an observation or record. This can happen due to various reasons, such as data entry errors, equipment failure, or subjects refusing to provide certain information. Missing values can be represented in different ways, such as blank spaces, NaN (Not a Number), or special characters like "-99" or "N/A".

Handling missing values is crucial because they can lead to biased or inaccurate analyses and predictions if not dealt with appropriately. Missing values can affect the summary statistics, correlations, and machine learning algorithms, leading to poor performance, underfitting, or overfitting of models. Ignoring missing values can also result in the loss of valuable information and reduce the representativeness of the sample.

Some algorithms that are not affected by missing values include:

Decision trees: Missing values can be treated as a separate category, and the tree can split based on whether the value is missing or not.
Random Forest: It can handle missing values by imputing them using the median or mean of the other available data in the same column.
Gaussian Naive Bayes: It assumes that the features follow a Gaussian distribution, and missing values can be imputed using the mean or median of the non-missing values.
K-Nearest Neighbors: It can handle missing values by replacing them with the mean or median of the k-nearest neighbors in the feature space.
Support Vector Machines: It can handle missing values by ignoring the missing feature in the distance calculation between data points.

Q2: List down techniques used to handle missing data. Give an example of each with python code.

There are several techniques used to handle missing data. Here are some of the most common ones with examples in Python:
Deletion:
This technique involves removing the observations or variables with missing values. It can be done in two ways:
a. Listwise deletion: Removing the entire observation that contains any missing value.
b. Pairwise deletion: Removing only the missing values from the analysis and keeping the rest of the data.

In [1]:
import seaborn as sns 
df=sns.load_dataset('titanic')

In [2]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [3]:
df.isnull()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
887,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False
889,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [5]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

Rowwise : Hansling the mising value by deleting the row NAN values

In [6]:
df.dropna()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,0,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


Handling the mising values by deleting the column nan values  

In [7]:
df.dropna(axis=1)

Unnamed: 0,survived,pclass,sex,sibsp,parch,fare,class,who,adult_male,alive,alone
0,0,3,male,1,0,7.2500,Third,man,True,no,False
1,1,1,female,1,0,71.2833,First,woman,False,yes,False
2,1,3,female,0,0,7.9250,Third,woman,False,yes,True
3,1,1,female,1,0,53.1000,First,woman,False,yes,False
4,0,3,male,0,0,8.0500,Third,man,True,no,True
...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,0,0,13.0000,Second,man,True,no,True
887,1,1,female,0,0,30.0000,First,woman,False,yes,True
888,0,3,female,1,2,23.4500,Third,woman,False,no,False
889,1,1,male,0,0,30.0000,First,man,True,yes,True


Handling the missing value by Imputation Technique 
1-Mean value Imputation

In [8]:
df['Age_mean']=df['age'].fillna(df['age'].mean())

In [9]:
df[['age','Age_mean']]

Unnamed: 0,age,Age_mean
0,22.0,22.000000
1,38.0,38.000000
2,26.0,26.000000
3,35.0,35.000000
4,35.0,35.000000
...,...,...
886,27.0,27.000000
887,19.0,19.000000
888,,29.699118
889,26.0,26.000000


Handling the missing values by Median imputation

In [10]:
df['Age_median']=df['age'].fillna(df['age'].median())

In [11]:
df[['age','Age_median']]

Unnamed: 0,age,Age_median
0,22.0,22.0
1,38.0,38.0
2,26.0,26.0
3,35.0,35.0
4,35.0,35.0
...,...,...
886,27.0,27.0
887,19.0,19.0
888,,28.0
889,26.0,26.0


Handling missing values by mode imputation technique for categorical type of mising values .

In [12]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Age_mean,Age_median
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,22.0,22.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,38.0,38.0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,26.0,26.0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,35.0,35.0
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,35.0,35.0


In [13]:
df['embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [14]:
df['age'].notna()

0       True
1       True
2       True
3       True
4       True
       ...  
886     True
887     True
888    False
889     True
890     True
Name: age, Length: 891, dtype: bool

In [15]:
df[df['age'].notna()]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Age_mean,Age_median
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False,22.0,22.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,38.0,38.0
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True,26.0,26.0
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False,35.0,35.0
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True,35.0,35.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,0,3,female,39.0,0,5,29.1250,Q,Third,woman,False,,Queenstown,no,False,39.0,39.0
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True,27.0,27.0
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True,19.0,19.0
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True,26.0,26.0


In [18]:
mode=df[df['age'].notna()]['embarked'].mode()[0]

In [19]:
mode

'S'

In [20]:
df['emabrked_mode']=df['embarked'].fillna(mode)

In [21]:
df[['emabrked_mode','embarked']]

Unnamed: 0,emabrked_mode,embarked
0,S,S
1,C,C
2,S,S
3,S,S
4,S,S
...,...,...
886,S,S
887,S,S
888,S,S
889,C,C


Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Imbalanced data refers to a situation where the distribution of classes in a dataset is not equal. In other words, one class has significantly fewer samples than the other class(es). For example, in a binary classification problem where the task is to identify whether a credit card transaction is fraudulent or not, the majority of the transactions may be legitimate, while the fraudulent transactions make up a tiny minority.

If imbalanced data is not handled properly, it can lead to several issues. Firstly, it can result in a biased model, which may produce inaccurate predictions on the minority class. In the example above, a model trained on imbalanced data may classify all transactions as legitimate, as it is more likely to encounter legitimate transactions in the dataset. This is known as the "accuracy paradox," where a model may have high accuracy but low performance on the minority class.

Secondly, imbalanced data can also lead to overfitting, where the model may memorize the majority class and fail to generalize well to new data. This can happen because the model is trained to minimize the overall error, and since the majority class has more samples, it may dominate the loss function.

To address imbalanced data, several techniques can be used, such as oversampling the minority class, undersampling the majority class, or using algorithms specifically designed for imbalanced data, such as cost-sensitive learning or anomaly detection methods.

Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-
sampling are required.

Up-sampling and down-sampling are two common techniques used to address imbalanced data by adjusting the class distribution in a dataset.

Up-sampling involves increasing the number of samples in the minority class to match the number of samples in the majority class. This can be done by replicating existing samples or generating new ones. For example, in a dataset with 100 samples of class A and 20 samples of class B, up-sampling would involve creating new samples of class B until the number of samples in class B is equal to 100. This can be done through techniques such as random oversampling or synthetic minority over-sampling technique (SMOTE).

Down-sampling involves decreasing the number of samples in the majority class to match the number of samples in the minority class. This can be done by randomly removing samples or selecting a subset of samples. For example, in a dataset with 100 samples of class A and 20 samples of class B, down-sampling would involve randomly removing 80 samples of class A until the number of samples in class A is equal to 20.

Up-sampling and down-sampling are required when dealing with imbalanced datasets, where one class has significantly fewer samples than the other. This can be the case in various domains, such as fraud detection, medical diagnosis, and anomaly detection. In these scenarios, up-sampling and down-sampling can help to balance the class distribution and improve the performance of the machine learning model.

For example, in fraud detection, the number of fraudulent transactions is usually much lower than legitimate ones. Therefore, up-sampling can be used to increase the number of fraudulent transactions, allowing the model to better capture the characteristics of fraudulent transactions. In contrast, in medical diagnosis, where the prevalence of a disease may be low, down-sampling can be used to balance the class distribution, ensuring that the model does not overfit to the majority class.

Q5: What is data Augmentation? Explain SMOTE.

Data augmentation is a technique used to artificially increase the size of a dataset by creating new samples from the existing ones. The goal is to improve the performance of a machine learning model by providing it with more training data. This is particularly useful when the available data is limited, as it allows the model to generalize better to new, unseen data.

One popular data augmentation technique is Synthetic Minority Over-sampling Technique (SMOTE). SMOTE is a type of up-sampling that creates new synthetic samples for the minority class by interpolating between existing minority samples.
SMOTE involves generating synthetic instances of the minority class by interpolation between existing instances.

Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Outliers are data points that are significantly different from other data points in a dataset. They are values that are unusually high or low, or that deviate from the general pattern of the data.

Handling outliers is essential because they can have a significant impact on statistical analyses and machine learning models. Outliers can skew the results of descriptive statistics, such as the mean and standard deviation, making them less representative of the true distribution of the data. In machine learning, outliers can also cause overfitting, where the model memorizes the training data, including the outliers, rather than learning the underlying patterns in the data. This can lead to poor generalization performance on new data.

In [2]:
import numpy as np
lit_marks=[45,32,56,75,89,54,32,89,87,67,54,45,98,99,67,74,1000,1100]

In [3]:
minimum,Q1,Q2,Q3,maximum=np.quantile(lit_marks,[0,0.25,0.50,0.75,1.0])

In [4]:
maximum

1100.0

In [5]:
IQR=Q3-Q1
print(IQR)

35.0


In [6]:
lf=Q1-1.5*(IQR)
hf=Q3+1.5*(IQR)

In [7]:
lf,hf

(1.5, 141.5)

In [9]:
outliers=[]
for i in lit_marks:
    if i<=1.5 and i>=141.5:
        print('This is not outliers')
    else:
        outliers.append(i)
        

Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?

For the missing data some techniques are used to handling the mising data these are deleting the row which has nan values and deleting the columns value which has nan values and there is a technique called imputation technique .
The imputation technique is three types :
    1) mean imputation 
    2) median imputation 
    3)mode imputation

Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are
some strategies you can use to determine if the missing data is missing at random or if there is a pattern
to the missing data?

When working with a large dataset that has missing data, it is important to determine if the missing data is missing at random or if there is a pattern to the missing data. Here are some strategies that can be used to determine the nature of the missing data:

Visual inspection: One approach is to visualize the missing data using a heatmap or scatterplot matrix to identify any patterns in the missing data. For example, missing data may occur more frequently for certain variables or combinations of variables.

Summary statistics: Another approach is to calculate summary statistics, such as the mean or median, for both the complete and incomplete cases. If there is a significant difference between these statistics, it may suggest that the missing data is not missing at random.

Statistical tests: Statistical tests, such as the chi-squared test or t-test, can be used to compare the distribution of the missing values to the distribution of the non-missing values. If there is a significant difference between these distributions, it may suggest that the missing data is not missing at random.

Machine learning models: Machine learning models can also be used to predict the missing values based on the available data. If the prediction accuracy is low, it may suggest that the missing data is not missing at random.

Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Imbalanced datasets can pose a challenge in machine learning, particularly in the medical field where a small proportion of patients may have a rare condition of interest. Here are some strategies that can be used to evaluate the performance of a machine learning model on an imbalanced dataset:

Confusion Matrix: One of the simplest ways to evaluate the performance of a machine learning model is by creating a confusion matrix. It is a matrix that shows the number of true positives, false positives, true negatives, and false negatives. From the confusion matrix, metrics like precision, recall, and F1-score can be calculated.

Precision and Recall: Precision measures the proportion of true positives among all the predicted positives. Recall measures the proportion of true positives among all the actual positives. These metrics are useful when dealing with imbalanced datasets as they focus on the performance of the model on the minority class.

Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?

When dealing with an unbalanced dataset, where the majority class dominates the data, one common approach is to balance the dataset by down-sampling the majority class. This approach involves randomly removing samples from the majority class to reduce its size and bring it closer to the size of the minority class.
Random Under-Sampling: This method involves randomly selecting a subset of the majority class equal in size to the minority class. This method is simple to implement and can work well if the dataset is not too imbalanced.

Cluster-Based Under-Sampling: This method involves using clustering techniques to group similar data points together and then selecting representative samples from each cluster. This approach can work well when the majority class has different subgroups or clusters.

Tomek Links: This is a method that involves identifying pairs of samples that are closest to each other but belong to different classes. Removing the majority class samples in these pairs can lead to a more balanced dataset.

Edited Nearest Neighbors: This method involves identifying samples in the majority class that are misclassified by the k-nearest neighbor classifier and removing them. This approach can help improve the accuracy of the model by removing noisy samples from the majority class.

Synthetic Minority Over-sampling Technique (SMOTE): This method involves creating new synthetic minority class samples by interpolating between existing minority class samples. This approach can help increase the size of the minority class and reduce the class imbalance.



Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?