## Q1

In [None]:
'''
 Missing values in a dataset refer to the absence of data in one or more columns or rows of the dataset.
 It is essential to handle missing values as they can cause biases and inaccuracies in the analysis of the dataset.
 Some of the algorithms that are not affected by missing values are decision trees, random forests, and support vector machines.
'''

## Q2

In [1]:
## Techniques used to handle missing data

### Deletion: It deletes the every row or column having mission values in the dataset.
### Imputation : Mean, median, mode imputation where missing values occur in the dataset

## Example of deletion

In [3]:
import seaborn as sns
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
## checking the missing values
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [8]:
df.dropna()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,0,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


In [11]:
df.dropna().shape

(182, 15)

In [12]:
df.shape

(891, 15)

In [13]:
## so we can see all rows that contained missing values are deleted 

## Example of Imputation
### mean

In [16]:
df["age_mean_imputed"] = df["age"].fillna(df["age"].mean())

In [17]:
df[["age_mean_imputed","age"]]

Unnamed: 0,age_mean_imputed,age
0,22.000000,22.0
1,38.000000,38.0
2,26.000000,26.0
3,35.000000,35.0
4,35.000000,35.0
...,...,...
886,27.000000,27.0
887,19.000000,19.0
888,29.699118,
889,26.000000,26.0


In [18]:
## we can see above that where age column had missing values is replaced by the mean of the age column

## Example median
### median is used where outliers is present in the data

In [23]:
df.isnull().sum()

survived              0
pclass                0
sex                   0
age                 177
sibsp                 0
parch                 0
fare                  0
embarked              2
class                 0
who                   0
adult_male            0
deck                688
embark_town           2
alive                 0
alone                 0
age_mean_imputed      0
dtype: int64

In [25]:
df["age_median_imputed"] = df["age"].fillna(df["age"].median())

In [27]:
df[["age_median_imputed","age"]]

Unnamed: 0,age_median_imputed,age
0,22.0,22.0
1,38.0,38.0
2,26.0,26.0
3,35.0,35.0
4,35.0,35.0
...,...,...
886,27.0,27.0
887,19.0,19.0
888,28.0,
889,26.0,26.0


## Example of mode 
### used for catagorical data

In [28]:
df["embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [31]:
mode = df["embarked"].mode()[0]

In [32]:
mode

'S'

In [33]:
df["embarked_mode_imputed"] = df["embarked"].fillna(mode)

In [None]:
df[["embarked","embarked_mode_imputed"]]

Unnamed: 0,embarked,embarked_mode_imputed
0,S,S
1,C,C
2,S,S
3,S,S
4,S,S
...,...,...
886,S,S
887,S,S
888,S,S
889,C,C


In [40]:
df["embarked_mode_imputed"].isnull().sum()

0

In [39]:
df["embarked"].isnull().sum()

2

## Q3

In [None]:
'''
Imbalance data is a type of data in which number of observation of one type of class or group has much higher in comparison to other classes i.e the number of 
observation for different class or group are not evenly distributed.

If it is not handled, it leads to biased and inaccurate ML model. As the model is not enough exposed to minority class of datapoints, model most of the time 
predict same observation even for minority class of datapoints. 

'''

## Q4

In [None]:
'''
Up-sampling involves increasing the number of instances in the minority class by randomly duplicating them. This can be done until the number of instances in the
minority class matches that of the majority class.
Down-sampling, on the other hand, involves reducing the number of instances in the majority class by randomly removing some instances.


Here's an example to illustrate when up-sampling and down-sampling are required:

Let's say we have a dataset with 1000 observations, where 90% of the observations belong to Class A and 10% belong to Class B. 

If we choose to up-sample, we can randomly duplicate instances in Class B to increase the number of instances until it matches that of Class A. 
For example, we can duplicate each instance in Class B nine times to get a new dataset with 900 instances in Class A and 900 instances in Class B.

If we choose to down-sample, we can randomly remove instances in Class A to reduce the number of instances until it matches that of Class B.
For example, we can remove 800 instances from Class A to get a new dataset with 100 instances in Class A and 100 instances in Class B.

'''

## Q5

In [None]:
'''
Data augmentation is a technique used in machine learning to increase the size and diversity of a dataset by applying 
various transformations to the original data. These transformations can include rotations, flips, translations, 
and other modifications that preserve the essential features of the original data.

SMOTE (Synthetic Minority Over-sampling Technique) is a popular data augmentation technique used to address the problem
of imbalanced datasets, where the number of instances in the minority class is significantly lower than that in the 
majority class. SMOTE generates synthetic samples by interpolating between existing minority class instances,
thereby creating new data points that are similar to the original data but are not identical. This technique helps to
balance the dataset and improves the performance of the machine learning model.
'''

## Q6

In [None]:
'''

Outliers are data points that deviate significantly from the rest of the data in a dataset. Outliers can occur due to errors in data 
collection, measurement errors, or other factors. It is essential to handle outliers because they can have a significant impact on the
statistical properties of the dataset, such as the mean and variance, and can also affect the performance of machine learning models.

'''

## Q7

In [None]:
'''

There are several techniques that can be used to handle missing data in a dataset, including:

Deletion: Removing the rows or columns that contain missing values from the dataset.

Imputation: Replacing the missing values with estimates based on the available data. Common imputation techniques include mean imputation, 
median imputation, mode imputation, and regression imputation.

Interpolation: Estimating missing values based on the available data using interpolation techniques such as linear interpolation, 
polynomial interpolation
'''

## Q8

In [None]:
'''

There are several strategies that can be used to determine if missing data is missing at random or if there is a pattern to the missing data,
including:

Visual inspection: Examining the patterns of missing data using visualizations such as histograms or heat maps.

Statistical tests: Performing statistical tests to determine if the missing data is associated with other variables in the dataset.

Machine learning models: Training machine learning models to predict the missing data and evaluating the performance of the models.

Imputation: Imputing the missing data using different methods and comparing the results to determine if there is a pattern to the missing data.

'''

## Q9

In [None]:
'''
Using metrics that are robust to model trained with imbalance dataset: Accuracy is not a good metric to use because the model can have high accuracy by simply 
predicting the majority class every time. Instead, metrics like Precision, Recall, F1-score are better suited to imbalanced datasets as 
they take into account the performance of the minority class.

'''


## Q10

In [None]:
'''
method to balance the dataset is down-sampling the majority class. Some strategies for down-sampling are:

Random under-sampling: randomly removing instances from the majority class to match the size of the minority class.

Cluster-based under-sampling: grouping instances from the majority class into clusters and selecting only a subset of the clusters to match
the size of the minority class.

'''

## Q11

In [None]:
'''
Oversampling: This technique involves increasing the number of instances in the minority class by randomly duplicating existing instances. This can be done using simple techniques like random duplication or more sophisticated methods like SMOTE (Synthetic Minority Over-sampling Technique).
'''