#### (1)-
##### Missing values occur in a dataset when some of the information is not stored for a variable. There are 3 mechanisms why missing values occur- 
- Missing Completely At Random(MCAR)
- Missing At Random(MAR)
- Missing Not At Random(MNAR)
##### It is important to handle missing values because it can cause bias in the results, it can lead to inaccurate predictions and can affect the generalizability of the model.
##### The algorithms that are not affected by missing values are KNN, Random Forest, Histogram based gradient boosting algorithm, Naive Bayes.

#### (2)-
##### We use imputation techniques to handle missing values-
- Mean value imputation
- Median value imputation
- Mode value imputation

#####  1) Mean value imputation-

In [1]:
import seaborn as sns

df= sns.load_dataset("titanic")
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [2]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [5]:
df["age_mean"]= df["age"].fillna(df["age"].mean())        # the NaN values in the age column are replaced with the mean of the age column and stored in a new column- age_mean

In [4]:
df[["age_mean", "age"]]

Unnamed: 0,age_mean,age
0,22.000000,22.0
1,38.000000,38.0
2,26.000000,26.0
3,35.000000,35.0
4,35.000000,35.0
...,...,...
886,27.000000,27.0
887,19.000000,19.0
888,29.699118,
889,26.000000,26.0


##### 2) Median value imputation-

In [6]:
df["age_median"]= df["age"].fillna(df["age"].median())

In [7]:
df[["age_median", "age"]]

Unnamed: 0,age_median,age
0,22.0,22.0
1,38.0,38.0
2,26.0,26.0
3,35.0,35.0
4,35.0,35.0
...,...,...
886,27.0,27.0
887,19.0,19.0
888,28.0,
889,26.0,26.0


##### 3) Mode value imputation-

In [9]:
df[df["embarked"].isnull()]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,age_mean,age_median
61,1,1,female,38.0,0,0,80.0,,First,woman,False,B,,yes,True,38.0,38.0
829,1,1,female,62.0,0,0,80.0,,First,woman,False,B,,yes,True,62.0,62.0


In [11]:
df["embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [12]:
df[df["embarked"].notna()]["embarked"].mode()

0    S
Name: embarked, dtype: object

In [13]:
mode_value= df[df["embarked"].notna()]["embarked"].mode()[0]
mode_value

'S'

In [14]:
df["embarked_mode"]= df["embarked"].fillna(mode_value)

In [15]:
df[["embarked_mode", "embarked"]]

Unnamed: 0,embarked_mode,embarked
0,S,S
1,C,C
2,S,S
3,S,S
4,S,S
...,...,...
886,S,S
887,S,S
888,S,S
889,C,C


In [16]:
df["embarked_mode"].isnull().sum()

0

#### (3)-
##### Imbalanced dataset has unequal distribution of classes. Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations, i.e one class label has a very high number of observations and the other has a very low number of observations.
##### If we don't handle the imbalanced dataset, then our classifier or model will be biased towards the class with majority data points. The model learns more from biased examples as opposed to the examples in the minority class. One might end up with a scenario where a model assumes that any data you feed it belongs to the majority class. This, as a result, makes a model seem naïve in its predictions, regardless of achieving high accuracy scores.

#### (4)-
##### Up-sampling is a technique in which we randomly duplicate the observations from the minority class in order to reinforce its signal. Down-sampling is a technique in which we randomly remove observations from the majority class to prevent its signal from dominating the learning algorithm.
##### Example of down-sampling- if we have a dataset containing info about patients who have cancer and who don't have cancer, then the no. of patients who have cancer will be less, so this is an imbalanced dataset and here we should do down sampling of the majority class i.e. the no. of patients who are cancer free and finally have a balanced dataset.
##### Example of up-sampling- if we have a dataset containing heights of boys and girls and one class let's say the boys heights' have majority data points, then we can upsample the monority class i.e. girls heights' data points and finally have a balanced daatset.

#### (5)-
##### Data Augmentation is a method that works much like oversampling. Yet Data Augmentation adds a twist- rather than making exact duplicates of observations in the minority class, you will add small perturbations to the copied data points i.e. create synthetic data based on original data.
##### SMOTE(Synthetic Minority Over-sampling Technique) is a technique used in machine learning to address imbalanced datasets where the minority class has significantly fewer instances than the majority class. SMOTE involves generating synthetic instances of the minority class by interpolating between existing instances. So, SMOTE performs data augmentation by creating synthetic data points based on the original data points. The advantage of SMOTE is that you are not generating duplicates, but rather creating synthetic data points that are slightly different from the original data points.

#### (6)-
##### Outliers are extreme values that differ from most other data points in a dataset. It is important to handle outliers because they can have a big impact on our statistical analyses and skew the results of any hypothesis tests. The mean of the dataset is sensitive to outliers, it shifts drastically.

#### (7)-
##### We can handle the missing data by deleting those row records or by deleting the columns with null values using the dropna() method. But this results in loss of data, so we follow some imputation techniques like-
- 1. Mean value imputation- We replace the null values in the column with the mean value of the column. This method is effective if we have normal distribution of data. 
- 2. Median value imputation- We replace the null values in the column with the median value of the column. This method is effective if our dataset contains outliers.
- 3. Mode value imputation- We replace the null values of a column with categorical data with the mode value.
- 4. Random value imputation- We replace the null values of the column with a random value from that column. random value imputation is used when we have MCAR data.

#### (8)-
##### Strategies to determine the type of missing data-
- MNAR- Using Little's test where we create a new binary variable where missing is coded as 1 and not missing is coded as 0. We compare the mean of another feature for 1's and 0's, if there is significant difference in means we have significant evidence that the data is missing not at random. 
- MAR- If there is no significant difference between our primary variable of interest and the missing and non-missing values, we have evidence that our data is missing at random.
- MCAR- Finally, we compare the means on multiple dependent variables. We can run t-tests and chi-square tests between this variable and other variables in the data set to see if the missingness on this variable is related to the values of other variables. If there is no significant realtion, then the data is missing completely at random.

#### (9)-
##### Although accuracy is considered to be an important metric for evaluating the performance of a machine learning model, sometimes it can be misleading in case of an imbalanced dataset. In such circumstances the strategies to evaluate the performance of machine learning model on imbalanced dataset involves using other performance metrics such as- 
- Confusion matrix- A breakdown of predictions into a table showing correct predictions along the diagonal and the types of incorrect predictions made.
- Precision- A measure of a classifiers exactness.
- Recall- A measure of a classifiers completeness
- F1 score- Weighted average of precision and recall
- Cohen's kappa- Classification accuracy normalized by the imbalance of the classes in the data.
- ROC curve

#### (10)-
##### Our dataset here is imbalanced dataset with majority of customers satisfied with the product and very few not satisfied with the product. So, we can down-sample the majority class using the sklearn library's resample() method. Down sampling involves randomly selecting examples from the majority class to delete from the training dataset.
##### First we split the target class into 2 data frames one with majority class another with minority class. Then in the resample function we pass the majority data frame as argument, we set the replace as False- because we want only to reduce the data points of the majority class, the sample size is equal to the size of the minority data frame. This resampled data frame we obtain is then concatenated with the minority data frame and finally we have a dataframe with equal no. of majority and minority class data points. 

#### (11)-
##### Our dataset here is imbalanced dataset with low percentage of occurencs of a rare event. So, we can up-sample the minority class using sklearn library's resample() method. Up sampling involves randomly duplicating examples from the minority class and adding them to the training dataset. 
##### First we split the target class into 2 data frames one with majority class another with minority class. Then in the resample function we pass the minority data frame as argument, we set the replace as True, the sample size is equal to the size of the majority data frame. This resampled data frame we obtain is then concatenated with the majority data frame and finally we have a dataframe with equal no. of majority and minority class data points. 
##### Using SMOTE we can up-sample the minority class too. We use the fit_resample() method of SMOTE class. SMOTE involves generating synthetic instances of the minority class by interpolating between existing instances. So, SMOTE performs data augmentation by creating synthetic data points based on the original data points. The advantage of SMOTE is that you are not generating duplicates, but rather creating synthetic data points that are slightly different from the original data points.