# Question 1

Missing data or missing values occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. Incomplete data can bias the results of the machine learning models and/or reduce the accuracy of the model. 

k-NN, Naive Bayes and Random Forest algorithms can support missing values.

## Question 2

There are many techniques that can be used to handle missing values in our dataset these may include the following ----


###### 1. dropping of columns or rows having the missing values

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np
df=sns.load_dataset('titanic')

In [2]:
df.isnull().sum()  # here we can see that our dataset has missing values in columns -[age,embarked,deck,embark_town]

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [3]:
df.shape  # here we have 891 entries in total

(891, 15)

In [4]:
## removing the rows that have null values
df.dropna().shape

(182, 15)

In [5]:
df.dropna(axis=1).shape  # removing the columns having missing values in the dataset

(891, 11)

Here we observe that originally the dataset was containing 891 rows of data entries but after removing the rows with missing values we are left with only 182 rows and if we remove the columns conntaining the same we are left with 11 columns. There is a huge loss of data so we conclude that this method is not an effective way of removing missing values in the dataset

###### 2. Mean Imputation

In mean imputation we replace the missing values in the dataset by the mean value of the column where the missing value is present. Mean imputation works for data which is normally distributed. 

In [6]:
df['age'].tail()

886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, dtype: float64

In [7]:
df['mean_age']=df['age'].fillna(df['age'].mean())

In [8]:
df['mean_age'].tail()  # the 888th entry is replace by the mean value of the age column

886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: mean_age, dtype: float64

###### 3. Median Imputation

In median imputation we replace the missing values by the median value of the column. Median imputation is used for data that is skewed in nature and may have otliers present in it.

In [9]:
df['median_age']=df['age'].fillna(df['age'].median())

In [10]:
df['median_age'].tail() ### 888th entry replaced by the median value

886    27.0
887    19.0
888    28.0
889    26.0
890    32.0
Name: median_age, dtype: float64

###### 4. Mode Imputation

Mode imputation is a technique that is used for filling the missing values in categorical variables. This fills the missing value with the mode or the most occuring sequence


In [11]:
df['embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [12]:
df[df['embarked'].isnull()]  # there are two entries with NaN values

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,mean_age,median_age
61,1,1,female,38.0,0,0,80.0,,First,woman,False,B,,yes,True,38.0,38.0
829,1,1,female,62.0,0,0,80.0,,First,woman,False,B,,yes,True,62.0,62.0


In [17]:
mode_val=df[df['embarked'].notna()]['embarked'].mode()[0]

In [18]:
df['new_embarked']=df['embarked'].fillna(mode_val)

In [19]:
df['new_embarked'].isnull().sum()

0

## Question 3

Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations, i.e one class label has a very high number of observations and the other has a very low number of observations. These classes can be classified as minority and majority classes.
Sometimes when the records of a certain class are much more than the other class, our classifier may get biased towards the prediction of that particular class and effects the overall performance of the model.



## Question 4

Upsampling is a procedure where synthetically generated data points (corresponding to minority class) are injected into the dataset. After this process, the counts of both labels are almost the same. This equalization procedure prevents the model from inclining towards the majority class. Furthermore, the interaction between the target classes remains unaltered. And also, the upsampling mechanism introduces bias into the system because of the additional information.

Downsampling is a mechanism that reduces the count of training samples falling under the majority class. As it helps to even up the counts of target categories. By removing the collected data, we tend to lose so much valuable information.

In [20]:
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(42)
class_ratio=0.8
number_samples=1200
n_class_0=int(number_samples*class_ratio)
n_class_1=number_samples-n_class_0

In [21]:
class_0=pd.DataFrame({
    'f1':np.random.normal(loc=0,scale=1,size=n_class_0),
    'f2':np.random.normal(loc=0,scale=1,size=n_class_0),
    'target':[0]*n_class_0
})
class_1=pd.DataFrame({
    'f1':np.random.normal(loc=2,scale=1,size=n_class_1),
    'f2':np.random.normal(loc=2,scale=1,size=n_class_1),
    'target':[1]*n_class_1
})

In [22]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

In [23]:
df

Unnamed: 0,f1,f2,target
0,0.496714,0.642723,0
1,-0.138264,1.329153,0
2,0.647689,0.196521,0
3,1.523030,0.709004,0
4,-0.234153,-0.089736,0
...,...,...,...
1195,1.383639,0.550355,1
1196,1.624804,1.078140,1
1197,1.682285,0.996043,1
1198,3.281644,2.207267,1


In [24]:
df['target'].value_counts()  # we have created an imbalanced dataset with 0 being the majority class and 1 being the minority class

0    960
1    240
Name: target, dtype: int64

## Upsampling

In [25]:
# upsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [26]:
from sklearn.utils import resample
df_majority_upsampled=resample(df_minority,replace=True,n_samples=len(df_majority),random_state=42)

In [27]:
df_majority_upsampled

Unnamed: 0,f1,f2,target
1062,3.360659,3.235782,1
1139,2.817890,1.339679,1
1052,1.230858,0.548824,1
974,4.644343,1.164857,1
1066,3.800511,0.221412,1
...,...,...,...
1136,2.670481,2.500666,1
970,0.770450,2.987335,1
1044,0.106385,3.617213,1
985,1.630390,3.533434,1


In [28]:
df_upsampled=pd.concat([df_majority,df_majority_upsampled])

In [29]:
df_upsampled['target'].value_counts() # the mibority class have been upsampled and have the same number of occurance as the majority class

0    960
1    960
Name: target, dtype: int64

## Downsampling

In [30]:
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(42)
class_ratio=0.8
number_samples=1200
n_class_0=int(number_samples*class_ratio)
n_class_1=number_samples-n_class_0

In [31]:
class_0=pd.DataFrame({
    'f1':np.random.normal(loc=0,scale=1,size=n_class_0),
    'f2':np.random.normal(loc=0,scale=1,size=n_class_0),
    'target':[0]*n_class_0
})
class_1=pd.DataFrame({
    'f1':np.random.normal(loc=2,scale=1,size=n_class_1),
    'f2':np.random.normal(loc=2,scale=1,size=n_class_1),
    'target':[1]*n_class_1
})

In [33]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)
df['target'].value_counts()  # we have created an imbalanced dataset with 0 being the majority class and 1 being the minority class

0    960
1    240
Name: target, dtype: int64

In [34]:
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [35]:
df_majority_downsampled=resample(df_majority,n_samples=len(df_minority),random_state=42)

In [36]:
df_majority_downsampled

Unnamed: 0,f1,f2,target
102,-0.342715,1.148766,0
435,0.074095,0.184680,0
860,0.202923,-0.131257,0
270,1.441273,-0.557492,0
106,1.886186,-1.294681,0
...,...,...,...
376,0.872321,-0.613403,0
282,1.586017,-0.581681,0
957,0.456753,-0.620848,0
632,-0.158008,0.436560,0


In [37]:
df_downsampled=pd.concat([df_minority,df_majority_downsampled])

In [39]:
df_downsampled['target'].value_counts()  # the majority class has been downsampled to the same number of occurance of the minority class

1    240
0    240
Name: target, dtype: int64

## Question 5

Data augmentation is a technique of artificially increasing the training set by creating modified copies of a dataset using existing data. It includes making minor changes to the dataset or using deep learning to generate new data points.Augmented data is driven from original data with some minor changes. In the case of image augmentation, we make geometric and color space transformations (flipping, resizing, cropping, brightness, contrast) to increase the size and diversity of the training set. 

SMOTE(SyntheticMinorityOversamplingTechnique) works based on the KNearestNeighbours algorithm, synthetically generating data points that fall in the proximity of the already existing outnumbered group. The input records should not contain any null values when applying this approach. In other words it fills the link between two data points with synthetically generated datapoints

## Question 6

In simple terms, an outlier also called aberrations, abnormal points, anomalies, etc. is an extremely high or extremely low data point relative to the nearest data point and the rest of the neighboring co-existing values in a data graph or dataset you're working with. Outliers are extreme values that stand out greatly from the overall pattern of values in a dataset or graph.

It is essential to detect and handle outliers in a dataset as it can have a significant impact on many statistical methods, such as mean, variance, etc., and the performance of the ML models.

## Question 7

While working on the customer dataset we'll first observe the type of values present in different columns of our dataset. If there are are numerical values present in the dataset we'll observe the distribution of dataset if the distribution is normally distributed and there are no outliers present in the dataset we can use mean imputation technique to fill the missing values in the dataset else if the data is skewed or have outliers present we use median imputation beacause mean value gets affected by the presence of outliers in the dataset.

For categorical variables we use mode imputation in which the missing value is replaced by the mode or the highest uccuring sequence in the dataset column.

We can also use deletion techniques such as column or row deletion containing the missing values but this leads to loss of data.

Another advance imputation technique is the K_Nearest Neighbour Imputation in which we make use of machine learning algorithm to impute the data (using Euclidian distance metric).

## Question 8

When working with a big dataset that has a small portion/percentage of missing data we can detect the missing values visually using Missingno library that presents a series of visualizations to recognize the behaviour and distribution of missing data inside a pandas data frame.  It can be in the form of a barplot, matrix plot, heatmap, or a dendrogram.
The matrix functionality of the missingno library gives us a visual representation of the data frame with white lines depicting the position of the missing values in the column, we can also use the heatmap functionality of the library to give us a more clear understanding of the relationship between the missing values of two features. Observing these patterns we can determine the relationship that the missing values might have with other features missing values, to determine wheather the data is missing at random(MAR) , missing completely at random(MCAR) or missing not at random(MNAR).

## Question 9

Suppose we are working on a medical diagnosis project there will be an imbalance of data because there are only small number of patients that have the condition of interest this could lead to us getting a pretty high accuracy just by predicting the majority class, but we fail to capture the minority class, which is most often the point of creating the model in the first place.In a dataset with highly unbalanced classes, the classifier will always predict the most common class without performing any analysis of the features, and it will have a high accuracy rate, obviously not the correct one.

 Therefore resampling techniques are used to solve class Imbalance problem.It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling). We can use techniques such as --  Random Under-Sampling With Imblearn, Random Over-Sampling With imblearn, Synthetic Minority Oversampling Technique (SMOTE), and we can change the performance metrics and use Confusion Matrix, Precision, Recall, F1 score, Area under ROC curve
 

## Question 10

When attempting to estimate customer satisfaction our dataset can be unbalanced to solve this problem we can use resampling techniques to make our dataset more balanced andso that our model is not biased towwards the majority class. Resampling can include techniques such as under-sampling or down sampling in which we reduce the majority data points and bring them close or equal to the minority class but  this technique may lead to loss of data, we could also upsample our minority class to bring it close to the majority class.
The resample functionality is imported from the sklearn.utils liabrary.

## Question 11


For the estimation of occurance of a rare event we have an unbalanced data as there are lesser records for the occurance of the event so to make the unbalanced dataset balanced we can upsample the minority class of data containing the information of the existance of the rare event. The resample functionality of the sklearn.utils liabrary can be used to upsample the data to bring it closer to the majority class data. We can also use the imblearn(imbalance learn liabrary ) to import SMOTE(Sysnthetic Minority Over-sampling Technique) which handles the imbalance of the data by crreating synthetic datapoints between two minority datapoints using the kNN algorithm.  