In [None]:
#1
'''
Missing values in a dataset refer to the absence of a particular value in a specific observation or variable.
They can occur due to various reasons, such as data entry errors, equipment malfunction, or participants refusing
to answer a particular question in a survey.

Handling missing values is crucial for several reasons:

Biased analysis: If missing values are not appropriately handled, they can lead to biased or inaccurate results 
during data analysis. This can impact the validity and reliability of the conclusions drawn from the data.

Distorted relationships: Missing values can affect the relationships and correlations between variables.
Their presence can distort statistical measures and lead to incorrect interpretations of the data.

Algorithm compatibility: Many machine learning algorithms cannot directly handle missing values and may result in
errors or unexpected behavior. Therefore, it is essential to address missing values before applying these algorithms.

Algorithms:
SVM
Naive-Bayes
'''

In [None]:
#2
'''
Techniques used to handle missing data are:
i.Mean Imputation
ii.Median Imputation
iii.Mode Imputation


In [8]:
#2 Example
import seaborn as sns
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [9]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [12]:
#Mean Imputation
mean = df['age'].mean()
df['age_mean']=df['age'].fillna(mean)
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
age_mean         0
dtype: int64

In [13]:
df[['age','age_mean']]

Unnamed: 0,age,age_mean
0,22.0,22.000000
1,38.0,38.000000
2,26.0,26.000000
3,35.0,35.000000
4,35.0,35.000000
...,...,...
886,27.0,27.000000
887,19.0,19.000000
888,,29.699118
889,26.0,26.000000


In [16]:
#Median Imputation
median = df['age'].median()
df['age_med']=df['age'].fillna(median)
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
age_mean         0
age_med          0
dtype: int64

In [18]:
df[['age','age_med']]

Unnamed: 0,age,age_med
0,22.0,22.0
1,38.0,38.0
2,26.0,26.0
3,35.0,35.0
4,35.0,35.0
...,...,...
886,27.0,27.0
887,19.0,19.0
888,,28.0
889,26.0,26.0


In [22]:
#mode imputation
df[df['embarked'].isnull()]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,age_mean,age_med
61,1,1,female,38.0,0,0,80.0,,First,woman,False,B,,yes,True,38.0,38.0
829,1,1,female,62.0,0,0,80.0,,First,woman,False,B,,yes,True,62.0,62.0


In [24]:
df['embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [30]:
mode = df[df['embarked'].notna()]['embarked'].mode()[0]
mode

'S'

In [32]:
df['embarked_mode'] = df['embarked'].fillna(mode)

In [34]:
df[df['embarked'].isnull()]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,age_mean,age_med,embarked_mode
61,1,1,female,38.0,0,0,80.0,,First,woman,False,B,,yes,True,38.0,38.0,S
829,1,1,female,62.0,0,0,80.0,,First,woman,False,B,,yes,True,62.0,62.0,S


In [None]:
#3
'''
Imbalanced data refers to a situation where the distribution of target classes in a dataset is significantly
skewed or disproportionate. In other words, one class has a much larger number of instances compared to the 
other class(es). 

For example, in a binary classification problem, if 95% of the data belongs to Class A and only 5% belongs 
to Class B, the data is imbalanced.

If imbalance is not handled then it leads to:
- Poor Generalization
- Biased Model Performance
'''

In [None]:
#4
'''
There are 2 methods for handling imbalanced datasets:
i.Upsampling
ii.Downsampling

Upsampling: 
Upsampling is process of increasing instances in the minority class to balance it with majority class.
- If not handled properly, it may lead to overfitting of the data.

Downsampling:
Downsampling is the process of decreasing instances in the majority class to balance it with minority class.
- There is a loss of data in this approach.
'''

In [None]:
#5
'''
Data augmentation is a technique used to artificially increase the size of a dataset by creating new,synthetic
data points.

SMOTE (Synthetic Minority Over-sampling Technique) is a specific data augmentation technique designed to address
the class imbalance problem. It focuses on increasing the number of instances in the minority class by 
generating synthetic samples that are similar to the existing minority class samples.
'''

In [None]:
#6
'''
- Outliers are data points that significantly deviate from the majority of the data in a dataset. 
- They are observations that lie far away from the central tendency of the data and may exhibit unusual or 
  extreme values. 
- Outliers can occur due to various reasons, such as measurement errors, data corruption, or rare events.

Effects of Outliers:
- Influenced model performance
- Misleading Interpretations
'''

In [None]:
#7
'''
Some techniques that can used to handle the missing data:
i. Deletion:
            It means to delete the rows that contain the null value.However it is an inefficient approach 
            due to loss of data.
ii.Imputation:
            It means to fill the null values with the value we get using Measures of Central Tendency.
            .i.e, Mean , Median and Mode.
'''

In [None]:
#8
'''
MCAR Test: - This test assesses whether the missing data is missing completely at random (MCAR).
           -  It tests the null hypothesis that the missing data is MCAR. If the p-value is not significant,
              it suggests that the missing data is MCAR.
'''

In [None]:
#9
'''
Best approach to handle imbalnce data in this scenario is to upsample the class of patients that are not 
interested so that there will be no loss of data in the diagnoise patients data.
'''

In [None]:
#10
'''SMOTE in combination with downsampling can be used so that there will be less loss of data.'''

In [None]:
#11
'''SMOTE'''