**Q1**
Missing values in a dataset refer to the absence of data for certain variables or observations. They can occur for various reasons, such as data collection errors, equipment malfunctions, or simply because some information was not collected. Handling missing values is essential because they can lead to biased or inaccurate results during analysis or modeling. Ignoring missing values may result in incorrect conclusions, reduced statistical power, and compromised model performance.

Algorithms that are not affected by missing values include:

- Decision Trees: Decision trees can handle missing values by choosing alternative paths during tree construction.

- Random Forests: Random Forests extend the idea of decision trees and can handle missing values in a similar manner.

- Naive Bayes: Naive Bayes is not affected by missing values because it calculates probabilities independently for each feature.

- K-Nearest Neighbors (KNN): KNN can handle missing values by considering only the available features during distance calculations.

In [7]:
#Q2
#Dropping missing values
import pandas as pd
import seaborn as sns

df = sns.load_dataset('titanic')
df.dropna(inplace = True )
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,0,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


In [17]:
#imputation(filling with constant)
df = sns.load_dataset('titanic')
print(df)
df['deck'].unique()
df['deck'].fillna(value = 'C', inplace = True)
df

     survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
0           0       3    male  22.0      1      0   7.2500        S   Third   
1           1       1  female  38.0      1      0  71.2833        C   First   
2           1       3  female  26.0      0      0   7.9250        S   Third   
3           1       1  female  35.0      1      0  53.1000        S   First   
4           0       3    male  35.0      0      0   8.0500        S   Third   
..        ...     ...     ...   ...    ...    ...      ...      ...     ...   
886         0       2    male  27.0      0      0  13.0000        S  Second   
887         1       1  female  19.0      0      0  30.0000        S   First   
888         0       3  female   NaN      1      2  23.4500        S   Third   
889         1       1    male  26.0      0      0  30.0000        C   First   
890         0       3    male  32.0      0      0   7.7500        Q   Third   

       who  adult_male deck  embark_town alive  alo

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,C,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,C,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,C,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,C,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,C,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [19]:
#imputation(filling with mean)
df = sns.load_dataset('titanic')
print(df)
df['age'].fillna(value = df['age'].mean(), inplace = True)
df

     survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
0           0       3    male  22.0      1      0   7.2500        S   Third   
1           1       1  female  38.0      1      0  71.2833        C   First   
2           1       3  female  26.0      0      0   7.9250        S   Third   
3           1       1  female  35.0      1      0  53.1000        S   First   
4           0       3    male  35.0      0      0   8.0500        S   Third   
..        ...     ...     ...   ...    ...    ...      ...      ...     ...   
886         0       2    male  27.0      0      0  13.0000        S  Second   
887         1       1  female  19.0      0      0  30.0000        S   First   
888         0       3  female   NaN      1      2  23.4500        S   Third   
889         1       1    male  26.0      0      0  30.0000        C   First   
890         0       3    male  32.0      0      0   7.7500        Q   Third   

       who  adult_male deck  embark_town alive  alo

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.000000,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.000000,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.000000,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.000000,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.000000,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.000000,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.000000,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,29.699118,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.000000,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [23]:
#imputation(filling with mean)
df = sns.load_dataset('titanic')
print(df)
df['age'].fillna(value = df['age'].median(), inplace = True)
df

     survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
0           0       3    male  22.0      1      0   7.2500        S   Third   
1           1       1  female  38.0      1      0  71.2833        C   First   
2           1       3  female  26.0      0      0   7.9250        S   Third   
3           1       1  female  35.0      1      0  53.1000        S   First   
4           0       3    male  35.0      0      0   8.0500        S   Third   
..        ...     ...     ...   ...    ...    ...      ...      ...     ...   
886         0       2    male  27.0      0      0  13.0000        S  Second   
887         1       1  female  19.0      0      0  30.0000        S   First   
888         0       3  female   NaN      1      2  23.4500        S   Third   
889         1       1    male  26.0      0      0  30.0000        C   First   
890         0       3    male  32.0      0      0   7.7500        Q   Third   

       who  adult_male deck  embark_town alive  alo

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,28.0,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [25]:
# Imputation (filling with forward fill):
df = sns.load_dataset('titanic')
print(df)
df['age'].fillna(method = 'ffill', inplace = True)
df

     survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
0           0       3    male  22.0      1      0   7.2500        S   Third   
1           1       1  female  38.0      1      0  71.2833        C   First   
2           1       3  female  26.0      0      0   7.9250        S   Third   
3           1       1  female  35.0      1      0  53.1000        S   First   
4           0       3    male  35.0      0      0   8.0500        S   Third   
..        ...     ...     ...   ...    ...    ...      ...      ...     ...   
886         0       2    male  27.0      0      0  13.0000        S  Second   
887         1       1  female  19.0      0      0  30.0000        S   First   
888         0       3  female   NaN      1      2  23.4500        S   Third   
889         1       1    male  26.0      0      0  30.0000        C   First   
890         0       3    male  32.0      0      0   7.7500        Q   Third   

       who  adult_male deck  embark_town alive  alo

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,19.0,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


**Q3**

Imbalanced data refers to a situation where the classes in a classification problem are not represented equally. One class (the minority class) has significantly fewer instances than the other class (the majority class). If imbalanced data is not handled, machine learning models may exhibit biased behavior toward the majority class, leading to poor performance in predicting the minority class.

Consequences of not handling imbalanced data include:

- Biased Models: Models tend to predict the majority class, ignoring the minority class.

- Poor Generalization: The model may perform well on the majority class but poorly on the minority class, leading to a lack of generalization.

- Misleading Accuracy: Accuracy may appear high due to successful prediction of the majority class, but the model's effectiveness on the minority class is often low.

**Q4**

- **Up-sampling**: Increasing the number of instances in the minority class to balance the class distribution. This is typically done by duplicating or generating new instances from the minority class.
- **Down-sampling**: Decreasing the number of instances in the majority class to balance the class distribution. This is typically done by randomly removing instances from the majority class.

In [45]:
#Setting up dataframe
import numpy as np
import pandas as pd

np.random.seed(156)
n_sample = 10000
class_0_ratios = 0.8
n_class0 = int(n_sample * class_0_ratios)
n_class1 = n_sample - n_class0

#Creating dataframe

class_0 = pd.DataFrame({
'feature1': np.random.normal(loc = 0, scale = 1, size = n_class0),
    'feature2': np.random.normal(loc = 0, scale = 1, size = n_class0),
    'target': [0] * n_class0
    
})

class_1 = pd.DataFrame({
'feature1': np.random.normal(loc = 0, scale = 1, size = n_class1),
    'feature2': np.random.normal(loc = 0, scale = 1, size = n_class1),
    'target': [1] * n_class1
    
})

df = pd.concat([class_0, class_1]).reset_index(drop = True)

print(df['target'].value_counts())

df_minority = df[df['target']== 1]
print(df_minority)
df_majority = df[df['target']== 0]
print(df_majority)

target
0    8000
1    2000
Name: count, dtype: int64
      feature1  feature2  target
8000 -0.273305  1.961101       1
8001 -0.673125  1.522028       1
8002  0.291849 -0.737003       1
8003 -0.503554  1.255707       1
8004 -0.028202  1.101512       1
...        ...       ...     ...
9995 -1.282189  2.863957       1
9996 -0.091759  0.023590       1
9997  0.570734  0.914088       1
9998 -0.095811  0.047369       1
9999  0.695540 -1.593849       1

[2000 rows x 3 columns]
      feature1  feature2  target
0     0.584918  0.920999       0
1    -1.847740  0.594893       0
2    -0.783089 -1.555339       0
3    -0.595249  0.341117       0
4     1.341287  0.712734       0
...        ...       ...     ...
7995  0.434506 -0.770788       0
7996  0.606495  0.639844       0
7997  0.561406  0.300528       0
7998  1.141498 -0.092981       0
7999 -0.103850  1.628664       0

[8000 rows x 3 columns]


In [51]:
# upsampling
from sklearn.utils import resample
df_mir =  resample (df_minority, replace = True, n_samples = len(df_majority), random_state = 42)
print(df_mir.shape)
pd.concat([df_majority,df_mir])

(8000, 3)


Unnamed: 0,feature1,feature2,target
0,0.584918,0.920999,0
1,-1.847740,0.594893,0
2,-0.783089,-1.555339,0
3,-0.595249,0.341117,0
4,1.341287,0.712734,0
...,...,...,...
8011,0.084136,-0.234491,1
9447,-0.915306,-0.388701,1
9096,-0.393499,-2.151840,1
8587,1.732261,1.247995,1


In [50]:
#Downsampling
df_mir =  resample (df_majority, replace = True, n_samples = len(df_minority), random_state = 42)
print(df_mir.shape)
pd.concat([df_minority ,df_mir])

(2000, 3)


Unnamed: 0,feature1,feature2,target
8000,-0.273305,1.961101,1
8001,-0.673125,1.522028,1
8002,0.291849,-0.737003,1
8003,-0.503554,1.255707,1
8004,-0.028202,1.101512,1
...,...,...,...
1500,0.072894,0.762323,0
6737,0.763413,0.953542,0
3294,-0.478911,1.662602,0
5085,0.712954,2.039233,0


**#Q5**

- **Data Augmentation:**
Data augmentation is a technique used in machine learning to artificially increase the size of a dataset by applying various transformations to the existing data, such as rotation, flipping, zooming, or cropping. This helps improve model generalization and robustness.

- **SMOTE** (Synthetic Minority Over-sampling Technique): SMOTE is a specific data augmentation technique designed for imbalanced classification problems. It generates synthetic instances of the minority class by interpolating between existing minority class instances. SMOTE helps address the imbalance by introducing diversity into the minority class.

In [52]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Assuming X and y are your features and target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)


ImportError: cannot import name '_MissingValues' from 'sklearn.utils._param_validation' (C:\Users\wwwsu\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py)

**Q6**

*Outliers*: Outliers are data points that deviate significantly from the rest of the data in a dataset. They can be exceptionally high or low values compared to the majority of the data.

Importance of Handling Outliers:

- Model Performance: Outliers can unduly influence model training, leading to models that are biased or inaccurate.
- Statistical Analysis: Outliers can distort statistical analyses, affecting measures such as mean and standard deviation.
- Data Interpretation: Outliers can misrepresent the true distribution and patterns in the data.
- Robustness: Removing or transforming outliers can lead to more robust models that generalize better to new data.
Handling outliers can involve techniques such as removing them, transforming them, or using robust statistical methods that are less sensitive to extreme values. The choice of method depends on the specific characteristics of the data and the goals of the analysis.

Q7: Handling Missing Data in Customer Data Analysis:
   - **Imputation**: Fill in missing values with estimated or calculated values based on the rest of the data. This could be the mean, median, or mode for numerical data, or using predictive models for more complex scenarios.
   - **Deletion**: Remove rows or columns with missing data. This should be done cautiously to ensure that it doesn't introduce bias into your analysis.
   - **Interpolation**: Estimate missing values by interpolating between existing values.
   - **Machine Learning Models**: Train models to predict missing values based on other features in the dataset.

Q8: Strategies for Determining Pattern in Missing Data:
   - **Statistical Tests**: Conduct statistical tests to determine if missing data is correlated with other variables.
   - **Visualization**: Create visualizations, such as heatmaps, to identify patterns in missing data.
   - **Imputation with Indicators**: Create an indicator variable to flag missing values and include it in the analysis to account for potential patterns.

Q9: Handling Imbalanced Medical Diagnosis Dataset:
   - **Resampling**: Either oversample the minority class or undersample the majority class to balance the dataset.
   - **Synthetic Data Generation**: Generate synthetic data for the minority class using techniques like SMOTE (Synthetic Minority Over-sampling Technique).
   - **Adjusting Class Weights**: In machine learning algorithms, assign higher weights to the minority class to make the model more sensitive to it.

Q10: Balancing a Customer Satisfaction Dataset with Downsampling:
   - **Downsampling**: Randomly remove samples from the majority class to balance the class distribution.
   - **Stratified Sampling**: Ensure that the sample taken from the majority class is representative of the distribution of the majority class in the original dataset.

Q11: Balancing a Dataset with Upsampling for a Rare Event:
   - **Upsampling**: Replicate or synthetically generate instances of the minority class to balance the dataset.
   - **SMOTE (Synthetic Minority Over-sampling Technique)**: Generate synthetic instances of the minority class to increase its representation.
   - **Ensemble Techniques**: Use ensemble methods like bagging or boosting, which can handle imbalanced datasets by combining multiple models.