# Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

Missing values occurs in a dataset when some of the informations are not stored for a variable. There are 3 mechanisms

1. Missing Completely at Random, MCAR:
Missing completely at random (MCAR) is a type of missing data mechanism in which the probability of a value being missing is unrelated to both the observed data and the missing data. In other words, if the data is MCAR, the missing values are randomly distributed throughout the dataset, and there is no systematic reason for why they are missing.

2. Missing at Random MAR:
Missing at Random (MAR) is a type of missing data mechanism in which the probability of a value being missing depends only on the observed data, but not on the missing data itself. In other words, if the data is MAR, the missing values are systematically related to the observed data, but not to the missing data. Here are a few examples of missing at random:

3. Missing data not at random (MNAR)
It is a type of missing data mechanism where the probability of missing values depends on the value of the missing data itself. In other words, if the data is MNAR, the missingness is not random and is dependent on unobserved or unmeasured factors that are associated with the missing values.

It is important to handle the missing values appropriately.

1. Many machine learning algorithms fail if the dataset contains missing values. However, algorithms like K-nearest and Naive Bayes support data with missing values.
2. You may end up building a biased machine learning model, leading to incorrect results if the missing values are not handled properly.
3. Missing data can lead to a lack of precision in the statistical analysis.

Algorithms like K-nearest and Naive Bayes support data with missing values.

# Q2: List down techniques used to handle missing data. Give an example of each with python code.

There are 2 primary ways of handling missing values:

## 1. Deleting the Missing values
## 2. Imputing the Missing Values

# Deleting the Missing values

Generally, this approach is not recommended. It is one of the quick and dirty techniques one can use to deal with missing values. If the missing value is of the type Missing Not At Random (MNAR), then it should not be deleted.
If the missing value is of type Missing At Random (MAR) or Missing Completely At Random (MCAR) then it can be deleted


In [1]:
import seaborn as sns


In [2]:
df = sns.load_dataset('titanic')

In [3]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [7]:
# Check the missing values
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

# There are 2 ways one can delete the missing data values:
## Deleting the entire row


In [11]:
df_delnull=df.dropna(axis=0)
df_delnull.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

## Deleting the entire column
If a certain column has many missing values, then you can choose to drop the entire column. The code to drop the entire column is as follows:

In [12]:
df_nullCol = df.drop(['embark_town'], axis=1)
df_nullCol.isnull().sum()

survived        0
pclass          0
sex             0
age           177
sibsp           0
parch           0
fare            0
embarked        2
class           0
who             0
adult_male      0
deck          688
alive           0
alone           0
dtype: int64

## 2. Imputing the missing values

### Replacing with the mean
MEan Imputation Works Well when we have normally distributed data

In [13]:
df['Age_Mean'] = df['age'].fillna(df['age'].mean())

In [16]:
df[['Age_Mean','age']]

Unnamed: 0,Age_Mean,age
0,22.000000,22.0
1,38.000000,38.0
2,26.000000,26.0
3,35.000000,35.0
4,35.000000,35.0
...,...,...
886,27.000000,27.0
887,19.000000,19.0
888,29.699118,
889,26.000000,26.0


### Replacing with the median
It’s better to use the median value for imputation in the case of outliers. 

In [17]:
df['age_median'] = df['age'].fillna(df['age'].median())

In [18]:
df[['age_median','age']]

Unnamed: 0,age_median,age
0,22.0,22.0
1,38.0,38.0
2,26.0,26.0
3,35.0,35.0
4,35.0,35.0
...,...,...
886,27.0,27.0
887,19.0,19.0
888,28.0,
889,26.0,26.0


### 3. Replacing with the mode
 It is used in the case of categorical features. 

In [21]:
df[df['embarked'].isnull()]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Age_Mean,age_median
61,1,1,female,38.0,0,0,80.0,,First,woman,False,B,,yes,True,38.0,38.0
829,1,1,female,62.0,0,0,80.0,,First,woman,False,B,,yes,True,62.0,62.0


In [22]:
df['embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [23]:
df['embarked_mode'] = df['embarked'].fillna(df['embarked'].mode()[0])

In [24]:
df[['embarked_mode','embarked']]

Unnamed: 0,embarked_mode,embarked
0,S,S
1,C,C
2,S,S
3,S,S
4,S,S
...,...,...
886,S,S
887,S,S
888,S,S
889,C,C


# Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

A classification data set with skewed class proportions is called imbalanced. Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes.

If imbalanced data is not handled, it can lead to biased models and inaccurate predictions. This is because most machine learning algorithms are designed to maximize overall accuracy, which can lead to the majority class being predicted most of the time, and the minority class being overlooked. This is particularly problematic when the minority class is the one that is of interest, such as in fraud detection, disease diagnosis, or rare event prediction.
The minority class is ignored: When the data is imbalanced, the minority class can be completely ignored by the model, leading to low recall or sensitivity scores. This means that the model fails to identify many positive cases.

Biased predictions: The model can be biased towards the majority class, leading to high precision but low recall scores. This means that the model correctly identifies the negative cases, but misses many positive cases.

Overfitting: The model can overfit to the majority class, leading to poor generalization performance on new data.

In [1]:
%history -g

 3/1: import pandas as pd
 3/2: df = pd.read_csv("C:\Users\dell-pc\Documents\Data Science\Classes\CSV\Salary_Data.csv")
 3/3: df = pd.read_csv(r"C:\Users\dell-pc\Documents\Data Science\Classes\CSV\Salary_Data.csv")
 3/4: df
 3/5: x = df.iloc[:,0].values
 3/6: x
 3/7: x = df.iloc[:,0].values.reshape(-1,1)
 3/8: x
 3/9: x.shape
3/10: from sklearn.linear_model import LinearRegression
3/11: Lin = LinearRegression()
3/12: Lin.fit(x,y)
3/13: y = df.iloc[:,1]
3/14: y = df.iloc[:,1]
3/15: y.shape
3/16: Lin = LinearRegression()
3/17: Lin.fit(x,y)
3/18: Lin.predict(3.5)
3/19: Lin.predict([[3.5]])
 4/1: import pandas as pd
 4/2: df = pd.read_csv(r"C:\Users\dell-pc\Documents\Data Science\Classes\CSV\LSM1.csv")
 4/3: df
 4/4: x = df.iloc[:,1].values
 4/5: x
 4/6: x = df.iloc[:,0].values
 4/7: x
 4/8: from sklearn.linear_model import LinearRegression
 4/9: lin = LinearRegression()
4/10: lin.fit(x,y)
4/11:  y = df.iloc[:,1].values
4/12: lin.fit(x,y)
4/13: x = x.reshape(5,1)
4/14: x
4/15: lin.fit(x,y)

91/67: video_tags
91/68: print(video_tags)
91/69: soup = BeautifulSoup(response.content,"lxml")
91/70: video_tags = soup.findAll('a',id='video-title')
91/71: print(video_tags)
93/1: from selenium import webdriver
93/2: pip install selenium
93/3: from selenium import webdriver
93/4: from selenium import webdriver
93/5:
from selenium import webdriver
from bs4 import BeautifulSoup
93/6:
def main():
    driver = webdriver.Chrome()
    driver.get('https://www.youtube.com/@PW-Foundation/videos')
    content = driver.page_source.encode('utf-8').strip()
    soup=BeautifulSoup(content, 'lxml')
93/7: pip install lxml
95/1:
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
import logging
import os
95/2:
new_dir = "images/"
if not os.path.exists(new_dir):
    os.makedirs(new_dir)
95/3: headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"}
95/4: response = requests.get("https://w

# Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

Up-sampling and down-sampling are two common techniques used to handle imbalanced data in machine learning.

Up-sampling involves increasing the number of observations in the minority class to match the number of observations in the majority class. This can be done by randomly duplicating existing observations in the minority class or generating new synthetic observations using techniques like SMOTE (Synthetic Minority Over-sampling Technique).

Down-sampling, on the other hand, involves reducing the number of observations in the majority class to match the number of observations in the minority class. This can be done by randomly removing observations from the majority class.

### Up-sampling 
is generally used when the minority class has very few observations compared to the majority class. For example, in fraud detection, the number of fraudulent transactions is typically much lower than the number of legitimate transactions. In such cases, up-sampling can be used to generate additional synthetic fraudulent transactions, which can help the model learn to identify them better.

### Down-sampling 
is generally used when the majority class has too many observations, and the model is biased towards it. For example, in medical diagnosis, the number of healthy patients may be much higher than the number of patients with a particular disease. In such cases, down-sampling can be used to balance the dataset and prevent the model from being biased towards the healthy patients.

In [5]:
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# create a dataframe with 2 classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0
n_class_0, n_class_1

# Create dataframe with imbalanced dataset
class_0 = pd.DataFrame({
    'feature1' : np.random.normal(loc=0, scale=1, size = n_class_0),
    'feature2' : np.random.normal(loc=0, scale=1, size = n_class_0),
    'target'   : [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature1' : np.random.normal(loc=0, scale=1, size = n_class_1),
    'feature2' : np.random.normal(loc=0, scale=1, size = n_class_1),
    'target'   : [1] * n_class_1
})

# Concatenate the 2 datasers
df = pd.concat([class_0, class_1]).reset_index(drop=True)

df.head(10)

Unnamed: 0,feature1,feature2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0
5,1.651437,-0.116932,0
6,-2.426679,-0.102391,0
7,-0.428913,-2.272618,0
8,1.265936,-0.64261,0
9,-0.86674,0.299885,0


In [6]:
# Create dataframe with imbalanced dataset
class_0 = pd.DataFrame({
    'feature1' : np.random.normal(loc=0, scale=1, size = n_class_0),
    'feature2' : np.random.normal(loc=0, scale=1, size = n_class_0),
    'target'   : [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature1' : np.random.normal(loc=0, scale=1, size = n_class_1),
    'feature2' : np.random.normal(loc=0, scale=1, size = n_class_1),
    'target'   : [1] * n_class_1
})

# Concatenate the 2 datasers
df = pd.concat([class_0, class_1]).reset_index(drop=True)




In [13]:
df.head(10)

Unnamed: 0,feature1,feature2,target
0,-1.774224,0.285744,0
1,-1.201377,0.333279,0
2,1.096257,0.531807,0
3,0.861037,-0.354766,0
4,-1.520367,-1.120815,0
5,-0.44744,-0.043793,0
6,0.463487,0.513008,0
7,0.392493,0.511598,0
8,-1.627167,1.730055,0
9,0.26001,-0.499304,0


In [14]:
df.tail(5)


Unnamed: 0,feature1,feature2,target
995,0.677156,-0.907952,1
996,0.963404,-1.818045,1
997,-0.378524,-0.122733,1
998,1.429559,1.794486,1
999,1.532273,-0.32051,1


In [15]:
df['target'].value_counts()


0    900
1    100
Name: target, dtype: int64

In [18]:
## Upsampling

df_minority = df[df['target'] == 1]
df_majority = df[df['target'] == 0]

from sklearn.utils import resample
df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority), random_state=42)
df_minority_upsampled



Unnamed: 0,feature1,feature2,target
951,0.905343,-0.504849,1
992,0.000977,-0.185167,1
914,-0.072043,0.280911,1
971,0.819483,0.964646,1
960,0.456515,-0.166472,1
...,...,...,...
952,-0.233356,-0.467775,1
965,-0.472670,0.182477,1
976,0.463277,-1.204384,1
942,0.930412,-0.932647,1


In [19]:
df_minority_upsampled.shape


(900, 3)

In [21]:
df_minority_upsampled.head()



Unnamed: 0,feature1,feature2,target
951,0.905343,-0.504849,1
992,0.000977,-0.185167,1
914,-0.072043,0.280911,1
971,0.819483,0.964646,1
960,0.456515,-0.166472,1


In [None]:
# concat minority upsampled dataframe and majority dataset
df_upsampled = pd.concat([df_majority,df_minority_upsampled])


In [22]:
df_upsampled['target'].value_counts()

0    900
1    900
Name: target, dtype: int64

## Down Sampling

In [23]:

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

df = pd.concat([class_0, class_1]).reset_index(drop=True)

# Check the class distribution
print(df['target'].value_counts())

0    900
1    100
Name: target, dtype: int64


In [24]:
## downsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [25]:
from sklearn.utils import resample
df_majority_upsampled=resample(df_majority,replace=False, 
         n_samples=len(df_minority),
         random_state=42
        )

In [26]:
df_majority_upsampled

Unnamed: 0,feature_1,feature_2,target
70,0.468439,1.720920,0
827,1.089165,-0.464899,0
231,0.753869,-0.969798,0
588,0.588686,-0.704720,0
39,0.283627,1.012868,0
...,...,...,...
398,-0.168426,0.553775,0
76,-0.403366,0.081491,0
196,-0.269293,0.611238,0
631,-0.295829,0.671673,0


In [27]:
df_majority_upsampled.shape

(100, 3)

In [28]:
df_majority_upsampled.head(5)

Unnamed: 0,feature_1,feature_2,target
70,0.468439,1.72092,0
827,1.089165,-0.464899,0
231,0.753869,-0.969798,0
588,0.588686,-0.70472,0
39,0.283627,1.012868,0


In [29]:
# concat majority upsampled dataframe and minority dataset
df_downsampled = pd.concat([df_majority_upsampled,df_minority])


In [30]:
df_downsampled

Unnamed: 0,feature_1,feature_2,target
70,0.468439,1.720920,0
827,1.089165,-0.464899,0
231,0.753869,-0.969798,0
588,0.588686,-0.704720,0
39,0.283627,1.012868,0
...,...,...,...
995,1.376371,2.845701,1
996,2.239810,0.880077,1
997,1.131760,1.640703,1
998,2.902006,0.390305,1


In [31]:
df_downsampled['target'].value_counts()

0    100
1    100
Name: target, dtype: int64

# Q5: What is data Augmentation? Explain SMOTE.

Data augmentation is a technique used in machine learning to increase the amount of data available for training a model. This is done by creating new, synthetic data points from the existing data set through various transformations, such as rotation, scaling, cropping, flipping, or adding noise. Data augmentation helps to reduce overfitting and improve the generalization capability of the model by providing it with a more diverse and representative set of examples.

SMOTE, which stands for Synthetic Minority Over-sampling Technique, is a specific type of data augmentation that is commonly used in imbalanced classification problems, where the classes are not equally represented in the training data. SMOTE works by generating synthetic examples of the minority class by interpolating between the existing samples. Specifically, it selects a minority class sample and finds its k nearest neighbors in the feature space. It then generates new synthetic examples by randomly selecting one of the k neighbors and interpolating between the two examples. This process is repeated for a specified number of times until the desired level of over-sampling is achieved.

# Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Outliers are extreme values that differ from most other data points in a dataset. They can have a big impact on your statistical analyses and skew the results of any hypothesis tests

It is essential to handle outliers because they can have a significant impact on statistical measures, such as mean, variance, and correlation coefficients, as well as machine learning algorithms, such as linear regression and k-means clustering. Outliers can also affect the performance of classification models by reducing their accuracy and increasing their false positive or false negative rates.

# Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?


Handling missing data is an essential step in any data analysis project, as it can affect the accuracy and reliability of the results. Here are some techniques that can be used to handle missing data in customer data analysis:

Deletion: This technique involves removing the rows or columns that contain missing data. If the missing data is limited to only a few observations, then deleting them may not significantly affect the overall analysis. However, if the missing data is substantial, then deleting it can lead to loss of valuable information.

Imputation: This technique involves replacing the missing data with estimated values. Imputation methods can be classified into three categories:

a. Mean/Median/Mode Imputation: In this method, missing values are replaced by the mean, median, or mode value of the non-missing values in the same column.

b. Regression Imputation: In this method, a regression model is used to predict the missing values based on other variables in the dataset.

Predictive Modeling: This technique involves using a predictive model to predict the missing values based on the other variables in the dataset. The model can be trained on the non-missing data and used to predict the missing values.



# Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

Determining whether missing data is missing at random or if there is a pattern to the missing data is essential for selecting appropriate strategies to handle the missing data. Here are some strategies that can be used to determine if the missing data is missing at random or if there is a pattern to the missing data:

Visual inspection: One way to check if there is a pattern to the missing data is to create a visualization that shows the distribution of missing values across variables. This can be done using a heatmap or bar chart, where the y-axis represents variables and the x-axis represents observations. If there is a pattern to the missing data, it will be visible in the heatmap or bar chart.

Statistical tests: Statistical tests can be used to determine if the missing data is missing at random or if there is a pattern to the missing data. One popular test is the Little's MCAR (Missing Completely at Random) test, which tests if the missing data is completely random or if there is a systematic pattern to the missing data.

Imputation strategies: Imputation strategies can be used to fill in missing data, and the choice of imputation strategy depends on the pattern of missing data. For example, if the missing data is missing at random, then mean imputation or regression imputation can be used. If the missing data is non-random, then multiple imputation or similar techniques that account for the missingness mechanism can be used.

Domain knowledge: In some cases, domain knowledge can be used to determine if the missing data is missing at random or if there is a pattern to the missing data. For example, if the missing data is related to a specific variable, then it may be due to a measurement error or other external factor that is affecting that variable.



# Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?


Dealing with imbalanced datasets in machine learning projects is a common problem, and evaluating the performance of a machine learning model on such datasets can be challenging. Here are some strategies that can be used to evaluate the performance of a machine learning model on an imbalanced dataset in a medical diagnosis project:

Confusion matrix: A confusion matrix can be used to evaluate the performance of a model on an imbalanced dataset. The confusion matrix gives the number of true positives, true negatives, false positives, and false negatives. From the confusion matrix, several performance metrics can be derived, such as accuracy, precision, recall, and F1 score. In the case of an imbalanced dataset, accuracy alone may not be a good performance metric, and it may be more useful to look at precision and recall.

ROC Curve: ROC curve is a useful tool for evaluating the performance of a model on an imbalanced dataset. ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) for different thresholds of the model output. An area under the curve (AUC) metric can be calculated from the ROC curve, which provides a single measure of the model's performance.

Resampling techniques: Resampling techniques can be used to address the class imbalance problem in the dataset. This involves either oversampling the minority class or undersampling the majority class. The performance of the model can then be evaluated on the resampled dataset.

Cost-sensitive learning: In cost-sensitive learning, the cost of misclassifying a minority class is given more weight than misclassifying a majority class. This approach can be used to train the model, and the performance can be evaluated on the original dataset.

Threshold tuning: Threshold tuning involves adjusting the threshold at which the model output is considered a positive prediction. This can be useful for improving the performance of the model on the minority class.

# Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?


When dealing with an unbalanced dataset where the majority class dominates the data, there are several techniques that can be employed to balance the dataset and down-sample the majority class. Here are some methods that can be used:

Random under-sampling: In random under-sampling, some of the instances of the majority class are randomly removed from the dataset to balance the classes. This method is simple and easy to implement but can lead to information loss, especially if the majority class contains valuable information.


In [None]:
from sklearn.utils import resample

# Down-sample the majority class
df_majority = df[df['satisfaction'] == 'satisfied']
df_minority = df[df['satisfaction'] == 'unsatisfied']

df_majority_downsampled = resample(df_majority, 
                                   replace=False,     
                                   n_samples=len(df_minority),    
                                   random_state=42) 

df_downsampled = pd.concat([df_majority_downsampled, df_minority])


2. Cluster-based under-sampling: In cluster-based under-sampling, the majority class instances are clustered, and the centroids of the clusters are used as representative samples of the majority class. This method can help to reduce the loss of information that occurs in random under-sampling.

In [None]:
from sklearn.cluster import KMeans

# Down-sample the majority class using clustering
df_majority = df[df['satisfaction'] == 'satisfied']
df_minority = df[df['satisfaction'] == 'unsatisfied']

kmeans = KMeans(n_clusters=len(df_minority), random_state=42).fit(df_majority)

df_majority_downsampled = pd.DataFrame(kmeans.cluster_centers_, columns=df_majority.columns)
df_downsampled = pd.concat([df_majority_downsampled, df_minority])

3. Synthetic minority over-sampling technique (SMOTE): SMOTE generates new synthetic minority class instances by interpolating between the existing minority class instances. This method can be useful for addressing the class imbalance problem, but it can also lead to overfitting if the synthetic instances are too similar to the original minority instances.

In [None]:
from imblearn.over_sampling import SMOTE

# Up-sample the minority class using SMOTE
df_majority = df[df['satisfaction'] == 'satisfied']
df_minority = df[df['satisfaction'] == 'unsatisfied']

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(df.drop('satisfaction', axis=1), df['satisfaction'])
df_resampled = pd.concat([pd.DataFrame(X_resampled, columns=df.columns.drop('satisfaction')), pd.DataFrame(y_resampled, columns=['satisfaction'])])


# Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

When dealing with an imbalanced dataset with a minority class, there are several techniques that can be employed to balance the dataset and up-sample the minority class. Here are some methods that can be used:

1. Random over-sampling: In random over-sampling, some of the instances of the minority class are randomly duplicated in the dataset to balance the classes. This method is simple and easy to implement but can lead to overfitting, especially if the minority class contains noisy or irrelevant instances.

In [None]:
from sklearn.utils import resample

# Up-sample the minority class
df_majority = df[df['event'] == 'no_event']
df_minority = df[df['event'] == 'event']

df_minority_upsampled = resample(df_minority, 
                                 replace=True,     
                                 n_samples=len(df_majority),    
                                 random_state=42) 

df_upsampled = pd.concat([df_majority, df_minority_upsampled])


2. Synthetic minority over-sampling technique (SMOTE): SMOTE generates new synthetic minority class instances by interpolating between the existing minority class instances. This method can be useful for addressing the class imbalance problem, but it can also lead to overfitting if the synthetic instances are too similar to the original minority instances.

In [None]:
from imblearn.over_sampling import SMOTE

# Up-sample the minority class using SMOTE
df_majority = df[df['event'] == 'no_event']
df_minority = df[df['event'] == 'event']

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(df.drop('event', axis=1), df['event'])
df_resampled = pd.concat([pd.DataFrame(X_resampled, columns=df.columns.drop('event')), pd.DataFrame(y_resampled, columns=['event'])])

3. Adaptive Synthetic Sampling (ADASYN): ADASYN is similar to SMOTE but focuses on generating synthetic minority instances in regions of the feature space where the density of minority instances is low. This method can be useful for addressing the class imbalance problem while also reducing overfitting.


In [None]:
from imblearn.over_sampling import ADASYN

# Up-sample the minority class using ADASYN
df_majority = df[df['event'] == 'no_event']
df_minority = df[df['event'] == 'event']

adasyn = ADASYN(random_state=42)
X_resampled, y_resampled = adasyn.fit_resample(df.drop('event', axis=1), df['event'])
df_resampled = pd.concat([pd.DataFrame(X_resampled, columns=df.columns.drop('event')), pd.DataFrame(y_resampled, columns=['event'])])
