## ML-Assignment3


### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values

**Missing values** in a dataset refer to the absence of data for one or more variables (features) in certain observations or records. These missing values are often denoted by special markers like "NaN" (Not-a-Number), "NA," or simply left blank. Missing data can occur for various reasons, including data collection errors, sensor malfunctions, non-responses in surveys, or simply because certain information was not collected.

Handling missing values is essential for several reasons:

1. **Data Quality**: Missing values can negatively impact the quality and reliability of your data analysis or machine learning models. Inaccurate or biased results can arise if missing values are not appropriately managed.

2. **Algorithm Compatibility**: Many machine learning algorithms cannot handle missing values directly. Therefore, you need to preprocess the data to make it compatible with these algorithms.

3. **Statistical Analysis**: Missing data can lead to biased or inaccurate statistical analyses, affecting conclusions and decisions based on the data.

4. **Data Visualization**: Missing values can interfere with data visualization, making it challenging to accurately represent the dataset's characteristics.

5. **Ethical and Legal Considerations**: In some cases, handling missing data may be necessary for compliance with ethical or legal requirements, such as data protection regulations.

Some algorithms that are not directly affected by missing values include:

1. **Decision Trees**: Decision tree algorithms, such as CART (Classification and Regression Trees) and Random Forests, can naturally handle missing values. They determine splits in the data based on available features.

2. **k-Nearest Neighbors (k-NN)**: k-NN imputes missing values by considering the values of the nearest neighbors when making predictions. It can handle missing data without explicit imputation.

3. **Naive Bayes**: Naive Bayes algorithms can work with missing values because they calculate probabilities based on available feature values without requiring imputation.

4. **Anomaly Detection Algorithms**: Anomaly detection methods, like Isolation Forests or One-Class SVM, often do not rely on imputation and can detect anomalies without explicitly handling missing data.

5. **Matrix Factorization Techniques**: Techniques like Singular Value Decomposition (SVD) and Alternating Least Squares (ALS), used in recommendation systems, can work with sparse matrices, which may have missing values.



### Q2: List down techniques used to handle missing data.  Give an example of each with python code

1. Missing completely at random
2. Missing at random
3. Missing data not at random

In [1]:
import seaborn as sns
import numpy as np
import pandas as pd

In [2]:
df= sns.load_dataset("taxis")

In [3]:
df.head()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.7,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.1,0.0,13.4,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan


In [4]:
## Check missing value
df.isnull()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6428,False,False,False,False,False,False,False,False,False,False,False,False,False,False
6429,False,False,False,False,False,False,False,False,False,False,False,False,False,False
6430,False,False,False,False,False,False,False,False,False,False,False,False,False,False
6431,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [5]:
df.isnull().sum()

pickup              0
dropoff             0
passengers          0
distance            0
fare                0
tip                 0
tolls               0
total               0
color               0
payment            44
pickup_zone        26
dropoff_zone       45
pickup_borough     26
dropoff_borough    45
dtype: int64

In [6]:
df.isnull().sum().max()

45

In [7]:
df.shape

(6433, 14)

In [8]:
## Delete the rows or data point to handle missing values

df.dropna().shape

(6341, 14)

In [9]:
df.shape

(6433, 14)

In [10]:
## Column wise deletion
df.dropna(axis=1)

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.60,7.0,2.15,0.0,12.95,yellow
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.00,0.0,9.30,yellow
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.70,27.0,6.15,0.0,36.95,yellow
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.10,0.0,13.40,yellow
...,...,...,...,...,...,...,...,...,...
6428,2019-03-31 09:51:53,2019-03-31 09:55:27,1,0.75,4.5,1.06,0.0,6.36,green
6429,2019-03-31 17:38:00,2019-03-31 18:34:23,1,18.74,58.0,0.00,0.0,58.80,green
6430,2019-03-23 22:55:18,2019-03-23 23:14:25,1,4.14,16.0,0.00,0.0,17.30,green
6431,2019-03-04 10:09:25,2019-03-04 10:14:29,1,1.12,6.0,0.00,0.0,6.80,green
