Missing values are a common issue in machine learning. 

This occurs when a particular variable lacks data points, resulting in incomplete information and potentially harming the accuracy and dependability of your models. 

It is essential to address missing values efficiently to ensure strong and impartial results in your machine-learning projects. 

# How is a Missing Value Represented in a Dataset?

Missing values in a dataset can be represented in various ways, depending on the source of the data and the conventions used. Here are some common representations:
1. NaN (Not a Number): In many programming languages and data analysis tools, missing values are represented as NaN. This is the default for libraries like Pandas in Python.

2. NULL or None: In databases and some programming languages, missing values are often represented as NULL or None. For instance, in SQL databases, a missing value is typically recorded as NULL.

3. Empty Strings: Sometimes, missing values are denoted by empty strings (""). This is common in text-based data or CSV files where a field might be left blank.

4. Special Indicators: Datasets might use specific indicators like -999, 9999, or other unlikely values to signify missing data. This is often seen in older datasets or specific industries where such conventions were established.

5. Blanks or Spaces: In some cases, particularly in fixed-width text files, missing values might be represented by spaces or blank fields.

Understanding the representation of missing values in your dataset is crucial for proper data cleaning and preprocessing. Identifying and handling these missing values accurately ensures that your data analysis and machine learning models perform optimally.
imally.



#

# How to Handle Missing Data?
Missing data is a common headache in any field that deals with datasets. It can arise for various reasons, from human error during data collection to limitations of data gathering methods. Luckily, there are strategies to address missing data and minimize its impact on your analysis. Here are two main approaches:

- Deletion: This involves removing rows or columns with missing values. This is a straightforward method, but it can be problematic if a significant portion of your data is missing. Discarding too much data can affect the reliability of your conclusions.

- Imputation: This replaces missing values with estimates. There are various imputation techniques, each with its strengths and weaknesses. Here are some common ones:

  1. Mean/Median/Mode Imputation: Replace missing entries with the average (mean), middle value (median), or most frequent value (mode) of the corresponding column. This is a quick and easy approach, but it can introduce bias if the missing data is not randomly distributed.
 
  2. K-Nearest Neighbors (KNN Imputation): This method finds the closest data points (neighbors) based on available features and uses their values to estimate the missing value. KNN is useful when you have a lot of data and the missing values are scattered.
 
  3. Model-based Imputation: This involves creating a statistical model to predict the missing values based on other features in the data. This can be a powerful technique, but it requires more expertise and can be computationally expensive.

#

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv(r'Data-Science/placement-dataset.csv')
data.head()

Unnamed: 0,city,cgpa,iq,placement
0,New York,6.8,123.0,1
1,Los Angeles,5.9,106.0,0
2,Chicago,,121.0,0
3,New York,7.4,132.0,1
4,Los Angeles,5.8,142.0,0


In [4]:
# Checking if Null values are present in data or not

data.isnull().sum()

city         0
cgpa         8
iq           4
placement    0
dtype: int64

In [7]:
# Replacing Null values with mean

# for column : cgpa
data['cgpa'] = data['cgpa'].fillna(data['cgpa'].mean())

In [6]:
data['cgpa'].isnull().sum()

0

In [8]:
# for column : iq
data['iq'] = data['iq'].fillna(data['iq'].mean())

In [10]:
data['iq'].isnull().sum()

0

In [11]:
data.isnull().sum()

city         0
cgpa         0
iq           0
placement    0
dtype: int64

#

In [12]:
df = pd.read_csv(r'Data-Science/placement-dataset.csv')
df.head()

Unnamed: 0,city,cgpa,iq,placement
0,New York,6.8,123.0,1
1,Los Angeles,5.9,106.0,0
2,Chicago,,121.0,0
3,New York,7.4,132.0,1
4,Los Angeles,5.8,142.0,0


In [13]:
df.isna().sum()

city         0
cgpa         8
iq           4
placement    0
dtype: int64

In [14]:
df.shape

(100, 4)

In [17]:
# Deleting rows which have null values
new_df = df.dropna()

In [18]:
new_df.shape

(88, 4)