#### Methods to Handle Missing Values:
1. **Imputation** - using statistical values to fill missing values
2. **Dropping** - dropping the missing value rows

* In ML or Data Science , we use datasets which are use to train our ML models.
* Once the ML is trained with the dataset, we can make new predictions.
* In order to feed the dataset into the ML, we need to clean the data.

#### Importing the libraries

*  Pandas for making pandas dataframe which is nothing but structured table.
*  Matplotlib for visuals
*  Seaborn for plot and graph


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Loading the dataset to a Pandas DataFrame:

In [2]:
dataset = pd.read_csv('/content/Placement_Dataset.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/content/Placement_Dataset.csv'

#### Checking the first 5 rows of the dataset

In [None]:
dataset.head()

#### Columns Description:
- **sl_no**: Serial Number  
- **gender**: Gender  
- **ssc_p**: Secondary School Certificate Percentage (or 10th Grade Percentage)  
- **ssc_b**: Secondary School Certificate Board (or 10th Grade Board)  
- **hsc_p**: Higher Secondary Certificate Percentage (or 12th Grade Percentage)  
- **hsc_b**: Higher Secondary Certificate Board (or 12th Grade Board)  
- **hsc_s**: Higher Secondary Certificate Stream (or 12th Grade Stream, e.g., Science, Commerce, Arts)  
- **degree_p**: Degree Percentage (or Undergraduate Percentage)  
- **degree_t**: Degree Type (or Undergraduate Degree Type, e.g., B.Tech, B.Com, B.Sc)  
- **workex**: Work Experience (whether the candidate has prior work experience)  
- **etest_p**: Employability Test Percentage (or Entrance Test Percentage)  
- **specialisation**: Specialisation (e.g., MBA specialisation like Marketing, Finance)  
- **mba_p**: MBA Percentage (or Postgraduate Percentage)  
- **status**: Placement Status (whether the candidate is placed or not)  
- **salary**: Salary (offered to the candidate, if placed)  

#### Finding the rows and columns in the dataset

In [None]:
dataset.shape

#### Finding the missing values in each column

In [None]:
dataset.isnull().sum()

* There are 67 missing values in salary column.

#### Central Tendencies:
1. **Mean**: Average of all the values.
2. **Median**: Middle value.
3. **Mode**:Most repeated value.

#### Analyzing the **symmetry** of the dataset to determine which measure of central tendency to use for filling missing values.

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
sns.histplot(dataset.salary)
plt.show()

* Mostly the value around 250 thousand.
* The data is Right skewed.
   * Right-skewed (positive skewness) → Tail is longer on the right.
   * Left-skewed (negative skewness) → Tail is longer on the left.
   * Symmetric (normal distribution) → Balanced on both sides.
* We have outliers in the data so we can not use mean value.
* For skew distribution we either use median or mode.



### Imputation Method:

#### Replace the missing values with Median value

In [None]:
dataset['salary'] = dataset['salary'].fillna(dataset['salary'].median())

#### Checking Missing value:

In [None]:
dataset.isnull().sum()

* All the missing values of salary column has been filled with median value.

### Dropping Method:

In [None]:
salary_dataset = pd.read_csv('/content/Placement_Dataset.csv')

In [None]:
salary_dataset.shape

In [None]:
salary_dataset.isnull().sum()

#### Dropping the missing values:




In [None]:
salary_dataset = salary_dataset.dropna(how = 'any')

In [None]:
salary_dataset.isnull().sum()

In [None]:
salary_dataset.shape