<a href="https://colab.research.google.com/github/JatinB22/DSlab/blob/main/DSexp1/DSexp1_Employee.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Aim:
To preprocess (Imputation, Label encoding, and data cleaning) and prepare data using NumPy and Pandas in Python for effective analysis and modelling.

---
## Software:
 Google Colab, Python Libraries

---
## Theory:
 Real-world data is often messy, incomplete, or inconsistent. To ensure accurate and reliable results from machine learning models or statistical analysis, it is essential to preprocess the data. This involves several key steps:
1. **Data Cleaning:**
The process of identifying and correcting (or removing) corrupt or inaccurate records from a dataset.

  **Common Tasks:**
  *   Removing duplicate rows.
  *   Dropping irrelevant or redundant columns (e.g., IDs, timestamps).
  *   Removing rows or columns with too many missing values.
  *   Correcting data types or inconsistent formatting.


2. **Imputation (Handling Missing Values):**
Imputation is the process of replacing missing or null values with substituted values so that the dataset can be used effectively for modeling.
  **Techniques:**
  *  **Numerical Data:** Replace with statistical measures like mean, median, or mode.
  *  **Categorical Data:** Use the mode (most frequent value).
  *  More advanced techniques include KNN imputation or regression imputation, but basic imputation is often sufficient for initial preprocessing.

3. **Label Encoding (Categorical Variable Conversion)**

  Label encoding converts categorical values (e.g., 'Male', 'Female') into numerical values (e.g., 0, 1), which are necessary for most machine learning algorithms.

### Importance of Data Preparation:
- Reduces bias and variance in model predictions.
- Improves model accuracy and generalization.
- Ensures compatibility with modeling libraries (e.g., scikit-learn, TensorFlow).
---

**NumPy** provides efficient numerical operations and is commonly used behind the scenes in Pandas operations.

**Pandas** offers high-level data structures (like DataFrame) and functions that make data cleaning and preparation much more convenient.

---

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('emp_promotion.csv')
df

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54803,3030,Technology,region_14,Bachelor's,m,sourcing,1,48,3.0,17,0,0,78,0
54804,74592,Operations,region_27,Master's & above,f,other,1,37,2.0,6,0,0,56,0
54805,13918,Analytics,region_1,Bachelor's,m,other,1,27,5.0,3,1,0,79,0
54806,13614,Sales & Marketing,region_9,,m,sourcing,1,29,1.0,2,0,0,45,0


In [None]:
df.shape

(54808, 14)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54808 entries, 0 to 54807
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           54808 non-null  int64  
 1   department            54808 non-null  object 
 2   region                54808 non-null  object 
 3   education             52399 non-null  object 
 4   gender                54808 non-null  object 
 5   recruitment_channel   54808 non-null  object 
 6   no_of_trainings       54808 non-null  int64  
 7   age                   54808 non-null  int64  
 8   previous_year_rating  50684 non-null  float64
 9   length_of_service     54808 non-null  int64  
 10  KPIs_met >80%         54808 non-null  int64  
 11  awards_won?           54808 non-null  int64  
 12  avg_training_score    54808 non-null  int64  
 13  is_promoted           54808 non-null  int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 5.9+ MB


In [None]:
df.isna().sum()

Unnamed: 0,0
employee_id,0
department,0
region,0
education,2409
gender,0
recruitment_channel,0
no_of_trainings,0
age,0
previous_year_rating,4124
length_of_service,0


In [None]:
# Check for duplicates
df.duplicated().sum()

np.int64(0)

In [None]:
rating = df['previous_year_rating'].mode()
df['previous_year_rating'] = df['previous_year_rating'].fillna(rating[0])

In [None]:
abhyaas = df['education'].mode()
df['education'] = df['education'].fillna(abhyaas[0])