# Data Cleaning with Python

## What is Python?

Python is a high-level, interpreted programming language known for its simplicity, readability, and versatility. It was created by Guido van Rossum and first released in 1991. Python emphasizes code readability and has a clear and expressive syntax, making it easy to learn and write code in.


Python is a versatile programming language that can be used in various domains and applications. Here are some common areas where Python is widely used:

1. Web Development: Python offers several frameworks for web development, such as Django and Flask. These frameworks provide tools and libraries to build robust and scalable web applications.

2. Data Science and Analytics: Python has become a popular choice for data analysis, machine learning, and scientific computing. Libraries like NumPy, pandas, and scikit-learn provide powerful tools for data manipulation, analysis, and modeling.

3. Scripting and Automation: Python's simple syntax and extensive standard library make it an excellent choice for writing scripts and automating repetitive tasks. It is commonly used for tasks like system administration, file processing, and task scheduling.

5. Web Scraping: Python's libraries, such as Beautiful Soup and Scrapy, make it easy to extract data from websites. Web scraping allows you to gather information from websites automatically.

6. Scientific Computing: Python is widely used in scientific computing and simulation. Libraries like SciPy and matplotlib provide tools for scientific computations, numerical analysis, and data visualization.

7. Internet of Things (IoT): Python can be used for IoT development due to its simplicity and compatibility with microcontrollers and single-board computers like Raspberry Pi.

8. Game Development: Python offers libraries like Pygame that simplify game development. It is commonly used for creating 2D games and prototyping.


## What is Jupyter Notebook?

Jupyter Notebook is an interactive computing environment that allows you to create and share documents that combine:

- code and output
- visualizations
- explanatory text.

It provides an interface where you can write and execute code, view the output, and add text explanations or visualizations alongside the code.

You can think of Jupyter Notebook as a virtual notebook or a digital document that lets you write and run code in small chunks called **cells**. Each cell can contain code or text, and you can execute the code in a cell to see its output or results.


Jupyter Notebook consists of several important components that contribute to its functionality and versatility. The key components of Jupyter Notebook are:

1. **Notebook Interface:** The notebook interface is the main user interface where you interact with Jupyter Notebook. It is typically accessed through a web browser and provides an interactive environment for creating, editing, and executing notebooks.

2. **Code Cells:** Code cells are specifically used to write and execute code. You can enter Python (or other supported languages) code in these cells and run them to see the output or results.

3. **Markdown Cells:** Markdown cells allow you to write text explanations, format them using Markdown syntax, and include headings, lists, links, images, and more. They are useful for documenting your code, providing explanations, or creating reports.

4. **Kernel:** The kernel is a computational engine that executes the code within the notebook. Each notebook is associated with a kernel, which can be Python or another supported language. The kernel handles the execution of code cells and stores the variables and state of the notebook.

5. **Output Area:** The output area displays the output generated by code cells. It can show text output, error messages, tables, plots, and other visualizations. The output is displayed directly below the corresponding code cell.

6. **Toolbar:** The toolbar is a set of icons and buttons that provide quick access to various functions and actions in Jupyter Notebook. It allows you to perform actions such as saving the notebook, adding or deleting cells, changing cell types, running cells, and more.

8. **File Management:** Jupyter Notebook provides features for creating, opening, saving, and managing notebooks. It allows you to organize your notebooks into directories, rename them, and export them in different formats such as HTML, PDF, or Python script.

These components work together to provide an interactive and flexible environment for writing and running code, documenting analyses, and creating interactive reports in Jupyter Notebook. They contribute to its popularity and usefulness for data science, research, and collaborative work.


## What is Pandas?

**Pandas** is a popular open-source **data manipulation** and **analysis** library for the Python programming language. It provides **data structures** and **functions** that make it easy to work with structured data, such as tabular data and time series data. Pandas is built on top of the NumPy library, which adds support for fast **numerical operations** on arrays.

The main data structures in Pandas are the **DataFrame** and the **Series**. 

- A DataFrame is a two-dimensional table-like data structure, similar to a spreadsheet or a SQL table. It consists of rows and columns, where each column can have a different data type (e.g., numbers, strings, dates).

- A Series, on the other hand, is a one-dimensional labeled array that can hold any data type.

Pandas provides a wide range of functions and methods for data manipulation, including 
- Reading data in various formats (such as CSV, Excel, and SQL databases)
- Writing data in various formats (such as CSV, Excel, and SQL databases),
- Filtering data 
- Selecting data
- Handling missing values
- Merging and joining datasets
- Performing statistical calculations.

With its intuitive and expressive syntax, Pandas has become a go-to tool for data scientists, analysts, and developers working with data in Python. It simplifies many common data manipulation tasks and allows for efficient data exploration, transformation, and analysis.

```
pip install pandas
```

In [1]:
# import pandas 
import pandas as pd 

## Load the dataset

When working with pandas, loading your dataset is made simple. For CSV files, you can utilize the `read_csv()` function, while `read_excel()` is ideal for Excel files.

These functions enable you to effortlessly import your dataset into a pandas DataFrame, providing a convenient starting point for data analysis and manipulation. By leveraging these functions, you can easily access and work with your data using pandas' powerful capabilities.

In [2]:
# load data 

data_path = 'data/employee_information.csv'

data = pd.read_csv(data_path)

In [3]:
# check the shape of the dataset
data.shape 

(20010, 18)

In [5]:
# check sample of the dataset
data.tail(10)

Unnamed: 0,Employee ID,First Name,Last Name,Date of Birth,Gender,Email,Contact Number,Address,Nationality,Employment Start Date,Department,Position,Salary,Work Schedule,Employee Type,Emergency Contact,Bank Account Details,Employee Benefits
20000,370027,Jennifer,Cruz,28/02/2001,Male,evansamanda@example.com,-1209,"85680 Karen Vista Apt. 598\nWatkinston, FM 52942",Saint Vincent and the Grenadines,08/07/2021,Marketing,Administrator,7261681.0,8 AM - 4 PM,Part-time,Nicholas Moore,8725509675,Retirement Plan
20001,296403,Christopher,Sanchez,06/09/1994,Male,christopherwilliams@example.com,673-649-1615,"64295 Darlene Rue Suite 291\nNew David, KY 63377",Wallis and Futuna,04/01/2016,Marketing,,5994257.0,8 AM - 4 PM,Part-time,Jeremy Miller,5524968452,Paid Time Off
20002,580467,William,Jacobs,02/08/1991,Male,torr@example.com,001-823-890-8924x05805,"6117 Nunez Bypass\nEast Michaelberg, WI 85619",British Virgin Islands,14/10/2018,Marketing,Marketing Manager,206401.0,10 AM - 6 PM,Contractor,Mark Woodward,2162847141,Retirement Plan
20003,739022,Michael,Brooks,27/12/1973,Female,thomas50@example.net,001-692-103-1489x0179,"08894 Schultz Mount\nWest Crystal, DE 03377",Colombia,23/12/2021,Finance,Manager,3914044.0,9 AM - 5 PM,Part-time,Stephanie Hunter,5373797953,Health Insurance
20004,580425,Kyle,Bender,16/01/1973,Female,josephmccarthy@example.org,819.040.4733x299,"82922 Kimberly Trail Apt. 428\nPort Lisaton, C...",Nicaragua,31/10/2013,Sales,Manager,8571387.0,9 AM - 5 PM,Consultant,Jamie Craig,7859337537,Health Insurance
20005,364595,Penny,Phillips,18/08/1973,Male,estescharles@example.com,347.122.3642x9387,"819 Kevin Crest Suite 379\nLake Robertburgh, C...",Falkland Islands (Malvinas),30/06/2017,Marketing,Administrator,7974400.0,8 AM - 4 PM,Contractor,Melissa Hendricks,4471193371,Paid Time Off
20006,501192,Lisa,Mathews,06/03/2001,Female,hjones@example.org,001-139-616-9023x5251,"6029 Frazier Loop\nWalshview, MT 22458",Sudan,13/05/2019,IT,Analyst,8714308.0,10 AM - 6 PM,Part-time,Teresa Davis,7966717937,Paid Time Off
20007,200736,Angela,Roberts,18/05/1984,Female,dmeyer@example.org,485646305,"17700 Vargas Meadow\nJacobmouth, GU 52370",Reunion,06/12/2018,Finance,,1366559.0,10 AM - 6 PM,Contractor,Jade Cochran,1830416643,Paid Time Off
20008,720604,Benjamin,Johnson,26/10/1964,Female,nfernandez@example.org,166.216.2871,"PSC 5916, Box 8640\nAPO AA 11052",Hong Kong,01/06/2018,Finance,Supervisor,5404725.0,9 AM - 5 PM,Contractor,Alex Hall,7120958383,Health Insurance
20009,230096,Karen,Spears,13/12/2000,Female,seth90@example.com,+1-522-154-7595x752,"14274 Brandon Path Apt. 472\nNew Thomasland, W...",Uganda,12/02/2020,Human Resources,Marketing Manager,6207492.0,9 AM - 5 PM,Consultant,Joe Jackson,8151291243,Health Insurance


In [6]:
# check information about columns
data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20010 entries, 0 to 20009
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Employee ID            20010 non-null  int64  
 1   First Name             20010 non-null  object 
 2   Last Name              20010 non-null  object 
 3   Date of Birth          19963 non-null  object 
 4   Gender                 20010 non-null  object 
 5   Email                  20010 non-null  object 
 6   Contact Number         20010 non-null  object 
 7   Address                20010 non-null  object 
 8   Nationality            19950 non-null  object 
 9   Employment Start Date  20010 non-null  object 
 10  Department             19990 non-null  object 
 11  Position               17588 non-null  object 
 12  Salary                 20002 non-null  float64
 13  Work Schedule          20010 non-null  object 
 14  Employee Type          19963 non-null  object 
 15  Em

## Check Missing values

In [7]:
missing_values = data.isnull().sum() 

missing_values

Employee ID                 0
First Name                  0
Last Name                   0
Date of Birth              47
Gender                      0
Email                       0
Contact Number              0
Address                     0
Nationality                60
Employment Start Date       0
Department                 20
Position                 2422
Salary                      8
Work Schedule               0
Employee Type              47
Emergency Contact           0
Bank Account Details        0
Employee Benefits           0
dtype: int64

## Handling Missing values

### Drop rows with missing values

In [8]:
clean_data = data.dropna()  # Drop rows with any missing values

In [9]:
# check again if there is any missing values
clean_data.isnull().sum()

Employee ID              0
First Name               0
Last Name                0
Date of Birth            0
Gender                   0
Email                    0
Contact Number           0
Address                  0
Nationality              0
Employment Start Date    0
Department               0
Position                 0
Salary                   0
Work Schedule            0
Employee Type            0
Emergency Contact        0
Bank Account Details     0
Employee Benefits        0
dtype: int64

In [10]:
# check the shape after removing missing values
clean_data.shape 

(17432, 18)

### Drop columns with missing values

In [11]:
clean_data = data.dropna(axis=1)  # Drop columns with any missing values


In [12]:
# check the shape 
clean_data.shape 

(20010, 12)

In [13]:
## check sample of the data 
clean_data.head(10)

Unnamed: 0,Employee ID,First Name,Last Name,Gender,Email,Contact Number,Address,Employment Start Date,Work Schedule,Emergency Contact,Bank Account Details,Employee Benefits
0,221306,Tony,Nelson,Male,patrick84@example.org,853.129.0409,"093 Scott Station Suite 146\nDerektown, AR 25954",18/02/2019,8 AM - 4 PM,Phillip Boyd,7091407698,Health Insurance
1,887421,Crystal,Higgins,Female,wrightjohn@example.org,669-067-9325x24895,"72433 Moreno Fall Apt. 987\nEdwardmouth, ND 97508",25/03/2016,9 AM - 5 PM,Joshua Meza,5144379937,Retirement Plan
2,511496,Victoria,Davis,Male,allison08@example.com,(583)744-1249x36310,"8974 Anthony Expressway Apt. 711\nMelissabury,...",09/03/2017,10 AM - 6 PM,Claire Harrison,5080981679,Retirement Plan
3,820488,Jim,Hatfield,Female,kstokes@example.com,(873)242-4227x3364,"6943 Ramirez Islands\nMichaelchester, AS 16116",08/02/2023,10 AM - 6 PM,Steven Miller,1231654822,Health Insurance
4,674569,Kristina,Wallace,Female,pgarza@example.net,(580)197-3353x493,"845 Maxwell Gardens Suite 874\nAntoniomouth, M...",10/06/2018,9 AM - 5 PM,Sara Vasquez,1475767673,Paid Time Off
5,231961,Gregory,Aguilar,Male,kyle15@example.org,001-036-942-1793x7399,"99163 Justin Canyon Apt. 452\nNew Chad, MT 77820",13/06/2015,8 AM - 4 PM,Jacob Medina,6823807966,Health Insurance
6,589515,Elijah,Jackson,Female,christinajohnson@example.net,+1-427-571-4009x24881,"870 Rodney Divide\nNorth Wayne, MS 36589",23/08/2014,8 AM - 4 PM,Cristina Oliver,3941341495,Paid Time Off
7,718105,Kenneth,Delgado,Male,erodriguez@example.org,(983)974-4725x77710,"4617 Wright Manors Apt. 216\nPort Wendyland, G...",22/11/2017,9 AM - 5 PM,Haley Smith,1284793099,Retirement Plan
8,537537,Nicole,Morris,Male,robertsstephen@example.org,001-236-374-8086,"083 Ford Lights Suite 712\nSmithburgh, WA 94384",17/04/2018,10 AM - 6 PM,Mark Avery,7043514545,Paid Time Off
9,142377,Kyle,Watson,Male,heidimiller@example.com,661.885.9319x6812,"068 Cooper Springs Suite 909\nSouth Lisabury, ...",10/10/2016,9 AM - 5 PM,Patrick Smith,7844191204,Paid Time Off


### Drop specific column with missing values

In [18]:
# Remove the 'Position' column
clean_data = data.drop('Position', axis=1)

In [19]:
# show sample data 
clean_data.head(10)

Unnamed: 0,Employee ID,First Name,Last Name,Date of Birth,Gender,Email,Contact Number,Address,Nationality,Employment Start Date,Department,Salary,Work Schedule,Employee Type,Emergency Contact,Bank Account Details,Employee Benefits
0,221306,Tony,Nelson,28/11/1965,Male,patrick84@example.org,853.129.0409,"093 Scott Station Suite 146\nDerektown, AR 25954",Colombia,18/02/2019,Marketing,7145427.0,8 AM - 4 PM,Full-time,Phillip Boyd,7091407698,Health Insurance
1,887421,Crystal,Higgins,23/12/1996,Female,wrightjohn@example.org,669-067-9325x24895,"72433 Moreno Fall Apt. 987\nEdwardmouth, ND 97508",Paraguay,25/03/2016,Marketing,4548580.0,9 AM - 5 PM,Consultant,Joshua Meza,5144379937,Retirement Plan
2,511496,Victoria,Davis,13/02/1963,Male,allison08@example.com,(583)744-1249x36310,"8974 Anthony Expressway Apt. 711\nMelissabury,...",Slovenia,09/03/2017,Human Resources,3697485.0,10 AM - 6 PM,Full-time,Claire Harrison,5080981679,Retirement Plan
3,820488,Jim,Hatfield,12/11/1994,Female,kstokes@example.com,(873)242-4227x3364,"6943 Ramirez Islands\nMichaelchester, AS 16116",Seychelles,08/02/2023,Sales,9671957.0,10 AM - 6 PM,Full-time,Steven Miller,1231654822,Health Insurance
4,674569,Kristina,Wallace,29/10/1993,Female,pgarza@example.net,(580)197-3353x493,"845 Maxwell Gardens Suite 874\nAntoniomouth, M...",Ethiopia,10/06/2018,Human Resources,4420532.0,9 AM - 5 PM,Consultant,Sara Vasquez,1475767673,Paid Time Off
5,231961,Gregory,Aguilar,08/02/1973,Male,kyle15@example.org,001-036-942-1793x7399,"99163 Justin Canyon Apt. 452\nNew Chad, MT 77820",Iceland,13/06/2015,Human Resources,9996824.0,8 AM - 4 PM,Consultant,Jacob Medina,6823807966,Health Insurance
6,589515,Elijah,Jackson,01/07/1963,Female,christinajohnson@example.net,+1-427-571-4009x24881,"870 Rodney Divide\nNorth Wayne, MS 36589",Bosnia and Herzegovina,23/08/2014,Finance,6180839.0,8 AM - 4 PM,Part-time,Cristina Oliver,3941341495,Paid Time Off
7,718105,Kenneth,Delgado,08/11/1968,Male,erodriguez@example.org,(983)974-4725x77710,"4617 Wright Manors Apt. 216\nPort Wendyland, G...",Qatar,22/11/2017,IT,5594508.0,9 AM - 5 PM,Contractor,Haley Smith,1284793099,Retirement Plan
8,537537,Nicole,Morris,07/02/1987,Male,robertsstephen@example.org,001-236-374-8086,"083 Ford Lights Suite 712\nSmithburgh, WA 94384",,17/04/2018,Finance,9030800.0,10 AM - 6 PM,Full-time,Mark Avery,7043514545,Paid Time Off
9,142377,Kyle,Watson,14/05/1963,Male,heidimiller@example.com,661.885.9319x6812,"068 Cooper Springs Suite 909\nSouth Lisabury, ...",Mauritania,10/10/2016,,6225546.0,9 AM - 5 PM,Contractor,Patrick Smith,7844191204,Paid Time Off


### Drop specific row that has missing values

In [23]:
data.head(10)

Unnamed: 0,Employee ID,First Name,Last Name,Date of Birth,Gender,Email,Contact Number,Address,Nationality,Employment Start Date,Department,Position,Salary,Work Schedule,Employee Type,Emergency Contact,Bank Account Details,Employee Benefits
0,221306,Tony,Nelson,28/11/1965,Male,patrick84@example.org,853.129.0409,"093 Scott Station Suite 146\nDerektown, AR 25954",Colombia,18/02/2019,Marketing,Manager,7145427.0,8 AM - 4 PM,Full-time,Phillip Boyd,7091407698,Health Insurance
1,887421,Crystal,Higgins,23/12/1996,Female,wrightjohn@example.org,669-067-9325x24895,"72433 Moreno Fall Apt. 987\nEdwardmouth, ND 97508",Paraguay,25/03/2016,Marketing,Supervisor,4548580.0,9 AM - 5 PM,Consultant,Joshua Meza,5144379937,Retirement Plan
2,511496,Victoria,Davis,13/02/1963,Male,allison08@example.com,(583)744-1249x36310,"8974 Anthony Expressway Apt. 711\nMelissabury,...",Slovenia,09/03/2017,Human Resources,,3697485.0,10 AM - 6 PM,Full-time,Claire Harrison,5080981679,Retirement Plan
3,820488,Jim,Hatfield,12/11/1994,Female,kstokes@example.com,(873)242-4227x3364,"6943 Ramirez Islands\nMichaelchester, AS 16116",Seychelles,08/02/2023,Sales,Sales Manager,9671957.0,10 AM - 6 PM,Full-time,Steven Miller,1231654822,Health Insurance
4,674569,Kristina,Wallace,29/10/1993,Female,pgarza@example.net,(580)197-3353x493,"845 Maxwell Gardens Suite 874\nAntoniomouth, M...",Ethiopia,10/06/2018,Human Resources,Developer,4420532.0,9 AM - 5 PM,Consultant,Sara Vasquez,1475767673,Paid Time Off
5,231961,Gregory,Aguilar,08/02/1973,Male,kyle15@example.org,001-036-942-1793x7399,"99163 Justin Canyon Apt. 452\nNew Chad, MT 77820",Iceland,13/06/2015,Human Resources,Administrator,9996824.0,8 AM - 4 PM,Consultant,Jacob Medina,6823807966,Health Insurance
6,589515,Elijah,Jackson,01/07/1963,Female,christinajohnson@example.net,+1-427-571-4009x24881,"870 Rodney Divide\nNorth Wayne, MS 36589",Bosnia and Herzegovina,23/08/2014,Finance,Marketing Manager,6180839.0,8 AM - 4 PM,Part-time,Cristina Oliver,3941341495,Paid Time Off
7,718105,Kenneth,Delgado,08/11/1968,Male,erodriguez@example.org,(983)974-4725x77710,"4617 Wright Manors Apt. 216\nPort Wendyland, G...",Qatar,22/11/2017,IT,,5594508.0,9 AM - 5 PM,Contractor,Haley Smith,1284793099,Retirement Plan
8,537537,Nicole,Morris,07/02/1987,Male,robertsstephen@example.org,001-236-374-8086,"083 Ford Lights Suite 712\nSmithburgh, WA 94384",,17/04/2018,Finance,Developer,9030800.0,10 AM - 6 PM,Full-time,Mark Avery,7043514545,Paid Time Off
9,142377,Kyle,Watson,14/05/1963,Male,heidimiller@example.com,661.885.9319x6812,"068 Cooper Springs Suite 909\nSouth Lisabury, ...",Mauritania,10/10/2016,,,6225546.0,9 AM - 5 PM,Contractor,Patrick Smith,7844191204,Paid Time Off


In [24]:
clean_data = data.drop(9)

In [25]:
# show sample data 
clean_data.head(10)

Unnamed: 0,Employee ID,First Name,Last Name,Date of Birth,Gender,Email,Contact Number,Address,Nationality,Employment Start Date,Department,Position,Salary,Work Schedule,Employee Type,Emergency Contact,Bank Account Details,Employee Benefits
0,221306,Tony,Nelson,28/11/1965,Male,patrick84@example.org,853.129.0409,"093 Scott Station Suite 146\nDerektown, AR 25954",Colombia,18/02/2019,Marketing,Manager,7145427.0,8 AM - 4 PM,Full-time,Phillip Boyd,7091407698,Health Insurance
1,887421,Crystal,Higgins,23/12/1996,Female,wrightjohn@example.org,669-067-9325x24895,"72433 Moreno Fall Apt. 987\nEdwardmouth, ND 97508",Paraguay,25/03/2016,Marketing,Supervisor,4548580.0,9 AM - 5 PM,Consultant,Joshua Meza,5144379937,Retirement Plan
2,511496,Victoria,Davis,13/02/1963,Male,allison08@example.com,(583)744-1249x36310,"8974 Anthony Expressway Apt. 711\nMelissabury,...",Slovenia,09/03/2017,Human Resources,,3697485.0,10 AM - 6 PM,Full-time,Claire Harrison,5080981679,Retirement Plan
3,820488,Jim,Hatfield,12/11/1994,Female,kstokes@example.com,(873)242-4227x3364,"6943 Ramirez Islands\nMichaelchester, AS 16116",Seychelles,08/02/2023,Sales,Sales Manager,9671957.0,10 AM - 6 PM,Full-time,Steven Miller,1231654822,Health Insurance
4,674569,Kristina,Wallace,29/10/1993,Female,pgarza@example.net,(580)197-3353x493,"845 Maxwell Gardens Suite 874\nAntoniomouth, M...",Ethiopia,10/06/2018,Human Resources,Developer,4420532.0,9 AM - 5 PM,Consultant,Sara Vasquez,1475767673,Paid Time Off
5,231961,Gregory,Aguilar,08/02/1973,Male,kyle15@example.org,001-036-942-1793x7399,"99163 Justin Canyon Apt. 452\nNew Chad, MT 77820",Iceland,13/06/2015,Human Resources,Administrator,9996824.0,8 AM - 4 PM,Consultant,Jacob Medina,6823807966,Health Insurance
6,589515,Elijah,Jackson,01/07/1963,Female,christinajohnson@example.net,+1-427-571-4009x24881,"870 Rodney Divide\nNorth Wayne, MS 36589",Bosnia and Herzegovina,23/08/2014,Finance,Marketing Manager,6180839.0,8 AM - 4 PM,Part-time,Cristina Oliver,3941341495,Paid Time Off
7,718105,Kenneth,Delgado,08/11/1968,Male,erodriguez@example.org,(983)974-4725x77710,"4617 Wright Manors Apt. 216\nPort Wendyland, G...",Qatar,22/11/2017,IT,,5594508.0,9 AM - 5 PM,Contractor,Haley Smith,1284793099,Retirement Plan
8,537537,Nicole,Morris,07/02/1987,Male,robertsstephen@example.org,001-236-374-8086,"083 Ford Lights Suite 712\nSmithburgh, WA 94384",,17/04/2018,Finance,Developer,9030800.0,10 AM - 6 PM,Full-time,Mark Avery,7043514545,Paid Time Off
10,580085,Susan,Brooks,20/03/1994,Male,morganmichael@example.org,433-593-6547x027,"45693 Matthew Rue Apt. 459\nJasonburgh, OK 66629",Germany,16/01/2017,Finance,Administrator,6441004.0,9 AM - 5 PM,,Trevor Green,7079279430,Retirement Plan


### Fill missing values with specific value

In [26]:
data['Nationality'] =  data['Nationality'].fillna("no_country_selected")  # Replace missing values with specific value.


In [27]:
data.head(20)

Unnamed: 0,Employee ID,First Name,Last Name,Date of Birth,Gender,Email,Contact Number,Address,Nationality,Employment Start Date,Department,Position,Salary,Work Schedule,Employee Type,Emergency Contact,Bank Account Details,Employee Benefits
0,221306,Tony,Nelson,28/11/1965,Male,patrick84@example.org,853.129.0409,"093 Scott Station Suite 146\nDerektown, AR 25954",Colombia,18/02/2019,Marketing,Manager,7145427.0,8 AM - 4 PM,Full-time,Phillip Boyd,7091407698.0,Health Insurance
1,887421,Crystal,Higgins,23/12/1996,Female,wrightjohn@example.org,669-067-9325x24895,"72433 Moreno Fall Apt. 987\nEdwardmouth, ND 97508",Paraguay,25/03/2016,Marketing,Supervisor,4548580.0,9 AM - 5 PM,Consultant,Joshua Meza,5144379937.0,Retirement Plan
2,511496,Victoria,Davis,13/02/1963,Male,allison08@example.com,(583)744-1249x36310,"8974 Anthony Expressway Apt. 711\nMelissabury,...",Slovenia,09/03/2017,Human Resources,,3697485.0,10 AM - 6 PM,Full-time,Claire Harrison,5080981679.0,Retirement Plan
3,820488,Jim,Hatfield,12/11/1994,Female,kstokes@example.com,(873)242-4227x3364,"6943 Ramirez Islands\nMichaelchester, AS 16116",Seychelles,08/02/2023,Sales,Sales Manager,9671957.0,10 AM - 6 PM,Full-time,Steven Miller,1231654822.0,Health Insurance
4,674569,Kristina,Wallace,29/10/1993,Female,pgarza@example.net,(580)197-3353x493,"845 Maxwell Gardens Suite 874\nAntoniomouth, M...",Ethiopia,10/06/2018,Human Resources,Developer,4420532.0,9 AM - 5 PM,Consultant,Sara Vasquez,1475767673.0,Paid Time Off
5,231961,Gregory,Aguilar,08/02/1973,Male,kyle15@example.org,001-036-942-1793x7399,"99163 Justin Canyon Apt. 452\nNew Chad, MT 77820",Iceland,13/06/2015,Human Resources,Administrator,9996824.0,8 AM - 4 PM,Consultant,Jacob Medina,6823807966.0,Health Insurance
6,589515,Elijah,Jackson,01/07/1963,Female,christinajohnson@example.net,+1-427-571-4009x24881,"870 Rodney Divide\nNorth Wayne, MS 36589",Bosnia and Herzegovina,23/08/2014,Finance,Marketing Manager,6180839.0,8 AM - 4 PM,Part-time,Cristina Oliver,3941341495.0,Paid Time Off
7,718105,Kenneth,Delgado,08/11/1968,Male,erodriguez@example.org,(983)974-4725x77710,"4617 Wright Manors Apt. 216\nPort Wendyland, G...",Qatar,22/11/2017,IT,,5594508.0,9 AM - 5 PM,Contractor,Haley Smith,1284793099.0,Retirement Plan
8,537537,Nicole,Morris,07/02/1987,Male,robertsstephen@example.org,001-236-374-8086,"083 Ford Lights Suite 712\nSmithburgh, WA 94384",no_country_selected,17/04/2018,Finance,Developer,9030800.0,10 AM - 6 PM,Full-time,Mark Avery,7043514545.0,Paid Time Off
9,142377,Kyle,Watson,14/05/1963,Male,heidimiller@example.com,661.885.9319x6812,"068 Cooper Springs Suite 909\nSouth Lisabury, ...",Mauritania,10/10/2016,,,6225546.0,9 AM - 5 PM,Contractor,Patrick Smith,7844191204.0,Paid Time Off


In [28]:
# Convert 'date' column to datetime data type
data['Date of Birth'] = pd.to_datetime(data['Date of Birth'])

data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20010 entries, 0 to 20009
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Employee ID            20010 non-null  int64         
 1   First Name             20010 non-null  object        
 2   Last Name              20010 non-null  object        
 3   Date of Birth          19963 non-null  datetime64[ns]
 4   Gender                 20010 non-null  object        
 5   Email                  20010 non-null  object        
 6   Contact Number         20010 non-null  object        
 7   Address                20010 non-null  object        
 8   Nationality            20010 non-null  object        
 9   Employment Start Date  20010 non-null  object        
 10  Department             19990 non-null  object        
 11  Position               17588 non-null  object        
 12  Salary                 20002 non-null  float64       
 13  W

  data['Date of Birth'] = pd.to_datetime(data['Date of Birth'])


In [29]:
# Interpolate missing values in the 'date' column
data['Date of Birth'] = data['Date of Birth'].fillna('1980-07-01')

In [30]:
data.head(20)

Unnamed: 0,Employee ID,First Name,Last Name,Date of Birth,Gender,Email,Contact Number,Address,Nationality,Employment Start Date,Department,Position,Salary,Work Schedule,Employee Type,Emergency Contact,Bank Account Details,Employee Benefits
0,221306,Tony,Nelson,1965-11-28,Male,patrick84@example.org,853.129.0409,"093 Scott Station Suite 146\nDerektown, AR 25954",Colombia,18/02/2019,Marketing,Manager,7145427.0,8 AM - 4 PM,Full-time,Phillip Boyd,7091407698.0,Health Insurance
1,887421,Crystal,Higgins,1996-12-23,Female,wrightjohn@example.org,669-067-9325x24895,"72433 Moreno Fall Apt. 987\nEdwardmouth, ND 97508",Paraguay,25/03/2016,Marketing,Supervisor,4548580.0,9 AM - 5 PM,Consultant,Joshua Meza,5144379937.0,Retirement Plan
2,511496,Victoria,Davis,1963-02-13,Male,allison08@example.com,(583)744-1249x36310,"8974 Anthony Expressway Apt. 711\nMelissabury,...",Slovenia,09/03/2017,Human Resources,,3697485.0,10 AM - 6 PM,Full-time,Claire Harrison,5080981679.0,Retirement Plan
3,820488,Jim,Hatfield,1994-12-11,Female,kstokes@example.com,(873)242-4227x3364,"6943 Ramirez Islands\nMichaelchester, AS 16116",Seychelles,08/02/2023,Sales,Sales Manager,9671957.0,10 AM - 6 PM,Full-time,Steven Miller,1231654822.0,Health Insurance
4,674569,Kristina,Wallace,1993-10-29,Female,pgarza@example.net,(580)197-3353x493,"845 Maxwell Gardens Suite 874\nAntoniomouth, M...",Ethiopia,10/06/2018,Human Resources,Developer,4420532.0,9 AM - 5 PM,Consultant,Sara Vasquez,1475767673.0,Paid Time Off
5,231961,Gregory,Aguilar,1973-08-02,Male,kyle15@example.org,001-036-942-1793x7399,"99163 Justin Canyon Apt. 452\nNew Chad, MT 77820",Iceland,13/06/2015,Human Resources,Administrator,9996824.0,8 AM - 4 PM,Consultant,Jacob Medina,6823807966.0,Health Insurance
6,589515,Elijah,Jackson,1963-01-07,Female,christinajohnson@example.net,+1-427-571-4009x24881,"870 Rodney Divide\nNorth Wayne, MS 36589",Bosnia and Herzegovina,23/08/2014,Finance,Marketing Manager,6180839.0,8 AM - 4 PM,Part-time,Cristina Oliver,3941341495.0,Paid Time Off
7,718105,Kenneth,Delgado,1968-08-11,Male,erodriguez@example.org,(983)974-4725x77710,"4617 Wright Manors Apt. 216\nPort Wendyland, G...",Qatar,22/11/2017,IT,,5594508.0,9 AM - 5 PM,Contractor,Haley Smith,1284793099.0,Retirement Plan
8,537537,Nicole,Morris,1987-07-02,Male,robertsstephen@example.org,001-236-374-8086,"083 Ford Lights Suite 712\nSmithburgh, WA 94384",no_country_selected,17/04/2018,Finance,Developer,9030800.0,10 AM - 6 PM,Full-time,Mark Avery,7043514545.0,Paid Time Off
9,142377,Kyle,Watson,1963-05-14,Male,heidimiller@example.com,661.885.9319x6812,"068 Cooper Springs Suite 909\nSouth Lisabury, ...",Mauritania,10/10/2016,,,6225546.0,9 AM - 5 PM,Contractor,Patrick Smith,7844191204.0,Paid Time Off


In [31]:
#check missing again missing values in the data

data.isnull().sum() 

Employee ID                 0
First Name                  0
Last Name                   0
Date of Birth               0
Gender                      0
Email                       0
Contact Number              0
Address                     0
Nationality                 0
Employment Start Date       0
Department                 20
Position                 2422
Salary                      8
Work Schedule               0
Employee Type              47
Emergency Contact           0
Bank Account Details        0
Employee Benefits           0
dtype: int64

### Filling missing values with Mean

In [33]:
mean_salary = data['Salary'].mean()  # Calculate mean of 'column'

mean_salary

5062614.427957204

In [34]:
 # Replace missing values in 'column' with the mean
data['Salary'] = data['Salary'].fillna(mean_salary) 


In [35]:
#check missing again missing values in the data

data.isnull().sum() 

Employee ID                 0
First Name                  0
Last Name                   0
Date of Birth               0
Gender                      0
Email                       0
Contact Number              0
Address                     0
Nationality                 0
Employment Start Date       0
Department                 20
Position                 2422
Salary                      0
Work Schedule               0
Employee Type              47
Emergency Contact           0
Bank Account Details        0
Employee Benefits           0
dtype: int64

### Filling missing values with Median

In [36]:
# Calculate median of 'column'
median_value = data['Salary'].median()

median_value 

5075534.0

### Filling missing values with Mode

In [38]:
# Calculate mode of 'column'
mode_position = data['Position'].mode()[0]

mode_position

'Marketing Manager'

In [39]:
 # Replace missing values in 'column' with the mode
data['Position'] = data['Position'].fillna(mode_position) 

In [40]:
# Calculate mode of 'column'
mode_department = data['Department'].mode()[0]

mode_department

'Marketing'

In [41]:
 # Replace missing values in 'column' with the mode
data['Department'] = data['Department'].fillna(mode_department) 

In [42]:
# Calculate mode of 'column'
mode_employee_type = data['Employee Type'].mode()[0]

mode_employee_type

'Consultant'

In [43]:
 # Replace missing values in 'column' with the mode
data['Employee Type'] = data['Employee Type'].fillna(mode_department) 

In [44]:
#check again if you have missing values

data.isnull().sum() 

Employee ID              0
First Name               0
Last Name                0
Date of Birth            0
Gender                   0
Email                    0
Contact Number           0
Address                  0
Nationality              0
Employment Start Date    0
Department               0
Position                 0
Salary                   0
Work Schedule            0
Employee Type            0
Emergency Contact        0
Bank Account Details     0
Employee Benefits        0
dtype: int64

In [45]:
# check the shape 
data.shape 

(20010, 18)

## Check duplicates

### Checking for duplicates rows

In [46]:
 # Check for duplicate rows
duplicates = data.duplicated() 


In [47]:
# show duplicates 
duplicates 

0        False
1        False
2        False
3        False
4        False
         ...  
20005     True
20006     True
20007     True
20008     True
20009     True
Length: 20010, dtype: bool

### Counting duplicates rows

In [48]:
# Count the number of duplicate rows
duplicate_count = data.duplicated().sum()  

# show number of duplicates
duplicate_count

10

### Check for duplicates values in a specific columns

In [49]:
# Check for duplicate values in 'column'
email_duplicates = data['Email'].duplicated()

# show duplicates 
email_duplicates 

0        False
1        False
2        False
3        False
4        False
         ...  
20005     True
20006     True
20007     True
20008     True
20009     True
Name: Email, Length: 20010, dtype: bool

In [51]:
duplicate_emails = data[data['Email'].duplicated()]

In [52]:
duplicate_emails.to_csv('data/duplicates_emails.csv',index=False)

### Check for duplicates values based on multiple columns

In [53]:
# Check for duplicate rows based on 'column1' and 'column2'
duplicates = data.duplicated(['Email', 'Contact Number']).sum() 

# show duplicates

duplicates 


10

## Remove Duplicates

### Drop duplicates rows

In [54]:
 # Drop duplicate rows, keeping the first occurrence
clean_data = data.drop_duplicates() 


# check the shape
clean_data.shape 

(20000, 18)

### Drop duplicates rows based on subset of column

In [55]:
 # Drop duplicate rows based on 'Email' and 'Contact Nuber', keeping the first occurrence
data = data.drop_duplicates(subset=['Email', 'Contact Number']) 

# check the shape
data.shape 

(20000, 18)

## Rename column

In [56]:
# rename column
data = data.rename(columns={'Contact Number': 'Phone Number'})


# show all column name

data.columns 

Index(['Employee ID', 'First Name', 'Last Name', 'Date of Birth', 'Gender',
       'Email', 'Phone Number', 'Address', 'Nationality',
       'Employment Start Date', 'Department', 'Position', 'Salary',
       'Work Schedule', 'Employee Type', 'Emergency Contact',
       'Bank Account Details', 'Employee Benefits'],
      dtype='object')

In [57]:
# show sample data 
data.head()

Unnamed: 0,Employee ID,First Name,Last Name,Date of Birth,Gender,Email,Phone Number,Address,Nationality,Employment Start Date,Department,Position,Salary,Work Schedule,Employee Type,Emergency Contact,Bank Account Details,Employee Benefits
0,221306,Tony,Nelson,1965-11-28,Male,patrick84@example.org,853.129.0409,"093 Scott Station Suite 146\nDerektown, AR 25954",Colombia,18/02/2019,Marketing,Manager,7145427.0,8 AM - 4 PM,Full-time,Phillip Boyd,7091407698,Health Insurance
1,887421,Crystal,Higgins,1996-12-23,Female,wrightjohn@example.org,669-067-9325x24895,"72433 Moreno Fall Apt. 987\nEdwardmouth, ND 97508",Paraguay,25/03/2016,Marketing,Supervisor,4548580.0,9 AM - 5 PM,Consultant,Joshua Meza,5144379937,Retirement Plan
2,511496,Victoria,Davis,1963-02-13,Male,allison08@example.com,(583)744-1249x36310,"8974 Anthony Expressway Apt. 711\nMelissabury,...",Slovenia,09/03/2017,Human Resources,Marketing Manager,3697485.0,10 AM - 6 PM,Full-time,Claire Harrison,5080981679,Retirement Plan
3,820488,Jim,Hatfield,1994-12-11,Female,kstokes@example.com,(873)242-4227x3364,"6943 Ramirez Islands\nMichaelchester, AS 16116",Seychelles,08/02/2023,Sales,Sales Manager,9671957.0,10 AM - 6 PM,Full-time,Steven Miller,1231654822,Health Insurance
4,674569,Kristina,Wallace,1993-10-29,Female,pgarza@example.net,(580)197-3353x493,"845 Maxwell Gardens Suite 874\nAntoniomouth, M...",Ethiopia,10/06/2018,Human Resources,Developer,4420532.0,9 AM - 5 PM,Consultant,Sara Vasquez,1475767673,Paid Time Off


## Change Datatypes

Pandas provides various data types that can be used to represent and manipulate data efficiently. Here are some commonly used data types in pandas along with their explanations:

1. **object**: The object data type is a general-purpose data type that can hold any Python object. It is often used to represent strings or a mix of different data types within a single column.

2. **int64**: The int64 data type represents integer values with a 64-bit precision. It can hold both positive and negative whole numbers.

3. **float64**: The float64 data type represents floating-point numbers with a 64-bit precision. It can hold both integer and decimal values.

4. **bool**: The bool data type represents boolean values (True or False) indicating logical conditions. It is useful for storing binary or categorical data.

5. **datetime64**: The datetime64 data type represents date and time values with a 64-bit precision. It is particularly useful for handling temporal data and performing time-based operations.

6. **timedelta**: The timedelta data type represents differences or durations between two datetime values. It allows for calculations involving time intervals.

7. **category**: The category data type represents categorical or discrete values with a fixed set of possible values. It is memory-efficient and can be useful for columns with a limited number of unique values.

These are some of the commonly used data types in pandas. Depending on the specific requirements and nature of the data, pandas also provides other data types such as uint64 (unsigned integers), int32 (32-bit integers), float32 (32-bit floating-point numbers), and more.

In [58]:
 # Convert 'column' to integer data type
data['Salary'] = data['Salary'].astype('int64') 

# check columns information
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000 entries, 0 to 19999
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Employee ID            20000 non-null  int64         
 1   First Name             20000 non-null  object        
 2   Last Name              20000 non-null  object        
 3   Date of Birth          20000 non-null  datetime64[ns]
 4   Gender                 20000 non-null  object        
 5   Email                  20000 non-null  object        
 6   Phone Number           20000 non-null  object        
 7   Address                20000 non-null  object        
 8   Nationality            20000 non-null  object        
 9   Employment Start Date  20000 non-null  object        
 10  Department             20000 non-null  object        
 11  Position               20000 non-null  object        
 12  Salary                 20000 non-null  int64         
 13  W

In [59]:
data.head() 

Unnamed: 0,Employee ID,First Name,Last Name,Date of Birth,Gender,Email,Phone Number,Address,Nationality,Employment Start Date,Department,Position,Salary,Work Schedule,Employee Type,Emergency Contact,Bank Account Details,Employee Benefits
0,221306,Tony,Nelson,1965-11-28,Male,patrick84@example.org,853.129.0409,"093 Scott Station Suite 146\nDerektown, AR 25954",Colombia,18/02/2019,Marketing,Manager,7145427,8 AM - 4 PM,Full-time,Phillip Boyd,7091407698,Health Insurance
1,887421,Crystal,Higgins,1996-12-23,Female,wrightjohn@example.org,669-067-9325x24895,"72433 Moreno Fall Apt. 987\nEdwardmouth, ND 97508",Paraguay,25/03/2016,Marketing,Supervisor,4548580,9 AM - 5 PM,Consultant,Joshua Meza,5144379937,Retirement Plan
2,511496,Victoria,Davis,1963-02-13,Male,allison08@example.com,(583)744-1249x36310,"8974 Anthony Expressway Apt. 711\nMelissabury,...",Slovenia,09/03/2017,Human Resources,Marketing Manager,3697485,10 AM - 6 PM,Full-time,Claire Harrison,5080981679,Retirement Plan
3,820488,Jim,Hatfield,1994-12-11,Female,kstokes@example.com,(873)242-4227x3364,"6943 Ramirez Islands\nMichaelchester, AS 16116",Seychelles,08/02/2023,Sales,Sales Manager,9671957,10 AM - 6 PM,Full-time,Steven Miller,1231654822,Health Insurance
4,674569,Kristina,Wallace,1993-10-29,Female,pgarza@example.net,(580)197-3353x493,"845 Maxwell Gardens Suite 874\nAntoniomouth, M...",Ethiopia,10/06/2018,Human Resources,Developer,4420532,9 AM - 5 PM,Consultant,Sara Vasquez,1475767673,Paid Time Off


## Remove  characters

In [60]:
 # Remove '/' from 'column' values

data['Employment Start Date'] = data['Employment Start Date'].str.replace('/', '-') 

In [61]:
# show sample

data.head() 

Unnamed: 0,Employee ID,First Name,Last Name,Date of Birth,Gender,Email,Phone Number,Address,Nationality,Employment Start Date,Department,Position,Salary,Work Schedule,Employee Type,Emergency Contact,Bank Account Details,Employee Benefits
0,221306,Tony,Nelson,1965-11-28,Male,patrick84@example.org,853.129.0409,"093 Scott Station Suite 146\nDerektown, AR 25954",Colombia,18-02-2019,Marketing,Manager,7145427,8 AM - 4 PM,Full-time,Phillip Boyd,7091407698,Health Insurance
1,887421,Crystal,Higgins,1996-12-23,Female,wrightjohn@example.org,669-067-9325x24895,"72433 Moreno Fall Apt. 987\nEdwardmouth, ND 97508",Paraguay,25-03-2016,Marketing,Supervisor,4548580,9 AM - 5 PM,Consultant,Joshua Meza,5144379937,Retirement Plan
2,511496,Victoria,Davis,1963-02-13,Male,allison08@example.com,(583)744-1249x36310,"8974 Anthony Expressway Apt. 711\nMelissabury,...",Slovenia,09-03-2017,Human Resources,Marketing Manager,3697485,10 AM - 6 PM,Full-time,Claire Harrison,5080981679,Retirement Plan
3,820488,Jim,Hatfield,1994-12-11,Female,kstokes@example.com,(873)242-4227x3364,"6943 Ramirez Islands\nMichaelchester, AS 16116",Seychelles,08-02-2023,Sales,Sales Manager,9671957,10 AM - 6 PM,Full-time,Steven Miller,1231654822,Health Insurance
4,674569,Kristina,Wallace,1993-10-29,Female,pgarza@example.net,(580)197-3353x493,"845 Maxwell Gardens Suite 874\nAntoniomouth, M...",Ethiopia,10-06-2018,Human Resources,Developer,4420532,9 AM - 5 PM,Consultant,Sara Vasquez,1475767673,Paid Time Off


In [None]:
# Save clean data

In [62]:
data.to_csv('data/clean_employee_information.csv',index=False)