# Employee Salary Data Cleaning
This notebook demonstrates how to clean and convert string data into numeric types using Pandas. We will process a raw dataset of employee salaries containing currency symbols and abbreviations.

In [6]:
import pandas as pd

### 1. Create Raw Data
We start by creating a DataFrame with employee details. Note that the `Salary` column currently contains strings with "USD" and "k" (e.g., "USD 50k"), which prevents us from performing mathematical calculations.

In [7]:
df = pd.DataFrame({
    "Employee_ID": ["Emp_010", "Emp_011", "Emp_012", "Emp_013", "Emp_014", "Emp_015", "Emp_016"],
    "Name": ["Ali", "Sara", "Ahmed", "Ayesha", "Hassan", "Fatima", "Ali Akbar"],
    "Age": [25, 22, 30, 28, 27, 26, 24],
    "Role": ["Engineer", "Data Scientist", "Manager", "Intern", "Doctor", "Teacher", "Artist"],
    "Salary": ["USD 50k", "USD 75k", "USD 120k", "USD 45k", "USD 60k", "USD 55k", "USD 65k"]
})
df

Unnamed: 0,Employee_ID,Name,Age,Role,Salary
0,Emp_010,Ali,25,Engineer,USD 50k
1,Emp_011,Sara,22,Data Scientist,USD 75k
2,Emp_012,Ahmed,30,Manager,USD 120k
3,Emp_013,Ayesha,28,Intern,USD 45k
4,Emp_014,Hassan,27,Doctor,USD 60k
5,Emp_015,Fatima,26,Teacher,USD 55k
6,Emp_016,Ali Akbar,24,Artist,USD 65k


### 2. Data Cleaning: Remove Currency Symbols
First, we strip the text "USD" from the `Salary` column to begin isolating the numeric values.

In [8]:
df["Salary"] = df["Salary"].str.replace("USD", "")
df

Unnamed: 0,Employee_ID,Name,Age,Role,Salary
0,Emp_010,Ali,25,Engineer,50k
1,Emp_011,Sara,22,Data Scientist,75k
2,Emp_012,Ahmed,30,Manager,120k
3,Emp_013,Ayesha,28,Intern,45k
4,Emp_014,Hassan,27,Doctor,60k
5,Emp_015,Fatima,26,Teacher,55k
6,Emp_016,Ali Akbar,24,Artist,65k


### 3. formatting and Type Conversion
We need to finalize the cleaning process:
1. **Replace Abbreviation:** Change "k" to "000" (e.g., "50k" becomes "50000").
2. **Convert Data Type:** Use `.astype(int)` to convert the column from an Object (string) to an Integer so we can perform analysis later.

In [9]:
df["Salary"] = df["Salary"].str.replace("k", "000").astype(int)
df

Unnamed: 0,Employee_ID,Name,Age,Role,Salary
0,Emp_010,Ali,25,Engineer,50000
1,Emp_011,Sara,22,Data Scientist,75000
2,Emp_012,Ahmed,30,Manager,120000
3,Emp_013,Ayesha,28,Intern,45000
4,Emp_014,Hassan,27,Doctor,60000
5,Emp_015,Fatima,26,Teacher,55000
6,Emp_016,Ali Akbar,24,Artist,65000


### 4. Verify Data Types
Finally, we check the DataFrame's information to confirm that the `Salary` column is now correctly recognized as an integer (`int64`).

In [10]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   Employee_ID  7 non-null      str  
 1   Name         7 non-null      str  
 2   Age          7 non-null      int64
 3   Role         7 non-null      str  
 4   Salary       7 non-null      int64
dtypes: int64(2), str(3)
memory usage: 412.0 bytes
