# Step 1: Load the Data
- I loaded the dataset from the file `messy_data.csv` into a pandas DataFrame. The initial view of the data shows a mix of date formats, missing values, and potentially incorrect salary values.

# Step 2: Handle Missing Values in the 'Join Date' Column
- The 'Join Date' column had missing values, and the date format was inconsistent (both `-` and `/` used).
- I used `pd.to_datetime()` to convert the dates to a standard format and `errors='coerce'` to convert invalid entries to `NaT`.
- Missing values were filled with `'Unknown'` to ensure that all rows have a valid entry.
# Step 3: Convert Salary Column to Fixed Range
- The 'Salary' column was continuous and needed to be converted into discrete ranges for easier analysis.
- I defined the following salary ranges:
    - 0 to 50k
    - 51k to 100k
    - 101k to 150k
- I used `pd.cut()` to categorize the salary values into these ranges and added a new column `Salary Range` to the DataFrame.

# step  4: Handling Missing Values in the 'Name' Field
   - The **'Name'** field contained missing entries.
   - I replaced the missing values with a placeholder value: `'Unknown'`.
   - This ensures that all records have a valid name.

# 5 Handling Missing Values in the 'Date' Field
   - The **'Date'** field had missing or invalid date values.
   - I converted all valid dates to a standard format using `pd.to_datetime()`. 
   - For invalid or missing dates, I filled the `NaT` (Not a Time) entries with `'Unknown'` as a placeholder.
   - This ensures consistency in the date field.

# 6 Handling Missing Values in the 'Salary' Field
   - The **'Salary'** field had missing salary values.
   - I replaced the missing salary values with a placeholder, such as `0` or the mean salary, depending on the context.
   - The placeholder helps keep the dataset intact for further analysis without removing records.


# step 7 Email Format Correction

# Problem:
The email addresses in the dataset had inconsistent formatting. Some emails were missing the domain (e.g., `user@domain` instead of `user@domain.com`), and others had extra spaces, typos, or incorrect characters.

## Steps Taken:
1. **Remove Leading and Trailing Spaces**: I used the `strip()` function to remove any spaces before or after the email addresses.
2. **Validate Email Format**: I used a regular expression (regex) to check for proper email format. The correct format should include:
    - A string before the `@` symbol (e.g., `user`)
    - The `@` symbol
    - A valid domain name (e.g., `domain.com`)
3. **Fix Invalid Emails**: Invalid emails were replaced with `invalid_email@domain.com` as a placeholder.

# step 8: **Standardize Department Name**: 
   - I corrected all department names that start with "Sales" and removed any extra characters, leaving only "Sales" as the department name.
   - This was done using string manipulation techniques to match department names that begin with "Sales" and ensure they are consistently labeled as `Sales`.

2. **Handle Other Departments**: 
   - For departments that do not start with "Sales," no changes were made. These department names were left intact.
   
3. **Ensure Consistency**:
   - The dataset now has a consistent department name format for all records that start with "Sales.","Engineering","support",'Markting'.

# Step 7 : Save the Cleaned Data
- Once the cleaning steps were completed, I saved the cleaned data into a new file called `cleaned_dataset.csv`.
- This file can now be used for further analysis or reporting.
