# Assignment 1: Data Parsing, Cleansing and Integration
## Task 1 and 2
#### Student Name: Mrwan Alhandi
#### Student ID: s3969393

Date: 06/08/2023

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used:
* pandas
* re
* numpy
* xml.etree.ElementTree
* from sklearn.linear_model import LinearRegression
* from sklearn.model_selection import train_test_split
* from sklearn.metrics import mean_squared_error

## Introduction: Data Preparation for United Kingdom Job Advertisements Dataset

This study addresses a pivotal and labor-intensive aspect of data science: data preparation. The process entails a range of tasks such as data acquisition, formatting, and cleaning to facilitate subsequent analytical activities. The dataset under scrutiny pertains to job advertisements in the United Kingdom.

### Attributes Description
The dataset comprises nine attributes, each serving distinct informational roles:
- **Source, Title, Location, Company, Category**: These attributes are nominal in nature. Syntactical errors such as extraneous whitespace and typographical mistakes were rectified during the cleaning process.
  
- **ContractTime**: This attribute is of a categorical data type. A specific data entry error was identified and subsequently rectified.
  
- **Salary**: This attribute had both syntactical and semantic errors, necessitating data type conversions for accurate analysis.
  
- **OpenDate, CloseDate**: These attributes were initially in an incorrect format, prompting a requisite data type conversion for proper analytical execution.

### Error Mitigation and Missing Value Treatment

#### Missing Value Handling
Once syntactical and semantic errors were addressed, the focus shifted to handling missing values using different techniques based on the nature of each attribute.

- **Source**: Missing values were replaced using the mode (most frequent value) of the Source attribute conditioned on the job Category.

- **Title**: No missing values were observed.

- **Location**: No missing values were observed.

- **Company**:
  1. Replaced missing values with the most frequently occurring Company name within the same Category.
  2. For remaining missing values, used the most frequently occurring Company name overall.



- **ContractTime**:
  1. Replaced missing values with the most frequent ContractTime value within the same Company.
  2. For remaining missing values, used the most frequent ContractTime value within the same Category.

- **Category**: No missing values were observed.

- **Salary**: 
  1. Replaced missing values with the most frequent Salary value within the same Company and Category.
  2. Replaced remaining missing values with the most frequent Salary value within the same Company and ContractTime.
  3. Replaced the remaining missing values with the most frequent Salary value within the same Company and Location.
  4. Replaced the remaining missing values with the most frequent Salary value within the same Company.
  5. Replaced the remaining missing values with the most frequent Salary value within the same Category.

- **OpenDate**: A singular missing value was replaced with the OpenDate of the job that had the closest CloseDate.

- **ClosedDate**: No missing values were observed.

Through these structured steps, the dataset was refined to a clean state, enabling robust and reliable downstream analyses.

## Importing libraries 

In [None]:
# Code to import libraries as you need in this assessment, e.g.,
import pandas as pd
import numpy as np
import xml.etree.ElementTree as etree 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error   


## Task 1. Parsing Data

### 1.1. Examining and loading data

In [None]:
# Code to read in the data from the csv file
tree = etree.parse("./s3969393_dataset1.xml")

In [None]:
# root
root = tree.getroot()     
root.tag

In [None]:
# how many children does root have?
len(root)

In [None]:
# does the root have any attributes?
root.attrib

In [None]:
# does the next level have any attributes?
root[0].attrib

In [None]:
# Check all elements
for elem in tree.iter():
    print (elem.tag, elem.text, elem.attrib)
    print('-------------')

- The root is called Advertisements
    - The second children with tag Source Name which is a website such as "insurancejobs.co.uk"
        - Each Source Name contains children tagged Rows
            - Each Row contains tags Title, Location, Company, ContractTime, Category, Salary, OpenDate, CloseDate.

In [None]:
# let us check how many records we have in the dataset

# Use XPath to find all 'Row' elements under 'SourceName' elements
row_elements = root.findall(".//Source/Row")

# Get the count of 'Row' elements
row_count = len(row_elements)

print(f"Total number of Row records: {row_count}")

### 1.2 Parsing data into the required format

In [None]:
# let us find the SourceNameName for each row

# Initialize a dictionary to store row counts for each SourceName
SourceName_row_counts = {}

# Iterate through each 'SourceName' in 'Advertisements'
for SourceName in root:
    SourceName_name = SourceName.attrib.get("Name", np.nan)

    # Count the number of 'Row' elements in this 'SourceName'
    row_count = len(SourceName.findall("Row"))

    # Store the row count in the dictionary
    SourceName_row_counts[SourceName_name] = row_count

SourceNames = []
# Print the row counts for each SourceName
for SourceName_name, row_count in SourceName_row_counts.items():
    for i in range(row_count):
        SourceNames.append(SourceName_name)

# Check the number of SourceNames - Should be 50753
print(len(SourceNames))

# Check the first 10 titles
print(SourceNames[:10]) # good the appending is as expected

In [None]:
# let us get the row id for each row
row_elements = root.findall(".//Source/Row")

row_ids = []
for row in row_elements:
    row_ids.append(row.attrib.get("ID", np.nan))

# Check the number of rows - Should be 50753
print(len(row_ids))

# Check the first 10 titles
print(row_ids[:10])

In [None]:
def create_list_attribute(attribute):
    """
    Creates a list of values corresponding to a given XML attribute from 'Row' elements within each 'SourceName' in 'Advertisements'.

    The function iterates through each 'SourceName' in the root ('Advertisements') and then through each 'Row' in each 'SourceName'.
    For each 'Row', the function looks for an element that matches the given attribute name. If found and not empty,
    its text is appended to the list. If not found or empty, np.nan is appended to the list.

    Parameters:
    - attribute (str): The name of the XML attribute to search for within each 'Row'.

    Returns:
    list: A list containing the text of each XML element matching the attribute, or np.nan for missing or empty elements.

    Side Effects:
    - Prints the length of the attribute list.
    - Prints the first 10 items in the attribute list.
    """

    attribute_list = []

    # Iterate through each 'SourceName' in 'Advertisements'
    for SourceName in root:
        # Iterate through each 'Row' in 'SourceName'
        for row in SourceName.findall("Row"):
            attribute_element = row.find(f"{attribute}")
            if attribute_element is not None and attribute_element.text:
                attribute_list.append(attribute_element.text)
            else:
                attribute_list.append(np.nan)

    # Check the number of companies - should match the number of Row elements
    print(len(attribute_list))

    # Check the first 10 companies
    print(attribute_list[:10])

    return attribute_list

In [None]:
# Create dataframe columns
titles = create_list_attribute("Title")
print("---------------------------------------------")
locations = create_list_attribute("Location")
print("---------------------------------------------")
companies = create_list_attribute("Company")
print("---------------------------------------------")
contract_times = create_list_attribute("ContractTime")
print("---------------------------------------------")
contract_types = create_list_attribute("ContractType")
print("---------------------------------------------")
categories = create_list_attribute("Category")
print("---------------------------------------------")
salaries = create_list_attribute("Salary")
print("---------------------------------------------")
open_dates = create_list_attribute("OpenDate")
print("---------------------------------------------")
close_dates = create_list_attribute("CloseDate")

In [None]:
# let us create a dataframe
df = pd.DataFrame({"id":row_ids, "Title": titles, "Location": locations, "Company": companies,"ContractType": contract_types ,"ContractTime": contract_times, "Category": categories, "Salary": salaries, "OpenDate": open_dates, "CloseDate": close_dates, "SourceName": SourceNames})

df

## Task 2. Auditing and cleansing the loaded data

In [None]:
# create and error recorder (i.e. the erlist)
itemlist = ['indexOfdf',"Id",'ColumnName', 'Orignal', 'Modified', 'ErrorType','Fixing']
erlist = pd.DataFrame(columns=itemlist)
erlist

# update error list by attributes
def updateErlist(indexOfdf, Id, ColumnName, Original, Modified, ErrorType, Fixing):
    errItem = [indexOfdf, Id, ColumnName, Original, Modified, ErrorType, Fixing]
    erlist.loc[len(erlist)] = errItem


In [None]:
# data types
df.dtypes

Attributes should be the following:
- SourceName: string
- Title: string
- Location: string
- Company: string
- Category: string
- Salary: float
- OpenDate: datetime
- CloseDate: datetime 
- id: integer
Only Salary, OpenDate and CloseDate need to be changed.

In [None]:
# let us change id to integer
df['id'] = df['id'].astype(int)

### SourceName

In [None]:
# value counts
df['SourceName'].value_counts()

In [None]:
# all websites either ends up with .co.uk or .com or .net
# let us use regex to extract those that do not end up with .co.uk or .com
filtered_df = df[~df['SourceName'].str.contains(r'.*.com|.*.co.uk|.*.net', regex=True, na=False)]
filtered_df['SourceName'].value_counts()


- Jobcenter Plus: gov.uk/contact-jobcentre-plus | https://en.wikipedia.org/wiki/Jobcentre_Plus
- MyUkJobs: https://myukjob.com/
- GAAPweb: https://www.gaapweb.com/
- Brand Republic Jobs: was not able to find the actual website, but this website mention it https://www.onrec.com/directory/job-boards/brand-republic-jobs
- eFinancialCareers: https://www.efinancialcareers.co.uk/
- PR week jobs: https://www.prweekjobs.co.uk/
- Multilingualvacancies: https://www.multilingualvacancies.com/
- Jobs Ac: https://www.jobs.ac.uk/
- Jobs24: https://jobs24.com/
- ijobs: https://ijobscenter.com/
- jobs.scot.nhs.uk: correct!
- JobSearch: There is no website upon searching called JobSearch
- JustLondonJobs: https://www.justlondonjobs.com/
- Teaching jobs - TES Connect: https://www.tes.com/en-au
- jobs.perl.org: correct!
- TotallyExec: https://www.totallyexec.com/

In [None]:
# let us replace Jobcenter Plus with gov.uk/contact-jobcentre-plus
df['SourceName'] = df['SourceName'].replace('Jobcentre Plus', 'gov.uk/contact-jobcentre-plus')
indices1 = df.loc[df['SourceName'] == 'gov.uk/contact-jobcentre-plus'].index
# MyUkJobs with myukjob.com
df['SourceName'] = df['SourceName'].replace('MyUkJobs', 'myukjob.com')
indices2 = df.loc[df['SourceName'] == 'myukjob.com'].index
# GAAPweb with gaapweb.com
df['SourceName'] = df['SourceName'].replace('GAAPweb', 'gaapweb.com')
indices3 = df.loc[df['SourceName'] == 'gaapweb.com'].index
# Brand Republic Jobs with onrec.com/directory/job-boards/brand-republic-jobs
df['SourceName'] = df['SourceName'].replace('Brand Republic Jobs', 'onrec.com/directory/job-boards/brand-republic-jobs')
indices4 = df.loc[df['SourceName'] == 'onrec.com/directory/job-boards/brand-republic-jobs'].index
# eFinancialCareers with efinancialcareers.co.uk
df['SourceName'] = df['SourceName'].replace('eFinancialCareers', 'efinancialcareers.co.uk')
indices5 = df.loc[df['SourceName'] == 'efinancialcareers.co.uk'].index
# PR week jobs with prweekjobs.co.uk
df['SourceName'] = df['SourceName'].replace('PR Week Jobs', 'prweekjobs.co.uk')
indices6 = df.loc[df['SourceName'] == 'prweekjobs.co.uk'].index
# Multilingualvacancies with multilingualvacancies.com
df['SourceName'] = df['SourceName'].replace('Multilingualvacancies', 'multilingualvacancies.com')
indices7 = df.loc[df['SourceName'] == 'multilingualvacancies.com'].index
# Jobs Ac with jobs.ac.uk
df['SourceName'] = df['SourceName'].replace('Jobs Ac', 'jobs.ac.uk')
indices8 = df.loc[df['SourceName'] == 'jobs.ac.uk'].index
# Jobs24 with jobs24.com
df['SourceName'] = df['SourceName'].replace('Jobs24', 'jobs24.com')
indices9 = df.loc[df['SourceName'] == 'jobs24.com'].index
# ijobs with ijobscenter.com
df['SourceName'] = df['SourceName'].replace('ijobs', 'ijobscenter.com')
indices10 = df.loc[df['SourceName'] == 'ijobscenter.com'].index
# JobSearch with Unknown
indices11 = df.loc[df['SourceName'] == 'JobSearch'].index
df['SourceName'] = df['SourceName'].replace('JobSearch', np.nan)
# JustLondonJobs with justlondonjobs.com
df['SourceName'] = df['SourceName'].replace('JustLondonJobs', 'justlondonjobs.com')
indices12 = df.loc[df['SourceName'] == 'justlondonjobs.com'].index
# Teaching jobs - TES Connect with tes.com
df['SourceName'] = df['SourceName'].replace('Teaching jobs - TES Connect', 'tes.com')
indices13 = df.loc[df['SourceName'] == 'tes.com'].index
# TotallyExec with totallyexec.com
df['SourceName'] = df['SourceName'].replace('TotallyExec', 'totallyexec.com')
indices14 = df.loc[df['SourceName'] == 'totallyexec.com'].index

In [None]:


# updating errors:
for i in indices1:
    updateErlist(i, df.iloc[i]['id'],"SourceName", "Jobcentre Plus", "gov.uk/contact-jobcentre-plus", "Svntactical Anomalies", "Replacing Original with Modified")
for i in indices2:    
    updateErlist(i, df.iloc[i]['id'],"SourceName", "MyUkJobs", "myukjob.com", "Svntactical Anomalies", "Replacing Original with Modified")
for i in indices3:
    updateErlist(i, df.iloc[i]['id'],"SourceName", "GAAPweb", "gaapweb.com", "Svntactical Anomalies", "Replacing Original with Modified")
for i in indices4:
    updateErlist(i, df.iloc[i]['id'],"SourceName", "Brand Republic Jobs", "onrec.com/directory/job-boards/brand-republic-jobs", "Svntactical Anomalies", "Replacing Original with Modified")
for i in indices5:
    updateErlist(i, df.iloc[i]['id'],"SourceName", "eFinancialCareers", "efinancialcareers.co.uk", "Svntactical Anomalies", "Replacing Original with Modified")
for i in indices6:
    updateErlist(i, df.iloc[i]['id'],"SourceName", "PR Week Jobs", "prweekjobs.co.uk", "Svntactical Anomalies", "Replacing Original with Modified")
for i in indices7:
    updateErlist(i, df.iloc[i]['id'],"SourceName", "Multilingualvacancies", "multilingualvacancies.com", "Svntactical Anomalies", "Replacing Original with Modified")
for i in indices8:
    updateErlist(i, df.iloc[i]['id'],"SourceName", "Jobs Ac", "jobs.ac.uk", "Svntactical Anomalies", "Replacing Original with Modified")
for i in indices9:
    updateErlist(i, df.iloc[i]['id'],"SourceName", "Jobs24", "jobs24.com", "Svntactical Anomalies", "Replacing Original with Modified")
for i in indices10:
    updateErlist(i, df.iloc[i]['id'],"SourceName", "ijobs", "ijobscenter.com", "Svntactical Anomalies", "Replacing Original with Modified")
for i in indices11:
    updateErlist(i, df.iloc[i]['id'],"SourceName", "JobSearch", "np.nan", "Svntactical Anomalies", "Replacing Original with Modified")
for i in indices12:
    updateErlist(i, df.iloc[i]['id'],"SourceName", "JustLondonJobs", "justlondonjobs.com", "Svntactical Anomalies", "Replacing Original with Modified")
for i in indices13:
    updateErlist(i, df.iloc[i]['id'],"SourceName", "Teaching jobs - TES Connect", "tes.com", "Svntactical Anomalies", "Replacing Original with Modified")
for i in indices14:
    updateErlist(i, df.iloc[i]['id'],"SourceName", "TotallyExec", "totallyexec.com", "Svntactical Anomalies", "Replacing Original with Modified")


In [None]:
# Let us check if all is Resolved
filtered_df = df[~df['SourceName'].str.contains(r'.*.com|.*.co.uk|.*.net', regex=True, na=False)]
filtered_df['SourceName'].value_counts()

All has been resolved. The output are all websites or the changes we have made.

In [None]:
# Keep a copy of the original 'Location' column
original_location = df['SourceName'].copy()

# let us remove white spaces if exists
df['SourceName'] = df['SourceName'].str.strip()

# Find indices where changes occurred
changed_indices = df.index[~original_location.eq(df['SourceName']) & ~(original_location.isna() & df['SourceName'].isna())].tolist()

changed_indices



- No whitespaces.

In [None]:
# let us check how many missing values there are for SourceName
df['SourceName'].isna().sum()

In [None]:
# let us check the instances where the missing values are
indices_1 = df.loc[df['SourceName'].isna()].index
df[df['SourceName'].isna()]

In [None]:
# let us group the SourceNames based on categories
df.groupby('Category')['SourceName'].value_counts()

In [None]:
# let us replace the missing values with the most frequent SourceName for each category 
# Compute the mode for each 'Category'
most_frequent_SourceName = df.groupby('Category')['SourceName'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)

# Fill NaN values in 'SourceName' with the computed mode
df['SourceName'].fillna(most_frequent_SourceName, inplace=True)


In [None]:
# let us record this change
for i in indices_1:
    updateErlist(i, df.iloc[i]['id'],"SourceName", "NaN", "Most Frequent SourceName Depending on Category", "Missing Values", "Replacing Original with Most Frequent SourceName Depending on Category")

In [None]:
# let us check if all is Resolved
df['SourceName'].isna().sum()

### Title

In [None]:
# we can not check for logical errors since the variability of title is huge and there are no commonalities.
# we can only remove some of the symbols that are not needed like ... or ***
# let us find those that have special characters in the title
filtered_df = df[df['Title'].str.contains(r'[*?!.,:;\-+@#$%^&<>~`_]{1,}|\s{2,}', regex=True, na=False)]
filtered_indices = filtered_df.index
filtered_df

- That is a lot of titles 26957!
- Let us remove *** from the titles and white spaced.
- Other than these, it is difficult to find other errors in the title since its variability is high.

In [None]:
# Initialize dictionaries to hold before and after states
changes_whitespace = {}
changes_special_char = {}

# Keep a copy of original titles to be able to find indices of modified titles
original_titles = df['Title'].copy()

# Apply first transformation (fixing whitespaces)
df['Title'] = df['Title'].str.replace(r'\s{2,}', ' ', regex=True)

# Find and store indices changed by first transformation
indices15 = original_titles[original_titles != df['Title']].index.tolist()

# Update Erlist for first transformation, if any changes were made
if indices15:
    for i in indices15:
        updateErlist(i, df.iloc[i]['id'],"Title", "Multiple Original Values", "Multiple Modified Values", "Syntactical Anomalies", "Removing white spaces")

# Make another copy of the modified titles after whitespace fix
modified_titles_after_whitespace_fix = df['Title'].copy()

# Apply second transformation (fixing special characters)
df['Title'] = df['Title'].str.replace(r'[*?]{1,}', '', regex=True)

# Find and store indices changed by second transformation
indices16 = modified_titles_after_whitespace_fix[modified_titles_after_whitespace_fix != df['Title']].index.tolist()

# Update Erlist for second transformation, if any changes were made
if indices16:
    for i in indices16:
        updateErlist(i, df.iloc[i]['id'],"Title", "Multiple Original Values", "Multiple Modified Values", "Syntactical Anomalies", "Removing special characters")

In [None]:
# let us check how many missing values there are for Title
df['Title'].isna().sum()

### Location

In [None]:
# value counts
df['Location'].value_counts()

In [None]:
# unique values
df['Location'].unique()

In [None]:
# better view format
unique = []
for entry in df['Location']:
    if entry not in unique:
        unique.append(entry)

unique

In [None]:
# most of the errors are due to typos
# let us replace Leads with Leeds
indices17 = df.loc[df['Location'] == 'Leads'].index
df['Location'] = df['Location'].replace('Leads', 'Leeds')
# let us replace london with London
indices18 = df.loc[df['Location'] == 'london'].index
df['Location'] = df['Location'].replace('london', 'London')
# let us replace SURREY with Surrey
indices19 = df.loc[df['Location'] == 'SURREY'].index
df['Location'] = df['Location'].replace('SURREY', 'Surrey')
# let us replace birmingham with Birmingham
indices20 = df.loc[df['Location'] == 'birmingham'].index
df['Location'] = df['Location'].replace('birmingham', 'Birmingham')
# let us replace Oxfords with Oxford
indices21 = df.loc[df['Location'] == 'Oxfords'].index
df['Location'] = df['Location'].replace('Oxfords', 'Oxford')
# let us replace LANCASHIRE with Lancashire
indices22 = df.loc[df['Location'] == 'LANCASHIRE'].index
df['Location'] = df['Location'].replace('LANCASHIRE', 'Lancashire')
# let us replace HAMpshire with Hampshire
indices23 = df.loc[df['Location'] == 'HAMpshire'].index
df['Location'] = df['Location'].replace('HAMpshire', 'Hampshire')
# let us replace Londn with London
indices24 = df.loc[df['Location'] == 'Londn'].index
df['Location'] = df['Location'].replace('Londn', 'London')
# let us replace ABERDEEN with Aberdeen
indices25 = df.loc[df['Location'] == 'ABERDEEN'].index
df['Location'] = df['Location'].replace('ABERDEEN', 'Aberdeen')
# let us replace DONCASTER with Doncaster
indices26 = df.loc[df['Location'] == 'DONCASTER'].index
df['Location'] = df['Location'].replace('DONCASTER', 'Doncaster')

In [None]:
# Updating records with specific locations
for i in indices17:
    updateErlist(i, df.iloc[i]['id'],"Location", "Leads", "Leeds", "Svntactical Anomalies", "Replacing Original with Modified")

for i in indices18:
    updateErlist(i, df.iloc[i]['id'],"Location", "london", "London", "Svntactical Anomalies", "Replacing Original with Modified")

for i in indices19:
    updateErlist(i, df.iloc[i]['id'],"Location", "SURREY", "Surrey", "Svntactical Anomalies", "Replacing Original with Modified")

for i in indices20:
    updateErlist(i, df.iloc[i]['id'],"Location", "birmingham", "Birmingham", "Svntactical Anomalies", "Replacing Original with Modified")

for i in indices21:
    updateErlist(i, df.iloc[i]['id'],"Location", "Oxfords", "Oxford", "Svntactical Anomalies", "Replacing Original with Modified")

for i in indices22:
    updateErlist(i, df.iloc[i]['id'],"Location", "LANCASHIRE", "Lancashire", "Svntactical Anomalies", "Replacing Original with Modified")

for i in indices23:
    updateErlist(i, df.iloc[i]['id'],"Location", "HAMpshire", "Hampshire", "Svntactical Anomalies", "Replacing Original with Modified")

for i in indices24:
    updateErlist(i, df.iloc[i]['id'],"Location", "Londn", "London", "Svntactical Anomalies", "Replacing Original with Modified")

for i in indices25:
    updateErlist(i, df.iloc[i]['id'],"Location", "ABERDEEN", "Aberdeen", "Svntactical Anomalies", "Replacing Original with Modified")

for i in indices26:
    updateErlist(i, df.iloc[i]['id'],"Location", "DONCASTER", "Doncaster", "Svntactical Anomalies", "Replacing Original with Modified")


In [None]:
# Keep a copy of the original 'Location' column
original_location = df['Location'].copy()

# let us remove white spaces if exists
df['Location'] = df['Location'].str.strip()

# Find indices where changes occurred
changed_indices = df.index[original_location != df['Location']].tolist()

changed_indices

- No whitespace errors.

In [None]:
# let us check how many missing values there are for Location
df['Location'].isna().sum()

### Company

In [None]:
# value counts
df['Company'].value_counts()

In [None]:
# checking unique values in a good output format
unique = []
for entry in df['Company']:
    if entry not in unique:
        unique.append(entry)

unique

- OOOF! That is a lot of companies.
- Let us remove obvious errors like white spaces and companies can not be just a number of special characters.

In [None]:
# Keep a copy of the original 'Company' column
original_location = df['Company'].copy()

# let us remove white spaces if exists
df['Company'] = df['Company'].str.strip()

# Find indices where changes occurred
# This accounts for the NaN issue
changed_indices = df.index[~original_location.eq(df['Company']) & ~(original_location.isna() & df['Company'].isna())].tolist()

# Update the error list
for idx in changed_indices:
    updateErlist(idx, df.iloc[i]['id'],"Company", original_location[idx], df['Company'][idx], "Whitespace Anomalies", "Removing white spaces")

In [None]:
# let us replace N/A with Unknown
indices27 = df.loc[df['Company'] == 'N/A'].index
df['Company'] = df['Company'].replace('N/A', np.nan)

# update error list
for i in indices27:
    updateErlist(i, df.iloc[i]['id'],"Company", "N/A", "np.nan", "Svntactical Anomalies", "Replacing Original with Modified")

In [None]:
# there are entries that are empty, let us replace them with Unknown
indices28 = df.loc[df['Company'] == ''].index
df['Company'] = df['Company'].replace('', np.nan)

# update error list
for i in indices28:
    updateErlist(i, df.iloc[i]['id'],"Company", "", "np.nan", "Svntactical Anomalies", "Replacing Original with Modified")

In [None]:
# let us replace - with Unknown
indices29 = df.loc[df['Company'] == '-'].index
df['Company'] = df['Company'].replace('-', np.nan)

# update error list
for i in indices29:
    updateErlist(i, df.iloc[i]['id'],"Company", "-", "np.nan", "Svntactical Anomalies", "Replacing Original with Modified")

In [None]:
# value counts
df['Company'].value_counts()

In [None]:
# let us check how many missing values there are for Company
df['Company'].isna().sum()

In [None]:
# let us check the instances where the missing values are
indices_2 = df.loc[df['Company'].isna()].index
df[df['Company'].isna()]

In [None]:
# Compute the mode for each combination of 'Location' and 'Category'
most_frequent_company = df.groupby(['Location', 'Category'])['Company'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)

# Fill NaN values in 'Company' with the computed mode
df['Company'].fillna(most_frequent_company, inplace=True)


# If we have a null value in the "Company" column for a job that is in, say, 
# "New York" and is in the "IT" category, this null value would be replaced by the company that appears most frequently for IT jobs in New York in your DataFrame.

In [None]:
# Record
for i in indices_2:
    updateErlist(i, df.iloc[i]['id'],"Company", "NaN", "Most Frequent Company Depending on Location and Category", "Missing Values", "Replacing Original with Most Frequent Company Depending on Category")

In [None]:
# let us check if all is resolved
df['Company'].isna().sum()

In [None]:
# let us see these companies
indices_3 = df[df['Company'].isna()].index
df[df['Company'].isna()]

In [None]:
# let us replace them now with SourceName and category

# Compute the mode for each combination of 'SourceName' and 'Category'
most_frequent_company = df.groupby(['SourceName', 'Category'])['Company'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)

# Fill NaN values in 'Company' with the computed mode
df['Company'].fillna(most_frequent_company, inplace=True)



In [None]:
# Record
for i in indices_3:
    updateErlist(i, df.iloc[i]['id'],"Company", "NaN", "Most Frequent Compnay Depending on SourceName and Category", "Missing Values", "Replacing Original with Most Frequent Company Depending on Category")

In [None]:
# let us check if all is resolved
df['Company'].isna().sum()

The initial rationale for employing a combination of 'Location' and 'Category' as opposed to 'Source' and 'Category' was predicated on the assumption that a company's location holds greater significance than the platform on which its job posting is listed. Specifically, a company will have a singular geographical location, while it may utilize multiple platforms or sources to advertise its vacancies.

### ContractType

In [None]:
# value counts
df['ContractType'].value_counts()

In [None]:
indices28_type = df.loc[df['ContractType'] == 'N/A'].index
df['ContractType'] = df['ContractType'].replace('N/A', np.nan)

# update error list
for i in indices28_type:
    updateErlist(i, df.iloc[i]['id'],"ContractType", "N/A", "np.nan", "Svntactical Anomalies", "Replacing Original with Modified")

# let us replace - with np.NaN
indices29_type = df.loc[df['ContractType'] == '-'].index
df['ContractType'] = df['ContractType'].replace('-', np.nan)

# update error list
for i in indices29_type:
    updateErlist(i, df.iloc[i]['id'],"ContractType", "-", "np.nan", "Svntactical Anomalies", "Replacing Original with Modified")

# let us replace " " with np.nan
indices30_type = df.loc[df['ContractType'] == ' '].index
df['ContractType'] = df['ContractType'].replace(' ', np.nan)

# update error list
for i in indices30_type:
    updateErlist(i, df.iloc[i]['id'],"ContractType", " ", "np.nan", "Svntactical Anomalies", "Replacing Original with Modified")

In [None]:
# value counts
df['ContractType'].value_counts()

In [None]:
# any missing values?
df['ContractType'].isna().sum()

In [None]:
# record the indices of missing values
indices_4_type = df[df['ContractTime'].isna()].index

In [None]:
# let us replace the ContractTime based on company most frequent contract type
# Compute the mode (most frequent value) for 'ContractTime' for each 'Company'
most_frequent_contract_type = df.groupby('Category')['ContractType'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)

# Fill NaN values in 'ContractTime' with the computed mode for each 'Company'
df['ContractType'].fillna(most_frequent_contract_type, inplace=True)

In [None]:
# Record
for i in indices_4_type:
    updateErlist(i, df.iloc[i]['id'],"ContractTime", "NaN", "Most Frequent ContractTime Depending on Category", "Missing Values", "Replacing Original with Most Frequent ContractTime Depending on Category")

In [None]:
# any missing values?
df['ContractType'].isna().sum()

### ContractTime

In [None]:
# value counts
df['ContractTime'].value_counts()

In [None]:
# let us replace N/A with np.NaN
indices28 = df.loc[df['ContractTime'] == 'N/A'].index
df['ContractTime'] = df['ContractTime'].replace('N/A', np.nan)

# update error list
for i in indices28:
    updateErlist(i, df.iloc[i]['id'],"ContractTime", "N/A", "np.nan", "Svntactical Anomalies", "Replacing Original with Modified")

# let us replace - with np.NaN
indices29 = df.loc[df['ContractTime'] == '-'].index
df['ContractTime'] = df['ContractTime'].replace('-', np.nan)

# update error list
for i in indices29:
    updateErlist(i, df.iloc[i]['id'],"ContractTime", "-", "np.nan", "Svntactical Anomalies", "Replacing Original with Modified")

# let us replace " " with np.nan
indices30 = df.loc[df['ContractTime'] == ' '].index
df['ContractTime'] = df['ContractTime'].replace(' ', np.nan)

# update error list
for i in indices30:
    updateErlist(i, df.iloc[i]['id'],"ContractTime", " ", "np.nan", "Svntactical Anomalies", "Replacing Original with Modified")


In [None]:
# value counts
df['ContractTime'].value_counts()

In [None]:
# any missing values?
df['ContractTime'].isna().sum()

In [None]:
# record the indices of missing values
indices_4 = df[df['ContractTime'].isna()].index

In [None]:
# let us replace the ContractTime based on company most frequent contract type
# Compute the mode (most frequent value) for 'ContractTime' for each 'Company'
most_frequent_contract_time = df.groupby('Company')['ContractTime'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)

# Fill NaN values in 'ContractTime' with the computed mode for each 'Company'
df['ContractTime'].fillna(most_frequent_contract_time, inplace=True)

In [None]:
# Record
for i in indices_4:
    updateErlist(i, df.iloc[i]['id'],"ContractTime", "NaN", "Most Frequent ContractTime Depending on Company", "Missing Values", "Replacing Original with Most Frequent ContractTime Depending on Company")

In [None]:
# any missing values?
df['ContractTime'].isna().sum()

In [None]:
# record the indices of missing values
indices_5 = df[df['ContractTime'].isna()].index

In [None]:
# let us replace the ContractTime based on company most frequent contract type
# Compute the mode (most frequent value) for 'ContractTime' for each 'Company'
most_frequent_contract_time = df.groupby('Category')['ContractTime'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)

# Fill NaN values in 'ContractTime' with the computed mode for each 'Company'
df['ContractTime'].fillna(most_frequent_contract_time, inplace=True)

In [None]:
# Record
for i in indices_5:
    updateErlist(i, df.iloc[i]['id'],"ContractTime", "NaN", "Most Frequent ContractTime Depending on Category", "Missing Values", "Replacing Original with Most Frequent ContractTime Depending on Category")

In [None]:
# any missing values?
df['ContractTime'].isna().sum()

- The rationale for prioritizing 'Company' over 'Category' in the data imputation process stems from the belief that the nature of a company's typical employment contracts holds greater significance than the job type in determining the contract time.

### Category

In [None]:
# value counts
df['Category'].value_counts()

In [None]:
# are there any missing values?
df['Category'].isna().sum()

- No errors

### Salary

In [None]:
# checking salary format
try:
    df["Salary"] = df["Salary"].astype("float")
except ValueError:
    print("It turns out Salary is in different formats or have errors.")

In [None]:
# let see how many errors we have and how they look like
count = 0
errors = []
for i in df["Salary"]:
    try:
        float(i)
    except ValueError:
        count = count + 1
        if i not in errors:
            errors.append(i)
count

In [None]:
# what type of errors?
errors

In [None]:
# filtering those with range
filtered_df = df[df['Salary'].str.contains(r'\d*\s*?[-~]\s*\d*|\d*\s*?to\s*\d*', regex=True, na=False)]
filtered_indices = filtered_df.index
filtered_df['Salary'].value_counts()

In [None]:
# let us first replace those with just -
indices31 = df.loc[df['Salary'] == '-'].index
mask = df['Salary'] == '-'

# let us replace them with np.NaN
df.loc[mask, 'Salary'] = np.NaN

# update error list
for i in indices31:
    updateErlist(i, df.iloc[i]['id'],"Salary", "-", "np.nan", "Svntactical Anomalies", "Replacing Original with Modified")

In [None]:
# First, convert the Salary column to string type to ensure the regular expression works on all rows
df['Salary'] = df['Salary'].astype(str)

# Now copy the original Salary column
original_salary = df['Salary'].copy()

# Replace using the regular expression
df['Salary'] = df['Salary'].str.replace(
    r'(\d+)\s*([-~]|to)\s*(\d+)', 
    lambda x: str((float(x.group(1)) + float(x.group(3))) / 2) if x.group(1) and x.group(3) else x.group(0),
    regex=True
)

# Find indices where changes occurred; this will now also include NaNs
changed_indices = df.index[original_salary != df['Salary']].tolist()

# Now, go through these indices and update the error list
for idx in changed_indices:
    if pd.isna(original_salary[idx]) and pd.isna(df['Salary'][idx]):
        # Skip NaNs as these are not "changes"
        continue
    updateErlist(idx, df.iloc[i]['id'],"Salary", original_salary[idx], df['Salary'][idx], "Semantic Anomalies", "Averaging Salary Range")

In [None]:
# lets deal with those with /year or per Annum
filtered_df = df[df['Salary'].str.contains(r'\d*[/]year|\d*\s*per\s*?Annum', regex=True, na=False)]
filtered_indices = filtered_df.index
filtered_df

In [None]:
# we just need to remove the /year or per Annum
original_salary_before_per_year_removal = df['Salary'].copy()

df['Salary'] = df['Salary'].str.replace(r'[/]year|\s*per\s*?Annum', '', regex=True)

changed_indices = df.index[original_salary_before_per_year_removal != df['Salary']].tolist()

# Update the error list
for idx in changed_indices:
    if pd.isna(original_salary_before_per_year_removal[idx]) and pd.isna(df['Salary'][idx]):
        # Skip NaNs as these are not "changes"
        continue
    updateErlist(idx, df.iloc[i]['id'],"Salary", original_salary_before_per_year_removal[idx], df['Salary'][idx], "Semantic Anomalies", "Removing '/year' or 'per Annum'")


In [None]:
# let us deal with those with /hour
filtered_df = df[df['Salary'].str.contains(r'\d*.\d*\sper\shour|\d*.\d*\sp/h', regex=True, na=False)]
filtered_indices = filtered_df.index 
filtered_df

- The average working hours in UK is 36 hours per week
- https://standout-cv.com/average-working-hours-uk#:~:text=hours%20per%20week.-,Average%20working%20hours%20per%20week%20UK,works%2036.4%20hours%20per%20week.

- so, we want to take the value and multiply it by 36 hours and that's the total Salary per week. We multiply that by 52 and we get the yearly salary

In [None]:
# let us replace them by multiplying them by 36 and 52
# Copy the original Salary column before the operation for per hour in "p/h" format
original_salary_before_ph_removal = df['Salary'].copy()

# Perform the first string replacement for "p/h"
df['Salary'] = df['Salary'].str.replace(r'(\d*.\d*)\sp[/]h', lambda x: str(float(x.group(1)) * 36 * 52), regex=True)

# Find indices where changes occurred for the first operation
changed_indices_ph = df.index[original_salary_before_ph_removal != df['Salary']].tolist()

# Update the error list for the first operation
for idx in changed_indices_ph:
    if pd.isna(original_salary_before_ph_removal[idx]) and pd.isna(df['Salary'][idx]):
        # Skip NaNs as these are not "changes"
        continue
    updateErlist(idx, df.iloc[i]['id'],"Salary", original_salary_before_ph_removal[idx], df['Salary'][idx], "Semantic Anomalies", "Converting p/h to yearly")

# Copy the original Salary column before the operation for per hour in "per hour" format
original_salary_before_per_hour_removal = df['Salary'].copy()

# Perform the second string replacement for "per hour"
df['Salary'] = df['Salary'].str.replace(r'(\d*.\d*)\sper\shour', lambda x: str(float(x.group(1)) * 36 * 52), regex=True)

# Find indices where changes occurred for the second operation
changed_indices_per_hour = df.index[original_salary_before_per_hour_removal != df['Salary']].tolist()

# Update the error list for the second operation
for idx in changed_indices_per_hour:
    if pd.isna(original_salary_before_per_hour_removal[idx]) and pd.isna(df['Salary'][idx]):
        # Skip NaNs as these are not "changes"
        continue
    updateErlist(idx, df.iloc[i]['id'],"Salary", original_salary_before_per_hour_removal[idx], df['Salary'][idx], "Semantic Anomalies", "Converting 'per hour' to yearly")


In [None]:
# let us deal with k
filtered_df = df[df['Salary'].str.contains(r'\d*k', regex=True, na=False)]
filtered_indices = filtered_df.index

In [None]:
# Copy the original Salary column before the operation for replacing "k"
original_salary_before_k_removal = df['Salary'].copy()

# Perform the string replacement for "k"
df['Salary'] = df['Salary'].str.replace(r'(\d*)k', lambda x: str(float(x.group(1)) * 1000), regex=True)

# Find indices where changes occurred for this operation
changed_indices_k = df.index[original_salary_before_k_removal != df['Salary']].tolist()

# Update the error list for this operation
for idx in changed_indices_k:
    if pd.isna(original_salary_before_k_removal[idx]) and pd.isna(df['Salary'][idx]):
        # Skip NaNs as these are not "changes"
        continue
    updateErlist(idx, df.iloc[i]['id'],"Salary", original_salary_before_k_removal[idx], df['Salary'][idx], "Semantic Anomalies", "Converting 'k' to full numbers")


In [None]:
# there is a single 'N/A' error
# let us replace them with np.NaN
indices32 = df.loc[df['Salary'] == 'N/A'].index
mask = df['Salary'] == 'N/A'
df.loc[mask, 'Salary'] = np.NaN

# update error list
for i in indices32:
    updateErlist(i, df.iloc[i]['id'],"Salary", "N/A", "np.nan", "Svntactical Anomalies", "Replacing Original with Modified")

In [None]:
# there's a single whitespace error
# let us replace them with np.NaN
indices33 = df.loc[df['Salary'] == ' '].index
mask = df['Salary'] == ' '
df.loc[mask, 'Salary'] = np.NaN

# update error list
for i in indices33:
    updateErlist(i, df.iloc[i]['id'],"Salary", " ", "np.nan", "Svntactical Anomalies", "Replacing Original with Modified; Space to np.nan")

In [None]:
# let us see if any remaining errors
count = 0
errors = []
for i in df["Salary"]:
    try:
        float(i)
    except ValueError:
        count = count + 1
        if i not in errors:
            errors.append(i)
errors

In [None]:
# let change the data type to float
df["Salary"] = df["Salary"].astype("float")

In [None]:
# are there any missing values for Salary?
df['Salary'].isna().sum()

In [None]:
# record indices of missing values
indices_6 = df[df['Salary'].isna()].index

In [None]:
# let us replace the Salary based on the mean of salary of the same company and category
# Compute the mean for each combination of 'Company' and 'Category'
most_frequent_salary = df.groupby(['Company', 'Category'])['Salary'].transform(lambda x: x.mean() if not x.empty else np.nan)

# Fill NaN values in 'Salary' with the computed mean
df['Salary'].fillna(most_frequent_salary, inplace=True)


In [None]:
# Record
for i in indices_6:
    updateErlist(indices_6, df.iloc[i]['id'],"Salary", "NaN", "Mean of Salary Depending on Company and Category", "Missing Values", "Replacing Original with Most Frequent Salary Depending on Company and Category")

In [None]:
# are there any missing values for Salary?
df['Salary'].isna().sum()

In [None]:
# record indices of missing values
indices_7 = df[df['Salary'].isna()].index

In [None]:
# let us replace the Salary based on the mean of salary of the same company and category
# Compute the mean for each combination of 'Company' and 'ContractTime'
most_frequent_salary = df.groupby(['Company', 'ContractTime'])['Salary'].transform(lambda x: x.mean() if not x.empty else np.nan)

# Fill NaN values in 'Salary' with the computed mean
df['Salary'].fillna(most_frequent_salary, inplace=True)


In [None]:
# Record
for i in indices_7:
    updateErlist(i, df.iloc[i]['id'],"Salary", "NaN", "Mean of Salary Depending on Company and ContractTime", "Missing Values", "Replacing Original with Most Frequent Salary Depending on Company and ContractTime")

In [None]:
# are there any missing values for Salary?
df['Salary'].isna().sum()

In [None]:
# record indices of missing values
indices_8 = df[df['Salary'].isna()].index

In [None]:
# let us replace the Salary based on the mean of salary of the same company and category
# Compute the mean for each combination of 'Company' and 'ContractTime'
most_frequent_salary = df.groupby(['Company', 'Location'])['Salary'].transform(lambda x: x.mean() if not x.empty else np.nan)

# Fill NaN values in 'Salary' with the computed mean
df['Salary'].fillna(most_frequent_salary, inplace=True)

In [None]:
# Record
for i in indices_8:
    updateErlist(i, df.iloc[i]['id'],"Salary", "NaN", "Mean of Salary Depending on Company and Location", "Missing Values", "Replacing Original with Most Frequent Salary Depending on Company and Location")

In [None]:
# are there any missing values for Salary?
df['Salary'].isna().sum()

In [None]:
# record indices of missing values
indices_9 = df[df['Salary'].isna()].index

In [None]:
# lets just do company now
# let us replace the Salary based on the mean of salary of the same company and category
# Compute the mean for each combination of 'Company' and 'ContractTime'
most_frequent_salary = df.groupby(['Company'])['Salary'].transform(lambda x: x.mean() if not x.empty else np.nan)

# Fill NaN values in 'Salary' with the computed mean
df['Salary'].fillna(most_frequent_salary, inplace=True)

In [None]:
# Record
for i in indices_9:
    updateErlist(i, df.iloc[i]['id'],"Salary", "NaN", "Mean of Salary Depending on Company", "Missing Values", "Replacing Original with Most Frequent Salary Depending on Company")

In [None]:
# are there any missing values for Salary?
df['Salary'].isna().sum()

In [None]:
# record indices of missing values
indices_10 = df[df['Salary'].isna()].index

In [None]:
# let us just do category now
# Compute the mean for each combination of 'Company' and 'ContractTime'
most_frequent_salary = df.groupby(['Category'])['Salary'].transform(lambda x: x.mean() if not x.empty else np.nan)

# Fill NaN values in 'Salary' with the computed mean
df['Salary'].fillna(most_frequent_salary, inplace=True)

In [None]:
# Record
for i in indices_10:
    updateErlist(i, df.iloc[i]['id'],"Salary", "NaN", "Mean of Salary Depending on Category", "Missing Values", "Replacing Original with Most Frequent Salary Depending on Category")

In [None]:
# are there any missing values for Salary?
df['Salary'].isna().sum()

Here is the sequence for imputing missing values in the 'Salary' field, prioritized based on what I assess to be the most influential factors affecting salary levels.

- 1: Company + Category
- 2: Company + Contract Time
- 3: Company + Location
- 4: Company alone
- 5: Category alone


In [None]:
# let us change it to integer
df["Salary"] = df["Salary"].astype("float")

### Dates

In [None]:
# how does it look like?
df['OpenDate']

In [None]:
# Print the unique data types in the column; it should be 'object'
print(df['OpenDate'].apply(type).unique())

# Print rows where date conversion fails
for idx, date_str in enumerate(df['OpenDate']):
    try:
        pd.to_datetime(date_str, format='%Y%m%dT%H%M%S')
    except Exception as e:
        print(f"Row {idx} failed conversion: {date_str}, Error: {e}")


df['OpenDate'] = pd.to_datetime(df['OpenDate'], format='%Y%m%dT%H%M%S', errors='coerce')
df['CloseDate'] = pd.to_datetime(df['CloseDate'], format='%Y%m%dT%H%M%S', errors='coerce')

# errors= 'coerce' will convert the invalid dates to NaT which is one instance

In [None]:
# update error list
updateErlist("ALL", "ALL","OpenDate", "20131113T000000", "format='%Y%m%dT%H%M%S", "Semantic Anomalies", "Changing the format for all open date column")
updateErlist("ALL", "ALL","CloseDate", "20131113T000000", "format='%Y%m%dT%H%M%S", "Semantic Anomalies", "Changing the format for all open date column")

In [None]:
# any missing values?
df['OpenDate'].isna().sum()

In [None]:
# any missing values?
df['CloseDate'].isna().sum()

In [None]:
# let us check this missing value
df[df['OpenDate'].isna()]

In [None]:
# checking other instances with Flame Health Associates LLP Company and Cornwall location
df[(df['Company'] == "Flame Health Associates LLP") & (df['Location'] == 'Cornwall')]

In [None]:
# The closest job that closed at the same time is the following:
# 70229120	jobs4medical.co.uk	Allied Health Care Professional : Optometrist ...	Cornwall	Flame Health Associates LLP	permanent	Healthcare & Nursing Jobs	45000	2012-10-29 12:00:00	2012-11-28 12:00:00
# we can replace the missing open date with the open date of this job which is 2012-10-29 12:00:00
# let us replace the missing value with this date
indices_11 = df.loc[df['OpenDate'].isna()].index
df.loc[indices11, 'OpenDate'] = '2012-10-29 12:00:00'


In [None]:
# Record
for i in indices_11:
    updateErlist(i, df.iloc[i]['id'],"OpenDate", "NaT", "2012-10-29 12:00:00", "Missing value", "Replacing Original with Modified")

## A summary function to this notebook

- This will be used for task 3

In [None]:
def clean_date(df):
    """
    Cleans the dataframe by addressing common errors and inconsistencies found in various attributes
    of this specific database schema. The function performs the following operations:

    1. Normalizes the 'SourceName' field to replace aliases with standard URLs or names.
    2. Cleans the 'Title' by removing extra spaces and certain special characters.
    3. Standardizes 'Location' names by replacing incorrect or alternative spellings.
    4. Strips extra spaces and replaces placeholder values in 'Company' with NaN.
    5. Cleans 'ContractTime' to replace placeholder values with NaN.
    6. Standardizes 'Salary' by converting all formats to a common scale.
    7. Converts 'OpenDate' and 'CloseDate' to pandas datetime format.
    8. Using LinearRegression model to impute missing values for Full-Time Equivalent (FTE) column.

    Parameters:
    df (DataFrame): The DataFrame to be cleaned.

    Returns:
    DataFrame: The cleaned DataFrame.
    """
    
    # SourceName
    df['SourceName'] = df['SourceName'].replace('Jobcentre Plus', 'gov.uk/contact-jobcentre-plus')
    df['SourceName'] = df['SourceName'].replace('MyUkJobs', 'myukjob.com')
    df['SourceName'] = df['SourceName'].replace('GAAPweb', 'gaapweb.com')
    df['SourceName'] = df['SourceName'].replace('Brand Republic Jobs', 'onrec.com/directory/job-boards/brand-republic-jobs')
    df['SourceName'] = df['SourceName'].replace('eFinancialCareers', 'efinancialcareers.co.uk')
    df['SourceName'] = df['SourceName'].replace('PR Week Jobs', 'prweekjobs.co.uk')
    df['SourceName'] = df['SourceName'].replace('Multilingualvacancies', 'multilingualvacancies.com')
    df['SourceName'] = df['SourceName'].replace('Jobs Ac', 'jobs.ac.uk')
    df['SourceName'] = df['SourceName'].replace('Jobs24', 'jobs24.com')
    df['SourceName'] = df['SourceName'].replace('ijobs', 'ijobscenter.com')
    df['SourceName'] = df['SourceName'].replace('JobSearch', np.nan)
    df['SourceName'] = df['SourceName'].replace('JustLondonJobs', 'justlondonjobs.com')
    df['SourceName'] = df['SourceName'].replace('Teaching jobs - TES Connect', 'tes.com')
    df['SourceName'] = df['SourceName'].replace('TotallyExec', 'totallyexec.com')

    df['SourceName'] = df['SourceName'].str.strip()
    most_frequent_SourceName = df.groupby('Category')['SourceName'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
    df['SourceName'].fillna(most_frequent_SourceName, inplace=True)

    # Title

    df['Title'] = df['Title'].str.replace(r'\s{2,}', ' ', regex=True)
    df['Title'] = df['Title'].str.replace(r'[*?]{1,}', '', regex=True)

    # Location

    df['Location'] = df['Location'].str.strip()
    df['Location'] = df['Location'].replace('Leads', 'Leeds')
    df['Location'] = df['Location'].replace('london', 'London')
    df['Location'] = df['Location'].replace('SURREY', 'Surrey')
    df['Location'] = df['Location'].replace('birmingham', 'Birmingham')
    df['Location'] = df['Location'].replace('Oxfords', 'Oxford')
    df['Location'] = df['Location'].replace('LANCASHIRE', 'Lancashire')
    df['Location'] = df['Location'].replace('HAMpshire', 'Hampshire')
    df['Location'] = df['Location'].replace('Londn', 'London')
    df['Location'] = df['Location'].replace('ABERDEEN', 'Aberdeen')
    df['Location'] = df['Location'].replace('DONCASTER', 'Doncaster')

    # Company
    df['Company'] = df['Company'].str.strip()
    df['Company'] = df['Company'].replace('N/A', np.nan)
    df['Company'] = df['Company'].replace('', np.nan)
    df['Company'] = df['Company'].replace('-', np.nan)

    most_frequent_company = df.groupby(['Location', 'Category'])['Company'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
    df['Company'].fillna(most_frequent_company, inplace=True)

    most_frequent_company = df.groupby(['SourceName', 'Category'])['Company'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
    df['Company'].fillna(most_frequent_company, inplace=True)

    # ContractTime
    df['ContractTime'] = df['ContractTime'].replace('N/A', np.nan)
    df['ContractTime'] = df['ContractTime'].replace('-', np.nan)
    df['ContractTime'] = df['ContractTime'].replace(' ', np.nan)

    most_frequent_contract_time = df.groupby('Company')['ContractTime'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
    df['ContractTime'].fillna(most_frequent_contract_time, inplace=True)

    most_frequent_contract_time = df.groupby('Category')['ContractTime'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
    df['ContractTime'].fillna(most_frequent_contract_time, inplace=True)

    # ContractType
    df['ContractType'] = df['ContractType'].replace('N/A', np.nan)
    df['ContractType'] = df['ContractType'].replace('-', np.nan)
    df['ContractType'] = df['ContractType'].replace(' ', np.nan)

    most_frequent_contract_time = df.groupby('Company')['ContractType'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
    df['ContractType'].fillna(most_frequent_contract_time, inplace=True)

    most_frequent_contract_time = df.groupby('Category')['ContractType'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
    df['ContractType'].fillna(most_frequent_contract_time, inplace=True)

    # Salary
    mask = df['Salary'] == '-'
    df.loc[mask, 'Salary'] = np.NaN
    df['Salary'] = df['Salary'].astype(str)
    df['Salary'] = df['Salary'].str.replace(
    r'(\d+)\s*([-~]|to)\s*(\d+)', 
    lambda x: str((float(x.group(1)) + float(x.group(3))) / 2) if x.group(1) and x.group(3) else x.group(0),
    regex=True
    )
    df['Salary'] = df['Salary'].str.replace(r'[/]year|\s*per\s*?Annum', '', regex=True)
    df['Salary'] = df['Salary'].str.replace(r'(\d*.\d*)\sp[/]h', lambda x: str(float(x.group(1)) * 36 * 52), regex=True)
    df['Salary'] = df['Salary'].str.replace(r'(\d*.\d*)\sper\shour', lambda x: str(float(x.group(1)) * 36 * 52), regex=True)
    df['Salary'] = df['Salary'].str.replace(r'(\d*)k', lambda x: str(float(x.group(1)) * 1000), regex=True)
    mask = df['Salary'] == ' '
    df.loc[mask, 'Salary'] = np.NaN
    df["Salary"] = df["Salary"].astype("float")
    most_frequent_salary = df.groupby(['Company', 'Category'])['Salary'].transform(lambda x: x.mean() if not x.empty else np.nan)
    df['Salary'].fillna(most_frequent_salary, inplace=True)
    most_frequent_salary = df.groupby(['Company', 'ContractTime'])['Salary'].transform(lambda x: x.mean() if not x.empty else np.nan)
    df['Salary'].fillna(most_frequent_salary, inplace=True)
    most_frequent_salary = df.groupby(['Company', 'Location'])['Salary'].transform(lambda x: x.mean() if not x.empty else np.nan)
    df['Salary'].fillna(most_frequent_salary, inplace=True)
    most_frequent_salary = df.groupby(['Company'])['Salary'].transform(lambda x: x.mean() if not x.empty else np.nan)
    df['Salary'].fillna(most_frequent_salary, inplace=True)
    most_frequent_salary = df.groupby(['Category'])['Salary'].transform(lambda x: x.mean() if not x.empty else np.nan)
    df['Salary'].fillna(most_frequent_salary, inplace=True)

    # OpenDate
    df['OpenDate'] = pd.to_datetime(df['OpenDate'], format='%Y%m%dT%H%M%S', errors='coerce')

    # CloseDate
    df['CloseDate'] = pd.to_datetime(df['CloseDate'], format='%Y%m%dT%H%M%S', errors='coerce')

    return df

## Saving data

In [None]:
# data types
df.dtypes

In [None]:
# summary for df
df.info()

In [None]:
# code to save output data
# let us save the output data as csv
df.to_csv('s3969393_dataset1_solution.csv', index=False)
erlist.to_csv('s3969393_errorlist.csv', index=False)


## Summary

After rigorous data inspection and preprocessing steps, we are pleased to announce that the dataset is now fully cleaned and optimized for downstream analysis. Our comprehensive cleaning process focused on addressing multiple challenges commonly encountered in data science, such as data preparation, missing values, and syntactic as well as semantic anomalies.

1. **Data Formatting**: The dataset, which encompasses job advertisements in the United Kingdom, initially presented with nine attributes. These were parsed and formatted correctly to ensure data consistency.

2. **Handling Nominal Data**: Columns like 'Source', 'Title', 'Location', 'Company', and 'Category' were assessed for nominal data anomalies. Common syntactic irregularities such as extraneous white spaces and typos were successfully resolved.

3. **Categorical Data**: The 'ContractTime' column was evaluated for data entry errors, and categorical data types were appropriately coded.

4. **Error Resolution**: 'Salary', 'OpenDate', and 'CloseDate' presented with both syntactic and semantic errors, necessitating data type conversions and format standardization.

5. **Handling Missing Values**: Various intelligent strategies were employed to address missing values across different columns, leveraging techniques like frequency-based imputations depending on category, company, or other relevant parameters.

6. **Audit and Validation**: Following the cleaning process, the dataset was scrutinized to confirm the absence of errors and to ensure data integrity. All changes were carefully logged for transparency and future reference.

In conclusion, the data cleaning process has been completed successfully, ensuring that the dataset is now in an optimal format for any downstream tasks, including but not limited to data analysis, modeling, and visualization.