# Data Cleansing

Date: 13/05/2018

Version: 3.0

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:
* pandas
* numpy 
* re 
* matplotlib.pyplot
* difflib


## 1. Introduction
This project comprises the execution of different cleaning methods applied to different varaibles in order to wrangle and clean dirty values in the dataset.

Tasks:
1. Importing libraries
2. Reading data in
3. Exploring and cleaning variables one by one. 

More details for each task will be given in the following sections.

## 2. Libraries

In [1]:
#importing libraries; more libraries will be added later if needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import difflib # for cleaning location
#!pip install fuzzywuzzy
# from fuzzywuzzy import fuzz
# from fuzzywuzzy import process

#code to show plots inline
%matplotlib inline

#display multiple df in one code segment
from IPython.display import display

## 3. reading data

In [2]:
df = pd.read_csv('dataset1_with_error.csv')

Data description provided

|COLUMN|DESCRIPTION|
|:---|:---|
|Id	|8 digit Id of the job advertisement|
|Title	|Title of the advertised job position|
|Location|	Location of the advertised job position|
|ContractType|	The contract type of the advertised job position, could be full-time, part-time or non-specified.|
|ContractTime|	The contract time of the advertised job position, could be permanent, contract or non-specified.|
|Company	|Company (employer) of the advertised job position|
|Category|	The Category of the advertised job position, e.g., IT jobs, Engineering Jobs, etc.|
|Salary per annum	|Annual Salary of the advertised job position, e.g., 80000|
|OpenDate	|The opening time for applying for the advertised job position, e.g., 20120104T150000, means 3pm, 4th January 2012.|
|CloseDate	|The closing time for applying for the advertised job position, e.g., 20120104T150000, means 3pm, 4th January 2012.|
|SourceName|The website where the job position is advertised.|

In [3]:
len(df)

25077

In [4]:
df.head()

Unnamed: 0,Id,Title,Location,ContractType,ContractTime,Company,Category,Salary per annum,SourceName,OpenDate,CloseDate
0,12612628,Engineering Systems Analyst,Dorking,not available,permanent,Gregory Martin International,Engineering Jobs,24996,cv-library.co.uk,20121103T000000,20121203T000000
1,12612830,Stress Engineer Glasgow,Glasgow,not available,permanent,Gregory Martin International,Engineering Jobs,30000,cv-library.co.uk,20130108T150000,20130408T150000
2,12612844,Modelling and simulation analyst,Hampshire,not available,permanent,Gregory Martin International,Engineering Jobs,30000,cv-library.co.uk,20130726T150000,20130924T150000
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Surrey,not available,permanent,Gregory Martin International,Engineering Jobs,27504,cv-library.co.uk,20121214T000000,20130314T000000
4,12613647,"Pioneer, Miser Engineering Systems Analyst",Surrey,not available,permanent,Gregory Martin International,Engineering Jobs,24996,cv-library.co.uk,20131025T000000,20131224T000000


In [5]:
#checking info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25077 entries, 0 to 25076
Data columns (total 11 columns):
Id                  25077 non-null int64
Title               25077 non-null object
Location            25077 non-null object
ContractType        25077 non-null object
ContractTime        25077 non-null object
Company             21242 non-null object
Category            25077 non-null object
Salary per annum    25077 non-null object
SourceName          25077 non-null object
OpenDate            25077 non-null object
CloseDate           25077 non-null object
dtypes: int64(1), object(10)
memory usage: 2.1+ MB


We can see the following: 
- ID is in integer form while every thing else in string format. 
- There are null values in company

In [6]:
#checking if all unique
print('id', len(df.Id.unique()))
print('title', len(df.Title.unique()))

id 25077
title 25077


Also we can see that all ids and titles in the dataset are unique.

In [7]:
#checking value counts numerical
df.describe()

Unnamed: 0,Id
count,25077.0
mean,66643120.0
std,5195261.0
min,12612630.0
25%,67208300.0
50%,68361100.0
75%,68713710.0
max,69247670.0


In [8]:
#checking value counts categorical
df.describe(include=['O'])

Unnamed: 0,Title,Location,ContractType,ContractTime,Company,Category,Salary per annum,SourceName,OpenDate,CloseDate
count,25077,25077,25077,25077,21242,25077,25077,25077,25077,25077
unique,25077,482,3,3,4879,8,1589,90,2203,2400
top,Senior Staff Nurse (HDU) Guildford,UK,not available,permanent,UKStaffsearch,IT Jobs,35004,totaljobs.com,20130127T150000,20130526T120000
freq,1,3996,19499,16194,248,7085,1011,5335,27,23


## 4.  Checking individual data

## Id

In [9]:
#checking all lenghts of id
df.Id.apply(lambda x: len(str(x))).unique()

array([8], dtype=int64)

In [10]:
#checking for characters other than digits
{re.search(r'\D+', str(x)).group() for x in df.Id.tolist() if re.search(r'\D+', str(x))}

set()

We can see that 
- all the IDs are of length 8
- Ids only contain digits (which was a given since the data is in integer format)

Since the data for ID is clean, we move on.

## Location

Lets first try to have a look at the unique values, just eye ball the data. 

In [11]:
df.Location.unique()

array(['Dorking', 'Glasgow', 'Hampshire', 'Surrey', 'East Midlands',
       'Witney', 'Derby', 'Gateshead', 'UK', 'Avon', 'Wolverhampton',
       'Berkshire', 'North East England', 'North Yorkshire', 'East Sheen',
       'Buckinghamshire', 'London', 'Central London', 'York',
       'Newcastle Upon Tyne', 'Cambridge', 'Manchester', 'Maidstone',
       'Bradford', 'Burgess Hill', 'Birmingham', 'Derbyshire',
       'Leicestershire', 'The City', 'Staffordshire', 'Suffolk',
       'Haywards Heath', 'Jersey', 'South Yorkshire', 'Nottinghamshire',
       'Northamptonshire', 'Milton Keynes', 'Bristol', 'Hertfordshire',
       'Basingstoke', 'Warrington', 'West Yorkshire', 'South East London',
       'Salisbury', 'Leeds', 'North West London', 'Poole', 'Cumbria',
       'Huddersfield', 'Rotherham', 'West Sussex', 'Norwich', 'Cardiff',
       'Bury St. Edmunds', 'Braintree', 'Nottingham', 'Winchester',
       'Surey', 'Kent', 'Durham', 'Sunderland', 'Southampton', 'Essex',
       'Bournemouth', '

In [12]:
#checking value counts
df.Location.value_counts()

UK                      3996
London                  2743
South East London       1433
The City                 558
Leeds                    343
Manchester               334
Surrey                   317
Central London           316
Reading                  300
Birmingham               292
West Midlands            273
East Sheen               222
Bristol                  222
Berkshire                202
Nottingham               196
Oxford                   190
Hampshire                184
Milton Keynes            180
Cambridge                172
Sheffield                160
Kent                     158
Newcastle Upon Tyne      156
Hertfordshire            155
Guildford                147
Aberdeen                 141
Leicester                139
Cheshire                 133
Oxfordshire              127
West Yorkshire           127
Belfast                  125
                        ... 
Harlow                     4
Bexley                     4
Twickenham                 4
Melksham      

Using regexes we can try and find if there are any special characters in the location data.

In [13]:
#looking for characters other than words, -, ', \s.
a = r".*[^a-zA-Z\s.\-\\']+.*"
[re.search(a, str(x)).group() for x in df.Location.tolist() if re.search(a, str(x))]

[]

In [14]:
#looking for leading or trailing spaces
a = r"(^\s.*)|(.*\s$)"
[re.search(a, str(x)).group() for x in df.Location.tolist() if re.search(a, str(x))]

[]

In [15]:
#looking for double spaces
a = r".*\s\s+.*"
[re.search(a, str(x)).group() for x in df.Location.tolist() if re.search(a, str(x))]

[]

In [16]:
#looking for double capital words (abbreviations)
a = r".*[A-Z]{2,}.*"
{re.search(a, str(x)).group() for x in df.Location.tolist() if re.search(a, str(x))}

{'UK'}

In [17]:
#looking for double capital words (abbreviations)
a = r"U.*"
{re.search(a, str(x)).group() for x in df.Location.tolist() if re.search(a, str(x))}

{'UK', 'Uckfield', 'Upon Thames', 'Upon Tyne', 'Upon-Avon', 'Uxbridge'}

Using regexes we can see that the data:

- does not have trailing, leading or double spaces
- does not have any characters other than
    - words
    - hiphen
    - apostraphe
    - fullstop
- we can also see that only location containing two consecutive Capitals is UK and that there is no United Kingdom in the Location hence the data is consistent with this.


Once we have checked for any special or incorrect characters, we can check for spelling errors in the data. 

In [18]:
#checking misspellings


#https://stackoverflow.com/questions/41192424/python-how-to-correct-misspelled-names
def is_similar(first, second, ratio):
    return difflib.SequenceMatcher(None, first, second).ratio() > ratio


first = list(set(df.Location.unique()))
second = list(set(df.Location.unique()))

for f in first:
    result = [s for s in second if is_similar(f,s, 0.9)]
    if len(result) > 1:
        print(f, result)

Oxford ['Oxford', 'Oxfords']
Nottinham ['Nottinham', 'Nottingham']
Nottingham ['Nottinham', 'Nottingham']
Oxfords ['Oxford', 'Oxfords']
Herefordshire ['Herefordshire', 'Hertfordshire']
Surrey ['Surrey', 'Surey']
Hertfordshire ['Herefordshire', 'Hertfordshire']
Surey ['Surrey', 'Surey']


Using difflib we can see that we can identify that there are a few spelling errors in the location. for example:
    - Nottinham
    - Surey
    - Oxfords
    
In the code we used a 90% similarity limit, however if we decrease this limit we can find more spelling errors. Another approach of cleaning and identifying spelling errors is to obtain a list of clean locations and compare our locations with the clean ones. This however is difficult as we cannot be sure that the data is of the UK (however it does look like it) and compare it with UK data only. This way we might end up with incorrect similarities. 

Since we have identified the spelling errors we can go ahead and clean them. 

In [19]:
#fixing spelling errors
df.loc[df.Location == 'Nottinham', 'Location'] = 'Nottingham'
df.loc[df.Location == 'Surey', 'Location'] = 'Surrey'
df.loc[df.Location == 'Oxfords', 'Location'] = 'Oxford'

Since we have seen an example of Oxfords, lets see if there are more locations that are incorrectly spelled in plural eg. Londons

In [20]:
#looking for double capital words (abbreviations)
a = r"^.*s$"
{re.search(a, str(x)).group() for x in df.Location.tolist() if re.search(a, str(x))}

{'Barrow-In-Furness',
 'Bognor Regis',
 'Bolton Le Sands',
 'Brent Cross',
 'Bury St. Edmunds',
 'Devizes',
 'Docklands',
 'Dumfries',
 'East Midlands',
 'Gerrards Cross',
 'Glenrothes',
 'Hastings',
 'Henley-On-Thames',
 'Highlands',
 'Inverness',
 'Kingston Upon Thames',
 'Leads',
 'Leeds',
 'Lewes',
 'Milton Keynes',
 'North Shields',
 'Sevenoaks',
 'South Shields',
 'St. Albans',
 'St. Helens',
 'St. Ives',
 'St. Neots',
 'Staines',
 'Stockton-On-Tees',
 'Tunbridge Wells',
 'Wales',
 'Walton-On-Thames',
 'West Midlands',
 'Widnes'}

Looks like that there is another spelling error Leeds instead of Leads. Our difflib code might not have picked it up because it didnt have 90% similarity. Any ways, lets go ahead and correct it aswell. 

In [21]:
df.loc[df.Location == 'Leads', 'Location'] = 'Leeds'

Lastly we check if there are any locations that start with small case and not in upper case e.g. london except for London. 

In [22]:
#looking for double capital words (abbreviations)
a = r"^[a-z].*$"
{re.search(a, str(x)).group() for x in df.Location.tolist() if re.search(a, str(x))}

set()

Since there are none, we move on. 

## Contract Type

Since we already know what format the contract type attribute should be in, lets just check the unique values in it. 

In [23]:
#checking values
df.ContractType.unique()

array(['not available', 'full_time', 'part_time'], dtype=object)

In [24]:
#checking counts of vlaues
df.ContractType.value_counts()

not available    19499
full_time         4883
part_time          695
Name: ContractType, dtype: int64

The data values seem to be clean, however they are in an incorrect format. Hence we can go ahead and just change the format of the values. 

In [25]:
#fixing and changeing not available to not specified
df.loc[df.ContractType=='not available', 'ContractType'] = 'non-specified'
df.loc[df.ContractType=='full_time', 'ContractType'] = 'full-time'
df.loc[df.ContractType=='part_time', 'ContractType'] = 'part-time'

In [26]:
#checking vlaue counts to confirm
df.ContractType.value_counts()

non-specified    19499
full-time         4883
part-time          695
Name: ContractType, dtype: int64

Now that we have cleaned the format of the data values, we can move on.

## Contract Time

We know the format of the data values in this aswell. Hence we can just go ahead and first see what the values are like

In [27]:
#checking values
df.ContractTime.unique()

array(['permanent', 'not available', 'contract'], dtype=object)

In [28]:
#checking counts of vlaues
df.ContractTime.value_counts()

permanent        16194
not available     6212
contract          2671
Name: ContractTime, dtype: int64

Again the values seem to be clean however the format for not available is incorrect and should be non-specified. Hence lets just go ahead and change that.

In [29]:
#fixing and changeing not available to not specified
df.loc[df.ContractTime=='not available','ContractTime'] = 'non-specified'

In [30]:
#checking counts of vlaues
df.ContractTime.value_counts()

permanent        16194
non-specified     6212
contract          2671
Name: ContractTime, dtype: int64

Now that the format is correct we can move on.

## Company 

Cleaning the company would be a bit tricky for the following reasons:
    - Company is the only attribute that have null values in the dataset
    - Company names do not follow a specific format and can have any names. There is no limit on the type of characters they use etc.
    - Different companies can have similar names in different locations. Again, we do not know if the data is only for the UK and hence we cannot make any unreversable changes to the data based on this assumption.

In [31]:
df.Company.value_counts()

UKStaffsearch                              248
CVbrowser                                  170
Randstad                                   169
JOBG8                                      159
Matchtech Group plc.                       150
Chef Results                               137
Penguin Recruitment                        134
Hays                                       115
London4Jobs                                112
COREcruitment International                111
JAM Recruitment Ltd                        109
Computer People                             99
Clear Selection                             92
Monarch Recruitment                         89
Towngate Personnel                          79
Cherryred Recruitment                       69
Senitor Associates                          67
Capita Resourcing                           62
Support Services Group                      62
Modis                                       62
Aspire Data Recruitment                     61
Adecco       

After just eyeballing the data for companies we can see that there is no specific format. Some company names have a few capitalized alphabets while some have integers in them etc.

But, lets look at the format a bit later, first lets try and see if there is some similarity with other attributes when the company is missing. Maybe we can find some sort of similarity.

In [32]:
df[df.Company.isnull()].describe(include=['O'])

Unnamed: 0,Title,Location,ContractType,ContractTime,Company,Category,Salary per annum,SourceName,OpenDate,CloseDate
count,3835,3835,3835,3835,0.0,3835,3835,3835,3835,3835
unique,3835,414,3,3,0.0,8,713,17,1844,1871
top,Peripatetic / Relief Manager Nursing Homes London,UK,non-specified,non-specified,,Healthcare & Nursing Jobs,35004,careworx.co.uk,20130921T150000,20120708T120000
freq,1,215,2831,2234,,1951,116,1777,8,8


In [33]:
sum(df.Company.isnull())

3835

There seems to be no specific format in terms of the attributes. There are a few ways that the missing companies can be cleaned.
- Deleting the rows which have empty company
- imputing the empty values
    - Since the company is not a numerical value one way to imput is to use the mode impuation method. By grouping the data by location and the job category we can identify the mode the of companies as to which hire the most from that specific location of that category of jobs. This however, even though its doable, does not have any statistical accuracy measure and might not be the best way to impute. 
- Replace the null values with a new value like non-specified like it was in Category Time and Type. 

In this approach we will replace the null values with non-specified as that seems to be the format followed for null values in this dataset. 

In [34]:
#filling nulls
df.Company.fillna('non-specified', inplace=True)

The code incase we wanted to fix the values with mode would look something like this. This would group by data by location and then job category type and then impute the company with the mode company.

```python
#fixing company using loc and category
df["Company"].fillna(df.groupby(['Location', 'Category'])["Company"].transform(lambda x: x.mode().iloc[0] if len(x.mode())!=0 else np.NaN), inplace=True)
```

Some company missing values do not have another same category that is available in that location and hence for those imputation would only have been done using the location mode. 

```python
#fixing company only using cateogry
df["Company"].fillna(df.groupby(['Category'])["Company"].transform(lambda x: x.mode().iloc[0] if len(x.mode())!=0 else np.NaN), inplace=True)
```

Once we have decided on how to fix the missing values lets start to have a look athe existing values and see if tehre are any dirty values that we can see.

In [35]:
#looking for dirty values
a = r".*[^a-zA-Z0-9\s]+.*"
set([re.search(a, str(x)).group() for x in df['Company'].tolist() if re.search(a, str(x))])

{'.',
 '1st Choice Computer Appointments.',
 '24/7 Recruitment',
 '@ITS  Limited',
 'AD LIB Holdings Ltd.',
 'AES TECHNICAL & EXECUTIVE CONSULTANTS',
 'AMR   Gr, Manchester & Cheshire',
 'AMR   London, North & Central',
 'ARV Solutions.',
 'Actis Recruitment.',
 'Adapt Resourcing.',
 'Adecco UK & Ireland',
 'Admiral Hotels & Restaurants',
 'Advanse.co.uk',
 'AdvertAnywhere.com',
 'Alexander Lloyd   Compliance & Financial Services',
 'Apex Engineering Solutions.',
 'Appetite4Recruitment   Contract & Facilities Division',
 'Appetite4Recruitment   High End Restaurants, Pubs & Bars',
 'Appetite4Recruitment   High End Restaurants, Pubs Bars',
 'Arnold Clark.',
 'Aspire, Achieve, Advance Limited',
 'Assurance and Testing, Capita',
 'Astbury Marsden & Partners Limited',
 'B&Q',
 'B&Q PLC',
 'BTR Recruitment Ltd.',
 'Badenoch & Clark   Birmingham.',
 'Badenoch & Clark   Bristol',
 'Badenoch & Clark   Edinburgh.',
 'Badenoch & Clark   London ',
 'Badenoch & Clark   Manchester.',
 'Badenoch & Cl

We can see that there are values that have:
- Websites
- double spaces or tabs
- wierd symbols such as @, !, /
- dirty values such as only a '.' as a value
- different foramts for writing Ltd e.g. Ltd, Limited, Ltd., ltd., LTD etc. 

This freedom in terms of the company values seems to make cleaning this very difficult. Hence we cannot exlclude special characters such as @ from the values are they might be correct names. 

So lets first try to see what we can try to find in terms of inconsistent formating. First lets see the companies that have a fullstop in the company value. 

In [36]:
#looking for dirty values
a = r".*\..*"
set([re.search(a, str(x)).group() for x in df['Company'].tolist() if re.search(a, str(x))])

{'.',
 '1st Choice Computer Appointments.',
 'AD LIB Holdings Ltd.',
 'ARV Solutions.',
 'Actis Recruitment.',
 'Adapt Resourcing.',
 'Advanse.co.uk',
 'AdvertAnywhere.com',
 'Apex Engineering Solutions.',
 'Arnold Clark.',
 'BTR Recruitment Ltd.',
 'Badenoch & Clark   Birmingham.',
 'Badenoch & Clark   Edinburgh.',
 'Badenoch & Clark   Manchester.',
 'Badenoch & Clark   Nottingham.',
 'Be A.co.uk',
 'BlueTownOnline.co.uk',
 'Bridge Human Resources Recruitment.',
 'Burnham Resources Ltd.',
 'Buzzrecruit.com',
 'C.K.R. Recruitment Limited',
 'C.O.A.L IT Services Ltd',
 'C.O.A.L IT. Services Ltd',
 'Carr Lyons.',
 'Catch 22.',
 'Church International Ltd.',
 'Client Server Ltd.',
 'Cobalt Recruitment.',
 'Connections Recruitment Ltd.',
 'Coupons.com',
 'D.P. Group',
 'D.R.C. Locums Limited',
 'DCL Search & Selection.',
 'DGH Recruitment Ltd.',
 'Darwin Recruitment.',
 'Deerfoot I.T. Resources Ltd',
 'Delta Consultants.',
 'EasyWebRecruitment.com',
 'Easyvacancy.co.uk',
 'Epsilium Ltd.',
 

In [37]:
#looking for trailing spaces
df.loc[df.Company.str.contains('^.* $', na=False,regex=True), 'Company']

1342                    Ashton Consulting 
1412                                  DMS 
1457               Support Services Group 
1483               Support Services Group 
1578               Support Services Group 
1579               Support Services Group 
1688               Support Services Group 
1732          OCC Computer Personnel  Ltd 
1820          OCC Computer Personnel  Ltd 
1883          OCC Computer Personnel  Ltd 
1900          OCC Computer Personnel  Ltd 
1987          OCC Computer Personnel  Ltd 
2011               Support Services Group 
2012               Support Services Group 
2035               Stafffinders Edinburgh 
2047          OCC Computer Personnel  Ltd 
2048          OCC Computer Personnel  Ltd 
2113          OCC Computer Personnel  Ltd 
2207               Support Services Group 
2246          OCC Computer Personnel  Ltd 
2247          OCC Computer Personnel  Ltd 
2268          OCC Computer Personnel  Ltd 
2291               Support Services Group 
2292       

In [38]:
#looking for double or more spaces
a = r".*\s{2,}.*"
set([re.search(a, str(x)).group() for x in df['Company'].tolist() if re.search(a, str(x))])

{'@ITS  Limited',
 'AMR   East Midlands',
 'AMR   Gr, Manchester & Cheshire',
 'AMR   Home Counties North',
 'AMR   London, North & Central',
 'AMR   South West England',
 'AMR   West Midlands',
 'AMR   West of England',
 'Advanse   Hays',
 'Aegis Media LTD  ',
 'Alexander Lloyd   Accountancy',
 'Alexander Lloyd   Compliance & Financial Services',
 'Angela Mortimer   Main Account',
 'Antal International Limited   Warrington',
 'Appetite4Recruitment   Contract & Facilities Division',
 'Appetite4Recruitment   Contract Facilities Division',
 'Appetite4Recruitment   High End Restaurants, Pubs & Bars',
 'Appetite4Recruitment   High End Restaurants, Pubs Bars',
 'Ashton Consulting  Limited',
 'Avenue Scotland   FALKIRK',
 'Aximis  Limited',
 'BEG  Ltd',
 'BMS   Engineering',
 'BMS   Graduate',
 'BMS   Marketing',
 'BROOK STREET BUREAU   Baker Street',
 'BROOK STREET BUREAU   Cardiff Care',
 'BROOK STREET BUREAU   Derby',
 'BROOK STREET BUREAU   Edgware',
 'BROOK STREET BUREAU   Fenchurch Str

We can see a few things:
    - There are different formats of websites in these .com, .co.uk, www. etc.
    - double or more spaces
    - full stops are the company titles 
    - fulls stops in some abbreviations and not in others eg. I.T. or IT.
    - company with value '.'
    - Tabs
    - There are company's that have location in them aswell otherwise they are the same company

In terms of formating we can:
- remove the pre and post website pars such as www., .com etc. (this might cause some issues as some company might actualy have a .com in it. 
- reformat the limited and bring them into a consistent format 
- try to fix certain abbreviations that seemed incorect such as t/a to T/A. 
- remove the full stops from company entirely (this might cause some issues as some company names might have fullstops in them)
- remove the location/branch part of the company but this depends on the what the client wants to keep. If they consider the company to be different if the branch of it is different then removing the location would create an issue. Hence, since we dont know exactly what we want, rather than ruining the data and removing information lets keep it. 
- remove double or more whitespaces by replacing them with a single whitespace. 
- remove trailing and leading white spaces if there are any. 

Another thing that can be done is to convert all the company names to upper case. This way the issue of some company names being upper case and some in lower case will be fixed. However, since a company name might actually be registered in a capitalized form. This kind of changes require further aproval from company stakeholders or the clients hence without approval these changes cannot be made. 

In [39]:
#dictionary for cleaning
Comp_dic_clean = {'\.com': '', '\.net':'', '\.uk\.net':'', '\.co\.uk':'', 'LTD':'Ltd', ' Limited':' Ltd', 'www\.':'', 't/a':'T/A', '\.':'', '\s{2,}':' '}

#loop for cleaning
for key, val in Comp_dic_clean.items():
    df.Company = df.Company.str.replace(key, val, case=False)
    
#remove white spaces from right and left
df.Company = df.Company.str.strip()

After cleaning the required, lets again try to check what special characters are included in the compnay name values. But first since one company name was '.' and we removed all fullstops lets convert that company name to non-specified.

In [40]:
df.loc[df.Company == '', 'Company'] = 'non-specified'

In [41]:
#looking for dirty values
a = r".*([^a-zA-Z0-9\s.])+.*"
set([re.search(a, str(x)).group(1) for x in df['Company'].tolist() if re.search(a, str(x))])

{'!', '&', "'", '+', ',', '-', '/', ':', '@', '’'}

In [42]:
#looking for trailing spaces
df.loc[df.Company.str.contains(r"[!&'+,/:@’]", na=False,regex=True), 'Company']

1414                                    IT SEARCH & SELECT
1415                                    IT SEARCH & SELECT
1757                            Sykes & Co Recruitment Ltd
1767                         Collins King & Associates Ltd
1857                                   P&C Recruitment Ltd
1879                        Green Stone Search & Selection
1998                             OCC Computer Personnel/RI
1999                             OCC Computer Personnel/RI
2000                             OCC Computer Personnel/RI
2002                             OCC Computer Personnel/RI
2091                                   P&C Recruitment Ltd
2165                             OCC Computer Personnel/RI
2173                                   P&C Recruitment Ltd
2363                               Bond Search & Selection
2365                             OCC Computer Personnel/RI
2532                             OCC Computer Personnel/RI
2533                             OCC Computer Personnel/

In [43]:
df.loc[df.Company.str.contains('-', na=False,regex=True), 'Company'].unique()

array(['non-specified'], dtype=object)

In [44]:
df.loc[df.Company.str.contains(',', na=False,regex=True), 'Company'].unique()

array(['Consult, Search and Selection', 'Quality Data Management, Inc',
       'Randstad Construction, Property Engineering',
       'Berkeley Scott Pubs, Bars & Restaurants',
       'Assurance and Testing, Capita',
       'Technology Recruiting Solutions,Inc',
       'Appetite4Recruitment High End Restaurants, Pubs & Bars',
       'Berkeley Scott Pubs, Bars Restaurants',
       'Paramount Recruitment Med Comms, PR & Advertising',
       'AMR Gr, Manchester & Cheshire', 'Radisson Blu Edwardian, London',
       'Caprice Holdings Ltd, Annabel apos;s Clubs, Urban Caprice',
       'Appetite4Recruitment High End Restaurants, Pubs Bars',
       'Aspire, Achieve, Advance Ltd', 'AMR London, North & Central',
       'Staffworx Ltd, UK', 'Bristol, Cardiff, Coventry', 'SolTech, Inc'],
      dtype=object)

In [45]:
df.loc[df.Company.str.contains(':', na=False,regex=True), 'Company']

16229       id:recruitment
16230       id:recruitment
17456       id:recruitment
17633       id:recruitment
18691       id:recruitment
21932              CVL:LDN
22865    Section:Media Ltd
22869    Section:Media Ltd
Name: Company, dtype: object

After doing the changes that we could do in the company lets move forward.

## Category

For category lets first try to check the value counts for the categories.

In [46]:
df.Category.value_counts()

IT Jobs                             7085
Healthcare & Nursing Jobs           4334
Engineering Jobs                    3458
Accounting & Finance Jobs           3099
Sales Jobs                          2609
Hospitality & Catering Jobs         2124
Teaching Jobs                       1378
PR, Advertising & Marketing Jobs     990
Name: Category, dtype: int64

The Cateogry column seems to be clean, Hence we move on.

## Salary per annum

First lets try to find any values that do not contain digits or decimals.

In [47]:
df['Salary per annum'].describe()

count     25077
unique     1589
top       35004
freq       1011
Name: Salary per annum, dtype: object

In [48]:
#looking for dirty values
a = r".*[^\d.]+.*"
[re.search(a, str(x)).group() for x in df['Salary per annum'].tolist() if re.search(a, str(x))]

['30K',
 '38K',
 '14K',
 '24K',
 '20896.2 - 23095.8',
 '23K',
 '16153.8 - 17854.2',
 '25K',
 '23712.0 - 26208.0',
 '16K',
 '16416.0 - 18144.0',
 '33K',
 '22321.2 - 24670.8',
 '17578.8 - 19429.2',
 '45K',
 '85K',
 '30871.199999999997 - 34120.8',
 '19K',
 '35146.2 - 38845.8',
 '21853.8 - 24154.2',
 '17100.0 - 18900.0',
 '22K',
 '24646.8 - 27241.2',
 '20896.2 - 23095.8',
 '17100.0 - 18900.0',
 '36480.0 - 40320.0',
 '33253.799999999996 - 36754.200000000004',
 '49020.0 - 54180.0',
 '27553.8 - 30454.2',
 '26K',
 '23746.199999999997 - 26245.800000000003',
 '19003.8 - 21004.2',
 '52246.2 - 57745.8',
 '32K',
 '34K',
 '24K',
 '47503.799999999996 - 52504.200000000004',
 '52K',
 '78K',
 '32K',
 '19K',
 '23K',
 '30403.8 - 33604.200000000004',
 '52K',
 '29184.0 - 32256.0',
 '41325.0 - 45675.0',
 '28500.0 - 31500.0',
 '26K',
 '55K',
 '35K',
 '27553.8 - 30454.2',
 '61753.799999999996 - 68254.2',
 '32K',
 '28K',
 '40846.2 - 45145.8',
 '40378.799999999996 - 44629.200000000004',
 '14614.8 - 16153.2',
 '2

We can see that some of the values are written interms of K (representing thousands) and some are in ranges of min and max salaries. 

Once we have eyeballed the data lets try to extract all dirty characters

In [49]:
#looking for dirty characters
dirty = []
a = r"[^\d.]"
b = [re.findall(a, str(x))  for x in df['Salary per annum'].tolist() if re.search(a, str(x))]
for i in b:
    dirty = dirty + i
    
#check all dirty values
set(dirty)

{' ', '-', 'K'}

We find three dirty characters:
    - K
    - '-'
    - space
Hence we try to clean them in the following ways:
- The hypehn represents the range hence we can replace this range with the average of the two since we dont exactly know how much the salary was. 
- K can be replaced with 000 for thousands
- the space is just in the range values as the range is given as (a - b). So once the range is cleaned the space should go. But we will check this assumption after cleaning aswell. 

First, lets correct the ranges.

In [50]:
#fixing range salary

#looking for - in salary
a = r".*[-].*"
[re.findall(a, str(x))  for x in df['Salary per annum'].tolist() if re.search(a, str(x))]


#splitting colums and creating a new dataframe
sal_range_fix = df.loc[df['Salary per annum'].str.contains('-',na=False),'Salary per annum'].apply(lambda s: pd.Series({'salary':s ,'left': s.split(' - ')[0],'right':s.split(' - ')[1]}))
#adding mean column
sal_range_fix['Mid'] = (pd.to_numeric(sal_range_fix.left) + pd.to_numeric(sal_range_fix.right))/2

#replace the dirty values
df.loc[df['Salary per annum'].str.contains('-',na=False),'Salary per annum'] = sal_range_fix.Mid

Now, lets change then K to thousands (000)

In [51]:
# checking where are the k pr K 
a = r".*[k|K].*"
[re.search(a, str(x)).group()  for x in df['Salary per annum'].tolist() if re.search(a, str(x))]

#replace k or K with 000
df.loc[df['Salary per annum'].str.contains('K', na=False), 'Salary per annum'] = df.loc[df['Salary per annum'].str.contains('K', na=False), 'Salary per annum'].str.replace('k', '000', case=False)

Now that we have cleaned it, lets check if there are any dirty values in the salary and confirm if all the spaces are gone. 

In [52]:
#checking whether salary has anything other than numbers and decimals
a = r".*[^\d.].*"
{re.search(a, str(x)).group()  for x in df['Salary per annum'].tolist() if re.search(a, str(x))}

set()

Now the salaries seem to be clean, hence we move on. 

## Source name

We know that the source name should include the websites where the jobs wer3e posted. Hence we know that to check for valid source names we have to check if they are valid website links. However, there can be websites that are in the correct format but do not exists, this errors unfortunately, will be very difficult to identify hence we will just identify incorrect values by checking their formats. 

The first task is to just eye ball the different values in the column.

In [53]:
df['SourceName'].unique()

array(['cv-library.co.uk', 'caterer.com', 'hays.co.uk',
       'theitjobboard.co.uk', 'jobs.catererandhotelkeeper.com',
       'careworx.co.uk', 'securityclearedjobs.com',
       'myjobs.cimaglobal.com', 'jobserve.com', 'jobg8.com',
       'rengineeringjobs.com', 'planetrecruit.com', 'hotrecruit.com',
       'jobsfinancial.com', 'thecareerengineer.com',
       'jobsineducation.co.uk', 'totaljobs.com', 'nijobfinder.co.uk',
       'staffnurse.com', 'emptylemon.co.uk', 'juniorbroker.com',
       'jobsinsocialwork.co.uk', 'strike-jobs.co.uk', 'fish4.co.uk',
       'randstadfp.com', 'simplysalesjobs.co.uk', 'cwjobs.co.uk',
       'OilCareers.com', 'careerbuilder.com', 'salestarget.co.uk',
       'jobs.guardian.co.uk', 'technojobs.co.uk',
       'professionalpensionsjobs.com', 'accountancyagejobs.com',
       'nijobs.com', 'jobs.planningresource.co.uk', 'jobs4medical.co.uk',
       'thegraduate.co.uk', 'jobs.newstatesman.com', 'contractjobs.com',
       'icaewjobs.com', 'jobs.bighospitality.

We can see that different websites have different website link types. Hence using this we can create a regex to identify correct webistes and incorrect ones. 

In [54]:
#creating regex for websites
a = r'^[a-zA-z0-9-.]+\.(com|co\.uk|uk|net|org)$'

#checking for dirty websites
df[~df.SourceName.str.contains(a, regex=True)].SourceName.unique()

  """


array(['monashstudent', 'jobcareer', 'admin@caterer.com'], dtype=object)

After applying the regex we can see that there are three values that dont seem to be valid websites. 
- monashstudent : seems to be not a website
- jobcareer : also does not seem to be a valid website
- admin@caterer.com : seems to be an email address

So lets view the data for these companies. 

In [55]:
a = df.SourceName.isin(['monashstudent', 'jobcareer', 'admin@caterer.com'])
df[a]

Unnamed: 0,Id,Title,Location,ContractType,ContractTime,Company,Category,Salary per annum,SourceName,OpenDate,CloseDate
5663,66932999,Registered Midwives RM Lincolnshire Lincoln,Lincoln,part-time,non-specified,The A24 Group,Healthcare & Nursing Jobs,40140,monashstudent,20130124T150000,20130223T150000
12871,68393935,Killer Javascript role Exclusive to Brightwater,Belfast,full-time,permanent,Brightwater Group,IT Jobs,39996,jobcareer,20120205T150000,20120306T150000
15379,68672352,Digital Account Manager **** plus a Great Bonus,South East London,non-specified,permanent,Blu Digital,Sales Jobs,27504,admin@caterer.com,20120516T000000,20120715T000000


Lets try to figure out what the source name generally is for these companies. 

In [56]:

df.loc[df.Company.isin(df.loc[a, 'Company']), ['Company', 'SourceName']].groupby(['Company', 'SourceName']).size().unstack(fill_value=0)

SourceName,admin@caterer.com,jobcareer,monashstudent,nijobs.com,staffnurse.com,totaljobs.com
Company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Blu Digital,1,0,0,0,0,3
Brightwater Group,0,1,0,5,0,0
The A24 Group,0,0,1,0,9,0


We can see that:
- BluDigital has the source totaljobs.com
- Brightwater Group has the source nijobs.com
- The A24 Group has the source staffnurse.com

Hence we can replace the dirty values with the correct ones

In [57]:
df.loc[df.SourceName == 'admin@caterer.com', 'SourceName'] = 'totaljobs.com'
df.loc[df.SourceName == 'jobcareer', 'SourceName'] = 'nijobs.com'
df.loc[df.SourceName == 'monashstudent', 'SourceName'] = 'staffnurse.com'

Lets confirm if we were able to clean them properly

In [58]:
df.loc[df.Company.isin(df.loc[a, 'Company']), ['Company', 'SourceName']].groupby(['Company', 'SourceName']).size().unstack(fill_value=0)

SourceName,nijobs.com,staffnurse.com,totaljobs.com
Company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Blu Digital,0,0,4
Brightwater Group,6,0,0
The A24 Group,0,10,0


Once the source name is clean, we can move forward

## Open Date

The format for the dates are already provided hence we have to check if all the formats for this variable are consistent and also try to figure out if there are no funky dates are in the date attribute.

Lets first try to find out what the maximum and minimum values for the dates are by just extracting the date part and not looking at the time part.

In [59]:
on_t = lambda x: x.split('T')[0]

test = pd.to_numeric(df.OpenDate.apply(on_t))
print(test.min())
print(test.max())

20120101
20133004


Once we know the max and min for the dates (range) we can create a regex to check the format.

Lets try to check if there are dates that do not follow the format and are incorrect.

In [60]:
#creating regex for websites
a = r'^201[23](0[1-9]|1[0-2])(0[1-9]|[12][0-9]|3[01])$'

#checking for dirty websites
df[~df.OpenDate.str.contains(a, regex=True)].OpenDate.unique()

  """


array(['20121103T000000', '20130108T150000', '20130726T150000', ...,
       '20131509T000000', '20132901T150000', '20132108T120000'],
      dtype=object)

We can see that there are some dirty dates. Technically, they seem to be in an incorrect order i.e. the months in the days position and vice versa. 

Lets have another look by only looking at the dates and excluding the time part of it. 

In [61]:
a = r'^201[23](0[1-9]|1[0-2])(0[1-9]|[12][0-9]|3[01])$'
test = test.astype(str)
test[~test.str.contains(a, regex=True)]

  This is separate from the ipykernel package so we can avoid doing imports until


1102     20131803
2104     20132606
2839     20122003
5707     20121512
10881    20133004
11948    20131908
15353    20121406
22918    20131509
23007    20132901
23169    20132108
Name: OpenDate, dtype: object

Okay so we have identified one error. The next thing to check is that if there exists any date which is out of bounds for February. That is the day is 29 or more in the month of Feb.

In [62]:
#searching for feb leap or wrong
# 2012 was a leap year
a = r'^201[23]02(29|3[01])$'
test = test.astype(str)
set(test[test.str.contains(a, regex=True)].tolist())

  """


{'20120229'}

We can see that there is one date that is of Feb 29, but on a simple google search, we can see that 2012 was a leap year. So it seems february is clean. 

Lets get back to the flipped dates and view all of them. 

In [63]:
#checking dirty dates
a = r'^201[23](0[1-9]|1[0-2])(0[1-9]|[12][0-9]|3[01])T(0[0-9]|1[0-9]|2[0-4])[0-5][0-9][0-5][0-9]$'
df[~df.OpenDate.str.contains(a, regex=True)].OpenDate

  This is separate from the ipykernel package so we can avoid doing imports until


1102     20131803T120000
2104     20132606T000000
2839     20122003T150000
5707     20121512T150000
10881    20133004T150000
11948    20131908T000000
15353    20121406T120000
22918    20131509T000000
23007    20132901T150000
23169    20132108T120000
Name: OpenDate, dtype: object

Since we know the issue, we konw inorder to clean them we will have to replace the positions of the month and days. So lets create a function for that. Maybe, we might have to use it again for closedate.

In [64]:
#fucntion for fixing date
def date_fix(x):
    date = x.split('T')[0]
    time = x.split('T')[1]
    #extracting date parts
    y = date[:4]
    d = date[4:6] #storing correct day
    m = date[6:] #storing correct month
    #storing and creating the correct date format
    result = y + m + d + 'T' + time
    return result

Now that the function is created, lets clean the dates. 

In [65]:
#checking dirty dates
a = r'^201[23](0[1-9]|1[0-2])(0[1-9]|[12][0-9]|3[01])T(0[0-9]|1[0-9]|2[0-4])[0-5][0-9][0-5][0-9]$'

#fixing date
df.OpenDate = df.OpenDate.apply(lambda x: date_fix(x) if re.search(a, x)==None else x)

One final check and if clean, lets move on. 

In [66]:
#checking dirty dates
a = r'^201[23](0[1-9]|1[0-2])(0[1-9]|[12][0-9]|3[01])T(0[0-9]|1[0-9]|2[0-4])[0-5][0-9][0-5][0-9]$'
df[~df.OpenDate.str.contains(a, regex=True)].OpenDate

  This is separate from the ipykernel package so we can avoid doing imports until


Series([], Name: OpenDate, dtype: object)

## Closed Date

For the close date lets apply the same strategy. Lets first identify the range of the dates.

In [67]:
on_t = lambda x: x.split('T')[0]

test = pd.to_numeric(df.CloseDate.apply(on_t))
print(test.min())
print(test.max())

20120115
20140331


Next lets to to see whether there are dates that are not in teh correct format.

In [68]:
#checking dirty dates
a = r'^201[2-4](0[1-9]|1[0-2])(0[1-9]|[12][0-9]|3[01])T(0[0-9]|1[0-9]|2[0-4])[0-5][0-9][0-5][0-9]$'
df[~df.CloseDate.str.contains(a, regex=True)].CloseDate

  This is separate from the ipykernel package so we can avoid doing imports until


Series([], Name: CloseDate, dtype: object)

There seems to be no dirty dates. But then again, lets check for the month of February.

In [69]:
#searching for feb leap or wrong
# 2012 was a leap year
a = r'^201[23]02(29|3[01])$'
test = test.astype(str)
set(test[test.str.contains(a, regex=True)].tolist())

  """


{'20120229'}

Again, 2012 was a leap year, so no worries. 

One major thing that still might be an issue is that we konw that the closed date should always be after the open date. So next, lets try to check that.

## checking whether closed date is always after open date

In order to check this, itll be easier if we just checked the date parts of the attribute. 

In [70]:
on_t = lambda x: x.split('T')[0]

testO = pd.to_numeric(df.OpenDate.apply(on_t))
testC = pd.to_numeric(df.CloseDate.apply(on_t))

Now that we have these lets join them together into another dataframe. But since join used indexes lets first check the indexes.

In [71]:
testO.index

RangeIndex(start=0, stop=25077, step=1)

In [72]:
testC.index

RangeIndex(start=0, stop=25077, step=1)

The indexes seem to be inorder hence lets just go ahead and join them together. 

In [73]:
testO = testO.to_frame('Open')
testC = testC.to_frame('Close')

#join the data
testf = testO.join(testC, how='outer')

Now lets check if the new dataframe is consistent with the old one.

In [74]:
#checking if data is consistent with original dataframe
display(testf.head())
display(df[['OpenDate', 'CloseDate']].head())

Unnamed: 0,Open,Close
0,20121103,20121203
1,20130108,20130408
2,20130726,20130924
3,20121214,20130314
4,20131025,20131224


Unnamed: 0,OpenDate,CloseDate
0,20121103T000000,20121203T000000
1,20130108T150000,20130408T150000
2,20130726T150000,20130924T150000
3,20121214T000000,20130314T000000
4,20131025T000000,20131224T000000


Once we have the data time to check if there there are any closed date before open date.

In [75]:
testf[testf.Close - testf.Open < 0]

Unnamed: 0,Open,Close
6659,20130703,20130404
9568,20120306,20120205
11043,20131105,20131006
12473,20131222,20131023
19142,20120803,20120505
24206,20130622,20130324
24297,20130116,20121018
25039,20130708,20130608


Yes, There are a few. It seems in order to fix them we just have to flip the values between the columns. 

First lets store the index where these issues occur.

In [76]:
dirty_ind = testf.Close - testf.Open < 0

To actually do this, lets create a function and then use the stored dirty indexes to replace the columns.

In [77]:
def replace_open_close(x):
    op = x['OpenDate']
    cl = x['CloseDate']
    x['OpenDate'] = cl
    x['CloseDate'] = op
    return x

df[dirty_ind] = df[dirty_ind].apply(lambda x: replace_open_close(x), axis=1)

Now that its cleaned, lets check the original data set.

In [78]:
df[dirty_ind]

Unnamed: 0,Id,Title,Location,ContractType,ContractTime,Company,Category,Salary per annum,SourceName,OpenDate,CloseDate
6659,67330693,Pensions Implementation Manager,Brighton,non-specified,permanent,non-specified,Accounting & Finance Jobs,35004,hays.co.uk,20130404T000000,20130703T000000
9568,68091856,Manager Technical Operations,Shepton Mallet,non-specified,permanent,Index Recruitment,IT Jobs,62004,totaljobs.com,20120205T000000,20120306T000000
11043,68290941,Planning Enforcement officer,London,full-time,contract,Randstad,Engineering Jobs,39360,jobs.planningresource.co.uk,20131006T000000,20131105T000000
12473,68360715,Penetration Tester / Reverse Engineer,UK,non-specified,contract,Vertex Solutions,IT Jobs,76908,cwjobs.co.uk,20131023T150000,20131222T150000
19142,68729089,Registered Home Manager Loughborough,Loughborough,full-time,non-specified,Compass Associates Ltd,Healthcare & Nursing Jobs,42504,staffnurse.com,20120505T150000,20120803T150000
24206,69173205,"User Interface/UI Developer JavaScript, jQuer...",London,non-specified,permanent,Dawson & Walsh,IT Jobs,45000,totaljobs.com,20130324T150000,20130622T150000
24297,69181738,Registered General Nurse (RGN)/ Registered Men...,Poole,full-time,non-specified,Executive Care Group Ltd,Healthcare & Nursing Jobs,24960,staffnurse.com,20121018T000000,20130116T000000
25039,69228213,"SCIENTIFIC (PHYSICS, OR CHEMISTRY, OR ELECTRON...",Warrington,full-time,permanent,non-specified,Engineering Jobs,24996,fish4.co.uk,20130608T120000,20130708T120000


It seems to be clean now. Lets move on. 

## Title

As we saw earlier, the titles were all unique. However, lets try to see if its clean. 

Lets first eyeball the data by checking it. 

In [79]:
df.Title

0                              Engineering Systems Analyst
1                                  Stress Engineer Glasgow
2                         Modelling and simulation analyst
3        Engineering Systems Analyst / Mathematical Mod...
4               Pioneer, Miser Engineering Systems Analyst
5                  Trainee Mortgage Advisor  East Midlands
6                         PROJECT ENGINEER, PHARMACEUTICAL
7        Chef de Partie  Award Winning Restaurant  Exce...
8                                         Quality Engineer
9        Chef de Partie  Award Winning Dining  Live In ...
10            Senior Fatigue and Damage Tolerance Engineer
11                                     C I Design Engineer
12                                 Lead Engineers (Stress)
13         Relief Chef de Partie  Croydon, Surrey  Live in
14             Senior Control and Instrumentation Engineer
15                    Control and Instrumentation Engineer
16                               Electrical / ICA Engine

Just by glancing at the values we can see the following:
- The titles are not only job titles but contain extra information regarding the position aswell.
- There are special characters included in the title 
    - Stars
    - hyphens
    - slash
    - etc.
    
Lets first try to find special characters.

In [80]:
#looking for characters other than words, -, ', \s.
a = r".*[^a-zA-Z\s.\-\\']+.*"
[re.search(a, str(x)).group() for x in df.Title.tolist() if re.search(a, str(x))]

['Engineering Systems Analyst / Mathematical Modeller',
 'Pioneer, Miser Engineering Systems Analyst',
 'PROJECT ENGINEER, PHARMACEUTICAL',
 'Lead Engineers (Stress)',
 'Relief Chef de Partie  Croydon, Surrey  Live in',
 'Electrical / ICA Engineer',
 'Pastry Chef for **** red star **** rosette hotel  ****',
 'CHEF DE PARTIE POSITION IN **** ROSETTE HOTEL NYORKS  ****k',
 'Senior Sous Chef for **** rosette kitchen, up to ****',
 'General Manager  Funky, Cool Restaurant Concept  London  ****k ',
 'C/C++ Developer',
 'Welwyn Chef de Partie does it get any better than this? ****',
 'Pastry Chef AL**** ****AA Rosette Restaurant',
 'Fine Dining Chefs for rosette restaurants  up to ****',
 'Software Engineer/Mathematical Modeller',
 'Field Sales Executive / Business Development  Wide format printing',
 'FIELD SALES ENGINEER / SALES ACCOUNT MANAGER  PLASTICS',
 'C / C++ SOFTWARE ENGINEERS (All levels)',
 'Chef De Partie up to ****  Tips Ipswich Outskirts',
 'FB Assistant  amazing Hotel, unique

We can see that the asterix are used for alot of different purposes. 
- Some seem to be salaries or bonuses or tips that are hidden 
    - `**** p/h`
    - `**** K`
    - `**** to ****`
    - `**** Tips`
    - `**** BONUS`
- Some seem to be other than the ones mentioned
    - `**** x`
    - `x ****`
    - `**** RED STAR  ****`
    
There seems to be no general pattern. Removing the stars will create alot of missing wierd sentences. Either entire sentences are removed such as `**** RED STAR  **** ROSETTE KITCHEN  **** to ****`. 

Another way could be to replace the `*` with salaries where we see `p/h` or `K` but then since some of these could be mentioning bonuses, tips, etc imputing the replacing the wrong value would result in incorrect data. 

The point of data cleaning in general is to clean the data but not remove extra information that might be important. There are a few things to consider:
- Cleaning the `*` might cause incorrect cleaning e.g. replacing it with the wrong thing
- Titles are currently unique, this could be done delibrately, and might need to be unique. Cleaning the stars might cause duplicated, as we dont know why teh stars are introduced in the first place. 
- Maybe the starts in some cases represent line breaks such as if the length of the title was larger than the allowed length in the database and `****` represents `...` for continuation. 
- The database might be linked to an external aplication that picks up its values from the dataset. Maybe the stars are introduced as they want the information to not be shown on the app etc. Cleaning the `*` would cause issues. 

The point is that cleaning the `*` cannot be done accuratley, without either introducing more noise into the dataset or by removing the actaul purpose of the star. For these issues it is always best to sit with the client and understand how to clean the data before changing the original dataset. 

Another issue is that there is quite alot of information present in the title that is not mentioned in the data set, as in there is no attribute that includes that. e.g. `Graduate Software Engineer  C, C++, Java, UML, OOD`. Removing this information is going to result in loss of data. One Solution to this could have been to extract only titles and create another column for it which leaving this information as an attribute called Details. 

One thing than however, can be done without loosing information is cleaning the formating errors such as double spacing etc. Hence lets try to do that. 

In [81]:
test = df.iloc[:,1:2]

In [82]:
#looking for characters other than words, -, ', \s.
a = r".*\sfor\s.*"
[re.search(a, str(x)).group() for x in df.Title.tolist() if re.search(a, str(x))]

['Pastry Chef for **** red star **** rosette hotel  ****',
 'Senior Sous Chef for **** rosette kitchen, up to ****',
 'Fine Dining Chefs for rosette restaurants  up to ****',
 'Health Care Assistants Needed for Nursing Homes in Eastbourne, Sussex',
 'Nurses needed for Acute Hospital in Redhill, Surrey',
 'Nurses needed for Hospitals in Haywards Heath, Sussex',
 'RGNs for Doncaster',
 'Scrub Nurses needed for mobile theatres',
 'Senior Care Consultant Regional Roles Available See Job Description for Locations',
 'Open Day for Care Assistant in Kent',
 'RGN Staff Nurses for Nursing Home Southall',
 'Junior SousBib Gourmand restaurant pushing for Michelin star',
 'Senior Prac for the Court Team',
 'SAS Data Developer Wanted for MIS Team',
 'Experience RGN/RMN needed for stunning nursing home Bridgend area',
 'Clinical Manager for Bournemouth Nursing Home',
 'Media Sales Outdoor Advertisement for a UK Top 100 company',
 'Sous Chef needed for gorgeous gastro pub Up to ****  tips',
 'Pastry 

In [83]:
#looking for characters other than words, -, ', \s.
a = r".*\s{2,}.*"
[re.search(a, str(x)).group() for x in df.Title.tolist() if re.search(a, str(x))]

['Trainee Mortgage Advisor  East Midlands',
 'Chef de Partie  Award Winning Restaurant  Excellent Tips',
 'Chef de Partie  Award Winning Dining  Live In  Share of Tips',
 'Relief Chef de Partie  Croydon, Surrey  Live in',
 'Pastry Chef for **** red star **** rosette hotel  ****',
 'CHEF DE PARTIE POSITION IN **** ROSETTE HOTEL NYORKS  ****k',
 'General Manager  Funky, Cool Restaurant Concept  London  ****k ',
 'Fine Dining Chefs for rosette restaurants  up to ****',
 'Field Sales Executive / Business Development  Wide format printing',
 'FIELD SALES ENGINEER / SALES ACCOUNT MANAGER  PLASTICS',
 'Sales Negotiator  Birmingham',
 'Mechanical Engineer  Design and Substantiation',
 'Chef de Partie  Award Winning Fine Dining Restaurant  Straights',
 'Chef De Partie up to ****  Tips Ipswich Outskirts',
 'FB Assistant  amazing Hotel, unique dining ****benefits',
 'Pastry Demi Chef de Partie  Luxury Award Winning Hotel  Free Live In',
 'Senior Structural Integrity Engineer  Defence',
 'Chef De 

In [84]:
df.Title = df.Title.str.replace('\s{2,}', ' ', case=False)

In [85]:
#looking for characters other than words, -, ', \s.
a = r".*\s{2,}.*"
[re.search(a, str(x)).group() for x in df.Title.tolist() if re.search(a, str(x))]

[]

Lets see what other special characters are in the title.

In [86]:
#looking for dirty values
a = r".*([^a-zA-Z0-9\s.])+.*"
set([re.search(a, str(x)).group(1) for x in df['Title'].tolist() if re.search(a, str(x))])

{'%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '/',
 ':',
 ';',
 '?',
 '[',
 ']',
 '_',
 '`',
 '|',
 '–',
 '—',
 '’',
 '…'}

In [87]:
#looking for dirty values
a = r".*([^a-zA-Z0-9\s.])+.*"
b = ''.join(list(set([re.search(a, str(x)).group(1) for x in df['Title'].tolist() if re.search(a, str(x))])))
# c = "[–?+:'\]’[&)_/,|…*;(%`—]"
c = "[?%…]"
df[df.Title.str.contains(c)].Title

26       Welwyn Chef de Partie does it get any better t...
3902     Digital Brand Manager, FMCG, London ****k plus...
4208                           Senior Seeking Progression?
4644     Community Occupational Therapist ??? Leicester...
4649             Community Physiotherapist ??? Bournemouth
4718     SalesForcecom Consultant Global IT Services **...
5005       Proposal Manager/Cambridgeshire/****k 10% bonus
5236     Home Manager RGN/RMN/RNLD Peterborough ****k p...
5731            No Experience? No Problem Immediate Starts
6703     Telesales Experience? Junior Brokers LONDON **...
7036     Receptionist De Vere Venues Ltd ?? Wokefield Park
7986         Test Analyst Bath up to ****K up to 20% bonus
8548      Graduate Data Executive SQL / SAS Skills? London
8558            New Start for 2013? Sales Customer Service
8708     Fancy Becoming A Broker?? Bromley ****K Basic ...
8759     Client Support Technician Fleet twentyfour/sev...
8795                   Can you handle this UK/US tax rol

There are alot of special charaters that are in the title. But all of these do make sense:
e.g. `+` for `C++` language, brackets are used in senteces, `%` for e.g. `10%` bonus, `…` for incomplete. One thing that does not make much sense is ?. This is correctly used for alot of cases where there are catch phrases in the title such as `Experienced Brokers? Fancy a change of Commodi...` but in othere cases such as `?Classic? IT Business Analyst Logistics, Retai...` or `Community Occupational Therapist ??? Leicester...` it doesnt really make sense. 

Again before removing these, they could represent special meaning and but very less likely. Also, if we do remove there might be certain cases where the `?` did make sense it would not anymore. But after cleaning we will gain more benifit than loss. Hence we can go ahead and remove all `?`. 


In [88]:
df.Title = df.Title.str.replace('\?', '', case=False)

In [89]:
#looking for dirty values
a = r".*(\?).*"
set([re.search(a, str(x)).group(1) for x in df['Title'].tolist() if re.search(a, str(x))])


set()

After this the title is as clean as it can be. Hence, we move on. 

## 5. write back

Now that the data is clean we can write the cleaned dataset back. 

In [90]:
df.to_csv('dataset1_solution.csv')

In [91]:
len(df)

25077

## 6. Summary
This assessment measured the understanding of basic text file processing techniques in the Python programming language. The main outcomes achieved while applying these techniques were:

Data wrangling and cleaning consumed quite a lot of time. Some of the dirty data was easily identified due to clear format issues, however due to the lack of data or worry of loosing data or adding more noise in the data creates complications on how to deal with dirty values. 

For example, in the case for title there seemed to be no pattern, and hence trying to clean it altogether wouldve definitely added more noise in the data. In this case cleaning wouldve been taken a long time as we wouldve had to understand fully what the variable values mean (such as `****`) 

All in all, after this excersise, it proves that, cleaning and filtering the data at the collection stage wouldve been much better than aquiring dirty data and then trying to clean it. 
L