# Natural Language Processing

## Basic Text Pre-processing

#### Notebook By: PALLAVI BHIMTE


Version: 1.0

Environment: Python 3 and Google Colab Jupyter notebook

Libraries used:
* pandas
* re
* numpy
* nltk
* pickle
* os
* sklearn
* itertools


## Data Description:
A large collection of job advertisements documents is provided with near about 50,000 document. There are 8 subfolders inside the data folder and each of those folder is a job category. Each document belongs to one of the job categories from: IT, Accounting_Finance, Engineering, Healthcare Nursing, Hospitality Catering,  PR Advertising Marketing, or Sales and Teaching. 

## Introduction
In Task 1, all the basic pre-processing steps are performed to make the description of each job advertisement clean and ready for further analysis. This notebook covers extraction of each job advertisement and steps to clean the description by:

* tokenising the word with the help of this regular expression r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
* converting each word to lower case
* removing words with less than 2
* Removing stopwords using the provided stop words list
* Removing the word that appears only once in the document collection, based on term frequency
* Removing the top 50 most frequent words based on document frequency
* Extracting the top 10 Bigrams based on term frequency

The three output files: 
1. vocab.txt saves the unigram vocabulary 
2. bigram.txt saves the found bigrams found in the whole document collection as well as their term frequency 
3. job_ads.txt saves information and the pre-processed description  text for all the job advertisement documents with respective ID, category, webindex, title, and description.


## Importing libraries 

In [1]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!unzip "/content/drive/My Drive/data.zip"

In [2]:
!unzip "/content/drive/My Drive/data.zip"

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: data/Engineering/Job_18156.txt  
  inflating: __MACOSX/data/Engineering/._Job_18156.txt  
  inflating: data/Engineering/Job_19248.txt  
  inflating: __MACOSX/data/Engineering/._Job_19248.txt  
  inflating: data/Engineering/Job_21258.txt  
  inflating: __MACOSX/data/Engineering/._Job_21258.txt  
  inflating: data/Engineering/Job_17932.txt  
  inflating: __MACOSX/data/Engineering/._Job_17932.txt  
  inflating: data/Engineering/Job_20146.txt  
  inflating: __MACOSX/data/Engineering/._Job_20146.txt  
  inflating: data/Engineering/Job_22037.txt  
  inflating: __MACOSX/data/Engineering/._Job_22037.txt  
  inflating: data/Engineering/Job_15843.txt  
  inflating: __MACOSX/data/Engineering/._Job_15843.txt  
  inflating: data/Engineering/Job_14585.txt  
  inflating: __MACOSX/data/Engineering/._Job_14585.txt  
  inflating: data/Engineering/Job_20620.txt  
  inflating: __MACOSX/data/Engineering/._Job_20620.txt  
  inflat

In [3]:
# required Libraries
from sklearn.datasets import load_files
from nltk import RegexpTokenizer
from nltk.tokenize import sent_tokenize
from itertools import chain
import re
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
nltk.download('punkt')
from nltk.probability import *
from nltk.util import ngrams
import pickle
import os

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### 1. Examining and loading data


In [4]:
rawData = load_files(r"data", load_content=True, encoding= 'utf-8')

In [6]:
# storing the stopwords from stopwords_en.txt in a list
stopwords = []
with open('/content/drive/My Drive/stopwords_en.txt', encoding='utf-8') as f:
    stopwords = f.read().splitlines()

In [8]:
# create list with ID
def getNewList(rawData):
  # initialise list
  lines = list()
  # regex to get digits of ID
  regex = re.compile(r'\d+')
  for index, line in enumerate(rawData.data):   
    # find digits from the line
    id = regex.findall(rawData.filenames[index])
    id = ''.join(id)
    # appending ID with the required format
    lines.append("ID: " + id)
    category = os.path.basename(os.path.dirname(rawData.filenames[index]))
    # appending category from the folder name
    lines.append("Category: " + category)
    for word in line.split("\n"):
      # append all the lines
      lines.append(word)
  
  return lines

# getNewList
newlist = getNewList(rawData)
newlist[1:10]

['Category: Engineering',
 'Title: Plant Engineer',
 'Webindex: 62119057',
 'Company: W5 Recruitment',
 "Description: Our client has established itself as a leading manufacturer and supplier of quality water treatment plants, ranging from basic water softeners and reverse osmosis equipment to customer specified complex water treatment solutions. The company are able to meet their clients' requirements through flexibility in tailoring their product to their needs and budgets. Due to expansion and an increased workload they are seeking to recruit a Planet Engineer to cover accounts along the M4 Corridor Responsibilities will include conducting the routine sampling and analysis of water systems, interpreting results, maintenance and the installation of chemical dosing systems. Servicing accounts within both the industrial and commercial industries the successful candidate will complete all work in accordance to the approved code of practice. The ideal applicant for this position will have

### 2. Tokenize each job advertisement description

In [None]:
def tokenizeDescription(raw_description):
  raw_description = lowerCase(raw_description)
  # tokenize raw description
  sentences = sent_tokenize(raw_description)
  # required regex pattern for the strings
  pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
  # tokenize strings with above regex pattern
  tokenizer = RegexpTokenizer(pattern)
  # return tokenized words in form of list
  token_lists = [tokenizer.tokenize(sen) for sen in sentences]
  # merging into a list of tokens
  tokenizeDescription = list(chain.from_iterable(token_lists))
  return tokenizeDescription

### 3. Convert words to lower case

In [None]:
def lowerCase(raw_description):
  # convert all strings in the description to lower case
  raw_description = raw_description.lower()
  # get the first character of first string i.e., d
  newstr = raw_description[0]
  # capitalize 'd' of description to meet the required format
  newstr = newstr.upper()
  # Description + other lower strings
  my_string = newstr + raw_description[1:]
  return my_string

### 4. Remove words with length less than 2

In [None]:
def remove_short_words(tokens):
  # list without short words
  new_tokens = list()
  for t in tokens:
    if len(t) >= 2:
      # append words with length more than 2
      new_tokens.append(t)
  return new_tokens

### 5. Remove stopwords using the provided stop words list

In [None]:
def removeStopWords(match):
  # initialise list
  list_without_stopwords = list()
  for word in match:
    if word not in stopwords:
      # append words which are not in the stopwords list
      list_without_stopwords.append(word)
  return list_without_stopwords

### 6. Remove the word that appears only once in the document collection(TERM FREQUENCY)

In [None]:
def findTermFreqWords(descList):
  # creating a chainable list of all description tokens
  allDescWordList = list(chain.from_iterable(descList))
  # calculating the term frequency distribution
  term_fd = FreqDist(allDescWordList)
  # list of all leff frequent words
  lessFreqWords = set(term_fd.hapaxes())
  return lessFreqWords

def removeLessFreqWordTokens(tokens, lessFreqWords):
  # looping through and returing a list without less frequent words
  return [w for w in tokens if w not in lessFreqWords]


### 7. Remove the top 50 most frequent words(DOCUMENT FREQUENCY)

In [None]:
def removeDocFreqWords(tokens, top50words):
  # looping through and returing a list without top 50 most frequent words
  return [w for w in tokens if w not in top50words]

### Executing all above functions

In [None]:
# function for task 1-5
def main1(newlist):
  # initialise list
  newList2 = list()
  
  # looping through all the lines of list
  for match in newlist:
    # condiiton to check if index 0 of the line starts with "Description"
    if match.find("Description") == 0:
      # tokenize each description string
      tokenized = tokenizeDescription(match)

      # remove short words from the above tokens
      tokenized = remove_short_words(tokenized)
      
      # remove stop words from the above tokens
      match1 = removeStopWords(tokenized) 

      # detokenizing the processed tokens
      match2 = TreebankWordDetokenizer().detokenize(match1)

      # add colon after Description keyword for required format
      match2 = match2[:11] + ":" + match2[11:]

      # append the string to the list
      newList2.append(match2)

    # condiiton to check if index 0 of the line starts with "Company"
    elif match.find("Company") == 0:
      # skipping as Company is not required
      continue
    # appending all other lines
    else:
      newList2.append(match)
  return newList2


newList2 = main1(newlist)

In [None]:
newList2[1:10]

['Category: Engineering',
 'Title: Plant Engineer',
 'Webindex: 62119057',
 'Description: client established leading manufacturer supplier quality water treatment plants ranging basic water softeners reverse osmosis equipment customer complex water treatment solutions company meet clients requirements flexibility tailoring product budgets due expansion increased workload seeking recruit planet engineer cover accounts corridor responsibilities include conducting routine sampling analysis water systems interpreting results maintenance installation chemical dosing systems servicing accounts industrial commercial industries successful candidate complete work accordance approved code practice ideal applicant position minimum years relevant industry experience knowledge reverse osmosis water softeners water filters uv equipment full uk driving license essential return client offering competitive benefits salary package ideal candidate',
 'ID: 31567',
 'Category: Healthcare_Nursing',
 'Title:

In [None]:
# function to return tokens of all the description
def getAllDescList(newlist):
  # initialise list
  desList = list()

  # looping through all the lines of list
  for match in newlist:
    # condiiton to check if index 0 of the line starts with "Description"
    if match.find("Description") == 0:
      # tokenize each description string
      tokens = tokenizeDescription(match)
      tokens.remove('Description')
      # appending all the tokens
      desList.append(tokens)

  # print("desList: getAllDescList  ", desList)
  return desList

In [None]:
#  function to iterate through rocessed list and remove less frequent words
def removeLessFreqWords(prevList):
  newList = list()
  allDescListTokens = getAllDescList(prevList)
  lessFreqWords = findTermFreqWords(allDescListTokens)
  # print("lessFreqWords:--------",lessFreqWords)
  
  # looping through all the lines of list
  for match in prevList:
    # condiiton to check if index 0 of the line starts with "Description"
    if match.find("Description") == 0:
      # tokenize each description string
      tokenized = tokenizeDescription(match)
      
      # remove less frequent words from the above tokens
      lessFreqWordsRemoved = removeLessFreqWordTokens(tokenized, lessFreqWords) 
      
      # detokenizing the processed tokens
      detokenized = TreebankWordDetokenizer().detokenize(lessFreqWordsRemoved)
      
      # add colon after Description keyword for required format
      detokenized = detokenized[:11] + ":" + detokenized[11:]
      
      # append the string to the list
      newList.append(detokenized)      
    
    # condiiton to check if index 0 of the line starts with "Company"
    elif match.find("Company") == 0:
      # skipping as Company is not required
      continue
    # appending all other lines   
    else:    
      newList.append(match)

  return newList

In [None]:
# run function to remove less frequent words
ListWithoutLessFreq = removeLessFreqWords(newList2)

desList: getAllDescList   

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



lessFreqWords:-------- {'nonvendor', 'microbrewery', 'financialplannerbirmingham', 'aunch', 'asapref', 'salesexecutivefieldsalesareasales', 'ibmi', 'managerexciting', 'salisburyour', 'invernessscotland', 'researchinformed', 'unorthodox', 'serviceptr', 'yorksalary', 'delegatedauthoritydataanalyst', 'cwc', 'nurser', 'stricktest', 'liant', 'maintenanceengineershiftengineerelectricalengineer', 'foundationlevel', 'developerlinux', 'swbacademyhays', 'brightside', 'husky', 'presentationsmanage', 'acareportingaccountant', 'dervers', 'pastrychefal', 'abitity', 'nationallyrecognised', 'metown', 'polarrecruitment', 'haematologists', 'systme', 'bonusduration', 'enob', 'applcations', 'encashments', 'newbusinesssalesaccountexective', 'ecommercebusinessanalystretailbackground', 'responsibilitiespa', 'hippa', 'graduateitsales', 'fixedincomesingledealersupportanalyst', 'accountnegotiator', 'multivenue', 'jobsfairs', 'primaryschoolteacherleeds', 'investmentmanagementsales', 'secteur', 'nickaaslrecruitme

### Top 50 words

In [None]:
# run function to get all description tokens in a list
allDescListTokens = getAllDescList(ListWithoutLessFreq)

desList: getAllDescList   

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
# This is a nested list of all description tokens
allDescListTokens[1:3]

[['timeout',
  'children',
  'homes',
  'rapidly',
  'expanding',
  'company',
  'forefront',
  'therapeutic',
  'care',
  'young',
  'people',
  'aged',
  'years',
  'experienced',
  'emotional',
  'behavioural',
  'difficulties',
  'lives',
  'recruit',
  'residential',
  'care',
  'workers',
  'based',
  'homes',
  'swindon',
  'area',
  'successful',
  'candidates',
  'work',
  'collaboratively',
  'cooperatively',
  'timeout',
  'staff',
  'young',
  'people',
  'external',
  'agencies',
  'required',
  'work',
  'consultation',
  'families',
  'social',
  'workers',
  'yot',
  'professionals',
  'involved',
  'young',
  'person',
  'including',
  'education',
  'team',
  'deliver',
  'effective',
  'educational',
  'programmes',
  'successful',
  'applicants',
  'required',
  'provide',
  'enhanced',
  'disclosure',
  'disclosure',
  'expense',
  'met',
  'employer',
  'apply',
  'click',
  'apply',
  'button',
  'redirected',
  'site',
  'complete',
  'application',
  'form'],
 

In [None]:
# get only top 50 words from the set
allDescWords = list(chain.from_iterable([set(w) for w in allDescListTokens]))
# document frequency for each unique word/type from the description
doc_fd = FreqDist(allDescWords)
# most common function to get top 50 set
doc_fd.most_common(50)

[('experience', 43644),
 ('role', 34680),
 ('work', 33684),
 ('team', 32585),
 ('working', 30714),
 ('skills', 30412),
 ('client', 26899),
 ('job', 25552),
 ('business', 24739),
 ('uk', 24133),
 ('excellent', 22982),
 ('opportunity', 22678),
 ('company', 22263),
 ('management', 20620),
 ('required', 20555),
 ('development', 20223),
 ('apply', 20133),
 ('based', 19333),
 ('successful', 19118),
 ('join', 18682),
 ('www', 18421),
 ('salary', 18402),
 ('cv', 18383),
 ('support', 18286),
 ('knowledge', 17844),
 ('strong', 16475),
 ('environment', 16408),
 ('posted', 16398),
 ('jobseeking', 16342),
 ('candidate', 16304),
 ('originally', 16294),
 ('leading', 16194),
 ('high', 15922),
 ('service', 15623),
 ('manager', 15587),
 ('good', 15252),
 ('ability', 15154),
 ('including', 14857),
 ('position', 14564),
 ('services', 14501),
 ('benefits', 14434),
 ('training', 14218),
 ('essential', 13915),
 ('experienced', 13826),
 ('key', 13567),
 ('contact', 13551),
 ('level', 13523),
 ('recruitment', 

In [None]:
# list of top 50 word tokens
top50words = set(w[0] for w in doc_fd.most_common(50))
top50words

{'ability',
 'apply',
 'based',
 'benefits',
 'business',
 'candidate',
 'candidates',
 'client',
 'company',
 'contact',
 'cv',
 'development',
 'environment',
 'essential',
 'excellent',
 'experience',
 'experienced',
 'good',
 'high',
 'including',
 'job',
 'jobseeking',
 'join',
 'key',
 'knowledge',
 'leading',
 'level',
 'management',
 'manager',
 'opportunity',
 'originally',
 'position',
 'posted',
 'provide',
 'recruitment',
 'required',
 'role',
 'salary',
 'service',
 'services',
 'skills',
 'strong',
 'successful',
 'support',
 'team',
 'training',
 'uk',
 'work',
 'working',
 'www'}

In [None]:
# function to remove top 50 words from the description
def removeTop50Words(prevList, top50words):
  #  initialise list
  newList = list()

  # looping through all the lines of list
  for match in prevList:
    # condiiton to check if index 0 of the line starts with "Description"
    if match.find("Description") == 0: 
      # tokenize each description string
      tokenized = tokenizeDescription(match)
      
      # run function to remove top 50 words based on document frequency
      topFiftyRemoved = removeDocFreqWords(tokenized, top50words)

      # detokenizing the processed tokens
      detokenized = TreebankWordDetokenizer().detokenize(topFiftyRemoved)
      
      # add colon after Description keyword for required format
      detokenized = detokenized[:11] + ":" + detokenized[11:]

      # append the string to the list
      newList.append(detokenized)
    
    # appending all other lines 
    else:    
      newList.append(match)

  return newList

In [None]:
# run function to remove top 50 words
ListWithoutTop50Words = removeTop50Words(ListWithoutLessFreq, top50words)

In [None]:
ListWithoutTop50Words[1:10]

['Category: Engineering',
 'Title: Plant Engineer',
 'Webindex: 62119057',
 'Description: established manufacturer supplier quality water treatment plants ranging basic water softeners reverse osmosis equipment customer complex water treatment solutions meet clients requirements flexibility tailoring product budgets due expansion increased workload seeking recruit planet engineer cover accounts corridor responsibilities include conducting routine sampling analysis water systems interpreting results maintenance installation chemical dosing systems servicing accounts industrial commercial industries complete accordance approved code practice ideal applicant minimum years relevant industry reverse osmosis water softeners water filters uv equipment full driving license return offering competitive package ideal',
 'ID: 31567',
 'Category: Healthcare_Nursing',
 'Title: Residential Care Worker',
 'Webindex: 66314490',
 'Description: timeout children homes rapidly expanding forefront therapeut

In [None]:
# storing the final list in a new variable
finalJobAds = ListWithoutTop50Words

In [None]:
# Pickle out finalJobAds list
pickle_out = open("/content/drive/My Drive/pickle-data/finalJobAds.pickle", "wb")
pickle.dump(finalJobAds, pickle_out)

In [None]:
# Pickle in finalJobAds list
pickle_in = open("/content/drive/My Drive/pickle-data/finalJobAds.pickle", "rb")
finalJobAds = pickle.load(pickle_in)

In [None]:
finalJobAds[1:15]

['Category: Engineering',
 'Title: Plant Engineer',
 'Webindex: 62119057',
 'Description: established manufacturer supplier quality water treatment plants ranging basic water softeners reverse osmosis equipment customer complex water treatment solutions meet clients requirements flexibility tailoring product budgets due expansion increased workload seeking recruit planet engineer cover accounts corridor responsibilities include conducting routine sampling analysis water systems interpreting results maintenance installation chemical dosing systems servicing accounts industrial commercial industries complete accordance approved code practice ideal applicant minimum years relevant industry reverse osmosis water softeners water filters uv equipment full driving license return offering competitive package ideal',
 'ID: 31567',
 'Category: Healthcare_Nursing',
 'Title: Residential Care Worker',
 'Webindex: 66314490',
 'Description: timeout children homes rapidly expanding forefront therapeut

In [None]:
# Generating a txt file for all the job ads description token
finalTokens = getAllDescList(finalJobAds)
finalTokens[1:3]

desList: getAllDescList   

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[['timeout',
  'children',
  'homes',
  'rapidly',
  'expanding',
  'forefront',
  'therapeutic',
  'care',
  'young',
  'people',
  'aged',
  'years',
  'emotional',
  'behavioural',
  'difficulties',
  'lives',
  'recruit',
  'residential',
  'care',
  'workers',
  'homes',
  'swindon',
  'area',
  'collaboratively',
  'cooperatively',
  'timeout',
  'staff',
  'young',
  'people',
  'external',
  'agencies',
  'consultation',
  'families',
  'social',
  'workers',
  'yot',
  'professionals',
  'involved',
  'young',
  'person',
  'education',
  'deliver',
  'effective',
  'educational',
  'programmes',
  'applicants',
  'enhanced',
  'disclosure',
  'disclosure',
  'expense',
  'met',
  'employer',
  'click',
  'button',
  'redirected',
  'site',
  'complete',
  'application',
  'form'],
 ['french',
  'restaurant',
  'club',
  'gascon',
  'michelin',
  'established',
  'heart',
  'london',
  'easy',
  'access',
  'bus',
  'train',
  'tube',
  'chef',
  'de',
  'rang',
  'waiter',
  

### 8. Extract the top 10 Bigrams(TERM FREQUENCY)

In [None]:
#  dunction to get bigrams from the list
def extractBigram(blist):
  #  initialise list
  tokenList = list()

  # looping through all the lines of list
  for match in blist:
    # condiiton to check if index 0 of the line starts with "Description"
    if match.find("Description") == 0:
      # tokenize each description string
      tokens = tokenizeDescription(match)
      tokens.remove('Description')

      # extend the tokens to the list
      tokenList.extend(tokens)
  
  # getting ngrams out ot the total tokens
  bigrams = ngrams(tokenList, n = 2)

  # calculating the frequency distribution of bigrams
  fdbigram = FreqDist(bigrams)

  # fetching the top 10 most common bigrams
  bigrams = fdbigram.most_common(10) 

  return bigrams

In [None]:
#  run function to get bigrams
bigram_found = extractBigram(ListWithoutTop50Words)
bigram_found

[(('employment', 'agency'), 8055),
 (('track', 'record'), 5472),
 (('acting', 'employment'), 5095),
 (('sql', 'server'), 4804),
 (('asp', 'net'), 4687),
 (('relation', 'vacancy'), 3977),
 (('sales', 'executive'), 3619),
 (('chef', 'de'), 3586),
 (('nursing', 'home'), 3503),
 (('de', 'partie'), 3396)]

### Vocabulary

In [9]:
# run function to get all description token list
vocab_list = getAllDescList(finalJobAds)

# put all tokens from the above corpus to a single list
words = list(chain.from_iterable(vocab_list))

# sorted vocabulary by converting the list of tokens to a set
vocab = sorted(list(set(words)))

len(vocab)

40038


## Saving required outputs
Save the vocabulary, bigrams and job advertisment txt as per spectification.
- vocab.txt
- bigram.txt
- job_ads.txt

In [None]:
# save vocab with required format
def save_vocab(path, vocab):
  # creates a txt file and open to save the description tokens
  out_file = open(path, "w")
  for word in vocab:
    out_file.write(word + ":" + str(vocab.index(word)) + "\n")
  out_file.close()

# save vocab with only words
def save_vocab_only_words(path, vocab):
  # creates a txt file and open to save the description tokens
  out_file = open(path, 'w')
  string = "\n".join(["".join(v) for v in vocab])
  out_file.write(string)
  out_file.close()

# save bigram with required format
def save_bigram(path, bigrams):
  out_file = open(path, "w")
  for bg in bigrams:
    out_file.write(" ".join(bg[0]) + "," + str(bg[1]) + "\n")
  out_file.close()

# save jobAds with required format
def save_jobAds(path, finalJobAds):
    out_file = open(path, 'w') # creates a txt file and open to save the reviews
    string = "\n".join(["".join(j) for j in finalJobAds])
    out_file.write(string)
    out_file.close()

# save all jobAds token
def save_tokensJobAd(path, finalTokens):
  out_file = open(path, 'w') # creates a txt file and open to save the reviews
  string = "\n".join([" ".join(v) for v in finalTokens])
  out_file.write(string)
  out_file.close()

In [None]:
save_jobAds("/content/drive/My Drive/Task-2-and-3/job_ads.txt",finalJobAds)

In [None]:
save_vocab("/content/drive/My Drive/Task-2-and-3/vocab.txt",vocab)

In [None]:
save_vocab_only_words("/content/drive/My Drive/Task-2-and-3/words-desc-vocab.txt", vocab)

In [None]:
save_bigram("/content/drive/My Drive/Task-2-and-3/bigram.txt",bigrams)

In [None]:
save_tokensJobAd("/content/drive/My Drive/Task-2-and-3/tokens_job_ads.txt", finalTokens)

## Summary
All the preprocessing steps were successfully performed on the job advertisement including tokenisation, removing stopwords, words shoter than 2 character length, words that appear only once , top 50 words to make it clean. 