# Introduction to Natural Language Processing and Deep Learning
## Text Pre-Processing, Count Vectorisation & Word2Vec
### DSF - Data Idols - Summer School 2021

Louise Kirkham (louise.kirkham@royalmail.com) and Paola Puglisi Paola Puglisi (paola.puglisi@royalmail.com), Data Science, Royal Mail

Contents:

1. Tokenisation
2. Word Normalisation - Stemming and Lemmatisation
3. Removing Stop Words
4. Count Vectoriastion (and N-grams)
5. TF-IDF (Term Frequency - Inverse Document Frequency) Transformation
6. Word2Vec Model


In [1]:
import warnings
warnings.filterwarnings('ignore')

### 1. Tokenisation

In [2]:
# Start with a basic string of text (taken from a TrustPilot review)

my_paragraph = "Another straightforward order via Royal Mail website. Needed some first class stamps for the office. Stamp selection clearly set out on website along with pricing. After placing the order, stamps arrived within 48 hrs. Great service, I'd use again."
my_paragraph

"Another straightforward order via Royal Mail website. Needed some first class stamps for the office. Stamp selection clearly set out on website along with pricing. After placing the order, stamps arrived within 48 hrs. Great service, I'd use again."

In [3]:
# Apply tokeniser... 

from nltk.tokenize import regexp_tokenize

tokens = regexp_tokenize(my_paragraph.lower(), "[\w']+")
print(tokens)

['another', 'straightforward', 'order', 'via', 'royal', 'mail', 'website', 'needed', 'some', 'first', 'class', 'stamps', 'for', 'the', 'office', 'stamp', 'selection', 'clearly', 'set', 'out', 'on', 'website', 'along', 'with', 'pricing', 'after', 'placing', 'the', 'order', 'stamps', 'arrived', 'within', '48', 'hrs', 'great', 'service', "i'd", 'use', 'again']


In [4]:
# Expand contracted words...

import contractions

expanded_tokens = []  
for word in tokens:
    expanded_tokens.extend((contractions.fix(word)).split()) 
    
print(expanded_tokens)

['another', 'straightforward', 'order', 'via', 'royal', 'mail', 'website', 'needed', 'some', 'first', 'class', 'stamps', 'for', 'the', 'office', 'stamp', 'selection', 'clearly', 'set', 'out', 'on', 'website', 'along', 'with', 'pricing', 'after', 'placing', 'the', 'order', 'stamps', 'arrived', 'within', '48', 'hrs', 'great', 'service', 'I', 'would', 'use', 'again']


### 2. Word Normalisation - Stemming and Lemmatisation

In [5]:
# Apply stemmer to each of our tokens...

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

stemmed_tokens = [stemmer.stem(word) for word in expanded_tokens]
print(stemmed_tokens)

['anoth', 'straightforward', 'order', 'via', 'royal', 'mail', 'websit', 'need', 'some', 'first', 'class', 'stamp', 'for', 'the', 'offic', 'stamp', 'select', 'clear', 'set', 'out', 'on', 'websit', 'along', 'with', 'price', 'after', 'place', 'the', 'order', 'stamp', 'arriv', 'within', '48', 'hrs', 'great', 'servic', 'i', 'would', 'use', 'again']


In [6]:
# Apply Lemmatisation...

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemma_tokens = [lemmatizer.lemmatize(word) for word in expanded_tokens]
print(lemma_tokens)

['another', 'straightforward', 'order', 'via', 'royal', 'mail', 'website', 'needed', 'some', 'first', 'class', 'stamp', 'for', 'the', 'office', 'stamp', 'selection', 'clearly', 'set', 'out', 'on', 'website', 'along', 'with', 'pricing', 'after', 'placing', 'the', 'order', 'stamp', 'arrived', 'within', '48', 'hr', 'great', 'service', 'I', 'would', 'use', 'again']


### 3. Removing Stop Words

In [7]:
# Use NLTK library's dictionary of stopwords

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

print(stop_words)

{'yourselves', 'now', 'aren', 'do', 'off', 'why', 'on', 'whom', 'myself', 'into', "haven't", 'of', 'so', "you'd", 'are', 'himself', 'should', 'had', 'other', "couldn't", 'he', 'won', 'him', "wouldn't", 'my', 'doesn', 'when', 'you', 'with', 'against', 'before', 'it', 'in', 'how', 'll', 'just', 'where', 'isn', 'while', 'and', 'during', 'all', 'each', 't', 'them', 'from', 'yourself', 'because', 've', "it's", 'only', 'ours', 'themselves', 'itself', 'a', 'who', 'nor', 'is', 'by', 'most', 'been', 'about', 'couldn', 'mustn', 'own', 'wasn', 'after', 'ain', 'me', "didn't", "won't", 'don', 'does', 'or', 'again', "you're", 'very', 'doing', 'which', 'were', "hadn't", 're', 'at', 'such', 'i', 'over', "aren't", 'more', 'up', 'below', 'here', 'both', "mightn't", 'few', "hasn't", 'further', 'have', 'ma', 'theirs', 'there', 'its', "shouldn't", 'those', 'being', 'his', 'an', 'down', 'than', 'weren', 's', 'didn', "you've", 'herself', 'their', 'until', 'our', "shan't", 'needn', 'that', 'same', 'shouldn', 

In [8]:
# Filter these from our tokens...

filtered_tokens = [w for w in expanded_tokens if not w.lower() in stop_words]
print(filtered_tokens)

['another', 'straightforward', 'order', 'via', 'royal', 'mail', 'website', 'needed', 'first', 'class', 'stamps', 'office', 'stamp', 'selection', 'clearly', 'set', 'website', 'along', 'pricing', 'placing', 'order', 'stamps', 'arrived', 'within', '48', 'hrs', 'great', 'service', 'would', 'use']


In [9]:
stop_words_found = [w for w in expanded_tokens if w.lower() in stop_words]
print(stop_words_found)

['some', 'for', 'the', 'out', 'on', 'with', 'after', 'the', 'I', 'again']


In [10]:
# Define a list of custom stop words

custom_stop_words = ['set', 'use']

filtered_tokens = [w for w in filtered_tokens if not w.lower() in custom_stop_words]
print(filtered_tokens)

['another', 'straightforward', 'order', 'via', 'royal', 'mail', 'website', 'needed', 'first', 'class', 'stamps', 'office', 'stamp', 'selection', 'clearly', 'website', 'along', 'pricing', 'placing', 'order', 'stamps', 'arrived', 'within', '48', 'hrs', 'great', 'service', 'would']


### 4. Count Vectorisation

In [11]:
# Create a datafrae of customer review text...

import pandas as pd

my_text_df = pd.DataFrame({
    'clean_text':['great service from royal mail', 'royal mail is great', 'my royal mail postie is fantastic']
})
my_text_df

Unnamed: 0,clean_text
0,great service from royal mail
1,royal mail is great
2,my royal mail postie is fantastic


In [12]:
# First, let's understand 'n-grams'...

import nltk
from nltk.util import ngrams

first_document = regexp_tokenize(my_text_df['clean_text'][0].lower(), "[\w']+")

n_grams = ngrams(first_document, n=2)

for gram in n_grams:
    print(gram)

('great', 'service')
('service', 'from')
('from', 'royal')
('royal', 'mail')


In [13]:
# Define our vectoriser...

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(
#     max_features=1500,    # limit the number of words to keep from all words, otherwise over 7000 features (words)
#     min_df=5,             # min number of documents the word should occur in i.e. too rare
#     max_df=0,             # max. % of documents the word should occur in i.e. too. common
      ngram_range=(1,2))


In [14]:
# Apply to our text documents...

doc_count_matrix = count_vect.fit_transform(my_text_df['clean_text'])

# Convert to a dataframe just to help us visualise...

doc_count_matrix_df = pd.DataFrame.sparse.from_spmatrix(doc_count_matrix)
doc_count_matrix_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,1,1,1,1
1,0,0,0,1,0,1,0,1,1,1,0,0,0,0,0,1,1,0,0
2,1,0,0,0,0,1,1,0,1,0,1,1,1,1,1,1,1,0,0


In [15]:
# Let's look at our features...

feature_names = count_vect.get_feature_names()
print(feature_names)

['fantastic', 'from', 'from royal', 'great', 'great service', 'is', 'is fantastic', 'is great', 'mail', 'mail is', 'mail postie', 'my', 'my royal', 'postie', 'postie is', 'royal', 'royal mail', 'service', 'service from']


### 5. TF-IDF (Term Frequency - Inverse Document Frequency)

In [16]:
# Define our TF-IDF transformer...

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)

# Transform our matrix...
doc_count_matrix_tfidf = tfidf_transformer.fit_transform(doc_count_matrix)
doc_count_matrix_tfidf

# # Convert to dataframe to help us visualise...
doc_count_matrix_tfidf_df = pd.DataFrame.sparse.from_spmatrix(doc_count_matrix_tfidf)
doc_count_matrix_tfidf_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,0.0,0.388518,0.388518,0.295478,0.388518,0.0,0.0,0.0,0.229465,0.0,0.0,0.0,0.0,0.0,0.0,0.229465,0.229465,0.388518,0.388518
1,0.0,0.0,0.0,0.370954,0.0,0.370954,0.0,0.48776,0.288079,0.48776,0.0,0.0,0.0,0.0,0.0,0.288079,0.288079,0.0,0.0
2,0.340505,0.0,0.0,0.0,0.0,0.258963,0.340505,0.0,0.201108,0.0,0.340505,0.340505,0.340505,0.340505,0.340505,0.201108,0.201108,0.0,0.0


In [17]:
# Let's see how each feature is weighted

df_idf = pd.DataFrame(tfidf_transformer.idf_, index=count_vect.get_feature_names(),columns=["idf_weights"])
df_idf = df_idf.sort_values(by=['idf_weights'])
df_idf

Unnamed: 0,idf_weights
royal mail,1.0
royal,1.0
mail,1.0
great,1.287682
is,1.287682
fantastic,1.693147
postie is,1.693147
postie,1.693147
my royal,1.693147
my,1.693147


### 6. Word2Vec Model

In [18]:
# Use text from wikipedia page...

rm_wiki_text = 'Royal Mail Group plc is a British multinational postal service and courier company, originally established in 1516 as a department of the English government. The company\'s subsidiary Royal Mail Group Limited operates the brands Royal Mail (letters) and Parcelforce Worldwide (parcels). GLS Group, an international logistics company, is a wholly owned subsidiary of Royal Mail Group. The group used the name Consignia for a brief period in the early 2000s.\nThe company provides mail collection and delivery services throughout the UK. Letters and parcels are deposited in post or parcel boxes, or are collected in bulk from businesses and transported to Royal Mail sorting offices. Royal Mail owns and maintains the UK\'s distinctive red pillar boxes, first introduced in 1852, many of which bear the initials of the reigning monarch. Deliveries are made at least once every day except Sundays and bank holidays at uniform charges for all UK destinations. Royal Mail generally aims to make first class deliveries the next business day throughout the nation.For most of its history, the Royal Mail was a public service, operating as a government department or public corporation. Following the Postal Services Act 2011, a majority of the shares in Royal Mail were floated on the London Stock Exchange in 2013. The UK government initially retained a 30% stake in Royal Mail, but sold its remaining shares in 2015, ending 499 years of state ownership. It is a constituent of the FTSE 100 Index.\n\n\n== History ==\n\nThe Royal Mail can trace its history back to 1516, when Henry VIII established a "Master of the Posts", a position that was renamed "Postmaster General" in 1710.Upon his accession to the throne of England at the Union of the Crowns in 1603, James VI moved his court to London. One of his first acts from London was to establish the royal postal service between London and Edinburgh, in an attempt to retain control over the Scottish Privy Council.The Royal Mail service was first made available to the public by Charles I on 31 July 1635, with postage being paid by the recipient. The monopoly was farmed out to Thomas Witherings.In the 1640s, Parliament removed the monopoly from Witherings and during the Civil War and First Commonwealth the parliamentary postal service was run at great profit for himself by Edmund Prideaux (a prominent parliamentarian and lawyer who rose to be attorney-general). To keep his monopoly in those troubled times Prideaux improved efficiency and used both legal impediments and illegal methods.In 1653, Parliament set aside all previous grants for postal services, and contracts were let for the inland and foreign mails to John Manley. Manley was given a monopoly on the postal service, which was effectively enforced by Protector Oliver Cromwell\'s government, and thanks to the improvements necessitated by the war, Manley ran a much improved Post Office service. In July 1655, the Post Office was put under the direct government control of John Thurloe, a Secretary of State, best known to history as Cromwell\'s spymaster general. Previous English governments had tried to prevent conspirators communicating; Thurloe preferred to deliver their post having surreptitiously read it. As the Protectorate claimed to govern all of Great Britain and Ireland under one unified government, on 9 June 1657 the Second Protectorate Parliament (which included Scottish and Irish MPs) passed the "Act for settling the Postage in England, Scotland and Ireland", which created one monopoly Post Office for the whole territory of the Commonwealth. The first Postmaster General was appointed in 1661, and a seal was first fixed to the mail.At the restoration of the monarchy, in 1660, all the ordinances and acts passed by parliaments during the Civil War and the Interregnum passed into oblivion, so the General Post Office (GPO) was officially established by Charles II in 1660.Between 1719 and 1763, Ralph Allen, postmaster at Bath, signed a series of contracts with the post office to develop and expand Britain\'s postal network. He organised mail coaches which were provided by both Wilson & Company of London and Williams & Company of Bath. The early Royal Mail Coaches were similar to ordinary family coaches, but with Post Office livery.The first mail coach ran in 1784, operating between Bristol and London. Delivery staff received uniforms for the first time in 1793, and the Post Office Investigation Branch was established. The first mail train ran in 1830, on the Liverpool and Manchester Railway. The Post Office\'s money order system was introduced in 1838.\n\n\n=== Uniform penny postage ===\n\nIn December 1839, the first substantial reform started when postage rates were revised by the short-lived Uniform Fourpenny Post.Rowland Hill, an English teacher, inventor and social reformer, became disillusioned with the postal service, and wrote a paper proposing reforms that resulted in an approach that would go on to change not only the Royal Mail, but also be copied by postal services around world. His proposal was refused at the first attempt, but he overcame the political obstacles, and was appointed to implement and develop his ideas. He realised that many small purchases would fund the organisation and implemented this by changing it from a receiver-pays to a sender-pays system. This was used as the model for other postal services around the world, but also spilled over to the modern-day crowd-funding approach.Greater changes took place when the Uniform Penny Post was introduced on 10 January 1840, whereby a single rate for delivery anywhere in Great Britain and Ireland was pre-paid by the sender. A few months later, to certify that postage had been paid on a letter, the sender could affix the first adhesive postage stamp, the Penny Black, which was available for use from 6 May the same year. Other innovations were the introduction of pre-paid William Mulready designed postal stationery letter sheets and envelopes.As Britain was the first country to issue prepaid postage stamps, British stamps are the only stamps that do not bear the name of the country of issue on them.By the late 19th century, there were between six and twelve mail deliveries per day in London, permitting correspondents to exchange multiple letters within a single day.The first trial of the London Pneumatic Despatch Company was made in 1863, sending mail by underground rail between postal depots. The Post Office began its telegraph service in 1870.\n\n\n=== Pillar boxes ===\n\nThe first Post Office pillar box was erected in 1852 in Jersey. Pillar boxes were introduced in mainland Britain the following year. British pillar boxes traditionally carry the Latin initials of the reigning monarch at the time of their installation, for example: VR for Victoria Regina or GR for Georgius Rex. Such branding is not used in Scotland, due to a dispute over the current monarch\'s title: some Scottish nationalists argue that Queen Elizabeth II should have simply been Queen Elizabeth, as there had been no previous Queen Elizabeth of Scotland or of the United Kingdom of Great Britain and Northern Ireland (Elizabeth I was only Queen of the pre-1707 Kingdom of England). The dispute involved vandalism and attacks on pillar and post boxes introduced in Scotland which displayed EIIR. To avoid the issue, pillar boxes in Scotland were either marked \'Post Office\' or use the Scots Crown.A national telephone service was opened by the Post Office in 1912. In 1919, the first international airmail service was developed by Royal Engineers (Postal Section) and Royal Air Force. The London Post Office Railway was opened in 1927.In 1941, an airgraph service was introduced between UK and Egypt. The service was later extended to Canada (1941), East Africa (1941), Burma (1942), India (1942), South Africa (1942), Australia (1943), New Zealand (1943) Ceylon (1944) and Italy (1944).\n\n\n=== Statutory corporation ===\n\nUnder the Post Office Act 1969 the General Post Office was changed from a government department to a statutory corporation, known simply as the Post Office. The office of Postmaster General was abolished and replaced with the positions of chairman and chief executive in the new company.The two-class postal system was introduced in 1968, using first-class and second-class services. The Post Office opened the National Giro Bank that year.In 1971, postal services in Great Britain were suspended for two months between January and March as the result of a national postal strike over a pay claim. Postcodes were extended across Great Britain and Northern Ireland between 1959 and 1974.Postal workers held their first national strike for 17 years in 1988, after walking out over bonuses being paid to recruit new workers in London and the South East. Royal Mail established Romec (Royal Mail Engineering & Construction) in 1989 to deliver facilities maintenance services to its business. Romec was 51% owned by Royal Mail, and 49% by Haden Building Management Ltd, which became Balfour Beatty WorkPlace and is now Cofely UK, part of GDF Suez in a joint venture.British Telecom was separated from the Post Office in 1980, and emerged as an independent business in 1981. Girobank was sold to Alliance & Leicester in 1990, and Royal Mail Parcels was rebranded as Parcelforce. The remaining business continued under public ownership, as privatisation of this was deemed to be too unpopular. However, in the 1990s, President of the Board of Trade Michael Heseltine began investigating a possible sale, and eventually a Green Paper on Postal Reform was published in May 1994, outlining various options for privatisation. The ideas, however, proved controversial, and were dropped from the 1994 Queen\'s Speech after a number of Conservative MPs warned Heseltine that they would not vote for the legislation.\n\n\n=== Modernisation ===\nAfter a change of government in 1997, the Labour administration decided to keep the Post Office state-owned, but with more commercial freedom. This led to the Postal Services Act 2000, whereby the Post Office became a public limited company in which the Secretary of State for Trade and Industry owned 50,004 ordinary shares plus 1 special share, and the Treasury Solicitor held 1 ordinary share. The company was renamed Consignia plc in 2001 and the new name was intended to show that the company did more than deliver mail; however, the change was very unpopular with both the public and employees. The Communication Workers Union (CWU) boycotted the name, and the following year, it was announced that the company would be renamed Royal Mail Group plc.In 1999, Royal Mail launched a short-lived e-commerce venture, ViaCode Limited, aimed at providing encrypted online communications services. However it failed to make a profit and closed in 2002.As part of the 2000 Act, the government set up a postal regulator, the Postal Services Commission, known as Postcomm, which offered licences to private companies to deliver mail. In 2001, the Consumer Council for Postal Services, known as Postwatch, was created for consumers to express any concerns they may have with the postal service in Britain.In 2004, the second daily delivery was scrapped in an effort to reduce costs and improve efficiency, meaning a later single delivery would be made. The same year, the travelling post office mail trains were also axed. In 2005, Royal Mail signed a contract with GB Railfreight to operate an overnight rail service between London and Scotland (carrying bulk mail, and without any on-train sorting); this was later followed by a London-Newcastle service.\n\nOn 1 January 2006, the Royal Mail lost its 350-year monopoly, and the British postal market became fully open to competition. Competitors were allowed to collect and sort mail, and pass it to Royal Mail for delivery, a service known as downstream access. Royal Mail introduced Pricing in Proportion (PiP) for first and second class inland mail, whereby prices are affected by the size as well as weight of items. It also introduced an online postage service, allowing customers to pay for postage online.In 2007, the Royal Mail Group plc became Royal Mail Group Ltd, in a slight change of legal status. Royal Mail ended Sunday collections from pillar boxes that year.On 1 October 2008, Postwatch was merged into the new consumer watchdog Consumer Focus.In 2008, due to a continuing fall in mail volumes, the government commissioned an independent review of the postal services sector by Richard Hooper CBE, the former deputy chairman of Ofcom. The recommendations in the Hooper Review led Business Secretary Lord Mandelson to seek to part privatise the company by selling a minority stake to a commercial partner. However, despite legislation for the sale passing the House of Lords, it was abandoned in the House of Commons after strong opposition from backbench Labour MPs. The government later cited the difficult economic conditions for the reason behind the retreat.After the departure of Adam Crozier to ITV plc on 27 May 2010, Royal Mail appointed Canadian Moya Greene as chief executive, the first woman to hold the post.On 6 December 2010, a number of paid-for services including Admail, post office boxes and private postboxes were removed from the Inland Letter Post Scheme (ILPS) and became available under contract. Several free services, including petitions to parliament and the sovereign, and poste restante, were removed from the scheme.\n\n\n=== Privatisation ===\n\nFollowing the 2010 general election, the new Business Secretary in the coalition government, Vince Cable, asked Richard Hooper to expand on his report, to account for EU Directive 2008/6/EC which called for the postal sector to be fully open to competition by 31 December 2012. Based on the Hooper Review Update, the government passed the Postal Services Act 2011. The Act allowed for up to 90% of Royal Mail to be privatised, with at least 10% of shares to be held by Royal Mail employees.As part of the 2011 Act, Postcomm was merged into the communications regulator Ofcom on 1 October 2011, with Ofcom introducing a new simplified set of regulations for postal services on 27 March 2012. On 31 March 2012, the Government took over the historic assets and liabilities of the Royal Mail pension scheme, relieving Royal Mail of its huge pensions deficit. On 1 April 2012, Post Office Ltd became independent of Royal Mail Group, and was reorganised to become a subsidiary of Royal Mail Holdings, with a separate management and board of directors. A 10-year inter-business agreement was signed between the two companies to allow Post Offices to continue issuing stamps and handling letters and parcels for Royal Mail. The Act also contained the option for Post Office Ltd to become a mutual organisation in the future.In July 2013, Cable announced that Royal Mail was to be floated on the London Stock Exchange, and confirmed that postal staff would be entitled to free shares. Cable explained his position before the House of Commons:\n\nThe government\'s decision on the sale is practical, it is logical, it is a commercial decision designed to put Royal Mail\'s future on a long-term sustainable business. It is consistent with developments elsewhere in Europe where privatised operators in Austria, Germany and Belgium produce profit margins far higher than the Royal Mail but have continued to provide high-quality and expanding services.\nRoyal Mail\'s chief executive Moya Greene publicly supported Cable, stating that the sale would provide staff with "a meaningful stake in the company", while the public would be able to "invest in a great British institution". On 12 September 2013, a six-week plan for the sale of at least half of the business was released to the public; the Communication Workers Union (CWU), representing over 100,000 Royal Mail employees, said that 96% of Royal Mail staff opposed the sell-off. A postal staff ballot in relation to a nationwide strike action was expected to take place in late September 2013.Applications for members of the public to buy shares opened on 27 September 2013, ahead of the company\'s listing on the London Stock Exchange on 15 October 2013. The government was expected to retain between a 37.8% and 49.9% holding in the company. A report on 10 October 2013 revealed that around 700,000 applications for shares had been received by HM Government, more than seven times the amount that were available to the public. Cable stated: "The aim is to place the shares with long-term investors, we are absolutely confident that will happen." At the time of the report, Royal Mail staff continued to ballot regarding potential strike action.The initial public offering (IPO) price was set at 330p, and conditional trading in shares began on 11 October 2013, ahead of the full listing on 15 October 2013. Following the IPO, 52.2% of Royal Mail had been sold to investors, with 10% given to employees for free. Due to the high demand for shares, an additional 7.8% was sold via an over-allotment arrangement on 8 November 2013. This left the government with a 30% stake in Royal Mail and £1.98bn raised from the sale of shares.The CWU confirmed on 13 October 2013 that strike action would occur in response to the privatisation of Royal Mail, with a possible start date of 23 October 2013. A union source stated: "It is likely to be an all-out strike first, then rolling strikes in the run up to Christmas", while the CWU had dismissed the offer of an 8.6% rise over three years as "misleading and unacceptable". Prior to the announcement of the strike ballot results on the afternoon of 16 October 2013, employees were offered £300 to cross the picket line if a nationwide postal strike occurred. The CWU called off strike action on 30 October 2013, while negotiations progressed with Royal Mail\'s management. The talks were extended on 13 November 2013, with the aim that an agreement be reached by both sides by 20 November 2013. Royal Mail confirmed that both sides had reached a proposed settlement on 4 December, and the CWU confirmed on 9 December 2013 that it would recommend the deal to its members. On 6 February 2014, the CWU confirmed that Royal Mail staff had voted to accept the settlement.\n\n\n=== Post-privatisation ===\nShare prices rose by 38% on the first day of conditional trading, leading to accusations that the company had been undervalued. Six months later, the market price was 58% more than the sale price, and peaked as high as 87%. Business Secretary Vince Cable defended the low sale price that was finalised, saying that the threat of strike action around the time of the sale meant it was a fair price in the circumstances, following questioning from the House of Commons Business Committee in late April 2014. On behalf of both himself and Business Minister Michael Fallon, Cable stated before the Committee: "We don\'t apologise for it and we don\'t regret it."Cable was required to respond to the sale price issue again on 11 July 2014 after a report was published on that date by the Business, Innovation and Skills (BIS) Committee. Chaired by MP Adrian Bailey, the report concluded:\n\nIt is clear that the Government met its objectives in terms of delivering a privatised Royal Mail with an employee share scheme. However, it is not clear whether value for money was achieved and whether Ministers obtained the appropriate return to the taxpayer. We agree with the National Audit Office that the Government met its primary objective. On the basis of the performance of the share price to date, it appears that the taxpayer has missed out on significant value.\nThe report also concluded that the "Government over-emphasised the risk" in regard to the industrial relations between the government and the CWU, with the BIS Committee referring to the Royal Mail share price before, during and after the finalisation of the pay deal with the union. During the presentation of the report, Bailey referred to the underpinning factors of "fear of failure and poor quality advice", and warned that British taxpayers could sustain further losses in the future due to the inclusion of Royal Mail\'s \'surplus\' assets as part of "the most significant privatisation in years". The BIS Committee called on the UK government to publish a list of the preferred investors involved in the sale, including the details of those investors who sold their shareholding. Billy Hayes, general secretary of the CWU, also responded to the BIS report: "The BIS select committee\'s damning report published today shows the extent of the government\'s incompetence in the privatisation of Royal Mail."In 2014, the London Assembly voted to call for the renationalisation of Royal Mail.On 4 June 2015, the Chancellor of the Exchequer, George Osborne, announced that the government would sell its remaining 30% stake. A 15% stake was subsequently sold to investors on 11 June 2015, raising £750m, with a further 1% passed to the company\'s employees. The government completed the disposal of its shareholding on 12 October 2015, when a 13% stake was sold for £591m and another 1% was given to employees. In total the government raised £3.3bn from the full privatisation of Royal Mail.As of 13 January 2020, Royal Mail shares are trading below the issue price, as they did throughout all of 2019.\n\n\n== Services ==\n\n\n=== Universal service ===\nRoyal Mail is required by law to maintain the universal service, whereby items of a specific size can be sent to any location within the United Kingdom for a fixed price, not affected by distance. The Postal Services Act 2011 guaranteed that Royal Mail would continue to provide the universal service until at least 2021.\n\n\n=== Special Delivery ===\nRoyal Mail Special Delivery is an expedited mail service that guarantees delivery by 1 pm or 9 am the next day for an increased cost. In the event that the item does not arrive on time, there is a money back guarantee. It insures goods to the value of £50 for 9 am or £500 for 1 pm to £2,500 (for either service).\n\n\n=== Business services ===\n\nThe Royal Mail runs, alongside its stamped mail services, another sector of post called business mail. The large majority of Royal Mail\'s business mail service is for PPI or franked mail, where the sender prints their own \'stamp\'. For PPI mail, this involves either a simple rubber stamp and an ink pad, or a printed label. For franked mail, a dedicated franking machine is used.Bulk business mail, using Mailmark technology, attracts reduced prices of up to 32%, if the sender prints an RM4SCC barcode, or prints the address in a specified position on the envelope using a font readable by optical character recognition (OCR) equipment.\n\n\n=== Prohibited goods ===\nRoyal Mail will not carry a number of items which it says could be dangerous for its staff or vehicles. Additionally, a list of \'restricted\' items can be posted subject to conditions. Prohibited goods include alcoholic, corrosive or flammable liquids or solids, gases, controlled drugs, indecent or offensive materials, and human and animal remains.In 2004, Royal Mail applied to the then postal regulator Postcomm to ban the carriage of sporting firearms, saying they caused disruption to the network, that a ban would assist police with firearms control, and that ease of access meant the letters network was a target of criminals. Postcomm issued a consultation on the proposed changes in December 2004, to which 62 people and organisations responded.In June 2005, Postcomm decided to refuse the application on the grounds that Royal Mail had not provided sufficient evidence that carrying firearms caused undue disruption or that a ban would reduce the number of illegal weapons. It also said that a ban would cause unnecessary hardship to individuals and businesses.In August 2012, Royal Mail again attempted to prohibit the carriage of all firearms, air rifles and air pistols from 30 November 2012. It cited Section 14(1) of the 1998 Firearms (Amendment) Act, which requires carriers of firearms to "take reasonable precautions" for their safe custody and argued that to comply would involve disproportionate cost. A Royal Mail public consultation document on the changes said: "We expect the impact on customers to be minimal".The proposals provoked a large negative response, following a campaign led by the British Association for Shooting and Conservation, backed by numerous shooting-related websites and organisations. A total of 1,458 people gave their views in emails and letters sent to Royal Mail. An online petition opposing the proposals was signed by 2,236 people, 1,742 of whom added comments. In the face of such opposition, Royal Mail dropped the proposals in December 2012.\n\n\n=== Unaddressed promotional mail delivery ===\nRoyal Mail\'s "Door to door" service provides delivery of leaflets, brochures, catalogues and other print materials to groups of domestic and business addresses selected by postcode. Such deliveries are made by the mail carrier together as part of the daily round. Companies using the "Door to door" service include Virgin Media, BT, Sky, Talk Talk, Farmfoods, Domino\'s Pizza, Direct Line and Morrisons. In 2005, the service delivered 3.3 billion items.The "Door to door" service does not use the UK Mailing Preference Service; instead, Royal Mail operates its own opt-out database. Warnings about missing government communications given by Royal Mail to customers opting out of their service have been criticised by customers and consumer groups. Clarification given by the company in June 2015 explained that election communications and unaddressed government mail would be delivered to customers even if they had opted out.\n\n\n== Staffing ==\n\nAs of 2019, Royal Mail employs around 162,000 permanent postal workers, of which 143,000 are UK based roles, and 90,000 are postmen and women. An additional 18,000 casual workers are employed during November and December to assist with the additional Christmas post.In 2011, Royal Mail established an in-house agency, Angard Staffing Solutions, to recruit temporary workers. Royal Mail was accused of trying to circumvent the Agency Workers Regulations, but denied this, saying they only wanted to reduce recruitment costs. In January 2012 it was reported that Angard had failed to pay a number of workers for several weeks.Royal Mail\'s industrial disputes include a seven-week strike in 1971 after a dispute over pay and another strike in 1988 due to bonuses being paid to new staff recruited in London and the South East.Royal Mail suffered national wildcat strikes over pay and conditions in 2003. In Autumn 2007, disputes over modernisation began to escalate into industrial action. In mid October the CWU and Royal Mail agreed a resolution to the dispute.In December 2008, workers at mail centres affected by proposals to rationalise the number of mail centres (particularly in north west England) again voted for strike action, potentially affecting Christmas deliveries. The action was postponed less than 24 hours before staff were due to walk out.Localised strikes took place across the UK from June 2009 and grew in frequency throughout the summer. In September 2009 the CWU opened a national ballot for industrial action over Royal Mail\'s failure to reach a national agreement covering protection of jobs, pay, terms and conditions and the cessation of managerial executive action. The ballot was passed in October, causing a number of two- and three-day strikes.\n\n\n=== Penny Post Credit Union ===\n\nPenny Post Credit Union Limited is a savings and loans co-operative established by a joint project with the CWU in 1996, as Royal Mail Wolverhampton and District Employees Credit Union, it became Royal Mail (West) Credit Union in 2000, before adopting the present name in 2001. Based at the North West Midlands Mail Centre, it is a member of the Association of British Credit Unions Limited.The credit union is authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and the PRA. Ultimately, like the banks and building societies, members\' savings are protected against business failure by the Financial Services Compensation Scheme.\n\n\n== Regulation ==\nThe Royal Mail is regulated by Ofcom, while consumer interests are represented by the Citizens Advice Bureau. The relationship between the two bodies\' predecessors (Postcomm and Postwatch) was not always good, and in 2005, Postwatch took Postcomm to judicial review over its decision regarding rebates to late-paying customers.Royal Mail has, in some quarters, a poor reputation for losing mail despite its claims that more than 99.93% of mail arrives safely and in 2006 was fined £11.7 million due to the amount of mail lost, stolen or damaged. In the first three months of 2011, around 120,000 letters were lost.In July 2012 Ofcom consulted on a scheme proposed by Royal Mail to alter its delivery obligations to allow larger postal items to be left with neighbours rather than returning them to a Royal Mail office to await collection. The scheme was presented as offering consumers greater choice for receiving mail when not at home, that is if Royal Mail deliver items as per their stated contractual obligations and was said to follow Royal Mail research from a \'delivery to neighbour\' trial across six areas of the UK that showed widespread consumer satisfaction. In a statement dated 27 September 2012, Ofcom announced it would approve the scheme after noting that more goods were being purchased over the internet and that Royal Mail\'s competitors were permitted to leave undelivered items with neighbours. People who do not wish to have parcels left with neighbours, or to receive those of others, can opt out by displaying a free opt-out sticker near their letterbox. Royal Mail remains liable for undeliverable items until they are received by the addressee or returned to sender.Ofcom suggested in October 2012 that the first and second class post systems could be replaced by a single class. The new class would be set at a higher price than the current second class, but would be delivered in a shorter time-frame.Royal Mail was fined £50 million by Ofcom in 2018 for breach of European Union competition law. Ofcom found that Royal Mail had abused its dominant position in 2014 in the delivery of letters.\n\n\n== Operations ==\nPost is stamped as First Class (traditionally red stamps) or Second Class (blue stamps) and prioritised accordingly. The targets are delivering 93% of First Class post the next working day, and delivering 98.5% of Second Class post within three working days.\n\n\n=== Mail centres ===\n\nRoyal Mail operates a network of 37 mail centres (as of 2019). Each mail centre serves a large geographically defined area of the UK and together they form the backbone network of the mail distribution operation. Mail is collected and brought to one of the mail centres. Mail is exchanged between the mail centres and then forwarded to one of 1,356 delivery offices, from where the final delivery is made or a P739 card is left.As part of the sorting process, mail is collected from pillar boxes, Post Office branches and businesses, and brought to the regional mail centre. The process is divided into two parts. The \'outward\' sorting identifies mail for delivery in the mail centre geographic area, which is retained, and mail intended for other mail centres, which is dispatched. The \'inward\' sorting forwards mail received from other centres to the relevant delivery offices within the mail centre area.\n\n\n==== Integrated mail processing ====\nIntegrated mail processing (IMP) is the method that Royal Mail uses to sort the mail (in bulk) before delivery and has been implementing the technology since 1999. The system works by automated optical character recognition of postcodes. Integrated mail processors scan the front and back of an envelope and translate addresses into machine-readable code. Letters are given a fluorescent orange barcode that represents the address. The barcode follows the RM4SCC pattern. Per mail item there are over 250 types of information that are collected from mail class to indicia type. Some scanning and detection features have been removed as they have been superseded by newer technology. This is known as the IMP Extension of Life (EoL) program.\n\n\n==== Intelligent letter sorting machines ====\nRoyal Mail operates 66 intelligent letter sorting machines (ILSMs) in the UK and were installed in the mid-1980s and early 1990s to improve the speed and efficiency of sorting and delivering mail. It processes more than 36,000 items per hour and was part of their ongoing modernisation programme that commenced in the early 1980s.\n\n\n==== International mail ====\nRoyal Mail operates an international mail sorting centre in Langley, Berkshire close to Heathrow Airport called the Heathrow Worldwide Distribution Centre to handle all international airmail arriving into and leaving the United Kingdom, plus some container and road transported mail.\n\n\n==== List of mail centres ====\nAs of March 2021, the 37 operational mail centres (divided into Royal Mail regions) were:\nEast: Chelmsford, Norwich, Nottingham, Peterborough, Romford, Sheffield, South Midlands (Northampton)\nWest: Birmingham, Chester, Manchester, North West Midlands (Wolverhampton), Preston, Warrington\nSouth East: Croydon, Gatwick (Crawley), Greenford, Home Counties North (Hemel Hempstead), Jubilee (Hounslow), Medway (Rochester), London Central (Mount Pleasant)\nSouth West: Bristol, Cardiff, Dorset (Poole), Exeter, Plymouth, Southampton, Swansea, Swindon, Truro\nNorth: Aberdeen, Inverness, Carlisle, Edinburgh, Glasgow, Leeds, Northern Ireland (Newtownabbey), Tyneside/Newcastle (Gateshead)Mail Centres in the Isle of Man, Jersey and Guernsey are streamlined into the Royal Mail’s domestic network.\n\n\n==== Closures ====\nThe number of mail centres has been declining as part of the Mail Centre Rationalisation Programme. In 2008, there were 69 mail centres and in 2010 there were 64. It was anticipated that around half of these could be closed by 2016. Oldham and Stockport along with Oxford and Reading mail centres all closed in 2009 and Bolton, Crewe, Liverpool, Northampton, Coventry and Milton Keynes were closed in 2010. Farnborough, Watford and Stevenage were closed in 2011. Hemel Hempstead, Southend, Worcester were closed in 2012. Dartford, Tonbridge, Maidstone and Canterbury were closed in 2012 but replaced by a new mail centre in Rochester. The East London and South London mail centres were closed during summer 2012.In 2013 and 2014, a further eight mail centres were planned to be closed. The old mail centres in Northampton, Coventry and Milton Keynes were replaced with the new South Midlands mail centre in Northampton covering Warwickshire, Coventry, Northamptonshire and Milton Keynes. The South Midlands Mail Centre is the largest in the UK.\n\n\n=== Regional Distribution Centres ===\nAs of 2020 there are 7 Regional Distribution Centres (RDCs) across the country.They are responsible for handling large pre-sorted mailings from business customers.\nScottish Distribution Centre (Wishaw)\nPrincess Royal Distribution Centre (London)\nNational Distribution Centre (Northampton)\nSouth West Distribution Centre (Bristol)\nNorth West Distribution Centre (Warrington)\nYorkshire Distribution Centre (Normanton)\nNorthern Ireland Distribution Centre (Newtownabbey)\n\n\n=== Fleet ===\nRoyal Mail is famous for its custom load-carrying bicycles (with the rack and basket built into the frame), made by Pashley Cycles since 1971. Since 2000, old delivery bicycles have been shipped to Africa by the charity Re~Cycle; over 8,000 had been donated by 2004. In 2009, Royal Mail announced it was beginning to phase out bicycle deliveries, to be replaced with more push-trolleys and vans. A spokesman said that they would continue to use bicycles on some rural routes, and that there was no plan to phase out bicycles completely.In addition to running a large number of road vehicles, Royal Mail uses trains, a ship and some aircraft, with an air hub at East Midlands Airport. Dedicated night mail flights are operated by Titan Airways for Royal Mail between East Midlands Airport and Bournemouth Airport and between Exeter International Airport and London Stansted Airport. One Boeing 737-3Y0 was flown in full Royal Mail livery. In June 2013, Royal Mail confirmed it would extend Titan Airways\' contract to operate night flights from Stansted Airport, from January 2014 to January 2017, introducing new routes to Edinburgh and Belfast using three Boeing 737s. The new contract called for the replacement of the British Aerospace 146-200QC (Quick Change) aircraft in favour of a standard Boeing 737 fleet, and the type was withdrawn by Titan Airways in November 2013.In 2021 Royal Mail announced plans to trial using a drone between the UK mainland and St Mary’s airport, Scilly Isles. The twin-engine vehicle is manufactured in the UK by the Windracers Limited and is capable of carrying 100kg of mail, which is the same weight as a typical delivery round. It is able to fly in poor weather conditions, including fog, and will be out of sight of any operator during the 70-mile journey. Vertical take-of and landing drones will take parcels between the islands in the archipelago. Royal Mail delivered its first parcel using a drone in December 2020. A package was sent to a remote lighthouse on Scotland’s Isle of Mull.The RMS St. Helena was a cargo and passenger ship that served the British overseas territory of Saint Helena. It sailed between Cape Town, Saint Helena and Ascension Island. It was one of only two Royal Mail Ships in service, alongside the Queen Mary 2, although it did not belong to Royal Mail Group.Royal Mail operated the London Post Office Railway, a network of driverless trains running on a private underground track, from 1927 until it closed it in 2003.\n\n\t\t\n\t\t\n\n\n== British Overseas Territories and Crown Dependencies ==\n\nBritish Overseas Territories and Crown Dependencies are allowed to establish independent postal systems, and typically now have local government agencies, British government delegates, or BFPO as postal operators. (See List of postal entities.) Though served by independent operators, the three Crown Dependencies use British postcodes in co-operation with Royal Mail; each dependency has its own postal area.  The same prices are charged by the four operators for delivery throughout their collective area, though delivery times vary and interjurisdictional mail must clear customs.\n\n\n== See also ==\nPostage stamps and postal history of Great Britain\nRoyal Mail rubber band\nLondon Penny Post\nBarbados Postal Service and forerunner – created by Royal Mail in 1663 for postal services in the former Crown colony, and a part of RM until 1851\nHongkong Post – created by Royal Mail in 1841 for postal services in the former Crown colony, and a part of RM until 1860\nCanada Post - established as Royal Mail Canada 1867 (replacing colonial postal departments) and renamed late 1960s;  RM managed postal services in pre Confederation Canada from 1775 to 1851\nAustralia Post 1975 - created to replace administration of the former Postmaster-General\'s Department. Before 1901 each colony ran their postal service mainly from their major settlement:\n(Sydney) New South Wales 1809\n(Melbourne) Victoria 1851\n(Adelaide) South Australia 1851\n(Van Diemen\'s Land) Tasmania 1853\n(Brisbane) Queensland 1859\nNew Zealand Post\nCaribbean Postal Union\nCredit unions in the United Kingdom\n\n\n== References ==\n\n\n=== Citations ===\n\n\n=== Sources ===\nCampbell-Smith, Duncan (2011). Masters of the Post: The Authorized History of the Royal Mail. Penguin Books. ISBN 978-0-241-95766-0.\nBrowne, Christopher (1993). Getting the Message: The Story of the British Post Office. Alan Sutton. ISBN 0-7509-0351-1.\nA brief history of the Post Office – A GPO public relations publication 1965\n\n\n== External links ==\nOfficial website \nCorporate website'
print(rm_wiki_text)

Royal Mail Group plc is a British multinational postal service and courier company, originally established in 1516 as a department of the English government. The company's subsidiary Royal Mail Group Limited operates the brands Royal Mail (letters) and Parcelforce Worldwide (parcels). GLS Group, an international logistics company, is a wholly owned subsidiary of Royal Mail Group. The group used the name Consignia for a brief period in the early 2000s.
The company provides mail collection and delivery services throughout the UK. Letters and parcels are deposited in post or parcel boxes, or are collected in bulk from businesses and transported to Royal Mail sorting offices. Royal Mail owns and maintains the UK's distinctive red pillar boxes, first introduced in 1852, many of which bear the initials of the reigning monarch. Deliveries are made at least once every day except Sundays and bank holidays at uniform charges for all UK destinations. Royal Mail generally aims to make first class 

In [19]:
# Tokenise and clean our string of text... 

import nltk

sentences = nltk.sent_tokenize(rm_wiki_text)

for i in range(0, len(sentences)):
    temp = nltk.word_tokenize(sentences[i])
    sentences[i] = [word.lower() for word in temp if word not in stop_words and word.isalpha()]
    
print(sentences)



In [20]:
# Apply Word2Vec model...

from gensim.models import Word2Vec

model = Word2Vec(sentences,
                 size=100,  # dimensionalilty of the word vectors
                 window=5)

In [23]:
# Measures the similarity between words using cosine distance...

model.similarity("post", "mail")

0.30476612

In [24]:
# Finds the top n most similar words...

model.similar_by_word("post", 10)

[('penny', 0.3729333281517029),
 ('new', 0.34610801935195923),
 ('company', 0.3333606719970703),
 ('october', 0.32771211862564087),
 ('established', 0.3169745206832886),
 ('opened', 0.31149032711982727),
 ('mail', 0.30476605892181396),
 ('january', 0.30406635999679565),
 ('royal', 0.30206358432769775),
 ('great', 0.29973503947257996)]

In [25]:
# Save our trained model to use later...

model.save("word2vec.model")