# BBC News predictive algorithm

### Andy McCann 2018/01/18
----
Dataset BBC, raw text files from: D. Greene and P. Cunningham. ["Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering"](http://mlg.ucd.ie/datasets/bbc.html), Proc. ICML 2006

The dataset was also available pre-processed, but as the task asked about preparing the data, I have started from the raw text files and pre-processed here.

I have taken two approaches to this task.  A Support Vector Machine model here in R and further below a Neural Network model in Azure Machine Learning.  If I had more time I would probably have created a Naive Bayes model as others report similar performance using that classifier and it has the benefit of being more transparent and easier to explain to a lay audience.

## Support Vector Machine in R 
To perform this task, I installed Jupyter Data Science Notebook on a Docker swarm which I set up on AWS EC2.

## Install required packages

In [14]:
# for compatability with R v3.3.2 in this notebook, need older versions of Slam and tm
package_url <- "http://cran.r-project.org/src/contrib/Archive/slam/slam_0.1-37.tar.gz"
install.packages(package_url, repos = NULL, type = "source")
package_url <- "https://cran.r-project.org/src/contrib/Archive/tm/tm_0.6-2.tar.gz"
install.packages(package_url, repos = NULL, type = "source")
install.packages('RTextTools')
install.packages('e1071')
library(RTextTools)
library(tm)
library(e1071)

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


### Download the ZIPped dataset from the web to a temporary file

In [4]:
temp <- tempfile()                             # get a temporary file name
download.file("http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip",temp)
file_list <- unzip(temp)                       # unzip the files, remembering the list of names
unlink(temp)                                   # and delete the temporary file

### Create a labelled dataframe from the documents

In [5]:
# Ignore the README.TXT file
# The name of the sub-folder containing the file is the category label
# Read all text lines from each document and collapse to single string with lines separated by spaces
# Use lapply to apply to all files and rbind together

dataset <- do.call("rbind",lapply(file_list[basename(file_list)!="README.TXT"],
   FUN=function(file) {data.frame(label = basename(dirname(file)), text = paste(readLines(file),collapse=' '))}))

### Confirm the structure of the dataset

In [6]:
str(dataset)

'data.frame':	2225 obs. of  2 variables:
 $ label: Factor w/ 5 levels "business","entertainment",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ text : Factor w/ 2127 levels "Ad sales boost Time Warner profit  Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three m"| __truncated__,..: 1 2 3 4 5 6 7 8 9 10 ...


### Randomise the dataset (with a set seed, so reproducible)

In [7]:
set.seed(12345)
dataset <- dataset[sample(nrow(dataset)),]

### Examine the first few entries in dataset

In [8]:
head(dataset)

Unnamed: 0,label,text
1605,sport,"Wales coach elated with win Mike Ruddock paid tribute to his Wales side after they came from 15-6 down to beat France 24-18 in the Six Nations. ""After going two tries down in 12 minutes we had to show character,"" said the national team coach. ""I didn't have to tell them anything at half-time because those players have stared down the barrel of a gun before. ""They decided they didn't want to do that again and came out fighting. It was a great team effort and we showed great character to come back."" Man-of-the-match Stephen Jones, who kicked three penalties, a drop goal and conversion, was ecstatic following after the win at Stade de France. ""It's just a special moment. Two years ago we didn't win a single game in the Six Nations. But we're a very happy camp now,"" he said. ""We worked hard as a squad and I'm a proud Welshman. We've got hard matches to come, so we're just happy with the start."" Double try scorer Martyn Williams was keen not to talk about a possible Grand Slam for Wales. ""We've got more self-belief these days. Two or three years ago we might have collapsed after going behind so early. ""There's no mention of a Grand Slam among the players. We've got a tough game against Scotland at Murrayfield. They could bring us crashing down to earth."""
1948,tech,"Loyalty cards idea for TV addicts Viewers could soon be rewarded for watching TV as loyalty cards come to a screen near you. Any household hooked up to Sky could soon be using smartcards in conjunction with their set-top boxes. Broadcasters such as Sky and ITV could offer viewers loyalty points in return for watching a particular channel or programme. Sky will activate a spare slot on set-top boxes in January, marketing magazine New Media Age reported. Sky set-top boxes have two slots. One is for the viewer's decryption card, while the other has been dormant until now. Loyalty cards have become a common addition to most wallets, as High Street brands rush to keep customers with a series of incentives offered by store cards. Now similar schemes look set to enter the highly competitive world of multi-channel TV. Viewers who stay loyal to a particular TV channel could be rewarded by free TV content or freebies from retail partners. Broadcasters aiming content at children could offer smartcards which gives membership to exclusive content and clubs. ""Parents could pre-pay for some content, as a kind of TV pocket money card,"" said Nigel Whalley, managing director of media consultancy Decipher. Viewers could even be rewarded for watching ad breaks, with ideas such as ad bingo being touted by firms keen to make money out of the new market, said Mr Whalley. Credit cards that have been chipped could be used in set-top boxes to pay for movies, gambling and gaming. ""The idea of an intelligent card in boxes offers a lot of possibilities. It will be down to the ingenuity of the content players,"" said Mr Whalley. For the BBC, revenue-generating activity will be of little interest but the new development may prompt changes to Freeview set-top boxes, said Mr Whalley. Currently most Freeview boxes do not have a slot which would allow viewers to use a smartcard. Some 7.4 million households have Sky boxes and Sky is hoping to increase this to 10 million by 2010. Loyalty cards could play a role in this, particularly in reducing the number of people who cancel their Sky subscriptions, said Ian Fogg, an analyst with Jupiter Research."
1692,sport,"Ireland call up uncapped Campbell Ulster scrum-half Kieran Campbell is one of five uncapped players included in Ireland's RBS Six Nations squad. Campbell is joined by Ulster colleagues Roger Wilson and Ronan McCormack along with Connacht's Bernard Jackman and Munster's Shaun Payne. Gordon D'Arcy is back after injury while Munster flanker Alan Quinlan also returns to international consideration. ""The squad is selected purely on form. A lot of players put their hands up,"" coach Eddie O'Sullivan told BBC Sport. ""Kieran Campbell was just one of those players. He has been playing very well in the Heineken Cup and deserves his call-up. ""There is big competition in some departments and not so much in others. There were one or two players who were unfortunate just to miss out."" Back-row forwards David Wallace and Victor Costello are omitted, with O'Sullivan having Quinlan, Wilson, Simon Easterby, Anthony Foley, Denis Leamy and Johnny O'Connor vying for the three positions. With David Humphreys, Kevin Maggs, Simon Best and Tommy Bowe again included, it is Ulster's biggest representation in a training panel for quite some time. Munster and Leinster have 12 and 11 players in the squad respectively while Jackman is the sole Connacht representative. Four British-based players are also included. Ulster forward Ronan McCormack said he was ""totally shocked"" to be included. ""I'm really looking forward to it,"" said McCormack. ""I played with guys like Brian O'Driscoll and Denis Hickie back in my school days in Leinster so I do know a few of them although not that well. ""It will be great to work with them."" S Best (Ulster), S Byrne (Leinster), R Corrigan (Leinster), L Cullen (Leinster), S Easterby (Llanelli), A Foley (Munster), J Hayes (Munster), M Horan (Munster), B Jackman (Connacht), D Leamy (Munster), E Miller (Leinster), R McCormack (Ulster), D O'Callaghan (Munster), P O'Connell (Munster), J O'Connor (Wasps), M O'Kelly (Leinster), F Sheahan (Munster), R Wilson (Ulster), A Quinlan (Munster). T Bowe (Ulster), K Campbell (Ulster), G D'Arcy (Ulster), G Dempsey (Leinster), G Duffy (Harlequins), G Easterby (Leinster), D Hickie (Leinster), A Horgan (Munster), S Horgan (Leinster), D Humphreys (Ulster), K Maggs (Ulster), G Murphy (Leicester), B O'Driscoll, (Leinster), R O'Gara (Munster), S Payne (Munster), P Stringer (Munster). K Gleeson (Leinster), T Howe (Ulster), J Kelly (Munster), N McMillan (Ulster)."
1969,tech,"Musicians 'upbeat' about the net Musicians are embracing the internet as a way of reaching new fans and selling more music, a survey has found. The study by US researchers, Pew Internet, suggests musicians do not agree with the tactics adopted by the music industry against file-sharing. While most considered file-sharing as illegal, many disagreed with the lawsuits launched against downloaders. ""Even successful artists don't think the lawsuits will benefit musicians,"" said report author Mary Madden. For part of the study, Pew Internet conducted an online survey of 2,755 musicians, songwriters and music publishers via musician membership organisations between March and April 2004. They ranged from full-time, successful musicians to artists struggling to make a living from their music. ""We looked at more of the independent musicians, rather than the rockstars of this industry but that reflects more accurately the state of the music industry,"" Ms Madden told the BBC News website. ""We always hear the views of successful artists like the Britneys of the world but the less successful artists rarely get represented."" The survey found that musicians were overwhelming positive about the internet, rather than seeing it as just a threat to their livelihood. Almost all of them used the net for ideas and inspiration, with nine out of 10 going online to promote, advertise and post their music on the web. More than 80% offered free samples online, while two-thirds sold their music via the net. Independent musicians, in particular, saw the internet as a way to get around the need to land a record contract and reach fans directly. ""Musicians are embracing the internet enthusiastically,"" said Ms Madden. ""They are using the internet to gain inspiration, sell it online, tracking royalties, learning about copyright."" Perhaps surprisingly, opinions about online file-sharing were diverse and not as clear cut as those of the record industry. Through the Recording Industry Association of America (RIAA), it has pursued an aggressive campaign through the courts to sue people suspected of sharing copyrighted music. But the report suggests this campaign does not have the wholehearted backing of musicians in the US. It found that most artists saw file-sharing as both good and bad, though most agreed that it should be illegal. ""Free downloading has killed opportunities for new bands to break without major funding and backing,"" said one musician quoted by the report. ""It's hard to keep making records if they don't pay for themselves through sales."" However 60% said they did not think the lawsuits against song swappers would benefit musicians and songwriters. Many suggested that rather than fighting file-sharing, the music industry needed to recognise the changes it has brought and embrace it. ""Both successful and struggling musicians were more likely to say that the internet has made it possible for them to make more money from their music, rather than make it harder for them to protect their material from piracy,"" said Ms Madden."
1014,politics,"Bid to cut court witness stress New targets to reduce the stress to victims and witnesses giving evidence in courts in England and Wales have been announced by the lord chancellor. Lord Falconer wants all crown courts and 90% of magistrates' courts to have facilities to keep witnesses separate from defendants within four years. More video links will also be made available so that witnesses do not have to enter courtrooms. It is part of a five-year plan to help build confidence in the justice system. Ministers say the strategy is aimed at re-balancing the court system towards victims, and increasing the number of offenders brought to justice. Launching the Department for Constitutional Affairs' plan, Lord Falconer said: ""One of the top priorities will be a better deal for victims. ""The needs and safety of victims will be at the heart of the way trials are managed. ""Courts, judges, magistrates, prosecutors, police and victim support - all working together to ensure the rights of victims are put first, without compromising the rights of the defendant."" He went on: ""Giving evidence is a nerve-wracking experience, especially when you're a victim. ""Yet with a will and with support it can be done."" Lord Falconer told BBC Radio 4's Today programme it was impossible for some elderly people to go to court to give evidence. Other witnesses could be intimidated by sitting alongside defendants outside courts. ""You are never going to get rid of some element of the trauma of giving evidence,"" he said. ""But you can make people believe that the courts understand the problem, it's not some kind of alien place where they go where they are not thinking about them."" The plan comes as the lord chancellor also considers allowing cameras into courts for the first time since 1925, as long as they were used for cases that did not involve witnesses. Another feature of the strategy is constitutional reform, with a government bill to set up a supreme court and a judicial appointments commission returning to the House of Lords on Tuesday. Ministers had proposed getting rid of the title of lord chancellor, but the Lords have over-ruled this. Lord Falconer said it was right for the highest court to be completely distinct from Parliament. The person in charge of the court system should not also be speaker of the House of Lords, he said, and should be the best person chosen from either House of Parliament. What they did, not what they were called, was the critical issue, he added."
370,business,"US manufacturing expands US industrial production increased in December, according to the latest survey from the Institute for Supply Management (ISM). Its index of national manufacturing activity rose to 58.6 last month from 57.8 in November. A reading above 50 indicates a level of growth. The result for December was slightly better than analysts' expectations and the 19th consecutive expansion. The ISM said the growth was driven by a ""significant"" rise in the new orders. ""This completes a strong year for manufacturing based on the ISM data,"" said chairman of the ISM's survey committee. ""While there is continuing upward pressure on prices, the rate of increase is slowing and definitely trending in the right direction."" The ISM's index of national manufacturing activity is compiled from monthly responses of purchasing executives at more than 400 industrial companies, ranging from textiles to chemicals to paper, and has now been above 50 since June 2003. Analysts expected December's figure to come in at 58.1. The ISM manufacturing index's main sister survey - the employment index - eased to 52.7 in December from 57.6 in November, while its ""prices paid"" index, measuring the cost to businesses of their inputs, also eased to 72.0 from 74.0. The ISM's ""new orders"" index rose to 67.4 from 61.5."


### Preprocess text, remove sparse terms and create a document term matrix.  Set sample split 75:25 train:test.
The text data needs to be converted into a Document Term Matrix in order to train a bag-of-words model, using each word (token) in the text as a feature.

The library tm could be used to stem, remove numbers and so on, but create_matrix does it in one step.  Create_matrix by default converts to lower case, removes punctuation, stop-words and whitespace.

I did also train using TFIDF, but this did not improve the accuracy of the model.

In [9]:
smp_size <- floor(0.75*nrow(dataset))
matrix <- create_matrix(dataset$text, language="english"
 , removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.998
 , weighting=weightTf)

SparseTerms setting of 0.998 removes terms which appear in fewer than five documents, reduces the number of features from 21,212 to 6,371 with no resultant effect on F-score

In [10]:
str(matrix)

List of 6
 $ i       : int [1:290002] 1 1 1 1 1 1 1 1 1 1 ...
 $ j       : int [1:290002] 106 185 243 384 432 457 477 683 787 791 ...
 $ v       : num [1:290002] 2 1 1 1 1 1 1 1 2 1 ...
 $ nrow    : int 2225
 $ ncol    : int 6371
 $ dimnames:List of 2
  ..$ Docs : chr [1:2225] "1" "2" "3" "4" ...
  ..$ Terms: chr [1:6371] "aaa" "aaron" "abandon" "abba" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"


### Create a container, train a radial SVM model and classify the testing data
Support Vector Machines (SVM) are often used for supervised learning text classification problems.  SVMs are linear discriminants which divide a feature space with a hyperplane (although they often use a kernel function to map the original features to another feature space so creating a non-linear discrminant, as here).

In [11]:
container <- create_container(matrix,dataset$label, trainSize=1:smp_size, testSize=(smp_size+1):nrow(dataset),
virgin=FALSE)


In [39]:
# Defaults
svm_model <- train_model(container, "SVM")
svm_results <- classify_model(container, svm_model)

In [40]:
# combine predicted class with actual class (and document text)
Predictions <- cbind(container@testing_codes, svm_results, dataset[(smp_size+1):nrow(dataset), 'text'])
colnames(Predictions) <- c('Actual', 'Prediction', 'Confidence', 'Document Text')

# find the number of correct and total predictions 
 correct_predictions = sum(Predictions$Actual == Predictions$Prediction)
 total_predictions = nrow(Predictions)
 
cat('\nF-score (micro-averaged):', correct_predictions, '/', total_predictions,'=', correct_predictions / total_predictions, '\n\n')
 
# confusion matrix
 print(table(Predictions[,1:2]), zero.print=".")


F-score (micro-averaged): 539 / 557 = 0.967684 

               Prediction
Actual          business entertainment politics sport tech
  business           135             .        3     .    4
  entertainment        1            88        1     1    1
  politics             .             .      101     .    .
  sport                .             .        2   121    1
  tech                 2             2        .     .   94


### Attempt to tune the model
The SVM radial model with default parameters performs extremely well and achieves a micro-averaged F-score (which for a multi-class problem is the same as the micro-averaged Precision and Recall scores) of 96.8%.

However, attempt to tune the model by doing a grid search of values of gamma (the free parameter of the Gaussian radial function) and cost (the influence of points on the 'wrong' side of the support vector), using 10-fold cross-validation.

In [51]:
model.tuned <- tune.svm(x = container@training_matrix,
                        y = container@training_codes,
                        kernel = "radial",
                        gamma = 10^(-6:-1),
                        cost = 10^(-1:2),
                        )

In [52]:
summary(model.tuned)


Parameter tuning of ‘svm’:

- sampling method: 10-fold cross validation 

- best parameters:
 gamma cost
 1e-05  100

- best performance: 0.02996898 

- Detailed performance results:
   gamma  cost      error dispersion
1  1e-06   0.1 0.77039535 0.02389592
2  1e-05   0.1 0.76919775 0.02629504
3  1e-04   0.1 0.75003607 0.08074179
4  1e-03   0.1 0.24461799 0.04988591
5  1e-02   0.1 0.77398817 0.01954183
6  1e-01   0.1 0.77039535 0.02389592
7  1e-06   1.0 0.77039535 0.02389592
8  1e-05   1.0 0.76739413 0.02747888
9  1e-04   1.0 0.13670731 0.03231720
10 1e-03   1.0 0.04195945 0.01325130
11 1e-02   1.0 0.32787678 0.06036609
12 1e-01   1.0 0.69063920 0.02893785
13 1e-06  10.0 0.76859173 0.02528784
14 1e-05  10.0 0.12891206 0.02913243
15 1e-04  10.0 0.03356901 0.01206664
16 1e-03  10.0 0.03477383 0.01409251
17 1e-02  10.0 0.30090181 0.05489305
18 1e-01  10.0 0.69063920 0.02893785
19 1e-06 100.0 0.12831325 0.02940173
20 1e-05 100.0 0.02996898 0.01165950
21 1e-04 100.0 0.03357261 0.01363795
22

In [53]:
svm_model <- train_model(container,"SVM", kernel="radial", gamma=0.00001, cost=100)
svm_results <- classify_model(container,svm_model)

In [54]:
# combine predicted class with actual class (and document text)
Predictions <- cbind(container@testing_codes, svm_results, dataset[(smp_size+1):nrow(dataset), 'text'])
colnames(Predictions) <- c('Actual', 'Prediction', 'Confidence', 'Document Text')

# find the number of correct and total predictions 
 correct_predictions = sum(Predictions$Actual == Predictions$Prediction)
 total_predictions = nrow(Predictions)
 
cat('\nF-score (micro-averaged):', correct_predictions, '/', total_predictions,'=', correct_predictions / total_predictions, '\n\n')
 
# confusion matrix
 print(table(Predictions[,1:2]), zero.print=".")


F-score (micro-averaged): 539 / 557 = 0.967684 

               Prediction
Actual          business entertainment politics sport tech
  business           134             1        3     .    4
  entertainment        1            88        1     1    1
  politics             .             .      101     .    .
  sport                .             .        2   121    1
  tech                 2             1        .     .   95


In this case, tuning does not improve the overall accuracy compared with the default settings

### Train a linear SVM model
In fact, a linear SVM model performs on this dataset marginally better than a radial model.

In [55]:
svm_model <- train_model(container,"SVM", kernel="linear")
svm_results <- classify_model(container,svm_model)

In [56]:
# combine predicted class with actual class (and document text)
Predictions <- cbind(container@testing_codes, svm_results, dataset[(smp_size+1):nrow(dataset), 'text'])
colnames(Predictions) <- c('Actual', 'Prediction', 'Confidence', 'Document Text')

# find the number of correct and total predictions 
 correct_predictions = sum(Predictions$Actual == Predictions$Prediction)
 total_predictions = nrow(Predictions)
 
cat('\nF-score (micro-averaged):', correct_predictions, '/', total_predictions,'=', correct_predictions / total_predictions, '\n\n')
 
# confusion matrix
 print(table(Predictions[,1:2]), zero.print=".")


F-score (micro-averaged): 541 / 557 = 0.9712747 

               Prediction
Actual          business entertainment politics sport tech
  business           135             .        3     .    4
  entertainment        1            88        1     1    1
  politics             1             .      100     .    .
  sport                .             .        2   122    .
  tech                 2             .        .     .   96


The linear SVM model achieves an accuracy of 97.1% in classifying the 25% of the dataset reserved for testing.  The list below shows the 16 articles in the testing set which are misclassified.

Some could clearly be classified either way.  For example, ths first is an article about the business aspects of technology, while politics and business are often related.

However, the second example is probably picking up on words such as "win" and "commonwealth" and misclassifying an entertainment article as sport. Given more time, it would be interesting to investigate these cases further.

### List the incorrectly labelled articles

In [57]:
Predictions[Predictions$Actual != Predictions$Prediction,]

Unnamed: 0,Actual,Prediction,Confidence,Document Text
5,tech,business,0.9037493,"PC ownership to 'double by 2010' The number of personal computers worldwide is expected to double by 2010 to 1.3 billion machines, according to a report by analysts Forrester Research. The growth will be driven by emerging markets such as China, Russia and India, the report predicted. More than a third of all new PCs will be in these markets, with China adding 178 million new PCs by 2010, it said. Low-priced computers made by local companies are expected to dominate in such territories, Forrester said. The report comes less than a week after IBM, a pioneer of the PC business, sold its PC hardware division to China's number one computer maker Lenovo. The $1.75bn (£900m) deal will make the combined operation the third biggest PC vendor in the world. ""Today's products from Western PC vendors won't dominate in those markets in the long term,"" Simon Yates, a senior analyst for Forrester, said. ""Instead local PC makers like Lenovo Group in China and Aquarius in Russia that can better tailor the PC form factor, price point and applications to their local markets will ultimately win the market share battle,"" he said. There are currently 575 million PCs in use globally. The United States, Europe and Asia-Pacific are expected to add 150 million new PCs by 2010, according to the study. The report forecast that there will be 80 million new PC users in India by 2010 and 40 million new users in Indonesia."
11,entertainment,sport,0.551463,"Spark heads world Booker list Dame Muriel Spark is among three British authors who have made the shortlist for the inaugural international Booker Prize. Doris Lessing and Ian McEwan have also been nominated. McEwan and Margaret Atwood are the only nominees to have previously won the main Booker Prize. The new £60,000 award is open to writers of all nationalities who write in English or are widely translated. The prize commends an author for their body of work instead of one book. Gabriel Garcia Marquez, Saul Bellow, Milan Kundera and John Updike also feature on the 18-strong list of world literary figures. But other past winners of the regular Booker Prize, such as Salman Rushdie, JM Coetzee and Kazuo Ishiguro have failed to make the shortlist. The prize, which will be awarded in London in June, will be given once every two years. It will reward an author - who must be living - for ""continued creativity, development and overall contribution to fiction on the world stage"". An author can only win once. The international award was started in response to criticisms that the Booker Prize is only open to British and Commonwealth authors. Margaret Atwood (Canada) Saul Bellow (Canada) Gabriel Garcia Marquez (Colombia) Gunter Grass (Germany) Ismail Kadare (Albania) Milan Kundera (Czech Republic) Stanislaw Lem (Poland) Doris Lessing (UK) Ian McEwan (UK) Naguib Mahfouz (Egypt) Tomas Eloy Martinez (Argentina) Kenzaburo Oe (Japan) Cynthia Ozick (US) Philip Roth (US) Muriel Spark (UK) Antonio Tabucchi (Italy) John Updike (US) Abraham B Yehoshua (Israel)"
16,business,tech,0.8972175,"News Corp eyes video games market News Corp, the media company controlled by Australian billionaire Rupert Murdoch, is eyeing a move into the video games market. According to the Financial Times, chief operating officer Peter Chernin said that News Corp is ""kicking the tyres of pretty much all video games companies"". Santa Monica-based Activison is said to be one firm on its takeover list. Video games are ""big business"", the paper quoted Mr Chernin as saying. We ""would like to get into it"". The success of products such as Sony's Playstation, Microsoft's X-Box and Nintendo's Game Cube have boosted demand for video games. The days of arcade classics such as Space Invaders, Pac-Man and Donkey Kong are long gone. Today, games often have budgets big enough for feature films and look to give gamers as real an experience as possible. And with their price tags reflecting the heavy investment by development companies, video games are proving almost as profitable as they are fun. Mr Chernin, however, told the FT that News Corp was finding it difficult to identify a suitable target. ""We are struggling with the gap between companies like Electronic Arts (EA), which comes with a high price tag, and the next tier of companies,"" he explained during a conference in Phoenix, Arizona. ""These may be too focused on one or two product lines."" Activision has a stock market capitalisation of about $2.95bn (£1.57bn), compared to EA's $17.8bn. Some of the games industry's main players have recently been looking to consolidate their position by making acquisitions. France's Ubisoft, one of Europe's biggest video game publishers, has been trying to remain independent since Electronic Arts announced plans to buy 19.9% of the firm. Analysts have said that industry mergers are likely in the future."
68,sport,politics,0.827585,"Calder fears for Scottish rugby Former Scotland international Finlay Calder fears civil war at the SRU could seriously hamper his country's RBS Six Nations campaign. Four members of the executive board, including the chairman, David Mackay, have resigned after a simmering row. And Calder said: ""This is terrible news for every level of Scottish rugby. ""David is a successful businessman and I thought that if anybody could transform the negative atmosphere and rising debt level, it was him."" Mackay's executive board has been in a power struggle with the general committee, which contains members elected by Scotland's club sides. ""He has been driven out by people who seem happier waging civil war than addressing the central issue that professional rugby can't be run by amateurs,"" said Calder. ""In fact, I don't understand why we are still having this argument 10 years after professionalism arrived. ""But I don't believe the rest of the SRU will take this lying down. ""I think the banks will be dismayed at this decision and, ultimately, it is them who pull the strings. ""So I wouldn't be surprised if they reviewed their position. But, in the wider picture, what message does this send out?"" He thought the work of Scotland's coaches, who have been attempting to arrest the decline of the national side, would be made much more difficult. ""Matt Williams and Willie Anderson must be wondering, 'what have we walked into here?'"" said Calder. ""And we can now expect weeks of arguments and acrimony just at a time when we should be looking forward to the Six Nations Championship. ""I am very, very disappointed, more than you can imagine. Why do so many Scots have this knack of turning on each other when the going gets tough?"""
97,business,politics,0.9065534,"Ban on forced retirement under 65 Employers will no longer be able to force workers to retire before 65, unless they can justify it. The government has announced that firms will be barred from 2006 from imposing arbitrary retirement ages. Under new European age discrimination rules, a default retirement age of 65 will be introduced. Workers will be permitted to request staying on beyond this compulsory retirement age, although employers will have the right to refuse. Trade and Industry Secretary Patricia Hewitt said people would not be forced to work longer than they wanted, saying the default age was not a statutory, compulsory retirement age. She said employers would be free to continue employing people for as long as they were competent. Under age discrimination proposals from the Department of Trade and Industry last year workers were to be allowed to work on till 70 if they wished. Business leaders had opposed the plan as they said it would be too costly and cumbersome. The British Chambers of Commerce welcomed the latest proposal. ""This move today is the best of both worlds,"" it said. ""Employers have the ability to define the end point of the employer-employee relationship and employees have flexibility with a right to request to work past the age of 65."" But Age Concern said imposing a retirement age of 65 was ""cowardly"" and a ""complete u-turn"". ""This makes a mockery of the Government's so-called commitment to outlawing ageism, leaving the incoming age discrimination law to unravel,"" said Gordon Lishman, director general of Age Concern England . ""It is now inevitable that older people will mount legal challenges to the decision using European law."" The decision will have no impact on the age at which workers can collect their state pension, the government has said."
110,business,tech,0.8044108,"BT offers equal access to rivals BT has moved to pre-empt a possible break-up of its business by offering to cut wholesale broadband prices and open its network to rivals. The move comes after telecom regulator Ofcom said in November that the firm must offer competitors ""real equality of access to its phone lines"". At the time, Ofcom offered BT the choice of change or splitting into two. Ofcom is carrying out a strategic review aimed at promoting greater competition in the UK telecom sector. BT's competitors have frequently accused it of misusing its status as the former telecoms monopoly and controller of access to many customers to favour its own retail arm. This latest submission was delivered to the watchdog ahead of a deadline for the second phase of its review. ""Central to the proposals are plans by BT to offer operators lower wholesale prices, faster broadband services and transparent, highly-regulated access to BT's local network,"" the former monopoly said in a statement. ""The United Kingdom has the opportunity to create the most exciting and innovative telecoms market in the world,"" BT chief executive Ben Verwaayen said. ""BT has a critical role to play, and today we are making a set of far-reaching proposals towards that framework,"" he said. BT wants lighter regulation in exchange for the changes, as well as the removal of the break-up threat. The group is to set up a new Access Services division - with a separate board which would include independent members - to ensure equal access for rivals to the ""local loop"", the copper wires that run between telephone exchanges and households. The company also unveiled plans to cut the wholesale prices of its most popular broadband product by about 8% from April in areas of high customer demand. It added that it plans to invest £10bn in the next five years to create a ""21st Century network"". To meet the growing demand for greater bandwidth, BT said it would begin trials in April with a view to launching higher-speed services nationally from the autumn. Telecom analysts Ovum welcomed the move, saying BT had ""given a lot of ground"". ""The big question now is whether the industry, and particularly Ofcom feels BT's proposals go far enough ...Now the real negotiation begins,"" director of telecoms research Tony Lavender said. Internet service provider (ISP) Plus.net also backed the proposals saying ""we will be entirely happy if Ofcom accepts them"". ""BT has been challenged to play fair and its plans will introduce a level playing field. The scenario now is how well people execute their business plans as a service provider,"" chief executive Lee Strafford said. Chris Panayis, managing director of ISP Freedom2surf said that it would make the situation clearer for business. ""I think it's the first productive thing we've had from BT,"" he said. AOL backed the price cuts but said regulation was still needed to ensure a level playing field. ""This is a reminder to Ofcom that as long as BT can change the dynamics of the whole broadband market at will, the process of opening up the UK's local telephone network to infrastructure investment and competition remains fragile,"" a spokesman said. ""Ofcom needs to return to regulation of the wholesale broadband service [IPStream] and provide more robust rules for local loop unbundling if consumers are to see the benefits of increased competition and infrastructure investment."" More than 100 telecom firms, consumer groups and other interested parties are expected to make submissions to the regulator during this consultation phase. Ofcom is expected to spend the next few weeks examining the proposals before making an announcement within the next few months."
147,business,tech,0.8473257,"News Corp eyes video games market News Corp, the media company controlled by Australian billionaire Rupert Murdoch, is eyeing a move into the video games market. According to the Financial Times, chief operating officer Peter Chernin said that News Corp is ""kicking the tires of pretty much all video games companies"". Santa Monica-based Activison is said to be one firm on its takeover list. Video games are ""big business"", the paper quoted Mr Chernin as saying. We ""would like to get into it"". The success of products such as Sony's Playstation, Microsoft's X-Box and Nintendo's Game Cube have boosted demand for video games. The days of arcade classics such as Space Invaders, Pac-Man and Donkey Kong are long gone. Today, games often have budgets big enough for feature films and look to give gamers as real an experience as possible. And with their price tags reflecting the heavy investment by development companies, video games are proving almost as profitable as they are fun. Mr Chernin, however, told the FT that News Corp was finding it difficult to identify a suitable target. ""We are struggling with the gap between companies like Electronic Arts, which comes with a high price tag, and the next tier of companies,"" he explained during a conference in Phoenix, Arizona. ""These may be too focused on one or two product lines."""
162,entertainment,politics,0.6015395,"BBC 'should allow more scrutiny' MPs have urged the BBC to give watchdogs more freedom to scrutinise how £2bn in licence fee money is spent. The Public Accounts Committee called for the National Audit Office to be given a ""free hand"" to investigate how the BBC offers value for money. Although six areas are to be opened up to scrutiny the audit office should have more power to choose what it investigated, the MPs said. The call was made in a report into the BBC's Freeview digital service. ""Our aim is not to rewrite the storyline of EastEnders but simply to ensure that the BBC is as accountable to parliament as any other organisation spending public money,"" said the committee chairman, MP Edward Leigh. ""The BBC's spending is not subject to the full independent scrutiny, and accountability to parliament. ""Parliament requires television owners to pay a licence fee and expects the comptroller and auditor general, on behalf of parliament, to be able to scrutinise how that money, over £2 billion a year, is used."" A BBC spokeswoman said: ""We share the committee's interest in ensuring the public money we receive is spent well. Though in its infancy, we think the arrangements with the NAO are working well and should be given time to mature."" The report said the Freeview digital service has had an ""impressive"" take up since its launch but the BBC must still dispel confusion about the service. The committee found the BBC had succeeded in ensuring subscription-free access to digital channels following the collapse of ITV Digital in 2002. But the fact that one in four homes could not access Freeview remained a problem. The report said that while gaps in the coverage were largely due to landscape issues, there was need for detailed explanations on the Freeview website and on promotional literature as to why it was not available in specific areas. The government has proposed switch off of the analogue television signal, with 2012 the most recently proposed date. The BBC launched Freeview in 2002 as an alternative to satellite subscription services such as Sky, to allow its digital channels such as BBC Three and News 24 to be seen. There have been an estimated five million Freeview set-top boxes sold since the launch and prices have fallen considerably. The corporation plans to spend up to £138m on Freeview before 2014 to ensure people can receive the service throughout the UK, and are aware of it."
290,business,politics,0.6765124,"Call to save manufacturing jobs The Trades Union Congress (TUC) is calling on the government to stem job losses in manufacturing firms by reviewing the help it gives companies. The TUC said in its submission before the Budget that action is needed because of 105,000 jobs lost from the sector over the last year. It calls for better pensions, child care provision and decent wages. The 36-page submission also urges the government to examine support other European countries provide to industry. TUC General Secretary Brendan Barber called for ""a commitment to policies that will make a real difference to the lives of working people."" ""Greater investment in childcare strategies and the people delivering that childcare will increases the options available to working parents,"" he said. ""A commitment to our public services and manufacturing sector ensures that we can continue to compete on a global level and deliver the frontline services that this country needs."" He also called for ""practical measures"" to help pensioners, especially women who he said ""are most likely to retire in poverty"". The submission also calls for decent wages and training for people working in the manufacturing sector."
354,business,politics,0.7798692,"Crossrail link 'to get go-ahead' The £10bn Crossrail transport plan, backed by business groups, is to get the go-ahead this month, according to The Mail on Sunday. It says the UK Treasury has allocated £7.5bn ($13.99bn) for the project and that talks with business groups on raising the rest will begin shortly. The much delayed Crossrail Link Bill would provide for a fast cross-London rail link. The paper says it will go before the House of Commons on 23 February. A second reading could follow on 16 or 17 March. ""We've always said we are going to introduce a hybrid Bill for Crossrail in the Spring and this remains the case,"" the Department for Transport said on Sunday. Jeremy de Souza, a spokesman for Crossrail, said on Sunday he could not confirm whether the Treasury was planning to invest £7.5bn or when the bill would go before Parliament. However, he said some impetus may have been provided by the proximity of an election. The new line would go out as far as Maidenhead, Berkshire, to the west of London, and link Heathrow to Canary Wharf via the City. Heathrow to the City would take 40 minutes, dramatically cutting journey times for business travellers, and reducing overcrowding on the tube. The line has the support of the Mayor of London, Ken Livingstone, business groups and the government, but there have been three years of arguments over how it should be funded. The Mail on Sunday's Financial Mail said the £7.5bn of Treasury money was earmarked for spending in £2.5bn instalments in 2010, 2011 and 2012."


It would be interesting to see whether an ensemble method, combining the outputs of a number of different classifiers, could improve further on the accuracy (I had some issues with the kernel dying when I attempted this in the environment I had created).

It would also be interesting to see how accurate a Naive Bayes model would be.  Although the independence assumption is likely to be violated, Naive Bayes models are often used for classifying text where the actual values of the probabilites are not relevant, just the most likely class.  The Naive Bayes classifier has the advantage that it can be relatively easily explained as being based on the product of the evidence lifts of each separate feature, a opposed to other classifiers which are less transparent.

## Neural Network model in Azure Machine Learning

I actually started this task by quickly running a few models in Azure Machine Learning Studio (AML).  I then broke off to create the above analysis to demonstrate that I could address the task using R within a Jupyter notebook.  However, I returned to AML, which allows quick experimentation with different classifiers and parameters and also gives a quick means to operationalise a model by creating an API. 

The input dataset for AML was created by combining the raw data in Excel using VBA.

I trained a Random Forest model, but this was out-performed by a multiclass neural network.

AML allows easy pre-processing of text, for instance removing stop words and numbers.  In this case I performed lemmatization, grouping together the inflected forms of a word based on a dictionary, rather than stemming.

Rather than creating a Document Term Matrix, AML allows feature hashing, which turns tokens into numerical indices.  Although this removes any possibility of understanding which individual words are influencing the predictions, it is fast and efficient and has the advantage that using the trained model to predict on new data does not require the original Document Term Matrix.


![image.png](attachment:image.png)

A Multiclass Neural Network with default parameters achieved a 96.4% overall accuracy (micro-averaged F-score).  Tuning these parameters with a 20-pass random sweep improved the accuracy to 97.1%, exactly the same as the linear SVM model in R.

![image.png](attachment:image.png)

### Live model deployed

The dataset on which this model was built is of BBC news articles from 2004-2005.  As such, the trained model will not be ideal for classifying more recent articles.  For instance, the politicians, sports and entertainment stars from over a decade ago wil not be the same as the ones cited now. 

However, as an exercise, the model has been deployed on a web-site, using an API created by AML.  It can be tested by pasting articles into the following webpage:

https://bbcamcclassificationdemo.azurewebsites.net/

(though please note that, although this is developed using Microsoft tools, in Internet Explorer it seems to always return 'Sport' even though in Chrome it works fine!)