```{css}
.toc {
  color: blue;
  }

```


# Abstract

Text mining refers to the process of transforming unstructured text data into structured, meaningful clusters of information. This project explores a specific aspect of text mining known as authorship attribution, which involves analysing various linguistic and stylistic features of text to predict its author, or in this case the \"speaker\". This project assesses transcription data containing the speeches delivered by South African presidents during the State of the Nation Address (SONA) from 1994 to 2023. The primary goal was to develop a classification model that can take a sentence from a SONA speech and correctly predict the president who said it. Various models were evaluated in this project, such as Classification Trees, and Random Forests, in addition to XGBoost  - , Naïve Bayesian- and feed forward Neural network- models. However, I find that .... outperformed all other models with a test ....F1-score of...

# Introduction 

Text mining is a branch of computer science, and more specifically of artificial intelligence (AI) that aims to transform unstructured text data into structured formats (Ibm.com, 2023)^1^. Text mining employs a variety of statistical and machine learning methods, including Classification Trees, Random Forests, Naïve Bayesian models, and numerous other deep learning algorithms. These methods are used to uncover textual patterns, trends, and hidden relationships within unstructured data.

While traditional text mining relied solely on machine learning algorithms, modern text mining also employs sophisticated methods of Natural Language Processing (NLP) such as parsing and part-of-speech tagging (Greenbook.org, 2017)^2^. This advancement in text mining is largely attributed to the exponential growth in data, with approximately 80% of global data residing in unstructured formats. This vast amount of data has necessitated the use of text mining, making it a significant task in the field of information technology and data analytics.

The application of text mining is particularly useful in large organizations where decision-making is central and time is limiting (Ibm.com, 2023)^1^. For example, banks employ text mining techniques in risk management to scrutinize changes in sentiment within financial reports. This strategy proves to be especially useful when evaluating potential business investments. Another application of text mining is in medical research, where text mining is used to cluster medical documents. Text mining is also used for anomaly and outlier detection such as in spam email filtering.

The many applications of text mining across various domains have led to the development of several models, including supervised, semi-supervised, and unsupervised methods (Dogra et al., 2022)^3^. However, determining the most appropriate and effective model for a specific text mining task remains a complex and nuanced challenge.

The aim of this project is to identify the most effective classification model for authorship attribution, which involves determining the author of a given document (Mohamed Amine Boukhaled and Jean-Gabriel Ganascia, 2017)^4^. This project specifically focuses on analysing transcription text data from speeches delivered by South African presidents during the State of the Nation Address (SONA) from 1994 to 2023 (www.gov.za, 2023)^5^. For context, SONA serves as an annual opening to South African Parliament, where the President reports on the socio-economic state of the nation to a joint sitting of Parliament namely, the National Assembly (NA) and the National Council of Provinces (NCOP). The main objective is to train a model that can take a sentence (extracted from a SONA speech) as input and accurately predict which president said it.

# Literature Review

Several studies have explored numerous machine learning methods for author prediction. These techniques include Support Vector Machines (SVMs), Decision Trees, Random Forests (RF), and Neural Networks (NNs), to name a few. The choice of model often depends on the nature of the data and the specific requirements of the task.

Feature selection plays a crucial role in author prediction, where features are broadly categorized into lexical features (e.g., word usage, sentence length, and punctuation usage) and syntactic features (e.g., part-of-speech tags). Some studies have also explored semantic features (e.g., topics and sentiments) for author prediction.

An article by Shukri, (2021)^6^ suggests a method for author prediction, by training models with Arabic opinion articles. The study collected 8109 articles from 428 authors for the period 2016 to 2021. Their NN model achieved the highest accuracy of 81.1%, followed by the Logistic Regression (LR) model with an accuracy of 80.8%.

Another article by Bauersfeld et al., (2023)^7^ proposed a transformer-based, neural-network architecture that uses text content and author names in the bibliography to determine the author of an anonymous manuscript. The authors generated the largest authorship-identification dataset to date, leveraging all research papers publicly available on arXiv, whilst achieving a 73% accuracy rate.

A similar paper by Khalid, (2021)^8^ performed author prediction on 210 000 anonymous, news headlines from HuffPost (2012-2022). The study used Bag of Words (BoW) and Latent Semantic Analysis (LSA) features as input to train classification algorithms such as SVM, RF and LR models. The study found that the LR model trained on all features outperformed all other models with an accuracy of 94.9%.

In conclusion, these manuscripts demonstrate the potential of text mining in various applications. These papers also highlight the variety of author-prediction techniques available, such as NN-, LR-, RF- and SVM- models. However, we also note the challenges in applying these techniques, such as the need for large datasets and the availability of useful features, as well as the ability of these models to discriminate between content-related features and author-specific features.Literature Review

Several studies have explored numerous machine learning methods for author prediction. These techniques include Support Vector Machines (SVMs), Decision Trees, Random Forests (RF), and Neural Networks (NNs), to name a few. The choice of model often depends on the nature of the data and the specific requirements of the task.

Feature selection plays a crucial role in author prediction, where features are broadly categorized into lexical features (e.g., word usage, sentence length, and punctuation usage) and syntactic features (e.g., part-of-speech tags). Some studies have also explored semantic features (e.g., topics and sentiments) for author prediction.

An article by Shukri, (2021)^6^ suggests a method for author prediction, by training models with Arabic opinion articles. The study collected 8109 articles from 428 authors for the period 2016 to 2021. Their NN model achieved the highest accuracy of 81.1%, followed by the Logistic Regression (LR) model with an accuracy of 80.8%.

Another article by Bauersfeld et al., (2023)^7^ proposed a transformer-based, neural-network architecture that uses text content and author names in the bibliography to determine the author of an anonymous manuscript. The authors generated the largest authorship-identification dataset to date, leveraging all research papers publicly available on arXiv, whilst achieving a 73% accuracy rate.

A similar paper by Khalid, (2021)^8^ performed author prediction on 210 000 anonymous, news headlines from HuffPost (2012-2022). The study used Bag of Words (BoW) and Latent Semantic Analysis (LSA) features as input to train classification algorithms such as SVM, RF and LR models. The study found that the LR model trained on all features outperformed all other models with an accuracy of 94.9%.

In conclusion, these manuscripts demonstrate the potential of text mining in various applications. These papers also highlight the variety of author-prediction techniques available, such as NN-, LR-, RF- and SVM- models. However, we also note the challenges in applying these techniques, such as the need for large datasets and the availability of useful features, as well as the ability of these models to discriminate between content-related features and author-specific features.

# Data

## Data Source

The data set contains the speeches delivered by South African presidents during the State of the Nation Address (SONA) from 1994 to 2023. The data is publicly available on the South African government website (www.gov.za, 2023)^5^. For years where an election took place, the State of the Nation Address occurs twice - once before and again after the election.

## Data Description

Seven speeches are available for former president Mandela (1994-1999). Ten speeches are available for former president Mbeki (2000-2008). Ten speeches are also available for former president Zuma (2009-2017). President Ramaphosa has a total of seven speeches to date (2018-2023). Two outliers exist, namely one speech each for former president deKlerk (1994) and former president Motlanthe (2009).

These records cumulatively formed the \"sona\" data set with 36 records and 5 variables, namely: *filename, speech, year, president* and *date of speech delivered*.

## Data Pre-processing

All data was read into R version 4.3.1 , using R-Studio version 2023.9.1.494. The year of each speech was extracted from the first four characters in the \"filename\" column, which was then appended to the *sona* data frame as a new column. The president names were extracted from the \"filename\" column using regular expressions, where alphabetical text ending in a \".txt\" extension was matched as the presidents\' name. Subsequently, all other unnecessary text such as \"http\"-, fullstop-, ampersand-, greater-than-, and less-than characters were removed, in addition to trailing white spaces and new-line characters. Dates were then re-formatted into a *dd-mm-yyyy* format. Finally, the pre-processed data was saved as an RDS object for downstream analysis.

# References

1.     Ibm.com. (2023). *What is Text Mining? \| IBM*. \[online\] Available at: https://www.ibm.com/. \[Accessed 14 Oct. 2023\].

2.     Greenbook.org. (2017). *Text Analytics: A Primer*. \[online\] Available at: https://www.greenbook.org/insights/market-research-leaders/text-analytics-a-primer \[Accessed 14 Oct. 2023\].

3.     Dogra, V., Verma, S., Kavita Kavita, Chatterjee, P., Shafi, J., Choi, J. and Muhammad Fazal Ijaz (2022). A Complete Process of Text Classification System Using State-of-the-Art NLP Models. Computational Intelligence and Neuroscience, \[online\] 2022, pp.1--26. doi:https://doi.org/10.1155/2022/1883698.

4.     Mohamed Amine Boukhaled and Jean-Gabriel Ganascia (2017). Stylistic Features Based on Sequential Rule Mining for Authorship Attribution. \[online\] doi:https://doi.org/10.1016/b978-1-78548-253-3.50008-1.

5.     www.gov.za. (2023). *State of the Nation Address \| South African Government*. \[online\] Available at: https://www.gov.za/state-nation-address \[Accessed 14 Oct. 2023\].

6.     Shukri, N. (2021). Author Prediction in Text Mining of the Opinion Articles in Arabic Newspapers. \[online\] 16(2), pp.1-05. doi:https://doi.org/10.9790/2834-1602020105.

7.     Bauersfeld, L., Romero, A., Manasi Muglikar and Davide Scaramuzza (2023). Cracking double-blind review: Authorship attribution with deep learning. PLOS ONE, \[online\] 18(6), pp.e0287611--e0287611. doi: https://doi.org/10.1371/journal.pone.0287611.

8.     Khalid, N. (2021). AUTHOR IDENTIFICATION BASED ON NLP. *European Journal of Computer Science and Information Technology*, \[online\] 9(1), pp.1--26. Available at: https://www.eajournals.org/wp-content/uploads/Author-Identification-Based-on-NLP.pdf.