# Text Classification of Textbooks
### Pierre Lardet

This code is presented as a python notebook, using python 3.11.2. My thoughts are presented chronologically

Versions of libraries used are listed below.


### Data Cleaning

Before anything else, we need to be able to read in the text and convert it into a format which is easy to manipulate. I'm going to use a [Pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). I noticed that the Computer Science text files were nested in an extra directory so manually moved them to make the structure of text files consistent. Next, I read in all of the text files in an easily extensible manner, stored them in a 2d array labelled with their subject and created a new Pandas dataframe to make further manipulation easy.

In [17]:
import glob

subjects = ['Computer_Science', 'History', 'Maths']
raw_texts = []

def read_subject_data(subject:str)->None:
    for dir in glob.glob(f'./data/{subject}/*.txt'):
        f = open(dir, 'r')
        text = f.read()
        raw_texts.append([subject, text, dir])
        f.close()

for subject in subjects:
    read_subject_data(subject)

In [46]:
import pandas as pd

textbooks = pd.DataFrame(raw_texts, columns=['subject', 'text' ,'dir'])

print('-'*50)
print(f'Sample of the data: \n {textbooks.head()}')
print('-'*50)
print(f'Dimensions: {textbooks.shape}')
print('-'*50)
print(f'Counts of each subject: \n {textbooks.groupby("subject").count()}')
print('-'*50)



--------------------------------------------------
Sample of the data: 
             subject                                               text
0  Computer_Science  4.8 Exercises 275\n4.15 [IS) <§§4.2, 4.3> One ...
1  Computer_Science  4.5 Fallacies and Pitfalls 26.\nFirst we find ...
2  Computer_Science  518 Chapter 7 Large and Fast: Exploiting Memor...
3  Computer_Science  Computers\nReconstructing the\nin the\nAncient...
4  Computer_Science  230 Chapter 3 Arithmetic: for Computers\n3.9 [...
--------------------------------------------------
Dimensions: (1356, 2)
--------------------------------------------------
Counts of each subject: 
                   text
subject               
Computer_Science   642
History            500
Maths              214
--------------------------------------------------


Now we have a dataframe to work with. The text is currently very messy with lots of extra characters and spacing etc. In order to use the text as an input into a ML classification model, it needs to be much cleaner. The desired format will be a list of lower-case words in each sample which can later be converted to numeric values.

In [54]:
import re

def clean_text(str:str)->str:
    str = re.sub(r'\n', ' ', str)
    str = re.sub(r'\W+', ' ', str)
    return str

textbooks['text'] = textbooks['text'].apply(clean_text)

print(textbooks.head())
textbooks.dropna(how='any',inplace=True)
textbooks.to_csv('output.csv', index=False)


            subject                                               text
0  Computer_Science  4 8 Exercises 275 4 15 IS 4 2 4 3 One user has...
1  Computer_Science  4 5 Fallacies and Pitfalls 26 First we find th...
2  Computer_Science  518 Chapter 7 Large and Fast Exploiting Memory...
3  Computer_Science  Computers Reconstructing the in the Ancient Wo...
4  Computer_Science  230 Chapter 3 Arithmetic for Computers 3 9 IOJ...
