# PDF Classifier Project Documentation

## 1. Project Overview
This project aims to create a classifier that can distinguish between PDF documents and PowerPoint presentations converted to PDF format. The classifier uses machine learning techniques to analyze features extracted from PDF files.

## 2. Data Preparation

### 2.1 Data Splitting
- Implemented an 80:20 train-test split
- Ensured proportional representation of both classes in each split
- Used `sklearn.model_selection.train_test_split` with `stratify` parameter
- Used stratify to make sure that there is proportional representation of data in both splits

## 3. Feature Extraction
- Extracted average RGB values for each PDF using a library called pymupdf
- Initially wanted to add average page word count as a feature but then on visual inspection of the data saw that for both PDF and PPT data it was mainly just images therefore eh pymupdf library woul not actually be able to recognise the words in the document.
- Normalized the average RGB values to be between 0 and 1.

## 4. Model Selection and Training
- Chose Support Vector Model 
- Initially was thinking of either Support Vector Model or Logistic regression
- Decided to go with SVM because of effectiveness with high dimentional spaces.

### 4.1 Model Training
- Implemented hyperparameter tuning using GridSearchCV

## 5. Initial Model Evaluation
- Accuracy: 0.73
- F1 Score:
  - PDF files: 0.79
  - PowerPoint files: 0.64

### 5.2 Visualization
- Created confusion matrix using seaborn heatmap which can be found in the directory

## 6. Improved Feature Extraction
- Kept Extracting the average RGB values for each PDF using pymupdf
- converted the pixelmatrixes from pymupdf into images then used opencv to apply canny edge detection 
- used the edge detector to then get the ratio of edges in the image to non edges and also the average intensity of the edges in the image.

## 7. Improved Results and Model Evaluation
- Accuracy 0.97 for both pdf and ppt
- 0.97 f1 score for both ppt and pdf 

## 8. Possible Improvements
- If I were to make any improvements I would create a more extensive feature extraction function
- Possibly would extract details about sentence meaning from the pdf
- I could have possibly used techniques I learned in university such as the sobel filter instead of the canny filter for edge detection to see whcih one does better

## 9. How to use the program
- Within the classify_pdf.py program in the pdf_classifier directory there is a variable called `test_pdf_path` which you can change to any other path of .pdf file.
- Run poetry shell and then run python and then simply running python classify_pdf.py after changing the variable should produce results.

## 10. Conclusion
- General overview of the system: 
1. create 80 : 20 data split
2. extract RGB and edge features from each document in both training and testing data
3. append the extracted feature data to their respctive rows
4. train SVM on the data using different hyperparameters 
5. run the model on chosen document either ppt or pdf

## Personal Challenges
- Overall I do not have extensive experience with python therefore it is my first time using many of these libraries such as pymupdf and the second time I have used libraries like sklearn as most of the machine learning work I have done in the past was with MATLAB
- However, with all this considered I think the methodology in which I approached this was quite sound.

Thank you for this opportunity and I am looking forward to hearing back from you. 

> **⚠️IMPORTANT NOTE:⚠️** 
> - For some reason, the library pymupdf will give this error: "MuPDF error: argument error: cannot create appearance stream for widgets"
> - You can ignore this error and the program will run fine as long as you let it keep running.