Skip to content

Mapmo/Book-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sources:

How the project works:

  • extract.py - generates a dataset in JSON format from the most common words in each book
  • classify.py - using a specified dataset, created by extract.py, it classifies the genre(s) of a book after learning on the provided dataset
  • reqs.txt - contains all the requirements needed for the scripts, can be used from pip
  •  $ pip install -r reqs.txt

Usage:

extract.py
  • You must be located in a directory that contains other directories named after their genres
  • The script will be looking for a txt directory for each genre. This is useful in case you want to have zip files for the books in each genre as well
  • After you start the script, you have to wait depending on how many books you have provided. You can also provide as a parameter how many words you want to extract from each word
  • After the extraction is ready, the generated file will be "/tmp/ops.json"
  • $ ls
    Ancient  Classics  Criminal  Fantasy  Horror  Humour  Love  Science  Sci-Fi  Social
    $ ls Ancient/
    all_links  Ancient.words  txt  zips
    $ ls Ancient/txt/ | head
    110-Herodot_-_Istoricheski_noveli.txt
    141-William-Shakespeare_-_Soneti.txt
    1492-Nikolaj_Kun_-_Starogrytski_legendi_i_mitove.txt
    1642-Jean-Froissart_-_Hroniki.txt
    1696-Sun_Dzy_-_Izkustvoto_na_vojnata.txt
    1840-Jean-Pierre-Vernant_-_Starogrytski_mitove_-_Vsemiryt_bogovete_horata.txt
    1984-Konfutsij_-_Dobrijat_pyt_-_Misli_na_velikija_kitajski_mydrets.txt
    2074-Genro_-_Zheljaznata_flejta_sto_dzenski_koana_-_Slovata_na_dzenskite_mydretsi.txt
    217-Starogrytska_lirika.txt
    29-Giovanni-Boccaccio_-_Dekameron.txt
    $ ~/Git/Book-Classifier/extract.py
    Starting Ancient
    Starting Classics
    Starting Criminal
    Starting Fantasy
    Starting Horror
    Starting Humour
    Starting Love
    Starting Science
    Starting Sci-Fi
    Starting Social

classify.py

  • You must call the script with 2 parameters - *path to the dataset* and *path to the book*
  • The scipt will return an F1 score of the training and a predicted genre of the books
  • $ ./classify.py /tmp/ops.json Omir_-_Iliada_-_6122-b.txt
    0.6937984496124031
    [('Ancient',)]
    

Tests explained:

  • Test 0 The idea was to pick the top 3 datasets that I had based on an F1 score. Since a single test on a single dataset was ~10 hours, I had to do it only on some of the datasets
  • Test 1
    • features: 2,000 - 20,000
    • data_split: 0.1 - 0.45
    • threshold: 0.1 - 0.45
    • № tests: 20
    Results showed that the best solution has to be in this range:
    • features: 5,000 - 14000
    • data_split: 0.1 - 0.25
    • threshold: 0.25 - 0.3
  • Test 2
    • features: 5,000 - 14,000
    • data_split: 0.1 - 0.25
    • threshold: 0.25 - 0.3
    • № tests: 25
    Results showed that the best solution has to be in this range:
    • features: 5,000 - 10000
    • data_split: 0.1 - 0.15
    • threshold: 0.25 - 0.3
  • Test 3
    • features: 5,000 - 10,000
    • data_split: 0.1 - 0.15
    • threshold: 0.25 - 0.3
    • № tests: 35
    Results showed that the best solution has to be in this range:
    • features: 7,000
    • data_split: 0.1
    • threshold: 0.26
  • Test 4
    • features: 7,000
    • data_split: 0.1
    • threshold: 0.26
    • № tests: 50
    Results showed that the best possible solution in terms of F1 score is for the 1125 words dataset and it is 0.8759894459102902
    Note that it is overfitting and this is not the most optimal solution of the task.
  • Tests with books that were not part of the training process
    • Test Hannibal
      • Tests with 0.15 thresholds were accurate and tests with more than 1000 top words and threshold 0.125 were accurate as well
    • Test Twilight
      • Tests with threshold over 0.1 were accurate
    • Test Game of Thrones
      • Tests with threshold 0.1 and above were accurate
    • Test Naked Sun
      • Tests with 0.1 threshold and above and over 950 words were accurate
    • Test Azazel
      • Tests with 0.125 threshold and top words between 1000 and 1250 were accurate

    As a result from the tests, especially that from Azazel, datasets with top 1125 an 1250 top words seem the most accurate. and since the F1 score of 1250 is a bit higher, this is the dataset that I consider a winner

About

University project for artificial intelligence.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages