Skip to content

Marathi author article processing and classification using custom vectorizer and sklearn

License

Notifications You must be signed in to change notification settings

AP-Atul/Author-Identification

Repository files navigation

Author-Identification

This is an implementation of the paper

Stats

1. Information about the .csv file

I have provided a pre processed csv file using the preprocess.py and dataFrameGen.py. Also the vector for text is generated using the vectorizer.py which uses features of the articles. (without using NLTK)

original dataset contained:

  • Total lines in articles :: 10405
  • Total words in articles :: 358695
  • Total characters in articles :: 1889183
  • Total no of unique words :: 73889

2. Features selected

  • line count
  • char count
  • word count
  • average word size
  • vowels per word
  • consonants per word
  • matras per word
  • count of words of fize size
  • count of words below size
  • count of words above size

3. Model stats

Sr. No. Classifier Accuracy
1 Naive Bayes (Guassian) 89 %
2 SVC 70 %
3 Decision Tree 99 %
4 K Nearest Neighbour 98.8 %

About

Marathi author article processing and classification using custom vectorizer and sklearn

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages