text-mining-challenge

A text mining challenge from course INF582.
Please check out the github version of our repository for a tidier display of this README.md file !

Code structure

Directories

The structure of the directories should be kept as it is in our project:

data/
- *.csv (put the node_information.csv file here)
- *.txt (testing_set.txt and training_set.txt)
featureEngineering/
- abstractFeatures/
- graphArticleFeatures/
- graphAuthorsFeatures/
- journalFeatures/
- lsaFeatures/
- originalFeatures/
submissions/
- *.csv (all the submission.csv files will be exported here)
report/

Note that in each folder Features/ a folder output/ should be included.

Feature Engineering

We compared each features individually on the same cross validation training-testing-set, using the same regressor: RandomForestRegressor with random_state set to 42.

Bear in mind that these are feature sets, and not a single feature. For instance, the graphAuthors feature set is composed of 7 features ("meanACiteB_col", "maxACiteB_col","AOut_col", "BIn_col","ACiteAMean_col", "ACiteASum_col","BOut_col")

Feature Set	Individual F1 score
'lsa'	0.576185
'original'	0.811745
'graphAuthors'	0.879161
'graphArticles'	0.992431
'journal'	0.611100
'similarity'	0.778112

Model tuning and comparison

We compared these classifiers, and obtained the respective performance

Algorithm	F1 score
Gradient Boosting	0.9
Random Forest Regressor	0.8
Logistic Regression	0.9

To prevent overfitting, we did cross validation on a sample of the training set of approximately the same size For each classifier, explain the procedure that was followed to tackle parameter tuning and prevent overfitting.

Name		Name	Last commit message	Last commit date
Latest commit History 282 Commits
data		data
docs		docs
featureEngineering		featureEngineering
report		report
submissions		submissions
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
classifier.py		classifier.py
main.py		main.py
metrics.py		metrics.py
requirements.txt		requirements.txt
tools.py		tools.py
tools_stemmer.py		tools_stemmer.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text-mining-challenge

Code structure

Directories

Feature Engineering

Model tuning and comparison

About

Releases

Packages

Contributors 2

Languages

License

Edouard360/text-mining-challenge

Folders and files

Latest commit

History

Repository files navigation

text-mining-challenge

Code structure

Directories

Feature Engineering

Model tuning and comparison

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages