This is a ML application used to classify any text dataset into different labels. The app uses many text classifiers built into it and chooses the best classifier with the highest accuracy to train the models to predict the labels for the unlabelled data(user case).
Clone the whole repository with the html templates as it is and run it with any python IDE
Follow this guide to install required libraries for running this application.
Download Microsoft visual studio build tools, this would be required to install libraries needed. Download link
Next install all your libraries by running cmd with administrator rights and running the following command in that folder:
pip install -r requirements.txt
Install spacy language lib by executing this command in adminstrator mode
python -m spacy download en
To complete installing nltk execute these commands one by one in adminstrator mode
python
import nltk
nltk.download()
A window will pop up, download and install all the files and close it
Run the following commands to get your linux system to run our program:
sudo python3 -m pip install -r requirements.txt
python3 -m spacy download en
-
Open a terminal in the local repository and run the fit_dataset.py program.
- For Windows:
python fit_dataset.py
- For Linux:
python3 fit_dataset.py
-
After the program is successfully running, go to
http://localhost:5000/
in your preferred browser. You must be able to see the following screen:
Let us consider a sample dataset and use it in our application. The data set consists of Consumer Finance Complaints with 4 predefined categories namely - ‘Credit reporting’, ‘Debt collection’, ‘Mortgage’, ‘Student loan’. (Source: data.gov)
The input data to our application should contain a feature name and label fields. The sameple dataset looks like the following:
Consumer_complaint_narrative | Product | |
---|---|---|
0 | This company refuses to provide me verification and validation of debt per ... | Debt collection |
1 | Started the refinance of home mortgage process with cash out option on XX/ ... | Mortgage |
2 | I was dropped from my income based repayment plan by FedLoan servicing for ... | Student loan |
3 | The first communication that I received from the debt collector was a court ... | Debt collection |
... | ... | ... |
Now assuming the application is running, let us proceed:
Step 1: Upload labelled dataset in .csv format to train the model. Enter the feature name and label (Here, "Consumer_complaint_narrative" and "Product" respectively). Enter the name by which your trained model should be saved. Finally, Click the Train
button.
The loading screen appears till the models are trained.
Step 2: The accuracy table of the classifiers. The classifier with best accuracy is automatically chosen and is saved as a model with the name given in previous step.
Step 3: Upload unlabelled dataset in .csv format to predict the labels. Enter the feature name in the unlabelled file whose label have to be predicted. Next, input the filename of the saved model. Then enter the filename of file in which results are going to be saved. Click OK.
Step 4: The results are displayed in the following format:
(Note: If you have already trained your dataset and don't want to train it again, you can jump directly to step 3 by clicking "Go to Predict Page" in step 1)
- Flask - A web microframework for Python
- scikit-learn - Machine learning library in Python
- spaCy - Industrial-Strength Natural Language Processing
- NLTK - NLP tasks
- pandas data structures and data analysis tools for the Python
- Allu Praveen
- Deepak Divya Tejaswi
- Nimit Kanani
- Nitish Kumar Naineni