2- Multinomial Logistic Regression

History

Name		Name	Last commit message	Last commit date
parent directory ..
largedata		largedata
README.md		README.md
metrics_out.txt		metrics_out.txt
tagger.py		tagger.py
test_out.txt		test_out.txt
train_out.txt		train_out.txt

README.md

Text Analyzer with Multinomial Logistic Regression

Author: Stephen Tse <***@cmu.edu>

This project implements a crude Natural language Processing (NLP) system. It predicts the tags (marked in Begin-Inside-Outside (BIO) format) of words in a sentence using Multinomial Logistic regression. The model for this NLP system is a probability distribution over the current tag y_{t} using the parameters \theta and a feature vector based on the previous word w_{t-1}, the current word w_{t}, and the next word w_{t+1}.

Data Assumptions

Data files are stored as tab-separated values and can be encoded as unicode strings.
The first line of the dataset should not be empty line.

Implementation Notes

This implementation does not perform any forms of regularization (l1 / l2).
The implementation of Stochastic Gradient Descent (SGD) does not shuffle the training data, i.e. it goes through the training dataset in its original order during each iteration. This may not be "stochastic" but it allows producing deterministic results for graders.
This implementation does not check for log-likelihood convergence; rather, it requires that user specifies the number of epochs for SGD in each run.
For multiple label predictions with the same likelihood, the program will break tie by choosing the label with the smallest ASCII value.

Usage

python tagger.py <train input> <validation input> <test input> <train out> <test out> <metrics out> <num epoch>

<train input>: path to the training input .tsv file
<validation input>: path to the validation input .tsv file
<test input>: path to the test input .tsv file
<train out>: path to output .labels file to which the prediction on the training data should be written
<test out>: path to output .labels file to which the prediction on the test data should be written
<metrics out>: path of the output .txt file to which metrics such as train and test error should be written
<num epoch>: integer specifying the number of times SGD loops through all of the training data (e.g., if <num epoch> equals 5, then each training example will be used in SGD 5 times).

Example

Input:

python tagger.py largedata/train.tsv largedata/validation.tsv largedata/test.tsv train_out.labels test_out.labels metrics_out.txt 10

3 files are generated in the root directory: train_out.labels, test_out.labels, and metrics_out.txt. The metrics output will report the objective function values (i.e. negative average log likelihood) for both the training dataset and the validation dataset in each epoch as well as final model prediction error rates for both the training dataset and the test dataset.

A Note on Sample Dataset

The sample dataset included are from the Airline Travel Information System (ATIS) dataset. Each data set consists of attributes (words) and labels (airline flight information tags in Begin-Inside-Outside (BIO) format). The attributes and tags are separated into sequences (i.e. phrases) with a blank line between each sequence. See here for more information.

Language & Dependencies

Language: Python 2.7 / 3.6

Dependency Requirements: numpy (version 1.7.1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

2- Multinomial Logistic Regression

2- Multinomial Logistic Regression

README.md

Text Analyzer with Multinomial Logistic Regression

Data Assumptions

Implementation Notes

Usage

Example

A Note on Sample Dataset

Language & Dependencies

Files

2- Multinomial Logistic Regression

Directory actions

More options

Directory actions

More options

Latest commit

History

2- Multinomial Logistic Regression

Folders and files

parent directory

README.md

Text Analyzer with Multinomial Logistic Regression

Data Assumptions

Implementation Notes

Usage

Example

A Note on Sample Dataset

Language & Dependencies