# Applying Binary Classification with Logistic Regression
---

_In many classification problems, there are more than two classes that are of interest. We might wish to predict the genres of songs from samples of audio, or classify images of galaxies by their types. The goal of multi-class classification is to assign an instance to one of the set of classes. scikit-learn uses a strategy called one-vs.-all, or one-vs.-the-rest, to support multi-class classification. One-vs-all classification uses one binary classifier for each of the possible classes. The class that is predicted with the greatest confidence is assigned to the instance.
LogisticRegression supports multi-class classification using the one-versus-all strategy out of the box. Let's use LogisticRegression for a multi-class classification problem._

## The Problem: 

Here we want to use scikit-learn to find the movies with good reviews.
In this example, we will classify the sentiments of phrases taken from movie reviews in the Rotten Tomatoes data set. Each phrase can be classified as one of the following sentiments: negative, somewhat negative, neutral, somewhat positive, or positive. 
While the classes appear to be ordered, the explanatory variables that we will use do not always corroborate this order due to sarcasm, negation, and other linguistic phenomena. Instead, we will approach this problem as a multi-class classification task.

## The Data Set:
The data can be downloaded from https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data. 
This Rotten Tomatoes movie review dataset is a corpus of movie reviews used for sentiment analysis

## Exploring the Data using Pandas

In [1]:
import pandas as pd
df = pd.read_csv('datasets/train.tsv', header=0, delimiter='\t')
print df.count()
print "--------------------"
print df.head()

PhraseId      156060
SentenceId    156060
Phrase        156060
Sentiment     156060
dtype: int64
--------------------
   PhraseId  SentenceId                                             Phrase  \
0         1           1  A series of escapades demonstrating the adage ...   
1         2           1  A series of escapades demonstrating the adage ...   
2         3           1                                           A series   
3         4           1                                                  A   
4         5           1                                             series   

   Sentiment  
0          1  
1          2  
2          2  
3          2  
4          2  


The Sentiment column contains the response variables. The 0 label corresponds to the sentiment negative, 1 corresponds to somewhat negative, and so on. The Phrase column contains the raw text. Each sentence from the movie reviews has been parsed into smaller phrases. We won't require the PhraseId and SentenceId columns in this example. Let's print some of the phrases and examine them.

In [2]:
print df['Phrase'].head(10)

0    A series of escapades demonstrating the adage ...
1    A series of escapades demonstrating the adage ...
2                                             A series
3                                                    A
4                                               series
5    of escapades demonstrating the adage that what...
6                                                   of
7    escapades demonstrating the adage that what is...
8                                            escapades
9    demonstrating the adage that what is good for ...
Name: Phrase, dtype: object


In [3]:
#Examining the target class
print df['Sentiment'].describe()

count    156060.000000
mean          2.063578
std           0.893832
min           0.000000
25%           2.000000
50%           2.000000
75%           3.000000
max           4.000000
Name: Sentiment, dtype: float64


In [4]:
print df['Sentiment'].value_counts()

2    79582
3    32927
1    27273
4     9206
0     7072
Name: Sentiment, dtype: int64


In [5]:
print df['Sentiment'].value_counts()/df['Sentiment'].count()

2    0.509945
3    0.210989
1    0.174760
4    0.058990
0    0.045316
Name: Sentiment, dtype: float64


The most common class, Neutral, includes more than 50 percent of the instances.
Accuracy will not be an informative performance measure for this problem, as a
degenerate classifier that predicts only Neutral can obtain an accuracy near 0.5.
Approximately one quarter of the reviews are positive or somewhat positive, and
approximately one fifth of the reviews are negative or somewhat negative.

## Train the classifier with scikit-learn

In [6]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])
parameters = {
    'vect__max_df': (0.25, 0.5),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__use_idf': (True, False),
    'clf__C': (0.1, 1, 10),
}

df = pd.read_csv('datasets/train.tsv', header=0, delimiter='\t')
X, y = df['Phrase'], df['Sentiment'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)
grid_search = GridSearchCV(pipeline, parameters, n_jobs=3,verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])

[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:  5.5min
[Parallel(n_jobs=3)]: Done  72 out of  72 | elapsed:  8.3min finished


Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best score: 0.619
Best parameters set:
	clf__C: 10
	vect__max_df: 0.25
	vect__ngram_range: (1, 2)
	vect__use_idf: False


## Evaluating our classifier using Multi-class classification performance metrics

In [7]:
predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
print 'Confusion Matrix:', confusion_matrix(y_test, predictions)
print 'Classification Report:', classification_report(y_test, predictions)

Accuracy: 0.638164808407
Confusion Matrix: [[ 1129  1719   608    69    11]
 [  905  6033  6210   561    16]
 [  205  3117 32703  3523   161]
 [   33   412  6506  8244  1242]
 [    6    51   525  2354  1687]]
Classification Report:              precision    recall  f1-score   support

          0       0.50      0.32      0.39      3536
          1       0.53      0.44      0.48     13725
          2       0.70      0.82      0.76     39709
          3       0.56      0.50      0.53     16437
          4       0.54      0.36      0.44      4623

avg / total       0.62      0.64      0.63     78030



First, we make predictions using the best parameter set found by using grid searching.
While our classifier is an improvement over the baseline classifier, it frequently
mistakes Somewhat Positive and Somewhat Negative for Neutral.

## Summary:
Here we then discussed multi-class classification, a task in which each instance must be assigned one label from a set of labels. We used the one-vs.-all strategy to classify the sentiments of movie reviews.