# Logistic Regression Lab with pipelines

In this lab you will try out pipelines with what you've learned so far and practice logistic regression on news headlines.

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report
%matplotlib inline

In [8]:
df_1kH  = pd.read_csv('df_1k_H.csv', index_col=0, encoding='utf-8')
df_1kS  = pd.read_csv('df_1k_S.csv', index_col=0, encoding='utf-8')
df_1kW  = pd.read_csv('df_1k_W.csv', index_col=0, encoding='utf-8')



Then, let's create the training and test sets:

In [6]:
from sklearn.cross_validation import train_test_split, cross_val_score

In [9]:
df_1kW.head()

Unnamed: 0,0,code
0,"[This, book, is, a, record, of, a, pleasure, t...",1
1,"[., But, it, was, not, ., There, was, a, toler...",1
2,"[a, year, ., A, good, many, expedients, were, ...",1
3,"[no, particle, of, trimming, about, this, mons...",1
4,"[fog, !, We, got, plenty, of, fresh, oranges, ...",1


In [10]:
df_1kS.head()

Unnamed: 0,0,code
0,"[This book is a record of a pleasure trip., If...",1
1,[and then staggered away and fell over the coo...,1
2,[When the ship rolled to starboard the whole p...,1
3,"[., ., $21.70 Happiness reigned once more in ...",1
4,[To see it is to see a vision of home itself a...,1


In [11]:
df_1kH.head()

Unnamed: 0,0,code
0,This book is a record of a pleasure trip. If ...,1
1,ly see with the glasses. We could not properl...,1
2,"l's any use--do you? They're only a bother, a...",1
3,"en, and boys and girls, all ragged and barefoo...",1
4,eternally substantial; and everywhere are tho...,1


In [30]:
X = df_1kW['0'].values
y = df_1kW['code'].values

In [13]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [14]:
# Initialize a CountVectorizer
cvec_W = CountVectorizer(decode_error='ignore', stop_words='english') # single-word count vectorizer

In [15]:
# Iinitialize the TFIDF transformer, which will transform the count matrix to a TF-IDF matrix
tfidf = TfidfTransformer()

In [16]:
# Initialize the logistic regression
logit = LogisticRegression()

In [17]:
# Turn text into counts.
X_train_count = cvec_W.fit_transform(X_train)

In [18]:
# Turn counts into frequencies.
X_train_df = tfidf.fit_transform(X_train_count)

In [19]:
# Fit the Logistic Regression to the training data
q_clf = logit.fit(X_train_df, y_train)


In [20]:
X_test_count = cvec_W.transform(X_test)

In [21]:
X_test_df = tfidf.transform(X_test_count)


In [22]:
predicted = q_clf.predict(X_test_df)

In [25]:
print classification_report(y_test, predicted, target_names=['Twain', 'Wilde', 'Lincoln', 'Modern'])


             precision    recall  f1-score   support

      Twain       0.84      0.81      0.82       350
      Wilde       0.92      0.89      0.90       344
    Lincoln       0.91      0.93      0.92       309
     Modern       0.88      0.91      0.89       319

avg / total       0.88      0.88      0.88      1322



## 1. Model Pipeline

Try out making pipelines with different transformations (look at the scikit-learn documentation for some that you think would be good) with a LogisticRegression instance. 

Notice that a `sklearn.pipeline` can have an arbitrary number of transformation steps, but only one, optional, estimator step as the last one in the chain.

In [26]:
from sklearn.pipeline import Pipeline
quote_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('logit', LogisticRegression())])



In [27]:
quote_fit = quote_clf.fit(X_train, y_train)

In [28]:
quote_pred = quote_fit.predict(X_test)

In [29]:
print classification_report(y_test, quote_pred, target_names=['Twain', 'Wilde', 'Lincoln', 'Modern'])

             precision    recall  f1-score   support

      Twain       0.80      0.77      0.79       350
      Wilde       0.88      0.87      0.87       344
    Lincoln       0.86      0.94      0.90       309
     Modern       0.91      0.88      0.89       319

avg / total       0.86      0.86      0.86      1322



## 2. Train the model
Use `X_train` and `y_train` to fit the model.
Use `X_test` to generate predicted values for the target variable and save those in a new variable called `y_pred`.

## 3. Evaluate the model accuracy

1. Use the `confusion_matrix` and `classification_report` functions to assess the quality of the model.
- Embed the results of the `confusion_matrix` in a Pandas dataframe with appropriate column names and index, so that it's easier to understand what kind of error the model is incurring into.
- Are there more false positives or false negatives? (remember we are trying to predict survival)
- How does that relate to what the `classification_report` is showing?

## 4. Improving the model

Can we improve the accuracy of the model?

One way to do this is to use tune the parameters controlling it.

You can get a list of all the model parameters using `model.get_params().keys()`.

Discuss with your team which parameters you could try to change.

You can systematically probe parameter combinations by using the `GridSearchCV` function. Implement a new classifier that searches the best parameter combination.

1. How will you choose the grid granularity?
1. How can you prevent the grid to exponentially grow?

## 5. Assess the tuned model

A tuned grid search model stores the best parameter combination and the best estimator as attributes.

1. Use these to generate a new prediction vector `y_pred`.
- Use the `confusion matrix`and `classification_report` to assess the accuracy of the new model.
- How does the new model compare with the old one?
- What else could you do to improve the accuracy?

## Bonus

What would happen if we used a different scoring function? Would our results change?
Choose one or two classification metrics from the [sklearn provided metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) and repeat the grid_search. Do your result change?