# Classify Flu Tweets 

This notebook provides a simple demonstration on how open source Python libraries can be used to classify Twitter messages.

---

First, we import necessary libraries.  We are using [scikit-learn](http://scikit-learn.org/stable/) to do most of the heavy lifting in terms of transforming and classifying data.  Scikit-learn contains a wide range of functions for performing data mining and classification tasks. We also use the [Pandas](https://pandas.pydata.org/) library for reading our data.  

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
import pandas as pd

Before working with any of the training data, we set up a classification [pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that combines all required data transformation and modeling steps:

1. [Vectorize](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html): This step transforms text data into numerical data that can be used for classification.  You can read more [here](https://en.wikipedia.org/wiki/Bag-of-words_model).
2. [TF-IDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html): This is an additional transformation that is common when working with text data.  It uses statistical properties of the dataset to assign weights to text terms.  You can read more [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
3. [Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html): Finally, we classify data that was transformed by the previous two steps.  In this case we are using a linear support vector classifier, which is commonly used in text classification tasks.  You can read more [here](https://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM). 

In [2]:
classifier = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,3))),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))
])

Now we read the training data Excel file, which has two columns:

1. `text`: The text of the tweet.
2. `valid`: A determination of whether this tweet indicates an actual case of influenza.

In [5]:
df = pd.read_excel("../training_set.xlsx", "Sheet1")

# the tweets are the input
tweets = df.text.astype(str)

# the 'valid' column is the desired output (is this tweet a valid flu tweet?)
valid = df.valid.astype(bool)

Train the classifier with `tweets` as the input and `valid` as the output. **Note:** Because we created a classification pipeline, all of the data transformation and training is done in a single step.

In [6]:
# train the classifier
classifier.fit(tweets, valid)

Pipeline(steps=[('vectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), preprocessor=None, stop_words=None,
    ...lti_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
          n_jobs=1))])

Finally, we can see how the trained classifier performs on new tweets by using the `predict` function.

In [7]:
# try it out on some sample tweets
texts = [
    'i hate having the flu', 
    'i dont want to get a flu shot!!', 
    'the flu sucks, i keep coughing', 
    'Studies show that flu rates are rising in the nation: http://www.fakeurl.com',
    'RT i have the flu!!! ahhh!!!!',
    'just got a flu shot, but i hope i dont get the flu!',
    'got a flu shot i better not get the flu now!'
]

outputs = classifier.predict(texts)

for text, output in zip(texts, outputs):
    print('%s --> %s' % (text, str(output)))

i hate having the flu --> True
i dont want to get a flu shot!! --> False
the flu sucks, i keep coughing --> True
Studies show that flu rates are rising in the nation: http://www.fakeurl.com --> False
RT i have the flu!!! ahhh!!!! --> True
just got a flu shot, but i hope i dont get the flu! --> False
got a flu shot i better not get the flu now! --> True


### **Note:** Normally, you would perform [cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html) on a model to evaluate its performance, but for this example we kept things simple. 