## Introduction to Text Analyis with sklearn 

### Recommendation System
We live in a world surrounded by recommendation systems - our shopping habbits, our reading habits, political opinions are heavily influenced by recommendation algorithms. So lets take a closer look at how to build a basic recommendation system.

Simply put a recommendation system learns from your previous behavior and tries to recommend items that are similar to your previous choices. While there are a multitude of approaches for building recommendation systems, we will take a simple approach that is easy to understand and has a reasonable performance.

For this exercise we will build a recommendation system that predicts which talks you'll enjoy at a conference.

### Before you proceed
This project is still in alpha stage. Bugs, typos, spelling, grammar, terminologies - there's every scope of finding bugs. If you have found one - [open an issue on github](https://github.com/chicagopython/CodingWorkshops/issues/new). Pull Requests with corrections, fixes and enhancements will be received with open arms! Don't forget to add yourself to the [list of contributors to this project](https://github.com/chicagopython/CodingWorkshops/blob/master/README.md). 


#### Recommendation for Pycon talks
With 32 tuotorials, 12 sponsor workshops, 16 talks at the education summit, and 95 talks at the main conference - Pycon has a lot to offer. Reading through all the talk descriptions and filtering out the ones that you should go to is a tedious process. 
Lets build a recommendation system that recommends talks from this year's Pycon based on the ones that you went to last year. This way you don't waste any time deciding which talk to go to and spend more time making friends on the hallway track! 

We will be using [`pandas`](https://pandas.pydata.org/) and [`scikit-learn`](http://scikit-learn.org/) to build the recommnedation system using the text description of talks.


### Definitions
#### Documents
In our example the talk descriptions are the documents.

#### Class
We have two classes to classify our documents
- The talks that the user would like to see "in person". Denoted by 1
- The talks that the user would watch "later online". Denoted by 0

A talk description is labeled 0 would mean the user has chosen to watch it later and a label 1 would mean the user has chose to watch it in person.

### Supervised Learning
In Supervised learning we inspect each observation in a given dataset and manually label them. These manually labeled data is used to construct a model that can predict the labels on new data. We will use a Supervised Learning technique called Support Vector Machines.

In unsupervised learning we do not need any manual labeling. The recommendation system finds the pattern in the data to build a model that can be used for recommendation.

### Dataset
The dataset contains the talk description and speaker details from Pycon 2017 and 2018. All the 2017 talk data has been labeled by a user who has been to Pycon 2017.

### Exercise A: Load the data
The data directory contains the snapshot of one such user's labeling - lets load that up and start with our analysis. 

In [17]:
import tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [1]:
import pandas as pd
import numpy as np
df=pd.read_csv('talks.csv')
df.head()

Unnamed: 0,id,title,description,presenters,date_created,date_modified,location,talk_dt,year,label
0,1,5 ways to deploy your Python web app in 2017,You’ve built a fine Python web application and...,Andrew T. Baker,2018-04-19 00:59:20.151875,2018-04-19 00:59:20.151875,Portland Ballroom 252–253,2017-05-08 15:15:00.000000,2017,0.0
1,2,A gentle introduction to deep learning with Te...,Deep learning's explosion of spectacular resul...,Michelle Fullwood,2018-04-19 00:59:20.158338,2018-04-19 00:59:20.158338,Oregon Ballroom 203–204,2017-05-08 16:15:00.000000,2017,0.0
2,3,aiosmtpd - A better asyncio based SMTP server,smtpd.py has been in the standard library for ...,Barry Warsaw,2018-04-19 00:59:20.161866,2018-04-19 00:59:20.161866,Oregon Ballroom 203–204,2017-05-08 14:30:00.000000,2017,1.0
3,4,Algorithmic Music Generation,Music is mainly an artistic act of inspired cr...,Padmaja V Bhagwat,2018-04-19 00:59:20.165526,2018-04-19 00:59:20.165526,Portland Ballroom 251 & 258,2017-05-08 17:10:00.000000,2017,0.0
4,5,An Introduction to Reinforcement Learning,Reinforcement learning (RL) is a subfield of m...,Jessica Forde,2018-04-19 00:59:20.169075,2018-04-19 00:59:20.169075,Portland Ballroom 252–253,2017-05-08 13:40:00.000000,2017,0.0


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 190 entries, 0 to 189
Data columns (total 10 columns):
id               190 non-null int64
title            190 non-null object
description      190 non-null object
presenters       190 non-null object
date_created     190 non-null object
date_modified    190 non-null object
location         190 non-null object
talk_dt          190 non-null object
year             190 non-null int64
label            190 non-null float64
dtypes: float64(1), int64(2), object(7)
memory usage: 14.9+ KB


Here is a brief description of the interesting fields.

variable | description  
------|------|
`title`|Title of the talk
`description`|Description of the talk
`year`|Is it a `2017` talk or `2018`  
`label`|`1` indicates the user preferred seeing the talk in person,<br> `0` indicates they would schedule it for later.

Note all 2018 talks are set to 1. However they are only placeholders, and are not used in training the model. We will  use 2017 data for training, and predict the labels on the 2018 talks.

Lets start by selecting the 2017 talk descriptions that were labeled by the user for watching in person.

```python
df[(df.year==2017) & (df.label==1)]['description']
```

Print the description of the talks that the user preferred watching in person. How many such talks are there?

## Exercise 1: Exploring the dataset

### Exercise 1.1: Select 2017 talk description and labels from the Pandas dataframe. How many of them are present? Do the same for 2018 talks.

In [401]:
count = 0
for year in df.year:
    if year == 2017:
        count += 1

print(count)

95


In [402]:
count = 0
for year in df.year:
    if year == 2018:
        count += 1

print(count)
df[df['year'] == 2018].shape[0]

95


95

In [245]:
import re

def clean_corpus(corpus):
    corpus = corpus.tolist()
    final_corpus = []
    for item in corpus:
        item = re.sub("([^0-9A-Za-z])"," ",item)
        final_str = ""    
        for word in item.split():
            if word not in stop_words:
                final_str += " "+word   
        final_corpus.append(final_str)
   
    return final_corpus

In [403]:
print('Number of class 1 :',sum([1 for y in df_2017['label'] if y == 1.0 ]))
print('Number of class 0 :',sum([1 for y in df_2017['label'] if y == 0.0 ]))


Number of class 1 : 38
Number of class 0 : 57


The 2017 talks will be used for training and the 2018 talks will we used for predicting. Set the values of `year_labeled` and `year_predict` to appropriate values and print out the values of `description_labeled` and `description_predict`.

In [311]:
year_labeled=2017
year_predict=2018
description_labeled = df[df.year==year_labeled]['description']
description_predict = df[df.year==year_predict]['description']

In [326]:
df_2017 = df[df.year==year_labeled]
X_train = df_2017[['description', 'label']]
X_train.head()
description_2017 = X_train['description']


In [327]:
import re

def clean_corpus(corpus):
    corpus = corpus.tolist()
    final_corpus = []
    for item in corpus:
        item = re.sub("([^0-9A-Za-z])"," ",item)
        final_str = ""    
        for word in item.split():
            if word not in stop_words:
                final_str += " "+word   
        final_corpus.append(final_str)
   
    return final_corpus

In [329]:
Capture_X_train_2017 = clean_corpus(description_2017)

In [320]:
description_labeled[0]

'You’ve built a fine Python web application and now you’re ready to share it with the world. But what’s the best way to deploy your app in 2017?  This talk will demonstrate popular techniques for deploying Python web applications. We’ll start with a simple Flask application and expose it to the world five times over as we learn to use different tools and services available to the modern Python developer.  Specific topics covered include:   Exposing your local dev environment with ngrok Using a Platform-as-a-Service (PaaS) like Heroku Going “serverless” with AWS Lambda Configuring your own VM with Google Compute Engine Thinking inside the box using Docker   We’ll also briefly touch on the pros and cons of each technique to help you figure out which one is right for your app.  At the end of this talk you will have a basic understanding of how each of these techniques work and you’ll be ready to try them out yourself.'

In [393]:
print(Capture_X_train_2017[0])

 You built fine Python web application ready share world But best way deploy app 2017 This talk demonstrate popular techniques deploying Python web applications We start simple Flask application expose world five times learn use different tools services available modern Python developer Specific topics covered include Exposing local dev environment ngrok Using Platform Service PaaS like Heroku Going serverless AWS Lambda Configuring VM Google Compute Engine Thinking inside box using Docker We also briefly touch pros cons technique help figure one right app At end talk basic understanding techniques work ready try


In [394]:
train_label = df_2017['label']

In [404]:
from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(description_2017,train_label,test_size=0.25)

In [405]:
train_2017_cleaned = clean_corpus(X_train)
test_2017_cleaned = clean_corpus(X_test)

In [406]:
from sklearn.pipeline import Pipeline
from sklearn.metrics import  accuracy_score,precision_score,recall_score,confusion_matrix,auc
text_clf = Pipeline([('vect', CountVectorizer()),  
                      ('tfidf', TfidfTransformer()),
                     ('clf', SVC(kernel='linear'))])
text_clf = text_clf.fit(train_2017_cleaned,Y_train)

In [407]:
text_clf

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [408]:
predicted_2017 = text_clf.predict(test_2017_cleaned)
print('Accuracy of the model is:',accuracy_score(Y_test,predicted_2017))
print('Precision of the model is:',precision_score(Y_test,predicted_2017))
print('Recall of the model is:',recall_score(Y_test,predicted_2017))
print('confusion Matrix is:',confusion_matrix(Y_test,predicted_2017))

Accuracy of the model is: 0.7083333333333334
Precision of the model is: 0.5
Recall of the model is: 0.2857142857142857
confusion Matrix is: [[15  2]
 [ 5  2]]


In [349]:
description_predict_clean = clean_corpus(description_predict)
predicted_2018 = text_clf.predict(description_predict_clean)

In [381]:
predict_actual = pd.DataFrame(data= description_predict,columns=['description_predict'], index=None)

In [385]:
predict_actual['description_predict'] = description_predict

predict_actual['predicted_label'] = predicted_2018

In [387]:
predict_actual[predict_actual.predicted_label==1.0]

Unnamed: 0,description_predict,predicted_label
102,"Nowadays, there are many ways of building data...",1.0
112,Want to know about the latest trends in the Py...,1.0
131,Are you an intermediate python developer looki...,1.0


## Quick Introduction to Text Analysis
![text-analysis](text-analysis.jpg)

Lets have a quick overview of text analysis. Our end goal is to train a machine learning algorithm by making it go through enough documents from each class to recognize the distingusihing characteristics in documents from a particular class. 

1. *Labeling* - This is the step where the user (i.e. a human) reviews a set of documents and manually classifies them. For our problem, here a Pycon attendee is labeling a talk description from 2017 as "watch later"(0) or "watch now" (1).
1. *Training/Testing split* - In order to test our algorithm, we split parts of our labeled data into training (used to train the algorithm) and testing set (used to test the algorithm).
1. *Vectorization & feature extraction* - Since machine learning algorithms deal with numbers rather than words, we vectorize our documents - i.e. we split the documents into individual unique words and count the frequency of their occurance across documents. There are different data normalization is possible at this stage like stop words removal, [lemmatization](https://spacy.io/api/lemmatizer) - but we will skip them for now. Each individual token occurrence frequency (normalized or not) is treated as a feature.
1. *Model training* - This is where we build the model.
1. *Model testing* - Here we test out the model to see how it is performing against label data as we subject it to the previously set aside test set.
1. *Tweak and train* - If our measures are not satisfactory, we will change the parameters that define different aspects of the machine learning algorithm and we will train the model again.
1. Once satisfied with the results from the previous step, we are now ready to deploy the model and have new unlabled documents be classified by it.

### sklearn

Using sklearn we will build the feature set by tokenization, counting and normalization of the bi-grams from the text descriptions of the talk. You can find more information on text feature extraction [here](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) and TfidfVectorizer [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

We will use the [fit_transform](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) method to learn the vocabulary dictionary and return term-document matrix. 

For the data on which we will do our predictions, we will use the [transform](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.transform) method to get the document-term matrix.

Next we will split our data into training set and testing set. This allows us to do cross validation and avoid overfitting. Use the [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method from `sklearn.model_selection`.


Finally we get to the stage for training the model. We are going to use a linear [support vector classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) and check its [precision and recall](http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html) by using the [classification_report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).