# Classifier for Med School Applications

## Understanding Classification
How would you describe apples to a computer? How do they differ from oranges?
Remember, computers can only really understand numbers, true false values, and strings within a predefined set.
For example - you have shared data on fruit to your machine. An apple has a height feature, width feature, color feature, etc... After "learning" all the fruits, when the machine comes across an unknown, "unlabeled" fruit, it can use its previous experience (the data you shared) to **Classify** the object into a label / class (in the example below, Orange) based on that object's known **Features**.

### USEFUL TERMS
**FEATURES**: Properties that describe data attributes for machine learning - often the variables <br>
**FEATURE VECTOR aka FEATURE REPRESENTATION**: A set of features for a particular item of data

<img src="fruit_example.png">

_Source: Andrew Rosenberg_

**WE** are going to classify two different sets of documents a corpus of Med School Applications:  
1. Students who entered the Family Medicine specialization
2. Students who chose a different specialization

Obviously, we can't use the same features as the fruit example - a digital text document won't have color or weight etc. and we want to use something more _meaningful_.  

One method is to use a **Bags of Words** feature representation. In a Bag-of-words model, a text (e.g. an individual application document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

So the frequency/ccurrence of each word will be used as the document's feature for training our classifier.

### BAG of WORDS EXAMPLE

<img src= "BoW_example.png">

_Source: Wikipedia_

# The TO-DO List

Using **scikit-learn** tools (along with a few other useful tools) we're going to analyze our dataset of text documents (i.e. all the Med School Applications):
1. Load our Application data (texts) and organize it according to the categories (Fam Med / Other)
2. Extract feature vectors suitable for machine learning _aka_ Build the Bag-of-Words model for each document
3. Train & Test classifier(s) to perform categorization
4. Finetuning the Classifier / Use a grid search strategy to find a good configuration of both the feature extraction components and the classifier

### 1. LOADING & ORGANIZING TEXT DATA 
We need to first load our JSON datafile (text and metadata from med school applications) & split it into the two categories by the True/False metadata point in our JSON dictionary. Then we'll check the count (should be Total: 260).

In [145]:
import json
#create lists to contain JUST the text part for each group (fm/non) of docs 
fm_texts= []
other_texts= []

#open/read our data then split and append into the lists above
with open('familyMedicine_031720.json', 'r') as f:
    data = json.load(f)
    for doc in data:
        if doc['keyword'] == 'TRUE':
            true_text = doc['text']
            fm_texts.append(true_text)
        if doc['keyword'] == 'FALSE':
            false_text = doc['text']
            other_texts.append(false_text)
            
#check our count to make sure we've got things right
print('Total Docs:', len(data))
print('Fam Med Docs:', len(fm_texts))
print('Other Docs:', len(other_texts))
print ()

Total Docs: 260
Fam Med Docs: 135
Other Docs: 125



To help us organize a bit more, I like to use the PANDAS library. PANDAS DataFrames are essentially spreadsheets (rows columns). Having our dataset in this format will help us check in on our data, explore a bit, and cut all the data we don't need to include.  We want every row to be a doc with just the row number, text, and classification. We could skip this step and use data straight from our JSON in the following steps, but for me this just helps me orient a bit more.

In [146]:
import pandas as pd

fmdf = pd.DataFrame({'text': fm_texts,
                    'label':'family'})
othdf = pd.DataFrame({'text':other_texts, 
                    'label':'other'})

# combining our two dataframes into 1
df = pd.concat([fmdf, othdf])

print (fmdf.head())
print ()
print (othdf.head())
print ()
print (df['label'].value_counts())

                                                text   label
0  I am a DAP (Dual Admissions Program) Student. ...  family
1  I am a Dual Admissions Student. Student Worker...  family
2  Throughout my life it has been proven to me ti...  family
3  When I was in eighth grade, I had a personal m...  family
4  I am ten. We're at Grandma's for the holidays....  family

                                                text  label
0  We each live once.  To live life to its fulles...  other
1  Drums and feet pounded in Lubwe, a town in rur...  other
2  Children, as patients, are almost never found ...  other
3  My father used to tell me Learn from your past...  other
4  In September of 2005 my friend Mark learned th...  other

family    135
other     125
Name: label, dtype: int64


## 2. Extracting features from text files _aka_ Build the Bag-of-Words
Before we can perform any machine learning tasks on text documents, we need to make our text content numerical feature vectors. Remember, computers can only really understand numbers, true/false values, and strings within a predefined set.

**Bags of Words**
- Assign a fixed integer id to each word occurring in any document of the training set (by building a dictionary from words to integer indices). 
- For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary.

**E.G. 'family' is indexed at 7015 and any document in our dataset that uses 'student' will list 7015 and then the number of times that document uses 'student'.**

NOTE: The bags of words representation implies that n_features is the number of distinct words in the corpus: this number is typically larger than 100,000. If n_samples == 10000, storing X as a NumPy array of type float32 would require 10000 x 100000 x 4 bytes = 4GB in RAM which is barely manageable on today’s computers. Fortunately, most values in X will be zeros since for a given document less than a few thousand distinct words will be used. For this reason we say that bags of words are typically high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory. scipy.sparse matrices are data structures that do exactly this, and scikit-learn has built-in support for these structures. (_source: scikit-learn tutorial, Working with Text Data_)

### Tokenizing & Counting Text
We know we need to count the occurences of each term in each doc, but we also need to do some text pre-processing tasks like:<br>
- _Tokenizing_: breaking docs into words<br>
- _Filtering Stopwords_: removing less meaningful words like articles and prepositions

Both of these tasks are already included in SciKit-Learn's CountVectorizer tool. So we can accomplish these AND build a dictionary of features (all the unique words in the total dataset) AND transform individual documents into feature vectors - all in one fairly small chunk of code.
<br>
<br>
**NOTE: at this point is where you might want to apply [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), but we are going to skip narrowing our terms down by frequency since we are dealing with a fairly small dataset (260 documents).**

In [223]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(stop_words='english')
fv = count_vect.fit_transform(df['text'])
print(fv.shape)
print(fv[:1])
print()

(260, 20234)
  (0, 4877)	3
  (0, 5904)	3
  (0, 727)	3
  (0, 14348)	3
  (0, 17526)	2



Once fitted, the vectorizer has built a dictionary of feature indices.<br>
The index value of a word in the vocabulary is linked to its frequency in the whole training corpus.

In [234]:
print(count_vect.vocabulary_.get(u'family'))

7015


### TF-IDF: let's realllllly think about what frequency means ay
Longer documents will have higher avg counts than shorter documents, even though they might talk about the same topics.

To avoid discrepancies, we can divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called **tf–idf** for “Term Frequency times Inverse Document Frequency”.

Both tf and tf–idf can be computed as follows using TfidfTransformer:

In [224]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(fv)
fv_tf = tf_transformer.transform(fv)
print(fv_tf.shape)

(260, 20234)


## 3. Training a Classifier
### Partitioning data into train and test sets
When partitioning data into train and test sets, a good place to start is to use 75% of your data for training, and 25% of your data for testing. We want as much training data as possible so the machine/algorithm can "learn" (imagine telling a baby 'this is a lemon' 1 time versus 100 times), while also having enough testing data to ensure that our trained classifier is generalizable across a number of examples. This will also lead to more accurate evalutation of our trained classifier.

Again, scikit-learn has a function that will do exactly this!

In [247]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(fv, df['label'],
                                                stratify=df['label'], 
                                                test_size=0.25,
                                                    random_state = 42)

* We use the "stratify" argument because we have an uneven amount of training data; we have more Family Medicine Applications (135) than Other Applications (125). By using stratify, we ensure that our classifier will take this data imbalence into account.

* In this example, we are using a fixed random state, to ensure we will always get exactly the same value when we classify. Adding this argument is unnecessary for most types of classification; we do it here to ensure our results do not vary slightly across runs. <br>**Fun Fact Alert!!** 42 is commonly used for the fixed random state b/c of _Hitchhiker's Guide to the Galaxy_

CHECK-IN: How many docs do we have now in each of our Train / Test groups? And what is the total unique vocabulary?

In [248]:
print(X_train.shape)
print(X_test.shape)

(195, 20234)
(65, 20234)


### Selecting a Classifier

<img src= "algorithm_map.png">

*Source: Andreas Mueller*

**START** <br>
- More than 50 samples: YES (260 docs)<br>
- Predicting a category: YES!<br>
- Have labeled data: YES!<br>
- Less than 100k samples: YES (Again, 260 docs)

**END: Linear SVC Classifier**

## 3.A Trying the Linear SVC Classifier
We're actually going to try a few different classifiers - but let's start with the **Linear SVC Classifier**<br>
We import that Classifier from scikit-learn, then "FIT" our "Trainging" data to it (75% of our total dataset).<br> This gives the machine something to learn. 

In [272]:
from sklearn.svm import SVC
classifier_svc = SVC().fit(X_train, y_train)

Then we use the Classifier's predict abilities on the "Testing" data (25% of our total dataset) and see what percentage it correctly guesses by comparing the machine's predictions to the testing data's *actual* classification (family/other).

In [273]:
import numpy as np
predicted_svc = classifier.predict(X_test)
np.mean(predicted_svc == y_test)

0.6

We can see what that 60% score looks like across our classifications using a Confusion Matrix:

In [274]:
from sklearn.metrics import confusion_matrix
print (confusion_matrix(y_test, predicted_svc))

[[19 15]
 [11 20]]


<img src= "confusionMatrix_svc.png">

## 3.B Trying a New Classifer!
**Naïve Bayes** (probabilistic) classifier.<br>
<br>_A probabilistic classifier is able to predict, given an observation of an input (our dataset's feature representation), a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to._ ([Wikipedia, Probabilistic classification](https://en.wikipedia.org/wiki/Probabilistic_classification))<br>
<br>
Scikit-learn includes several variants of Naïve Bayes classifier - we'll use the one most suitable for word counts:<br> 
**The Multinomial Variant**

In [263]:
from sklearn.naive_bayes import MultinomialNB
classifier_mnb = MultinomialNB().fit(X_train, y_train)

In [264]:
#docs_new = ['a personal medical experience ignited my passion to pursue a career in medicine.', 'I am a Dual admissions program student']
#X_new_counts = count_vect.transform(docs_new)

predicted_mnb = classifier_mnb.predict(X_test)

#for doc, label in zip(X_test, predicted):
 #   print('%r => %s' % (doc, label))



In [265]:
classifier_mnb.score(X_test, y_test)

0.6

In [266]:
from sklearn.metrics import confusion_matrix
print (confusion_matrix(y_test, predicted_mnb))

[[28  6]
 [20 11]]


<img src = "confusion_matrix.png">

### Ok so decidely not a great score here. SO FAR ...

Not a ton of highly successful classifiers, ay? Well, that's where finetuning a Classifier comes in and that's where I'll start applying some of the hand-coded categories of Jackie's **extremly thorough** tagging of W2V outputs. 

## FINETUNING a CLASSIFIER using hand-tagged categories

*Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur feugiat, eros a tristique aliquam, leo augue hendrerit neque, pulvinar suscipit urna ex ac dolor. Maecenas metus ligula, tempor id ante non, fringilla euismod massa. Nam tempus congue justo, eu maximus nibh pellentesque vel. Phasellus commodo sapien in velit finibus sollicitudin. Pellentesque semper et nibh at faucibus. Mauris dignissim tincidunt metus, cursus malesuada libero rhoncus varius. Vivamus nisi neque, hendrerit et nisi quis, malesuada tincidunt est. Praesent malesuada imperdiet ultrices. In hac habitasse platea dictumst. Donec risus enim, hendrerit vel tincidunt in, sollicitudin accumsan metus.* <br>
<br>

### SHOUTOUTS!
Jackie Knapke! and everyone else who contributed from the:<br>
Dept of Family & Community Medicine - University of Cincinnati, College of Medicine<br>
[SciKit-Learn "Working With Text Data" Tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) <br>
[CUNY DHRI Open Curricula: Intro to Machine Learning](https://www.dhinstitutes.org/curricula/)<br>
[Finetuning a Classifier in SciKit-Learn](https://towardsdatascience.com/fine-tuning-a-classifier-in-scikit-learn-66e048c21e65)