
# Privacy Text Classification 
If a provider wants to sell data on the data store, the provider would have to tag the data and it would go into a queue. The PO in change would take a look at the segments and label the segments with one of the following classifications: GA, R1, R2. GA refers to general data, R1 refers to some more private data (e.g. income) and R2 refers to even more private data (e.g. health). The PO can refer to the privacy team for assistance. This process can slow down the provider from releasing the data. 
(See https://docs.google.com/document/d/14ltFfoN0bRWdT71Zaw5A60poGZjG60XHjFSdHhuGfXM/edit?ts=58798cc6# for more information)


With the use of natural language processing, we hope to streamline this process by having a machine classify the data and limit the amount of human intervention necessary.

There are many methods of text classification. One prominent method is to use word2vec to convert the words into vectors and run classical algorithms on it (more success has been seen in word-based ConvNets and LSTMs on the vectors). Traditional NLP methods use bag-of-words, which would involve picking out the most frequent words in the entire data set and counting how often each word appears in the sample. 


Why we decided not to use bag of words... actually we can try bag of words.

In [382]:
import csv
import numpy as np
import tensorflow as tf
import gensim
import sklearn
import random
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import RadiusNeighborsClassifier 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import operator

## Data retrieval
The data used in this project were obtained from the following link: https://docs.google.com/spreadsheets/d/1ySI6QxVbsy6fLqlIn2-T0awo82whu87scO6RTpvthUY/edit#gid=463244196

Currently, we pick 2000 random GA segment names because the size of our GA far exceeds what we have in R1 and R2. This is to prevent class imbalance. This is a temporary workaround. Not sure how to approach this yet.

In [369]:
# Set up GA
ga_list = []
with open("GA.csv") as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        ga_list.append(row['name'])
random.shuffle(ga_list)
ga_list = ga_list[:2000] # GA has too many data points 

# Set up r1
r1_list = []
with open("R1.csv") as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        r1_list.append(row['name'])
        
# Set up r2
r2_list = []
with open("R2.csv") as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        r2_list.append(row['name'])  


## Tools for word2vec
word2vec is the main tool used to convert words to vectors in a sensible fashion. We use the pretrained word2vec model provided by Google.

TODO: How we combine the vectors is slightly hackish and needs further investigation

In [330]:
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)  

KeyboardInterrupt: 

In [370]:
def calculate_vectors(lst):
    # given a list of strings
    # use google's word2vec to convert words to their vector forms
    # found average of the vectors in a name
    # return a list of vectors and dictionary from input string to its vector 
    word_score_pairs = {}
    ret = []
    for entry in lst:
        word_list = []
        for words in entry.strip().split(" > "):
            word_list.extend(words.strip().split(" "))
        sum_vect = np.zeros(300)
        num_success = 0
        for word in word_list:
            try: 
                sum_vect += model[word.strip()]
                num_success += 1
            except KeyError: 
                continue
        entry_vect = 1.0*sum_vect/max(num_success, 1)
        word_score_pairs[entry.strip()] = entry_vect
        ret.append(entry_vect)
    return ret, word_score_pairs

In [371]:
r1, r1_pairs = calculate_vectors(r1_list)
r2, r2_pairs = calculate_vectors(r2_list)
ga, ga_pairs = calculate_vectors(ga_list)
word_score_pairs = {**r1_pairs, **r2_pairs, **ga_pairs} 

## Quick look into the vectors and data
Mainly used to justify that word2vec has provided sensible word to vector mapping and the data is unique.

In [372]:
def diff(x,y):
    # Calculate euclidean distance between two points
    print(np.linalg.norm(x - y))

In [373]:
print("Some viability to the vectors")
w = word_score_pairs
v1 = w["Crossix US > Healthcare > Prescription Type > Insomnia"]
v2 = w["Crossix US > Healthcare > Prescription Type > Depression"]
v3 = w["Crossix US > Healthcare > Conversion > Uloric - Packaged"]
v4 = w["V12 > PYCO PERSONALITY > Mature Social Media Users"]
diff(v1,v2)
diff(v1,v3)
diff(v2,v3)
diff(v1,v4)
diff(v2,v4)
diff(v3,v4)

Some viability to the vectors


KeyError: 'V12 > PYCO PERSONALITY > Mature Social Media Users'

In [374]:
def get_all_words(word_score_pairs):
    # input is dictionary with keys as the segment_names
    # return all the words used in the segment_names 
    words = []
    word_list = []
    for words_with_greater in word_score_pairs.keys():
        word_list.extend(words_with_greater.split(' > '))
    for words_with_spaces in word_list:
        words.extend(words_with_spaces.split(' '))
    words = [word for word in words if word != '-']
    return words
words = get_all_words(word_score_pairs)

In [376]:
print("Total number of words: " + str(len(words)))
print("Total number of unique words: " + str(len(set(words))))
num_found = 0
for word in set(words): 
    try:
        model[word]
        num_found += 1
    except KeyError:
        continue
print("Total number of unique words found by Google's word2vec model: " + str(num_found))

Total number of words: 47575
Total number of unique words: 6758
Total number of unique words found by Google's word2vec model: 4036


## Preparing the data
We want to create a combined X and y variables and then split the data such that we have a training set, validation set, and test set.

In [377]:
def merge_and_label_data(data_segments):
    # labels data. pass in an list of lists of data of a partciular segment
    X, y = [], []
    for i in range(len(data_segments)):
        X += data_segments[i]
        y += [i] * len(data_segments[i])
    return X, y

In [378]:
def split_data(X,y):
    # Uses sklearn to split data 
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)
    return x_train, x_val, x_test, y_train, y_val, y_test

In [393]:
X, y = merge_and_label_data([ga, r1, r1])
x_train, x_val, x_test, y_train, y_val, y_test = split_data(X,y)
print("Count of each class in training set: " + str([y_train.count(i) for i in range(3)]))
print("Count of each class in validation set: " + str([y_val.count(i) for i in range(3)]))
print("Count of each class in test set: " + str([y_test.count(i) for i in range(3)]))

Count of each class in training set: [1421, 1475, 1495]
Count of each class in validation set: [296, 323, 323]
Count of each class in test set: [283, 339, 319]


## Classifiers 
Use x_train, y_train to train your models. 

Use x_val, y_val to tune your hyperparameters.

Use x_test, y_test to test your accuracy. DO NOT USE TEST DATA UNTIL YOU HAVE FOUND DESIRED HYPERPARAMETERS

#### K-Nearest Neighbor Classifier
The classic technique to provide a baseline accuracy for other methods.

In [380]:
knn = KNeighborsClassifier(n_neighbors=101)
knn.fit(x_train, y_train)
knn.predict_proba([model["Healthcare"], r2[2]])

array([[ 0.40594059,  0.27722772,  0.31683168],
       [ 0.33663366,  0.35643564,  0.30693069]])

In [296]:
print(accuracy_score(knn.predict(x_val), y_val))

0.602972399151


In [None]:
# k = 3  => accuracy = 0.59130
# k = 5  => accuracy = 0.58386
# k = 10  => accuracy = 0.5637
# k = 15  => accuracy = 0.583
# k = 100  => accuracy = 0.576

#### Radius Neighbor Classifier 
Task for Shida

In [404]:
# playing around with logistic regression 
log_reg = LogisticRegression(C=1)
log_reg.fit(x_train, y_train)
print(accuracy_score(log_reg.predict(x_val), y_val))

0.543524416136


# IGNORE BELOW FOR NOW
## Bag of Words
Another way to generate the vectors to use in classical machine learning models is to use bag of words.
We pick the words that show up the most frequently in our data set and for each sample, determine the number of times certain words show up in the sample. These will be the features. 

We have to pick the number of words to look out for. The following paper used the 50,000 most frequent words: https://arxiv.org/abs/1509.01626 


In [368]:
# NEEDS TO HAVE RUN calculate_vector (even though different representation) 
def get_frequency_map(words): 
    freq = {}
    for word in words:
        if word not in freq.keys():
            freq[word] = 1
        else: 
            freq[word] +=1
    return freq

def get_sorted_freq_list(words):
    freq = get_frequency_map(words)
    sorted_freq_list = sorted(freq.items(), key=operator.itemgetter(1), reverse=True)
    return sorted_freq_list

def pick_bag_of_words_words(words, num_words=100):
    sorted_freq_list = get_sorted_freq_list(words)[:num_words]
    zipped_list = zip(*sorted_freq_list)
    return list(list(zipped_list)[0])

def calculate_bag_of_words_vector(lst, bow_words):
    # given a list of strings
    # use bag_of_words to convert words to their vector forms
    # return a list of vectors 
    # WARNING this method can be really slow. Look to refactor it 
    ret = []
    vector_length = len(bow_words)
    for entry in lst:
        bow_vector = [0] * vector_length
        for words in entry.strip().split(" > "):
            for word in words.split(" "):
                word = word.strip()
                try: 
                    bow_vector[bow_words.index(word)] +=1
                except ValueError:
                    continue
        ret.append(bow_vector)
    return ret

# Example use case
bow_words = pick_bag_of_words_words(words, 500)
r1_bow = calculate_bag_of_words_vector(r1_list, bow_words)
print(sum(r1_bow[441]))

7
