# Mini project 1
## Identifying the girl next door - a study in natural langauge processing
This notebook contains the code with which we have generated the results of our analysis. Code explanations follow in markdown throughout the notebook.

## Necessary imports


In [20]:
import nltk
import pandas as pd
import io
import re
import pickle
import sys
from random import shuffle
from string import punctuation
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import collections
from nltk.classify import NaiveBayesClassifier
from nltk.metrics.scores import precision, recall, f_measure
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.svm import LinearSVC
#nltk.download('punkt') # Uncomment if package not downloaded
#nltk.download('stopwords') #  Uncomment if package not downloaded
#nltk.download('PorterStemmer') #  Uncomment if package not downloaded


## 1.1 Data loading
Data is loaded as a pandas dataframe from the "cleaned" .csv file. 

In [2]:
data = pd.read_csv('cleanerstill.csv', sep=";")
#data.head() # Uncomment to inspect head of dataframe

  interactivity=interactivity, compiler=compiler, result=result)


## 1.2 Data balancing
Firstly, we want to filter out any row where sex,age,ethnicity,essay0 (about me) and essay4 (interests) is not given.
This is done by creating a number of masks of the original data. We then proceed to check the sizes of the groups of males and females. It becomes apparant that he male group is considerably bigger than the female group. Hence it is reduced, after which the male and female group is concatenated and shuffled into df_final

In [3]:
mask = (data['ethnicity'] != ' ') & (type(data['ethnicity']) != float) & (data['age'] != ' ') & (data['sex'] != ' ') & (data['essay0'] != ' ') & (data['essay4'] != ' ')
# mask removes all rows where ethnicity, age and sex is not given. Also removes rows where ethnicity is NaN. This particular value is not present for age and sex, hence it is not masked out.
data_new = data[mask]

data_rc = data_new.filter(['age','sex','ethnicity','essay0','essay4'], axis=1)

mask_male = (data_rc['sex'] == 'm') # creates mask of males where sex evaluates to True if == 'm'
mask_female = (data_rc['sex'] == 'f') # creates mask of females where sex evaluates to True if == 'f'

data_om = data_rc[mask_male] # only males for relevant columns
data_of = data_rc[mask_female] # only females for relevant columns

data_om_reduced = data_om.sample(frac=0.6665) # returns a random sample of data_om where parameter frac describes size of sample relative to original data

df_tmp = pd.concat([data_of, data_om_reduced], ignore_index=True)

df_final = df_tmp.sample(frac=1) # gives a random sample of df_tmp of size frac (currently 34685 obs)

  result = method(y)


### 1.2.1 Pickling 
df_final is splitted into test,dev and train data in a 10/10/80 ratio, then pickled and exported as binary files. The data set is quite big and we do a large number of calculations throughout the analysis. The pickle module helps us save intermediary results speeding up calculations. As we do not need to visually inspect the data any further, we save as a binary file. 

In [4]:
test_data = df_final[:len(df_final)//10]
dev_data = df_final[-len(df_final)//10:]
train_data = df_final[len(df_final)//10:-len(df_final)//10]

outfile = open("Data/test_data", 'wb')
pickle.dump(test_data, outfile)
outfile.close()
outfile = open("Data/dev_data", 'wb')
pickle.dump(dev_data, outfile)
outfile.close()
outfile = open("Data/train_data", 'wb')
pickle.dump(train_data, outfile)
outfile.close()

# 2.1 Finding frequent n-grams
Considering the huge number of n-grams our dataset we do not want to calculate n-gram frequencies every single time we run an experiment for a specific label. Hence, we use two functions to ease the process. The first is ngram_generator_wlabels(), the second is freq_ngrams() as explained below.

## 2.1.1 ngram_generator_wlabels
ngram_generator_wlabels() gives us, for each essay, a tuple containing three dictionaries. For each of these dictionaries the given essay is its key, and its corresponding value is a tuple containing a list of all the ngrams in the essay and the essay's given labels. 

Calculating n-grams for approx 35,000 essays is a bit time consuming and this scripts pickles the results and dump them to a binary file. Hence, we only need to create the ngrams once for each data set, saving a substantial amount of time when running experiments.

The function takes two arguments; the name of the input file e.g. test_data, dev_data or train data from 1.2.1, and a name for an output file. 

In [9]:
def ngram_generator_wlabels(data):
    essay_list = ['essay0','essay4']
    stop_words = stopwords.words('english')
    porter = PorterStemmer()
    infile = open(data, 'rb')
    data_file = pickle.load(infile)
    infile.close()
    essay_unigrams = {}
    essay_bigrams = {}
    essay_trigrams = {}
    '''essay_unigrams['essay0'] will contain a list of all unigrams for each essay, along with a dictionary of all values for the classifiers
    You access it by doing essay_unigrams['essay0'][i] where i is an index for a tuple of each essay in essay0 and a dictionary of classifier values'''
    classifiers = ["age", "ethnicity", "sex"]
    for es in essay_list:
        all_bigrams = []
        essays = [(idx, e) for idx, e in data_file[es].iteritems()]
        unigrams_list = []
        bigrams_list = []
        trigrams_list = []
        for (i, essay) in essays:
            tmp = []
            tmp_list = []
            essay_bigram_list = []
            essay_trigram_list = [] 
            classifier_dictionary = {}
            for clas in classifiers:
                classifier_dictionary[clas] = data_file[clas][i]
            if type(essay) != float:
                tmp.extend([w for w in essay.split()])
                for w in tmp:
                    splt = w.split("'")
                    for s in splt:
                        if not s.isdigit():
                            tmp_list.append(porter.stem(s))
                for j in range(len(tmp_list)-1):
                    essay_bigram_list.append(" ".join((tmp_list[j],tmp_list[j+1])))
                for k in range(len(tmp_list)-2):
                    essay_trigram_list.append(" ".join((tmp_list[k],tmp_list[k+1],tmp_list[k+2])))
                unigrams_list.append((tmp_list, classifier_dictionary))
                bigrams_list.append((essay_bigram_list, classifier_dictionary))
                trigrams_list.append((essay_trigram_list, classifier_dictionary))
        essay_unigrams[es] = unigrams_list
        essay_bigrams[es] = bigrams_list
        essay_trigrams[es] = trigrams_list
    return (essay_unigrams, essay_bigrams, essay_trigrams)

In [12]:
ngrams = ngram_generator_wlabels('Data/train_data') # write e.g. ('')
outfile = open('train_ngrams', 'wb') # write e.g. open('train_ngrams', 'wb')
pickle.dump(ngrams, outfile)
outfile.close()

## 2.1.2 freq_ngrams
freq_ngrams() takes no arguments from the user. Instead, it takes the binary train_data file from 1.2.1 and for each essay, here essay0 and essay4, first creates lists of all stemmed words,bigrams and unigrams unless the given token is a stopword. These lists are the all_words, all_bigrams and all_trigrams in the code below. 

It then proceeds to iterate through each of these lists calculating the frequency of the given n-gram in the document. It dumps in total 9 files; for each essay (essay0 and essay4) the frequency distributions of uni- bi- and trigrams in that essay. Then three files for the distribution of the combined n-grams of both essays - e.g. a list of all frequency distributions of unigrams in essay0 and the frequency distributions of all unigrams in essay4 for a given user in the data set. 

This script needs only to be run once.

In [33]:
def freq_ngrams(essay_list):
    stop_words = stopwords.words('english')
    porter = PorterStemmer()
    infile = open("Data/train_data", 'rb')
    train_data = pickle.load(infile)
    every_word = []
    every_bigram = []
    every_trigram = []
    
    for es in essay_list:
        all_words = []
        all_bigrams = []
        all_trigrams = []
        essays = [(idx, e) for idx, e in train_data[es].iteritems()]
        for i, essay in essays:
            tmp = []
            tmp_list = []
            if type(essay) != float:
                tmp.extend([porter.stem(w) for w in essay.split() if not w in stop_words])
                for w in tmp:
                    splt = w.split("'")
                    for s in splt:
                        if not s.isdigit() and s not in stop_words:
                            tmp_list.append(s)
                for j in range(len(tmp_list)-1):
                    all_bigrams.append(" ".join((tmp_list[j],tmp_list[j+1])))
                for k in range(len(tmp_list)-2):
                    all_trigrams.append(" ".join((tmp_list[k],tmp_list[k+1],tmp_list[k+2])))
                all_words.extend(tmp_list)
                
        freq_words = nltk.FreqDist(w for w in all_words)
        freq_bigrams = nltk.FreqDist(w for w in all_bigrams)
        freq_trigrams = nltk.FreqDist(w for w in all_trigrams)
        
        with open(f"Data/{es}_freq_words", 'wb') as file:
            pickle.dump(freq_words, file)
        with open(f"Data/{es}_freq_bigrams", 'wb') as file:
            pickle.dump(freq_bigrams, file)
        with open(f"Data/{es}_freq_trigrams", 'wb') as file:
            pickle.dump(freq_trigrams, file)

        every_word.extend(all_words)
        every_bigram.extend(all_bigrams)
        every_trigram.extend(all_trigrams)
        
    freq_all_words = nltk.FreqDist(w for w in every_word)
    freq_all_bigrams = nltk.FreqDist(w for w in every_bigram)
    freq_all_trigrams = nltk.FreqDist(w for w in every_trigram)

    with open("Data/all_freq_words", 'wb') as file:
        pickle.dump(freq_all_words, file)
    with open("Data/all_freq_bigrams", 'wb') as file:
        pickle.dump(freq_all_bigrams, file)
    with open("Data/all_freq_trigrams", 'wb') as file:
        pickle.dump(freq_all_trigrams, file)

In [34]:
freq_ngrams(['essay0','essay4'])

# 3. Running the experiment
Now that all data is created we can run the actual Naïve-Bayes classifier. For this purpose we use the nltk library implemented through the function bayes(). Bayes() utilises label_func in order to create the relevant feature sets. 



## 3.1.1 label_func()
label_func() is used for the labels that are not binary i.e. age and ethnicity. 
For these labels we need to group the data into larger groups. 

The function takes two parameters, l, which is a class like sex, age or ethnicity,
and cdl which is a dictionary as given in ngram_generator_wlabels() that contains 
the labels for a given essay.

If you call bayes() with unigrams, essay and sex, label_func will return m or f
given whether the author is male or female. 

If bayes() is called with age, label_func will look up the age of the author of a given
essay and return u_30 if the age of the author is 30 or less and o_30 if age is over 30.

Lastly, if bayes() is called with ethnicity, it will give the individual user the label white or non-white.

## 3.1.2 document_features()
document_features is utilised to create the feature sets. It is called when defining the variables train_features and dev_features.

This means the function is given a list of each ngram in each essay with its corresponding labels.

From this, it returns a dictionary with each of the 2000 most frequent n-grams for all essays in the category as keys and a boolean of whether that n-gram is in the actual essay it is reviewing at the moment.
    


## 3.1.3 bayes()
Our function bayes() takes four arguments; an integer describing whether we want to analyse for uni, bi or trigrams, an essay to analyse, a label to identify and a boolean to decide whether to run for most frequent n-grams in just the given essay or for both essays per user.

An essential part of this function is creating the actual feature sets. It does so with for instance train_features by creating a list containing a tuple with two indexes. The first index contains a dictionary as created by document_features that is, for the 2000 most frequent n-grams for a given essay category, is each word in the specific essay or not, and for the second essay the label for the author of the specific essay. The label is returned by label_func().

The train and dev sets are then shuffled with the random module, after which we create our classifier by training the NaiveBayesClassifier on the train set. 

From here we can run the classifier on the test data (in this case the test data is the dev data) computing the accuracy of the classifier and showing the 100 most informative features.

We then proceed to create two dictionaries, predictions and gold_labels

    

In [58]:
def label_func(l, cdl):
    if l == "sex" or cdl == "white":
        return cdl
    elif l == "age":
        if cdl <= 30:
            return "u_30"
        else:
            return "o_30"
    else:
        return "n-white"
    
def bayes(ngram, essay, label,all):
    if ngram == 0:
        if all:
            infile = open(f"Data/all_freq_words", 'rb')
        else:
            infile = open(f"Data/{essay}_freq_words", 'rb')
    elif ngram == 1:
        if all:
            infile = open(f"Data/all_freq_bigrams", 'rb')
        else:
            infile = open(f"Data/{essay}_freq_bigrams", 'rb')
    else:
        if all:
            infile = open(f"Data/all_freq_trigrams", 'rb')
        else:
            infile = open(f"Data/{essay}_freq_trigrams", 'rb')
            
    freq_ngrams = pickle.load(infile)
    word_features = [w for (w, f) in freq_ngrams.most_common(2000)]
    infile.close()
      
    def document_features(document):
        document_words = set(document)
        features = {}
        for word in word_features:
            features['contains({})'.format(word)] = (word in document_words)
        return features

    infile = open("Data/train_ngrams", 'rb')
    train_ngrams = pickle.load(infile)
    '''tuple with essay_unigrams, essay_bigrams and essay_trigrams
    to access unigrams, just ngram_tuple[0]["essay0"]'''
    infile.close()
    
    infile = open("Data/dev_ngrams", 'rb')
    dev_ngrams = pickle.load(infile)
    infile.close()
            
    train_features = [(document_features(t), label_func(label, class_dic[label])) for (t, class_dic) in train_ngrams[ngram][essay]]
    dev_features = [(document_features(t), label_func(label, class_dic[label])) for (t, class_dic) in dev_ngrams[ngram][essay]]
    #print(dev_features[-10:])
    shuffle(train_features)
    shuffle(dev_features)
    training_set, testing_set = train_features, dev_features
    classifier = nltk.NaiveBayesClassifier.train(training_set)
    print("Naive Bayes accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
    print(classifier.show_most_informative_features(100))
    predictions, gold_labels = defaultdict(set), defaultdict(set)
    for i, (features, label) in enumerate(testing_set):
        predictions[classifier.classify(features)].add(i)
        gold_labels[label].add(i)
    for label in predictions:
        print(label, 'Precision:', precision(gold_labels[label], predictions[label]))
        print(label, 'Recall:', recall(gold_labels[label], predictions[label]))
        print(label, 'F1-Score:', f_measure(gold_labels[label], predictions[label]))
        print()
        
    print(predictions[:10])
    '''
    MNB_classifier = SklearnClassifier(MultinomialNB())
    MNB_classifier.train(training_set)
    print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

    BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
    BernoulliNB_classifier.train(training_set)
    print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

    LogisticRegression_classifier = SklearnClassifier(LogisticRegression(max_iter = 150))
    LogisticRegression_classifier.train(training_set)
    print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

    SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
    SGDClassifier_classifier.train(training_set)
    print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)

    LinearSVC_classifier = SklearnClassifier(LinearSVC(max_iter=1200))
    LinearSVC_classifier.train(training_set)
    print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)'''

In [57]:
#bayes(ngram, essay, label,bool) # run with parameters e.g. as below
all = False
bayes(0, 'essay4', 'sex',False)

[({'contains(music)': True, 'contains(movi)': True, 'contains(food)': True, 'contains(love)': False, 'contains(book)': True, 'contains()': False, 'contains(like)': True, 'contains(favorit)': False, 'contains(read)': True, 'contains(anyth)': False, 'contains(show)': False, 'contains(good)': False, 'contains(tv)': False, 'contains(rock)': False, 'contains(time)': False, 'contains(realli)': False, 'contains(also)': False, 'contains(watch)': False, 'contains(new)': False, 'contains(one)': False, 'contains(much)': False, 'contains(lot)': True, 'contains(enjoy)': False, 'contains(thing)': False, 'contains(eat)': False, 'contains(go)': False, 'contains(big)': False, 'contains(mani)': False, 'contains(classic)': True, 'contains(v)': False, 'contains(listen)': False, 'contains(fiction)': False, 'contains(tri)': False, 'contains(life)': False, 'contains(pretti)': True, 'contains(game)': False, 'contains(day)': False, 'contains(everyth)': False, 'contains(list)': False, 'contains(get)': False, 'c

Naive Bayes accuracy percent: 68.51100811123986
Most Informative Features
        contains(austen) = True                f : m      =      7.5 : 1.0
      contains(prejudic) = True                f : m      =      6.7 : 1.0
           contains(eyr) = True                f : m      =      5.2 : 1.0
         contains(pride) = True                f : m      =      5.0 : 1.0
       contains(anatomi) = True                f : m      =      4.7 : 1.0
          contains(pray) = True                f : m      =      4.5 : 1.0
       contains(barbara) = True                f : m      =      4.1 : 1.0
        contains(geisha) = True                f : m      =      4.0 : 1.0
          contains(etta) = True                f : m      =      4.0 : 1.0
        contains(memoir) = True                f : m      =      3.8 : 1.0
        contains(canada) = True                m : f      =      3.8 : 1.0
          contains(jane) = True                f : m      =      3.7 : 1.0
      contains(difranco) =