# Mini project 1
## Identifying the girl next door - a study in natural langauge processing
This notebook contains the code with which we have generated the results of our analysis. Code explanations follow in markdown throughout the notebook.

## Necessary imports


In [9]:
import nltk
import pandas as pd
import io
import re
import pickle
from string import punctuation
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from random import shuffle
#nltk.download('punkt') # Uncomment if package not downloaded
#nltk.download('stopwords') #  Uncomment if package not downloaded
#nltk.download('PorterStemmer') #  Uncomment if package not downloaded


## 1.1 Data loading
Data is loaded as a pandas dataframe from the "cleaned" .csv file. 

In [3]:
data = pd.read_csv('cleanerstill.csv', sep=";")
#data.head() # Uncomment to inspect head of dataframe

  interactivity=interactivity, compiler=compiler, result=result)


## 1.2 Data balancing
Firstly, we want to filter out any row where sex,age,ethnicity,essay0 (about me) and essay4 (interests) is not given.
This is done by creating a number of masks of the original data. We then proceed to check the sizes of the groups of males and females. It becomes apparant that he male group is considerably bigger than the female group. Hence it is reduced, after which the male and female group is concatenated and shuffled into df_final

In [6]:
mask = (data['ethnicity'] != ' ') & (type(data['ethnicity']) != float) & (data['age'] != ' ') & (data['sex'] != ' ') & (data['essay0'] != ' ') & (data['essay4'] != ' ')
# mask removes all rows where ethnicity, age and sex is not given. Also removes rows where ethnicity is NaN. This particular value is not present for age and sex, hence it is not masked out.
data_new = data[mask]

data_rc = data_new.filter(['age','sex','ethnicity','essay0','essay4'], axis=1)

mask_male = (data_rc['sex'] == 'm') # creates mask of males where sex evaluates to True if == 'm'
mask_female = (data_rc['sex'] == 'f') # creates mask of females where sex evaluates to True if == 'f'

data_om = data_rc[mask_male] # only males for relevant columns
data_of = data_rc[mask_female] # only females for relevant columns

data_om_reduced = data_om.sample(frac=0.6665) # returns a random sample of data_om where parameter frac describes size of sample relative to original data

df_tmp = pd.concat([data_of, data_om_reduced], ignore_index=True)

df_final = df_tmp.sample(frac=1) # gives a random sample of df_tmp of size frac (currently 34685 obs)

### 1.2.1 Pickling 
df_final is pickled and exported as a binary file. The data set is quite big and we do a large number of calculations throughout the analysis. The pickle module helps us save intermediary results speeding up calculations. As we do not need to visually inspect the data any further, we save as a binary file. 

In [10]:
outfile = open("Data/cleaned_data", 'wb')
pickle.dump(df_final, outfile)
outfile.close()

# 2.1 Finding frequent n-grams
Considering the huge number of n-grams our dataset we do not want to calculate n-gram frequencies every single time we run an experiment for a specific label. Hence, we use two functions to ease the process. The first is ngram_generator_wlabels(), the second is freq_ngrams() as explained below.

## 2.1.1 ngram_generator_wlabels
ngram_generator_wlabels() gives us, for each essay, a tuple containing three dictionaries. For each of these dictionaries the given essay is its key, and its corresponding value is a tuple containing a list of all the ngrams in the essay and the essay's given labels. 

Calculating n-grams for approx 35,000 essays is a bit time consuming and this scripts pickles the results and dump them to a binary file. Hence, we only need to create the ngrams once, saving a substantial amount of time when running experiments.

In [None]:
def ngram_generator_wlabels(data):
    essay_list = ['essay0','essay4']
    stop_words = stopwords.words('english')
    porter = PorterStemmer()
    infile = open(data, 'rb')
    data_file = pickle.load(infile)
    infile.close()
    essay_unigrams = {}
    essay_bigrams = {}
    essay_trigrams = {}
    '''essay_unigrams['essay0'] will contain a list of all unigrams for each essay, along with a dictionary of all values for the classifiers
    You access it by doing essay_unigrams['essay0'][i] where i is an index for a tuple of each essay in essay0 and a dictionary of classifier values'''
    classifiers = ["age", "ethnicity", "sex"]
    for es in essay_list:
        all_bigrams = []
        essays = [(idx, e) for idx, e in data_file[es].iteritems()]
        unigrams_list = []
        bigrams_list = []
        trigrams_list = []
        for (i, essay) in essays:
            tmp = []
            tmp_list = []
            essay_bigram_list = []
            essay_trigram_list = [] 
            classifier_dictionary = {}
            for clas in classifiers:
                classifier_dictionary[clas] = data_file[clas][i]
            if type(essay) != float:
                tmp.extend([w for w in essay.split()])
                for w in tmp:
                    splt = w.split("'")
                    for s in splt:
                        if not s.isdigit():
                            tmp_list.append(porter.stem(s))
                for j in range(len(tmp_list)-1):
                    essay_bigram_list.append(" ".join((tmp_list[j],tmp_list[j+1])))
                for k in range(len(tmp_list)-2):
                    essay_trigram_list.append(" ".join((tmp_list[k],tmp_list[k+1],tmp_list[k+2])))
                unigrams_list.append((tmp_list, classifier_dictionary))
                bigrams_list.append((essay_bigram_list, classifier_dictionary))
                trigrams_list.append((essay_trigram_list, classifier_dictionary))
        essay_unigrams[es] = unigrams_list
        essay_bigrams[es] = bigrams_list
        essay_trigrams[es] = trigrams_list
    return (essay_unigrams, essay_bigrams, essay_trigrams)

In [None]:
ngrams = ngram_generator_wlabels(sys.argv[1])
outfile = open(sys.argv[2], 'wb')
pickle.dump(ngrams, outfile)
outfile.close()

## 2.1.2 freq_ngrams