# Project_Code 2: Clean up and analysis of ELI Data #
## Ben Naismith ##

### Changes since 'Project_Code1' ###

This new document has been created as a number of significant changes have been made to the original code. Based on discussions with other members of the ELI Data Mining Group, the following points were determined:

- For the sake of efficiency, it is better not to merge the different data frames into one big one
- A 'sanitization' step of the data was completed which duplicated some of the steps of my initial code. These duplications include removing unwanted apostrophes, changing all 'null' and 'ull' to NaN, and removing empty or unreal students (who were most likely teachers). As such, the dataset is now ready for more in-depth cleaning and analysis, i.e. the purpose of this notebook. The code for the sanitization step is in a private repository of the ELI Data Mining Groups 'convert_0_to_1.ipynb'.

### Data Sharing Plan ###

The full ELI data set (see project_plan.md) is private at this time. Below is a workbook with the current code for organizing and cleaning that data. In order to see how the code works, snippets of data have been displayed throughout.

A sample of the 'sanitized' data is included in the 'data' folder in this same repository. It contains samples of the four CSV files referred to in this code, consisting of 1000 answers, in order to allow for testing and reproducibility by others of the code. These 1000 answers are the first 1000 from the answer_csv file and correspond to user_file_id 7505 to 10108.

Ultimately, it is the intention of the dataset's authors for the entire dataset to be made public, with a CC license. Please see the LICENSE_notes.md for details

### Initial setup ###

In [None]:
#Import necesary modules
import numpy as np
import pandas as pd
import nltk
import glob
import matplotlib.pyplot as plt
import random

#return every shell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#Create short-hand for directory root
cor_dir = "/Users/Benjamin's/Documents/ELI_Data_Mining/Data-Archive/1_sanitized/"

In [None]:
#Add starter code created by Na-Rae Han for the ELI research group
from elitools import *

Pretty printing has been turned OFF
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48384 entries, 1 to 48420
Data columns (total 8 columns):
question_id        48384 non-null int64
anon_id            48353 non-null object
user_file_id       48384 non-null int64
text               47175 non-null object
directory          14 non-null object
is_doublespaced    48384 non-null int64
is_plagiarized     48384 non-null int64
is_deleted         48384 non-null int64
dtypes: int64(5), object(3)
memory usage: 3.3+ MB
<class 'pandas.core.frame.DataFrame'>
Index: 913 entries, ez9 to bn6
Data columns (total 20 columns):
gender                       913 non-null object
birth_year                   913 non-null int64
native_language              913 non-null object
language_used_at_home        912 non-null object
language_used_at_home_now    855 non-null object
non_native_language_1        859 non-null object
yrs_of_study_lang1           863 non-null object
study_in_classroom_lang1     863 non-null 

### Student information (S_info_csv and S_info_df) ###

In [None]:
#Process the student_information.csv file
S_info_csv = cor_dir + "student_information.csv"
S_info_df = pd.read_csv(S_info_csv, index_col = 'anon_id')

S_info_df.head() #Issues still apparent with integers turned into floats
S_info_df.tail(10) #6 anon_id with no personal info - perhaps not students and to be 'pruned', as well as teachers with 'English' as the native language

In [None]:
#Remove anyone with 'English' or 'NaN' as their native_language, i.e. not students

#First try to create filters

Englishfilter = S_info_df['native_language'] == 'English' #first filter works
NaNfilter = S_info_df['native_language'] == np.nan #second filter doesn't

fake_Ss = S_info_df.loc[Englishfilter] #works, but...
fake_Ss

#fake_Ss = S_info_df.loc[(Englishfilter) or (NaNfilter)] #doesn't work
#fake_Ss


### Student responses (answer_csv and answer_df) ###

In [None]:
#Process answer.csv file
answer_csv = cor_dir + "answer.csv"
answer_df = pd.read_csv(answer_csv, index_col = 'answer_id')

answer_df.head()
answer_df.tail(10)

### Course IDs ###
(should help with finding specific texts and linking other data frames)

In [None]:
#Process course.csv file
course_csv = cor_dir + "course.csv"
course_df = pd.read_csv(course_csv, index_col = 'course_id')

course_df.head()

###  user_file_internal ###
- big csv file with a lot of information
- should help with finding specific texts and linking other data frames
- includes file_type_id, course_id, and paths of text and wav files (i.e. all the spoken responses I need)


In [None]:
#Process user_file_wavtxt.csv file
user_csv = cor_dir + "user_file_internal.csv"
user_df = pd.read_csv(user_csv, index_col = 'user_file_id')

user_df.head()

### Basic info about dataframes ###

The following information is an overview of the four dataframes/csv files currently being looked at:

#### S_info_df ####
Size:
- there are 941 entries, i.e. students, although at least 9 need to be removed once filters can be made to work
- 21 columns including info about languages spoken, personal data like age, and learning preferences
- Some columns will likely be removed if deemed unhelpful/unnecessary (e.g. 4th language spoken)
- Some data is normalized, e.g. years of study, but others was open, resulting in very varied responses

Connection to other dataframes:
- link to answer_df is anon_id

Most useful columns for this project:
- anon_id (for linking to other df)
- L1, gender, time studying, age (for data analysis)  


#### answer_df ####
Size:
- there are 47175 'text' entries, i.e. student responses, although 48384 total rows. The remaining (including many null texts need to be removed as without texts they serve no purpose
- 9 columns including info about the question, the answer, and characteristics of the text (like if it was plagiarized)

Connection to other dataframes:
- link to S_info_df and course_df is anon_id column

Most useful columns for this project:
- answer_id (shorthand for the individual texts to be analyzed)
- text (the most important column so far) -> to be converted into tokens, bigrams, etc.  
- anon_id (for linking to other df)


#### course_df ####
Size:
- there are 1071 entries, i.e. one row for each course
- 6 columns including info about the course and class, both in terms of their assigned number and a description

Connection to other dataframes:
- link to user_df is course_id 

Most useful columns for this project:
- only really useful as a transition for linking to other df  


#### user_df ####
Size:
- there are 76371 rows, each with a file_id number. However, it is unclear how to use this informatin effectively.
- There are 29 columns, although many are not useful for this project
- A lot of the cells have no input
- Some columns will likely be removed if deemed unhelpful/unnecessary

Connection to other dataframes:
- link to course_df is course_id column

Most useful columns for this project:
- course_id (to link to other DF)
- file_type_id (for indicating the type of activity used in class)

In [None]:
S_info_df.info()

In [None]:
answer_df.info()

In [None]:
course_df.info()

In [None]:
user_df.info()

### Creating find_stuff function ###

Goal: create a function that allows for easy retrieval within, from the various different, dataframes.


In [None]:
#adapted from initial work of Brianna - thank you!

#this works to find all the course_id entries for a particular class type, in this case '3' which == speaking

def find_stuff(df, class_type):
    class_id = df.loc[df['class_id'] == class_type]
    return class_id

test = find_stuff(course_df, 3)
test.head()

In [None]:
#test #2

test2 = find_stuff(course_df, 5)
test2.head()

- Next step is to either expand on this function or create other similar ones to allow look up of other types of info

### Tokenization of answers ###

Goal: tokenize the text in answer.csv to allow for further analysis (bigrams, lexical diversity, etc.)


In [None]:
#find column to tokenize
answer_df[['text']].head()

In [None]:
#With the magic of stackoverflow, this seems to work, converting NaN to empty strings
answer_df = answer_df[answer_df['text'].notnull()]
answer_df['toks'] = answer_df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)

answer_df.head()

### Bigrams###

Goal: create a bigram columns from the tok column


In [None]:
#mini-test to make sure I am creating bigrams correctly

bigram_test = answer_df.toks[1]
bigram_test
list(nltk.bigrams(bigram_test))

#test works, let's try on dataframe

answer_df['bigrams'] = answer_df.toks.apply(lambda x: list(nltk.bigrams(x)))
answer_df.head(1)

### Create frequency dictionary for entire corpus ###

Frequency dictionary for all toks

In [None]:
testdict = nltk.FreqDist(answer_df.toks[1])
random.sample(list(testdict.items()),5) #random 5-item sample
#looks ok, now to apply to the whole column

In [None]:
answer_corpus = ' '.join(answer_df['text'])
answer_corpus[:100]
answer_corpus_tok = nltk.word_tokenize(answer_corpus)
answer_corpus_tok[:20]

#probably not the most efficient way but it seems to have worked at least for tokenizing whole corpus.

In [None]:
answer_dict = nltk.FreqDist(answer_corpus_tok)
random.sample(list(answer_dict.items()),5) #random 5-item sample

#success!

### Create frequency dictionary for bigrams of entire corpus ###

Attempting to create frequency dictionary for all bigrams

In [None]:
#Let's try to do this from the answer_corpus_tok

answer_corpus_bigrams = list(nltk.bigrams(answer_corpus_tok))
answer_corpus_bigrams[:10]

In [None]:
#Now time for the dictionary
answer_bigram_dict = nltk.FreqDist(answer_corpus_bigrams)
random.sample(list(answer_bigram_dict.items()),5) #random 5-item sample

## After Progress-report 2

The following is everything that has been completed since Progress Report 2.  See progress_report.MD for details.

### Next goals:
Create another DF called bigrams_df with bigrams, MI scores, occurences per million score, and perhaps more to bge added later. To do so:  
1) Create function for calculating MI 
2) Create function for calculating occurences per million for unigrams and bigrams  
3) Apply the MI formula for pairs of words in the bigram list and create a column in the new DF  
4) Apply the occurences per million for bigrams and create a column in the new DF  
5) Create a column showing percentage of time the bigrams are used by the three proficiency levels  


### Calculating Mutual Information (MI)

(from https://corpus.byu.edu/mutualInformation.asp)  

Mutual Information is calculated as follows:  
MI = log ( (AB * sizeCorpus) / (A * B * span) ) / log (2)  

Suppose we are calculating the MI for the collocate color near purple in BYU-BNC.  

A = frequency of node word (e.g. purple): 1262  
B = frequency of collocate (e.g. color): 115  
AB = frequency of collocate near the node word (e.g. color near purple): 24  
sizeCorpus= size of corpus (# words; in this case the BNC): 96,263,399  
span = span of words (e.g. 3 to left and 3 to right of node word): 6  
log (2) is literally the log10 of the number 2: .30103  

MI = 11.37 = log ( (24 * 96,263,399) / (1262 * 115 * 6) ) / .30103  

In [None]:
#Found something called 'Pointwise Mutual Information' - I believe it is what I am looking for.

import math
from math import log

def MI(word1, word2):
  prob_word1 = answer_dict[word1] / float(sum(answer_dict.values()))
  prob_word2 = answer_dict[word2] / float(sum(answer_dict.values()))
  prob_word1_word2 = answer_bigram_dict[word1, word2] / float(sum(answer_bigram_dict.values()))
  return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)

In [None]:
#something I imagine has an average MI
answer_bigram_dict['young', 'people']
answer_dict['young']
answer_dict['people']

#Yes - 'young' collocates strongly with 'people' (about 25% of time) but 'people' doesn't collocate strongly with 'young'

In [None]:
MI('young','people')

#That is the standard range for a M1 score

In [None]:
#Time to try one that shouldn't have as high MI, e.g. 'man' with 'the'

answer_bigram_dict['the', 'man']
answer_dict['the']
answer_dict['man']

MI('the', 'man')

#With a smoothing of MI3, this would not show up on collocation lists (a good thing)

### Creating combined dataframe for easier analysis and viewing
- joins answer_df, user_df, and course_df
- removes unnecessary columns
- narrows results down to only answers from writing classes and first versions of their work

In [None]:
#join answer_df and user_df along 'user_file_id' column
combo_df = answer_df.join(user_df, on='user_file_id', lsuffix='user_file_id')

#now join this new df with course_df along 'course_id' column
combo_df = combo_df.join(course_df, on='course_id', lsuffix='user_file_id')

In [None]:
#Dropping unnecessary columns (there a lot)
combo_df = combo_df.drop(['directoryuser_file_id', 'is_doublespaced', 'is_plagiarized', 'is_deleteduser_file_id',
                            'modifiedby', 'modifieddate', 'allow_submit_after_duedate', 'anon_id', 'file_type_id',
                            'file_info_id', 'user_file_parent_id', 'createdby', 'session_id',
                           'document_id','filename', 'content_text', 'createddate', 'allow_multiple_accesses',
                           'directoryuser_file_id', 'is_doublespaced', 'is_plagiarized', 'is_deleteduser_file_id',
                           'modifiedby', 'modifieddate', 'allow_submit_after_duedate','activity', 'order_num', 
                            'due_date', 'post_date', 'assignment_name', 'directory', 'activity', 'semester',
                            'order_num', 'due_date', 'post_date', 'assignment_name', 'allow_double_spacing',
                           'duration', 'pull_off_date', 'direction', 'grammar_qp_id', 'is_deleted',
                            'section', 'course_description'], axis = 1)

In [None]:
#keeping only 1st versions of students' work
combo_df = combo_df.loc[combo_df['version'] == 1]

#'version' column now unnecessary
combo_df = combo_df.drop(['version'], axis = 1)

In [None]:
#keeping only answers from writing classes (class_id = 2)
combo_df = combo_df.loc[combo_df['class_id'] == 2]

#'class_id' column now unnecessary
combo_df = combo_df.drop(['class_id'], axis = 1)

In [None]:
#just change the order of columns to something more logical and rename some columns
combo_df = combo_df[['question_id','user_file_id', 'anon_iduser_file_id', 'level_id', 'course_id', 'text', 'toks', 'bigrams']]
combo_df.rename(columns={'anon_iduser_file_id':'anon_id'}, inplace=True)

#finished result =  much cleaner
combo_df.head()

In [None]:
#remove level 2 (too few to be usefully analyzed)

combo_df.level_id.unique()

combo_df = combo_df.loc[combo_df['level_id'] != 2]

combo_df.level_id.unique()

### Create function for calculating occurrences per million for unigrams and bigrams  

Formula:

FN = FO(1,000,000) / C

FN = normalized frequency
FO = observed frequency
C = corpus size

In [None]:
#create new freq dicts for combo_df (unigrams and bigrams) using same 
#code as earlier versions with answer_df

combo_corpus = ' '.join(combo_df['text'])
combo_corpus_tok = nltk.word_tokenize(combo_corpus)
combo_corpus_tok = list(map(lambda x:x.lower(),combo_corpus_tok)) #making everything lowercase
combo_unigram_dict = nltk.FreqDist(combo_corpus_tok)

combo_corpus_bigrams = list(nltk.bigrams(combo_corpus_tok))
combo_bigram_dict = nltk.FreqDist(combo_corpus_bigrams)

In [None]:
#total number of unigrams
total_unigrams = len(combo_corpus_tok)

#total number of bigrams
total_bigrams = len(combo_corpus_bigrams)

total_unigrams
total_bigrams

#different by one a bigrams will be naturally be unigrams - 1 (for the first one)

In [None]:
#create function where you enter the unigram and it tells 
#you the frequency in the corpus per million tokens

def unigram_per_M(unigram):
   return (combo_unigram_dict[unigram]*1000000) / total_unigrams

In [None]:
#test manually and with defined function
combo_unigram_dict['the']

(108346*1000000)/2549012
unigram_per_M('the')

In [None]:
#create function where you enter the bigram and it tells you the frequency in the corpus per million tokens

def bigram_per_M(word1, word2):
   return (combo_bigram_dict[word1, word2]*1000000) / total_bigrams

In [None]:
#test manually and with defined function
combo_bigram_dict['the', 'man']

(92*1000000)/2549011
bigram_per_M('the', 'man')

### Create a bigram_df showing relevant info based on above formulas
- columns for this dataframe:
    - default index
    - bigrams
    - MI scores
    - occurrences per million
    - normalized percentage used at each proficiency level

In [None]:
bigram_df = pd.DataFrame.from_dict(combo_bigram_dict,orient='index')
bigram_df = bigram_df.reset_index()
bigram_df = bigram_df.rename(columns = {0:'tokens', 'index': 'bigram'})
bigram_df.head()

#first two bullet points complete - now to add more columns

In [None]:
#Changing bigram tuples to lists for easier manipulation
bigram_df['bigram'] = [list(x) for x in bigram_df['bigram']]

#### Creating MI column

In [None]:
#New MI calculator based on new dictionary

def MI(word1, word2):
  prob_word1 = combo_unigram_dict[word1] / float(sum(combo_unigram_dict.values()))
  prob_word2 = combo_unigram_dict[word2] / float(sum(combo_unigram_dict.values()))
  prob_word1_word2 = combo_bigram_dict[word1, word2] / float(sum(combo_bigram_dict.values()))
  return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)

In [None]:
test = bigram_df.iloc[0][0]
MI(test[0], test[1])

#it works on one cell, so theoretically should work on all...

In [None]:
bigram_df['MI'] = [MI(x[0], x[1]) for x in bigram_df['bigram']]

#it took a few hours to run it, but it worked!

In [None]:
bigram_df[['MI']] = bigram_df[['MI']].apply(lambda x: pd.Series.round(x, 2))
bigram_df.head()

#### Creating per_million column

In [None]:
#testing one one cell first
bigram_per_M(test[0], test[1])

In [None]:
bigram_df['per_million'] = [bigram_per_M(x[0], x[1]) for x in bigram_df['bigram']]

In [None]:
bigram_df[['per_million']] = bigram_df[['per_million']].apply(lambda x: pd.Series.round(x, 2))
bigram_df.head()

In [None]:
bigram_df['per_million'] = [bigram_per_M(x[0], x[1]) for x in bigram_df['bigram']]

#### Creating 'normalized toks per level' and 'relative percentage per level' columns

In [None]:
#create level dataframes
level_3 = combo_df.loc[combo_df['level_id'] == 3, :] 
level_4 = combo_df.loc[combo_df['level_id'] == 4, :] 
level_5 = combo_df.loc[combo_df['level_id'] == 5, :] 

#create frequency dictionaries for each level
level_3_corpus = ' '.join(level_3['text'])
level_3_tok = nltk.word_tokenize(level_3_corpus)
level_3_tok = list(map(lambda x:x.lower(),level_3_tok))
level_3_bigrams = list(nltk.bigrams(level_3_tok))
level_3_bigram_dict = nltk.FreqDist(level_3_bigrams)

level_4_corpus = ' '.join(level_4['text'])
level_4_tok = nltk.word_tokenize(level_4_corpus)
level_4_tok = list(map(lambda x:x.lower(),level_4_tok))
level_4_bigrams = list(nltk.bigrams(level_4_tok))
level_4_bigram_dict = nltk.FreqDist(level_4_bigrams)

level_5_corpus = ' '.join(level_5['text'])
level_5_tok = nltk.word_tokenize(level_5_corpus)
level_5_tok = list(map(lambda x:x.lower(),level_5_tok))
level_5_bigrams = list(nltk.bigrams(level_5_tok))
level_5_bigram_dict = nltk.FreqDist(level_5_bigrams)

In [None]:
#test to see what I want in each cell in the level_3 column
#I need the values from level_3_bigram_dict divided by the value from combo_bigram_dict

#for example
level_3_bigram_dict['in', 'the'] / combo_bigram_dict['in', 'the'] 

#or better yet as a percentage
"{0:.2f}%".format(level_3_bigram_dict['in', 'the'] / combo_bigram_dict['in', 'the'] * 100)

#totals for all 3 levels should add up to 100%
"{0:.2f}%".format(level_3_bigram_dict['in', 'the'] / combo_bigram_dict['in', 'the'] * 100)
"{0:.2f}%".format(level_4_bigram_dict['in', 'the'] / combo_bigram_dict['in', 'the'] * 100)
"{0:.2f}%".format(level_5_bigram_dict['in', 'the'] / combo_bigram_dict['in', 'the'] * 100)

12.17 + 40.75 + 47.07 #close enough!

In [None]:
level_3_bigram_dict['in', 'the']
level_4_bigram_dict['in', 'the']
level_5_bigram_dict['in', 'the']

1360+4553+5259

combo_bigram_dict['in', 'the']

In [None]:
#also necessary to normalize as different number of responses at each level

#weighting for each level
level_3_weighting = len(level_3.index) / len(combo_df.index)
level_4_weighting = len(level_4.index) / len(combo_df.index)
level_5_weighting = len(level_5.index) / len(combo_df.index)

level_3_weighting
level_4_weighting
level_5_weighting

level_3_weighting+level_4_weighting+level_5_weighting #should equal 100

#difference between observed and expected, i.e. expected weighting (.33) -  actual weighting (level_N_percent)
level_3_change = (1/3) - level_3_weighting
level_4_change = (1/3) - level_4_weighting
level_5_change = (1/3) - level_5_weighting

level_3_change
level_4_change
level_5_change

round(level_3_change + level_4_change + level_5_change, 2) # should be 0

In [None]:
#example of normalizing with ['in', 'the'] bigram

#un-normalized number
level_3_bigram_dict['in', 'the']
level_4_bigram_dict['in', 'the']
level_5_bigram_dict['in', 'the']
combo_bigram_dict['in', 'the']

#normalized number
n3 = level_3_bigram_dict['in', 'the'] + (combo_bigram_dict['in', 'the'] * level_3_change)
n4 = level_4_bigram_dict['in', 'the'] + (combo_bigram_dict['in', 'the'] * level_4_change)
n5 = level_5_bigram_dict['in', 'the'] + (combo_bigram_dict['in', 'the'] * level_5_change)

n3
n4
n5

n3 + n4 + n5

In [None]:
#create a function for the above

def norm_toks_level3(word1, word2):
    return int((level_3_bigram_dict[word1,word2] + (combo_bigram_dict[word1,word2] * level_3_change)))

def norm_toks_level4(word1, word2):
    return int((level_4_bigram_dict[word1,word2] + (combo_bigram_dict[word1,word2] * level_4_change)))
            
def norm_toks_level5(word1, word2):
    return int((level_5_bigram_dict[word1,word2] + (combo_bigram_dict[word1,word2] * level_5_change)))

#Example time:
norm_toks_level3('in', 'the')
norm_toks_level4('in', 'the')
norm_toks_level5('in', 'the')

In [None]:
#And as a comparative percentage
def norm_percent_level3(word1, word2):
    return "{0:.2f}%".format(((level_3_bigram_dict[word1,word2] + (combo_bigram_dict[word1,word2] * level_3_change)) / combo_bigram_dict[word1, word2])*100)

def norm_percent_level4(word1, word2):
    return "{0:.2f}%".format(((level_4_bigram_dict[word1,word2] + (combo_bigram_dict[word1,word2] * level_4_change)) / combo_bigram_dict[word1, word2])*100)

def norm_percent_level5(word1, word2):
    return "{0:.2f}%".format(((level_5_bigram_dict[word1,word2] + (combo_bigram_dict[word1,word2] * level_5_change)) / combo_bigram_dict[word1, word2])*100)

#Example time:
norm_percent_level3('in', 'the')
norm_percent_level4('in', 'the')
norm_percent_level5('in', 'the')

In [None]:
#Normalized tokens pplied to the whole dataframe

bigram_df['lv3_norm_toks'] = [norm_toks_level3(x[0], x[1]) for x in bigram_df['bigram']]
bigram_df['lv4_norm_toks'] = [norm_toks_level4(x[0], x[1]) for x in bigram_df['bigram']]
bigram_df['lv5_norm_toks'] = [norm_toks_level5(x[0], x[1]) for x in bigram_df['bigram']]

bigram_df.head()

In [None]:
#And now the comparative percentages

bigram_df['level_3'] = [norm_percent_level3(x[0], x[1]) for x in bigram_df['bigram']]
bigram_df['level_4'] = [norm_percent_level4(x[0], x[1]) for x in bigram_df['bigram']]
bigram_df['level_5'] = [norm_percent_level5(x[0], x[1]) for x in bigram_df['bigram']]

bigram_df.head()

#### Creating level per_million columns

In [None]:
#create per_million columns for each level

bigram_df['lv3_per_M'] = round(bigram_df['lv3_norm_toks']*1000000/total_bigrams, 2)
bigram_df['lv4_per_M'] = round(bigram_df['lv4_norm_toks']*1000000/total_bigrams, 2)
bigram_df['lv5_per_M'] = round(bigram_df['lv5_norm_toks']*1000000/total_bigrams, 2)

bigram_df.head()

#### A lot of work for a very small final dataframe, but at least it should be usable for machine analysis and future research.

### Let's see a few 'Top 20' lists

In [None]:
bigram_df.index += 1 #lists look better starting at 1

In [None]:
top_bigram_MI = bigram_df.sort_values('MI', ascending = False).reset_index(drop=True)
top_bigram_MI.index += 1
top_bigram_MI[top_bigram_MI['tokens'] >= 50].head(20) #set min number to get rid of random names and rarities

In [None]:
top_bigram_toks = bigram_df.sort_values('tokens', ascending = False).reset_index(drop=True)
top_bigram_toks.index += 1 #lists look better starting at 1
top_bigram_toks.head(20)

In [None]:
top_bigram_level3 = bigram_df.sort_values('level_3', ascending = False).reset_index(drop=True)
top_bigram_level3.index += 1
top_bigram_level3.head(20)

top_bigram_level4 = bigram_df.sort_values('level_4', ascending = False).reset_index(drop=True)
top_bigram_level4.index += 1
top_bigram_level4.head(20)

top_bigram_level5 = bigram_df.sort_values('level_5', ascending = False).reset_index(drop=True)
top_bigram_level5.index += 1
top_bigram_level5.head(20)

### Next goals (for final submission of code):  
<br>
_Final analysis touch ups_:
-	Deal with capitalization issues skewing data **COMPLETED AT EARLIER combo_corpus_tok STAGE**
-	Remove levels from combo_df other than 3,4,5 (easy to do but need time to re-run whole script afterwards) **COMPLETED AT EARLIER combo_df STAGE**


_Machine learning_:
- Predict level based on bigram frequency (types and tokens)
- Predict level based on MI of bigrams used 


_Visualizations_:
- Create visualizations (heat maps for predictions and bar graphs for observed stats)
- Sort bigram_df in different orders to produce tables of common bigrams
- Tidy up notebook / add descriptive detail