<a href="https://colab.research.google.com/github/Damntoochill/Learning-ML/blob/master/Spooky_Author_Identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

####Objective : 

   The  dataset contains text from works of fiction written by spooky authors of the public domain: Edgar Allan Poe, HP Lovecraft and Mary Shelley. The data was prepared by chunking larger texts into sentences using CoreNLP's MaxEnt sentence tokenizer. The objective is to accurately identify the author of the sentences in the test set.

In [0]:
#Uploading the dataset
from google.colab import files
files.upload()

In [0]:
#Importing the required dependencies
import nltk
import pandas as pd

In [4]:
#Creating a Dataframe

data = pd.read_csv('train.csv')
data.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


To find out which author wrote a sentence we need to find the frequency of words used by each author. For that, first we need to split the data by each author. Then we can find the frequency of words in each of these data groups that belong to each author.

In [0]:
#Splitting the data by author

data_groups = data.groupby('author')

We need to define a function to create dictionary of  frequencies for each words from the grouped data for each author. Luckily the nltk library has the function built in.

In [0]:
#Function to find the frequency of words

word_frequency = nltk.probability.ConditionalFreqDist()

Before we apply the function we need to combine the sentences(the text column) told by each author. Then we need to convert all the words to lower case (to eliminate the ambiguity between words with upper and lower case letters). Then we need to tokenize each word and calculate the frequency of each words. we would add the frequency of each word to a dictionary.

In [8]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
#Finding the fequency of words used by each author

for name, group in data_groups :
  sentences = group['text'].str.cat(sep = ' ') #Combines all the sentences
  sentences = sentences.lower() #Converts all sentences to lower case
  tokens = nltk.tokenize.word_tokenize(sentences) #Tokenize the combined sentence
  frequency = nltk.FreqDist(tokens) #Calculates the frequency of each token and returns a dictionary
  word_frequency[name] = (frequency) #Returns a dictionary of dictionary

Now lets check how often an author uses a word.

In [10]:
for i in word_frequency.keys():
  print("haunt : " + i)
  print(word_frequency[i].freq('haunt'))
  

haunt : EAP
4.3077638828460535e-06
haunt : HPL
4.5984675606854016e-05
haunt : MWS
1.5888147442008263e-05


Now lets use the dictionary of dictionary we generated above to determine which author wrote  a sentence. For that first we need to create a data frame to contain the dictionary we generated followed by tokenizing the sentence for which we need to find the author. Then we have to find the probability/frequency of each token ( present in the sentence ) in the dataframe grouped by the author names and add them together. The group with maximum probabiity determines the author of the given sentence.

In [0]:
def predictor(test_sentence):
  
  processed_test_sentence = nltk.tokenize.word_tokenize(test_sentence.lower())
  
  test_probabilities = pd.DataFrame(columns = ['author', 'word', 'probability'])

  for i in word_frequency.keys():
    for j in processed_test_sentence:
      token_freq = word_frequency[i].freq(j)
      smoothed_token_freq = token_freq + 0.000001
      output = pd.DataFrame([[i,j,token_freq]], columns = ['author','word','probability'])
      test_probabilities = test_probabilities.append(output, ignore_index = True)
  total_probabilities = []
  #total_probabilities = pd.DataFrame(columns = ['author', 'total_probability'])
  for i in word_frequency.keys():
    single_author = test_probabilities.query('author == "' + i + '"')
    total_probability = single_author.product(numeric_only = True)[0]
    total_probabilities.append(total_probability)
    #output = pd.DataFrame([[i, total_probability]], columns = ['author','total_probability'])
    #total_probabilities = total_probabilities.append(output,ignore_index = True)
  return total_probabilities
  #return total_probabilities.loc[total_probabilities['total_probability'].idxmax(),'author']

In [39]:
sentence = "the sky was ghastly blue"
probability = predictor(sentence)
print(probability)

[3.648284698631943e-16, 6.1946985157252966e-15, 4.538885719571753e-16]


In [29]:
test_data = pd.read_csv('test.csv')
test_data.head()

Unnamed: 0,id,text
0,id02310,"Still, as I urged our leaving Ireland with suc..."
1,id24541,"If a fire wanted fanning, it could readily be ..."
2,id00134,And when they had broken down the frail door t...
3,id27757,While I was thinking how I should possibly man...
4,id04081,I am not sure to what limit his knowledge may ...
