# ASSIGMENT 1
## English doc classifier

This software classifies documents in English or Not English.

Note: the doc should have the same size or bigger then the learning docs.

##  STRUCTURE

-   *Data Fetching*
-   *Pipeline*
-   *Feature Extraction*
-   *Traning the Model*
-   *Results & Conclusion*



In [14]:
#IMPORTS 
import collections
import math
import random
import nltk 
import numpy as np
from tqdm import tqdm
from nltk.corpus import europarl_raw
from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.stem import PorterStemmer 
from nltk.stem import WordNetLemmatizer
from nltk.metrics import ConfusionMatrix
from nltk.metrics.scores import precision, recall, f_measure
from string import punctuation

In [15]:
#DOWNLOAD DOCUMENTS 
nltk.download('punkt')
nltk.download("europarl_raw")
nltk.download("udhr")
nltk.download("gutenberg")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sonia\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package europarl_raw to
[nltk_data]     C:\Users\sonia\AppData\Roaming\nltk_data...
[nltk_data]   Package europarl_raw is already up-to-date!
[nltk_data] Downloading package udhr to
[nltk_data]     C:\Users\sonia\AppData\Roaming\nltk_data...
[nltk_data]   Package udhr is already up-to-date!
[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\sonia\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [16]:
#Object creation
st = PorterStemmer() 
#st = LancasterStemmer()
wnl = WordNetLemmatizer()

## Pipeline function
The main purpose of this function is parse the documents, in order to remove useless part of words (with the pipeline), remove most frequent words (stopwords) and select the usefull words for the features.

Pipeline that process the data following this technique:
- Tokenization: Divide docs in single wards to be processed
- Stopwords eliminations: elimination of the first n (*stopwards* parameter) most cummon wards
- Stemming: process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words 
- Lemmatization: process of reducing a word to its lemma

This fun returns also a list of docs processed and labelled 


In [17]:
def pipeline(data, stopwords = 10, limit =2000):
    """
        Given a list of docs returns them processed by the pipeline and labelled
        IN:\n
        text:   list of documents\n
        stopwords:  number of words to ignore\n
        limit: numer of features we keep
        OUT:\n
        topWords:    list of topwords ordered by most frequent
        DataProcessed:  documents with label 
    """
    parole = 0
    dataProcessed = [0 for _ in range(len(data))]
    fdist = FreqDist()
    for i, (doc, l)in enumerate(tqdm(data)):
        temp = ([], l)
        
        #TOKENIZATION 
        #tokenization doc into words
        words = word_tokenize(doc)          
        for word in words:
            if word not in punctuation and not word.isdecimal():
                parole +=1
                #Stemming 
                stemmed= st.stem(word)
                #lemmatization
                lemmatized= wnl.lemmatize(stemmed) 
                #counting words elaborated    
                fdist[lemmatized] += 1
                temp[0].append(lemmatized) 

        dataProcessed[i]= temp
        
    print("Words cardinality: ",parole, "FDist Cardinality: ", len(fdist))
    return list(fdist)[stopwords:limit], dataProcessed



## features_estractor1 fun

Create the features set readable for the NaiveBayes Classifier from a document and top_Words.

In particular a dictionary where for each top_Words (tW extract before) there is a boolean value if it is in that document (d) or not.

In [18]:
def features_estractor1(d, tW):
    """
        Return a dictionary with all words in the tW and a presence value in the document d\n
        IN:\n
        d:  Document to look if the word is in it\n 
        tW: set of Words to check (top_Words)\n
        OUT:\n
        dict:   dictionary with {word: presenceValue(bool)}
    """
    ds = set(d)
    features = {}
    for w in tW:
        features[f'contains({w})'] = (w in ds)
    return features

## Main
### Data fetching
- Var setup (20 docs for English, 30 Non English)
- texts load    (English, French, Danish, Finnish)
- labelling tests 
- Data shuffle

In [19]:
fids = 10
nLen = 3
h_ids = 10  #math.floor(fids/(nLen-1))

data = []
labels = []
tests = []

#europarl docs
en = europarl_raw.english.fileids()[:fids]
fr = europarl_raw.french.fileids()[:h_ids] 
dan = europarl_raw.danish.fileids()[:h_ids]
fin = europarl_raw.finnish.fileids()[:h_ids]

#gutenberg eng docs 
gutENGberg_ids = gutenberg.fileids()[:fids]



#list of tuples with docs and label
#E english N_E not english
for ids in gutENGberg_ids:
    data.append((gutenberg.raw(ids), "E"))
for i in range(fids):
    data.append((europarl_raw.english.raw(en[i]), "E"))
for i in range(h_ids):
    data.append((europarl_raw.french.raw(fr[i]), "N_E"))
    data.append((europarl_raw.danish.raw(dan[i]), "N_E"))
    data.append((europarl_raw.finnish.raw(fin[i]), "N_E"))


#data shuffle
random.shuffle(data)

### Pipeline

Processing the documents

In [20]:
#document process 
#removing the first 10000 words as stopwords and taking the next 5000 words as features 
topWords, dataProcessed = pipeline(data, stopwords=10000, limit = 20000)

100%|██████████| 50/50 [02:31<00:00,  3.03s/it]


Words cardinality:  3403340 FDist Cardinality:  107899


### Feature Extraction
#### Creation of:
-   Features set
-   Training set    .7 Features Set
-   Testing set     .3 Features Set

In [21]:
#features creation
featuresets = [(features_estractor1(d,topWords),l) for (d,l) in tqdm(dataProcessed)]
sep = math.floor(len(featuresets) * 0.7 )
train_set, test_set = featuresets[:sep], featuresets[sep:]

100%|██████████| 50/50 [00:00<00:00, 70.36it/s]


## Traning the Model

In [22]:
#classifier
classifier = nltk.NaiveBayesClassifier.train(train_set) 

## Testing

In [23]:
refsets =  collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i,(feats,label) in enumerate(test_set):
    refsets[label].add(i)
    result = classifier.classify(feats)
    testsets[result].add(i)
    labels.append(label)
    tests.append(result)
cm = ConfusionMatrix(labels, tests)

## Dataset & Metrics

- Docs Labels 
- Confusion Matrix (*N_E* Not English, *E* English)

In [24]:
print(f"English docs: {fids+len(gutENGberg_ids)}\nNot English docs: {len(data)-(fids+len(gutENGberg_ids))}")
print("Avarange words for document: ", round(np.mean([len(d[0]) for d in dataProcessed],0)))
print("\nConfusion Matrix:")
print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=5))

English docs: 20
Not English docs: 30
Avarange words for document:  68067

Confusion Matrix:
    |      N        |
    |      _        |
    |      E      E |
----+---------------+
N_E | <66.7%>     . |
  E |      . <33.3%>|
----+---------------+
(row = reference; col = test)



- Accuracy
- Precision
- Recall
- F1 Score

In [25]:
print("Accuracy:",round(nltk.classify.accuracy(classifier, test_set),3)) 
print( 'Precision:', round(precision(refsets['E'], testsets['E']),3) )
print( 'Recall:', round(recall(refsets['E'], testsets['E']),3) )
print("F-Score:", round(f_measure(refsets['E'], testsets['E']),3))

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F-Score: 1.0


- Morst informative features:

In [26]:
classifier.show_most_informative_features(20)

Most Informative Features
      contains(diplomat) = True                E : N_E    =      6.6 : 1.0
          contains(sale) = True                E : N_E    =      6.6 : 1.0
         contains(timid) = True                E : N_E    =      6.6 : 1.0
        contains(absorb) = True                E : N_E    =      5.7 : 1.0
      contains(abstract) = True                E : N_E    =      5.7 : 1.0
       contains(appetit) = True                E : N_E    =      5.7 : 1.0
         contains(blank) = True                E : N_E    =      5.7 : 1.0
      contains(fluctuat) = True                E : N_E    =      5.7 : 1.0
       contains(illumin) = True                E : N_E    =      5.7 : 1.0
        contains(obstin) = True                E : N_E    =      5.7 : 1.0
       contains(permiss) = True                E : N_E    =      5.7 : 1.0
      contains(platform) = True                E : N_E    =      5.7 : 1.0
       contains(redress) = True                E : N_E    =      5.7 : 1.0

##  Results and Conclusions

This classifier classifies documents as *English* or *Not English*, this implies that the document needs to be enoght big to be classified.

"It works good because the task is easy".

Talking about metrics we used a confusion matrix to evaluate the model and we can see that all of them are high this is due to the big difference between the languages used in the big corpus used and the small dataset (50 docs) (overfit).


 

### QUESTIONS

Discuss:
1.  size of the corpus, size of the split training and test sets
2.  performance indicators employed and their nature
3.  employability of the classifier as a Probabilistic Language Model.

1.  In this case the corpus size consist in 50 docs, 20 english and 30 not english (over 65000 words on avarage), that's leads to have high model performances because every doc is big. Then we are working with small docs set thats causes overfitting problems.</br>
Talking about training and test sets we divided the feature sets .7 for train and .3 for test, this division is biased towards training in order cover the overfitting problem. 
2.  The metrics used are:
    -   **Precision** - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations
    -   **Recall** (Sensitivity) - Recall is the ratio of correctly predicted positive observations to the all observations in actual class *E*
    -   **F1 score** - F1 Score is the weighted average of Precision and Recall, it is more robust then *accuracy*
3.  This classifier can classifies "large" documents like long speeches and documents larger then 50000 words.</br>
Maybe the topic could be a problem but with over 60k words and 10k features it should cover a good part of the topics.