# Sprint 1

In de [eerste sprint](https://trello.com/b/RtxQyUB4/sprint-1-17-nov) werken we aan de volgende taken:

1. [Email file reading methode schrijven](https://trello.com/c/L3ISKIgf/4-email-file-reading-methode-schrijven)
2. [Cleaning methode schrijven](https://trello.com/c/QCD8JKLK/2-cleaning-methode-schrijven) + [Tokenize methode schrijven](https://trello.com/c/UrXLnJBi/3-tokenize-methode-schrijven)
3. [Fuzzy logic model bedenken (inputs, MF's, outputs, rules enz..), Inspirational Paper Lezen](http://ijcsi.org/papers/IJCSI-10-3-2-48-58.pdf)

### 1. Email file reading methode schrijven

Voor dit project staan resource files opgeslagen in de folder: *res*.



In [1]:
# Global variable containing body of .txt file
body = ''

# Opening the example email file for reading
# using a context manager (safe)
with open('res/email.txt', 'r') as f:
    size_to_read = 270
    body = f.read(size_to_read)
    
print(body)

﻿Dear Mr. Lee,

I write on behalf of my CEO, Ms. Douglas, of the Dutch cell phone company ‘‘cell.com’’.
You’ve met Ms. Douglas at the Cebit Trade Fair in Hannover. She was highly satisfied with your inspirational inventions, especially the auto rechargeable cell phone. 


### 2. Cleaning + tokenize methode schrijven

#### 2.1 Splitting words based on whitespace or regex pattern

Door een stuk text m.b.v. de regex \[\W+\] te splitten worden de woorden al enigszins gecleaned. Maar woorden zoals "you've" worden opgesplitst in "you" en "ve", wat natuurlijk niet hoort.

Door een stuk text op te splitsen aan de hand van whitespace, worden alle woorden intact gehouden, maar zit alle punctuatie er nog in.

In [2]:
# split based on words only
import re
words = re.split(r'\W+', body)

print('REGEX:\n', words[:50])
print()

# split into words by white space
words = body.split()
words[0] = words[0].replace('\ufeff', '')

print('WHITESPACE:\n', words[:50])

REGEX:
 ['', 'Dear', 'Mr', 'Lee', 'I', 'write', 'on', 'behalf', 'of', 'my', 'CEO', 'Ms', 'Douglas', 'of', 'the', 'Dutch', 'cell', 'phone', 'company', 'cell', 'com', 'You', 've', 'met', 'Ms', 'Douglas', 'at', 'the', 'Cebit', 'Trade', 'Fair', 'in', 'Hannover', 'She', 'was', 'highly', 'satisfied', 'with', 'your', 'inspirational', 'inventions', 'especially', 'the', 'auto', 'rechargeable', 'cell', 'phone', '']

WHITESPACE:
 ['Dear', 'Mr.', 'Lee,', 'I', 'write', 'on', 'behalf', 'of', 'my', 'CEO,', 'Ms.', 'Douglas,', 'of', 'the', 'Dutch', 'cell', 'phone', 'company', '‘‘cell.com’’.', 'You’ve', 'met', 'Ms.', 'Douglas', 'at', 'the', 'Cebit', 'Trade', 'Fair', 'in', 'Hannover.', 'She', 'was', 'highly', 'satisfied', 'with', 'your', 'inspirational', 'inventions,', 'especially', 'the', 'auto', 'rechargeable', 'cell', 'phone.']


#### 2.2 Removing punctuation

Met de 'string' import kunnen we de punctuatie verwijderen, zonder dat we "you've" verliezen.

In [3]:
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
words = [w.translate(table) for w in words]

print(words[:50])

['Dear', 'Mr', 'Lee', 'I', 'write', 'on', 'behalf', 'of', 'my', 'CEO', 'Ms', 'Douglas', 'of', 'the', 'Dutch', 'cell', 'phone', 'company', '‘‘cellcom’’', 'You’ve', 'met', 'Ms', 'Douglas', 'at', 'the', 'Cebit', 'Trade', 'Fair', 'in', 'Hannover', 'She', 'was', 'highly', 'satisfied', 'with', 'your', 'inspirational', 'inventions', 'especially', 'the', 'auto', 'rechargeable', 'cell', 'phone']


#### 2.3 Normalizing case

Sommige woorden bevatten hoofdletters, sommige niet.

In [4]:
# convert to lower case
words = [word.lower() for word in words]
print(words[:50])

['dear', 'mr', 'lee', 'i', 'write', 'on', 'behalf', 'of', 'my', 'ceo', 'ms', 'douglas', 'of', 'the', 'dutch', 'cell', 'phone', 'company', '‘‘cellcom’’', 'you’ve', 'met', 'ms', 'douglas', 'at', 'the', 'cebit', 'trade', 'fair', 'in', 'hannover', 'she', 'was', 'highly', 'satisfied', 'with', 'your', 'inspirational', 'inventions', 'especially', 'the', 'auto', 'rechargeable', 'cell', 'phone']


#### 2.4. Filtering stop words & stemming

Een alternatieve manier om tekst te cleanen is het gebruiken van een package zoals NLTK.

Om nltk te installeren, gebruik pip of pip3: *'sudo pip3 install -U nltk'*

In [5]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.tokenize import word_tokenize

# Tokenize
tokens = word_tokenize(body)
# print('TOKENS:\n', tokens[:100])

# Convert to lower case
tokens = [w.lower() for w in tokens]

# Remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]

# Remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]

# Filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]

# Stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
words = [porter.stem(word) for word in words]

print(words)

[nltk_data] Downloading package punkt to /home/stefan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/stefan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['mr', 'lee', 'write', 'behalf', 'ceo', 'ms', 'dougla', 'dutch', 'cell', 'phone', 'compani', 'cellcom', 'met', 'ms', 'dougla', 'cebit', 'trade', 'fair', 'hannov', 'highli', 'satisfi', 'inspir', 'invent', 'especi', 'auto', 'recharg', 'cell', 'phone']


### 3. Fuzzy logic model bedenken (inputs, MF's, outputs, rules enz..), Inspirational Paper Lezen

Paper: [http://ijcsi.org/papers/IJCSI-10-3-2-48-58.pdf](http://ijcsi.org/papers/IJCSI-10-3-2-48-58.pdf)

In hoofdstuk 3.1 Fuzzy Classification Module, wordt uitgelegd hoe spam woorden worden beoordeeld en gevoed aan een FLS. Net als in het ontwerp van ons project, worden woorden gecleaned, getokenized en vervolgens beoordeeld. 

Voor elke beoordeling van ieder woord kan een vector worden gemaakt met features (bijvoorbeeld: "overlap_dept1", "sentiment", "technische_term", enz...), zodat alle woorden kunnen worden meeggeven als een vector van feature vectors:

$words = [\overrightarrow{F_1}, \overrightarrow{F_2}, ..., \overrightarrow{F_n}]$

Elke feature kan als input aan het FLS worden meegegeven, zo dat voor ieder woord een "ranking" wordt berekend:

$A = \{f, \mu_A(f)$ | $ f \in F_1 \}$