# Notes from Ch. 2 Your first NLP example, Getting Started with Natural Language Processing 
## by Ekaterina Kochmar (2022)
### https://livebook.manning.com/book/getting-started-with-natural-language-processing/chapter-2/v-10/

### 2.1 Introducing NLP in practice: spam filtering

### Classification
**Classification** refers to the process of identifying which category or class among the set of categories (classes) an observation belongs to based on its properties. In machine learning terms, such properties are called features and the class names are called class labels. If you classify observations into two classes, you are dealing with binary classification; tasks with more than two classes are examples of multi-class classification.

**binary classifications** allows us to organize items into to groups (e.g., motorized vs not) </br> 
**multi-class catagorization**: two-wheeled unmotorized vehicles, two-wheeled motorized vehicles, three-wheeled unmotorized vehicles, and so on 


We can have computers make classifications. For examples, we can create a function that prints out a warning that water is hot based on a simple threshold of 45&deg;C (113&deg;F). 

In [None]:
def print_warning(temperature):
    if temperature>=45:
        print ("Caution: Hot water!")
    else:
        print ("You may use water as usual")
print_warning(46)
 

Caution: Hot water!


However, when there are multiple factors to take into account and these multiple factors may interact in various ways, a better strategy is to make the machine learn such rules and infer their correspondences from the data rather than hard-code them. 

This type of machine learning approach, when we supervise the machine while it is learning by providing it with the labeled data, is called supervised machine learning. This is what machine learning is about: it states that machines can learn to solve the task if they are provided with a sufficient number of examples and with the general outline of the task. For example, if we define the classes, labels, and features for the machine, it can then learn to assign concepts to the predefined classes based on these features. 

####SUPERVISED MACHINE LEARNING
Supervised machine learning refers to a family of machine learning tasks in which the algorithm learns the correspondences between an input and an output based on the provided labeled input-output examples. Classification is an example of a supervised machine learning task, where the algorithm tries to learn the mapping between the input data and the output class label.

# Understanding the Task 

 Five steps of a machine learning-based text classification project: </br> 
 Define Classes -> Split into Words -> Extract Features -> Train Classifer -> Test and Evaluate 

1. Define the Data and Classes 
Define which data represents “ham” class and which data represents “spam” class for the machine learning algorithm.
2.  Split the text into words


In [None]:
text = "Define which data represents each class for the machine learning algorithm"
text.split(" ")
# type(text)


['Define',
 'which',
 'data',
 'represents',
 'each',
 'class',
 'for',
 'the',
 'machine',
 'learning',
 'algorithm']

So far, so good. However, what happens to this strategy when we have punctuation marks?

In [None]:

text = 'Define which data represents "ham" class and which data represents "spam" class for the machine learning algorithm.'
delimiters = ['/', "."]
words = []
current_word = ""
 
for char in text:
    if char==" ":
        if not current_word=="":
            words.append(current_word)
            current_word = ""
    elif char in delimiters:
        if current_word=="":
            words.append(char)
        else:
            words.append(current_word)
            words.append(char)
            current_word = ""
    else:
        current_word += char
 
print(words) 


['Define', 'which', 'data', 'represents', '"ham"', 'class', 'and', 'which', 'data', 'represents', '"spam"', 'class', 'for', 'the', 'machine', 'learning', 'algorithm', '.']


### Tokenizer

While the above is a helpful exercise, it is more pargmatic to use a tokenizer from NNLPTK: NLP toolkits. These tokenizers are highly optimized for the task, and they not only perform splitting by whitespaces and punctuation marks, but also keep track of the cases that should not be split by such methods. This helps make sure that the tokenization step results in a list of appropriate English words.


**Tokenization** is the process of word token identification or extraction from the running text. It is often the first step in text preprocessing. Whitespaces and punctuation marks often serve as reliable word separators; however, simple approaches are likely to run into exceptions like “U.S.A.” and similar. Tokenizers are NLP tools highly optimized for the task of word tokenization, and they may rely on carefully crafted regular expressions or trained using machine learning algorithms.

In [None]:
import os 
import codecs 

def read_in(folder): 
  files = oslistdir(folder)
  a_list = []
  for a_file in files:
    if not a_file.startswith("."):
      f = codecs.open(folder + a_file,
                      "r", encoding = "ISO-8859-1", errors="ignore")
      a_list.append(f.read())
      f.close()
    return a_list



In [None]:
# lists are really large - will update with Anaconda/local instructions (taking too long in Drive/Colabratory)

spam_list = read_in("enron1/spam/") 
ham_list = read_in("enron1/ham/")
print(len(spam_list)) 
print(len(ham_list))
print(spam_list[0])
print(ham_list[0])

