# Assignment 1, from scratch.

## Table of Contents (clickable):
1. [A long, easy to follow version with comments walking you through the code](#Long-version-with-comments)
2. [A long, easy to follow version without comments](#Long-version-without-comments)
3. [A short, hard to follow version for people comfortable with programming and interested to know some of what Python is capable of.](Concise-version-of-the-code)

---

### Long version with comments
#### Creating a language classifier

Let's develop a plan:
- First, import necessary libraries
- Second, load the data into the correct format
    - Segregating out data into "training" and target arrays
    - Changing all data into some numerical format interpretable by a machine
- Third, instantiate some classifier and let the classifier "learn" the data.

##### Importing libraries
For this task, the only library we would actually need to import is ```sklearn```'s ```MLPClassifier```. Potentially, if you need to load a lot of data, you may want to import ```glob```.

In [1]:
# Import a multilayer perceptron from scikit-learn
from sklearn.neural_network import MLPClassifier

##### Loading the data
There are multiple different things we need to take into account:
- All of the data will need to be converted into some numerical format. As such, words may be translated into some numerical form, etc.
- Most AI algorithms and NN (unless specified otherwise or being built yourself), need fixed-size samples. In this specific scenario of a language classifier, this means that all of the words need to be of the same length (all 4 letter words, or all 6 letter words, etc.)
- We will need to segregate the data into "training" and "target" samples. Training samples contain the data/features that you would like to predict on, and target samples contain the answer to the training sample. As such, "thanks" is a training sample with target "English," and "Kamusta" is a training sample with target "Filipino."

In [2]:
# Specifying which file names to load
files = ["english.txt", "german.txt"]

# Creating both the training and target lists.
training = []
target = []

# Opening each filename within the list "files" in a for loop. This for loop runs twice: 
# once for english.txt and another for german.txt.
for file in files:
    
    # Opening the file using the function open()
    opened_file = open(file, 'r', encoding='latin-1')
    
    # Reading the lines within the file using the function readlines()
    file_lines = opened_file.readlines()
    
    # We are trying to determine whether or not the language is English 
    # so we can append the 0 to the target array (0 being our numerical representation of English)
    # There are better, more automated ways of doing this.
    if file == "english.txt":
        
        # To fix the fixed-size sample problem, we will only take six-letter words from the dataset
        # We are going to make a counter that checks how many six-letter words we encounter
        six_letter_words = 0
        
        # We are going through each and every line within the file. As we know, each line in the
        # file actually corresponds to a word. Therefore, this can be seen as looking through 
        # each and every word within the file.
        for line in file_lines:
            
            # If we look at each "word," we see that it ends with an ugly "\n" character, which
            # is the new-line character. We'll need to remove it as we only want the real letters,
            # not the characters that denote a new-line to outputs or text editors.
            # We are using the replace() method in order to replace "\n" with "", the empty
            # character, effectively deleting it.
            line = line.replace("\n", "")
            
            # We can further normalize the words by making them all lower-case
            line = line.lower()
            
            # Now, let's check if the line/word is of length 6.
            if len(line) == 6:
                
                # If it is of length six, make a list where each element within that list is the
                # number representation of the letter. The ord() function takes a character and 
                # makes a numerical representation to it. chr() is the opposite - it takes a number
                # and translates it back to a character.
                orded_line = []
                for letter in line:
                    orded_line.append(ord(letter))
                    
                # Now, lets add that number representation of the word to our training array
                training.append(orded_line)
                
                # Then, lets add 1 to the variable six_letter_words in order to keep track of how
                # many answers we need to make.
                six_letter_words += 1
        
        # We need to make the answers for each of our six letter words now. Using the 
        # six_letter words variable, we know how many 0s we need for each of the 
        # training samples. Remember that the number 0 is how we're telling the machine that
        # the training sample is English.
        class_target = []
        for _ in range(six_letter_words):
            class_target.append(0)
        
        # Last, lets concatenate our created list above of 0s to our main target (answer) list.
        # We would need to do this as we are storing values of other things as well, such as 1
        # for German.
        target += class_target
    else:
        # Notice that this is part of the if-statement where the file != english.txt. Therefore,
        # we are accessing the German file.
        # Everything here is functionally the same as the above, but we are instead appending
        # 1 in our target array as we want these to represent German.
        six_letter_words = 0
        for line in file_lines:
            line = line.replace("\n", "")
            line = line.lower()
            if len(line) == 6:
                orded_line = []
                for letter in line:
                    orded_line.append(ord(letter))
                training.append(orded_line)
                six_letter_words += 1
                
        class_target = []
        for _ in range(six_letter_words):
            class_target.append(1)
            
        target += class_target

#### Using a pre-defined neural network and using it on our data

Now that we've created our data, let's go ahead and train a pre-defined multilayer perceptron and use it for predictions!

In [3]:
# We'll need to instantiate our neural network. We can do this pretty easily by setting a variable
# to MLPClassifier().
mlp_nn = MLPClassifier()

# To train, we use the fit() function. We give it the training and target lists that we have made
# earlier.
mlp_nn.fit(training, target)

# Once that is finished training, you can go ahead and make some predictions using predict(). 
# Again, the same rules apply, you would need to translate them into numerical interpretations,
# and that the data will need to be six-letter words as above.

# Here are two words that don't really exist, but are 1-letter modifications to existing
# words
english_word = "hellow"
german_word = "flüche"

orded_eng = []
for letter in english_word:
    orded_eng.append(ord(letter))
    
orded_ger = []
for letter in german_word:
    orded_ger.append(ord(letter))
    
# You need to format your prediction samples in a n+1 dimensional list for neural networks other than 
# MLP. In this scenario, we are using MLP and can only train on and predict 2-dimensional lists. 
# An easy way to think about list dimensionality is that 1-dimensional does not contain a list
# inside of a list. Two dimensional is a list in a list. Three dimensional is a list in a list in a list,
# etc. Our word to number representation is a one dimensional list: [0, 0, 0, 0, 0, 0]. Our training
# and target data is a two-dimensional list as it is a list of our word to number representations. 
# [[0, 0, 0, 0, 0, 0], where each of these lines is another word within our training array.
#  [0, 0, 0, 0, 0, 0],
#  [0, 0, 0, 0, 0, 0]]

# If you want to only predict one sample:
mlp_nn.predict([[0, 0, 0, 0, 0, 0]])
mlp_nn.predict([orded_eng]) # orded_eng is already a list as seen above, and it is within another brackets
                            # to denote that it is within a different list, making it two dimensional.

# If you want to predict more than one, just add more lists to the outermost list. Here we show two samples
mlp_nn.predict([[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]])
mlp_nn.predict([orded_eng, orded_ger])

array([0, 1])

---

### Long version without comments

In [4]:
files = ["english.txt", "german.txt"]

training = []
target = []

for file in files:
    opened_file = open(file, 'r', encoding='latin-1')
    file_lines = opened_file.readlines()

    if file == "english.txt":
        
        six_letter_words = 0
        for line in file_lines:
            line = line.replace("\n", "")
            line = line.lower()
            
            if len(line) == 6:
                orded_line = []
                for letter in line:
                    orded_line.append(ord(letter))
                training.append(orded_line)
                six_letter_words += 1
        
        class_target = []
        for _ in range(six_letter_words):
            class_target.append(0)
        
        target += class_target
    else:
        six_letter_words = 0
        for line in file_lines:
            line = line.replace("\n", "")
            line = line.lower()
            if len(line) == 6:
                orded_line = []
                for letter in line:
                    orded_line.append(ord(letter))
                training.append(orded_line)
                six_letter_words += 1
                
        class_target = []
        for _ in range(six_letter_words):
            class_target.append(1)
            
        target += class_target

mlp_nn = MLPClassifier()
mlp_nn.fit(training, target)

english_word = "hellow"
german_word = "flüche"

orded_eng = []
for letter in english_word:
    orded_eng.append(ord(letter))
    
orded_ger = []
for letter in german_word:
    orded_ger.append(ord(letter))
    
mlp_nn.predict([orded_eng, orded_ger])

array([0, 1])

---

### Concise version of the code
For people comfortable with programming, here is a version of the code that exemplifies how short Python can be, and given enough knowledge of Python, how many more explicit versions of code in other languages (e.g. Java) can be shorted by taking advantage of Python features such as generators.

In [5]:
from sklearn.neural_network import MLPClassifier

training = []
target = []
for idx, file in enumerate(["english.txt", "german.txt"]):
    lines = open(file).readlines()
    sample = [[ord(c) for c in line.replace("\n", "").lower()] for line in lines if len(line) == 7]
    training += sample
    target += [idx for _ in range(len(sample))]

mlp_nn = MLPClassifier()
mlp_nn.fit(training, target)

mlp_nn.predict([[ord(c) for c in word] for word in ["hellow", "flüche"]])

array([0, 1])