### Creating simple artificial intelligence in Python

In this tutorial, we will be classifying what type of Iris a set of values are. Is it setosa, versicolor, or virginica? Using sklearn, we will be doing this automatically, and hopefully be able to predict a certain species of flower *even if we don't know specifically what the species is*. 

#### Step 1: Data Preparation
First, we need to load our data from a file. We need to separate them into two arrays: 

- Training Array (often denoted as "X")
    - Your training array will contain a set of values. In our example, we will be passing the sepal length and width, and petal length and width.
- Target Array (often denoted as "y")
    - Your target array will contain the *answers* to the given training array.

For simplicity, I will be using the following English vs. German example:

|Training Array|Target Array|
|-----|-----|
|ANYONE|English|
|UPROAR|English|
|YELLOW|English|
|BÄRGET|German|
|ZURUFE|German|
|WÜSTEM|German|

Unlike the above example, we are going to be using Iris data instead. First, let's create our training array. We shall be using the traditional way of loading data in Python, in contrast to `pandas`, for reference and simplicity.

##### Traditionally loading files in Python
You can open files natively, without the need of libraries, using the `open()` function. `open()` returns a file-like object. In this case, it returns a TextIOWrapper object.

In [1]:
my_file = open("iris-setosa.csv", 'r')
print(type(my_file))

If you want to read a TextIOWrapper object, you can use `read()` or `readlines()`.

Here, we will use `readlines()`. Readlines returns a list of strings, with each element being the corresponding line within the text file.

In [2]:
for text_line in my_file:
    print(text_line)

Now, using what we learnt about glob from yesterday, we can go ahead and load all of the files in a loop. Instead of using pandas, however, we will be using the same method that we used above.

#### Activity: Read multiple files using glob, and append all values into a list

So, using the notebook that we used yesterday, go ahead and create a cell that will load all `*.csv`s into an array and append each line into a *singular list*

Now, we have a string list with all of the species above. However, we need to convert this into a *number* list. 

Machine Learning, in the barest form, is Mathematics. It's impossible for us to do math with words, and thus it's imperative to change your data into numbers.

For example, if you were training AI on English words, you may end up converting your word into a list of numbers:

|English|To Numbers|
|-----|-----|
|hello|`[8, 5, 12, 12, 15]`

Then, you would train your AI on the list of numbers, and your AI will spit out another list of numbers. You would then have to re-translate them back to English.

|Numbers|To English|
|-----|-----|
|`[19, 8, 9, 6, 20]`|shift|

So in this scenario, we have a row that looks like

`"5.0,3.3,1.4,0.2,Iris-setosa"`

We need to convert this string into a list, so that we get the specific values, similar to:

`[5.0, 3.3, 1.4, 0.2, "Iris-setosa"]`

We can split strings and turn them into arrays really easily using the `split()` function.

In [3]:
my_string = "5.0,3.3,1.4,0.2,Iris-setosa"
string_to_array = "5.0,3.3,1.4,0.2,Iris-setosa".split(",")

print(string_to_array)

Now, we have an array, but our number values are still strings (which is not helpful). We need to convert them into numbers.

In [4]:
print(range(len(string_to_array) - 1))

for item in range(len(string_to_array) - 1):
    string_to_array[item] = float(string_to_array[item])
    
print(string_to_array)

range(0, 4)
[5.0, 3.3, 1.4, 0.2, 'Iris-setosa']


So, we've converted our values properly, but we have one problem: 'Iris-setosa' needs to become a number. How should we do this? Remember: this is 1 out of the 3 species included in the dataset.

##### Activity: Do our solution for all of the lines within the dataset.

Here's one way to do this:

In [5]:
import glob

training = []
target = []
label_dict = {}

for item in glob.glob("iris*.csv"):
    lines = open(item, 'r').readlines()
    del lines[0]
    for line in lines:
        training_slice = line.split(",")[:-1]
        slice_to_float = [float(i) for i in training_slice]
        training.append(slice_to_float)
        label = line.split(",")[-1].replace("\n", '')
        try:
            target.append(label_dict[label])
        except KeyError:
            label_dict[label] = len(label_dict)
            target.append(label_dict[label])


Perfect! Now we have both of our training and target arrays prepared. Now, let's do some simple machine learning!

First, we need to import certain scripts/functions from the `sklearn` library, which stands for SciKit Learn. We can do that using the `from ... import ...` statement.

The above allows you to import only *specific* things within the library, instead of the entire thing. This can help in terms of memory management and performance of your script.

In [6]:
from sklearn.neural_network import MLPClassifier

Using `sklearn` is incredibly easy, and creating a neural network can be done in two lines.

In [7]:
# Create the MLPClassifier
mlp_nn = MLPClassifier()

# Fit (or train) the MLPClassifier with the training
# and corresponding target data.
mlp_nn.fit(training, target)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [8]:
mlp_nn.predict([training[145]])

array([3])

Let's say that we're missing certain values in our dataset (such as NaN values). Now, it may be easy to take the mean of a given column for a particular species and put them in the rows that are missing values.

However, many times it is a lot better and more accurate to use a neural network to *predict* values that are missing. Let's look at that now:

In [9]:
missing_v = open('missing-iris-virginica.csv', 'r')

for item in missing_v.readlines()[-3:]:
    print(item)

2.559,None,2.047,0.787,Iris-virginica

2.441,None,2.126,0.906,Iris-virginica

2.323,None,2.008,0.709,Iris-VirGinICA



Now, as usual, let's make the training and target lists.

In [10]:
missing_v = open('missing-iris-virginica.csv', 'r')
trainingm_list = []
targetm_list = []

for item in missing_v.readlines()[1:-3]:
    this_row = []
    for idx, value in enumerate(item.split(",")[:-1]):
        if idx == 1:
            targetm_list.append(float(value))
        else:
            this_row.append(float(value))
    trainingm_list.append(this_row)
    
display(trainingm_list)
print(targetm_list)

[[2.48, 2.362, 0.984],
 [2.283, 2.008, 0.748],
 [2.795, 2.323, 0.827],
 [2.48, 2.205, 0.709],
 [2.559, 2.283, 0.866],
 [2.992, 2.598, 0.827],
 [1.929, 1.772, 0.669],
 [2.874, 2.48, 0.709],
 [2.638, 2.283, 0.709],
 [2.835, 2.402, 0.984],
 [2.559, 2.008, 0.787],
 [2.52, 2.087, 0.748],
 [2.677, 2.165, 0.827],
 [2.244, 1.969, 0.787],
 [2.283, 2.008, 0.945],
 [2.52, 2.087, 0.906],
 [2.559, 2.165, 0.709],
 [3.031, 2.638, 0.866],
 [3.031, 2.717, 0.906],
 [2.362, 1.969, 0.591],
 [2.717, 2.244, 0.906],
 [2.205, 1.929, 0.787],
 [3.031, 2.638, 0.787],
 [2.48, 1.929, 0.709],
 [2.638, 2.244, 0.827],
 [2.835, 2.362, 0.709],
 [2.441, 1.89, 0.709],
 [2.402, 1.929, 0.709],
 [2.52, 2.205, 0.827],
 [2.835, 2.283, 0.63],
 [2.913, 2.402, 0.748],
 [3.11, 2.52, 0.787],
 [2.52, 2.205, 0.866],
 [2.48, 2.008, 0.591],
 [2.402, 2.205, 0.551],
 [3.031, 2.402, 0.906],
 [2.48, 2.205, 0.945],
 [2.52, 2.165, 0.709],
 [2.362, 1.89, 0.709],
 [2.717, 2.126, 0.827],
 [2.638, 2.205, 0.945],
 [2.717, 2.008, 0.906],
 [2.283,

[1.299, 1.063, 1.181, 1.142, 1.181, 1.181, 0.984, 1.142, 0.984, 1.417, 1.26, 1.063, 1.181, 0.984, 1.102, 1.26, 1.181, 1.496, 1.024, 0.866, 1.26, 1.102, 1.102, 1.063, 1.299, 1.26, 1.102, 1.181, 1.102, 1.181, 1.102, 1.496, 1.102, 1.102, 1.024, 1.181, 1.339, 1.22, 1.181, 1.22, 1.22, 1.22, 1.063, 1.26, 1.299, 1.181, 0.984]


Now that we have our data, let's recreate the neural network. Instead of a classifier, we need a regressor.

The simplest difference between a classifier and a regressor is that a classifier predicts *categories* or discrete numbers. A regressor predicts *numerical values* or continuous functions.

We're not trying to *classify* the data, we're trying to predict the best potential *numerical value* for the missing data. Thus, we will import a different neural network.

In [11]:
from sklearn.neural_network import MLPRegressor

# Create the MLPClassifier
m_mlp_nn = MLPRegressor(random_state=672)

# Fit (or train) the MLPClassifier with the training
# and corresponding target data.
m_mlp_nn.fit(trainingm_list, targetm_list)

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=672,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

Now, let's write a script to take the missing values from our data.

In [12]:
missing_v = open('missing-iris-virginica.csv', 'r')
missing_values_list = []

for item in missing_v.readlines()[-3:]:
    this_row = []
    for idx, value in enumerate(item.split(",")[:-1]):
        if value != 'None':
            this_row.append(float(value))
    missing_values_list.append(this_row)
    
display(missing_values_list)

[[2.559, 2.047, 0.787], [2.441, 2.126, 0.906], [2.323, 2.008, 0.709]]

Now we have our missing values. Let's use the created neural network in order to predict these missing values.

In [13]:
predicted_values = m_mlp_nn.predict(missing_values_list)
display(predicted_values)

array([1.20175054, 1.30719757, 1.19051255])

Here is the predicted values for our missing data. Let's see how well the neural network did given the original values? 

In [14]:
original_values = open('iris-virginica.csv', 'r')
original_values_list = []

for item in original_values.readlines()[-3:]:
    this_row = []
    for idx, value in enumerate(item.split(",")[:-1]):
        if idx == 1:
            this_row.append(float(value))
    original_values_list.append(this_row)

print(original_values_list)

[[1.181], [1.339], [1.181]]


It seems do have done quite alright! If we used the mean of the values, for example, we would've gotten:

In [15]:
missing_v = open('missing-iris-virginica.csv', 'r')
sepal_width_list = []

for item in missing_v.readlines()[1:-3]:
    for idx, value in enumerate(item.split(",")[:-1]):
        if idx == 1:
            sepal_width_list.append(float(value))
    
mean = sum(sepal_width_list) / len(sepal_width_list)
print(f"The mean of sepal_width column is {mean}")

The mean of sepal_width column is 1.1667446808510633
