<a href="https://colab.research.google.com/github/LoolzMe/MachineLearning/blob/main/TensorflowStackOverflowQuestionsRecognizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Languages classifier of topics in StackOverflow. There is database with questions and related to them lenguages (Python, C#, JS, Java). 

In [7]:
from os import path

import tensorflow as tf


url = "https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz" 

download_db = tf.keras.utils.get_file("stack_overflow_16k", origin=url, untar=True, cache_dir='', cache_subdir='')

path_db = path.dirname(download_db)



In [8]:
# path_db = '/tmp/.keras/stack_overflow_16k'

from os import listdir
print(listdir(path_db))

['stack_overflow_16k.tar.gz', 'test', 'train', 'README.md']


In [9]:
seed = 42

raw_train_dir = tf.keras.utils.text_dataset_from_directory(
    path.join(path_db, 'train'), validation_split=0.2, seed=seed,
    subset='training' 
)

Found 8000 files belonging to 4 classes.
Using 6400 files for training.


In [10]:

raw_valid_dir = tf.keras.utils.text_dataset_from_directory(
    path.join(path_db, 'train'), validation_split=0.2, seed=seed,
    subset='validation' 
)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.


In [24]:
raw_test_dir = tf.keras.utils.text_dataset_from_directory(
    path.join(path_db, 'test')
)

Found 8000 files belonging to 4 classes.


In [11]:
print(raw_valid_dir.class_names)

['csharp', 'java', 'javascript', 'python']


In [15]:
for text, label in raw_train_dir.take(5):
  for i in range(3):
    print("text ", text.numpy()[i])
    print("language ", label.numpy()[i])

text  b'"set blank to quit on exception? i\'m using blank 3..i\'ve been looking around for an answer to this, but i haven\'t found it yet. basically, i\'m running several blank scripts into a game engine, and each script has its own entry point...i\'d rather not add try: except blocks through all of my code, so i was wondering if it\'s at all possible to tell blank to quit (or perhaps assign a custom function to that ""callback"") on finding its first error, regardless of where or what it found? ..currently, the game engine will continue after finding and hitting an error, making it more difficult than necessary to diagnose issues since running into one error may make a subsequent script not work (as it relies on variables that the error-ing script set, for example). any ideas? ..i know that i could redirect the console to a file to allow for easier scrolling, but just capturing the first error and stopping the game prematurely would be really useful...okay, a couple of extra bits of i

In [18]:
max_features = 100000
seq_len = 250 


vectorize_layer = tf.keras.layers.TextVectorization(max_tokens=max_features,
                                                    output_sequence_length=seq_len)



In [19]:
train_questions = raw_train_dir.map(lambda x, y : x)
vectorize_layer.adapt(train_questions)


In [22]:
print(list(train_questions)[1])

tf.Tensor(
[b'"calculating bandwidth #2 i was successfully calculate bandwidth consumed by my application (which connects to web service to send files) using [this][1].  [1]: calculating bandwidth i used the code provided by rasmus faber... one thing puzzled me: the received bytes is far too more than sent bytes... as far as i am concerned i am sending images, xml files etc and returning error codes.. therefore the sent bytes should be more? ...."\n'
 b'"how can i convert contiguous letters to number separated by dash in blank? i\'d like to convert alphabet letters to numbers and if the alphabet letters are contiguous, i want to make it separated by dash (-). i will leave non-alphabetic character unchanged shown below:..input string: ""20 pizzas an!d, 1 apple a b c""..output string: ""20 16-9-26-26-1-19 1-14!4, 1 1-16-16-12-5 1 2 3""...note: a/a =1, b/b =2, ... y/y = 25, z/z = 26"\n'
 b'"hide content inside brackets using pure blank i want to use pure blank to hide all content inside b

In [25]:
def vect(text, label):
  tf.expand_dims(text, -1)
  return vectorize_layer(text), label


train_ds = raw_train_dir.map(vect)
test_ds = raw_test_dir.map(vect)
val_ds = raw_valid_dir.map(vect)

In [27]:
print(next(iter(train_ds))[0])

tf.Tensor(
[[   81     4   213 ...     0     0     0]
 [  985  3097  1584 ...     0     0     0]
 [   24     4 18289 ...     0     0     0]
 ...
 [   80   449     7 ...     0     0     0]
 [   24     4   205 ...     0     0     0]
 [   16   244    30 ...     0     0     0]], shape=(32, 250), dtype=int64)


In [29]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

In [30]:
embedding_dim = 16 # default

model = tf.keras.Sequential([
        tf.keras.layers.Embedding(max_features + 1, embedding_dim),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.GlobalAveragePooling1D(),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(4)
])



In [32]:
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [33]:
epochs = 10

model.fit(train_ds, validation_data=val_ds, epochs=epochs)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f155466a810>

In [35]:
loss, accuracy = model.evaluate(test_ds)

print(loss)
print(accuracy)

0.8541976809501648
0.7406250238418579


we can observe some overtraining, but we will close eyes on that

In [36]:
export_model = tf.keras.Sequential([
                vectorize_layer,
                model,
                tf.keras.layers.Activation('sigmoid')
])

export_model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                     optimizer='adam', metrics=['accuracy'])



In [40]:
loss, accuracy = export_model.evaluate(raw_test_dir)





In [44]:
example = ["how does streamreader read all chars, including 0x0d 0x0a chars? how does streamreader read all chars, including 0x0d 0x0a chars?..i have an old .txt file i am trying to covert.  many lines (but not all) end with 0x0d 0x0d 0x0a.  ..this code reads all of the lines. ..streamreader srfile = new streamreader(gstpathfilename);.while (!srfile.endofstream) {.    string stfilecontents = srfile.readline();.    ....}...this results in extra strings between each .txt line"]

export_model.predict(example)

array([[0.59661114, 0.4849414 , 0.37768006, 0.52548957]], dtype=float32)

In [48]:
model.save_weights("/content/export/checkpoint-0000.cpkt")
export_model.save_weights("/content/export/export-checkpoint-0000.ckpt")