In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mysession').getOrCreate()

In this note I show you how the input Parquet file was generated from the raw test file. 

The flow is as follows:
`data/test` $\rightarrow$ `data/test_unlabeled` $\rightarrow$ `data/input.parquet`

# Generating Unlabeled Test File

The file `data/test_unlabeled` is just the unlabeled version of `data/test`. If you are curious, I show you here how I converted one file into another via Python generators.

This is useful when we have a labeled dataset for which we want to retrieve predictions and compare them with true labels in order to compute performance metrics. In that case, we need to unlabel the dataset first and then perform inference to get the predictions. However, in practice, we will perform inference for unlabeled data. 

In [2]:
! head -10 data/test

__label__php __label__image making an image greyscale with gd library
__label__eclipse transforming selected text with a hotkey
__label__sql-server sql server and the guest account what is this for
__label__jquery __label__html how can i change html attribute names with jquery
__label__php __label__ajax how can i send an array to php through ajax
__label__c __label__cocoa c the definitive truth about rand random and arc4random
__label__winforms gantt chart controls on windows forms
__label__php __label__linux build tar file from directory in php without exec/passthru
__label__javascript __label__ajax how do you manage infragistics webgrid data from javascript/ajax code
__label__wcf how to consume json web services from a windows client


In [3]:
def keep_sentence_field(line):
    '''
    Function to keep only the text input given a labeled instance with fastText format.
    Example
    Input:
    '__label__python __label__django help with unit testing in a python app using django'
    Output:
    'help with unit testing in a python app using django'
    '''
    words = [x for x in line.split() if "__label__" not in x]
    output = ' '.join(words)
    return output

# Location of input file
inputFile = 'data/test'

# Define Python generators to 1) read lines, 2) keep only the sentence field
lines = (line for line in open(inputFile,encoding="ISO-8859-1"))
sentences = (keep_sentence_field(line) for line in lines)

# Location of output file
outputFile = 'data/test_unlabeled'

# Apply the generators and write predictions
with open(outputFile, 'w') as file:
    for sentence in sentences:
        file.write(sentence+'\n')
    file.close()

In [4]:
! head -10 data/test_unlabeled

making an image greyscale with gd library
transforming selected text with a hotkey
sql server and the guest account what is this for
how can i change html attribute names with jquery
how can i send an array to php through ajax
c the definitive truth about rand random and arc4random
gantt chart controls on windows forms
build tar file from directory in php without exec/passthru
how do you manage infragistics webgrid data from javascript/ajax code
how to consume json web services from a windows client


# Generating Parquet file

Let's make a Spark DF from the unlabeled text file, and then make a Parquet file from it for later use.

In [5]:
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([StructField("input", StringType())])

df_input = spark.read.csv('data/test_unlabeled', header=False, schema=schema)

In [6]:
df_input.show(10,False)

+---------------------------------------------------------------------+
|input                                                                |
+---------------------------------------------------------------------+
|making an image greyscale with gd library                            |
|transforming selected text with a hotkey                             |
|sql server and the guest account what is this for                    |
|how can i change html attribute names with jquery                    |
|how can i send an array to php through ajax                          |
|c the definitive truth about rand random and arc4random              |
|gantt chart controls on windows forms                                |
|build tar file from directory in php without exec/passthru           |
|how do you manage infragistics webgrid data from javascript/ajax code|
|how to consume json web services from a windows client               |
+---------------------------------------------------------------

In [7]:
df_input.write.mode('overwrite').parquet('data/input.parquet')