# Project 8: Deploy a model with a big data architecture in AWS

*Pierre-Eloi Ragetly*

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Create-a-file-with-all-picture's-path" data-toc-modified-id="Create-a-file-with-all-picture's-path-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Create a file with all picture's path</a></span></li><li><span><a href="#Create-a-Spark-DataFrame-with-a-Path-column" data-toc-modified-id="Create-a-Spark-DataFrame-with-a-Path-column-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Create a Spark DataFrame with a Path column</a></span></li><li><span><a href="#Create-a-Category-column" data-toc-modified-id="Create-a-Category-column-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Create a Category column</a></span></li></ul></div>

## Create a file with all picture's path

In [1]:
# File system management
import os

root_path = 'dataset/fruits-360_dataset/fruits-360'
train_path = os.path.join(root_path, 'Training')
test_path = os.path.join(root_path, 'Test')

In [2]:
def get_files_paths(path, text_name):
    """Extract all files paths and save them into a text file."""
    list_files = []
    for root, dirs, files in os.walk(path):
        for f in files:
            if not f[0] == '.':
                list_files.append(os.path.join(root, f))
    with open(text_name, 'w') as f:
        f.write('\n'.join(list_files))

In [3]:
get_files_paths(train_path, 'dataset/train_files.txt')
get_files_paths(test_path, 'dataset/test_files.txt')

## Create a Spark DataFrame with a Path column

In [4]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row

sc = SparkContext()
spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/06 15:38:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
def load_files_paths(path):
    """Load files paths into a Spark DataFrame"""
    rdd = sc.textFile(path)\
            .map(lambda line: Row(path=line))
    return spark.createDataFrame(rdd)

In [6]:
train = load_files_paths('dataset/train_files.txt')
test = load_files_paths('dataset/train_files.txt')

                                                                                

In [7]:
data = train.union(test)

In [8]:
data.limit(10).show(truncate=False)

[Stage 2:>                                                          (0 + 1) / 1]

22/10/06 15:38:28 WARN PythonRunner: Detected deadlock while completing task 0.0 in stage 2 (TID 2): Attempting to kill Python Worker
+---------------------------------------------------------------------+
|path                                                                 |
+---------------------------------------------------------------------+
|dataset/fruits-360_dataset/fruits-360/Training/Tomato 4/r_236_100.jpg|
|dataset/fruits-360_dataset/fruits-360/Training/Tomato 4/247_100.jpg  |
|dataset/fruits-360_dataset/fruits-360/Training/Tomato 4/257_100.jpg  |
|dataset/fruits-360_dataset/fruits-360/Training/Tomato 4/r_78_100.jpg |
|dataset/fruits-360_dataset/fruits-360/Training/Tomato 4/r_68_100.jpg |
|dataset/fruits-360_dataset/fruits-360/Training/Tomato 4/r_150_100.jpg|
|dataset/fruits-360_dataset/fruits-360/Training/Tomato 4/r_140_100.jpg|
|dataset/fruits-360_dataset/fruits-360/Training/Tomato 4/131_100.jpg  |
|dataset/fruits-360_dataset/fruits-360/Training/Tomato 4/198_100.jpg  |
|d

                                                                                

## Create a Category column

In [9]:
test = data.limit(1000)

In [10]:
from pyspark.sql.functions import regexp_extract

regex = r'(.*)/(.*[a-zA-Z])(.*)/'
df = test.withColumn('category', regexp_extract(test.path, regex, 2))
df.show(10)

[Stage 3:>                                                          (0 + 4) / 4]

22/10/06 15:38:32 WARN PythonRunner: Detected deadlock while completing task 3.0 in stage 3 (TID 6): Attempting to kill Python Worker
22/10/06 15:38:32 WARN PythonRunner: Detected deadlock while completing task 0.0 in stage 3 (TID 3): Attempting to kill Python Worker
22/10/06 15:38:32 WARN PythonRunner: Detected deadlock while completing task 2.0 in stage 3 (TID 5): Attempting to kill Python Worker
22/10/06 15:38:32 WARN PythonRunner: Detected deadlock while completing task 1.0 in stage 3 (TID 4): Attempting to kill Python Worker
+--------------------+--------+
|                path|category|
+--------------------+--------+
|dataset/fruits-36...|  Tomato|
|dataset/fruits-36...|  Tomato|
|dataset/fruits-36...|  Tomato|
|dataset/fruits-36...|  Tomato|
|dataset/fruits-36...|  Tomato|
|dataset/fruits-36...|  Tomato|
|dataset/fruits-36...|  Tomato|
|dataset/fruits-36...|  Tomato|
|dataset/fruits-36...|  Tomato|
|dataset/fruits-36...|  Tomato|
+--------------------+--------+
only showing top

                                                                                