# Introduction <br>

The very young AgriTech start-up, named “**Fruits**!”,
seeks to offer innovative solutions for fruit harvesting.

The company's desire is to preserve the biodiversity of fruits
by allowing specific treatments for each species of fruit
by developing intelligent picking robots.

The start-up initially wishes to make itself known by putting
available to the general public a mobile application which would allow
users to take a photo of a fruit and get information about that fruit.

For the start-up, this application would raise awareness among the general public
to fruit biodiversity and to set up a first version of the engine
classification of fruit images.

In addition, the development of the mobile application will make it possible to build
a first version of the **Big Data** architecture required.

## Objectives in this project

1. Develop a first data processing chain that <br />
   will include **preprocessing** and a **dimension reduction** step.
2. Take into account that <u>the volume of data will increase <br />
   very quickly</u>after delivery of this project, which involves:
 - Deploy data processing in a **Big Data** environment
 - Develop scripts in **pyspark** to perform **distributed computing**

## Technical choices <br>
We must take into account the very rapid increase in the volume of data after delivery of the project. This is why it is important to introduce a distributed calculation methodology upstream. To do this we will create **pyspark** scripts. <br>
**pySpark** is a way to communicate
with **Spark** via the **Python** language.<br />
**Spark**, for its part, is a tool that allows you to manage and coordinate
performing tasks on data across a group of computers. <br />
<u>Spark (or Apache Spark) is an open source distributed computing framework <br />
in-memory for processing and analyzing massive data</u>.

![Schéma de Spark](img/spark-schema.png)

*The driver (sometimes called “Spark Session”) distributes and schedules
the tasks between the different executors which execute them and allow
distributed processing. He is responsible for executing the code
on the different machines.

Each executor is a separate Java Virtual Machine (JVM) process<br />
of which it is possible to configure the number of CPUs and the quantity of
memory allocated to it. <br />
Only one task can process one data split at a time.*

In both environments (Local and Cloud) we will therefore use **Spark**
and we will exploit it through python scripts using **PySpark**.

In the <u>local version</u> of our script we **simulate
distributed computing** in order to validate that our solution works.<br />
In the <u>cloud version</u> we **perform operations on a machine cluster**.

## Transfert Learning

**Transfer learning** consists of
to use the knowledge already acquired
by a trained model (here **MobileNetV2**) for
adapt it to our problem.

We're going to provide the model with our images, and we're going to
<u>recover the penultimate layer</u> of the model.
Indeed the last model layer is a softmax layer
which allows the classification of images which we do
we do not want in this project.

The penultimate layer corresponds to a **vector
reduced** in dimension (1,1,1280).

This will make it possible to create a first version of the engine
for classification of fruit images.

**MobileNetV2** was selected for its <u>speed of execution</u>,
particularly suitable for processing a large volume
of data as well as the <u>low dimensionality of the vector
of output characteristic</u> (1,1,1280)

## Working environment

We carried out this project on a macOS operating system. <br>
Its installation is quickly explained in order to have a memory aid for later use.<br>
<br>

```bash
brew install openjdk@11
brew install python
brew install apache-spark
```  

Add Spark to the environment variables by adding the following lines to your <br>
.bash_profile, .zshrc or equivalent depending on your shell: <br>

```bash
export SPARK_HOME=/usr/local/spark
export PATH=$SPARK_HOME/bin:$PATH
```  
<br>
Reload your shell configuration file: <br>

```bash
source ~/.zshrc  # or source ~/.bash_profile
```  

#### If you have trouble with <u>Error: Heap size Spark</u> : <br>

```bash
sudo code /usr/local/Cellar/apache-spark/3.5.1/libexec/conf/spark-defaults.conf
```  
and add :

```bash
spark.driver.memory 15g
spark.driver.maxResultSize 2g
``` 

## 1. Library Import

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.ml.feature import PCA
from PIL import Image
import numpy as np
import io
import os
from pyspark.sql.functions import element_at, split

## 2. Setting PATHs to load images and save results

In this local version we assume that the data
are stored in the same directory as the notebook.<br />
We only use an extract of **300 images** to process in this
first version locally.<br />
The extract of images to load is stored in the **Sample** folder.<br />
We will record the result of our treatment
in the "**Results_Local**" folder.

In [7]:
PATH = os.getcwd()
PATH_Data = PATH+'/data/Sample'
PATH_Result = PATH+'/data/results'

In [8]:
PATHX = '/ai-cloud-computing-spark/train'
PATH_DataX = PATHX+'/data/Sample'
PATH_ResultX = PATHX+'/data/results'
print('PATH:        '+\
      PATHX+'\nPATH_Data:   '+\
      PATH_DataX+'\nPATH_Result: '+PATH_ResultX)

PATH:        /ai-cloud-computing-spark/train
PATH_Data:   /ai-cloud-computing-spark/train/data/Sample
PATH_Result: /ai-cloud-computing-spark/train/data/results


## 3. Creating the SparkSession

The Spark application is controlled using a driver process called **SparkSession**. <br />
<u>A **SparkSession** instance is how Spark executes user-defined functions <br />
throughout the cluster</u>. <u>A SparkSession always corresponds to a Spark application</u>. <br>

To avoid session creation bugs we stop all previously created sessions. <br>

<u>Here we create a spark session by specifying in order</u>:
 1. a **name for the application**, which will be displayed in the Spark web UI "**P8**"
 2. that the application must run **locally**. <br />
   We don't define the number of cores to use (like .master('local[4]) for 4 cores to use), <br />
   so we will use all available cores in our processor.<br />
 3. an additional configuration option allowing the use of the **"parquet" format** <br />
   which we will use to save and load the result of our work.
 4. want to **get an existing spark session** or if none exist, create a new one

In [9]:
if SparkSession.builder.getOrCreate().sparkContext:
    SparkSession.builder.getOrCreate().sparkContext.stop()
    
spark = (SparkSession
             .builder
             .appName('ai-cloud-computing-spark')
             .master('local')
             .config("spark.sql.parquet.writeLegacyFormat", 'true')
             .getOrCreate()
)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/30 11:15:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### 4. Loading data

Images are loaded in binary format, which offers,
more flexibility in how to preprocess images. <br>
Before loading the images we specify that we want to load
only files with **jpg** extension. <br>
We also indicate to load all possible objects contained
in the subfolders of the communicated folder.

<u>Display of the first images containing</u>:
 - the image path
 - the date and time of its last modification
 - its length
 - its content encoded in hexadecimal value
 - its label

In [10]:
df = spark.read.format("binaryFile") \
  .option("pathGlobFilter", "*.jpg") \
  .option("recursiveFileLookup", "true") \
  .load(PATH_Data)
df = df.withColumn('label', element_at(split(df['path'], '/'),-2))
df.show()

                                                                                

+--------------------+--------------------+------+--------------------+--------------+
|                path|    modificationTime|length|             content|         label|
+--------------------+--------------------+------+--------------------+--------------+
|file:/Users/gaeld...|2024-05-30 10:56:...|  5656|[FF D8 FF E0 00 1...|Apple Braeburn|
|file:/Users/gaeld...|2024-05-30 10:56:...|  5627|[FF D8 FF E0 00 1...|Apple Braeburn|
|file:/Users/gaeld...|2024-05-30 10:56:...|  5613|[FF D8 FF E0 00 1...|Apple Braeburn|
|file:/Users/gaeld...|2024-05-30 10:56:...|  5611|[FF D8 FF E0 00 1...|Apple Braeburn|
|file:/Users/gaeld...|2024-05-30 10:56:...|  5611|[FF D8 FF E0 00 1...|Apple Braeburn|
|file:/Users/gaeld...|2024-05-30 10:56:...|  5606|[FF D8 FF E0 00 1...|Apple Braeburn|
|file:/Users/gaeld...|2024-05-30 10:56:...|  5606|[FF D8 FF E0 00 1...|Apple Braeburn|
|file:/Users/gaeld...|2024-05-30 10:56:...|  5602|[FF D8 FF E0 00 1...|Apple Braeburn|
|file:/Users/gaeld...|2024-05-30 10:56:...|

<u>Data type for each column in the Spark Dataframe</u> <br>

- path: Character string (can be null)
- modificationTime: Timestamp (can be null)
- length: Long integer (can be null)
- content: Binary data (can be null)
- label: Character string (can be null)

In [11]:
df.printSchema()

root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- content: binary (nullable = true)
 |-- label: string (nullable = true)



## 5. PCA

In the introduction, we presented our objective of reducing the size of the data. <br>
To do this we must carry out a PCA on the characteristics of the images. <br>
<br>
The Spark documentation tells us that PCA ([Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis)) are only possible on dense vectors or sparse vectors. <br>
- Dense Vector : explicitly stores all of its elements.
- Sparse Vector : stores only non-zero elements and their indices

PCA are very slow with Spark, to know the number of main components to keep in our training we carry out a PCA <br> in another notebook to visualize how many PCAs we will keep.

#### <u>We can see, cummulative variance can be explain by only 31 components.</u>

![PCA Graph](img/PCA_grap.png)

When we use these 31 components the image is simplified to improve the model's understanding of the features. <br>
Here is a visualization of the original image and the image simplified by 31 components. <br>

![simplified Image](img/simplified_image.png)