<a href="https://colab.research.google.com/github/PiotrMaciejKowalski/kurs-analiza-danych-2022/blob/main/Tydzie%C5%84%206/MLlib.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup Sparka

## Utworzenie środowiska pyspark do obliczeń

Tworzymy swoje środowisko z pysparkiem we wenętrzu naszych zasobów chmurowych

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [2]:
!wget -q ftp.ps.pl/pub/apache/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz

In [3]:
!tar xf spark-3.1.2-bin-hadoop2.7.tgz

In [4]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop2.7"

In [5]:
!pip install -q findspark

import findspark
findspark.init()

## Utworzenie sesji z pyspark


Utworzymy testowo sesję aby zobaczyć czy działa. Element ten jest wspólny również gdy systemy sparkowe pracują w sposób ciągły, a nie są tworzone przez naszą sesję.

In [6]:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

## Podłączenie Google Drive do sesji colab

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Pakiet MLlib

Choć domyślnym sposobem pracy we współczesnych modelach analizy danych jest budowanie ich natywnie w Pythonie, to czasami nie jest możliwe np. uprodukcyjnienie modelu. Apache Spark jest jednym z narzędzi o największych możliwościach przetwarzania. 

Spróbujemy jako przykład użyć tu dużo prostszego zbioru jakim jest Iris

In [9]:
!ls /content/drive/MyDrive/iris.data

 21830110_E_Faktura_20211021.pdf
'Big Data Roles in Lugano.rtf.gdoc'
 cdr_d.csv
'Colab Notebooks'
 draw-io
 flights.csv
'Gmail labels.gsheet'
 iris.data
 irtOw3miIoGwMAw255_XGvQZm8BS0w0K0yL-Ym_w2tQ.png
 karta_lokalizacji_pasazera.pdf
 PART_1650571844111_Resized_20220421_220802.jpeg
 PART_1650571850277_Resized_20220421_220747.jpeg
 PART_1650571874607_Resized_20220421_220901.jpeg
'PART_1650571907856_Resized_20220421_220930 (1).jpeg'
 PART_1650571907856_Resized_20220421_220930.jpeg
 Poland1004_09_13.eu4
'Print tickets'
'SE calc.ods'
'SE calc.ods.gsheet'
 spark-3.1.1-bin-hadoop2.7
 spark-warehouse
 Takeout


In [10]:
pola_zbiorczo = '''P_LEN,P_WIDTH,S_LEN,S_WIDTH,SPECIES'''
pola = pola_zbiorczo.split(',')

In [11]:
from pyspark.sql.types import StructType, StringType, IntegerType, BooleanType, FloatType, TimestampType, DateType, ArrayType, MapType
from typing import List, Tuple, Dict, Any
map_python_types_2_spark_types = {
    str : StringType(),
    int : IntegerType(),
    bool : BooleanType(),
    float: FloatType(),
    'timestamp' : TimestampType(),
    'date' : DateType(),
    List[str] : ArrayType(StringType()),
    Tuple[str] : ArrayType(StringType()),
    Dict[str, str] : MapType(StringType(), StringType())
}

column_type_collection = {
    float : [ 'P_LEN','P_WIDTH','S_LEN','S_WIDTH' ],
    str : [ 'SPECIES' ]
}

map_column_names_2_types = {}

for pole in pola:
  for python_type, column_list in column_type_collection.items():
    if pole in column_list:
      map_column_names_2_types[pole] = map_python_types_2_spark_types[python_type]

print(map_column_names_2_types)

{'P_LEN': FloatType, 'P_WIDTH': FloatType, 'S_LEN': FloatType, 'S_WIDTH': FloatType, 'SPECIES': StringType}


In [14]:
schemat = StructType()
for pole, typ in map_column_names_2_types.items():
    schemat = schemat.add(pole, typ, True)

In [15]:
iris = spark.read.format('csv').option("header", False).schema(schemat).load('/content/drive/MyDrive/iris.data')
iris.show(5)

+-----+-------+-----+-------+-----------+
|P_LEN|P_WIDTH|S_LEN|S_WIDTH|    SPECIES|
+-----+-------+-----+-------+-----------+
|  5.1|    3.5|  1.4|    0.2|Iris-setosa|
|  4.9|    3.0|  1.4|    0.2|Iris-setosa|
|  4.7|    3.2|  1.3|    0.2|Iris-setosa|
|  4.6|    3.1|  1.5|    0.2|Iris-setosa|
|  5.0|    3.6|  1.4|    0.2|Iris-setosa|
+-----+-------+-----+-------+-----------+
only showing top 5 rows



In [16]:
iris.printSchema()

root
 |-- P_LEN: float (nullable = true)
 |-- P_WIDTH: float (nullable = true)
 |-- S_LEN: float (nullable = true)
 |-- S_WIDTH: float (nullable = true)
 |-- SPECIES: string (nullable = true)



# Preprocessing 

Dla pełnego wykorzystania zbioru IRIS użyjemy transformacji liczbowej na kolumnie SPECIES aby uzyskać dostęp do wszystkich modeli analitycznych

In [23]:
types = iris.select('SPECIES').distinct().toPandas().values.tolist()
types

[['Iris-virginica'], ['Iris-setosa'], ['Iris-versicolor']]

In [36]:
map_type = [{ typ[0]: i for i, typ in zip(range(3),types) }]
convert_species = spark.createDataFrame(data=map_type, schema = ['type', 'species_code'])
convert_species.show()

+----+------------+--------------+
|type|species_code|Iris-virginica|
+----+------------+--------------+
|   1|           2|             0|
+----+------------+--------------+



In [35]:
dataDictionary = [
        ('James',{'hair':'black','eye':'brown'}),
        ('Michael',{'hair':'brown','eye':None}),
        ('Robert',{'hair':'red','eye':'black'}),
        ('Washington',{'hair':'red','eye':'grey'}),
        ('Jefferson',{'hair':'red','eye':''})
        ]
df = spark.createDataFrame(data=dataDictionary, schema = ["name","properties"])
df.printSchema()
df.show(truncate=False)

root
 |-- name: string (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+----------+-----------------------------+
|name      |properties                   |
+----------+-----------------------------+
|James     |{eye -> brown, hair -> black}|
|Michael   |{eye -> null, hair -> brown} |
|Robert    |{eye -> black, hair -> red}  |
|Washington|{eye -> grey, hair -> red}   |
|Jefferson |{eye -> , hair -> red}       |
+----------+-----------------------------+

