### [INDEX]

* [Spark Creations]
    * [SparkContext oluşturma yöntem-1: SparkSession](#1)
    * [SparkContext oluşturma yöntem-2: ParkSession ve SparkConf](#2)
    * [SparkContext oluşturma yöntem-3: SparkContext ve SparkConf](#3)
    * [Python listelerinden RDD oluşturmak](#4)
    * [Python sözlükten (dictionary) RDD oluşturmak](#5)
    * [Metin dosyalarından RDD oluşturmak](#6)
* [Basic Transormations and Actions](#7)
    * [RDD transformations](#8)
    * [2 RDD transformations](#9)
    * [Basic Actions on one RDD](#10)
* [MAP vs FLATMAP](#11)
* [TEST ON MAP vs FLAtMAP](#12)
* [MAP vs FLATMAP Functions](#13)
* [RDD FILTER tansformation](#14)
* [RRD-JOIN](#15)
* [PAIR RDD Operations](#16)
* [Excel-Dataframe-RDD](#17)
* [BroadcastVariablesOps](#18)
* [RDD_Wordcount](#19)
* [AND Others](#20)

In [1]:
#RDD parallelize demekdir

In [2]:
import findspark
findspark.init(findspark.init("/Users/resitkadir/spark/spark-2.4.6/"))

# Create SparkContext
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark import SparkContext

###  SparkContext oluşturma yöntem-1: SparkSession <a class="anchor" id="1"></a>

In [3]:
# Aşağıdaki ayarları bilgisayarınızın belleğine göre değiştirebilirsiniz
spark = SparkSession.builder \
        .master("local[4]") \
        .appName("RDD-1") \
        .config("spark.executor.memory","4g") \
        .config("spark.driver.memory","2g") \
        .getOrCreate()

# sparkContext'i kısaltmada tut
sc = spark.sparkContext
#sc.stop()

### SparkContext oluşturma yöntem-2: ParkSession ve SparkConf  <a class="anchor" id="2"></a>

In [4]:
conf =   SparkConf() \
        .setMaster("local[4]") \
        .setAppName("RDD-Olusturmak-2") \
        .setExecutorEnv("spark.executor.memory","4g") \
        .setExecutorEnv("spark.driver.memory","4g")

pyspark = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

sc = pyspark.sparkContext
sc.stop()

### SparkContext oluşturma yöntem-3: SparkContext ve SparkConf <a class="anchor" id="3"></a>

In [5]:
sparkConf = SparkConf() \
            .setMaster("local[4]") \
            .setAppName("RDD-Olusturmak-3") \
            .setExecutorEnv("spark.executor.memory","2g") \
            .setExecutorEnv("spark.driver.memory","1g")



sc = SparkContext(conf=sparkConf)

#sc.stop()

## Python listelerinden RDD oluşturmak <a class="anchor" id="4"></a>

In [6]:
rdd = sc.parallelize([('Ahmet',25),('Cemal',29),('İnci',38),('Burcu',33)])
rdd.take(2)

[('Ahmet', 25), ('Cemal', 29)]

In [7]:
rdd2 = sc.parallelize([['Ahmet',25],['Cemal',29],['İnci',38],['Burcu',33]])
rdd2.take(3)

[['Ahmet', 25], ['Cemal', 29], ['İnci', 38]]

In [8]:
#Count
rdd2.count()

4

In [9]:
sayilarRDD = sc.parallelize([[1,2,3],[4,5,6]])
sayilarRDD.take(3)


[[1, 2, 3], [4, 5, 6]]

In [10]:
sc.stop()

## Python sözlükten (dictionary) RDD oluşturmak <a class="anchor" id="5"></a>

In [11]:
# Sözlük oluşturma
my_dict ={
    "Ogrenci":['Ali','Mehmet','Ayse'],
    "Notlar":[70,80,90]}
my_dict

{'Ogrenci': ['Ali', 'Mehmet', 'Ayse'], 'Notlar': [70, 80, 90]}

In [12]:
#Dict to frame donusumu yaoiyoruz
import pandas as pd
pdDF = pd.DataFrame(my_dict)
pdDF.head()

Unnamed: 0,Ogrenci,Notlar
0,Ali,70
1,Mehmet,80
2,Ayse,90


In [13]:
conf =   SparkConf() \
        .setMaster("local[4]") \
        .setAppName("RDD-Olusturmak-2") \
        .setExecutorEnv("spark.executor.memory","4g") \
        .setExecutorEnv("spark.driver.memory","4g")

pyspark = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

sc = pyspark.sparkContext

In [14]:
rdd_from_pandasDF = pyspark.createDataFrame(pdDF)
rdd_from_pandasDF.show()

+-------+------+
|Ogrenci|Notlar|
+-------+------+
|    Ali|    70|
| Mehmet|    80|
|   Ayse|    90|
+-------+------+



In [15]:
#get into pandas
rdd_from_pandas = rdd_from_pandasDF.rdd
rdd_from_pandas.take(3)

[Row(Ogrenci='Ali', Notlar=70),
 Row(Ogrenci='Mehmet', Notlar=80),
 Row(Ogrenci='Ayse', Notlar=90)]

### Metin dosyalarından RDD oluşturmak <a class="anchor" id="6"></a>

In [16]:
rdd_metin = sc.textFile("datasets/OnlineRetail.csv")
rdd_metin.take(2)


['InvoiceNo;StockCode;Description;Quantity;InvoiceDate;UnitPrice;CustomerID;Country',
 '536365;85123A;WHITE HANGING HEART T-LIGHT HOLDER;6;1.12.2010 08:26;2,55;17850;United Kingdom']

In [17]:
sc.stop()

 # Basic Tranformations and Actions <a class="anchor" id="7"></a>

In [128]:

import findspark
findspark.init("/Users/resitkadir/spark/spark-3.0.0/")
import pyspark 

# Create SparkContext
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

spark = SparkSession.builder \
    .master("local[4]") \
    .appName("RDD-Olusturmak") \
    .config("spark.executor.memory","4g") \
    .config("spark.driver.memory","2g") \
    .getOrCreate()

sc = spark.sparkContext

**1 RDD transformations**<a class="anchor" id="8"></a>

<font color='green'>**map()** </font>

<font color='green'>**filter()** </font>

<font color='green'>**flatMap()**  </font>
 "Map in yapdigi isi her bir element icin ayri ayri yapar"
                                                                        
<font color='green'>**distinct()  :**  </font>*duplicate(unique yapar) valuelari tek yazar alir*
                
<font color='green'>**sample()**  </font>


![exa](IMG/RDD_1.png)


<font color='green'>**reduceByKey()** : </font> *Anahtar icin degerleri birlestirir.Her bir anahtara ait degerlerin toplamini iceren bir RDD doner*
     
   **her anahtar icin degereri topladi**
   
     1 icin 2 ve 3 icin 4+6=10
 
    currentRDD{(1,2),(3,4),(3,6)}->**reduceByKey((x,y)=> x+y)->newRDD{(1,2),(3,10)}

In [129]:
#anahtar bazinda reduce et
rdd = sc.parallelize([(1,2),(3,4),(3,6)])
rdd.reduceByKey(lambda x,y :x+y).take(3)


[(1, 2), (3, 10)]

<font color='green'>**groupByKey()** : </font> *anahtarlari grupla*
    
       currentRDD{(1,2),(3,4),(3,6)}->**groupByKey() -> newRDD{(1,(2)),(3,(4,6))}

In [130]:
#anahtar bazinda gruplama
rdd.groupByKey().take(3)

[(1, <pyspark.resultiterable.ResultIterable at 0x7f9604dab810>),
 (3, <pyspark.resultiterable.ResultIterable at 0x7f9604dab550>)]

<font color='green'>**mapValues()** : </font> *PairRDD nin degerlerine belirtilen fonksiyonu uygular,ornegin degerleri 100 ile carp*
    
       currentRDD{(1,2),(3,4),(3,6)}->**rdd1.mapValues(x => x*100) -> newRDD{(1,200),(3,400),(3,600)}

In [131]:
rdd.mapValues(lambda x :x*100).collect()

[(1, 200), (3, 400), (3, 600)]

<font color='green'>**keys()** : </font>*pairRdd nin anahtarlarini iceren bir RDD doner*

       currentRDD{(1,2),(3,4),(3,6)}->**rdd1.keys() -> newRDD{(1,3,3)}
    

In [132]:
rdd.keys().collect()

[1, 3, 3]

<font color='green'>**values()** : </font>*pairRdd nin degerleri iceren bir RDD doner*

       currentRDD{(1,2),(3,4),(3,6)}->**rdd1.values() -> newRDD{(2,4,6)}
    

In [133]:
rdd.values().collect()

[2, 4, 6]

<font color='green'>**sortByKey()** : </font>*anahtara gore siralanmis bir RDD doner*

       currentRDD{(1,2),(3,4),(3,6)}->**rdd1.sortByKey() -> newRDD{(1,2),(3,4),(3,6)}
    

In [134]:
rdd.sortByKey().collect()

[(1, 2), (3, 4), (3, 6)]

**take**

In [19]:
list_ = [x for x in range(10)]
list_
liste_rdd =sc.parallelize(list_)
liste_rdd.take(10)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [20]:
#MAP
liste_rdd.map(lambda x : x**2 ).take(4)

[0, 1, 4, 9]

In [21]:
#Filter
liste_rdd.filter(lambda x : x < 5 ).take(5)

[0, 1, 2, 3, 4]

In [22]:
text_ = ["He is really cool","she is at home.","it doesn't say so much"]
#Create RDD with parallize
text_rdd = sc.parallelize(text_)
text_rdd.take(3)

['He is really cool', 'she is at home.', "it doesn't say so much"]

In [23]:
text_rdd.map(lambda x : x.upper()).take(3)

['HE IS REALLY COOL', 'SHE IS AT HOME.', "IT DOESN'T SAY SO MUCH"]

In [24]:
#flatmap,harf olarak verdi
text_rdd.flatMap(lambda x : x.upper()).take(3)

['H', 'E', ' ']

In [25]:
##word to word splitin
text_rdd.flatMap(lambda x : x.split(" ")).map(lambda x : x.upper()).take(4)

['HE', 'IS', 'REALLY', 'COOL']

In [26]:
#DISTINC,unique ler sadece kalir
list_2 = [1,1,2,2,4,5,6,6]
list_2_rdd = sc.parallelize(list_2)
list_2_rdd.distinct().take(10)

[4, 1, 5, 2, 6]

In [27]:
#0.7 %70 al gibi
#42 verilen orneklar ayni kalsin
list_2_rdd.sample(True,0.7,42).take(10)


[2, 2, 2, 2, 4, 6, 6]

**2 RDD transformations**<a class="anchor" id="9"></a>
![exa](IMG/RDD_2.png)
<font color='green'>**union()** : </font> "iki RDD nin elemanlarini birlestirir tek bir RDD doner "

    --> rdd1 = {1,2,9,4,5,36}

    -->rdd2 = {1,4,9,16,25,36}

    rdd1.union(rdd2) -> rddUnion {1,2,9,4,5,36,1,4,9,16,25,36}
    
<font color='green'>**intersection()**  : </font>
 "kesisim ikisine ait ortak elemanlari alir"

    rdd1.intersection(rdd2)->rddIntersection{1,4,9,36}

<font color='green'>**subtract()**   : </font>
  "ilkyazilan (rdd1)deki unique degerleri sadece al"
    
     rdd1.subtract(rdd2)-->rddSubtract{2,5}
     
     rdd2.subtract(rdd1)-->rddSubtract{16,25}

<font color='green'>**Cartesian()**  : </font> ""butun olasilikdaki couple lari alir"

    rdd1.cartesian(rdd2)-->{(1,4),(1,9),.....(36,1)}

<font color='green'>**subtractByKey()** : </font>**diger anahtara ait elemanlari cikarir**

    *3 anahtari olan sayilari at ve olmayan kalsin*

    rdd1->{(1,2),(3,4),(3,6)}
    
    rdd2->{(3,9)}
    
    rdd1.subtractByKey(rdd2)-->{(1,2)}

In [135]:
rdd = sc.parallelize([(1,2),(3,4),(3,6)])
rdd.collect()

[(1, 2), (3, 4), (3, 6)]

In [136]:
rdd2 = sc.parallelize([3,9])
rdd2.collect()

[3, 9]

In [137]:
rdd.subtractByKey(rdd2).collect()

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 22.0 failed 1 times, most recent failure: Lost task 7.0 in stage 22.0 (TID 71, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/resitkadir/spark/spark-2.4.6/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/Users/resitkadir/spark/spark-2.4.6/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/resitkadir/spark/spark-2.4.6/python/lib/pyspark.zip/pyspark/serializers.py", line 400, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/Users/resitkadir/spark/spark-2.4.6/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "/Users/resitkadir/spark/spark-2.4.6/python/pyspark/rdd.py", line 1983, in <lambda>
    map_values_fn = lambda kv: (kv[0], f(kv[1]))
TypeError: 'int' object is not subscriptable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:561)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:346)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:195)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:990)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:989)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.GeneratedMethodAccessor59.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/resitkadir/spark/spark-2.4.6/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/Users/resitkadir/spark/spark-2.4.6/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/resitkadir/spark/spark-2.4.6/python/lib/pyspark.zip/pyspark/serializers.py", line 400, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/Users/resitkadir/spark/spark-2.4.6/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "/Users/resitkadir/spark/spark-2.4.6/python/pyspark/rdd.py", line 1983, in <lambda>
    map_values_fn = lambda kv: (kv[0], f(kv[1]))
TypeError: 'int' object is not subscriptable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:561)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:346)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:195)


In [28]:
import subprocess as sp
sp.call("cls",shell=True)

127

In [29]:
rdd1 = sc.parallelize([1,2,9,4,5,36])
rdd2 = sc.parallelize([1,4,9,16,25,36])
#union
rdd1.union(rdd2).take(12) #[1, 2, 9, 4, 5, 36, 1, 4, 9, 16, 25, 36]
#intersection
rdd1.intersection(rdd2).take(6) #[1, 9, 4, 36]
#subtract
rdd1.subtract(rdd2).take(5) #[2, 5]
rdd2.subtract(rdd1).take(5) # [16, 25]
#Cartesian
rdd1.cartesian(rdd2).take(3)

[(1, 1), (1, 4), (1, 9)]

**Basic Actions on one RDD (uygulama yerleri)**<a class="anchor" id="10"></a>

<font color='green'> **collect()** : </font> *RDD uzerindeki tum elemanlari driver pc uzerine doner.Buyul RDD lerde calistirmak tehlikelidir,ornegin bizim driver 4 gb olsun ama veri 500gb olursa sikinti buyur*

    --> rdd1 = {1,2,9,4,5,36}
    Sonucu liste olarak verir
    rdd1.collect() --> [1,2,9,4,5,36]
    
    
    

<font color='green'>**count() :** </font>*Eleman sayisini sayar*

    rdd1.count()--> Long=6

<font color='green'>**countByValue() :** </font>*Her bir elemanin RDD icinde kac kez tekrarlandigini hesaplar ve bir tuple doner*

**ilk elemani sayiyi ve sonra Her elemandan kac tane oldugunu gosterir**
      
      
     rdd1.countByValue() --> Map[Int,Long] = Map(5->1,1->1,....,36->1)

<font color='green'>**take() :** </font> *RDD icinde istenen sayida eleman doner*

    rdd1.take(3)-->Array[int] = Array(1,2,5)

<font color='green'>**top() :** </font>*En ustteki 3 degeri al getir(buyukluk)*


    rdd1.top(3)-->Array[Int]=Array(36,9,5)

<font color='green'>**union()** : </font>

    

<font color='green'>**takeOrdered() :** </font>

*RDD icindeki elemanlari siralayarak belirtilen sayi kadarini bir array olarak doner*

    rdd1.takeOrdered(6)--->Array[Int]-->Array(1,2,4,5,9,36)

<font color='green'>**takeSample() :** </font>*RDD icinden istenen mikatrda orneklem iceren bir Array doner*

**parametreler:**

**withReplacement :** Boolean(Aldigini yerine koyalimmi?) , 

**num:Int**(kac tane eleman alicaz) ,

**seed:Long**(Her tekrarlandiginda ayni elemanlar gelsin mi?orenkde 42 verdikce 9,2,5 gelir yoksa degisir)

    rdd1.takeSample(false,3,42)-->Array[Int] = Array(9,2,5)

<font color='green'>**reduce() :** </font>*RDD uzerindeki elemanlari paralele olarak uygulayarak bir sonuc uretir.Orenign toplama*

    (1,2,4,5,9,36)
    
    rdd1.reduce((x,y) => x+y) -->57

<font color='green'>**fold() :** </font>*Reduce ile aynidir sadece sifir degeri farki vardir*

**fold(zero)(func)**

    rdd1.fold(0)((x,y) => x+y)-->57

<font color='green'>**aggregate() :** </font>** *Her bir pattition elemanlarini kumeleme(aggregation) fonskiyonunu uygular ve combine fonksiyonlari ile bu sonuclari birlestirir*

**aggregate(zero)(seqOp,combOp)**

    rdd1.aggregate((0,0))
    
    ((x,y) => (x._+y,x._2+1),
               (x._1+y._1,x._2+y._2))  ->(int,int)=(57,6)
               
<font color='green'>**foreach()  :** </font>

In [30]:
rdd = sc.parallelize([1,2,9,4,2,4,5,1,1,7])
#Count
rdd.count()

10

In [31]:
#CountByValue,hangi sayidan kac tane var
rdd.countByValue()

defaultdict(int, {1: 3, 2: 2, 9: 1, 4: 2, 5: 1, 7: 1})

In [32]:
#take
rdd.take(3)

[1, 2, 9]

In [33]:
#top,sirala
rdd.top(5)

[9, 7, 5, 4, 4]

In [34]:
#takeordered,takein siralanmis hali
rdd.takeOrdered(4)

[1, 1, 1, 2]

In [35]:
#Take sample,5 deger al ,seed 33 sectik,
rdd.takeSample("false",5,33)

[1, 1, 1, 1, 9]

In [36]:
rdd.takeSample("false",5,33)

[1, 1, 1, 1, 9]

In [37]:
rdd.takeSample("True",5,33)

[1, 1, 1, 1, 9]

In [38]:
#Reduce
rdd.reduce(lambda x,y :x+y)

36

In [39]:
#Fold
rdd.fold(0,lambda x,y :x+y)

36

[1,2,3,9,4,10,5,36,8]

**sc.parallelize ile datayi dagittik**

*ilk uc elemanimiz thread-1 gitti(1,2,3),*

0 degeri <font color='green'>**x[0]** </font>ataniyor ve oyle basliyor ve y 1 ile basliyor thread ikide x[0] yine 0 ile basliyor y ise 9 ile

<font color='green'>**x[1]** </font> ile isede yine 0 dan baslayip kac tane eleman oldugunu sayiyor

*ve thread-2 ye (9,4,10) gitti*

![Aggregate](IMG/Aggregate.png)

In [40]:
#aggregate
rdd_a = [1,2,3,9,4,10,5,36,8]
rdd_a = sc.parallelize(rdd_a)
rdd_a.aggregate((0,0),(lambda x,y : (x[0] + y , x[1]+1)),(lambda x , y : (x[0] + y[0],x[1]+y[1])))


(78, 9)

In [41]:
sc.stop()

**MAP vs FLARMAP**<a class="anchor" id="11"></a>


In [42]:
import findspark
findspark.init(findspark.init("/Users/resitkadir/spark/spark-3.0.0/"))
from pyspark import SparkContext
#
sc = SparkContext("local[4]","map_flat_Map")
#
ppl_RDD= sc.textFile( "datasets/simple_data.csv")
#
ppl_RDD.take(5)

['sirano,isim,yas,meslek,sehir,aylik_gelir',
 '1,Cemal,35,Isci,Ankara,3500',
 '2,Ceyda,42,Memur,Kayseri,4200',
 '3,Timur,30,Müzisyen,Istanbul,9000',
 '4,Burcu,29,Pazarlamaci,Ankara,4200']

In [43]:
ppl_RDD_ = ppl_RDD.filter(lambda x : "sirano" not in x)
#sirano satirini alma
ppl_RDD_.take(5)

['1,Cemal,35,Isci,Ankara,3500',
 '2,Ceyda,42,Memur,Kayseri,4200',
 '3,Timur,30,Müzisyen,Istanbul,9000',
 '4,Burcu,29,Pazarlamaci,Ankara,4200',
 '5,Yasemin,23,Pazarlamaci,Bursa,4800']

In [44]:
#map satira odaklanir
ppl_RDD_.map(lambda x : x.upper()).take(5)

['1,CEMAL,35,ISCI,ANKARA,3500',
 '2,CEYDA,42,MEMUR,KAYSERI,4200',
 '3,TIMUR,30,MÜZISYEN,ISTANBUL,9000',
 '4,BURCU,29,PAZARLAMACI,ANKARA,4200',
 '5,YASEMIN,23,PAZARLAMACI,BURSA,4800']

In [45]:
#flatmap,harf'e
ppl_RDD_.flatMap(lambda x :x.upper()).take(5)

['1', ',', 'C', 'E', 'M']

In [46]:
ppl_RDD_.flatMap(lambda x :x.split(",")).map(lambda x :x.upper()).take(15)

['1',
 'CEMAL',
 '35',
 'ISCI',
 'ANKARA',
 '3500',
 '2',
 'CEYDA',
 '42',
 'MEMUR',
 'KAYSERI',
 '4200',
 '3',
 'TIMUR',
 '30']

In [47]:
sc.stop()

**TEST ON MAP vs FLAtMAP**<a class="anchor" id="12"></a>

**1. SparkContext sınıfını kullanarak local modda çalışan 2 çekirdek, 2 Gb. driver, 3 Gb executor belleğine sahip, "Test" isimli ekrana "Merhaba Spark" yazan bir Spark uygulaması yazınız**

In [48]:
import findspark
findspark.init(findspark.init("/Users/resitkadir/spark/spark-3.0.0/"))

In [49]:
#spark Context sinifi ile kurmamiz lazim
from pyspark import SparkContext
from pyspark.conf import SparkConf

In [50]:
spark_conf =   SparkConf(). \
                  setMaster("local[2]"). \
                setAppName("Test"). \
                set("spark.driver.memory","2g"). \
                setExecutorEnv("spark.executor.memory","3g")

In [51]:
#spark context olusturalim
sc = SparkContext(conf=spark_conf)
print("hello Spark")

hello Spark


**2.) 3,7,13,15,22,36,7,11,3,25 rakamlarından bir RDD oluşturunuz.**

In [52]:
rdd_=[3,7,13,15,22,36,7,11,3,25]
rdd_s =sc.parallelize(rdd_)
rdd_s.collect()

[3, 7, 13, 15, 22, 36, 7, 11, 3, 25]

**3. "Spark'ı öğrenmek çok heyecan verici" cümlesinin tüm harflerini büyük harf yapınız.**

In [53]:
a = ["Spark'ı öğrenmek çok heyecan verici"]
a=sc.parallelize(a)
a.map(lambda x : x.upper()).take(4)

#yol-2
text_rdd = sc.parallelize(["Spark'ı öğrenmek çok heyecan verici"])
text_rdd.map(lambda x :x.upper()).collect()

["SPARK'I ÖĞRENMEK ÇOK HEYECAN VERICI"]

**4.) https://github.com/veribilimiokulu/udemy-apache-spark/blob/master/docs/Ubuntu_Spark_Kurulumu.txt adresindeki text dosyasını Spark ile okuyarak kaç satırdan oluştuğunu ekrana yazdırınız.**

In [54]:
text_ = sc.textFile("datasets/soru_4.txt").count()
text_

76

**5. https://github.com/veribilimiokulu/udemy-apache-spark/blob/master/docs/Ubuntu_Spark_Kurulumu.txt adresindeki text dosyasını Spark ile okuyarak kaç kelimeden oluştuğunu ekrana yazdırınız. (Kelimeler tekrarlanabilir)**

In [55]:
text_ = sc.textFile("datasets/soru_4.txt")
text_.flatMap(lambda x : x.split(" ")).map(lambda x : x.upper()).count()


237

**6. İkinci sorudaki rakam listesi ile 1,2,3,4,5,6,7,8,9,10 listesi arasındaki kesişim kümesini(ortak rakamları) Spark uygulaması ile ekrana yazdırınız.**

In [56]:
rdd_s.collect()
rdd_2 = sc.parallelize([1,2,3,4,5,6,7,8,9,10 ])
rdd_s.intersection(rdd_2).collect()

[3, 7]

**7. İkinci sorudaki rakamların tekil (rakamların tekrarlanmaması) halinden oluşan bir RDD yaratınız.**

In [57]:
rdd_s.collect()

[3, 7, 13, 15, 22, 36, 7, 11, 3, 25]

In [58]:
rdd_s.distinct().collect()

[22, 36, 3, 7, 13, 15, 11, 25]

**8. İkinci sorudaki rakamların liste içinde kaçar kez tekrarlandıklarını (frekanslarını) bulan bir Spark uygulaması yazınız.**

In [59]:
rdd_2=sc.parallelize([3,7,13,15,22,36,7,11,3,25])
rdd_2.collect()

[3, 7, 13, 15, 22, 36, 7, 11, 3, 25]

In [60]:
rdd_2.countByValue()

defaultdict(int, {3: 2, 7: 2, 13: 1, 15: 1, 22: 1, 36: 1, 11: 1, 25: 1})

In [61]:
rdd_2.map(lambda x :(x,1)).collect()

[(3, 1),
 (7, 1),
 (13, 1),
 (15, 1),
 (22, 1),
 (36, 1),
 (7, 1),
 (11, 1),
 (3, 1),
 (25, 1)]

In [62]:
rdd_2.map(lambda x :(x,1)).reduceByKey(lambda x,y :x+y).sortByKey().collect()

[(3, 2), (7, 2), (11, 1), (13, 1), (15, 1), (22, 1), (25, 1), (36, 1)]

In [63]:
rdd_2.map(lambda x :(x,1)).reduceByKey(lambda x,y :x+y).collect()

[(22, 1), (36, 1), (3, 2), (7, 2), (13, 1), (15, 1), (11, 1), (25, 1)]

In [64]:
sc.stop()

**MAP vs FLATMAP Functions***<a class="anchor" id="13"></a>

In [65]:
## Burada hata alırsanız komut satırından "pip install findspark" komutunu çalıştırarak findspark'ı yüklemeyi unutmayın.
import findspark
findspark.init(findspark.init("/Users/resitkadir/spark/spark-3.0.0/"))
import pyspark # only run after findspark.init()

from pyspark import SparkConf, SparkContext
#sc = SparkContext("local","RDD-Olusturmak")

conf = SparkConf() \
        .setMaster("local[4]") \
        .setAppName("RDD_Olusturmak") \
        .setExecutorEnv("spark.executor.memory", "4g") \
        .setExecutorEnv("spark.driver.memory","2g")

sc = SparkContext(conf=conf)

In [66]:
#filter(lambda x: "InvoiceNo" not in x) ile başlık satırından kurtuluyoruz
retailRDD = sc.textFile("datasets/OnlineRetail.csv") \
.filter(lambda x: "InvoiceNo" not in x)

In [67]:
#İlk satırı görelim başlıktan kurtulmuş muyuz?
retailRDD.first()

'536365;85123A;WHITE HANGING HEART T-LIGHT HOLDER;6;1.12.2010 08:26;2,55;17850;United Kingdom'

**MAP Donusumu**

     Quantity ile Unit price çarparak işlem tutarını bulmak ve InvoiceNo'dan C harflerini bularak yeni bir sütunda
    işlemin iptal olup olmadığını boolean olarak yazmak
    
    map() içinde uygulayacağımız işlemleri bir fonksiyonda yazmak


In [68]:
def my_func(line):
    isCancelled = True if(line.split(";")[0].startswith("C")) else False
    total = float(line.split(";")[3]) * float(line.split(";")[5].replace(",","."))
    return (isCancelled, total)


In [69]:
retailMapPriceRDD = retailRDD.map(my_func) 
    

In [70]:
retailMapPriceRDD.take(3)

[(False, 15.299999999999999), (False, 20.34), (False, 22.0)]

In [71]:
#İptal oanları filtreleyelim
retailMapPriceRDD.filter(lambda x: x[0] == True).take(10)

[(True, -27.5),
 (True, -4.65),
 (True, -19.799999999999997),
 (True, -6.959999999999999),
 (True, -6.959999999999999),
 (True, -6.959999999999999),
 (True, -41.400000000000006),
 (True, -19.799999999999997),
 (True, -39.599999999999994),
 (True, -25.5)]

In [72]:
#iptal olanlari sayalim
retailMapPriceRDD.filter(lambda x: x[0] == True).count()

9288

In [73]:
#FLATMAP donusumu
retailFlatMapSplittedRDD = retailRDD.flatMap(lambda x: x.split(";"))
retailFlatMapSplittedRDD.count()

4335272

In [74]:
retailRDD.count()

541909

In [75]:
#flatMap ile her kelimeyi büyük harf yapma
retailFlatMapUpper = retailRDD.flatMap(lambda x: x.split(";")).map(lambda x: x.upper())
retailFlatMapUpper.take(15)

['536365',
 '85123A',
 'WHITE HANGING HEART T-LIGHT HOLDER',
 '6',
 '1.12.2010 08:26',
 '2,55',
 '17850',
 'UNITED KINGDOM',
 '536365',
 '71053',
 'WHITE METAL LANTERN',
 '6',
 '1.12.2010 08:26',
 '3,39',
 '17850']

In [76]:
## İptal edilen satışların toplam tutarı
retailMapPriceRDD.reduceByKey(lambda x,y: x + y).take(2)

retailMapPriceRDD.reduceByKey(lambda x,y: x + y) \
                .filter(lambda x: x[0] == True) \
                .map(lambda x: x[1]) \
                .take(2)

[-896812.4900000116]

In [77]:
print("True") if("ile".startswith("e")) else print("False") 

False


In [78]:
"ile".replace("e","a")

'ila'

In [79]:
float("2,5".replace(",","."))

2.5

In [80]:
def cancelled_price(line):
    is_cancelled = True if(line.split(";")[0].startswith("C")) else False
    quantity = float(line.split(";")[3])
    price = float(line.split(";")[5].replace(",","."))
    
    total = quantity * price
    return (is_cancelled, total)

In [81]:
retailTotal = retailRDD.map(cancelled_price)
retailTotal.take(5)

[(False, 15.299999999999999),
 (False, 20.34),
 (False, 22.0),
 (False, 20.34),
 (False, 20.34)]

In [82]:
reducedTotal = retailTotal.reduceByKey(lambda x,y:x+y)
reducedTotal.filter(lambda x: x[0] == True).take(2)
reducedTotal.filter(lambda x: x[0] == True).map(lambda x: x[1]).take(2)

[-896812.4900000116]

In [83]:
sc.stop()

### RDD_Filter_Transformation <a class="anchor" id="14"></a>

In [84]:
import findspark
findspark.init(findspark.init("/Users/resitkadir/spark/spark-3.0.0/"))
# Create SparkContext
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

# Aşağıdaki ayarları bilgisayarınızın belleğine göre değiştirebilirsiniz
spark = SparkSession.builder \
        .master("local[4]") \
        .appName("Dataset-Olusturmak") \
        .config("spark.executor.memory","4g") \
        .config("spark.driver.memory","2g") \
        .getOrCreate()

# sparkContext'i kısaltmada tut
sc = spark.sparkContext

retailRDD = sc.textFile("datasets/OnlineRetail.csv").filter(lambda x: 'InvoiceNo' not in x)


In [85]:
retailRDD.take(3)

['536365;85123A;WHITE HANGING HEART T-LIGHT HOLDER;6;1.12.2010 08:26;2,55;17850;United Kingdom',
 '536365;71053;WHITE METAL LANTERN;6;1.12.2010 08:26;3,39;17850;United Kingdom',
 '536365;84406B;CREAM CUPID HEARTS COAT HANGER;8;1.12.2010 08:26;2,75;17850;United Kingdom']

In [86]:
## Header satırından kurtulma
firstline = retailRDD.first()
firstlinerdd = sc.parallelize([firstline])

In [87]:
retailRDDWithoutHeader = retailRDD.subtract(firstlinerdd)

In [88]:
retailRDDWithoutHeader.take(5)

['536367;84969;BOX OF 6 ASSORTED COLOUR TEASPOONS;6;1.12.2010 08:34;4,25;13047;United Kingdom',
 '536369;21756;BATH BUILDING BLOCK WORD;3;1.12.2010 08:35;5,95;13047;United Kingdom',
 '536370;22326;ROUND SNACK BOXES SET OF4 WOODLAND;24;1.12.2010 08:45;2,95;12583;France',
 '536370;21731;RED TOADSTOOL LED NIGHT LIGHT;24;1.12.2010 08:45;1,65;12583;France',
 '536372;22632;HAND WARMER RED POLKA DOT;6;1.12.2010 09:01;1,85;17850;United Kingdom']

In [89]:
### InvoiceNo 536367 olan siparişleri filtreleyelim
# InvoiceNo değerini string yazarak 
retailRDDWithoutHeader.filter(lambda line: line.split(";")[0] == '536367').take(10)

['536367;84969;BOX OF 6 ASSORTED COLOUR TEASPOONS;6;1.12.2010 08:34;4,25;13047;United Kingdom',
 '536367;84879;ASSORTED COLOUR BIRD ORNAMENT;32;1.12.2010 08:34;1,69;13047;United Kingdom',
 '536367;21755;LOVE BUILDING BLOCK WORD;3;1.12.2010 08:34;5,95;13047;United Kingdom',
 "536367;22745;POPPY'S PLAYHOUSE BEDROOM;6;1.12.2010 08:34;2,1;13047;United Kingdom",
 '536367;22310;IVORY KNITTED MUG COSY;6;1.12.2010 08:34;1,65;13047;United Kingdom',
 '536367;48187;DOORMAT NEW ENGLAND;4;1.12.2010 08:34;7,95;13047;United Kingdom',
 '536367;22623;BOX OF VINTAGE JIGSAW BLOCKS;3;1.12.2010 08:34;4,95;13047;United Kingdom',
 '536367;21754;HOME BUILDING BLOCK WORD;3;1.12.2010 08:34;5,95;13047;United Kingdom',
 "536367;22748;POPPY'S PLAYHOUSE KITCHEN;6;1.12.2010 08:34;2,1;13047;United Kingdom",
 '536367;22749;FELTCRAFT PRINCESS CHARLOTTE DOLL;8;1.12.2010 08:34;3,75;13047;United Kingdom']

In [90]:
### Ürün isimlerinden COFFEE içerenleri filtreleme
# Ürün isimlerinden COFFEE içerenleri filtreleme
retailRDDWithoutHeader.filter(lambda line: 'COFFEE' in line.split(";")[2]).take(20)

['536739;85159A;BLACK TEA,COFFEE,SUGAR JARS;2;2.12.2010 13:08;6,35;14180;United Kingdom',
 '536750;37370;RETRO COFFEE MUGS ASSORTED;6;2.12.2010 14:04;1,06;17850;United Kingdom',
 '536787;37370;RETRO COFFEE MUGS ASSORTED;6;2.12.2010 15:24;1,06;17850;United Kingdom',
 '536804;37370;RETRO COFFEE MUGS ASSORTED;72;2.12.2010 16:34;1,06;14031;United Kingdom',
 '536805;37370;RETRO COFFEE MUGS ASSORTED;12;2.12.2010 16:38;1,25;14775;United Kingdom',
 '536864;21216;SET 3 RETROSPOT TEA,COFFEE,SUGAR;1;3.12.2010 11:27;11,02;000000;United Kingdom',
 '536865;37370;RETRO COFFEE MUGS ASSORTED;1;3.12.2010 11:28;16,13;000000;United Kingdom',
 '537126;21216;SET 3 RETROSPOT TEA,COFFEE,SUGAR;1;5.12.2010 12:13;4,95;18118;United Kingdom',
 '537231;22304;COFFEE MUG BLUE PAISLEY DESIGN;6;6.12.2010 09:21;2,55;13652;United Kingdom',
 '537236;21216;SET 3 RETROSPOT TEA,COFFEE,SUGAR;8;6.12.2010 09:52;4,95;16858;United Kingdom',
 '537369;72122;COFFEE SCENT PILLAR CANDLE;1;6.12.2010 12:41;0,95;17860;United Kingdom',
 '

In [91]:
### Fiyatı 2000.0'den büyük alışverişler
retailRDDWithoutHeader.filter(lambda line: float(line.split(";")[6]) > 2000.0).take(5)

['536367;84969;BOX OF 6 ASSORTED COLOUR TEASPOONS;6;1.12.2010 08:34;4,25;13047;United Kingdom',
 '536369;21756;BATH BUILDING BLOCK WORD;3;1.12.2010 08:35;5,95;13047;United Kingdom',
 '536370;22326;ROUND SNACK BOXES SET OF4 WOODLAND;24;1.12.2010 08:45;2,95;12583;France',
 '536370;21731;RED TOADSTOOL LED NIGHT LIGHT;24;1.12.2010 08:45;1,65;12583;France',
 '536372;22632;HAND WARMER RED POLKA DOT;6;1.12.2010 09:01;1,85;17850;United Kingdom']

In [92]:
### Bir fonksiyon ile filtreleme yapma 
# Quantity > 10
def miktari_ondan_buyukler(x):
    id = x.split(";")[3]
    return int(id) > 10


In [93]:
retailRDDWithoutHeader.filter(lambda x: miktari_ondan_buyukler(x)).take(5)

['536370;22326;ROUND SNACK BOXES SET OF4 WOODLAND;24;1.12.2010 08:45;2,95;12583;France',
 '536370;21731;RED TOADSTOOL LED NIGHT LIGHT;24;1.12.2010 08:45;1,65;12583;France',
 '536378;85183B;CHARLIE & LOLA WASTEPAPER BIN FLORA;48;1.12.2010 09:37;1,25;14688;United Kingdom',
 '536381;22719;GUMBALL MONOCHROME COAT RACK;36;1.12.2010 09:41;1,06;15311;United Kingdom',
 '536384;22470;HEART OF WICKER LARGE;40;1.12.2010 09:53;2,55;18074;United Kingdom']

In [94]:
### Biraz daha karmaşık bir fonksiyon yazalım
# Belirli bir tarihten sonra belli bir ülkede gerçekleşen işlemler
import datetime
# InvoiceNo;StockCode;Description;Quantity;InvoiceDate;UnitPrice;CustomerID;Country
def daha_karmasik_filtre(x):
    InvoiceNo = x.split(";")[0]
    StockCode = x.split(";")[1]
    Description = x.split(";")[2]
    Quantity = x.split(";")[3]
    InvoiceDate = x.split(";")[4]
    UnitPrice = x.split(";")[5]
    CustomerID = x.split(";")[6]
    Country = x.split(";")[7]
    
    tarih = datetime.datetime.strptime(InvoiceDate, "%d.%m.%Y %H:%M")
    
    return tarih >= datetime.datetime(2010, 12, 1, 9, 58) and Country.startswith('United')

In [95]:
retailRDDWithoutHeader.filter(lambda x: daha_karmasik_filtre(x)).take(10)

['536387;79321;CHILLI LIGHTS;192;1.12.2010 09:58;3,82;16029;United Kingdom',
 '536388;21411;GINGHAM HEART  DOORSTOP RED;3;1.12.2010 09:59;4,25;16250;United Kingdom',
 '536388;22922;FRIDGE MAGNETS US DINER ASSORTED;12;1.12.2010 09:59;0,85;16250;United Kingdom',
 '536388;22469;HEART OF WICKER SMALL;12;1.12.2010 09:59;1,65;16250;United Kingdom',
 '536388;22242;5 HOOK HANGER MAGIC TOADSTOOL;12;1.12.2010 09:59;1,65;16250;United Kingdom',
 '536390;22960;JAM MAKING SET WITH JARS;12;1.12.2010 10:19;3,75;17511;United Kingdom',
 '536390;20668;DISCO BALL CHRISTMAS DECORATION;288;1.12.2010 10:19;0,1;17511;United Kingdom',
 '536390;22197;SMALL POPCORN HOLDER;100;1.12.2010 10:19;0,72;17511;United Kingdom',
 '536390;21786;POLKADOT RAIN HAT;144;1.12.2010 10:19;0,32;17511;United Kingdom',
 '536390;22174;PHOTO CUBE;48;1.12.2010 10:19;1,48;17511;United Kingdom']

In [96]:
sc.stop()

# RDD_Join-01 <a class="anchor" id="15"></a>

In [97]:
import findspark
findspark.init(findspark.init("/Users/resitkadir/spark/spark-3.0.0/"))

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("RDDJoin").setMaster("local[4]")

sc = SparkContext(conf=conf)
#Read data
# order_items okuma ve başlıktan kurtulma
order_items_rdd = sc.textFile("datasets/retail_db/order_items.csv") \
                    .filter(lambda x: "orderItemName" not in x) \
                    .repartition(4)

order_items_rdd.take(5)

['11,5,1014,2,99.96,49.98',
 '12,5,957,1,299.98,299.98',
 '13,5,403,1,129.99,129.99',
 '14,7,1073,1,199.99,199.99',
 '15,7,957,1,299.98,299.98']

In [98]:
# products okuma ve başlıktan kurtulma
products_rdd = sc.textFile("datasets/retail_db/products.csv") \
        .filter(lambda x: "productDescription" not in x) \
            .repartition(4)

products_rdd.take(5)

['11,2,Fitness Gear 300 lb Olympic Weight Set,,209.99,http://images.acmesports.sports/Fitness+Gear+300+lb+Olympic+Weight+Set',
 "12,2,Under Armour Men's Highlight MC Alter Ego Fla,,139.99,http://images.acmesports.sports/Under+Armour+Men%27s+Highlight+MC+Alter+Ego+Flash+Football...",
 "13,2,Under Armour Men's Renegade D Mid Football Cl,,89.99,http://images.acmesports.sports/Under+Armour+Men%27s+Renegade+D+Mid+Football+Cleat",
 '14,2,Quik Shade Summit SX170 10 FT. x 10 FT. Canop,,199.99,http://images.acmesports.sports/Quik+Shade+Summit+SX170+10+FT.+x+10+FT.+Canopy',
 "15,2,Under Armour Kids' Highlight RM Alter Ego Sup,,59.99,http://images.acmesports.sports/Under+Armour+Kids%27+Highlight+RM+Alter+Ego+Superman+Football..."]

In [99]:
# OKUNAN VERİLERİ PAIR RDD'ye ÇEVİRME SAFHASI

# order_items pair_rdd yapma
def make_order_items_pair_rdd(line):
    orderItemName = line.split(",")[0]
    orderItemOrderId = line.split(",")[1]
    orderItemProductId = line.split(",")[2]
    orderItemQuantity = line.split(",")[3]
    orderItemSubTotal = line.split(",")[4]
    orderItemProductPrice = line.split(",")[5]
    
    return (orderItemProductId, (orderItemName, orderItemOrderId, orderItemQuantity, 
                                 orderItemSubTotal,orderItemProductPrice))
order_item_pair_rdd = order_items_rdd.map(make_order_items_pair_rdd)
order_item_pair_rdd.take(5)

[('1014', ('11', '5', '2', '99.96', '49.98')),
 ('957', ('12', '5', '1', '299.98', '299.98')),
 ('403', ('13', '5', '1', '129.99', '129.99')),
 ('1073', ('14', '7', '1', '199.99', '199.99')),
 ('957', ('15', '7', '1', '299.98', '299.98'))]

In [100]:
# products için pair rdd yapma
def make_products_pair_rdd(line):
    productId = line.split(",")[0]
    productCategoryId = line.split(",")[1]
    productName = line.split(",")[2]
    productDescription = line.split(",")[3]
    productPrice = line.split(",")[4]
    productImage = line.split(",")[5]
    
    return (productId,(productCategoryId, productName, productDescription, productPrice, productImage))

products_pair_rdd = products_rdd.map(make_products_pair_rdd)
products_pair_rdd.take(2)

[('11',
  ('2',
   'Fitness Gear 300 lb Olympic Weight Set',
   '',
   '209.99',
   'http://images.acmesports.sports/Fitness+Gear+300+lb+Olympic+Weight+Set')),
 ('12',
  ('2',
   "Under Armour Men's Highlight MC Alter Ego Fla",
   '',
   '139.99',
   'http://images.acmesports.sports/Under+Armour+Men%27s+Highlight+MC+Alter+Ego+Flash+Football...'))]

In [101]:
# JOIN AŞAMASI
order_items_product_pair_rdd = order_item_pair_rdd.join(products_pair_rdd)
order_items_product_pair_rdd.take(2)

[('957',
  (('12', '5', '1', '299.98', '299.98'),
   ('43',
    "Diamondback Women's Serene Classic Comfort Bi",
    '',
    '299.98',
    'http://images.acmesports.sports/Diamondback+Women%27s+Serene+Classic+Comfort+Bike+2014'))),
 ('957',
  (('15', '7', '1', '299.98', '299.98'),
   ('43',
    "Diamondback Women's Serene Classic Comfort Bi",
    '',
    '299.98',
    'http://images.acmesports.sports/Diamondback+Women%27s+Serene+Classic+Comfort+Bike+2014')))]

In [102]:
sc.stop()

# PAIR RDD Operations <a class="anchor" id="16"></a>

In [103]:
## Burada hata alırsanız komut satırından "pip install findspark" komutunu çalıştırarak findspark'ı yüklemeyi unutmayın.
import findspark
findspark.init(findspark.init("/Users/resitkadir/spark/spark-3.0.0/"))
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext
#########################
sc = SparkContext("local[4]","PairRDDD-Ops")
##################################
insanlarRDD2 = sc.textFile("datasets/simple_data.csv")
insanlarRDD = insanlarRDD2.filter(lambda x: "sirano" not in x)
insanlarRDD.first()





'1,Cemal,35,Isci,Ankara,3500'

In [104]:
def meslek_maas(line):
    meslek = line.split(",")[3]
    maas = float(line.split(",")[5])
    
    return (meslek,maas)

meslek_maas_pairRDD = insanlarRDD.map(meslek_maas)
meslek_maas_pairRDD.take(3)

[('Isci', 3500.0), ('Memur', 4200.0), ('Müzisyen', 9000.0)]

In [105]:
meslek_maas = meslek_maas_pairRDD.mapValues(lambda x: (x,1))
meslek_maas.take(5)

[('Isci', (3500.0, 1)),
 ('Memur', (4200.0, 1)),
 ('Müzisyen', (9000.0, 1)),
 ('Pazarlamaci', (4200.0, 1)),
 ('Pazarlamaci', (4800.0, 1))]

In [106]:
meslek_maas_RBK = meslek_maas.reduceByKey(lambda x,y: (x[0] + y[0], x[1] + y[1]))
meslek_maas_RBK.take(5)

[('Memur', (12200.0, 3)),
 ('Pazarlamaci', (16300.0, 3)),
 ('Tuhafiyeci', (4800.0, 1)),
 ('Tornacı', (4200.0, 1)),
 ('Isci', (3500.0, 1))]

In [107]:
meslek_ort_maas = meslek_maas_RBK.mapValues(lambda x: x[0] / x[1]) 
meslek_ort_maas.take(8)

[('Memur', 4066.6666666666665),
 ('Pazarlamaci', 5433.333333333333),
 ('Tuhafiyeci', 4800.0),
 ('Tornacı', 4200.0),
 ('Isci', 3500.0),
 ('Müzisyen', 9900.0),
 ('Doktor', 16125.0),
 ('Berber', 12000.0)]

In [108]:
sc.stop()

# Excel-Dataframe-RDD <a class="anchor" id="17"></a>

In [109]:
import findspark
findspark.init(findspark.init("/Users/resitkadir/spark/spark-3.0.0/"))
import pandas as pd

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark = SparkSession.builder.master("local[4]").appName("Lung_Cancer").getOrCreate()

pdf = pd.read_excel("datasets/simple_data.xlsx")
df_pd = spark.createDataFrame(pdf)
rdd = df_pd.rdd
rdd.take(5)

[Row(sirano=1, isim='Cemal', yas=35, meslek='Isci', sehir='Ankara', aylik_gelir=3500),
 Row(sirano=2, isim='Ceyda', yas=42, meslek='Memur', sehir='Kayseri', aylik_gelir=4200),
 Row(sirano=3, isim='Timur', yas=30, meslek='MÃ¼zisyen', sehir='Istanbul', aylik_gelir=9000),
 Row(sirano=4, isim='Burcu', yas=29, meslek='Pazarlamaci', sehir='Ankara', aylik_gelir=4200),
 Row(sirano=5, isim='Yasemin', yas=23, meslek='Pazarlamaci', sehir='Bursa', aylik_gelir=4800)]

In [110]:
spark.stop()

# BroadcastVariablesOps<a class="anchor" id="18"></a>

In [111]:
import findspark
findspark.init(findspark.init("/Users/resitkadir/spark/spark-3.0.0/"))
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local[4]").setAppName("BroadcastVariablesOps")
sc = SparkContext(conf=conf).getOrCreate()

# products.csv dosyasını okuyup (urün_id, ürün_adı) döndüren fonksiyon

def read_products():
    products_text_wrapper = open("datasets/retail_db/products.csv", "r",encoding="utf-8")
    # satır satır okuma
    products = products_text_wrapper.readlines()
    
    product_id_name = {}
    
    for line in products:
        # başlık satırını atlamak için if kontrolü
        if "productName" not in line:
            product_id = int(line.split(",")[0])
            product_name = line.split(",")[2]
            # product_id_name.append((product_id,product_name))
            product_id_name.update({product_id: product_name})
    return product_id_name

products = read_products()
broadcast_products = sc.broadcast(products)
broadcast_products.value.get(114)

"Nike Men's Fly Shorts 2.0"

In [112]:
# order_item okuma ve rdd oluşturma
order_items_rdd = sc.textFile("datasets/retail_db/order_items.csv") \
                    .filter(lambda x: "orderItemOrderId" not in x)

order_items_rdd.take(5)

['1,1,957,1,299.98,299.98',
 '2,2,1073,1,199.99,199.99',
 '3,2,502,5,250.0,50.0',
 '4,2,403,1,129.99,129.99',
 '5,4,897,2,49.98,24.99']

In [113]:
# order_items pair_rdd yapma
def make_order_items_pair_rdd(line):
    order_item_product_id = int(line.split(",")[2])
    order_item_sub_total = float(line.split(",")[4])
    
    return (order_item_product_id, order_item_sub_total)

order_items_pair_rdd = order_items_rdd.map(make_order_items_pair_rdd)
order_items_pair_rdd.take(5)


[(957, 299.98), (1073, 199.99), (502, 250.0), (403, 129.99), (897, 49.98)]

In [114]:
sorted_orders = order_items_pair_rdd.reduceByKey(lambda x,y: x+y) \
            .map(lambda x: (x[1], x[0])) \
            .sortByKey(False) \
            .map(lambda x: (x[1], x[0])) \
            #.take(5)

In [115]:
sorted_orders.take(5)

[(1004, 6929653.499999708),
 (365, 4421143.019999639),
 (957, 4118425.419999785),
 (191, 3667633.1999997487),
 (502, 3147800.0)]

In [116]:
# order_items ile broadcast variable olan products birleştirme
sorted_orders_with_product_name = sorted_orders \
                                .map(lambda x: (broadcast_products.value.get(x[0]), x[1]))

sorted_orders_with_product_name.take(5)

[('Field & Stream Sportsman 16 Gun Fire Safe', 6929653.499999708),
 ('Perfect Fitness Perfect Rip Deck', 4421143.019999639),
 ("Diamondback Women's Serene Classic Comfort Bi", 4118425.419999785),
 ("Nike Men's Free 5.0+ Running Shoe", 3667633.1999997487),
 ("Nike Men's Dri-FIT Victory Golf Polo", 3147800.0)]

In [117]:
sc.stop()

# RDD_Wordcount<a class="anchor" id="19"></a>

In [118]:
import findspark
findspark.init(findspark.init("/Users/resitkadir/spark/spark-3.0.0/"))

# Create SparkContext
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

# Aşağıdaki ayarları bilgisayarınızın belleğine göre değiştirebilirsiniz
spark = SparkSession.builder \
.master("local[4]") \
.appName("RDD-Olusturmak") \
.config("spark.executor.memory","4g") \
.config("spark.driver.memory","2g") \
.getOrCreate()

# sparkContext'i kısaltmada tut
sc = spark.sparkContext
#readfile
veri_dosyasi = "datasets/omer_seyfettin_forsa_hikaye.txt"

hikaye_rdd = sc.textFile(veri_dosyasi)
hikaye_rdd.take(5)

['Ömer Seyfettin -         Forsa',
 '',
 'Akdeniz’in, kahramanlık yuvası sonsuz ufuklarına bakan küçük tepe, minimini bir çiçek ',
 '',
 '']

In [119]:
# Her bir kelimeyi boşluklarla ayıralım ve başka bir rdd'de tutalım
kelimeler = hikaye_rdd.flatMap(lambda satir: satir.split(" "))
# Kelimeleri sayalım
kelime_sayilari = kelimeler.map(lambda kelime: (kelime,1)).reduceByKey(lambda x,y: x+y)
# kaç farklı kelime var
kelime_sayilari.count()

840

In [120]:
# Kelimeler ve tekrarlanma sayılarından rastgel 15 tanesini görelim
kelime_sayilari.take(15)

[('Ömer', 1),
 ('Seyfettin', 1),
 ('', 85),
 ('Forsa', 1),
 ('Akdeniz’in,', 1),
 ('kahramanlık', 2),
 ('sonsuz', 1),
 ('ufuklarına', 1),
 ('bakan', 1),
 ('uzun', 1),
 ('badem', 1),
 ('alaca', 1),
 ('inen', 1),
 ('keçiyoluna', 1),
 ('rüzgârıyla', 1)]

In [121]:
# Rakamları anahtar olarak kullanmak için 0 indisine kelimeleri 1 indisine atalım
kelime_sayilari2 = kelime_sayilari.map(lambda x: (x[1], x[0]))

In [122]:
# Rakamlar artık key olunca saydoralım bakalım en çok tekrarlanan 15 kelime ne imiş
kelime_sayilari2.sortByKey(False).take(20)

[(85, ''),
 (32, 'bir'),
 (31, '–'),
 (8, 'yıl'),
 (6, 'diye'),
 (5, 'Türk'),
 (5, 'dedi.'),
 (5, 'onun'),
 (5, 'doğru'),
 (5, 'Kırk'),
 (4, 'Yirmi'),
 (4, 'tutsak'),
 (4, 'Ben'),
 (4, 'gibi'),
 (4, 'Ama'),
 (4, 'büyük'),
 (3, 'yanı'),
 (3, 'şey'),
 (3, 'onu'),
 (3, 'geminin')]

In [123]:
sc.stop()

# AND OTHERS <a class="anchor" id="20"></a>

![key_value](IMG/key_value.png)
**yasi 30 dan kucuk olanlari almak icin**
![exa](IMG/exa_1.png)

In [125]:
#**KEY-VALUE**

import findspark
findspark.init(findspark.init("/Users/resitkadir/spark/spark-2.4.6/"))

# Create SparkContext
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

# Aşağıdaki ayarları bilgisayarınızın belleğine göre değiştirebilirsiniz
spark = SparkSession.builder \
        .master("local[4]") \
        .appName("Dataset-Olusturmak") \
        .config("spark.executor.memory","4g") \
        .config("spark.driver.memory","2g") \
        .getOrCreate()

# sparkContext'i kısaltmada tut
sc = spark.sparkContext


ages = [("ahmet",35),("oscar",22),("jason",98)]
ages_rdd = sc.parallelize(ages)
ages_rdd.collect()

[('ahmet', 35), ('oscar', 22), ('jason', 98)]

In [126]:
ages_rdd.filter(lambda key_value : key_value[1] < 30).take(3)
#key_value nn birinci elemani

[('oscar', 22)]

In [127]:
ages_rdd.filter(lambda key_value : key_value[0] == "oscar").take(3)

[('oscar', 22)]