# SparkCore
基于内存的计算引擎，它的计算速度非常快。但是仅仅只涉及到数据的计算，并没有涉及到数据的存储，但是，spark的缺点是：吃内存，不太稳定

In [1]:
# SparkContext作为Spark的程序入口
from pyspark.context import SparkContext, SparkConf
import os 

In [2]:
sc = SparkContext()

In [3]:
# 输出Spark基本信息
sc

## 入门

### 词频统计

In [35]:
# 获取词信息
path = os.path.join(os.getcwd(), "data/text.txt")
words = sc.textFile(name=path, minPartitions=5)
words

/home/gavin/Machine/Recommend-System/Spark/data/text.txt MapPartitionsRDD[85] at textFile at NativeMethodAccessorImpl.java:0

In [5]:
# 构建整体流程(只构建，不计算)
RDD = words.flatMap(lambda line:line.split(" ")).map(lambda x:(x, 1)).reduceByKey(lambda a, b: a + b)

In [6]:
#　聚合操作，也即整合所有的流程得到输出结果
RDD.collect()[:10]

[('of', 3),
 ('to', 3),
 ('no', 1),
 ('', 2),
 ('learn', 1),
 ('computation', 1),
 ('see', 1),
 ('code', 1),
 ('examples.', 1),
 ('versions', 1)]

In [7]:
#　分布式显示与计算
data = [1, 2, 3, 4, 5] 
# 可以指定计算的分区，也即指定计算的子任务数量(partion)
distData = sc.parallelize(data, 5)
distData.collect()

[1, 2, 3, 4, 5]

In [8]:
distData = sc.parallelize(data, 2)
# 聚合元素, 相当于聚合操作
distData.reduce(lambda a, b:a + b)

15

In [9]:
from operator import add
distData.reduce(add)

15

## RDD的常用操作

### RDD Transformation算子
目的：从一个已经存在的数据集创建一个新的数据集

In [10]:
# map(fun)
rdd1 = sc.parallelize([1, 2, 3, 5, 6, 7, 8], 5)
rdd2 = rdd1.map(lambda x: x+1)
rdd2.collect()

[2, 3, 4, 6, 7, 8, 9]

In [11]:
# filter
rdd3 = rdd2.filter(lambda x:x>5)
rdd3.collect()

[6, 7, 8, 9]

In [12]:
# flatMao操作　
rdd4 = sc.parallelize(["a b x", "d u j", "l p o"], 4)
rdd5 = rdd4.flatMap(lambda x: x.split(" "))
rdd6 = rdd4.map(lambda x:x.split(" "))
print(rdd5.collect(), rdd6.collect())

['a', 'b', 'x', 'd', 'u', 'j', 'l', 'p', 'o'] [['a', 'b', 'x'], ['d', 'u', 'j'], ['l', 'p', 'o']]


In [13]:
# union 求并集
rdd1 = sc.parallelize([("a",1),("b",2)])
rdd2 = sc.parallelize([("c",1),("b",3)])
rdd3 = rdd1.union(rdd2)
rdd3.collect()

[('a', 1), ('b', 2), ('c', 1), ('b', 3)]

In [14]:
# groupByKey [("b", [2, 3])]
rdd4 = rdd3.groupByKey()
rdd4.collect()

[('a', <pyspark.resultiterable.ResultIterable at 0x7f206b7f4908>),
 ('b', <pyspark.resultiterable.ResultIterable at 0x7f206b7f40b8>),
 ('c', <pyspark.resultiterable.ResultIterable at 0x7f206b7f4b70>)]

In [15]:
list(rdd4.collect()[1][1])

[2, 3]

In [16]:
# reduceByKey
rdd5 = rdd3.reduceByKey(lambda x, y:x+y)
rdd5.collect()

[('a', 1), ('b', 5), ('c', 1)]

In [17]:
# sortByKey
temp = [('Mary', 1), ('had', 2), ('a', 3), ('little', 4), ('lamb', 5)]
rdd = sc.parallelize(temp, 5)
rdd1 = rdd.sortByKey()
rdd1.collect()

[('Mary', 1), ('a', 3), ('had', 2), ('lamb', 5), ('little', 4)]

In [18]:
# sortBy
rdd2 = rdd.sortBy(keyfunc=lambda x:x[1], ascending=False)
rdd2.collect()

[('lamb', 5), ('little', 4), ('a', 3), ('had', 2), ('Mary', 1)]

### RDD Action算子

- collect

  - 返回一个list, list中包含 RDD中的所有元素
  - 只有当数据量较小的时候使用Collect, 因为所有的结果都会加载到内存中

- reduce

  - **reduce**将**RDD**中元素两两传递给输入函数，同时产生一个新的值，新产生的值与RDD中下一个元素再被传递给输入函数直到最后只有一个值为止

In [19]:
# first
rdd.first()

('Mary', 1)

In [20]:
# take(显示指定个数的元素)
rdd.take(2)

[('Mary', 1), ('had', 2)]

In [21]:
# count
rdd.count()

5

## 实战：通过Spark实现点击流日志分析

### 网络总访问量

In [22]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("pv").getOrCreate()
sc = spark.sparkContext
path1 = os.path.join(os.getcwd(), "data/access.log")

In [23]:
# 网络信息
rdd = sc.textFile(path1)
rdd.collect()[:5]

['194.237.142.21 - - [18/Sep/2013:06:49:18 +0000] "GET /wp-content/uploads/2013/07/rstudio-git3.png HTTP/1.1" 304 0 "-" "Mozilla/4.0 (compatible;)"',
 '183.49.46.228 - - [18/Sep/2013:06:49:23 +0000] "-" 400 0 "-" "-"',
 '163.177.71.12 - - [18/Sep/2013:06:49:33 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"',
 '163.177.71.12 - - [18/Sep/2013:06:49:36 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"',
 '101.226.68.137 - - [18/Sep/2013:06:49:42 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"']

In [24]:
# 每一行数据表示一次网站访问(访问量)
rdd1 = rdd.map(lambda x:("pv", 1)).reduceByKey(lambda a, b :a+b)
# rdd1.saveAsTextFile("data/pv.txt")
rdd1.collect()

[('pv', 90)]

### 网站独立用户访问量

In [25]:
spark = SparkSession.builder.appName("uv").getOrCreate()
sc = spark.sparkContext
path1 = os.path.join(os.getcwd(), "data/access.log")

In [26]:
# 得到ip地址
rdd1 = sc.textFile(path1)
rdd2 = rdd1.map(lambda x:x.split(" ")).map(lambda x:x[0])
rdd2.collect()[:5]

['194.237.142.21',
 '183.49.46.228',
 '163.177.71.12',
 '163.177.71.12',
 '101.226.68.137']

In [27]:
rdd3 = rdd2.distinct().map(lambda x:("uv", 1))
rdd4 = rdd3.reduceByKey(lambda a,b:a+b)
rdd4.collect()
# rdd4.saveAsTextFile("data/uv.txt")

[('uv', 17)]

### 访问TopN

In [28]:
spark = SparkSession.builder.appName("TopN").getOrCreate()
sc = spark.sparkContext
path1 = os.path.join(os.getcwd(), "data/access.log")

In [29]:
rdd1 = sc.textFile(path1)
rdd2 = rdd1.map(lambda line:line.split(" ")).filter(lambda x:len(x)>10).map(lambda x:(x[10],1))
rdd3 = rdd2.reduceByKey(lambda a, b:a+b).sortBy(lambda x:x[1], ascending=False)
rdd4 = rdd3.take(5)
rdd4

[('"-"', 27),
 ('"http://blog.fens.me/vps-ip-dns/"', 18),
 ('"http://blog.fens.me/wp-content/themes/silesia/style.css"', 7),
 ('"http://blog.fens.me/nodejs-socketio-chat/"', 7),
 ('"http://blog.fens.me/nodejs-grunt-intro/"', 7)]

In [34]:
rdd3.glom().collect()

[[('"-"', 27),
  ('"http://blog.fens.me/vps-ip-dns/"', 18),
  ('"http://blog.fens.me/wp-content/themes/silesia/style.css"', 7),
  ('"http://blog.fens.me/nodejs-socketio-chat/"', 7),
  ('"http://blog.fens.me/nodejs-grunt-intro/"', 7)],
 [('"http://blog.fens.me/nodejs-async/"', 5),
  ('"http://www.angularjs.cn/A00n"', 2),
  ('"http://cos.name/category/software/packages/"', 1),
  ('"http://blog.fens.me/series-nodejs/"', 1),
  ('"http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=6&cad=rja&ved=0CHIQFjAF&url=http%3A%2F%2Fblog.fens.me%2Fvps-ip-dns%2F&ei=j045UrP5AYX22AXsg4G4DQ&usg=AFQjCNGsJfLMNZnwWXNpTSUl6SOEzfF6tg&sig2=YY1oxEybUL7wx3IrVIMfHA&bvm=bv.52288139,d.b2I"',
   1),
  ('"http://www.google.com/url?sa=t&rct=j&q=nodejs%20%E5%BC%82%E6%AD%A5%E5%B9%BF%E6%92%AD&source=web&cd=1&cad=rja&ved=0CCgQFjAA&url=%68%74%74%70%3a%2f%2f%62%6c%6f%67%2e%66%65%6e%73%2e%6d%65%2f%6e%6f%64%65%6a%73%2d%73%6f%63%6b%65%74%69%6f%2d%63%68%61%74%2f&ei=rko5UrylAefOiAe7_IGQBw&usg=AFQjCNG6YWoZsJ_bSj8kTnMHcH51

In [30]:
# 终止
# sc.stop()