## 1. 第一个例子
1. 读入文件为DataFrame
2. 统计含有字母a的行数，含有字母b的行数，总共行数

In [4]:
from pyspark.sql import SparkSession

# 待读取文件路径
logFile = "D:\\spark\\README.md" 
# 创建或者获取spark会话，并命名
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
# 读取文件到DataFrame
logData = spark.read.text(logFile).cache()
# 统计文件总行数
nums = logData.count()
# 统计含有字母a的行数
numAs = logData.filter(logData.value.contains('a')).count()
# 统计含有字母b的行数
numBs = logData.filter(logData.value.contains('b')).count()
# 打印结果
print("Lines with a: %i, lines with b: %i, lines: %i" % (numAs, numBs, nums))
# 关闭spark会话
spark.stop()

Lines with a: 64, lines with b: 32, lines: 108


## 2. 创建RDD
也就是如何在。spark上创建一个数据集，这个是spark数据处理的基本对象.RDD是无schema的数据结构，和DataFrame非常不同。
但是对于pysaprk来说，RDD需要在python和JVM之间来回切换，这样就增大了开销。而DataFrame没有这方面开销，相应地会更快一些。

### 2.1 从变量或文件创建RDD

In [5]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RDDTestApp").getOrCreate()
sc = spark.sparkContext
# 方法1：.parallelize(...)集合（元素为list或array）
data = sc.parallelize([('Amber',22), ('Alfred', 23), ('Skye', 4), ('Albert', 12), ('Amber', 9)])

# 方法2：引用位于本地或者外部的某个文件
data_from_file = sc.textFile('D:/spark/README.md', 4)

In [6]:
# .parallelize(...)几乎可以混合使用任何类型的数据结构

data_hetergenous = sc.parallelize([
    ('Ferrari', 'fast'),
    {'Porsche': 10000},
    ['Spain', 'visited', 4504]
]).collect()  # -> 对数据集使用.collect()方法，就就可以像python那样访问数据。

data_hetergenous[1]['Porsche']

10000

### 2.2 读取创建的RDD对象
如何读取创建的RDD对象。

In [7]:
# 读取从文件创建的RDD对象data_from_file的第一行
data_from_file.take(1)

['# Apache Spark']

In [8]:
data_from_file.take(2)

['# Apache Spark', '']

## 3 DataFrame
DataFrame的引入极大地提升了python的性能。
### 3.1 创建DataFrame
一共两种方法：1，创建JSON数据，然后将其转换为DataFrame；2，从文件系统读入；
#### 3.1.1 从自建的JSON文件中创建DataFrame
 

In [9]:
from pyspark.sql import SparkSession
# 创建spark会话
spark = SparkSession.builder.appName("DataFrameTestApp").getOrCreate()
sc = spark.sparkContext

In [10]:
# 查看spark的UI web网址
sc.uiWebUrl

'http://windows10.microdone.cn:4040'

In [11]:
## 第一步，创建自己的JSON数据
stringJSONRDD = sc.parallelize(("""
    {
        "id": "123",
        "name": "Kate",
        "age": 19,
        "eyeColor": "brown"
    }""",
    """{
        "id": "234",
        "name": "Michael",
        "age": 22,
        "eyeColor": "green"
    }""",
    """{
        "id": "345",
        "name": "Simone",
        "age": 23,
        "eyeColor": "blue"
    }"""))

In [12]:
## 第二步，从JSON创建一个DataFrame
swimmersJSON = spark.read.json(stringJSONRDD)

### 3.2 DataFram查询
两种查询方式，一种是DataFrame自带的API，另一种是使用SQL查询语句；
#### 3.2.1 DataFrame自带的API

In [13]:
## 简单的DataFrame查询
swimmersJSON.show()

+---+--------+---+-------+
|age|eyeColor| id|   name|
+---+--------+---+-------+
| 19|   brown|123|   Kate|
| 22|   green|234|Michael|
| 23|    blue|345| Simone|
+---+--------+---+-------+



#### 3.2.2 使用SQL语句查询

In [14]:
# SQL查询语句
## 需要先对DataFrame创建一个临时的视图
swimmersJSON.createOrReplaceTempView("swimmers")
## 对临时视图进行SQL语句查询
spark.sql("select * from swimmers").collect()

[Row(age=19, eyeColor='brown', id='123', name='Kate'),
 Row(age=22, eyeColor='green', id='234', name='Michael'),
 Row(age=23, eyeColor='blue', id='345', name='Simone')]

In [15]:
## 需要先对DataFrame创建一个临时的视图
swimmersJSON.createOrReplaceTempView("swimmers")
## 对SQL语句查询
sqlDF = spark.sql("select * from swimmers")
sqlDF.show()

+---+--------+---+-------+
|age|eyeColor| id|   name|
+---+--------+---+-------+
| 19|   brown|123|   Kate|
| 22|   green|234|Michael|
| 23|    blue|345| Simone|
+---+--------+---+-------+



In [16]:
## 需要先对DataFrame创建一个临时的视图
swimmersJSON.createOrReplaceTempView("swimmers")
## 对SQL语句查询
spark.sql("select * from swimmers").show()

+---+--------+---+-------+
|age|eyeColor| id|   name|
+---+--------+---+-------+
| 19|   brown|123|   Kate|
| 22|   green|234|Michael|
| 23|    blue|345| Simone|
+---+--------+---+-------+



In [17]:
spark.stop()

### 3.3 例子统计文本中单词数量
使用DataFrame数据集对象统计spark路径下的README.md文件中单词的数量；

In [18]:
from pyspark.sql import SparkSession
from operator import add
# 建立或获取spark会话
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
# 将文件读到DataFrame数据集对象中，并将每一行原样转换到DataFrame中
# 文件被认为是n行1列的格式，所以用row[0]代表文件的每一行
lines = spark.read.text("d://spark//README.md").rdd.map(lambda r: r[0])
# 按照空格划分出单词
# 将每个单词作为key，绑定到数字1上
# 按照单词合并
counts = lines.flatMap(lambda x: x.split(' ')) \
                .map(lambda x: (x, 1)) \
                .reduceByKey(add)
output = counts.collect()
output

[('#', 1),
 ('Apache', 1),
 ('Spark', 14),
 ('', 73),
 ('is', 7),
 ('a', 9),
 ('unified', 1),
 ('analytics', 1),
 ('engine', 2),
 ('for', 12),
 ('large-scale', 1),
 ('data', 2),
 ('processing.', 2),
 ('It', 2),
 ('provides', 1),
 ('high-level', 1),
 ('APIs', 1),
 ('in', 5),
 ('Scala,', 1),
 ('Java,', 1),
 ('Python,', 2),
 ('and', 9),
 ('R,', 1),
 ('an', 4),
 ('optimized', 1),
 ('that', 2),
 ('supports', 2),
 ('general', 2),
 ('computation', 1),
 ('graphs', 1),
 ('analysis.', 1),
 ('also', 5),
 ('rich', 1),
 ('set', 2),
 ('of', 5),
 ('higher-level', 1),
 ('tools', 1),
 ('including', 4),
 ('SQL', 2),
 ('DataFrames,', 1),
 ('MLlib', 1),
 ('machine', 1),
 ('learning,', 1),
 ('GraphX', 1),
 ('graph', 1),
 ('processing,', 1),
 ('Structured', 1),
 ('Streaming', 1),
 ('stream', 1),
 ('<https://spark.apache.org/>', 1),
 ('[![Jenkins', 1),
 ('Build](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/badge/icon)](https://amplab.cs.berkeley.edu/jenkins/job/spark-m