# Spark 快速上手

## 大数据世界中的 Hello World

我们完成一个读取文件中的单词，并且统计单词出现个数的 Spark 程序，这个程序也是我们再大数据世界中的 Hello World。

In [15]:
# 为了能够让 python 找到 pyspark，使用 findspark
import findspark
findspark.init()

In [16]:
# 为了使用 RDDs，创建 SparkSession
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

In [18]:
# 创建 SparkConf 和 SparkSession
conf = SparkConf()\
        .setMaster('local[*]')\
        .setAppName("WordCount")\
        .setExecutorEnv("spark.executor.memory","2g")\
        .setExecutorEnv("spark.driver.memory","2g")

spark = SparkSession.builder\
        .config(conf=conf)\
        .getOrCreate()
# 获取 SparkContext
sc = spark.sparkContext

In [17]:
!hadoop fs -put data/shakespeare.txt /dataset/shakespeare.txt

put: `/dataset/shakespeare.txt': File exists


In [19]:
!hadoop fs -ls /dataset

Found 2 items
drwxr-xr-x   - bigdata supergroup          0 2021-04-24 22:24 /dataset/ml-25m
-rw-r--r--   3 bigdata supergroup    1115394 2021-05-14 22:23 /dataset/shakespeare.txt


In [20]:
# 读取莎士比亚文本数据集, 先把 data/shakespeare.txt 文件上传到 hdfs 上
shakespeare_path = "/dataset/shakespeare.txt"
shakespeare_rdd=sc.textFile(shakespeare_path)
shakespeare_rdd.take(10)

['First Citizen:',
 'Before we proceed any further, hear me speak.',
 '',
 'All:',
 'Speak, speak.',
 '',
 'First Citizen:',
 'You are all resolved rather to die than to famish?',
 '',
 'All:']

In [21]:
# 获取数据集的行数
shakespeare_rdd.count()

40000

In [22]:
# 移除所有的标点符号
# 把所有的单词转换成小写
def lower_clean_str(x):
    punc='!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~-'
    lowercased_str = x.lower()
    for ch in punc:
        lowercased_str = lowercased_str.replace(ch, '')
    return lowercased_str

In [23]:
sentence = 'You are An apple.'
lower_clean_str(sentence)

'you are an apple'

In [24]:
# 给文本中的所有行，都执行 lower_clean_str 方法
shakespeare_rdd = shakespeare_rdd.map(lower_clean_str)
# 读取转换后的数据
shakespeare_rdd.take(10)

['first citizen',
 'before we proceed any further hear me speak',
 '',
 'all',
 'speak speak',
 '',
 'first citizen',
 'you are all resolved rather to die than to famish',
 '',
 'all']

In [25]:
# 使用 split 方法，把每一行中的单词分开（分词），并且把原来一行一行的数据“拉平”
shakespeare_rdd = shakespeare_rdd.flatMap(lambda satir: satir.split(" "))
shakespeare_rdd.take(15)

['first',
 'citizen',
 'before',
 'we',
 'proceed',
 'any',
 'further',
 'hear',
 'me',
 'speak',
 '',
 'all',
 'speak',
 'speak',
 '']

In [26]:
# 使用 filter 方法，把分词后出现的空格符给过滤掉
shakespeare_rdd = shakespeare_rdd.filter(lambda x: x!='')
shakespeare_rdd.take(15)

['first',
 'citizen',
 'before',
 'we',
 'proceed',
 'any',
 'further',
 'hear',
 'me',
 'speak',
 'all',
 'speak',
 'speak',
 'first',
 'citizen']

计算每一个单词出现的次数

In [27]:
# 为了能够统计单词出现的次数，我们需要先把原来的 rdd 转换成 (word, 1) 这样的一对对的 rdd
shakespeare_count = shakespeare_rdd.map(lambda  word: (word,1))
shakespeare_count.take(15)

[('first', 1),
 ('citizen', 1),
 ('before', 1),
 ('we', 1),
 ('proceed', 1),
 ('any', 1),
 ('further', 1),
 ('hear', 1),
 ('me', 1),
 ('speak', 1),
 ('all', 1),
 ('speak', 1),
 ('speak', 1),
 ('first', 1),
 ('citizen', 1)]

In [14]:
# 使用 reduceByKey 来统计出每个单词出现的次数
shakespeare_count_rbk = shakespeare_count.reduceByKey(lambda x,y:(x+y)).sortByKey()
shakespeare_count_rbk.take(10)

[('0indexgut', 1),
 ('1', 308),
 ('10', 4),
 ('100', 2),
 ('10000', 1),
 ('100000000trillion', 1),
 ('100th', 1),
 ('101', 1),
 ('102', 1),
 ('103', 1)]

In [15]:
# 为了能够得到使用频次倒序排列的结果，我们要先把 shakespeare_count 转换成 (count, word)
shakespeare_count_rbk = shakespeare_count_rbk.map(lambda x:(x[1],x[0]))
shakespeare_count_rbk.take(15)

[(1, '0indexgut'),
 (308, '1'),
 (4, '10'),
 (2, '100'),
 (1, '10000'),
 (1, '100000000trillion'),
 (1, '100th'),
 (1, '101'),
 (1, '102'),
 (1, '103'),
 (1, '104'),
 (1, '105'),
 (1, '106'),
 (1, '107'),
 (1, '108')]

In [16]:
# 使用 sortByKey 来获取 key 的倒序结果
shakespeare_count_rbk.sortByKey(False).take(100)

[(27643, 'the'),
 (26728, 'and'),
 (20681, 'i'),
 (19198, 'to'),
 (18173, 'of'),
 (14613, 'a'),
 (13649, 'you'),
 (12480, 'my'),
 (11121, 'that'),
 (10967, 'in'),
 (9598, 'is'),
 (8725, 'not'),
 (8244, 'for'),
 (7996, 'with'),
 (7768, 'me'),
 (7690, 'it'),
 (7090, 'be'),
 (6882, 'your'),
 (6857, 'his'),
 (6847, 'this'),
 (6270, 'but'),
 (6251, 'he'),
 (5958, 'as'),
 (5887, 'have'),
 (5485, 'thou'),
 (5268, 'so'),
 (5192, 'him'),
 (4979, 'will'),
 (4465, 'what'),
 (4412, 'by'),
 (4032, 'thy'),
 (3887, 'all'),
 (3851, 'are'),
 (3843, 'her'),
 (3796, 'no'),
 (3754, 'do'),
 (3591, 'shall'),
 (3503, 'if'),
 (3306, 'we'),
 (3178, 'thee'),
 (3123, 'or'),
 (3068, 'our'),
 (3059, 'lord'),
 (3051, 'on'),
 (2861, 'king'),
 (2812, 'good'),
 (2779, 'now'),
 (2754, 'sir'),
 (2646, 'from'),
 (2608, 'o'),
 (2509, 'at'),
 (2507, 'come'),
 (2471, 'they'),
 (2462, 'well'),
 (2316, 'which'),
 (2295, 'would'),
 (2291, 'more'),
 (2229, 'was'),
 (2222, 'then'),
 (2208, 'she'),
 (2168, 'am'),
 (2160, 'how'),


看到这里，你应该能够得到想要的对于单词的统计结果了。但是，你可能还有很多的疑惑~没关系，接下来，我们就来一层层的剥开这些程序神秘的面纱，看看它到底是这样工作的。