In [1]:
from pyspark import SparkContext, SparkConf

In [2]:
if not 'sc' in globals():
    conf = SparkConf().setMaster('local[*]')
    sc = SparkContext(conf=conf)

In [3]:
# load the dataset into an RDD to get started
input_rdd = sc.textFile("/home/fieldengineer/Documents/courses/architect_big_data_solutions_with_spark-master/Datasets//movielens/movies.csv")

In [4]:
# we will extract the year with this function. if there is problem with our data we will just return a None value
import re
def get_year(name):
    year = None
    try:
      pattern = re.compile(r"\((\d+)\)")
      year = int(pattern.findall(name)[0])
    except ValueError:
      pass
    except IndexError:
      pass
    
    return year

In [5]:
movie_info_rdd = input_rdd.filter(lambda line: 'movieId' not in line)
movie_info_list_rdd = movie_info_rdd.map(lambda x: x.split(','))

In [6]:
# we can use the map operation to apply our custom function to every name in our rdd
movie_year_rdd = movie_info_list_rdd.map(lambda x: get_year(x[1]))

### Actions
Actions are RDD methods that return a value to a driver program. This section discusses the commonly used
RDD actions. Spark will only "go to work" when you apply an action on top of your RDD.

##### collect
The collect method returns the elements in the source RDD as an array. This method should be used with
caution since it moves data from all the worker nodes to the driver program. It can crash the driver program
if called on a very large RDD.

In [7]:
# lets print out all the different years we have movies from in our data
movie_year_rdd.distinct().collect()

[None,
 1994,
 1996,
 1976,
 1992,
 1982,
 1990,
 1940,
 1960,
 1968,
 1988,
 1948,
 1964,
 1950,
 6,
 1952,
 1958,
 1954,
 1934,
 1944,
 1942,
 1946,
 1936,
 1956,
 1938,
 1974,
 1986,
 1980,
 1978,
 1962,
 1984,
 1970,
 1998,
 1930,
 1932,
 1928,
 1966,
 1972,
 2000,
 1926,
 2002,
 2004,
 1916,
 1922,
 1924,
 2006,
 2008,
 500,
 2010,
 2012,
 2014,
 2016,
 1995,
 1967,
 1993,
 1977,
 1965,
 1991,
 1989,
 1937,
 1981,
 1973,
 1955,
 1997,
 1943,
 1957,
 1961,
 1959,
 1963,
 1953,
 1939,
 1941,
 1945,
 1935,
 1947,
 1975,
 1949,
 1971,
 1951,
 1979,
 1987,
 1985,
 1983,
 1933,
 1931,
 1969,
 1927,
 1929,
 1999,
 1925,
 1923,
 2001,
 2003,
 2005,
 2007,
 2009,
 2011,
 2013,
 2015]

##### count
The count method returns a count of the elements in the source RDD

In [8]:
# count the number of movies we have in our data
movie_year_rdd.count()

9125

##### countByValue
The countByValue method returns a count of each unique element in the source RDD. It returns an instance
of the Map class containing each unique element and its count as a key-value pair.

In [9]:
# count the number of movies we have from every year
movie_year_rdd.countByValue()

defaultdict(int,
            {1995: 206,
             None: 2129,
             1994: 183,
             1996: 220,
             1976: 31,
             1992: 119,
             1967: 39,
             1993: 158,
             1977: 43,
             1965: 26,
             1982: 74,
             1990: 114,
             1991: 118,
             1989: 118,
             1937: 14,
             1940: 15,
             1981: 71,
             1973: 36,
             1960: 30,
             1955: 33,
             1968: 30,
             1988: 129,
             1948: 20,
             1964: 28,
             1950: 24,
             1997: 213,
             6: 1,
             1943: 17,
             1952: 17,
             1957: 23,
             1961: 25,
             1958: 18,
             1954: 23,
             1934: 9,
             1944: 14,
             1959: 27,
             1963: 29,
             1942: 18,
             1953: 24,
             1939: 19,
             1941: 16,
             1946: 16,
          

##### first
The first method returns the first element in the source RDD

In [12]:
movie_year_rdd.first()

1995

##### max
The max method returns the largest element in an RDD

In [16]:
movie_year_rdd.filter(lambda x: x is not None).max()

2016

##### min
The min method returns the smallest element in an RDD.

In [14]:
movie_year_rdd.filter(lambda x: x is not None).min()

6

##### take
The take method takes an integer N as input and returns an array containing the first N element in the
source RDD.

In [17]:
movie_year_rdd.take(10)

[1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995]

##### takeOrdered
The takeOrdered method takes an integer N as input and returns an array containing the N smallest
elements in the source RDD

In [18]:
movie_year_rdd.filter(lambda x: x is not None).takeOrdered(10)

[6, 500, 1916, 1922, 1922, 1922, 1923, 1923, 1923, 1924]

In [19]:
movie_year_rdd.filter(lambda x: x is not None).distinct().takeOrdered(5, key = lambda x: -x)

[2016, 2015, 2014, 2013, 2012]

##### top
The top method takes an integer N as input and returns an array containing the N largest elements in the
source RDD.

In [20]:
movie_year_rdd.filter(lambda x: x is not None).top(3)

[2016, 2016, 2016]

In [23]:
movie_year_rdd.filter(lambda x: x is not None).distinct().top(5)

[2016, 2015, 2014, 2013, 2012]

### Actions on RDD of Numeric Types

RDDs containing data elements of type Integer, Long, Float, or Double support a few additional actions that are useful for statistical analysis. The commonly used actions from this group are briefly described next

##### mean
The mean method returns the average of the elements in the source RDD.

In [24]:
sc.parallelize([1,2,3,4,5,6,7,8,9,10]).mean()

5.5

##### stdev
The stdev method returns the standard deviation of the elements in the source RDD.

In [25]:
sc.parallelize([1,2,3,4,5,6,7,8,9,10]).stdev()

2.8722813232690143

##### sum
The sum method returns the sum of the elements in the source RDD

In [26]:
sc.parallelize([1,2,3,4,5,6,7,8,9,10]).sum()

55

#### variance
The variance method returns the variance of the elements in the source RDD

In [29]:
sc.parallelize([1,2,3,4,5,6,7,8,9,10]).variance()

8.25

### Saving an RDD

##### saveAsTextFile
The saveAsTextFile method saves the elements of the source RDD in the specified directory on any
Hadoop-supported file system. Each RDD element is converted to its string representation and stored as a
line of text

##### saveAsObjectFile
The saveAsObjectFile method saves the elements of the source RDD as serialized Java objects in the specified directory.

##### saveAsSequenceFile
The saveAsSequenceFile method saves an RDD of key-value pairs in SequenceFile format. An RDD of keyvalue pairs can also be saved in text format using the saveAsTextFile.