## **Steven Miller**
### DSC 650 Winter 2019
### 2019-12-04
#### 2.2 Programming Exercise: Setup Spark, Load Data, and Work with DataFrames

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName('Exercise').getOrCreate()

a. Load Data and Show Schema

Load the baby-names.csv file into Spark dataframe as a text file. Print the dataframe’s schema using the printSchema method.

In [2]:
baby_names_text = spark.read.text('baby-names/baby-names.csv')
baby_names_text.printSchema()

root
 |-- value: string (nullable = true)



b. Filtering and Counting

First, count the number of rows in the dataframe. Second, filter the dataframe so that it only contains rows that contain John. Count the number of rows in the filtered dataframe.

In [3]:
print(f'This file has {baby_names_text.count()} rows.')
johns_text = baby_names_text.filter(baby_names_text.value.contains("John"))
print(f'This file has {johns_text.count()} rows that contain John.')

This file has 5933562 rows.
This file has 21785 rows that contain John.


3. Working with DataFrames

In the previous part of the exercise, you loaded the data into the dataframe as a text file. As a consequence, Spark treated each line as a record with a single field. While this is useful for some applications (processing raw text), it is not useful when our original data contains structure. In this part of the exercise, load baby-names.csv as a CSV file instead of a text file.

a. Load Data and Show Schema

Load the baby-names.csv file as a CSV file instead of a text file. Print the schema for this dataframe. In addition to printing the dataframe’s schema, show the first 20 rows of data using the show method.

In [4]:
baby_names_df = spark.read.csv('baby-names/baby-names.csv', header=True, inferSchema=True)
baby_names_df.printSchema()
baby_names_df.show()

root
 |-- state: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- count: integer (nullable = true)

+-----+---+----+---------+-----+
|state|sex|year|     name|count|
+-----+---+----+---------+-----+
|   AK|  F|1910|     Mary|   14|
|   AK|  F|1910|    Annie|   12|
|   AK|  F|1910|     Anna|   10|
|   AK|  F|1910| Margaret|    8|
|   AK|  F|1910|    Helen|    7|
|   AK|  F|1910|    Elsie|    6|
|   AK|  F|1910|     Lucy|    6|
|   AK|  F|1910|  Dorothy|    5|
|   AK|  F|1911|     Mary|   12|
|   AK|  F|1911| Margaret|    7|
|   AK|  F|1911|     Ruth|    7|
|   AK|  F|1911|    Annie|    6|
|   AK|  F|1911|Elizabeth|    6|
|   AK|  F|1911|    Helen|    6|
|   AK|  F|1912|     Mary|    9|
|   AK|  F|1912|    Elsie|    8|
|   AK|  F|1912|    Agnes|    7|
|   AK|  F|1912|     Anna|    7|
|   AK|  F|1912|    Helen|    7|
|   AK|  F|1912|   Louise|    7|
+-----+---+----+---------+-----+
only showing top 20

b. Filtering and Counting

Filter the dataframe so that it only contains rows that contain the name John. Count the number of rows in the filtered dataframe.

In [5]:
df_Johns = baby_names_df.where(baby_names_df.name.contains("John"))
print(f'This dataframe has {df_Johns.count()} rows that contain John.')

This dataframe has 21785 rows that contain John.


c. Sorting and Limits

For this step, filter the dataframe to include only males (sex=‘M’) born in Nebraska (state=‘NE’) in 1980 (year=‘1980’). Sort the dataframe by descending values of count and show the first ten rows. The result should be the top ten most popular boy’s names for 1980 in Nebraska.

Make sure that you save files that are output from each task, as they may be needed for other tasks. These include prepared corpus, trained models, and reports.



In [6]:
df_Nebraska_Males_1980 = baby_names_df.where((baby_names_df.state == 'NE') & (baby_names_df.sex == 'M') & (baby_names_df.year == '1980'))
df_Nebraska_Males_1980 = df_Nebraska_Males_1980.orderBy(desc("count"))

In [7]:
df_Nebraska_Males_1980.show()

+-----+---+----+-----------+-----+
|state|sex|year|       name|count|
+-----+---+----+-----------+-----+
|   NE|  M|1980|    Matthew|  434|
|   NE|  M|1980|    Michael|  426|
|   NE|  M|1980|      Jason|  409|
|   NE|  M|1980|     Joshua|  366|
|   NE|  M|1980|Christopher|  359|
|   NE|  M|1980|     Justin|  337|
|   NE|  M|1980|       Ryan|  320|
|   NE|  M|1980|      David|  292|
|   NE|  M|1980|     Andrew|  281|
|   NE|  M|1980|      Brian|  278|
|   NE|  M|1980|   Nicholas|  258|
|   NE|  M|1980|       John|  258|
|   NE|  M|1980|     Jeremy|  241|
|   NE|  M|1980|      James|  238|
|   NE|  M|1980|     Joseph|  212|
|   NE|  M|1980|       Adam|  209|
|   NE|  M|1980|     Daniel|  209|
|   NE|  M|1980|       Eric|  205|
|   NE|  M|1980|    Timothy|  188|
|   NE|  M|1980|     Robert|  187|
+-----+---+----+-----------+-----+
only showing top 20 rows

