# Note:
1. Create a directory isa460 under your home directory: mkdikr isa460
2. change to isa460: cd isa460
3. clone the [code](https://github.com/jonesberg/DataAnalysisWithPythonAndPySpark) and [data](https://github.com/jonesberg/DataAnalysisWithPythonAndPySpark-Data) for your textbook
4. rename the directory DataAnalysisWithPythonAndPySpark-Data to data

# My First Pyspark Program: What are the most popular words used in the English language? (based on Jane Austen's Pride and Prejudice)

1. Read—Read the input data (we’re assuming a plain text file).

2. Token—Tokenize each word.

3. Clean—Remove any punctuation and/or tokens that aren’t words. Lowercase each word.

4. Count—Count the frequency of each word present in the text.

5. Answer—Return the top 10 (or 20, 50, 100)

In [2]:
from pyspark.sql import SparkSession

# change the account name to your email account
account='sli'

# define a root path to access the data in the DataAnalysisWithPythonAndPySpark
root_path='/net/clusterhn/home/'+account+'/isa460/Data/'

# create a spark session
spark = SparkSession.builder.appName("My First Spark Program")\
        .config("spark.port.maxRetries", "100")\
        .getOrCreate()

# confiture the log level (defaulty is WWARN)
spark.sparkContext.setLogLevel('ERROR')

# read the csv file
book = spark.read.text(root_path+"gutenberg_books/1342-0.txt")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/08 17:09:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Basic Operations

In [3]:
# print the structure (or schema) of the data

book.printSchema()

root
 |-- value: string (nullable = true)



In [4]:
# show a sample of data

book.show(5, truncate=50)

+--------------------------------------------------+
|                                             value|
+--------------------------------------------------+
|The Project Gutenberg EBook of Pride and Prejud...|
|                                                  |
|This eBook is for the use of anyone anywhere at...|
|almost no restrictions whatsoever.  You may cop...|
|re-use it under the terms of the Project Gutenb...|
+--------------------------------------------------+
only showing top 5 rows



## Simple column transformations: Moving from a sentence to a list of words

### four ways to select a column

from pyspark.sql.functions import col
 
- book.select(book.value)
- book.select(book["value"])
- book.select(col("value"))
- book.select("value")

### Rename a column: alias() or withColumnRenamed()

## split the text into a list of words

## exploding a list into rows

## Working with words: changing case and removing puncutation

## use regexp_extract to remove special characters

### [Regular Expression Reference](https://docs.python.org/3/howto/regex.html)

## Filtering Rows

## Grouping records: Counting word frequencies

![GroupedData](https://raw.githubusercontent.com/Suhong88/ISA460_Fall2023/main/images/3.1.A.png)

In [81]:
# Display number of words per letter count



## Ordering the results using orderBy

## Writing data from a data frame

The results will be written in a directory

## Putting all together

In [7]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col,
    explode,
    lower,
    regexp_extract,
    split,
)

# change the account name to your email account
account='sli'

# define a root path to access the data in the DataAnalysisWithPythonAndPySpark
root_path='/net/clusterhn/home/'+account+'/isa460/Data/'

spark = SparkSession.builder.appName(
    "Analyzing the vocabulary of Pride and Prejudice."
).config("spark.port.maxRetries", "100").getOrCreate()
 
book = spark.read.text(root_path+"gutenberg_books/1342-0.txt")
 
lines = book.select(split(book.value, " ").alias("line"))
 
words = lines.select(explode(col("line")).alias("word"))
 
words_lower = words.select(lower(col("word")).alias("word"))
 
words_clean = words_lower.select(
    regexp_extract(col("word"), "[a-z']*", 0).alias("word")
)
 
words_nonull = words_clean.where(col("word") != "")
 
results = words_nonull.groupby(col("word")).count()
 
results.orderBy("count", ascending=False).show(10)

results.coalesce(1).write.mode('overwrite').csv("simple_count_single_partition.csv")

+----+-----+
|word|count|
+----+-----+
| the| 4480|
|  to| 4218|
|  of| 3711|
| and| 3504|
| her| 2199|
|   a| 1982|
|  in| 1909|
| was| 1838|
|   i| 1749|
| she| 1668|
+----+-----+
only showing top 10 rows



## Simplying our program via method chaining
![Method Chaining 3.3](https://raw.githubusercontent.com/Suhong88/ISA460_Fall2023/main/images/3.3.png)

In [78]:
import pyspark.sql.functions as F
 
results = (
    spark.read.text(root_path+"gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
    .where(F.col("word") != "")
    .groupby("word")
    .count()
)

results.show()

+-------------+-----+
|         word|count|
+-------------+-----+
|       online|    4|
|         some|  203|
|        still|   72|
|          few|   72|
|         hope|  122|
|        those|   60|
|     cautious|    4|
|       lady's|    8|
|    imitation|    1|
|          art|    3|
|      solaced|    1|
|       poetry|    2|
|    arguments|    5|
| premeditated|    1|
|      elevate|    1|
|       doubts|    2|
|    destitute|    1|
|    solemnity|    5|
|gratification|    1|
|    connected|   14|
+-------------+-----+
only showing top 20 rows



## Using spark-submit to launc your program in batch mode

- go to terminal
- activate BigDataAnalytics environment: conda activate BigDataAnalysis
- move to the directory where you have word_count_submit.py file
- run the following code: spark-submit word_count_submit.py