## Data Wrangling with Pyspark: Data Error

In [2]:
from pyspark.sql import SparkSession
from pyspark.context import SparkContext

In [3]:
spark = SparkSession.builder \
        .appName("Spark Data Wrangling Issue") \
        .getOrCreate()

In [4]:
spark

In [5]:
path = "spark_data/sparkify_log_small_error.json"
logs = spark.read.json(path)

In [6]:
logs.head(2)

[Row(_corrupt_record=None, artist='Showaddywaddy', auth='Logged In', firstName='Kenneth', gender='M', itemInSession=112, lastName='Matthews', length=232.93342, level='paid', location='Charlotte-Concord-Gastonia, NC-SC', method='PUT', page='NextSong', registration=1509380319284, sessionId=5132, song='Christmas Tears Will Fall', status=200, ts=1513720872284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='1046'),
 Row(_corrupt_record=None, artist='Lily Allen', auth='Logged In', firstName='Elizabeth', gender='F', itemInSession=7, lastName='Chase', length=195.23873, level='free', location='Shreveport-Bossier City, LA', method='PUT', page='NextSong', registration=1512718541284, sessionId=5027, song='Cheryl Tweedy', status=200, ts=1513720878284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"', userId='1000')]

As we can see, a new field ```_corrupt_record``` has appeared. In the 2 records above, its value is ```None```, so let's try to find a corrupt record:

In [7]:
logs.where(logs["_corrupt_record"].isNotNull()).collect()

[Row(_corrupt_record='{"ts":1513720980284,"userId":597dude,"sessionId":3689,"page":"Home","auth":"Logged In","method":"GET","status":200,"level":"free","itemInSession":0,"location":"Green Bay, WI","userAgent":"\\"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\\"","lastName":"Short","firstName":"Alexander","registration":1513594398284,"gender":"M"}', artist=None, auth=None, firstName=None, gender=None, itemInSession=None, lastName=None, length=None, level=None, location=None, method=None, page=None, registration=None, sessionId=None, song=None, status=None, ts=None, userAgent=None, userId=None)]

It looks like the value for ```userId``` is throwing an error, so it couldn't be read. When the Value for ```_corrupt_record``` is filled, it points to the values of the record that triggered such an error, and all other fields are ```None``` 

In [8]:
# Let's define an accumulator:
incorrect_records = SparkContext.accumulator(0, 0)

In [9]:
incorrect_records

Accumulator<id=0, value=0>

In [10]:
type(incorrect_records)

pyspark.accumulators.Accumulator

In [11]:
incorrect_records.value

0

In [12]:
# Create a function to increment accumulator:
def add_incorrect_record():
    global incorrect_records
    incorrect_records += 1

In [14]:
# Create a UDF that we can apply to the DataFrame and check the userId values in each row
from pyspark.sql.functions import udf
correct_userId = udf(lambda x: 1 if x.isdigit() else add_incorrect_record())

In [15]:
# Let's add a new column userId_digit to logs by applying the correct_userId function to the userID column in logs:
logs = logs.where(logs["_corrupt_record"].isNull()).withColumn("userId_digit", correct_userId(logs.userId))

In [16]:
incorrect_records.value

0

Due to Lazy Evaluation, the ```logs``` DataFrame hasn't changed yet. Let's force the change using ```collect()```:

In [17]:
logs.collect()

[Row(_corrupt_record=None, artist='Showaddywaddy', auth='Logged In', firstName='Kenneth', gender='M', itemInSession=112, lastName='Matthews', length=232.93342, level='paid', location='Charlotte-Concord-Gastonia, NC-SC', method='PUT', page='NextSong', registration=1509380319284, sessionId=5132, song='Christmas Tears Will Fall', status=200, ts=1513720872284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='1046', userId_digit='1'),
 Row(_corrupt_record=None, artist='Lily Allen', auth='Logged In', firstName='Elizabeth', gender='F', itemInSession=7, lastName='Chase', length=195.23873, level='free', location='Shreveport-Bossier City, LA', method='PUT', page='NextSong', registration=1512718541284, sessionId=5027, song='Cheryl Tweedy', status=200, ts=1513720878284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"', userId='1000', userId_digit='1'),
 R

Now we can see the addition of the ```userId_digit``` column

In [18]:
incorrect_records.value

336

Let's see what some of these incorrect records look like:

In [19]:
logs.filter(logs["userId_digit"].isNull()).collect()

[Row(_corrupt_record=None, artist=None, auth='Logged Out', firstName=None, gender=None, itemInSession=0, lastName=None, length=None, level='free', location=None, method='PUT', page='Login', registration=None, sessionId=5598, song=None, status=307, ts=1513721196284, userAgent=None, userId='', userId_digit=None),
 Row(_corrupt_record=None, artist=None, auth='Logged Out', firstName=None, gender=None, itemInSession=26, lastName=None, length=None, level='paid', location=None, method='GET', page='Home', registration=None, sessionId=428, song=None, status=200, ts=1513721274284, userAgent=None, userId='', userId_digit=None),
 Row(_corrupt_record=None, artist=None, auth='Logged Out', firstName=None, gender=None, itemInSession=5, lastName=None, length=None, level='free', location=None, method='GET', page='Home', registration=None, sessionId=2941, song=None, status=200, ts=1513722009284, userAgent=None, userId='', userId_digit=None),
 Row(_corrupt_record=None, artist=None, auth='Logged Out', firs

We can see that for much of these records, the userId is an empty string value, definitely not any digit. 

In [20]:
incorrect_records.value

1008

Now the value of the accumulator is 1008, compared to 336 previously. This is due to the above statement ```logs.filter(logs["userId_digit"].isNull()).collect()```, where collect() was called again and the accumulator was further incremented. 