## Data Wrangling with Pyspark: Data Error

In [1]:
from pyspark.sql import SparkSession
from pyspark.context import SparkContext

In [2]:
spark = SparkSession.builder \
        .appName("Spark Data Wrangling Issue") \
        .getOrCreate()

In [3]:
spark

In [4]:
path = "spark_data/sparkify_log_small_error.json"
logs = spark.read.json(path)

In [5]:
logs.head(2)

[Row(_corrupt_record=None, artist='Showaddywaddy', auth='Logged In', firstName='Kenneth', gender='M', itemInSession=112, lastName='Matthews', length=232.93342, level='paid', location='Charlotte-Concord-Gastonia, NC-SC', method='PUT', page='NextSong', registration=1509380319284, sessionId=5132, song='Christmas Tears Will Fall', status=200, ts=1513720872284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='1046'),
 Row(_corrupt_record=None, artist='Lily Allen', auth='Logged In', firstName='Elizabeth', gender='F', itemInSession=7, lastName='Chase', length=195.23873, level='free', location='Shreveport-Bossier City, LA', method='PUT', page='NextSong', registration=1512718541284, sessionId=5027, song='Cheryl Tweedy', status=200, ts=1513720878284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"', userId='1000')]

As we can see, a new field ```_corrupt_record``` has appeared. In the 2 records above, its value is ```None```, so let's try to find a corrupt record:

In [6]:
logs.where(logs["_corrupt_record"].isNotNull()).collect()

[Row(_corrupt_record='{"ts":1513720980284,"userId":597dude,"sessionId":3689,"page":"Home","auth":"Logged In","method":"GET","status":200,"level":"free","itemInSession":0,"location":"Green Bay, WI","userAgent":"\\"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\\"","lastName":"Short","firstName":"Alexander","registration":1513594398284,"gender":"M"}', artist=None, auth=None, firstName=None, gender=None, itemInSession=None, lastName=None, length=None, level=None, location=None, method=None, page=None, registration=None, sessionId=None, song=None, status=None, ts=None, userAgent=None, userId=None)]

It looks like the value for ```userId``` is throwing an error, so it couldn't be read. When the Value for ```_corrupt_record``` is filled, it points to the values of the record that triggered such an error, and all other fields are ```None``` 