### Data profiling, cleaning, ingestion

In [1]:
val filePath = "bdad_proj/yelp_academic_dataset_user.json"

In [2]:
val rawDF = spark.read
    .option("inferSchema","true")
    .json(filePath)

In [3]:
rawDF.cache

In [4]:
rawDF.printSchema

In [5]:
val baseDF = rawDF.select(
    "average_stars",
    "compliment_cool",
    "compliment_cute",
    "compliment_funny",
    "compliment_hot",
    "compliment_list",
    "compliment_more",
    "compliment_note",
    "compliment_photos",
    "compliment_plain",
    "compliment_profile",
    "compliment_writer",
    "cool",
    "elite",
    "fans",
    "funny",
    "name",
    "review_count",
    "useful",
    "user_id",
    "yelping_since"
 )

Reformat "yelping_since" field.
Transfer it to timestamp.

In [7]:
val transfered_yelping_since = baseDF
    .na.drop()
    .withColumn("yelping_since_transfered", to_timestamp($"yelping_since", "yyyy-MM-dd HH:mm:ss"))
    .drop("yelping_since")
transfered_yelping_since.show



We noticed some people's name have illegal format.
We need to trim it.

In [9]:
val reforated_name_DF = transfered_yelping_since
    .withColumn("reformat_name", trim(col("name")))
    .drop("name")

'Anaseini and Ｊｏａｎｎｅ is allowed because they can be forieigners whose name is really looking like this.

In [11]:
reforated_name_DF.filter(col("user_id").isNull).count
reforated_name_DF.filter(col("reformat_name").isNull).count


Summary:

This is already a clean dataset given by Yelp. There is no null value in the table.

Some people have a huge "useful" count and "review_count". They got the "elite" title every year while some users don't quite use Yelp that frequently. They may be the influencers that we are looking for. Thus in the following analysis we should take of their weight because they may be more valueable than normal customers. Their review may be more professional.
There also many users have lots of friends. They may take Yelp as one of their main social media thus they have more positive attitude while leaving a comment. An influencer's friend may also be a influencer. We would like to learn the relationship between those frequent users in our real analysis.

In [13]:
val outputDir = "bdad_proj/user_data"

In [14]:
reforated_name_DF.write.mode("overwrite").parquet(outputDir)