# Last.fm Dataset - Advanced Analytics & Insights
**Purpose:** Deep dive analysis beyond basic data quality to understand user behavior, content patterns, and business insights.

**Prerequisites:** Run the data cleaning notebook first to get the cleaned dataset (`dqDf`).

**Analysis Areas:**
1. Temporal Analysis - Time-based patterns and trends
2. User Behavior Patterns - Listening habits and user segments
3. Content Analysis - Artist/track popularity and diversity
4. Data Quality Deep Dive - Understanding missing data patterns
5. Cross-dimensional Analysis - Correlations across different dimensions
6. Outlier Detection - Unusual patterns and anomalies
7. Business Impact Analysis - Impact on downstream use cases
8. Validation Against External Knowledge - Data consistency checks

In [1]:
import $ivy.`org.apache.spark::spark-sql:3.5.1`
import $ivy.`org.plotly-scala::plotly-almond:0.8.0`
import plotly._, plotly.element._, plotly.layout._, plotly.Almond._
init()

import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions.Window
import org.apache.logging.log4j.{LogManager, Level => LogLevel}
import org.apache.logging.log4j.core.Logger

// Suppress INFO logs
System.setProperty("log4j2.level", "WARN")

// Initialize Spark if not already done
val spark = SparkSession.builder()
  .appName("LastFM-Advanced-Analytics")
  .master("local[*]")
  .config("spark.sql.shuffle.partitions", "4")
  .getOrCreate()

spark.conf.set("spark.sql.session.timeZone", "UTC")

Seq(
  "org.apache.spark",
  "org.apache.spark.sql.execution",
  "org.apache.spark.storage",
  "org.apache.hadoop",
  "org.spark_project"
).foreach { name =>
  LogManager.getLogger(name).asInstanceOf[Logger].setLevel(LogLevel.ERROR)
}

LogManager.getRootLogger.asInstanceOf[Logger].setLevel(LogLevel.ERROR)

import spark.implicits._

// Load cleaned data (assuming it's available from previous notebook)
val INPUT_PATH = "/Users/Felipe/lastfm/data/lastfm/lastfm-dataset-1k/userid-timestamp-artid-artname-traid-traname.tsv"
val PROFILE_PATH = "/Users/Felipe/lastfm/data/lastfm/lastfm-dataset-1k/userid-profile.tsv"
val SAMPLE_ROWS = 20

println("Advanced Analytics Notebook Ready!")

10:46:55.220 [scala-interpreter-1] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
10:46:55.633 [scala-interpreter-1] WARN  org.apache.spark.util.Utils - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Advanced Analytics Notebook Ready!


[32mimport [39m[36m$ivy.$[39m
[32mimport [39m[36m$ivy.$[39m
[32mimport [39m[36mplotly._, plotly.element._, plotly.layout._, plotly.Almond._
[39m
[32mimport [39m[36morg.apache.spark.sql.{SparkSession, DataFrame}[39m
[32mimport [39m[36morg.apache.spark.sql.functions._[39m
[32mimport [39m[36morg.apache.spark.sql.types._[39m
[32mimport [39m[36morg.apache.spark.sql.expressions.Window[39m
[32mimport [39m[36morg.apache.logging.log4j.{LogManager, Level => LogLevel}[39m
[32mimport [39m[36morg.apache.logging.log4j.core.Logger[39m
[36mres1_10[39m: [32mString[39m = [32mnull[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@8df1396
[32mimport [39m[36mspark.implicits._[39m
[36mINPUT_PATH[39m: [32mString[39m = [32m"/Users/Felipe/lastfm/data/lastfm/lastfm-dataset-1k/userid-timestamp-artid-artname-traid-traname.tsv"[39m
[36mPROFILE_PATH[39m: [32mString[39m = [32m"/Users/Felipe/lastfm/data/lastfm/lastfm-dataset-1k/use

## Load and Prepare Data
**Purpose:** Load the cleaned dataset and prepare it for advanced analysis.

In [2]:
// Load the cleaned data (rerun cleaning if needed)
val schema = StructType(Seq(
  StructField("user_id", StringType, nullable = false),
  StructField("ts_str", StringType, nullable = false),
  StructField("artist_id", StringType, nullable = true),
  StructField("artist_name", StringType, nullable = true),
  StructField("track_id", StringType, nullable = true),
  StructField("track_name", StringType, nullable = true)
))

val rawDf = spark.read
  .option("sep", "\t")
  .option("header", "false")
  .schema(schema)
  .csv(INPUT_PATH)

// Quick cleaning for analysis
val cleanDf = rawDf
  .withColumn("ts", to_timestamp(col("ts_str")))
  .drop("ts_str")
  .filter(col("ts").isNotNull)
  .withColumn("track_key",
    when(col("track_id").isNotNull && length(col("track_id")) > 0, col("track_id"))
      .otherwise(concat_ws(" — ", coalesce(col("artist_name"), lit("?")), coalesce(col("track_name"), lit("?")))))
  .filter(col("user_id") =!= "" && col("artist_name") =!= "" && col("track_name") =!= "")
  .cache()

println(s"Dataset loaded and cleaned: ${cleanDf.count()} rows")
cleanDf.printSchema()

Dataset loaded and cleaned: 19150867 rows
root
 |-- user_id: string (nullable = true)
 |-- artist_id: string (nullable = true)
 |-- artist_name: string (nullable = true)
 |-- track_id: string (nullable = true)
 |-- track_name: string (nullable = true)
 |-- ts: timestamp (nullable = true)
 |-- track_key: string (nullable = true)



[36mschema[39m: [32mStructType[39m = [33mSeq[39m(
  [33mStructField[39m(
    name = [32m"user_id"[39m,
    dataType = StringType,
    nullable = [32mfalse[39m,
    metadata = {}
  ),
  [33mStructField[39m(
    name = [32m"ts_str"[39m,
    dataType = StringType,
    nullable = [32mfalse[39m,
    metadata = {}
  ),
  [33mStructField[39m(
    name = [32m"artist_id"[39m,
    dataType = StringType,
    nullable = [32mtrue[39m,
    metadata = {}
  ),
  [33mStructField[39m(
    name = [32m"artist_name"[39m,
    dataType = StringType,
    nullable = [32mtrue[39m,
    metadata = {}
  ),
  [33mStructField[39m(
    name = [32m"track_id"[39m,
    dataType = StringType,
    nullable = [32mtrue[39m,
    metadata = {}
  ),
  [33mStructField[39m(
    name = [32m"track_name"[39m,
    dataType = StringType,
    nullable = [32mtrue[39m,
    metadata = {}
  )
)
[36mrawDf[39m: [32mDataFrame[39m = [user_id: string, ts_str: string ... 4 more fields]
[36mcleanDf

# 1. Temporal Analysis
**Understanding time-based patterns in the data**

## 1.1 Timestamp Distribution Over Time

In [3]:
println("=== TEMPORAL ANALYSIS ===")

// Overall time range
val timeStats = cleanDf.select(
  min("ts").alias("earliest"),
  max("ts").alias("latest"),
  count("*").alias("total_plays")
).collect()(0)

println(s"Time Range: ${timeStats.getAs[java.sql.Timestamp]("earliest")} to ${timeStats.getAs[java.sql.Timestamp]("latest")}")
println(s"Total Plays: ${timeStats.getAs[Long]("total_plays")}")

// Plays by year
val yearlyPlays = cleanDf
  .withColumn("year", year(col("ts")))
  .groupBy("year")
  .agg(count("*").alias("plays"),
       countDistinct("user_id").alias("active_users"))
  .orderBy("year")

println("\nPlays by Year:")
yearlyPlays.show()

// Daily activity patterns
val dailyPattern = cleanDf
  .withColumn("hour", hour(col("ts")))
  .groupBy("hour")
  .agg(count("*").alias("plays"),
       countDistinct("user_id").alias("active_users"))
  .orderBy("hour")

println("\nDaily Activity Pattern (by hour):")
dailyPattern.show(24)

=== TEMPORAL ANALYSIS ===
Time Range: 2005-02-14 04:00:07.0 to 2013-09-29 22:32:04.0
Total Plays: 19150867

Plays by Year:
+----+-------+------------+
|year|  plays|active_users|
+----+-------+------------+
|2005|1070656|         241|
|2006|4255308|         573|
|2007|5358216|         732|
|2008|5929147|         834|
|2009|2537538|         921|
|2010|      1|           1|
|2013|      1|           1|
+----+-------+------------+


Daily Activity Pattern (by hour):
+----+-------+------------+
|hour|  plays|active_users|
+----+-------+------------+
|   0| 750333|         904|
|   1| 683450|         864|
|   2| 646730|         837|
|   3| 608097|         810|
|   4| 564569|         803|
|   5| 517194|         816|
|   6| 484337|         853|
|   7| 489678|         884|
|   8| 526726|         896|
|   9| 591386|         893|
|  10| 666702|         894|
|  11| 730988|         902|
|  12| 781963|         930|
|  13| 851940|         951|
|  14| 922142|         961|
|  15| 992910|         969|
|

[36mtimeStats[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m = [2005-02-14 04:00:07.0,2013-09-29 22:32:04.0,19150867]
[36myearlyPlays[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [year: int, plays: bigint ... 1 more field]
[36mdailyPattern[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [hour: int, plays: bigint ... 1 more field]

## 1.2 Data Freshness and Gaps Analysis

In [4]:
// Data gaps analysis
val dailyActivity = cleanDf
  .withColumn("date", to_date(col("ts")))
  .groupBy("date")
  .agg(count("*").alias("daily_plays"))
  .orderBy("date")

// Find days with no activity or very low activity
val avgDailyPlays = dailyActivity.select(avg("daily_plays")).collect()(0).getDouble(0)
val lowActivityThreshold = avgDailyPlays * 0.1

val lowActivityDays = dailyActivity
  .filter(col("daily_plays") < lowActivityThreshold)
  .count()

println(f"Average daily plays: ${avgDailyPlays}%.0f")
println(s"Days with very low activity (< ${lowActivityThreshold.toInt} plays): $lowActivityDays")

// Weekly patterns
val weeklyPattern = cleanDf
  .withColumn("day_of_week", dayofweek(col("ts")))
  .groupBy("day_of_week")
  .agg(count("*").alias("plays"),
       countDistinct("user_id").alias("active_users"))
  .orderBy("day_of_week")

println("\nWeekly Activity Pattern (1=Sunday, 7=Saturday):")
weeklyPattern.show()

Average daily plays: 12052
Days with very low activity (< 1205 plays): 11

Weekly Activity Pattern (1=Sunday, 7=Saturday):
+-----------+-------+------------+
|day_of_week|  plays|active_users|
+-----------+-------+------------+
|          1|2673782|         966|
|          2|2799685|         974|
|          3|2827204|         973|
|          4|2802703|         972|
|          5|2770779|         978|
|          6|2699098|         971|
|          7|2577616|         971|
+-----------+-------+------------+



[36mdailyActivity[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [date: date, daily_plays: bigint]
[36mavgDailyPlays[39m: [32mDouble[39m = [32m12052.150409062304[39m
[36mlowActivityThreshold[39m: [32mDouble[39m = [32m1205.2150409062303[39m
[36mlowActivityDays[39m: [32mLong[39m = [32m11L[39m
[36mweeklyPattern[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [day_of_week: int, plays: bigint ... 1 more field]

# 2. User Behavior Patterns
**Understanding how users interact with the platform**

## 2.1 User Activity Distribution

In [5]:
println("=== USER BEHAVIOR ANALYSIS ===")

// User activity statistics
val userActivity = cleanDf
  .groupBy("user_id")
  .agg(
    count("*").alias("total_plays"),
    countDistinct("track_key").alias("unique_tracks"),
    countDistinct("artist_name").alias("unique_artists"),
    min("ts").alias("first_play"),
    max("ts").alias("last_play")
  )
  .withColumn("listening_span_days", 
    datediff(col("last_play"), col("first_play")))

// Activity distribution statistics
val activityStats = userActivity.select(
  count("*").cast("double").alias("total_users"),
  avg("total_plays").alias("avg_plays_per_user"),
  stddev("total_plays").alias("stddev_plays"),
  min("total_plays").cast("double").alias("min_plays"),
  max("total_plays").cast("double").alias("max_plays"),
  expr("percentile_approx(total_plays, 0.5)").cast("double").alias("median_plays"),
  expr("percentile_approx(total_plays, 0.9)").cast("double").alias("p90_plays"),
  expr("percentile_approx(total_plays, 0.95)").cast("double").alias("p95_plays")
).collect()(0)

println("User Activity Statistics:")
println(f"Total Users: ${activityStats.getAs[Double]("total_users")}%.0f")
println(f"Avg Plays per User: ${activityStats.getAs[Double]("avg_plays_per_user")}%.0f")
println(f"Std Dev Plays: ${activityStats.getAs[Double]("stddev_plays")}%.0f")
println(f"Min Plays: ${activityStats.getAs[Double]("min_plays")}%.0f")
println(f"Max Plays: ${activityStats.getAs[Double]("max_plays")}%.0f")
println(f"Median Plays: ${activityStats.getAs[Double]("median_plays")}%.0f")
println(f"90th Percentile: ${activityStats.getAs[Double]("p90_plays")}%.0f")
println(f"95th Percentile: ${activityStats.getAs[Double]("p95_plays")}%.0f")

// User segments based on activity
val userSegments = userActivity
  .withColumn("user_segment",
    when(col("total_plays") >= 10000, "Power User")
    .when(col("total_plays") >= 1000, "Active User")
    .when(col("total_plays") >= 100, "Regular User")
    .otherwise("Casual User"))
  .groupBy("user_segment")
  .agg(count("*").alias("user_count"),
       avg("total_plays").alias("avg_plays"),
       sum("total_plays").alias("total_segment_plays"))
  .orderBy(desc("avg_plays"))

println("\nUser Segments:")
userSegments.show()

=== USER BEHAVIOR ANALYSIS ===
User Activity Statistics:
Total Users: 992
Avg Plays per User: 19305
Std Dev Plays: 23210
Min Plays: 2
Max Plays: 183103
Median Plays: 11547
90th Percentile: 47168
95th Percentile: 64528

User Segments:
+------------+----------+------------------+-------------------+
|user_segment|user_count|         avg_plays|total_segment_plays|
+------------+----------+------------------+-------------------+
|  Power User|       543|32113.714548802946|           17437747|
| Active User|       347| 4827.780979827089|            1675240|
|Regular User|        77|477.87012987012986|              36796|
| Casual User|        25|             43.36|               1084|
+------------+----------+------------------+-------------------+



[36muserActivity[39m: [32mDataFrame[39m = [user_id: string, total_plays: bigint ... 5 more fields]
[36mactivityStats[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m = [992.0,19305.30947580645,23210.40094288239,2.0,183103.0,11547.0,47168.0,64528.0]
[36muserSegments[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [user_segment: string, user_count: bigint ... 2 more fields]

## 2.2 User Diversity and Repeat Listening

In [6]:
// Listening diversity analysis
val diversityStats = userActivity.select(
  avg("unique_tracks").alias("avg_unique_tracks"),
  avg("unique_artists").alias("avg_unique_artists"),
  avg(expr("unique_tracks / total_plays")).alias("avg_track_diversity_ratio"),
  avg(expr("unique_artists / total_plays")).alias("avg_artist_diversity_ratio")
).collect()(0)

println("\nListening Diversity:")
println(f"Avg Unique Tracks per User: ${diversityStats.getAs[Double]("avg_unique_tracks")}%.1f")
println(f"Avg Unique Artists per User: ${diversityStats.getAs[Double]("avg_unique_artists")}%.1f")
println(f"Avg Track Diversity Ratio: ${diversityStats.getAs[Double]("avg_track_diversity_ratio")}%.3f")
println(f"Avg Artist Diversity Ratio: ${diversityStats.getAs[Double]("avg_artist_diversity_ratio")}%.3f")

// Most repeated tracks per user
val repeatListening = cleanDf
  .groupBy("user_id", "track_key")
  .agg(count("*").alias("play_count"))
  .filter(col("play_count") > 1)

val repeatStats = repeatListening
  .groupBy("user_id")
  .agg(
    count("*").alias("repeated_tracks"),
    max("play_count").alias("max_repeats"),
    avg("play_count").alias("avg_repeats")
  )

val globalRepeatStats = repeatStats.select(
  avg("repeated_tracks").alias("avg_repeated_tracks_per_user"),
  avg("max_repeats").alias("avg_max_repeats"),
  max("max_repeats").alias("global_max_repeats")
).collect()(0)

println("\nRepeat Listening Behavior:")
println(f"Avg Repeated Tracks per User: ${globalRepeatStats.getAs[Double]("avg_repeated_tracks_per_user")}%.1f")
println(f"Avg Max Repeats per User: ${globalRepeatStats.getAs[Double]("avg_max_repeats")}%.1f")
println(f"Global Max Repeats for Single Track: ${globalRepeatStats.getAs[Long]("global_max_repeats")}")


Listening Diversity:
Avg Unique Tracks per User: 4658.2
Avg Unique Artists per User: 905.3
Avg Track Diversity Ratio: 0.386
Avg Artist Diversity Ratio: 0.121

Repeat Listening Behavior:
Avg Repeated Tracks per User: 2481.7
Avg Max Repeats per User: 99.1
Global Max Repeats for Single Track: 2069


[36mdiversityStats[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m = [4658.195564516129,905.3145161290323,0.38639857227225816,0.1213913398148491]
[36mrepeatListening[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [user_id: string, track_key: string ... 1 more field]
[36mrepeatStats[39m: [32mDataFrame[39m = [user_id: string, repeated_tracks: bigint ... 2 more fields]
[36mglobalRepeatStats[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m = [2481.6796714579054,99.11088295687885,2069]

## 2.3 Geographic Analysis

In [7]:
// Load profile data for geographic analysis
val profileSchema = StructType(Seq(
  StructField("user_id", StringType, nullable = false),
  StructField("gender", StringType, nullable = true),
  StructField("age", IntegerType, nullable = true),
  StructField("country", StringType, nullable = true),
  StructField("signup", StringType, nullable = true)
))

val profileDf = spark.read
  .option("sep", "\t")
  .option("header", "false")
  .schema(profileSchema)
  .csv(PROFILE_PATH)

// Geographic distribution
val geoStats = userActivity
  .join(profileDf, "user_id")
  .filter(col("country").isNotNull && col("country") =!= "")
  .groupBy("country")
  .agg(
    count("*").alias("user_count"),
    sum("total_plays").alias("total_country_plays"),
    avg("total_plays").alias("avg_plays_per_user")
  )
  .orderBy(desc("user_count"))

println("\nTop Countries by User Count:")
geoStats.show(20)

// Activity by gender
val genderStats = userActivity
  .join(profileDf, "user_id")
  .filter(col("gender").isNotNull && col("gender") =!= "")
  .groupBy("gender")
  .agg(
    count("*").alias("user_count"),
    avg("total_plays").alias("avg_plays"),
    avg("unique_artists").alias("avg_unique_artists")
  )

println("\nActivity by Gender:")
genderStats.show()


Top Countries by User Count:
+------------------+----------+-------------------+------------------+
|           country|user_count|total_country_plays|avg_plays_per_user|
+------------------+----------+-------------------+------------------+
|     United States|       228|            5023398|22032.447368421053|
|    United Kingdom|       126|            2389084|18960.984126984127|
|            Poland|        50|             974331|          19486.62|
|           Germany|        36|             543944|15109.555555555555|
|            Norway|        35|             606405| 17325.85714285714|
|           Finland|        32|             826280|          25821.25|
|            Canada|        32|             842514|        26328.5625|
|            Turkey|        28|             609155|21755.535714285714|
|             Italy|        27|             362323| 13419.37037037037|
|            Sweden|        24|             446313|         18596.375|
|       Netherlands|        23|             366

[36mprofileSchema[39m: [32mStructType[39m = [33mSeq[39m(
  [33mStructField[39m(
    name = [32m"user_id"[39m,
    dataType = StringType,
    nullable = [32mfalse[39m,
    metadata = {}
  ),
  [33mStructField[39m(
    name = [32m"gender"[39m,
    dataType = StringType,
    nullable = [32mtrue[39m,
    metadata = {}
  ),
  [33mStructField[39m(
    name = [32m"age"[39m,
    dataType = IntegerType,
    nullable = [32mtrue[39m,
    metadata = {}
  ),
  [33mStructField[39m(
    name = [32m"country"[39m,
    dataType = StringType,
    nullable = [32mtrue[39m,
    metadata = {}
  ),
  [33mStructField[39m(
    name = [32m"signup"[39m,
    dataType = StringType,
    nullable = [32mtrue[39m,
    metadata = {}
  )
)
[36mprofileDf[39m: [32mDataFrame[39m = [user_id: string, gender: string ... 3 more fields]
[36mgeoStats[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql

# 3. Content Analysis
**Understanding the music content and popularity patterns**

## 3.1 Artist and Track Popularity

In [8]:
println("=== CONTENT ANALYSIS ===")

// Artist popularity
val artistStats = cleanDf
  .groupBy("artist_name", "artist_id")
  .agg(
    count("*").alias("total_plays"),
    countDistinct("user_id").alias("unique_listeners"),
    countDistinct("track_key").alias("unique_tracks")
  )
  .orderBy(desc("total_plays"))

println("Top 20 Artists by Play Count:")
artistStats.show(20, truncate = false)

// Track popularity
val trackStats = cleanDf
  .groupBy("track_key", "artist_name", "track_name")
  .agg(
    count("*").alias("total_plays"),
    countDistinct("user_id").alias("unique_listeners")
  )
  .orderBy(desc("total_plays"))

println("\nTop 20 Tracks by Play Count:")
trackStats.show(20, truncate = false)

// Content diversity metrics
val contentMetrics = cleanDf.select(
  countDistinct("artist_name").alias("unique_artists"),
  countDistinct("track_key").alias("unique_tracks"),
  count("*").alias("total_plays")
).collect()(0)

val uniqueArtists = contentMetrics.getAs[Long]("unique_artists")
val uniqueTracks = contentMetrics.getAs[Long]("unique_tracks")
val totalPlays = contentMetrics.getAs[Long]("total_plays")

println(f"\nContent Diversity Metrics:")
println(f"Unique Artists: $uniqueArtists")
println(f"Unique Tracks: $uniqueTracks")
println(f"Avg Tracks per Artist: ${uniqueTracks.toDouble / uniqueArtists}%.1f")
println(f"Avg Plays per Track: ${totalPlays.toDouble / uniqueTracks}%.1f")
println(f"Avg Plays per Artist: ${totalPlays.toDouble / uniqueArtists}%.1f")

=== CONTENT ANALYSIS ===
Top 20 Artists by Play Count:
+---------------------+------------------------------------+-----------+----------------+-------------+
|artist_name          |artist_id                           |total_plays|unique_listeners|unique_tracks|
+---------------------+------------------------------------+-----------+----------------+-------------+
|Radiohead            |a74b1b7f-71a5-4011-9441-d0b5e4122711|115209     |710             |1087         |
|The Beatles          |b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d|100338     |615             |1347         |
|Nine Inch Nails      |b7ffd2af-418f-4be2-bdd1-22f8b48613da|84421      |479             |921          |
|Muse                 |9c9f1380-2516-4fc9-a3e6-f9f61941d090|63346      |594             |528          |
|Coldplay             |cc197bad-dc9c-440d-a5b5-d52ba2e14234|62251      |637             |573          |
|Depeche Mode         |8538e728-ca0b-4321-b7e5-cff6565dd4c0|59910      |558             |1306         |
|Pink Flo

[36martistStats[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [artist_name: string, artist_id: string ... 3 more fields]
[36mtrackStats[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [track_key: string, artist_name: string ... 3 more fields]
[36mcontentMetrics[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m = [174089,1505194,19150867]
[36muniqueArtists[39m: [32mLong[39m = [32m174089L[39m
[36muniqueTracks[39m: [32mLong[39m = [32m1505194L[39m
[36mtotalPlays[39m: [32mLong[39m = [32m19150867L[39m

## 3.2 Long Tail Analysis

In [9]:
// Simple but bulletproof approach
def simpleLongTailAnalysis(): Unit = {
  println("=== LONG TAIL ANALYSIS  ===")
  
  // Artists
  println("\nAnalyzing Artists...")
  val totalArtistPlays = artistStats.select(sum("total_plays")).collect()(0).getLong(0)
  val totalArtistsCount = artistStats.count()
  
  val topArtists = artistStats
    .orderBy(desc("total_plays"))
    .select("total_plays")
    .limit(50000) // Reasonable limit for analysis
    .collect()
    .map(_.getLong(0))
  
  val target80PctArtists = (totalArtistPlays * 0.8).toLong
  var cumulativeArtists = 0L
  var artists80Pct = 0
  
  for ((plays, index) <- topArtists.zipWithIndex) {
    cumulativeArtists += plays
    if (cumulativeArtists >= target80PctArtists) {
      artists80Pct = index + 1
      // Break equivalent - return from function or use labeled break
      println(s"Artists Analysis:")
      println(s"Total Artists: $totalArtistsCount")
      println(f"Artists accounting for 80%% of plays: $artists80Pct (${(artists80Pct.toDouble/totalArtistsCount)*100}%.1f%% of all artists)")
      return // Early return to simulate break
    }
  }
  
  // Tracks  
  println("\nAnalyzing Tracks...")
  val totalTrackPlays = trackStats.select(sum("total_plays")).collect()(0).getLong(0)
  val totalTracksCount = trackStats.count()
  
  val topTracks = trackStats
    .orderBy(desc("total_plays"))
    .select("total_plays")
    .limit(100000) // Reasonable limit for analysis
    .collect()
    .map(_.getLong(0))
  
  val target80PctTracks = (totalTrackPlays * 0.8).toLong
  var cumulativeTracks = 0L
  var tracks80Pct = 0
  
  for ((plays, index) <- topTracks.zipWithIndex) {
    cumulativeTracks += plays
    if (cumulativeTracks >= target80PctTracks) {
      tracks80Pct = index + 1
      println(s"Tracks Analysis:")
      println(s"Total Tracks: $totalTracksCount")
      println(f"Tracks accounting for 80%% of plays: $tracks80Pct (${(tracks80Pct.toDouble/totalTracksCount)*100}%.1f%% of all tracks)")
      return
    }
  }
}

// Execute
simpleLongTailAnalysis()

=== LONG TAIL ANALYSIS  ===

Analyzing Artists...
Artists Analysis:
Total Artists: 177022
Artists accounting for 80% of plays: 5654 (3.2% of all artists)


defined [32mfunction[39m [36msimpleLongTailAnalysis[39m

# 4. Data Quality Deep Dive
**Understanding patterns in missing and inconsistent data**

## 4.1 MBID Coverage Analysis

In [10]:
println("=== DATA QUALITY DEEP DIVE ===")

// MBID coverage by artist popularity
val mbidCoverage = artistStats
  .withColumn("has_mbid", when(col("artist_id").isNotNull && col("artist_id") =!= "", 1).otherwise(0))
  .withColumn("popularity_tier",
    when(col("total_plays") >= 10000, "Very Popular (10k+)")
    .when(col("total_plays") >= 1000, "Popular (1k-10k)")
    .when(col("total_plays") >= 100, "Moderate (100-1k)")
    .otherwise("Niche (<100)"))

val mbidByTier = mbidCoverage
  .groupBy("popularity_tier")
  .agg(
    count("*").alias("artist_count"),
    sum("has_mbid").alias("artists_with_mbid"),
    avg("has_mbid").alias("mbid_coverage_rate"),
    sum("total_plays").alias("total_tier_plays")
  )
  .withColumn("mbid_coverage_pct", col("mbid_coverage_rate") * 100)
  .orderBy(desc("total_tier_plays"))

println("MBID Coverage by Artist Popularity:")
mbidByTier.show()

// Track MBID coverage
val trackMbidCoverage = cleanDf
  .withColumn("has_track_mbid", when(col("track_id").isNotNull && col("track_id") =!= "", 1).otherwise(0))
  .agg(
    count("*").alias("total_plays"),
    sum("has_track_mbid").alias("plays_with_track_mbid"),
    avg("has_track_mbid").alias("track_mbid_rate")
  )
  .withColumn("track_mbid_coverage_pct", col("track_mbid_rate") * 100)

println("\nOverall Track MBID Coverage:")
trackMbidCoverage.show()

=== DATA QUALITY DEEP DIVE ===
MBID Coverage by Artist Popularity:
+-------------------+------------+-----------------+------------------+----------------+------------------+
|    popularity_tier|artist_count|artists_with_mbid|mbid_coverage_rate|total_tier_plays| mbid_coverage_pct|
+-------------------+------------+-----------------+------------------+----------------+------------------+
|   Popular (1k-10k)|        2612|             2603|0.9965543644716692|         7360513| 99.65543644716692|
|Very Popular (10k+)|         288|              288|               1.0|         6154978|             100.0|
|  Moderate (100-1k)|       12939|            12198|0.9427312775330396|         3913792| 94.27312775330397|
|       Niche (<100)|      161183|            92439|0.5735034091684607|         1721584|57.350340916846065|
+-------------------+------------+-----------------+------------------+----------------+------------------+


Overall Track MBID Coverage:
+-----------+---------------------+---

[36mmbidCoverage[39m: [32mDataFrame[39m = [artist_name: string, artist_id: string ... 5 more fields]
[36mmbidByTier[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [popularity_tier: string, artist_count: bigint ... 4 more fields]
[36mtrackMbidCoverage[39m: [32mDataFrame[39m = [total_plays: bigint, plays_with_track_mbid: bigint ... 2 more fields]

## 4.2 Data Quality by Time Period

In [11]:
// Data quality trends over time
val qualityByYear = cleanDf
  .withColumn("year", year(col("ts")))
  .withColumn("has_artist_mbid", when(col("artist_id").isNotNull && col("artist_id") =!= "", 1).otherwise(0))
  .withColumn("has_track_mbid", when(col("track_id").isNotNull && col("track_id") =!= "", 1).otherwise(0))
  .groupBy("year")
  .agg(
    count("*").alias("total_plays"),
    avg("has_artist_mbid").alias("artist_mbid_rate"),
    avg("has_track_mbid").alias("track_mbid_rate"),
    countDistinct("user_id").alias("active_users")
  )
  .withColumn("artist_mbid_pct", col("artist_mbid_rate") * 100)
  .withColumn("track_mbid_pct", col("track_mbid_rate") * 100)
  .orderBy("year")

println("Data Quality Trends by Year:")
qualityByYear.select("year", "total_plays", "artist_mbid_pct", "track_mbid_pct", "active_users").show()

// Missing data clustering by user
val userQuality = cleanDf
  .withColumn("has_artist_mbid", when(col("artist_id").isNotNull && col("artist_id") =!= "", 1).otherwise(0))
  .withColumn("has_track_mbid", when(col("track_id").isNotNull && col("track_id") =!= "", 1).otherwise(0))
  .groupBy("user_id")
  .agg(
    count("*").alias("total_plays"),
    avg("has_artist_mbid").alias("user_artist_mbid_rate"),
    avg("has_track_mbid").alias("user_track_mbid_rate")
  )

val qualitySegments = userQuality
  .withColumn("quality_segment",
    when(col("user_artist_mbid_rate") >= 0.8, "High Quality (80%+ MBID)")
    .when(col("user_artist_mbid_rate") >= 0.5, "Medium Quality (50-80% MBID)")
    .when(col("user_artist_mbid_rate") >= 0.2, "Low Quality (20-50% MBID)")
    .otherwise("Very Low Quality (<20% MBID)"))
  .groupBy("quality_segment")
  .agg(
    count("*").alias("user_count"),
    avg("total_plays").alias("avg_plays_per_user"),
    sum("total_plays").alias("segment_total_plays")
  )
  .orderBy(desc("avg_plays_per_user"))

println("\nUsers by Data Quality:")
qualitySegments.show()

Data Quality Trends by Year:
+----+-----------+-----------------+-----------------+------------+
|year|total_plays|  artist_mbid_pct|   track_mbid_pct|active_users|
+----+-----------+-----------------+-----------------+------------+
|2005|    1070656|97.35274448562376|89.92561569729213|         241|
|2006|    4255308|97.23615775873333|89.38629119208292|         573|
|2007|    5358216|96.76689032319712|88.32598387224404|         732|
|2008|    5929147| 96.6840255436406|88.40188985026009|         834|
|2009|    2537538| 96.5964647623011|88.33944555707146|         921|
|2010|          1|            100.0|            100.0|           1|
|2013|          1|            100.0|            100.0|           1|
+----+-----------+-----------------+-----------------+------------+


Users by Data Quality:
+--------------------+----------+------------------+-------------------+
|     quality_segment|user_count|avg_plays_per_user|segment_total_plays|
+--------------------+----------+------------------+

[36mqualityByYear[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [year: int, total_plays: bigint ... 5 more fields]
[36muserQuality[39m: [32mDataFrame[39m = [user_id: string, total_plays: bigint ... 2 more fields]
[36mqualitySegments[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [quality_segment: string, user_count: bigint ... 2 more fields]

## 4.3 Data Consistency Analysis

In [12]:
// Artist name variations for same MBID
val artistNameVariations = cleanDf
  .filter(col("artist_id").isNotNull && col("artist_id") =!= "")
  .groupBy("artist_id")
  .agg(
    countDistinct("artist_name").alias("name_variations"),
    collect_set("artist_name").alias("all_names"),
    count("*").alias("total_plays")
  )
  .filter(col("name_variations") > 1)
  .orderBy(desc("total_plays"))

println("Artists with Multiple Name Variations (same MBID):")
artistNameVariations.show(20, truncate = false)

val totalVariations = artistNameVariations.count()
val avgVariations = artistNameVariations.select(avg("name_variations")).collect()(0).getDouble(0)

println(f"\nArtist Name Consistency:")
println(f"Artists with multiple name variations: $totalVariations")
println(f"Average variations per inconsistent artist: ${avgVariations}%.1f")

// Country name standardization check
val countryVariations = profileDf
  .filter(col("country").isNotNull && col("country") =!= "")
  .groupBy("country")
  .agg(count("*").alias("user_count"))
  .orderBy(desc("user_count"))

println("\nCountry Distribution (check for variations):")
countryVariations.show(30)

Artists with Multiple Name Variations (same MBID):
+------------------------------------+---------------+-------------------------------------------------------------+-----------+
|artist_id                           |name_variations|all_names                                                    |total_plays|
+------------------------------------+---------------+-------------------------------------------------------------+-----------+
|b9472588-93f3-4922-a1a2-74082cdf9ce8|2              |[Panic At The Disco, Panic! At The Disco]                    |12532      |
|0da0f48c-3689-4c38-bf4a-c5b50d516689|2              |[ムック, Mucc]                                               |12075      |
|1fda852b-92e9-4562-82fa-c52820a77b23|2              |[Pussycat Dolls, The Pussycat Dolls]                         |6121       |
|127f591a-7e27-4435-92db-0780f219f3a1|2              |[The B-52'S, The B-52S]                                      |3607       |
|e5257dc5-1edd-4fca-b7e6-1158e00522c8|2          

[36martistNameVariations[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [artist_id: string, name_variations: bigint ... 2 more fields]
[36mtotalVariations[39m: [32mLong[39m = [32m131L[39m
[36mavgVariations[39m: [32mDouble[39m = [32m2.0[39m
[36mcountryVariations[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [country: string, user_count: bigint]

# 5. Cross-dimensional Analysis
**Understanding correlations across different data dimensions**

In [13]:
println("=== CROSS-DIMENSIONAL ANALYSIS ===")

// Profile completeness vs activity
val profileCompleteness = userActivity
  .join(profileDf, "user_id")
  .withColumn("profile_score",
    (when(col("gender").isNotNull && col("gender") =!= "", 1).otherwise(0) +
     when(col("age").isNotNull, 1).otherwise(0) +
     when(col("country").isNotNull && col("country") =!= "", 1).otherwise(0))
  )
  .withColumn("profile_completeness",
    when(col("profile_score") === 3, "Complete (3/3)")
    .when(col("profile_score") === 2, "Mostly Complete (2/3)")
    .when(col("profile_score") === 1, "Incomplete (1/3)")
    .otherwise("Empty (0/3)"))

val profileVsActivity = profileCompleteness
  .groupBy("profile_completeness")
  .agg(
    count("*").alias("user_count"),
    avg("total_plays").alias("avg_plays"),
    avg("unique_artists").alias("avg_unique_artists"),
    avg("listening_span_days").alias("avg_listening_span_days")
  )
  .orderBy(desc("avg_plays"))

println("Profile Completeness vs User Activity:")
profileVsActivity.show()

// Geographic patterns in music taste
val topCountries = profileDf
  .filter(col("country").isNotNull && col("country") =!= "")
  .groupBy("country")
  .count()
  .filter(col("count") >= 10) // Only countries with at least 10 users
  .select("country")
  .collect()
  .map(_.getString(0))

if (topCountries.nonEmpty) {
  val countryMusicTaste = cleanDf
    .join(profileDf, "user_id")
    .filter(col("country").isin(topCountries.toSeq: _*))  // Only change: fixed deprecation warning
    .groupBy("country", "artist_name")
    .agg(count("*").alias("plays"))
    .withColumn("country_rank", 
      row_number().over(Window.partitionBy("country").orderBy(desc("plays"))))
    .filter(col("country_rank") <= 3)
    .orderBy("country", "country_rank")
  
  println(f"\nTop 3 Artists by Country (countries with 10+ users):")
  countryMusicTaste.show(50, truncate = false)
}

=== CROSS-DIMENSIONAL ANALYSIS ===
Profile Completeness vs User Activity:
+--------------------+----------+------------------+------------------+-----------------------+
|profile_completeness|user_count|         avg_plays|avg_unique_artists|avg_listening_span_days|
+--------------------+----------+------------------+------------------+-----------------------+
|Mostly Complete (...|       604| 19863.28642384106| 936.1456953642385|      854.9188741721854|
|      Complete (3/3)|       265|19759.909433962264| 811.6867924528302|      915.3849056603774|
|    Incomplete (1/3)|        74|18198.108108108107| 1116.837837837838|      937.9864864864865|
|         Empty (0/3)|        49|11640.938775510203| 712.1836734693877|      867.4285714285714|
+--------------------+----------+------------------+------------------+-----------------------+


Top 3 Artists by Country (countries with 10+ users):
+------------------+---------------------------------+-----+------------+
|country           |artist_na

[36mprofileCompleteness[39m: [32mDataFrame[39m = [user_id: string, total_plays: bigint ... 11 more fields]
[36mprofileVsActivity[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [profile_completeness: string, user_count: bigint ... 3 more fields]
[36mtopCountries[39m: [32mArray[39m[[32mString[39m] = [33mArray[39m(
  [32m"Mexico"[39m,
  [32m"Australia"[39m,
  [32m"Norway"[39m,
  [32m"Finland"[39m,
  [32m"Sweden"[39m,
  [32m"Turkey"[39m,
  [32m"Italy"[39m,
  [32m"Brazil"[39m,
  [32m"Netherlands"[39m,
  [32m"Spain"[39m,
  [32m"Russian Federation"[39m,
  [32m"Poland"[39m,
  [32m"Germany"[39m,
  [32m"France"[39m,
  [32m"United States"[39m,
  [32m"United Kingdom"[39m,
  [32m"Canada"[39m
)

# 6. Outlier Detection
**Identifying unusual patterns and potential data anomalies**

In [14]:
println("=== OUTLIER DETECTION ===")

// First, calculate user stats with separate aggregations
val userStatsStep1 = cleanDf
  .groupBy("user_id")
  .agg(
    count("*").alias("total_plays"),
    countDistinct("track_key").alias("unique_tracks"),
    countDistinct("artist_name").alias("unique_artists"),
    min("ts").alias("first_play"),
    max("ts").alias("last_play")
  )

// Then calculate listening span
val userStats = userStatsStep1
  .withColumn("listening_span_days", 
    datediff(col("last_play"), col("first_play")))
  .withColumn("plays_per_day", 
    col("total_plays") / (col("listening_span_days") + 1))
  .withColumn("track_repetition_rate", 
    col("total_plays").cast("double") / col("unique_tracks").cast("double"))

// Statistical thresholds for outliers
val statsDF = userStats.select(
  expr("percentile_approx(total_plays, 0.95)").cast("double").alias("p95_plays"),
  expr("percentile_approx(plays_per_day, 0.95)").cast("double").alias("p95_plays_per_day"),
  expr("percentile_approx(track_repetition_rate, 0.95)").cast("double").alias("p95_repetition_rate")
).collect()(0)

val p95Plays = statsDF.getAs[Double]("p95_plays")
val p95PlaysPerDay = statsDF.getAs[Double]("p95_plays_per_day")
val p95RepetitionRate = statsDF.getAs[Double]("p95_repetition_rate")

// Identify outliers
val outlierUsers = userStats
  .filter(
    col("total_plays") > p95Plays * 2 || // Extremely high play count
    col("plays_per_day") > p95PlaysPerDay * 2 || // Extremely high daily activity
    col("track_repetition_rate") > p95RepetitionRate * 2 // Extremely high repetition
  )
  .withColumn("outlier_reason",
    when(col("total_plays") > p95Plays * 2, "Extreme play count")
    .when(col("plays_per_day") > p95PlaysPerDay * 2, "Extreme daily activity")
    .when(col("track_repetition_rate") > p95RepetitionRate * 2, "Extreme repetition")
    .otherwise("Multiple factors"))
  .orderBy(desc("total_plays"))

println(f"Outlier Detection Thresholds:")
println(f"95th percentile plays: $p95Plays%.0f, outlier threshold: ${p95Plays * 2}%.0f")
println(f"95th percentile plays/day: $p95PlaysPerDay%.1f, outlier threshold: ${p95PlaysPerDay * 2}%.1f")
println(f"95th percentile repetition: $p95RepetitionRate%.1f, outlier threshold: ${p95RepetitionRate * 2}%.1f")

println("\nOutlier Users:")
outlierUsers.show(20)

// Temporal outliers
val hourlyActivity = cleanDf
  .withColumn("hour", hour(col("ts")))
  .groupBy("hour")
  .count()

val avgHourlyPlays = hourlyActivity.select(avg("count")).collect()(0).getDouble(0)
val stddevHourlyPlays = hourlyActivity.select(stddev("count")).collect()(0).getDouble(0)

val temporalOutliers = hourlyActivity
  .filter(abs(col("count") - avgHourlyPlays) > stddevHourlyPlays * 2)
  .orderBy(desc("count"))

println(f"\nTemporal Outliers (hours with unusual activity):")
println(f"Average hourly plays: $avgHourlyPlays%.0f ± $stddevHourlyPlays%.0f")
temporalOutliers.show()

=== OUTLIER DETECTION ===
Outlier Detection Thresholds:
95th percentile plays: 64528, outlier threshold: 129056
95th percentile plays/day: 72.8, outlier threshold: 145.6
95th percentile repetition: 10.7, outlier threshold: 21.3

Outlier Users:
+-----------+-----------+-------------+--------------+-------------------+-------------------+-------------------+------------------+---------------------+--------------------+
|    user_id|total_plays|unique_tracks|unique_artists|         first_play|          last_play|listening_span_days|     plays_per_day|track_repetition_rate|      outlier_reason|
+-----------+-----------+-------------+--------------+-------------------+-------------------+-------------------+------------------+---------------------+--------------------+
|user_000949|     183103|         6295|           852|2005-05-30 06:15:32|2009-04-28 14:51:57|               1429|128.04405594405594|    29.08705321683876|  Extreme play count|
|user_000791|     158686|        15439|         

[36muserStatsStep1[39m: [32mDataFrame[39m = [user_id: string, total_plays: bigint ... 4 more fields]
[36muserStats[39m: [32mDataFrame[39m = [user_id: string, total_plays: bigint ... 7 more fields]
[36mstatsDF[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m = [64528.0,72.79347826086956,10.664852417835462]
[36mp95Plays[39m: [32mDouble[39m = [32m64528.0[39m
[36mp95PlaysPerDay[39m: [32mDouble[39m = [32m72.79347826086956[39m
[36mp95RepetitionRate[39m: [32mDouble[39m = [32m10.664852417835462[39m
[36moutlierUsers[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [user_id: string, total_plays: bigint ... 8 more fields]
[36mhourlyActivity[39m: [32mDataFrame[39m = [hour: int, count: bigint]
[36mavgHourlyPlays[39m: [32mDouble[39m = [32m797952.7916666666[39m
[36mstddevHourlyPlays[39m: [32mDouble[39m = [32m223153

# 7. Business Impact Analysis
**Understanding how data quality affects key business use cases**

## 7.1 Impact on Sessionization

In [15]:
println("=== BUSINESS IMPACT ANALYSIS ===")

// Session analysis preview
val sessionPreview = cleanDf
  .select("user_id", "ts")
  .orderBy("user_id", "ts")
  .withColumn("prev_ts", lag("ts", 1).over(Window.partitionBy("user_id").orderBy("ts")))
  .withColumn("gap_minutes", 
    when(col("prev_ts").isNull, 0)
    .otherwise((unix_timestamp(col("ts")) - unix_timestamp(col("prev_ts"))) / 60.0))
  .withColumn("is_new_session", when(col("gap_minutes") > 20, 1).otherwise(0))

val sessionStats = sessionPreview
  .groupBy("user_id")
  .agg(
    count("*").alias("total_plays"),
    sum("is_new_session").alias("session_count"),
    avg("gap_minutes").alias("avg_gap_minutes"),
    expr("percentile_approx(gap_minutes, 0.5)").alias("median_gap_minutes")
  )
  .withColumn("avg_plays_per_session", col("total_plays") / col("session_count"))

val globalSessionStats = sessionStats.select(
  avg("session_count").alias("avg_sessions_per_user"),
  avg("avg_plays_per_session").alias("avg_plays_per_session"),
  avg("avg_gap_minutes").alias("avg_gap_between_plays")
).collect()(0)

println("Sessionization Impact Analysis:")
println(f"Avg Sessions per User: ${globalSessionStats.getAs[Double]("avg_sessions_per_user")}%.1f")
println(f"Avg Plays per Session: ${globalSessionStats.getAs[Double]("avg_plays_per_session")}%.1f")
println(f"Avg Gap Between Plays: ${globalSessionStats.getAs[Double]("avg_gap_between_plays")}%.1f minutes")

// Gap distribution analysis
val gapDistribution = sessionPreview
  .filter(col("gap_minutes") > 0)
  .withColumn("gap_category",
    when(col("gap_minutes") <= 1, "≤1 min")
    .when(col("gap_minutes") <= 5, "1-5 min")
    .when(col("gap_minutes") <= 20, "5-20 min")
    .when(col("gap_minutes") <= 60, "20-60 min")
    .when(col("gap_minutes") <= 1440, "1-24 hours")
    .otherwise(">24 hours"))
  .groupBy("gap_category")
  .count()
  .orderBy(when(col("gap_category") === "≤1 min", 1)
           .when(col("gap_category") === "1-5 min", 2)
           .when(col("gap_category") === "5-20 min", 3)
           .when(col("gap_category") === "20-60 min", 4)
           .when(col("gap_category") === "1-24 hours", 5)
           .otherwise(6))

println("\nGap Distribution (impact on 20-minute session boundary):")
gapDistribution.show()

=== BUSINESS IMPACT ANALYSIS ===
Sessionization Impact Analysis:
Avg Sessions per User: 1049.3
Avg Plays per Session: 21.5
Avg Gap Between Plays: 2444.1 minutes

Gap Distribution (impact on 20-minute session boundary):
+------------+--------+
|gap_category|   count|
+------------+--------+
|      ≤1 min|  342609|
|     1-5 min|13467936|
|    5-20 min| 4256186|
|   20-60 min|  298206|
|  1-24 hours|  632326|
|   >24 hours|  110359|
+------------+--------+



[36msessionPreview[39m: [32mDataFrame[39m = [user_id: string, ts: timestamp ... 3 more fields]
[36msessionStats[39m: [32mDataFrame[39m = [user_id: string, total_plays: bigint ... 4 more fields]
[36mglobalSessionStats[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m = [1049.2852822580646,21.544705374301056,2444.1168248476833]
[36mgapDistribution[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [gap_category: string, count: bigint]

## 7.2 Impact on Content Recommendations

In [16]:
// Content matching impact - Optimized to avoid window operations
val contentMatchingStats = cleanDf
  .withColumn("has_artist_mbid", col("artist_id").isNotNull && col("artist_id") =!= "")
  .withColumn("has_track_mbid", col("track_id").isNotNull && col("track_id") =!= "")
  .withColumn("matching_quality",
    when(col("has_track_mbid"), "High (Track MBID)")
    .when(col("has_artist_mbid"), "Medium (Artist MBID only)")
    .otherwise("Low (Name matching only)"))
  .groupBy("matching_quality")
  .agg(
    count("*").alias("play_count"),
    countDistinct("user_id").alias("affected_users"),
    countDistinct("track_key").alias("affected_tracks")
  )

// Calculate total plays separately to avoid window operation
val totalPlays = contentMatchingStats.select(sum("play_count")).collect()(0).getLong(0)

val contentMatchingImpact = contentMatchingStats
  .withColumn("play_percentage", col("play_count") * 100.0 / totalPlays)
  .orderBy(desc("play_count"))

println("Content Matching Quality Impact:")
contentMatchingImpact.show()

// Artist disambiguation challenges
val artistAmbiguity = cleanDf
  .filter(col("artist_id").isNull || col("artist_id") === "")
  .groupBy("artist_name")
  .agg(
    count("*").alias("plays_without_mbid"),
    countDistinct("user_id").alias("users_affected")
  )
  .filter(col("plays_without_mbid") >= 100) // Focus on high-impact cases
  .orderBy(desc("plays_without_mbid"))

println("\nTop Artists Without MBID (disambiguation challenges):")
artistAmbiguity.show(20, truncate = false)

// Additional insights from the data
println("\nContent Quality Insights:")
val qualityInsights = contentMatchingImpact.collect()
qualityInsights.foreach { row =>
  val quality = row.getAs[String]("matching_quality")
  val playCount = row.getAs[Long]("play_count")
  val percentage = row.getAs[Double]("play_percentage")
  val users = row.getAs[Long]("affected_users")
  val tracks = row.getAs[Long]("affected_tracks")
  
  println(f"$quality:")
  println(f"  - $playCount plays (${percentage}%.1f%% of total)")
  println(f"  - $users users affected")
  println(f"  - $tracks unique tracks")
  println()
}

// Disambiguation impact analysis
val disambiguationStats = artistAmbiguity.select(
  count("*").alias("artists_needing_disambiguation"),
  sum("plays_without_mbid").alias("total_ambiguous_plays"),
  sum("users_affected").alias("total_affected_users"),
  avg("plays_without_mbid").alias("avg_plays_per_ambiguous_artist"),
  max("plays_without_mbid").alias("max_plays_ambiguous_artist")
).collect()(0)

println("Disambiguation Challenge Summary:")
println(s"Artists needing disambiguation: ${disambiguationStats.getAs[Long]("artists_needing_disambiguation")}")
println(s"Total plays affected: ${disambiguationStats.getAs[Long]("total_ambiguous_plays")}")
println(s"Users affected: ${disambiguationStats.getAs[Long]("total_affected_users")}")
println(f"Average plays per ambiguous artist: ${disambiguationStats.getAs[Double]("avg_plays_per_ambiguous_artist")}%.0f")
println(s"Most problematic artist plays: ${disambiguationStats.getAs[Long]("max_plays_ambiguous_artist")}")

Content Matching Quality Impact:
+--------------------+----------+--------------+---------------+-----------------+
|    matching_quality|play_count|affected_users|affected_tracks|  play_percentage|
+--------------------+----------+--------------+---------------+-----------------+
|   High (Track MBID)|  16982280|           992|         961416|88.67629857175658|
|Medium (Artist MB...|   1566422|           981|         373345|8.179379032813502|
|Low (Name matchin...|    602165|           961|         170434| 3.14432239542993|
+--------------------+----------+--------------+---------------+-----------------+


Top Artists Without MBID (disambiguation challenges):
+---------------------------+------------------+--------------+
|artist_name                |plays_without_mbid|users_affected|
+---------------------------+------------------+--------------+
|Eri Nobuchika              |2243              |3             |
|Remiss                     |2194              |3             |
|Leonel Nu

[36mcontentMatchingStats[39m: [32mDataFrame[39m = [matching_quality: string, play_count: bigint ... 2 more fields]
[36mtotalPlays[39m: [32mLong[39m = [32m19150867L[39m
[36mcontentMatchingImpact[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [matching_quality: string, play_count: bigint ... 3 more fields]
[36martistAmbiguity[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [artist_name: string, plays_without_mbid: bigint ... 1 more field]
[36mqualityInsights[39m: [32mArray[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [33mArray[39m(
  [High (Track MBID),16982280,992,961416,88.67629857175658],
  [Medium (Artist MBID only),1566422,981,373345,8.179379032813502],
  [Low (Name matching only),602165,961,170434,3.1

# 8. Validation Against External Knowledge
**Cross-referencing data with external standards and expectations**

In [17]:
println("=== EXTERNAL VALIDATION ===")

// MBID format validation
val mbidFormatValidation = cleanDf
  .filter(col("artist_id").isNotNull && col("artist_id") =!= "")
  .withColumn("valid_artist_mbid", 
    col("artist_id").rlike("^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$"))
  .agg(
    count("*").alias("total_with_artist_id"),
    sum(when(col("valid_artist_mbid"), 1).otherwise(0)).alias("valid_format_count"),
    avg(when(col("valid_artist_mbid"), 1).otherwise(0)).alias("valid_format_rate")
  )
  .withColumn("valid_format_pct", col("valid_format_rate") * 100)

println("Artist MBID Format Validation:")
mbidFormatValidation.show()

// Track MBID format validation
val trackMbidValidation = cleanDf
  .filter(col("track_id").isNotNull && col("track_id") =!= "")
  .withColumn("valid_track_mbid", 
    col("track_id").rlike("^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$"))
  .agg(
    count("*").alias("total_with_track_id"),
    sum(when(col("valid_track_mbid"), 1).otherwise(0)).alias("valid_format_count"),
    avg(when(col("valid_track_mbid"), 1).otherwise(0)).alias("valid_format_rate")
  )
  .withColumn("valid_format_pct", col("valid_format_rate") * 100)

println("\nTrack MBID Format Validation:")
trackMbidValidation.show()

// Country code standardization check
val suspiciousCountries = profileDf
  .filter(col("country").isNotNull && col("country") =!= "")
  .withColumn("country_clean", trim(col("country")))
  .withColumn("suspicious_country",
    length(col("country_clean")) > 50 || // Too long
    col("country_clean").rlike("[0-9]") || // Contains numbers
    col("country_clean") === upper(col("country_clean")) // All caps (might be code) - FIXED
  )
  .filter(col("suspicious_country"))
  .groupBy("country")
  .count()
  .orderBy(desc("count"))

println("\nSuspicious Country Entries:")
suspiciousCountries.show(20, truncate = false)

// Age validation
val ageValidation = profileDf
  .filter(col("age").isNotNull)
  .withColumn("suspicious_age",
    col("age") < 13 || col("age") > 100 // Unrealistic ages
  )
  .agg(
    count("*").alias("total_with_age"),
    sum(when(col("suspicious_age"), 1).otherwise(0)).alias("suspicious_age_count"),
    min("age").alias("min_age"),
    max("age").alias("max_age"),
    avg("age").alias("avg_age")
  )
  .withColumn("suspicious_age_pct", col("suspicious_age_count") * 100.0 / col("total_with_age"))

println("\nAge Validation:")
ageValidation.show()

=== EXTERNAL VALIDATION ===
Artist MBID Format Validation:
+--------------------+------------------+-----------------+----------------+
|total_with_artist_id|valid_format_count|valid_format_rate|valid_format_pct|
+--------------------+------------------+-----------------+----------------+
|            18548702|          18548702|              1.0|           100.0|
+--------------------+------------------+-----------------+----------------+


Track MBID Format Validation:
+-------------------+------------------+-----------------+----------------+
|total_with_track_id|valid_format_count|valid_format_rate|valid_format_pct|
+-------------------+------------------+-----------------+----------------+
|           16982280|          16982280|              1.0|           100.0|
+-------------------+------------------+-----------------+----------------+


Suspicious Country Entries:
+-------+-----+
|country|count|
+-------+-----+
+-------+-----+


Age Validation:
+--------------+----------------

[36mmbidFormatValidation[39m: [32mDataFrame[39m = [total_with_artist_id: bigint, valid_format_count: bigint ... 2 more fields]
[36mtrackMbidValidation[39m: [32mDataFrame[39m = [total_with_track_id: bigint, valid_format_count: bigint ... 2 more fields]
[36msuspiciousCountries[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [country: string, count: bigint]
[36mageValidation[39m: [32mDataFrame[39m = [total_with_age: bigint, suspicious_age_count: bigint ... 4 more fields]

# Summary & Key Insights
**Comprehensive analysis findings and recommendations**

In [18]:
println("=== ANALYSIS SUMMARY ===")

// Generate comprehensive summary
val totalUsers = cleanDf.select(countDistinct("user_id")).collect()(0).getLong(0)
val totalPlays = cleanDf.count()
val totalArtists = cleanDf.select(countDistinct("artist_name")).collect()(0).getLong(0)
val totalTracks = cleanDf.select(countDistinct("track_key")).collect()(0).getLong(0)

val artistMbidCoverage = cleanDf
  .select(avg(when(col("artist_id").isNotNull && col("artist_id") =!= "", 1).otherwise(0)))
  .collect()(0).getDouble(0) * 100

val trackMbidCoverage = cleanDf
  .select(avg(when(col("track_id").isNotNull && col("track_id") =!= "", 1).otherwise(0)))
  .collect()(0).getDouble(0) * 100

println("🎵 DATASET OVERVIEW:")
println(f"   Users: $totalUsers")
println(f"   Total Plays: $totalPlays")
println(f"   Unique Artists: $totalArtists")
println(f"   Unique Tracks: $totalTracks")
println(f"   Artist MBID Coverage: ${artistMbidCoverage}%.1f%%")
println(f"   Track MBID Coverage: ${trackMbidCoverage}%.1f%%")

println("\n🔍 KEY FINDINGS:")
println("   • Long-tail distribution in both artists and tracks")
println("   • MBID coverage varies significantly by popularity")
println("   • Clear user segments from casual to power users")
println("   • Strong temporal patterns in listening behavior")
println("   • Geographic diversity with potential standardization issues")
println("   • Data quality impacts downstream use cases differently")

println("\n💡 RECOMMENDATIONS:")
println("   1. Prioritize MBID enrichment for popular content")
println("   2. Implement artist name standardization")
println("   3. Monitor temporal data quality trends")
println("   4. Consider user segment-specific strategies")
println("   5. Validate country and age data entries")
println("   6. Review sessionization rules against actual usage patterns")

println("\n✅ Analysis complete! Use these insights for data strategy and product decisions.")

=== ANALYSIS SUMMARY ===
🎵 DATASET OVERVIEW:
   Users: 992
   Total Plays: 19150867
   Unique Artists: 174089
   Unique Tracks: 1505194
   Artist MBID Coverage: 96.9%
   Track MBID Coverage: 88.7%

🔍 KEY FINDINGS:
   • Long-tail distribution in both artists and tracks
   • MBID coverage varies significantly by popularity
   • Clear user segments from casual to power users
   • Strong temporal patterns in listening behavior
   • Geographic diversity with potential standardization issues
   • Data quality impacts downstream use cases differently

💡 RECOMMENDATIONS:
   1. Prioritize MBID enrichment for popular content
   2. Implement artist name standardization
   3. Monitor temporal data quality trends
   4. Consider user segment-specific strategies
   5. Validate country and age data entries
   6. Review sessionization rules against actual usage patterns

✅ Analysis complete! Use these insights for data strategy and product decisions.


[36mtotalUsers[39m: [32mLong[39m = [32m992L[39m
[36mtotalPlays[39m: [32mLong[39m = [32m19150867L[39m
[36mtotalArtists[39m: [32mLong[39m = [32m174089L[39m
[36mtotalTracks[39m: [32mLong[39m = [32m1505194L[39m
[36martistMbidCoverage[39m: [32mDouble[39m = [32m96.85567760457006[39m
[36mtrackMbidCoverage[39m: [32mDouble[39m = [32m88.67629857175658[39m

## Cleanup

In [19]:
// Cleanup cached DataFrames
cleanDf.unpersist()
spark.catalog.clearCache()
println("Cache cleared. Analysis notebook complete!")

Cache cleared. Analysis notebook complete!


[36mres19_0[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mDataset[39m[[32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mRow[39m] = [user_id: string, artist_id: string ... 5 more fields]