**Audio Engagement Project EDA**
--

**Structure**
--
1. Overview
2. Data Collection
3. Descriptive Statistics
4. Data Visualization and Correlation Analysis
5. General Observations and Summary

**1. Overview**
--

The goal is to predict how long a user will listen to an audio episode, based on a range of feature data describing both the listener and the audio content.

The dataset was synthetically generated based on real-world user audio consumption patterns. Feature distributions are realistic but not identical to any publicly available dataset.

**train.csv** - the training dataset; `Listening_Time_minutes` is the target variable

**test.csv** - the test dataset; your objective is to predict the `Listening_Time_minutes` for each row

**sample_submission.csv** - a sample submission file in the correct format.

**2. Data Collection**
--

**Import Data and Required Packages**

In [45]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import kaggle
import os
import zipfile
import time

**Import the DataSet**

In [46]:
kaggle.api.authenticate()
kaggle.api.competition_download_files('audio-engagement-challenge', path='data')

In [47]:
path_to_zip_file = "data/audio-engagement-challenge.zip"
dir_to_extract="data/"
with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
    zip_ref.extractall(dir_to_extract)

**File information**

In [48]:
print(kaggle.api.competition_list_files('audio-engagement-challenge').files)

[{"ref": "", "name": "sample_submission.csv", "description": "", "totalBytes": 3500026, "url": "", "creationDate": "2025-10-11T12:36:50.840Z"}, {"ref": "", "name": "test.csv", "description": "", "totalBytes": 21277250, "url": "", "creationDate": "2025-10-11T12:36:50.840Z"}, {"ref": "", "name": "train.csv", "description": "", "totalBytes": 70036578, "url": "", "creationDate": "2025-10-11T12:36:50.840Z"}]


**Import the CSV Data as Pandas DataFrame**

In [49]:
start = time.time()
test = pd.read_csv("data/test.csv")
train = pd.read_csv("data/train.csv")
sample_submission = pd.read_csv("data/sample_submission.csv")
print("Files loaded in", time.time()-start, "seconds")

Files loaded in 1.7737600803375244 seconds


**3. Descriptive Statistics**
--

**Train dataset**
--

In [50]:
train.shape

(750000, 12)

In [51]:
train.index

RangeIndex(start=0, stop=750000, step=1)

In [52]:
train.columns

Index(['id', 'Podcast_Name', 'Episode_Title', 'Episode_Length_minutes',
       'Genre', 'Host_Popularity_percentage', 'Publication_Day',
       'Publication_Time', 'Guest_Popularity_percentage', 'Number_of_Ads',
       'Episode_Sentiment', 'Listening_Time_minutes'],
      dtype='object')

In [53]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750000 entries, 0 to 749999
Data columns (total 12 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           750000 non-null  int64  
 1   Podcast_Name                 750000 non-null  object 
 2   Episode_Title                750000 non-null  object 
 3   Episode_Length_minutes       662907 non-null  float64
 4   Genre                        750000 non-null  object 
 5   Host_Popularity_percentage   750000 non-null  float64
 6   Publication_Day              750000 non-null  object 
 7   Publication_Time             750000 non-null  object 
 8   Guest_Popularity_percentage  603970 non-null  float64
 9   Number_of_Ads                749999 non-null  float64
 10  Episode_Sentiment            750000 non-null  object 
 11  Listening_Time_minutes       750000 non-null  float64
dtypes: float64(5), int64(1), object(6)
memory usage: 68.7+ MB


**Show top 5 records**

In [54]:
train.head()

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
0,0,Mystery Matters,Episode 98,,True Crime,74.81,Thursday,Night,,0.0,Positive,31.41998
1,1,Joke Junction,Episode 26,119.8,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241
2,2,Study Sessions,Episode 16,73.9,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.7,2.0,Positive,46.27824
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031


**Show 5 last records**

In [55]:
train.tail(5)

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
749995,749995,Learning Lab,Episode 25,75.66,Education,69.36,Saturday,Morning,,0.0,Negative,56.87058
749996,749996,Business Briefs,Episode 21,75.75,Business,35.21,Saturday,Night,,2.0,Neutral,45.46242
749997,749997,Lifestyle Lounge,Episode 51,30.98,Lifestyle,78.58,Thursday,Morning,84.89,0.0,Negative,15.26
749998,749998,Style Guide,Episode 47,108.98,Lifestyle,45.39,Thursday,Morning,93.27,0.0,Negative,100.72939
749999,749999,Sports Central,Episode 99,24.1,Sports,22.45,Saturday,Night,36.72,0.0,Neutral,11.94439


**Check missing values**

In [56]:
train.isna().sum()

id                                  0
Podcast_Name                        0
Episode_Title                       0
Episode_Length_minutes          87093
Genre                               0
Host_Popularity_percentage          0
Publication_Day                     0
Publication_Time                    0
Guest_Popularity_percentage    146030
Number_of_Ads                       1
Episode_Sentiment                   0
Listening_Time_minutes              0
dtype: int64