Dataset Description: The synthetic dataset contains information about various podcast episodes and their attributes. The goal is to analyze and predict the average listening duration of podcast episodes based on various features.

Columns in the Dataset:

Podcast_Name (Type: string)
Description: Names of popular podcasts.
Example Values: "Tech Talk", "Health Hour", "Comedy Central"

Episode_Title (Type: string)
Description: Titles of the podcast episodes.
Example Values: "The Future of AI", "Meditation Tips", "Stand-Up Special"

Episode_Length (Type: float, minutes)
Description: Length of the episode in minutes.
Example Values: 5.0, 10.0, 30.0, 45.0, 60.0, 90.0

Genre (Type: string)
Description: Genre of the podcast episode.
Possible Values: "Technology", "Education", "Comedy", "Health", "True Crime", "Business", "Sports", "Lifestyle", "News", "Music"

Host_Popularity (Type: float, scale 0-100)
Description: A score indicating the popularity of the host.
Example Values: 50.0, 75.0, 90.0

Publication_Day (Type: string)
Description: Day of the week the episode was published.
Possible Values: "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"

Publication_Time (Type: string)
Description: Time of the day the episode was published.
Possible Values: "Morning", "Afternoon", "Evening", "Night"

Guest_Popularity (Type: float, scale 0-100)
Description: A score indicating the popularity of the guest (if any).
Example Values: 20.0, 50.0, 85.0

Number_of_Ads (Type: int)
Description: Number of advertisements within the episode.
Example Values: 0, 1, 2, 3

Episode_Sentiment (Type: string)
Description: Sentiment of the episode's content.
Possible Values: "Positive", "Neutral", "Negative"

Listening_Time (Type: float, minutes)
Description: The actual average listening duration (target variable).
Example Values: 4.5, 8.0, 30.0, 60.0

In [18]:
# import required modules for this project

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [19]:
# Load csv file
df = pd.read_csv('/Users/sa26/Documents/GitHub/Predict-Podcast-Listening-Time/data/raw/train.csv')

# print out the first 5 rows of data
df.head()

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
0,0,Mystery Matters,Episode 98,,True Crime,74.81,Thursday,Night,,0.0,Positive,31.41998
1,1,Joke Junction,Episode 26,119.8,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241
2,2,Study Sessions,Episode 16,73.9,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.7,2.0,Positive,46.27824
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031


From opening the df in Data Wrangler, I am noticing some columns with missing and distinct values. 
'Episode_Length_minutes', 'Guest_Popularity_percentage', and 'Number_of_Ads' have missing values.
Guests might not always be present on a podcast. 

There are 48 distinct podcast names, 100 distinct episode titles. Most frequent podcast name is "Tech Talks" with 22,847 values. Most frequent episode title is "Episode 71" with 10,515 values. 

There are some outliers with episode length and host popularity. The average and median podcast length are a little over one hour. Most podcasts fall within 30 to 90 minutes. There is a podcast episode that lasted over five hours! I would want to investigate the podcast episodes reported to be zero minutes. The median listening length is ~43 minutes, while the mean is ~45. The max is a ~120 minutes! This reveals most listeners don't completely finish a podcast. 

Host popularity mean and median is reported to be ~60%. Most fall between 39 to 80%. The maximum is reported to be almost 120% even though the scale is 0-100. 

Guest popularity mean and median is reported to be ~53%. Most fall between 28 to 77%. There are most likely outliers here since there are values above the max scale of 100. 

Sports, technology, and true crime are among the most popular genres out of ten.

Podcasts are posted everyday with most published on Sunday, Monday, and Friday.

There are 12 distinct number of ads values with some high outliers (over 100 ads!). This column is heavily skewed to the right. Most podcasts have one ad per episode. This column is supposed to be int not float, so some adjustments will be made. 

Podcasts are pretty evenly posted throughout night, evening, afternoon, and morning.

Episode sentiment is prety evenly split between neutral, negative, and positive. 

There are 750,000 rows x 12 columns in the dataset. 

In [20]:
#Dropped id (not in original list of variables)
df.drop('id', axis=1, inplace=True)
df.head()

Unnamed: 0,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
0,Mystery Matters,Episode 98,,True Crime,74.81,Thursday,Night,,0.0,Positive,31.41998
1,Joke Junction,Episode 26,119.8,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241
2,Study Sessions,Episode 16,73.9,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531
3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.7,2.0,Positive,46.27824
4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031


In [21]:
# Print out general information on this dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750000 entries, 0 to 749999
Data columns (total 11 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Podcast_Name                 750000 non-null  object 
 1   Episode_Title                750000 non-null  object 
 2   Episode_Length_minutes       662907 non-null  float64
 3   Genre                        750000 non-null  object 
 4   Host_Popularity_percentage   750000 non-null  float64
 5   Publication_Day              750000 non-null  object 
 6   Publication_Time             750000 non-null  object 
 7   Guest_Popularity_percentage  603970 non-null  float64
 8   Number_of_Ads                749999 non-null  float64
 9   Episode_Sentiment            750000 non-null  object 
 10  Listening_Time_minutes       750000 non-null  float64
dtypes: float64(5), object(6)
memory usage: 62.9+ MB


In [22]:
df.describe()

Unnamed: 0,Episode_Length_minutes,Host_Popularity_percentage,Guest_Popularity_percentage,Number_of_Ads,Listening_Time_minutes
count,662907.0,750000.0,603970.0,749999.0,750000.0
mean,64.504738,59.859901,52.236449,1.348855,45.437406
std,32.969603,22.873098,28.451241,1.15113,27.138306
min,0.0,1.3,0.0,0.0,0.0
25%,35.73,39.41,28.38,0.0,23.17835
50%,63.84,60.05,53.58,1.0,43.37946
75%,94.07,79.53,76.6,2.0,64.81158
max,325.24,119.46,119.91,103.91,119.97


In [23]:
df.isnull().sum()

Podcast_Name                        0
Episode_Title                       0
Episode_Length_minutes          87093
Genre                               0
Host_Popularity_percentage          0
Publication_Day                     0
Publication_Time                    0
Guest_Popularity_percentage    146030
Number_of_Ads                       1
Episode_Sentiment                   0
Listening_Time_minutes              0
dtype: int64

In [27]:
# 1. Identify fractional floats in 'Number_of_Ads'
numeric_ads = pd.to_numeric(df['Number_of_Ads'], errors='coerce')

# Check for fractional part (value % 1 != 0)
is_fractional = numeric_ads.notna() & (numeric_ads % 1 != 0)

# 2. State the number of fractional floats found
num_fractional_rows = is_fractional.sum()
print(f"Number of rows with fractional floats in 'Number_of_Ads': {num_fractional_rows}")
print("-" * 30)

# 3. Delete rows containing fractional floats
df_cleaned = df.loc[~is_fractional].copy() # Use .loc and .copy()
print("DataFrame after removing fractional float rows:")
df_cleaned

Number of rows with fractional floats in 'Number_of_Ads': 7
------------------------------
DataFrame after removing fractional float rows:


Unnamed: 0,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
0,Mystery Matters,Episode 98,,True Crime,74.81,Thursday,Night,,0.0,Positive,31.41998
1,Joke Junction,Episode 26,119.80,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241
2,Study Sessions,Episode 16,73.90,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531
3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.70,2.0,Positive,46.27824
4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031
...,...,...,...,...,...,...,...,...,...,...,...
749995,Learning Lab,Episode 25,75.66,Education,69.36,Saturday,Morning,,0.0,Negative,56.87058
749996,Business Briefs,Episode 21,75.75,Business,35.21,Saturday,Night,,2.0,Neutral,45.46242
749997,Lifestyle Lounge,Episode 51,30.98,Lifestyle,78.58,Thursday,Morning,84.89,0.0,Negative,15.26000
749998,Style Guide,Episode 47,108.98,Lifestyle,45.39,Thursday,Morning,93.27,0.0,Negative,100.72939


In [25]:
df_cleaned.isnull().sum()

Podcast_Name                        0
Episode_Title                       0
Episode_Length_minutes          87093
Genre                               0
Host_Popularity_percentage          0
Publication_Day                     0
Publication_Time                    0
Guest_Popularity_percentage    146028
Number_of_Ads                       1
Episode_Sentiment                   0
Listening_Time_minutes              0
dtype: int64