# Podcast Listening Time Prediction - Exploratory Data Analysis

This notebook explores the Podcast Listening Time Prediction dataset from the Kaggle competition. The goal is to predict how long listeners will tune in to a podcast episode based on various features.

## 1. Setup and Data Loading

In [None]:
!pip install lightgbm

Collecting plotly
  Downloading plotly-6.0.1-py3-none-any.whl.metadata (6.7 kB)
Downloading plotly-6.0.1-py3-none-any.whl (14.8 MB)
[2K   [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/14.8 MB[0m [31m23.2 kB/s[0m eta [36m0:09:20[0m^C
[2K   [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/14.8 MB[0m [31m23.2 kB/s[0m eta [36m0:09:20[0m
[?25h

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# For visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# For modeling
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, r2_score

# Models
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import lightgbm as lgb
import xgboost as xgb

# Set visualization style
sns.set(style='whitegrid')
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (12, 8)

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'plotly'

In [None]:
# Load the datasets
train_path = '../Datasets/train.csv'
test_path = '../Datasets/test.csv'
sample_submission_path = '../Datasets/sample_submission.csv'

train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
sample_submission = pd.read_csv(sample_submission_path)

# Display basic information
print(f"Training set shape: {train.shape}")
print(f"Test set shape: {test.shape}")
print(f"Sample submission shape: {sample_submission.shape}")

## 2. Data Overview and Initial Inspection

In [None]:
# Display first few rows of training data
train.head()

In [None]:
# Display first few rows of test data
test.head()

In [None]:
# Check data types and basic statistics
train.info()

In [None]:
# Statistical summary of numerical features
train.describe()

In [None]:
# Check if there are any duplicate rows in train
print(f"Number of duplicate rows in train: {train.duplicated().sum()}")

In [None]:
# Check if there are any duplicate rows in test
print(f"Number of duplicate rows in test: {test.duplicated().sum()}")

## 3. Missing Value Analysis

In [None]:
# Check for missing values in training set
missing_train = train.isnull().sum().sort_values(ascending=False)
missing_train_percent = (train.isnull().sum() / train.shape[0] * 100).sort_values(ascending=False)
missing_train_df = pd.concat([missing_train, missing_train_percent], axis=1, keys=['Total', 'Percent'])
print("Missing values in training set:")
missing_train_df[missing_train_df['Total'] > 0]

In [None]:
# Check for missing values in test set
missing_test = test.isnull().sum().sort_values(ascending=False)
missing_test_percent = (test.isnull().sum() / test.shape[0] * 100).sort_values(ascending=False)
missing_test_df = pd.concat([missing_test, missing_test_percent], axis=1, keys=['Total', 'Percent'])
print("Missing values in test set:")
missing_test_df[missing_test_df['Total'] > 0]

In [None]:
# Visualize missing values
plt.figure(figsize=(10, 6))
plt.title('Percentage of Missing Values by Feature')
sns.barplot(x=missing_train_df[missing_train_df['Total'] > 0].index, 
            y=missing_train_df[missing_train_df['Total'] > 0]['Percent'])
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 4. Exploratory Data Analysis (EDA)

### 4.1 Target Variable Analysis

In [None]:
# Analyze the distribution of the target variable
plt.figure(figsize=(12, 6))

# Histogram
plt.subplot(1, 2, 1)
sns.histplot(train['Listening_Time_minutes'], kde=True)
plt.title('Distribution of Listening Time')
plt.xlabel('Listening Time (minutes)')

# Box plot
plt.subplot(1, 2, 2)
sns.boxplot(y=train['Listening_Time_minutes'])
plt.title('Box Plot of Listening Time')
plt.ylabel('Listening Time (minutes)')

plt.tight_layout()
plt.show()

# Basic statistics of target variable
print(train['Listening_Time_minutes'].describe())

In [None]:
# Check if there are any outliers in the target variable
Q1 = train['Listening_Time_minutes'].quantile(0.25)
Q3 = train['Listening_Time_minutes'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = train[(train['Listening_Time_minutes'] < lower_bound) | 
                 (train['Listening_Time_minutes'] > upper_bound)]

print(f"Number of outliers in target variable: {len(outliers)}")
print(f"Percentage of outliers: {len(outliers) / len(train) * 100:.2f}%")

### 4.2 Feature Analysis - Categorical Variables

In [None]:
# Identify categorical columns
categorical_cols = [col for col in train.columns if train[col].dtype == 'object']
print(f"Categorical columns: {categorical_cols}")

# Count unique values in each categorical column
for col in categorical_cols:
    print(f"\n{col} - {train[col].nunique()} unique values:")
    print(train[col].value_counts().sort_values(ascending=False).head(10))

In [None]:
# Visualize relationship between categorical features and target variable
for col in ['Genre', 'Publication_Day', 'Publication_Time', 'Episode_Sentiment']:
    plt.figure(figsize=(12, 6))
    
    # Box plot
    plt.subplot(1, 2, 1)
    sns.boxplot(x=col, y='Listening_Time_minutes', data=train)
    plt.title(f'Listening Time by {col}')
    plt.xticks(rotation=45)
    
    # Bar plot for average listening time
    plt.subplot(1, 2, 2)
    train.groupby(col)['Listening_Time_minutes'].mean().sort_values(ascending=False).plot(kind='bar')
    plt.title(f'Average Listening Time by {col}')
    plt.ylabel('Average Listening Time (minutes)')
    plt.xticks(rotation=45)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Count plots for categorical variables
fig = plt.figure(figsize=(15, 15))

for i, col in enumerate(['Genre', 'Publication_Day', 'Publication_Time', 'Episode_Sentiment']):
    plt.subplot(2, 2, i+1)
    sns.countplot(y=col, data=train, order=train[col].value_counts().index)
    plt.title(f'Distribution of {col}')
    plt.tight_layout()

plt.show()

### 4.3 Feature Analysis - Numerical Variables

In [None]:
# Identify numerical columns (excluding id and target)
numerical_cols = [col for col in train.columns if train[col].dtype != 'object' 
                  and col not in ['id', 'Listening_Time_minutes']]
print(f"Numerical columns: {numerical_cols}")

In [None]:
# Distribution of numerical features
fig = plt.figure(figsize=(15, 12))

for i, col in enumerate(numerical_cols):
    plt.subplot(2, 2, i+1)
    sns.histplot(train[col].dropna(), kde=True)
    plt.title(f'Distribution of {col}')
    
plt.tight_layout()
plt.show()

In [None]:
# Scatter plots of numerical features vs target
fig = plt.figure(figsize=(15, 12))

for i, col in enumerate(numerical_cols):
    plt.subplot(2, 2, i+1)
    plt.scatter(train[col], train['Listening_Time_minutes'], alpha=0.1)
    plt.title(f'{col} vs Listening Time')
    plt.xlabel(col)
    plt.ylabel('Listening Time (minutes)')
    
plt.tight_layout()
plt.show()

In [None]:
# Correlation between numerical features and target
numerical_data = train[numerical_cols + ['Listening_Time_minutes']].copy()

# Calculate correlation matrix
correlation_matrix = numerical_data.corr()
print("Correlation with target variable:")
print(correlation_matrix['Listening_Time_minutes'].sort_values(ascending=False))

In [None]:
# Plot correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numerical Features')
plt.tight_layout()
plt.show()

### 4.4 Relationship Analysis - Episode Length vs Listening Time

In [None]:
# Analyze relationship between episode length and listening time
plt.figure(figsize=(10, 6))
plt.scatter(train['Episode_Length_minutes'], train['Listening_Time_minutes'], alpha=0.1)
plt.title('Episode Length vs Listening Time')
plt.xlabel('Episode Length (minutes)')
plt.ylabel('Listening Time (minutes)')

# Add a line where x=y (perfect retention)
max_val = max(train['Episode_Length_minutes'].max(), train['Listening_Time_minutes'].max())
plt.plot([0, max_val], [0, max_val], 'r--', label='Perfect Retention')
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Calculate retention rate (listening time / episode length)
# Only for rows where episode length is not missing
train_with_length = train.dropna(subset=['Episode_Length_minutes']).copy()
train_with_length['Retention_Rate'] = train_with_length['Listening_Time_minutes'] / train_with_length['Episode_Length_minutes']

# Analyze retention rate distribution
plt.figure(figsize=(10, 6))
sns.histplot(train_with_length['Retention_Rate'], bins=50, kde=True)
plt.title('Distribution of Retention Rate')
plt.xlabel('Retention Rate (Listening Time / Episode Length)')
plt.axvline(1, color='r', linestyle='--', label='Perfect Retention')
plt.legend()
plt.tight_layout()
plt.show()

print(train_with_length['Retention_Rate'].describe())

### 4.5 Advanced Analysis - Multivariate Relationships

In [None]:
# Relationship between Genre, Episode Length, and Listening Time
plt.figure(figsize=(14, 8))
sns.boxplot(x='Genre', y='Listening_Time_minutes', data=train)
plt.title('Listening Time by Genre')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Listening time by host popularity segments
train['Host_Popularity_Segment'] = pd.qcut(train['Host_Popularity_percentage'], 4, labels=['Low', 'Medium-Low', 'Medium-High', 'High'])

plt.figure(figsize=(12, 6))
sns.boxplot(x='Host_Popularity_Segment', y='Listening_Time_minutes', data=train)
plt.title('Listening Time by Host Popularity Segment')
plt.tight_layout()
plt.show()

In [None]:
# Analyze how sentiment affects listening time across different genres
plt.figure(figsize=(14, 8))
sns.boxplot(x='Genre', y='Listening_Time_minutes', hue='Episode_Sentiment', data=train)
plt.title('Listening Time by Genre and Sentiment')
plt.xticks(rotation=45)
plt.legend(title='Sentiment')
plt.tight_layout()
plt.show()

In [None]:
# Analyze effect of number of ads on listening time
plt.figure(figsize=(12, 6))
sns.boxplot(x='Number_of_Ads', y='Listening_Time_minutes', data=train)
plt.title('Listening Time by Number of Ads')
plt.tight_layout()
plt.show()

## 5. Feature Engineering

In [None]:
# Create a function for feature engineering to apply to both train and test
def engineer_features(df):
    # Create a copy to avoid changing the original dataframe
    df_new = df.copy()
    
    # Extract episode number from Episode_Title
    df_new['Episode_Number'] = df_new['Episode_Title'].str.extract(r'Episode (\d+)').astype(float)
    
    # Day of week encoding (numerical)
    day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    df_new['Day_Num'] = df_new['Publication_Day'].map({day: i for i, day in enumerate(day_order)})
    
    # Is weekend feature
    df_new['Is_Weekend'] = df_new['Publication_Day'].isin(['Saturday', 'Sunday']).astype(int)
    
    # Time of day encoding
    time_order = ['Morning', 'Afternoon', 'Evening', 'Night']
    df_new['Time_Num'] = df_new['Publication_Time'].map({time: i for i, time in enumerate(time_order)})
    
    # For rows where Episode_Length_minutes is available, calculate proportion of listening time
    # This feature will have NaN for rows where Episode_Length_minutes is missing
    if 'Listening_Time_minutes' in df_new.columns and 'Episode_Length_minutes' in df_new.columns:
        df_new['Retention_Rate'] = df_new['Listening_Time_minutes'] / df_new['Episode_Length_minutes']
    
    # Create podcast popularity rank features
    podcast_avg_listening = df.groupby('Podcast_Name')['Listening_Time_minutes'].mean().reset_index() if 'Listening_Time_minutes' in df.columns else None
    
    if podcast_avg_listening is not None:
        podcast_avg_listening.columns = ['Podcast_Name', 'Avg_Podcast_Listening']
        df_new = df_new.merge(podcast_avg_listening, on='Podcast_Name', how='left')
    
    # Episode Sentiment encoding
    sentiment_map = {'Positive': 1, 'Neutral': 0, 'Negative': -1}
    df_new['Sentiment_Score'] = df_new['Episode_Sentiment'].map(sentiment_map)
    
    # Interaction features
    df_new['Host_Guest_Popularity_Diff'] = df_new['Host_Popularity_percentage'] - df_new['Guest_Popularity_percentage']
    df_new['Host_Guest_Popularity_Product'] = df_new['Host_Popularity_percentage'] * df_new['Guest_Popularity_percentage']
    
    return df_new

# Apply feature engineering
train_fe = engineer_features(train)
test_fe = engineer_features(test)

# Display new features
new_features = [col for col in train_fe.columns if col not in train.columns]
print(f"New features created: {new_features}")
train_fe[new_features].head()

## 6. Missing Value Treatment

In [None]:
# Analyze relationship between missing values and target
for col in ['Episode_Length_minutes', 'Guest_Popularity_percentage']:
    train_fe[f'{col}_Missing'] = train_fe[col].isnull().astype(int)
    
    plt.figure(figsize=(10, 6))
    sns.boxplot(x=f'{col}_Missing', y='Listening_Time_minutes', data=train_fe)
    plt.title(f'Listening Time by {col} Missing Status')
    plt.xlabel(f'Is {col} Missing')
    plt.ylabel('Listening Time (minutes)')
    plt.tight_layout()
    plt.show()
    
    # Print average listening time for missing vs non-missing
    missing_avg = train_fe.groupby(f'{col}_Missing')['Listening_Time_minutes'].mean()
    print(f"Average Listening Time by {col} Missing Status:")
    print(missing_avg)

## 7. Baseline Model Building

In [None]:
# Prepare data for modeling
# Define features to use
numerical_features = ['Episode_Length_minutes', 'Host_Popularity_percentage', 'Guest_Popularity_percentage', 
                     'Number_of_Ads', 'Episode_Number', 'Day_Num', 'Time_Num', 'Is_Weekend', 'Sentiment_Score',
                     'Host_Guest_Popularity_Diff', 'Host_Guest_Popularity_Product']

categorical_features = ['Genre', 'Publication_Day', 'Publication_Time', 'Episode_Sentiment']

# Add missing indicator features
for col in ['Episode_Length_minutes', 'Guest_Popularity_percentage']:
    train_fe[f'{col}_Missing'] = train_fe[col].isnull().astype(int)
    test_fe[f'{col}_Missing'] = test_fe[col].isnull().astype(int)
    numerical_features.append(f'{col}_Missing')

# Split features and target
X = train_fe[numerical_features + categorical_features]
y = train_fe['Listening_Time_minutes']

# Train-validation split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Prepare the test data
X_test = test_fe[numerical_features + categorical_features]

In [None]:
# Create preprocessing pipeline
# Numerical preprocessing - impute missing values and scale
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical preprocessing - one-hot encode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [None]:
# Create a function to evaluate different models
def evaluate_model(model_name, model, X_train, X_val, y_train, y_val):
    # Create pipeline with preprocessing and model
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)])
    
    # Train the model
    pipeline.fit(X_train, y_train)
    
    # Make predictions
    y_pred_train = pipeline.predict(X_train)
    y_pred_val = pipeline.predict(X_val)
    
    # Calculate metrics
    train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
    val_rmse = np.sqrt(mean_squared_error(y_val, y_pred_val))
    train_r2 = r2_score(y_train, y_pred_train)
    val_r2 = r2_score(y_val, y_pred_val)
    
    # Print results
    print(f"{model_name} Results:")
    print(f"Train RMSE: {train_rmse:.4f}")
    print(f"Validation RMSE: {val_rmse:.4f}")
    print(f"Train R²: {train_r2:.4f}")
    print(f"Validation R²: {val_r2:.4f}")
    print("-" * 50)
    
    return pipeline, val_rmse

In [None]:
# Test several baseline models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'XGBoost': xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42),
    'LightGBM': lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
}

results = {}
best_rmse = float('inf')
best_model = None

for name, model in models.items():
    pipeline, val_rmse = evaluate_model(name, model, X_train, X_val, y_train, y_val)
    results[name] = val_rmse
    
    if val_rmse < best_rmse:
        best_rmse = val_rmse
        best_model = pipeline

In [None]:
# Visualize model performance comparison
plt.figure(figsize=(12, 6))
plt.barh(list(results.keys()), list(results.values()))
plt.xlabel('Validation RMSE (lower is better)')
plt.title('Model Performance Comparison')
plt.tight_layout()
plt.show()

## 8. Feature Importance Analysis

In [None]:
# For tree-based models, analyze feature importance
# Train a simple model without the preprocessing pipeline to get feature names
# Fill missing values for this analysis
X_filled = X.copy()
for col in numerical_features:
    X_filled[col] = X_filled[col].fillna(X_filled[col].median())
    
# One-hot encode categorical features
X_filled_encoded = pd.get_dummies(X_filled, columns=categorical_features, drop_first=False)

# Train a random forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_filled_encoded, y)

# Get feature importances
feature_importances = pd.DataFrame({
    'Feature': X_filled_encoded.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

# Plot feature importances
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importances.head(20))
plt.title('Top 20 Feature Importances')
plt.tight_layout()
plt.show()

## 9. Make Predictions on Test Data

In [None]:
# Make predictions using the best model
test_predictions = best_model.predict(X_test)

# Create submission file
submission = pd.DataFrame({
    'id': test_fe['id'],
    'Listening_Time_minutes': test_predictions
})

# Save submission file
submission.to_csv('../Datasets/model_submission.csv', index=False)

print("Submission file created with predictions from the best model")
submission.head()

## 10. Conclusion and Next Steps

In this notebook, we've performed a comprehensive exploratory data analysis of the Podcast Listening Time Prediction dataset and built several baseline models. Here's a summary of our findings and potential next steps:

### Key Findings:
1. The dataset contains both categorical features (Genre, Publication Day/Time, Sentiment) and numerical features (Episode Length, Host/Guest Popularity).
2. There are missing values in Episode Length and Guest Popularity that require handling.
3. We engineered several potentially useful features, such as retention rate, day/time encodings, and popularity interaction features.
4. Tree-based models (Random Forest, Gradient Boosting, XGBoost, LightGBM) generally performed better than linear models.

### Next Steps:
1. **Feature Engineering**: Create more advanced features like:
   - Podcast-level aggregates (average listening time per podcast)
   - Genre-specific features
   - More interaction terms between features

2. **Model Tuning**: Perform hyperparameter optimization for the best-performing models

3. **Ensemble Methods**: Combine predictions from multiple models

4. **Cross-Validation**: Implement k-fold cross-validation for more robust model evaluation

5. **Original Dataset**: Consider incorporating the original Podcast Listening Time dataset as mentioned in the competition description

6. **Advanced Models**: Try neural network approaches or more sophisticated algorithms

7. **Missing Value Handling**: Explore more advanced imputation techniques