## Introduction

This project focuses on automating the prediction process for music streaming numbers using a regression-based machine learning model. To enable automated real-time predictions, the predictions are made accessible through a cloud-based system that leverages a Google Cloud Storage-hosted model and a cloud function. The cloud function processes input data, applies the trained model, and returns results. By analyzing key features such as energy, danceability, and audience engagement, the system provides accurate insights into the factors influencing song popularity.

An interactive widget-based user interface allows users to input feature values and retrieve predictions dynamically, providing a seamless and user-friendly experience.

### Workflow

The workflow for this project involves the following steps:

1. **Replicating and Retraining the Regression Model with Selected Features**
2. **Saving the Trained Model and Scaler Locally and Uploading to Google Cloud Storage**
3. **Building a Cloud Function to Process Input Data and Generate Predictions**
4. **Testing the Cloud Function with Sample Input Data for Accuracy and Consistency**
5. **Developing an Interactive User Interface for Dynamic Input and Prediction Retrieval**

In [1]:
import os
import numpy as np
import pandas as pd
import joblib
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from google.cloud import storage
import requests

## 1. Model Building 


Using the final model developed in the [Music Streams Prediction Project](https://github.com/RaghaviRajumohan/Rhythms-of-Data/blob/main/Music_Streams_Prediction_Model/music_streams_prediction_model.ipynb), the regression-based model was replicated and retrained. This model predicts music streaming numbers by analyzing features such as energy, danceability, and audience engagement metrics. It also incorporates interactions and quadratic relationships to capture the complex dynamics between song attributes and their popularity.

The trained model and preprocessing pipeline from this step are used as the foundation for deployment and real-time predictions.

In [2]:
# Step 1: Load the data and drop NaNs before transformations
data = pd.read_csv(
    "/Users/raghavirajumohan/.cache/kagglehub/datasets/salvatorerastelli/spotify-and-youtube/versions/2/Spotify_Youtube.csv"
).iloc[:, 1:]  # Load data and drop the first unnamed column
data.dropna(inplace=True)  # Drop rows with NaN values before transformations

# Step 2: Feature Engineering
data['Licensed'] = data['Licensed'].astype(int)
data['Album_single'] = (data['Album_type'] == 'single').astype(int)
data.drop(columns=['Album_type'], inplace=True)
data['Instrumentalness_logit'] = np.log(data['Instrumentalness'].clip(1e-5, 1 - 1e-5) / 
                                        (1 - data['Instrumentalness'].clip(1e-5, 1 - 1e-5)))
for var in ['Comments', 'Stream', 'Duration_ms']:
    data[f'log_{var}'] = np.log1p(data[var])
data['log_Duration_ms:Liveness'] = data['log_Duration_ms'] * data['Liveness']
data['log_Comments:Licensed'] = data['log_Comments'] * data['Licensed']
data['Album_single:Speechiness'] = data['Album_single'] * data['Speechiness']
data['log_Duration_ms_squared'] = data['log_Duration_ms'] ** 2
data['log_Comments_squared'] = data['log_Comments'] ** 2

# Drop NaNs after transformations
data.dropna(inplace=True)

# Step 3: Define features, target, and train model
features = [
    'Acousticness', 'Liveness', 'Speechiness', 'Instrumentalness_logit', 'Licensed',
    'log_Duration_ms', 'Valence', 'log_Comments', 'Album_single', 'log_Duration_ms:Liveness',
    'log_Comments:Licensed', 'Album_single:Speechiness', 'log_Duration_ms_squared', 'log_Comments_squared'
]
X, y = data[features], data['log_Stream']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Train the model
model = LinearRegression().fit(X_train_scaled, y_train)
model

### 2. Saving and Uploading Model and Scaler to Google Cloud

The trained regression model and its corresponding scaler are saved as serialized files for deployment. The model, saved as `streams_prediction.sav`, is used to predict music streaming numbers, while the scaler, saved as `scaler.sav`, standardizes the input features to ensure consistency during predictions.

Both files are securely uploaded to a Google Cloud Storage bucket (`streams_prediction`), providing centralized storage and enabling seamless access for deployment and real-time predictions.

In [3]:
# Step 5: Save the Model and Scaler Locally
model_filename = "streams_prediction.sav"
scaler_filename = "scaler.sav"
joblib.dump(model, model_filename)
joblib.dump(scaler, scaler_filename)

print("Model and scaler saved locally")

# Step 6: Upload to Google Cloud Storage
def upload_to_gcs(local_file, bucket_name, destination_blob_name):
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "Credentials.json"
    client = storage.Client()
    bucket = client.get_bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)
    blob.upload_from_filename(local_file)

bucket_name = "streams_prediction" 
upload_to_gcs(model_filename, bucket_name, f"model/{model_filename}")
upload_to_gcs(scaler_filename, bucket_name, f"model/{scaler_filename}")

print("\033[1;34mModel and scaler uploaded successfully to Google Cloud Storage!\033[0m")

Model and scaler saved locally
[1;34mModel and scaler uploaded successfully to Google Cloud Storage![0m


### 3. Building a Cloud Function for Predictions

Now, a cloud function was built to access the trained model and make real-time predictions. This function is structured into three separate components, each implemented as a distinct function:

1. **Loading the Model and Scaler:** The pre-trained model and scaler are retrieved from Google Cloud Storage using a dedicated function to align with the training environment and ensure consistency.

2. **Feature Engineering:** Another function transforms raw input data into a feature-rich format through scaling, interaction terms, and additional transformations to match the model's input requirements.

3. **Prediction:** A final function validates the input data, applies standardization, and uses the regression model to predict log-transformed stream counts, returning the results in a structured JSON format.

In [4]:
# Function to load the model and scaler from Google Cloud Storage
def load_model_and_scaler():
    bucket_name = "streams_prediction"
    model_filename = "streams_prediction.sav"
    scaler_filename = "scaler.sav"

    # Initialize Google Cloud Storage client
    client = storage.Client()
    bucket = client.get_bucket(bucket_name)
    
    # Load the model
    model_blob = bucket.blob(f"model/{model_filename}")
    model_bytes = model_blob.download_as_bytes()
    model = joblib.load(BytesIO(model_bytes))
    
    # Load the scaler
    scaler_blob = bucket.blob(f"model/{scaler_filename}")
    scaler_bytes = scaler_blob.download_as_bytes()
    scaler = joblib.load(BytesIO(scaler_bytes))
    
    return model, scaler

# Feature creation function
def create_features(raw_data):
    try:
        # Convert raw input data into a DataFrame
        input_data = pd.DataFrame([raw_data])
        
        # Perform transformations
        input_data['log_Comments'] = np.log1p(input_data['Comments'])
        input_data['log_Duration_ms'] = np.log1p(input_data['Duration_ms'])
        input_data['Instrumentalness_logit'] = np.log(
            input_data['Instrumentalness'].clip(1e-5, 1 - 1e-5) / 
            (1 - input_data['Instrumentalness'].clip(1e-5, 1 - 1e-5))
        )
        input_data['log_Duration_ms:Liveness'] = input_data['log_Duration_ms'] * input_data['Liveness']
        input_data['log_Comments:Licensed'] = input_data['log_Comments'] * input_data['Licensed']
        input_data['Album_single:Speechiness'] = input_data['Album_single'] * input_data['Speechiness']
        input_data['log_Duration_ms_squared'] = input_data['log_Duration_ms'] ** 2
        input_data['log_Comments_squared'] = input_data['log_Comments'] ** 2

        # Return the transformed data
        return input_data
    except Exception as e:
        raise ValueError(f"Error in feature creation: {str(e)}")

# Prediction function
def predict_streams(request):
    try:
        # Load model and scaler
        model, scaler = load_model_and_scaler()

        # Parse request JSON
        request_data = request.get_json()

        # Define required raw features
        required_features = [
            'Acousticness', 'Liveness', 'Speechiness', 'Instrumentalness', 
            'Licensed', 'Duration_ms', 'Valence', 'Comments', 'Album_single'
        ]

       # Create features using the helper function
        transformed_data = create_features(request_data)

        # Select relevant columns for prediction
        features = [
            'Acousticness', 'Liveness', 'Speechiness', 'Instrumentalness_logit', 
            'Licensed', 'log_Duration_ms', 'Valence', 'log_Comments', 'Album_single', 
            'log_Duration_ms:Liveness', 'log_Comments:Licensed', 
            'Album_single:Speechiness', 'log_Duration_ms_squared', 'log_Comments_squared'
        ]

        # Standardize the input features
        transformed_data[features] = scaler.transform(transformed_data[features])

        # Predict using the regression model
        prediction = model.predict(transformed_data[features])[0]

        # Return prediction
        return jsonify({"status": 200,"Stream Prediction (log)": prediction}), 200

    except Exception as e:
        
        # Return generic errors
        return jsonify({"status": "error","message": str(e)}), 500

### Function to Prepare Input Data

This function transforms raw input data into a structured format required for the prediction model. It includes steps like converting data into a DataFrame, applying feature transformations such as logarithmic and logit scaling, and creating additional interaction and polynomial features to align with the model's input requirements.


### Define the Cloud Function Request

This function sends the processed input data to the deployed cloud function for prediction. It converts the processed data into a JSON format and sends it as a POST request to the specified cloud function URL, returning the response with the predicted values.

In [5]:
# Function to prepare input data
def prepare_input_data(input_data):
    # Convert input data to DataFrame
    data = pd.DataFrame([input_data])

    # Log transformations
    data['log_Comments'] = np.log1p(data['Comments'])
    data['log_Duration_ms'] = np.log1p(data['Duration_ms'])

    # Logit transformation for Instrumentalness
    data['Instrumentalness_logit'] = np.log(
        data['Instrumentalness'].clip(1e-5, 1 - 1e-5) / 
        (1 - data['Instrumentalness'].clip(1e-5, 1 - 1e-5))
    )

    # Interaction terms and polynomial features
    data['log_Duration_ms:Liveness'] = data['log_Duration_ms'] * data['Liveness']
    data['log_Comments:Licensed'] = data['log_Comments'] * data['Licensed']
    data['Album_single:Speechiness'] = data['Album_single'] * data['Speechiness']
    data['log_Duration_ms_squared'] = data['log_Duration_ms'] ** 2
    data['log_Comments_squared'] = data['log_Comments'] ** 2

    return data

# Define the cloud function request
def send_to_cloud_function(processed_data, cloud_function_url):
    # Convert the processed data to JSON format
    json_data = processed_data.to_dict(orient='records')[0]
    
    # Send POST request to the cloud function
    response = requests.post(cloud_function_url, json=json_data)
    return response.json()

### Using the Cloud Function for Predictions

Values are inputted, transformed into the required format, and sent to the deployed cloud function using the defined helper functions. The cloud function processes the input, applies the trained model to make predictions, and returns the results. These predictions are then extracted and, if necessary, transformed back to their original scale for interpretation.

In [6]:
# Example input with only linear variables
input_data = {
    'Acousticness': 0.2,
    'Liveness': 0.3,
    'Speechiness': 0.1,
    'Instrumentalness': 0.8,
    'Licensed': 1,
    'Duration_ms': 200000,
    'Valence': 0.5,
    'Comments': 120,
    'Album_single': 1
}

# Cloud function URL
cloud_function_url = "https://us-west2-streamsregression.cloudfunctions.net/streams_prediction"  

# Step 1: Prepare the input data
processed_data = prepare_input_data(input_data)

# Step 2: Send the processed data to the cloud function
response = send_to_cloud_function(processed_data, cloud_function_url)

# Step 3: Extract and transform the output
if response.get("status") == 200:
    # The cloud function returns the predicted log value for streams
    log_streams_prediction = response.get("Stream Prediction (log)")
    
    # Convert back to original scale (exponentiate the log prediction)
    streams_prediction = np.expm1(log_streams_prediction)
    print("\033[1;34mPredicted Streams:\033[0m", f"{streams_prediction:.2f}")

else:
    print(f"Error from Cloud Function: {response.get('message')}")

[1;34mPredicted Streams:[0m 7184342.80
