In [None]:
import numpy as np
import csv

# Set up file paths
dataDir = 'C:/Users/navne/Python Files/Data Acquisition BIA 627/Final Project/' # Change this 
file_name_test = dataDir + 'testTrack_hierarchy.txt'
file_name_train = dataDir + 'trainIdx2_matrix.txt'
output_file = dataDir + 'output1.csv'  # Changed the extension to .csv

# Loading the training data into a dictionary
train_data = {}
with open(file_name_train, 'r') as fTrain:
    for line in fTrain:
        userID, itemID, rating = line.strip().split('|')[:3]
        if userID not in train_data:
            train_data[userID] = {}
        train_data[userID][itemID] = int(rating)

# then we process the test data and write predictions
with open(file_name_test, 'r') as fTest, open(output_file, 'w', newline='') as fOut:
    csv_writer = csv.writer(fOut)
    csv_writer.writerow(['TrackID', 'Predictor'])  # Into the csv file we will add the header

    lastUserID = None  # Track the last user ID processed
    trackID_vec = []  # Initialize the track ID vector for each user

    for line in fTest:
        arr_test = line.strip().split('|')
        if len(arr_test) < 4:
            continue  # we will skip if  then line doesn't have enough data
        userID, trackID, albumID, artistID = arr_test[:4]   # only considering these 4 entities 

        # Reset trackID_vec for a new user
        if userID != lastUserID:
            if lastUserID is not None:
                # Write predictions for the previous user's tracks
                for i, predRating in enumerate(predRatings):
                    trackID_pred = f"{lastUserID}_{trackID_vec[i]}"
                    csv_writer.writerow([trackID_pred, int(predRating > 0)]) 

            trackID_vec = [trackID]  # Start a new list for the new user
            predRatings = np.zeros(6)  # Reset predicted ratings for the new user
        else:
            trackID_vec.append(trackID)  # Append trackID for the same user

        # Check if the user exists in the training data
        if userID in train_data:
            user_ratings = train_data[userID]

            # Predict based on album or artist rating, if available
            if albumID in user_ratings:
                predRatings[len(trackID_vec) - 1] = user_ratings[albumID]
            elif artistID in user_ratings:
                predRatings[len(trackID_vec) - 1] = user_ratings[artistID]

        lastUserID = userID  # Update the last user ID processed

    # Write predictions for the last user in the file
    for i, predRating in enumerate(predRatings):
        trackID_pred = f"{userID}_{trackID_vec[i]}"
        csv_writer.writerow([trackID_pred, int(predRating > 0)])  # Write to CSV file


### Matrix Factorization with Alternating Least Squares (ALS): Comprehensive Overview

#### Motivation:
Matrix Factorization using the ALS algorithm is driven by the goal to discover latent factors that elucidate observed ratings in a user-item matrix. This technique is crucial for recommendation systems, particularly effective for large datasets, as it uncovers complex user preferences and item attributes that aren't immediately apparent from observable data like genres or artists. The ALS method is well-suited for addressing both the sparsity of user-item interactions and the scalability challenges in large-scale recommendation systems.

#### Formula and Methodology:
The ALS algorithm decomposes the user-item rating matrix \(R\) into two lower-dimensional matrices, \(U\) (user factors) and \(I\) (item factors), such that:
\[ R \approx U \times I^T \]
This decomposition is achieved by alternately fixing \(U\) and \(I\) and optimizing the other through iterative updates. The primary goal is to minimize the squared differences between the observed ratings and the product of \(U\) and \(I\), adjusted for regularization to prevent overfitting:
\[ \min_{U,I} \| R - U I^T \|_F^2 + \lambda (\| U \|_F^2 + \| I \|_F^2) \]
Where:
- \( \lambda \) is the regularization parameter,
- \( \|.\|_F \) denotes the Frobenius norm,
- `maxIter` represents the maximum iterations to refine the model,
- `rank` specifies the number of latent factors,
- `regParam` controls the regularization strength.

#### Discussion:
Matrix Factorization through ALS is highly regarded for its efficiency in revealing latent structures within the data, making it a powerful tool in predictive analytics for recommendation systems. It effectively handles large, sparse matrices by inferring user preferences and item characteristics that are not explicitly provided. Key parameters including the rank of the factorization and the number of iterations significantly influence the model's accuracy and computational efficiency. The method's performance is also affected by the `coldStartStrategy`, which is crucial for handling new users or items without historical data. Despite its strengths, ALS requires careful tuning of its hyperparameters to balance between model complexity and overfitting, especially in diverse datasets.

### Performance Observations:
The ALS model's efficacy is typically measured by the Mean Squared Error (MSE) across various configurations:

- **Varying "Rank"**: Higher ranks generally improve performance by capturing more detailed underlying structures in the data, but may also risk overfitting.
- **Varying "maxIter"**: More iterations allow for better convergence to optimal factor matrices, reflected by lower MSE values.
- **Varying Data Size**: Larger datasets provide more comprehensive information, allowing for more accurate predictions. However, improvements tend to diminish with excessively large datasets due to inherent noise and complexity.

#### Specific Results and Configurations:
- **Optimal Settings**: Using a `rank` of 20 and `maxIter` of 20 yielded the best results, indicating an effective level of model complexity and adequate iterations for convergence.
- **Data Size Impact**: Increasing the dataset size typically lowers the MSE, indicating better learning of user preferences and item properties, though the rate of improvement decreases with size.

#### Conclusion:
ALS matrix factorization stands out as a foundational technique in modern recommender systems, essential for extracting deep insights from user-item interactions. Its success hinges on the appropriate selection of hyperparameters and the ability to handle large-scale data efficiently. Continuous refinement of these parameters, based on the dataset characteristics and the specific requirements of the application, is crucial for maximizing the predictive performance of the model.