# HW 10 - Tim Demetriades
## Ensemble Modeling
5/1/2021

**Team - LetsGo! - Tim Demetriades and Alex Ibanez**

First we import the needed modules.

In [1]:
import pandas as pd
import numpy as np

The below list is a list of file names. Each is a csv file that is 120,000 lines long that contains the results that we submitted to Kaggle. They are in the format:
- `TrackID` - UserID and TrackID separated by an underscore
- `Predictor` - Prediction of either 0 if user will not like the song or 1 if user will like the song.

Each file name has the Kaggle score appended at the end.

In [2]:
list_of_results = ['MatrixFactorization_077897','PredictionsAlgorithm_Alex_084550',
                   'PredictionsAlgorithm_Tim_084711','Spark_GBT_084444',
                   'Spark_LR_086749','Spark_RF_082444',
                   'Ensemble_Predictions_Random_077897', 'pyspark_predictions_055980',
                   'pyspark_predictions_071049', 'Spark_LR_Tim_084494',
                   'Spark_GBT_Tim_084499']

Initialize some variables.

In [3]:
scores = []
solutions_df = []
index_prediction = 0

We have a for loop to create the Solutions DF. This DF will be 120,000 rows long and contain a column that corresponds to the TrackID column of each. Each additional column will be the Predictor column from each result file.

In [4]:
for result in list_of_results:    # loop through result files
    columns = ['TrackID', 'Predictor' + str(index_prediction)]    # append index at end of 'Predictor' column name
    file_name = 'Data/' + result + '.csv'
    new_result = pd.read_csv(file_name, names=columns, dtype={1:np.int64}, header=0)    # create temp df of current result file
    
    try:
        # Join the new predicton with the solutions_df
        solutions_df = solutions_df.join(new_result.set_index('TrackID'), on='TrackID')
    except:
        # If it's the first prediction, create the df
        solutions_df = pd.read_csv(file_name)
    
    # Generate a list with the scores for each prediction
    scores.append(float(result[-6:]) / (10 ** 5))    # turn score into decimal value (between 0 and 1)
    index_prediction += 1    # increment index

In [5]:
solutions_df

Unnamed: 0,TrackID,Predictor,Predictor1,Predictor2,Predictor3,Predictor4,Predictor5,Predictor6,Predictor7,Predictor8,Predictor9,Predictor10
0,199810_105760,1,1,1,1,1,1,1,0,1,1,1
1,199810_18515,1,1,1,1,1,1,1,1,0,1,1
2,199810_208019,1,0,1,0,0,0,1,1,1,0,0
3,199810_242681,0,0,0,1,0,1,0,0,0,1,1
4,199810_74139,0,1,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
119995,249010_186634,1,1,1,1,1,1,1,1,1,1,1
119996,249010_110470,1,1,1,1,1,1,1,0,0,1,1
119997,249010_72192,0,0,0,0,0,0,0,1,0,0,0
119998,249010_86104,0,0,0,0,0,0,0,0,0,0,0


Now we create the Solution matrix using all of the predictions from the DF. The shape of it will be (120,000 x n) where n is the numbner of different Predictor columns in the DF.

In [6]:
# Create initial S matrix with first Predictor column
S = np.array((solutions_df.iloc[:, 1] * 2 - 1))    # convert 0 to -1

In [7]:
S

array([ 1,  1,  1, ..., -1, -1, -1], dtype=int64)

In [8]:
# Loop over the rest of the predictor columns to create the rest of the S matrix
for index in range(2, solutions_df.shape[1]):    # .shape[1] gives the # of columns in df
    S = np.c_[S, (solutions_df.iloc[:, index] * 2 - 1)]

In [9]:
S.shape

(120000, 11)

In [10]:
S

array([[ 1,  1,  1, ...,  1,  1,  1],
       [ 1,  1,  1, ..., -1,  1,  1],
       [ 1, -1,  1, ...,  1, -1, -1],
       ...,
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ...,  1, -1, -1]], dtype=int64)

Here we generate the least squares solution $a_{LS}$.

$a_{LS}$ is equal to $(S^{T}S)^{-1}S^{T}x$.
$S^{T}x$ is equal to $N(2P_{i}-1)$ where $i$ is the index of the solution.

In [11]:
N = len(S)    # 120,000 rows in this case
ST_x = []

# Generate S(transpose) * x -> N(2P_i - 1)
ST_x = [N * (2 * P - 1) for P in scores]    # list comprehension

# Generate S(transpose) * S
ST_S = np.dot(S.T, S).astype('float') + np.eye(S.shape[1]) * (10 ** -6)    # to prevent singular matrices

# Generate (S(transpose) * S)^-1
ST_S_inv = np.linalg.inv(ST_S)

# Generate a_LS = (S(transpose) * S)^-1 * N(2P_i - 1) 
# a_LS is the Least Squares solution
a_LS = np.dot(ST_S_inv, ST_x)

In [12]:
a_LS

array([ 0.09710312,  0.10840707,  0.30699565, -0.02118903,  0.20215946,
        0.01372319,  0.09711456, -0.01720358,  0.05236355, -0.13783168,
        0.29207387])

As you can see above, each value corresponds to a weight $a$ for each solution $S$.

Finally, to get $S_{ensemble}$ we multiply matrix $S$ by vector $a_{LS}$.

$S_{ensemble} = S * a_{LS}$

In [13]:
s_ensemble = np.dot(S, a_LS)

In [14]:
s_ensemble

array([ 1.02812335,  0.88898909,  0.07903042, ..., -1.02812335,
       -0.99371619, -0.88898909])

In [15]:
s_ensemble_len = len(s_ensemble)    # store length of s_ensemble
s_ensemble_len

120000

In [16]:
final_predictions = np.zeros(s_ensemble_len)    # initialize final solution with list of zeroes

Here is the main loop that creates the final predictions. It loops over every 6 elements of $s_{ensemble}$ (since each User has 6 tracks), sorts those 6 elements of $s_{ensemble}$ and makes the third element the threshold, and then sets the predictions for the top 3 elements (above the threshold) a 1 and the bottom 3 elements (below the threshold) a 0.

In [17]:
# Loop through all 6 tracks for each user to get top 3 for each user
for index in range(s_ensemble_len // 6):    # floor division
    # Threshold is the third element in the sorted array
    user_score_threshold = np.sort(s_ensemble[index * 6 : index * 6 + 6])[2]    # sort the 6 values for each user and grab the third element
    for index_user in range(6):
        if s_ensemble[index * 6 + index_user] > user_score_threshold:
            final_predictions[index * 6 + index_user] = 1    # set top 3 to 1 (other 3 will be 0)

Finally, we generate the final predictions df using the solutions df and the final predictions generated above. Then we create a csv with the df.

In [18]:
# Generate the final prediction df
final_predictions_df = pd.DataFrame(solutions_df.iloc[:,0])    # make new df using first column of solutions_df
final_predictions_df['Predictor'] = np.array(final_predictions, dtype=int)

In [19]:
final_predictions_df

Unnamed: 0,TrackID,Predictor
0,199810_105760,1
1,199810_18515,1
2,199810_208019,1
3,199810_242681,0
4,199810_74139,0
...,...,...
119995,249010_186634,1
119996,249010_110470,1
119997,249010_72192,0
119998,249010_86104,0


In [20]:
final_predictions_df.to_csv('Ensemble_Predictions.csv', index=False)

### Results
After messing around with `list_of_results` to see if we can improve our score for this assignment, we managed to get our best score of **0.88908**, which put us at the top of the Kaggle leaderboard.