**What makes an All Star?**
Development of a machine learning model to predict which contestants are most likely to be selected for an All Stars Season based on their performance in previous seasons.

Run prep module to get data of contestant performance scraped from Wikipedia, cleaned for analysis.

In [1]:
import All_Stars_Data_Prep as all_stars_prep
model_data = all_stars_prep.get_all_stars_selection_model_data(range(1,5))

Load function to get Pearson correlations and p-values to explore which features might be best used in the model.

In [4]:
def get_pearson_correlations (model_data, dv, feature_columns):

    # Create output dataframe
    df = pd.DataFrame(columns=['Independent Variable', 'Correlation', 'P-Value'])
    
    # For each independent variable in the feature columns, get Pearson 
    # correlation/p-value and add to dataframe
    for iv in feature_columns:
        [corr, pval] = pearsonr(model_data[iv], model_data[dv])
        new_row = [iv, corr, pval]
        df.loc[df.shape[0] + 1] = new_row

    return df

Load packages for data analysis.

In [6]:
import pandas as pd
from scipy.stats.stats import pearsonr
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

Explore correlations to see what independent variables are correlated with a contestant being selected for All Stars.

In [7]:
feature_cols = [ 'Win', 'High', 'Safe', 'Low', 'Bottom', 'Eliminated', 'Guest', 
                 'Season Winner', 'Season Runner-Up', 'Season Miss Congeniality',
                 'Total Appearances', 'Years Since Last Competed' ]
get_pearson_correlations(model_data, 'Competed', feature_cols)

Unnamed: 0,Independent Variable,Correlation,P-Value
1,Win,0.037493,0.468539
2,High,0.147827,0.004069
3,Safe,0.114581,0.026302
4,Low,-0.006113,0.905957
5,Bottom,0.052854,0.306697
6,Eliminated,-0.095364,0.064713
7,Guest,-0.116583,0.023771
8,Season Winner,-0.110695,0.03188
9,Season Runner-Up,0.102937,0.046077
10,Season Miss Congeniality,0.189553,0.000218


Create a list of feature names to be used in an initial model, selecting those with a p-value of less than 0.05.

In [9]:
feature_cols_selected = [ 'High', 'Safe', 'Guest', 'Season Winner', 'Season Runner-Up', 
                          'Season Miss Congeniality', 'Years Since Last Competed' ]

Load function to get confusion matrix for a selected set of feature columns.

In [11]:
def get_confusion_matrix (model_data, feature_cols, dependent_variable, seed_val=0, test_size=0.25):                
    X = model_data[feature_cols]
    y = model_data[dependent_variable]
    
    # Split into train and test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed_val)
    
    # Instantiate the model
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
    
    return cnf_matrix

Examine confusion matrix for the selected columns.

In [12]:
get_confusion_matrix(model_data, feature_cols_selected, 'Competed')



array([[84,  0],
       [10,  0]])