<img src="

# Project Goal 
> - To predict what percentage of a Teams overall budget will be paid to the quarterback

# Project Description
> - Using data aquired from various websites we ran correlation tests to find the most statistically significant features

# Initial Hypothesis
> - Players who extend the season of their team I.E playoffs will have a higher percentage of their teams salary cap
> - Players who have more yards and touchdowns will have a higher percentage of their teams salary cap
> - Players who have more interceptions will have a lower percentage of their teams salary cap
> - Players who have a higher passer rating will have a higher percentage of their teams salary cap

# Imports

In [1]:
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LassoLars
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import TweedieRegressor
import wrangle
import warnings
warnings.filterwarnings("ignore")
from sklearn.feature_selection import SelectKBest, RFE, f_regression, SequentialFeatureSelector
from pydataset import data
from sklearn.linear_model import LinearRegression
import random
random.seed(10)

In [2]:
df = pd.read_csv('last_csv.csv',index_col=[0])

# Acquire
> * Data acquired from(web scrape):
    - https://overthecap.com/position/quarterback
    - https://overthecap.com/contract-history/quarterback
        - Original player salary data consisted of 1,590 and 11 columns. Each row contained a player and their salary for a given year.
    - https://www.nfl.com/schedules/2021/POST4/
        - Original playoff data consisted of 162 rows and 7 columns. Each row contained a team and the round that team made it to in a given year between 2010 and 2022.
    - https://www.pro-football-reference.com/years/2022/passing.htm
        - Original player stats data set consisted of 458 rows and 19 columns. Each row contained a single player for a given year between 2010 and 2022.
> * Cached all files to local csv
> * Each row player stats throughout a specific year

# Prepare
> * Visualized full dataset for univariate exploration
      * Histograms different types of distributions

> * Verified datatypes
> * Corrected column names
> * Checked for nulls and removed them
> * Once the 3 data sets were merged, we cleaned the data and conducted feature engineering. Our final data frame consisted of 410 rows and 42 columns for exploration. Each row contains a single player with their stats for a respective year from 2010 to 2022. 
> * Cached combined files to local csv

> * Split the data, stratifying on target variable

In [3]:
df.head()

In [4]:
train, validate, test = wrangle.split_data(df)
columns_list, target, corr_test = wrangle.get_target_and_columns(df, train)

# Univariate Analysis

In [5]:
wrangle.new_visual_univariate_findings(train)

In [6]:
wrangle.univariate_findings()

# Univariate Exploration Summary


# Bivariate Analysis

$H_0$: There is no correlation between our selected features and our target variable.

$H_\alpha$: There is a correlation between our selected features and our target variable.

$\alpha$: 0.05

In [None]:
wrangle.new_visual_multivariate_findings(df, target)

# Bivariate Exploration Summary


In [None]:
wrangle.get_explore_data(columns_list, corr_test)

# Correlation Tests

> - We will use a confidence interval of 95%
> - the resulting alpha is .05

$H_0$: There is no statistical significance between our selected features and our target variable.

$H_\alpha$: There is a statistical significance between our selected features and our target variable.

$\alpha$: 0.05

In [None]:
corr_test.sort_values(by= 'p')

In [None]:
columns_list = corr_test.feature[corr_test.p < .05].to_list()

# Modeling

- We will use RMSE as our evaluation metric

** by using baseline as an evaluation metric we can be accurate to within 6.9 <br>
** 6.9 will be the baseline RMSE we will use for this project <br>
<br>
** I will be evaluating models developed using four different model types and various hyperparameter configurations * Models will be evaluated on train and validate data * The model that performs the best will then be evaluated on test data

## Features we are moving forward with

In [None]:
corr_test[corr_test.p < .05].sort_values(by='p').reset_index().drop(columns ='index')

In [None]:
# splitting the data in its respective catagory
X_train, X_validate, X_test, y_train, y_validate, y_test = wrangle.get_X_train_val_test(train,validate, test, columns_list,target)

In [None]:
# Scaling on selected features to be sent into model
X_train, X_validate, X_test = wrangle.scale_data(X_train, X_validate,X_test,cols = columns_list)

In [None]:
# Running the data through the models
df1, df2, df3,predict_linear, feature_weights, predict_linear_test  = wrangle.get_model_numbers(X_train, X_validate, X_test, y_train, y_validate, y_test)

In [None]:
train['predicted'] = predict_linear.tolist()

# Looking at predicted vs actual for a given year

In [None]:
pd.set_option('display.max_rows', None)
train[['predicted','percent_of_cap', 'year']].sort_values(by=['percent_of_cap','year'], ascending = False)

## Train Data

In [None]:
# Models on the training data
df1

## Validate Data

In [None]:
# Models on the validate data
df2

## Test Data

In [None]:
# Model on the unseen test data
df3

# Using fold method for splitting data

In [None]:
# Running the data through the models with the new train, test method.
master_df,best_parameters = wrangle.run_fold(df, columns_list, target)

In [None]:
master_df

In [None]:
# Running the test data through the models
best_parameters

# Modeling Summary
> - Our ordinary least squared(OLS) performed best with an RMSE score of 5.479861e+00 in validate
> - Our unseen test data beat baseline

# Conclusion

> - The different columns were distributed in differently see above for a chart

### Features that were statistically significant

In [None]:
corr_test[corr_test.p < .05].sort_values(by='p').reset_index().drop(columns ='index')

### NLP Insights

In [None]:
import w_wrangle as wran

In [None]:
#Acquire and Prepare

comm = wran.acquire_commentary()
comm.player_commentary = comm.player_commentary.apply(wran.clean_strings)

comm.head()

In [None]:
# Get grams to put into visualizations

unigram_high_words, unigram_mid_words, unigram_low_words, bi_tri_high_words, bi_tri_mid_words, bi_tri_low_words = wran.get_grams(comm)

In [None]:
#Unigram visualizations

wran.viz_unigrams(unigram_high_words, unigram_mid_words, unigram_low_words)

In [None]:
#Bigram visualizations

wran.viz_bigrams(bi_tri_high_words, bi_tri_mid_words, bi_tri_low_words)

In [None]:
#Trigram visualizations

wran.viz_trigrams(bi_tri_high_words, bi_tri_mid_words, bi_tri_low_words)

### Sentiment Analysis

In [None]:
#Get Sentiment Scores

wran.get_sia_scores(comm)

### Key Findings

- 'Career' is not mentioned in low-percentage caps.
- High-percentage caps have a larger set of unique words. A lot of them could be comparisons to "the Greats" (Mahomes, Hurts, Wentz, Brady).
- Low-percentage caps talked a lot about backup (presumably quarterbacks), field, and run. Mention of "Jets" indicate historically low-percentage cap quarterbacks.
- The conversation always revolves around winning the Super Bowl across all three cap tiers.
- High-percentage caps talk about the Super Bowl and their performance in it significantly more. The focus of mid and low percentage caps also speak about their performance in the NFC.
- Sentiment scores are very high, if not maxed out, across all tiers. Only the low-percentage cap quarterbacks had a slightly lower score.
- Rushing attempts for Quarterbacks is trending up, which indicates an evolution for the position. There is a major spike after 2020.
- Quarterbacks who make the playoffs demand a higher percentage of a team's salary cap regardless of the round their team makes it to.

# Recommendations
> - Because our model was able to beat baseline we recommend using our model to assess value to Quarterbacks.



# Next Steps
> - Look into retrieving corpus from a text aggregator like ChatGPT.
> - Run throught the entire pipeline with different positons other than just quarterback
> - Look into quantifying qualitative attributes of players. Behavioral interviews, Wonderlic tests, etc