This portion of the FIFA project created by: Kayla Brock | August 9, 2022

![adultsoccer.jpeg](attachment:adultsoccer.jpeg)

In [1]:
from imports import *

# Defender Salary Prediction

### Project Goal

The goal of this report is to create a linear regression model that can accurately estimate a player's contract amount within a 10% margin of error. This information can be used by investors to estimate the cost of starting and/or maintaining a team.

### (sub)Project Goal 

The goal of this report is to create a linear regression model that can accurately estimate a defender's contract amount within a 10% margin of error. 

### Initial Questions

- Is there a relationship between a defender's sliding tackle skill and the salary they earn?
- Is there a relationship between a defender's standing tackle skill level and salary?
- Is there a relationship between a defender's interceptions skill level and salary?
- Is there a relationship between a defender's overall skill level and salary?
- Is there a relationship between a defender's defending skill level and salary?

### Acquire Data

In [2]:
df = acquire.get_fifa_data()

### Prepare Data 

In [3]:
df = prepare.prepped_data(df)

Before dropping nulls, 142079 rows, 111 cols
After dropping nulls. 131489 rows. 66 cols
After cleaning the data and adding additional columns there are: 98804 rows. 83 cols


In [4]:
goalkeeper_df, forward_df, midfielder_df, defender_df = prepare.acquire_players_by_position(df)

In [21]:
defender_df.wage_eur.max()

375000.0

### Split Dataframe

In [None]:
train_d, validate_d, test_d = prepare.split(defender_df)

In [None]:
X_train_defender, y_train_defender, X_validate_defender, y_validate_defender, X_test_defender, y_test_defender = prepare.defender_split(train_d, validate_d, test_d)

# Exploration

#### _Univariate_

In [None]:
#distribution of value_eur
fig = px.histogram(train_d, 
                   x='wage_eur', 
                   marginal='box', 
                   color_discrete_sequence=['orange'], 
                   title='Distribution of Wage')
fig.update_layout(bargap=0.1)
fig.show()

_Defender's wage is right skewed_

#### Bivariate

### _Is there a relationship between a defender's sliding tackle skill and the salary they earn?_

Pearson's Correlation Coefficient

$\alpha$ = .05

$𝐻_{0}$: There is no linear correlation with between sliding tackle and salary.

𝐻𝑎: There is a linear relationship between sliding tackle and salary.

In [None]:
plt.figure(figsize = (13, 7))
sns.boxplot(train_d.sliding_tackle, train_d.wage_eur)
plt.title("Is there a relationship between a defender's sliding tackle skill and the salary they earn?")
plt.show()

_There is a positive correlation between sliding tackle skill and player wage_

In [None]:
#set alpha
α = 0.05

#perform test
r, p = pearsonr(train_d.sliding_tackle, train_d.wage_eur)

#evaluate coefficient and p-value
print(f'Correlation Coefficient: {r:.3f}\nP-value: {p:.3f}')

#evaluate if p < α
if p < α:
    print('Reject the null hypothesis.')
else:
    print('Fail to reject the null hypothesis.')

### _Is there a relationship between a defender's standing tackle skill level and salary?_

Pearson's Correlation Coefficient

$\alpha$ = .05

$𝐻_{0}$: There is no linear correlation with between standing tackle and salary.

𝐻𝑎: There is a linear relationship between standing tackle and salary.

In [None]:
plt.figure(figsize = (13, 7))
sns.boxplot(train_d.standing_tackle, train_d.wage_eur)
plt.title("_Is there a relationship between a defender's standing tackle skill level and salary?_")
plt.show()

_There appears to be a positive correlation between standing tackle skill and player wage_

In [None]:
#set alpha
α = 0.05

#perform test
r, p = pearsonr(train_d.standing_tackle, train_d.wage_eur)

#evaluate coefficient and p-value
print(f'Correlation Coefficient: {r:.3f}\nP-value: {p:.3f}')

#evaluate if p < α
if p < α:
    print('Reject the null hypothesis.')
else:
    print('Fail to reject the null hypothesis.')

### _Is there a relationship between a defender's interceptions skill level and salary?_

Pearson's Correlation Coefficient

$\alpha$ = .05

$𝐻_{0}$: There is no linear correlation with between interceptions and salary.

𝐻𝑎: There is a linear relationship between interceptions and salary.

In [None]:
plt.figure(figsize = (13, 7))
sns.boxplot(train_d.interceptions, train_d.wage_eur)
plt.title("Is there a relationship between a defender's interceptions skill level and salary?")
plt.show()

_There is a positive correlation between interceptions skill and salary_

In [None]:
#set alpha
α = 0.05

#perform test
r, p = pearsonr(train_d.interceptions, train_d.wage_eur)

#evaluate coefficient and p-value
print(f'Correlation Coefficient: {r:.3f}\nP-value: {p:.3f}')

#evaluate if p < α
if p < α:
    print('Reject the null hypothesis.')
else:
    print('Fail to reject the null hypothesis.')

### _Is there a relationship between a defender's overall skill level and salary?_

Pearson's Correlation Coefficient

$\alpha$ = .05

$𝐻_{0}$: There is no linear correlation with between age and salary.

𝐻𝑎: There is a linear relationship between age and salary.

In [None]:
plt.figure(figsize = (13, 7))
sns.boxplot(train_d.overall, train_d.wage_eur)
plt.title("Does overall skill impact salary?")
plt.show()

_There is a positive correlation between overall skill and player wage_

In [None]:
#set alpha
α = 0.05

#perform test
r, p = pearsonr(train_d.overall, train_d.wage_eur)

#evaluate coefficient and p-value
print(f'Correlation Coefficient: {r:.3f}\nP-value: {p:.3f}')

#evaluate if p < α
if p < α:
    print('Reject the null hypothesis.')
else:
    print('Fail to reject the null hypothesis.')

### _Is there a relationship between a defender's defending skill level and salary?_

Pearson's Correlation Coefficient

$\alpha$ = .05

$𝐻_{0}$: There is no linear correlation with between age and salary.

𝐻𝑎: There is a linear relationship between age and salary.

In [None]:
plt.figure(figsize = (13, 7))
sns.boxplot(train_d.defending, train_d.wage_eur)
plt.title("Does potential impact salary?")
plt.show()

_There appears to be a positive correlation between defending skill and player salary_

In [None]:
#set alpha
α = 0.05

#perform test
r, p = pearsonr(train_d.defending, train_d.wage_eur)

#evaluate coefficient and p-value
print(f'Correlation Coefficient: {r:.3f}\nP-value: {p:.3f}')

#evaluate if p < α
if p < α:
    print('Reject the null hypothesis.')
else:
    print('Fail to reject the null hypothesis.')

# Clustering

#### ANOVA test on ball control, reactions, passing

In [None]:
#significance level 
a = 0.05 
#define x 
X = X_train_defender[['overall', 'defending']]
#define kmeans
kmeans = KMeans(n_clusters=4)
#fit 
kmeans.fit(X)

In [None]:
train_d['clusters'] = kmeans.predict(X)

In [None]:
# Find K: evaluate best k using elbow method 
with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(12, 6))
    pd.Series({k: KMeans(k).fit(X).inertia_ for k in range(2, 15)}).plot(marker='x')
    plt.xticks(range(2, 15))
    plt.xlabel('k')
    plt.ylabel('inertia')
    plt.title('Change in inertia as k increases')

In [None]:
#look at mean of clusters 
train_d.groupby('clusters')['overall', 'defending'].mean()

#### _The ANOVA test will be used to measure the significance or lack there of variance between clusters_

    - H0: There is no significant difference between salaries of each cluster 
    - HA: There is a significant difference between salaries of each cluster

In [None]:
alpha = 0.05

F, p = stats.f_oneway(train_d[train_d.clusters == 0].wage_eur,
                      train_d[train_d.clusters == 1].wage_eur,
                      train_d[train_d.clusters == 2].wage_eur,
                      train_d[train_d.clusters == 3].wage_eur)

print('Anova Test Results on forward Cluster')
print('F-value: ',F)
print('p-value: ',p)