# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [46]:
%matplotlib inline
# import numpy and pandas
import pandas as pd
import numpy as np
import re
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy.stats import linregress
import matplotlib.pyplot as plt

# Challenge 1 - Analysis of Variance

In this part of the lesson, we will perform an analysis of variance to determine whether the factors in our model create a significant difference in the group means. We will be examining a dataset of FIFA players. We'll start by loading the data using the code in the cell below.

In [2]:
# Run this code:
fifa = pd.read_csv('fifa.csv')

Let's examine the dataset by looking at the `head`.

In [3]:
# Your code here:
fifa.head()

Unnamed: 0,Name,Age,Nationality,Overall,Potential,Club,Value,Preferred Foot,Position,Weak Foot,Acceleration,SprintSpeed,Stamina,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties
0,L. Messi,31,Argentina,94,94,FC Barcelona,€110.5M,Left,RF,4.0,91.0,86.0,72.0,59.0,94.0,48.0,22.0,94.0,94.0,75.0
1,Cristiano Ronaldo,33,Portugal,94,94,Juventus,€77M,Right,ST,4.0,89.0,91.0,88.0,79.0,93.0,63.0,29.0,95.0,82.0,85.0
2,Neymar Jr,26,Brazil,92,93,Paris Saint-Germain,€118.5M,Right,LW,5.0,94.0,90.0,81.0,49.0,82.0,56.0,36.0,89.0,87.0,81.0
3,De Gea,27,Spain,91,93,Manchester United,€72M,Right,GK,3.0,57.0,58.0,43.0,64.0,12.0,38.0,30.0,12.0,68.0,40.0
4,K. De Bruyne,27,Belgium,91,92,Manchester City,€102M,Right,RCM,5.0,78.0,76.0,90.0,75.0,91.0,76.0,61.0,87.0,94.0,79.0


Player's values are expressed in millions of euros. We would like this column to be numeric. Therefore, let's create a numeric value column. Do this by stripping all non-numeric characters from each cell. Assign this new data to `ValueNumeric`. There is no need to multiply the value to be expressed in millions. 

In [11]:
# Your code here:
fifa['ValueNumeric'] = [re.sub('[^0-9.]','', e) for e in fifa['Value']]
fifa['ValueNumeric'] = fifa['ValueNumeric'].astype(float)

In [12]:
fifa.dtypes

Name               object
Age                 int64
Nationality        object
Overall             int64
Potential           int64
Club               object
Value              object
Preferred Foot     object
Position           object
Weak Foot         float64
Acceleration      float64
SprintSpeed       float64
Stamina           float64
Strength          float64
LongShots         float64
Aggression        float64
Interceptions     float64
Positioning       float64
Vision            float64
Penalties         float64
ValueNumeric      float64
dtype: object

#### We'd like to determine whether a player's preffered foot and position have an impact on their value. 

Using the `statsmodels` library, we are able to produce an ANOVA table without munging our data. Create an ANOVA table with value as a function of position and preferred foot. Recall that pivoting is performed by the `C` function.

Hint: For columns that have a space in their name, it is best to refer to the column using the dataframe (For example: for column `A`, we will use `df['A']`).

In [20]:
# Your code here:
res = ols("ValueNumeric ~ C(fifa['Preferred Foot'], Sum)*C(fifa['Position'], Sum)", data=fifa).fit()
print(res.summary())
sm.stats.anova_lm(res, typ=2)

                            OLS Regression Results                            
Dep. Variable:           ValueNumeric   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.005
Method:                 Least Squares   F-statistic:                     2.738
Date:                Wed, 29 Apr 2020   Prob (F-statistic):           1.90e-10
Time:                        17:08:04   Log-Likelihood:            -1.2856e+05
No. Observations:               18147   AIC:                         2.572e+05
Df Residuals:                   18093   BIC:                         2.576e+05
Df Model:                          53                                         
Covariance Type:            nonrobust                                         
                                                                             coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------

Unnamed: 0,sum_sq,df,F,PR(>F)
"C(fifa['Preferred Foot'], Sum)",72922.11,1.0,0.8723,0.3503319
"C(fifa['Position'], Sum)",8767522.0,26.0,4.033759,2.141952e-11
"C(fifa['Preferred Foot'], Sum):C(fifa['Position'], Sum)",3050243.0,26.0,1.403355,0.083279
Residual,1512530000.0,18093.0,,


What is your conclusion from this ANOVA?

In [6]:
# Your conclusions here:
# The mean of value of a football is statistically different based on the Position, but not based on the prefered foot
# The preferend foot and the position conbined has not impact in the values of the football plyer

After looking at a model of both preffered foot and position, we decide to create an ANOVA table for nationality. Create an ANOVA table for numeric value as a function of nationality.

In [21]:
# Your code here:
nat = ols("ValueNumeric ~ C(fifa['Nationality'], Sum)", data=fifa).fit()
print(nat.summary())
sm.stats.anova_lm(nat, typ=2)

                            OLS Regression Results                            
Dep. Variable:           ValueNumeric   R-squared:                       0.028
Model:                            OLS   Adj. R-squared:                  0.019
Method:                 Least Squares   F-statistic:                     3.203
Date:                Wed, 29 Apr 2020   Prob (F-statistic):           1.98e-38
Time:                        17:14:33   Log-Likelihood:            -1.2878e+05
No. Observations:               18207   AIC:                         2.579e+05
Df Residuals:                   18043   BIC:                         2.592e+05
Df Model:                         163                                         
Covariance Type:            nonrobust                                         
                                                          coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------

Unnamed: 0,sum_sq,df,F,PR(>F)
"C(fifa['Nationality'], Sum)",42929140.0,163.0,3.202987,1.9762529999999998e-38
Residual,1483605000.0,18043.0,,


What is your conclusion from this ANOVA?

In [22]:
# The mean of value of a football is statistically different based on the nationality

# Challenge 2 - Linear Regression

Our goal with using linear regression is to create a mathematical model that will enable us to predict the outcome of one variable using one or more additional independent variables.

We'll start by ensuring there are no missing values. Examine all variables for all missing values. If there are missing values in a row, remove the entire row.

In [23]:
# Your code here:
print(fifa.shape)
fifa = fifa.dropna()
print(fifa.shape)

(18207, 21)
(17918, 21)


Using the FIFA dataset, in the cell below, create a linear model predicting value using stamina and sprint speed. create the model using `statsmodels`. Print the model summary.

Hint: remember to add an intercept to the model using the `add_constant` function.

In [25]:
fifa.head(3)

Unnamed: 0,Name,Age,Nationality,Overall,Potential,Club,Value,Preferred Foot,Position,Weak Foot,...,SprintSpeed,Stamina,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,ValueNumeric
0,L. Messi,31,Argentina,94,94,FC Barcelona,€110.5M,Left,RF,4.0,...,86.0,72.0,59.0,94.0,48.0,22.0,94.0,94.0,75.0,110.5
1,Cristiano Ronaldo,33,Portugal,94,94,Juventus,€77M,Right,ST,4.0,...,91.0,88.0,79.0,93.0,63.0,29.0,95.0,82.0,85.0,77.0
2,Neymar Jr,26,Brazil,92,93,Paris Saint-Germain,€118.5M,Right,LW,5.0,...,90.0,81.0,49.0,82.0,56.0,36.0,89.0,87.0,81.0,118.5


In [28]:
# Your code here:
X = sm.add_constant(fifa[['SprintSpeed','Stamina']])
Y = fifa['ValueNumeric']

linmod = sm.OLS(Y,X).fit()
predictions = linmod.predict(X)

linmod.summary()

0,1,2,3
Dep. Variable:,ValueNumeric,R-squared:,0.0
Model:,OLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,4.454
Date:,"Wed, 29 Apr 2020",Prob (F-statistic):,0.0116
Time:,17:38:17,Log-Likelihood:,-127020.0
No. Observations:,17918,AIC:,254000.0
Df Residuals:,17915,BIC:,254100.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,280.4297,10.390,26.991,0.000,260.065,300.795
SprintSpeed,0.3186,0.188,1.693,0.091,-0.050,0.688
Stamina,-0.5173,0.174,-2.978,0.003,-0.858,-0.177

0,1,2,3
Omnibus:,2098.571,Durbin-Watson:,0.967
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2259.025
Skew:,0.819,Prob(JB):,0.0
Kurtosis:,2.413,Cond. No.,444.0


Report your findings from the model summary. In particular, report about the model as a whole using the F-test and how much variation is predicted by the model using the r squared.

In [11]:
# Your conclusions here:
# The predicticted cacpacity of the mpdel is 0, so the model is usless

Next, create a second regression model predicting value using potential. Create the model using `statsmodels` and print the model summary. Remember to add a constant term.

In [42]:
# Your code here:
X = sm.add_constant(fifa[['Potential']])
Y = fifa['ValueNumeric']

linmod1 = sm.OLS(Y,X).fit()
predictions = linmod1.predict(X)

linmod1.summary()

0,1,2,3
Dep. Variable:,ValueNumeric,R-squared:,0.056
Model:,OLS,Adj. R-squared:,0.056
Method:,Least Squares,F-statistic:,1054.0
Date:,"Wed, 29 Apr 2020",Prob (F-statistic):,9.15e-225
Time:,18:28:54,Log-Likelihood:,-126510.0
No. Observations:,17918,AIC:,253000.0
Df Residuals:,17916,BIC:,253000.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1062.4312,24.547,43.281,0.000,1014.316,1110.546
Potential,-11.1326,0.343,-32.469,0.000,-11.805,-10.461

0,1,2,3
Omnibus:,2018.008,Durbin-Watson:,1.099
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2748.489
Skew:,0.953,Prob(JB):,0.0
Kurtosis:,2.78,Cond. No.,834.0


Report your findings from the model summary. In particular, report about the model as a whole using the F-test and how much variation is predicted by the model using the r squared.

In [13]:
# Your conclusions here:
# Thi model predict bettert the value of a football player. 
# Although the dispersion is not explained by the model as the R-squared is quite low

Plot a scatter plot of value vs. potential. Do you see a linear relationship?

In [49]:
# Your code here:
plt.sca(fifa.ValueNumeric, fifa.Potential,'o')

NameError: name 'auto' is not defined