***IMPORTING NECESSARY PACKAGES***



We import necessary packages for the analysis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pylab as pl
from sklearn import preprocessing
import seaborn as sns
from sklearn import preprocessing
%matplotlib inline

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

***READING IN THE DATA***

In [None]:
df = pd.read_csv("../input/fifa-21-complete-player-dataset/players_21.csv")

In [None]:
pd.set_option('display.max_rows', None) 
pd.set_option('display.max_columns', None) 
pd.set_option('display.width', None) 

***DATA EXPLORATION***

In [None]:
df.head(10) 

In [None]:
df.isnull().sum()

**- TOP 10 TALLEST PLAYERS** 

The first insight we are looking to pull from the data is "who are the top 10 tallest players?"


**Steps**

We group the columns "short_name", "height_cm" and make a new dataframe called "player_heights".

In [None]:
players_heights = df[["short_name", "height_cm"]]

In [None]:
pd.set_option('display.max_rows', 30) 
pd.set_option('display.max_columns', None) 
pd.set_option('display.width', None) 

In [None]:
players_heights

The top 10 tallest players are:

In [None]:
players_heights.nlargest(10, "height_cm")

Graphical representation of the top 10 tallest players:

In [None]:
Top_10_tallest_players = players_heights.nlargest(10, "height_cm")

In [None]:
Top_10_tallest_players.plot(x ='short_name', y='height_cm', kind = 'scatter', figsize=(15, 10))

From the graph, we can see the tallest player is T. Holý. He is a Czech professional footballer who plays as a goalkeeper for Ipswich Town.

**- TOP 10 SHORTEST PLAYERS** 

The second insight we are looking to pull from the data is "who are the top 10 shortest players?"


The top 10 shortest players are:

In [None]:
players_heights.nsmallest(10, "height_cm")

Graphical representation of the top 10 shortest players:

In [None]:
Top_10_shortest_players = players_heights.nsmallest(10, "height_cm")

In [None]:
Top_10_shortest_players.plot(x ='short_name', y='height_cm', kind = 'scatter', figsize=(15, 10))

From the graph, we can see the shortest player is H.Nakagawa. He is a Japanese footballer who plays as a midfielder for Japanese club Shonan Bellmare on loan from Kashiwa Reysol.

**- LOWEST RATED PLAYERS**

The lowest rating in Fifa 21 is 47. There are 16 players with this rating. The players are seen below:

In [None]:
df.nsmallest(16,"overall")

**- TOP 10 STRONGEST PLAYERS**

In [None]:
Strongest_players = df.nlargest(10, "power_strength")

The top 10 strongest players are seen below:

In [None]:
Strongest_players

Graphical representation of the top 10 strongest players:

In [None]:
Strongest_players.plot(x ='short_name', y='power_strength', kind = 'scatter', figsize=(15, 10))

**- 90+ RATED PLAYERS**

The players with 90+ ratings are seen below:

In [None]:
df.nlargest(12,"overall")

**- AGE DISTRIBUTION IN THE TOP 10 MOST VALUABLE CLUBS IN THE WORLD**

The top 10 most valuable clubs in the world are Tottenham Hotspurs, PSG, Arsenal, Chelsea, Man City, Liverpool, Manchester United, Bayern Munich, Real Madrid and Barcelona.

In [None]:
top_valuable_club_names = ('FC Barcelona', 'Tottenham Hotspur', 'Paris Saint-Germain', 'Chelsea', 'Manchester City', "Manchester United", "Arsenal", "Liverpool", "Real Madrid", "Bayern Munich")

In [None]:
clubs = df.loc[df['club_name'].isin(top_valuable_club_names) & df['age']]
fig, ax = plt.subplots()
fig.set_size_inches(20, 10)
ax = sns.boxenplot('club_name', 'age', data=clubs)
ax.set_title(label='Age distribution in the top 10 most valuable clubs', fontsize=25)
plt.xlabel('Clubs', fontsize=20)
plt.ylabel('Age', fontsize=20)
plt.grid()

***PREDICTIVE ANALYTICS***

We are trying to predict the overall (output) a defender would have based on some features (input). Firstly we drop the following columns from the dataframe - "gk_diving", "gk_handling","gk_kicking","gk_reflexes","gk_speed","gk_positioning", "player_traits", "loaned_from", "player_tags","nation_jersey_number", "nation_position", "defending_marking". Theoretically, 25% to 30% is the maximum missing values allowed and these columns exceed this range hence why they are getting dropped. 

In [None]:
df.drop(["gk_diving", "gk_handling","gk_kicking","gk_reflexes","gk_speed","gk_positioning", "player_traits", "loaned_from", "player_tags","nation_jersey_number", "nation_position", "defending_marking"], inplace=True, axis = 1)

Now we have dropped those columns, the following columns have missing values as well - "club_name", "league_name", "league_rank", "release_clause_eur", "team_position", "team_jersey_number","joined", "contract_valid_until","pace", "shooting", "passing","dribbling", "defending","physic" but the range of the missing values is within the allowed range. So we drop the missing values.

In [None]:
df.dropna(inplace=True)

In [None]:
df.isnull().sum()

**- FEATURE SELECTION**

Now we have cleaned the data and have no missing values, we select our features for prediciton.

In [None]:
x = df[["defending_standing_tackle","mentality_composure","defending_sliding_tackle", "attacking_heading_accuracy", "power_strength","mentality_aggression","mentality_interceptions", "attacking_short_passing", "skill_ball_control","movement_reactions" ,"power_jumping"]]
y = df["overall"]

In [None]:
x = preprocessing.StandardScaler().fit(x).transform(x.astype(float))

**- SPLIT DATA**

Now we split our dataset using a 80-20 split (80% for training and 20% for testing). After which we use a linear regression algorithm for training.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,  test_size=0.2, random_state = 4)

To ensure we get equal number of samples in x and y, we transpose x

In [None]:
x.shape

In [None]:
y.shape

In [None]:
x = x.transpose()

**- TRAINING**

In [None]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(x_train, y_train)

In [None]:
y_hat = regr.predict(x_test)

**- EVALUATION**

In [None]:
from sklearn import metrics

Coefficient of determination (r2 score) is 85 percent

In [None]:
metrics.r2_score(y_hat, y_test)