# Data exploration

## What does each column mean?
1. **name**: The player's name
2. **age**: The player's age
3. **nationality**: The player's citizenship
4. **club**: The current club of player
5. **position**: The forte position of player
6. **height**: The height of player (cm)
7. **weight**: The weight of player (kg)
8. **foot**: The foot which player prefers to use
9. **total_matches**: The total number of matches played
10. **total_goals**: The total number of goals of player
11. **total_minutes**: The total minutes on the field of player
12. **total_assists**: The total assists of player
13. **total_yellow**: The total yellow cards of player
14. **total_red**: The total red cards of player
15. **minutes_per_goal_conceded**: The minutes per goal conceded
16. **shot_accuracy**: The accuracy of shot on target
17. **pass_completion_rate**: The rate of successful passes
18. **cross_completion_rate**: The rate of successful cross
19. **minute_per_assist**: The minutes taken to have an assist
20. **dribble_success_rate**: The rate of successful dribble
21. **tackles**: The number of tackles of player 
22. **interception**: The number of interceptions of player
23. **market_value**: Market value of player (milion euro)
24. **titles**: The total number of titles
25. **injuries**: The total number of injuries

## Import necessary packages

In [8]:
import pandas as pd
import numpy as np

## Read raw data from csv file

In [9]:
df = pd.DataFrame()
df = pd.read_csv('players.csv')

# test output
display(df.head())

# size of the data
print("Size of data: ", df.shape)

Unnamed: 0,name,age,nationality,club,position,height,weight,foot,total_matches,total_goals,...,shot_accuracy,pass_completion_rate,cross_completion_rate,minutes_per_assist,dribble_success_rate,tackles,interception,market_value,titles,injuries
0,Ardian Ismajli,28.0,Albania,Empoli,Defender - Centre Back,185.0,76.0,Right,227,5,...,0.0,82.93,0.0,,0.0,8.0,16.0,5.0,0.0,3
1,Berat Djimsiti,31.0,Albania,Atalanta,Defender - Centre Back,190.0,83.0,Right,472,15,...,0.0,88.32,0.0,,100.0,21.0,16.0,10.0,2.0,3
2,Elseid Hysaj,30.0,Albania,Lazio,Defender - Right Back,182.0,75.0,Right,487,6,...,0.0,86.67,0.0,,0.0,1.0,0.0,2.5,1.0,2
3,Ivan Balliu,32.0,Albania,Rayo Vallecano,Defender - Right Back,172.0,63.0,Right,421,3,...,0.0,76.73,0.0,,66.67,10.0,6.0,2.0,2.0,19
4,Kristjan Asllani,22.0,Albania,Inter Milan,Midfielder - Attacking Midfield,175.0,63.0,,107,6,...,50.0,90.57,50.0,,0.0,4.0,0.0,18.0,6.0,1


Size of data:  (2092, 25)


### Number of columns and rows

In [10]:
num_cols = len(df.columns)
num_rows = len(df.index)

print('Number of rows: ', num_rows)
print('Number of columns: ', num_cols)

Number of rows:  2092
Number of columns:  25


## Convert attributes to their appropriate data types

### Current data types

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2092 entries, 0 to 2091
Data columns (total 25 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   name                       2092 non-null   object 
 1   age                        2060 non-null   float64
 2   nationality                2092 non-null   object 
 3   club                       2092 non-null   object 
 4   position                   2092 non-null   object 
 5   height                     1813 non-null   float64
 6   weight                     1610 non-null   float64
 7   foot                       1241 non-null   object 
 8   total_matches              2092 non-null   int64  
 9   total_goals                2092 non-null   int64  
 10  total_minutes              2092 non-null   int64  
 11  total_assists              1293 non-null   float64
 12  total_yellow               2092 non-null   int64  
 13  total_red                  2092 non-null   int64

### Convert attributes to more suitable data types

In [12]:
df = df.convert_dtypes() # using supported function from pandas

# check conversion results
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2092 entries, 0 to 2091
Data columns (total 25 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   name                       2092 non-null   string 
 1   age                        2060 non-null   Int64  
 2   nationality                2092 non-null   string 
 3   club                       2092 non-null   string 
 4   position                   2092 non-null   string 
 5   height                     1813 non-null   Int64  
 6   weight                     1610 non-null   Int64  
 7   foot                       1241 non-null   string 
 8   total_matches              2092 non-null   Int64  
 9   total_goals                2092 non-null   Int64  
 10  total_minutes              2092 non-null   Int64  
 11  total_assists              1293 non-null   Int64  
 12  total_yellow               2092 non-null   Int64  
 13  total_red                  2092 non-null   Int64

## Locate and Fix multicolinearity

In [13]:
# locate multicolinearity

# make a copy of the data ưith only numeric columns
df_numeric = df.select_dtypes(include=[np.number])

# Calculate the correlation matrix
correlation = df_numeric.corr('pearson')

# Set a threshold
threshold = 0.75

# Find pairs of highly correlated features, excluding self-correlations
high_corr_pairs = correlation.where((correlation >= threshold) | (correlation <= -threshold))
high_corr_pairs = high_corr_pairs.stack().reset_index()
high_corr_pairs.columns = ['Feature1', 'Feature2', 'Correlation']

# Filter out self-correlations
high_corr_pairs = high_corr_pairs[high_corr_pairs['Feature1'] != high_corr_pairs['Feature2']]

# Display the highly correlated pairs
display(high_corr_pairs)

Unnamed: 0,Feature1,Feature2,Correlation
1,age,total_matches,0.848744
2,age,total_minutes,0.847019
5,total_matches,age,0.848744
7,total_matches,total_minutes,0.96961
9,total_minutes,age,0.847019
10,total_minutes,total_matches,0.96961


In [14]:
# drop total_minutes column
df = df.drop(columns=['total_minutes'])

# check the data set again
df_numeric = df.select_dtypes(include=[np.number])
correlation = df_numeric.corr('pearson')
display(correlation)

Unnamed: 0,age,height,weight,total_matches,total_goals,total_assists,total_yellow,total_red,minutes_per_goal_conceded,shot_accuracy,pass_completion_rate,cross_completion_rate,minutes_per_assist,dribble_success_rate,tackles,interception,market_value,titles,injuries
age,1.0,0.09572,0.236332,0.848744,0.328022,0.532603,0.659722,0.452453,0.024285,-0.033643,-0.039968,-0.007701,0.031996,-0.030155,-0.011441,0.050077,-0.179518,0.346237,0.50219
height,0.09572,1.0,0.745898,-0.007357,-0.091702,-0.100888,0.006332,0.089306,-0.085121,-0.142877,-0.08071,-0.15094,0.181064,-0.086065,-0.131199,0.04106,-0.023216,-0.015793,-0.045444
weight,0.236332,0.745898,1.0,0.138864,-0.004909,0.019113,0.095804,0.130594,-0.043934,-0.159005,-0.089288,-0.16287,0.102833,-0.051826,-0.142538,0.016071,-0.041569,0.053138,0.009171
total_matches,0.848744,-0.007357,0.138864,1.0,0.527884,0.749978,0.728797,0.442565,0.12763,0.076624,0.066722,0.04768,-0.009767,0.054129,0.059305,0.069311,0.116166,0.531926,0.59274
total_goals,0.328022,-0.091702,-0.004909,0.527884,1.0,0.686997,0.245173,0.091903,0.151382,0.253418,-0.128172,0.049864,-0.133901,0.070215,-0.133576,-0.188247,0.274699,0.304677,0.268384
total_assists,0.532603,-0.100888,0.019113,0.749978,0.686997,1.0,0.418094,0.22669,0.145955,0.150076,0.087649,0.068908,-0.114645,0.109758,-0.014176,-0.055661,0.32694,0.596083,0.477744
total_yellow,0.659722,0.006332,0.095804,0.728797,0.245173,0.418094,1.0,0.712908,0.099425,0.04181,0.159112,0.07222,0.095313,0.080322,0.264677,0.288575,0.029493,0.306215,0.602618
total_red,0.452453,0.089306,0.130594,0.442565,0.091903,0.22669,0.712908,1.0,0.021312,-0.01569,0.117904,-0.005302,0.110254,0.008592,0.145843,0.211901,-0.04321,0.163902,0.454
minutes_per_goal_conceded,0.024285,-0.085121,-0.043934,0.12763,0.151382,0.145955,0.099425,0.021312,1.0,0.247635,0.012609,0.061915,0.129297,0.092028,0.218257,0.138203,0.130426,0.056064,0.069113
shot_accuracy,-0.033643,-0.142877,-0.159005,0.076624,0.253418,0.150076,0.04181,-0.01569,0.247635,1.0,0.063997,0.129023,-0.038482,0.174526,0.131942,0.061715,0.153885,0.069378,0.074945
