# Synthesising a Real World Phenomenon

In this project I plan to:

1) Choose a real world phenonenon that can be measured and for which I can collect at least 100 data points across at least four variables.

2) Investigate the types of variables involved, their likely distributions, and their relationships with each other.

3) Synthesise a data set as closely matching their properties as possible.

## The Python Libraries to be used

Numpy is the fundamental package for scientific computing with Python. Besides it's scientific uses it can also be used as an efficient multi-dimensional container of generic data.

Pandas is a package providing fast, flexible and expressive data structures designed to make working with data both easy and initutive.

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

Seaborn is a Python data visualizition library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

In [867]:
# Upload the Python packages I plan on using

import pandas as pd
from pandas import DataFrame
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import random
%matplotlib inline 

## Data for this project

I have decided to look at a list of 100 footballers and how variables such as the number of hours they train and their age affects their performance. For measuring ther performance I will look at varibles like number of kilometres covered in a match, tackles made and goals scored. I will also factor in their position on the pitch as this will further impact the other variables.

To start I am creating a dataframe with 100 footballers including their ages being randomly selected between 18 and 35 years of age.

In [868]:
np.random.seed(100)     # seeding the data to get the same data for this project
df = pd.DataFrame({"Footballer":np.arange(1,101,1), "Age": np.random.randint(18,36,100)})  # selecting 100 footballers and adding random ages between 18 and 35
df.head(10)        

Unnamed: 0,Footballer,Age
0,1,26
1,2,21
2,3,25
3,4,33
4,5,34
5,6,28
6,7,20
7,8,20
8,9,20
9,10,32


In [869]:
allowed_position =(['Goalkeeper', 'Defender', 'Midfielder', "Striker",])  # creating a list of the 4 positions on a football team
my_position = [np.random.choice(allowed_position) for i in range(100)]     

In football the lineout is 1 goalkeeper, and usually 4 defenders, 4 midfielders and 2 strikers so I will weight the data for the selection of positions accordingly.

In [870]:
np.random.seed(100)    # weighting the number of specific positions based on 1 goalkeeper, 4 defenders, 4 midfilders, 2 strikers
position = np.random.choice(allowed_position, 100, p=[0.1, 0.35, 0.35, 0.2]) 
print(position)

['Midfielder' 'Defender' 'Defender' 'Striker' 'Goalkeeper' 'Defender'
 'Midfielder' 'Striker' 'Defender' 'Midfielder' 'Striker' 'Defender'
 'Defender' 'Defender' 'Defender' 'Striker' 'Striker' 'Defender' 'Striker'
 'Defender' 'Defender' 'Striker' 'Striker' 'Defender' 'Defender'
 'Defender' 'Goalkeeper' 'Defender' 'Midfielder' 'Goalkeeper' 'Midfielder'
 'Midfielder' 'Defender' 'Defender' 'Goalkeeper' 'Striker' 'Striker'
 'Goalkeeper' 'Striker' 'Midfielder' 'Midfielder' 'Midfielder'
 'Midfielder' 'Goalkeeper' 'Defender' 'Midfielder' 'Midfielder' 'Defender'
 'Defender' 'Striker' 'Striker' 'Striker' 'Defender' 'Midfielder'
 'Defender' 'Defender' 'Defender' 'Defender' 'Goalkeeper' 'Midfielder'
 'Defender' 'Midfielder' 'Midfielder' 'Defender' 'Striker' 'Striker'
 'Midfielder' 'Defender' 'Defender' 'Defender' 'Defender' 'Defender'
 'Defender' 'Striker' 'Striker' 'Midfielder' 'Midfielder' 'Defender'
 'Goalkeeper' 'Midfielder' 'Midfielder' 'Goalkeeper' 'Midfielder'
 'Striker' 'Defender' 'Defend

Now I would like to expand the dataframe by adding a column of the player positions.

In [871]:
np.random.seed(100)
df["Position"] = np.random.choice(allowed_position, 100, p=[0.1, 0.35, 0.35, 0.2]) 
df.head(10)

Unnamed: 0,Footballer,Age,Position
0,1,26,Midfielder
1,2,21,Defender
2,3,25,Defender
3,4,33,Striker
4,5,34,Goalkeeper
5,6,28,Defender
6,7,20,Midfielder
7,8,20,Striker
8,9,20,Defender
9,10,32,Midfielder


In [872]:
np.random.seed(100)
hours = np.random.randint(4,11,100) # randomly selecting the hours of training between 4 and 10 hours per week.
print(hours)

[ 4  4  7  4  6 10  8  6  9  6  6 10  6  5  4  4  8  7  8  6  4  7  5  9
 10  6  7  8  8  5  9  9  7  8  8  7  7  7  5  5  9 10  7  4  6  5  5 10
  7  6  9  7  4 10  5 10  4  9 10  8  6  4  4  6  9  6  5  4  9  6 10  5
  9  8  6  4  7  7  7  9 10  4  9  5  8  6  7 10  7  8  6 10  8  7  5  4
  8  7  8  9]


Now I would like to add another column to the dataset where the numbers of hours each player trains is randomly selected between 4 and 10 hours per week. However after some research I discovered that if a player is aged 25 or under they are more likely to train for longer than players aged over 25. This is reflected in the data were players 25 and under trained between 6 and 10 hours per week where as players aged 26 and over trained between 4 and 8 hours per week.

In [873]:
np.random.seed(100)         # Assigning weight to players hours trained based on their age
df.loc[df.Age <=25 , "Hours_training"] = np.random.randint(6,10,47)
df1 = pd.DataFrame(df)   # Creating a new df to hold the values for 25 and under hours of training
df.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training
0,1,26,Midfielder,
1,2,21,Defender,6.0
2,3,25,Defender,6.0
3,4,33,Striker,
4,5,34,Goalkeeper,
5,6,28,Defender,
6,7,20,Midfielder,9.0
7,8,20,Striker,9.0
8,9,20,Defender,9.0
9,10,32,Midfielder,


In [874]:
np.random.seed(100)         # Assigning weight to goalkeepers tackles
df1.loc[df1.Age >=26 , "Hours_training"] = np.random.randint(4,8,53)
df1.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training
0,1,26,Midfielder,4.0
1,2,21,Defender,6.0
2,3,25,Defender,6.0
3,4,33,Striker,4.0
4,5,34,Goalkeeper,7.0
5,6,28,Defender,7.0
6,7,20,Midfielder,9.0
7,8,20,Striker,9.0
8,9,20,Defender,9.0
9,10,32,Midfielder,7.0


We can now see that players aged 25 and under train between 6-10 hours per week and players aged 26 and over train between 4 and 8 hours per week.

In [875]:
df = df1      # resetting the name of the dataframe to df for the next section of data creation.

I would now like to have a look at the number of tackles each player makes in a game. From my research I have found:

1) Goalkeepers as expected rarely make tackles as saves is the name of thier game but they do occassionally come from goal to make a quick slide outside the box 0.2 per game.

2) Defenders bread and butter is tackling, it's how they earn their crust and no surprises that they tend to make more tackels than any other position at 5.5 per game.

3) Midfielders are the engine of a team and wile the cover more ground than anybody, they don't put in as many tackels as their defensive counterparts 4.2 per game.

4) Strikers are the poster boys of any team and they leave  the dirty business of tackeling to their team mates. 1.2 per game

When adding this column to my data set I will weight the data accordingly knowing not all players in each position will put in the same number of tackles. Lets assume 1 of a difference either side of the averages mentioned for defenders and midfielders,.7 of a difference for strikers and .1 of a defference for goalkeepers. 

In [876]:
df["Tackles_made"] = 1                # creating a Tackles_made column

In [877]:
np.random.seed(100)         # Assigning weight to goalkeepers tackles
df.loc[df.Position == "Goalkeeper", "Tackles_made"] = np.random.uniform (0.1,0.3,9)
df.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made
0,1,26,Midfielder,4.0,1.0
1,2,21,Defender,6.0,1.0
2,3,25,Defender,6.0,1.0
3,4,33,Striker,4.0,1.0
4,5,34,Goalkeeper,7.0,0.208681
5,6,28,Defender,7.0,1.0
6,7,20,Midfielder,9.0,1.0
7,8,20,Striker,9.0,1.0
8,9,20,Defender,9.0,1.0
9,10,32,Midfielder,7.0,1.0


We can see above the Goalkeeper data for Tackles_made has generated between the given values of 0.1 and 0.3

In [878]:
np.random.seed(100) 
df1 = pd.DataFrame(df)   # Creating a new df to hold the Goalkeeper values
df1.loc[df1.Position == "Defender", "Tackles_made"] = np.random.uniform (4.5,6.5,42) # Assigning weight to defenders tackles
df1.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made
0,1,26,Midfielder,4.0,1.0
1,2,21,Defender,6.0,5.58681
2,3,25,Defender,6.0,5.056739
3,4,33,Striker,4.0,1.0
4,5,34,Goalkeeper,7.0,0.208681
5,6,28,Defender,7.0,5.349035
6,7,20,Midfielder,9.0,1.0
7,8,20,Striker,9.0,1.0
8,9,20,Defender,9.0,6.189552
9,10,32,Midfielder,7.0,1.0


We can see the Defender data for Tackles_made has now generated between the given values of 4.5 and 6.5.

In [879]:
np.random.seed(100) 
df2 = pd.DataFrame(df1)  # Creating a new df to hold the Goalkeeper & Defender values
df2.loc[df2.Position == "Midfielder", "Tackles_made"] = np.random.uniform(3.2,5.2,29) # Assigning weight to midfielders tackles
df2.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made
0,1,26,Midfielder,4.0,4.28681
1,2,21,Defender,6.0,5.58681
2,3,25,Defender,6.0,5.056739
3,4,33,Striker,4.0,1.0
4,5,34,Goalkeeper,7.0,0.208681
5,6,28,Defender,7.0,5.349035
6,7,20,Midfielder,9.0,3.756739
7,8,20,Striker,9.0,1.0
8,9,20,Defender,9.0,6.189552
9,10,32,Midfielder,7.0,4.049035


We can see the Midfielder data for Tackles_made has now generated between the given values of 3.2 and 5.2.

In [880]:
np.random.seed(100)
df3 = pd.DataFrame(df2)  # Creating a new df to hold the Goalkeeper, Defender & Midfielder values
df3.loc[df3.Position == "Striker", "Tackles_made"] = np.random.uniform(0.5,1.9,20) # Assigning weight to strikers tackles
df3.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made
0,1,26,Midfielder,4.0,4.28681
1,2,21,Defender,6.0,5.58681
2,3,25,Defender,6.0,5.056739
3,4,33,Striker,4.0,1.260767
4,5,34,Goalkeeper,7.0,0.208681
5,6,28,Defender,7.0,5.349035
6,7,20,Midfielder,9.0,3.756739
7,8,20,Striker,9.0,0.889717
8,9,20,Defender,9.0,6.189552
9,10,32,Midfielder,7.0,4.049035


We can see the Striker data for Tackles_made has now generated between the given values of 0.5 and 1.9.

The next variable I would like to look at is the number of goals scored per game by each player. As with tackling, the research here is not surprising.

1) Goalkeeprs are in the business of preventing goals not scoring them, 0

2) Defenders primary function is to prevent the other teams players from scoring but they do from time to time get on the score sheet, particularly the big centrebacks from set pieces .15 per game

3) Midfielders being the all rounders that they are do have a habit of getting forwardd to assist their strikers and this pays off for them, 0.55 per game.

4) Strikers as expected top the list here as it is their job to put the pall in the net 0.8 per game.

When adding this column to my data set I will weight the data accordingly knowing not all players in each position will score the exact amount of goals. Lets assume .05 of a difference either side of the averages mentioned and .1 of a difference for strikers.

In [881]:
df3["Goals_scored"] = 1                # creating a Goals_scored column

In [882]:
np.random.seed(100)             # Assigning weight to goalkeepers goals scored. They don't score so the value is 0
df3.loc[df3.Position == "Goalkeeper", "Goals_scored"] = 0
df.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made,Goals_scored
0,1,26,Midfielder,4.0,4.28681,1
1,2,21,Defender,6.0,5.58681,1
2,3,25,Defender,6.0,5.056739,1
3,4,33,Striker,4.0,1.260767,1
4,5,34,Goalkeeper,7.0,0.208681,0
5,6,28,Defender,7.0,5.349035,1
6,7,20,Midfielder,9.0,3.756739,1
7,8,20,Striker,9.0,0.889717,1
8,9,20,Defender,9.0,6.189552,1
9,10,32,Midfielder,7.0,4.049035,1


We can see the Golakeeper data for Goals_scored has now generated as 0.

In [883]:
np.random.seed(100)             # Assigning weight to defenders goals scored. 
df4 = pd.DataFrame(df3)         # Creating a new df to hold the Goalkeeper values
df4.loc[df4.Position == "Defender", "Goals_scored"] = np.random.uniform(0.1,0.2,42)
df.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made,Goals_scored
0,1,26,Midfielder,4.0,4.28681,1.0
1,2,21,Defender,6.0,5.58681,0.15434
2,3,25,Defender,6.0,5.056739,0.127837
3,4,33,Striker,4.0,1.260767,1.0
4,5,34,Goalkeeper,7.0,0.208681,0.0
5,6,28,Defender,7.0,5.349035,0.142452
6,7,20,Midfielder,9.0,3.756739,1.0
7,8,20,Striker,9.0,0.889717,1.0
8,9,20,Defender,9.0,6.189552,0.184478
9,10,32,Midfielder,7.0,4.049035,1.0


We can see the Defender data for Goals_scored has now generated between the values 0.10 and 0.20.

In [884]:
np.random.seed(100)             # Assigning weight to Midfielders goals scored. 
df5 = pd.DataFrame(df4)         # Creating a new df to hold the Goalkeeper & Defender values
df5.loc[df5.Position == "Midfielder", "Goals_scored"] = np.random.uniform(.5,.6,29)
df.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made,Goals_scored
0,1,26,Midfielder,4.0,4.28681,0.55434
1,2,21,Defender,6.0,5.58681,0.15434
2,3,25,Defender,6.0,5.056739,0.127837
3,4,33,Striker,4.0,1.260767,1.0
4,5,34,Goalkeeper,7.0,0.208681,0.0
5,6,28,Defender,7.0,5.349035,0.142452
6,7,20,Midfielder,9.0,3.756739,0.527837
7,8,20,Striker,9.0,0.889717,1.0
8,9,20,Defender,9.0,6.189552,0.184478
9,10,32,Midfielder,7.0,4.049035,0.542452


We can see the Midfielder data for Goals_scored has now generated between the values 0.5 and 0.6

In [885]:
np.random.seed(100)             # Assigning weight to Strikers goals scored. 
df6 = pd.DataFrame(df5)         # Creating a new df to hold the Goalkeeper, Defender and Midfielder values
df6.loc[df6.Position == "Striker", "Goals_scored"] = np.random.uniform(0.7,0.9,20)
df.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made,Goals_scored
0,1,26,Midfielder,4.0,4.28681,0.55434
1,2,21,Defender,6.0,5.58681,0.15434
2,3,25,Defender,6.0,5.056739,0.127837
3,4,33,Striker,4.0,1.260767,0.808681
4,5,34,Goalkeeper,7.0,0.208681,0.0
5,6,28,Defender,7.0,5.349035,0.142452
6,7,20,Midfielder,9.0,3.756739,0.527837
7,8,20,Striker,9.0,0.889717,0.755674
8,9,20,Defender,9.0,6.189552,0.184478
9,10,32,Midfielder,7.0,4.049035,0.542452


We can see the Striker data for Goals_scored has now generated between the values 0.7 and 0.9.

The last variable I would like to look at is the number of kilometres covered by these players in a match. After reserching online I have found out the following:

1) Golakeepers as expected cover the least amount of ground 5.5km on average per game.

2) Defenders fall into two groups, fullbacks and centrebacks. The ground covered by both is different as fullbacks tend to cover more ground by attacking up the wings where as the centre backs tend to sit back and mind the house. For this project we will take an average of 10km per game.

3) Midfielders are the real workhorses in the game and cover an average of 11.5km per game.

4) Strikers tend to wait up front and conserve their energy until they get the ball and so they average about 9km per game.

When adding this column to my data set I will weight the data accordingly knowing not all players in each position will cover the exact same ground. Lets assume 1 kilometre of a difference either side of the averages mentioned above.

In [886]:
df6["Kms_covered"] = 1                # creating a Tackles_made column

In [887]:
np.random.seed(100)             # Assigning weight to goalkeepers kms covered.
df6.loc[df6.Position == "Goalkeeper", "Kms_covered"] = np.random.uniform (4.5,6.5,9)
df.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made,Goals_scored,Kms_covered
0,1,26,Midfielder,4.0,4.28681,0.55434,1.0
1,2,21,Defender,6.0,5.58681,0.15434,1.0
2,3,25,Defender,6.0,5.056739,0.127837,1.0
3,4,33,Striker,4.0,1.260767,0.808681,1.0
4,5,34,Goalkeeper,7.0,0.208681,0.0,5.58681
5,6,28,Defender,7.0,5.349035,0.142452,1.0
6,7,20,Midfielder,9.0,3.756739,0.527837,1.0
7,8,20,Striker,9.0,0.889717,0.755674,1.0
8,9,20,Defender,9.0,6.189552,0.184478,1.0
9,10,32,Midfielder,7.0,4.049035,0.542452,1.0


In [888]:
np.random.seed(100)             # Assigning weight to goalkeepers kms covered.
df6.loc[df6.Position == "Goalkeeper", "Kms_covered"] = np.random.uniform (4.5,6.5,9)
df.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made,Goals_scored,Kms_covered
0,1,26,Midfielder,4.0,4.28681,0.55434,1.0
1,2,21,Defender,6.0,5.58681,0.15434,1.0
2,3,25,Defender,6.0,5.056739,0.127837,1.0
3,4,33,Striker,4.0,1.260767,0.808681,1.0
4,5,34,Goalkeeper,7.0,0.208681,0.0,5.58681
5,6,28,Defender,7.0,5.349035,0.142452,1.0
6,7,20,Midfielder,9.0,3.756739,0.527837,1.0
7,8,20,Striker,9.0,0.889717,0.755674,1.0
8,9,20,Defender,9.0,6.189552,0.184478,1.0
9,10,32,Midfielder,7.0,4.049035,0.542452,1.0


We can see the Goalkeeper data for Kms_covered has now generated between the values 4.5 and 6.5.

In [889]:
np.random.seed(100)             # Assigning weight to defenders kms covered. 
df7 = pd.DataFrame(df6)         # Creating a new df to hold the Goalkeeper values
df7.loc[df7.Position == "Defender", "Kms_covered"] = np.random.uniform(9,11,42)
df.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made,Goals_scored,Kms_covered
0,1,26,Midfielder,4.0,4.28681,0.55434,1.0
1,2,21,Defender,6.0,5.58681,0.15434,10.08681
2,3,25,Defender,6.0,5.056739,0.127837,9.556739
3,4,33,Striker,4.0,1.260767,0.808681,1.0
4,5,34,Goalkeeper,7.0,0.208681,0.0,5.58681
5,6,28,Defender,7.0,5.349035,0.142452,9.849035
6,7,20,Midfielder,9.0,3.756739,0.527837,1.0
7,8,20,Striker,9.0,0.889717,0.755674,1.0
8,9,20,Defender,9.0,6.189552,0.184478,10.689552
9,10,32,Midfielder,7.0,4.049035,0.542452,1.0


We can see the Defender data for Kms_covered has now generated between the values 9 and 11.

In [890]:
np.random.seed(100)             # Assigning weight to midfielders kms covered. 
df8 = pd.DataFrame(df7)         # Creating a new df to hold the Goalkeeper & Defender values
df8.loc[df8.Position == "Midfielder", "Kms_covered"] = np.random.uniform(10.5, 12.5,29)
df.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made,Goals_scored,Kms_covered
0,1,26,Midfielder,4.0,4.28681,0.55434,11.58681
1,2,21,Defender,6.0,5.58681,0.15434,10.08681
2,3,25,Defender,6.0,5.056739,0.127837,9.556739
3,4,33,Striker,4.0,1.260767,0.808681,1.0
4,5,34,Goalkeeper,7.0,0.208681,0.0,5.58681
5,6,28,Defender,7.0,5.349035,0.142452,9.849035
6,7,20,Midfielder,9.0,3.756739,0.527837,11.056739
7,8,20,Striker,9.0,0.889717,0.755674,1.0
8,9,20,Defender,9.0,6.189552,0.184478,10.689552
9,10,32,Midfielder,7.0,4.049035,0.542452,11.349035


We can see the Midfielder data for Kms_covered has now generated between the values 10.5 and 12.5.

In [891]:
np.random.seed(100)             # Assigning weight to strikers kms covered. 
df9 = pd.DataFrame(df8)         # Creating a new df to hold the Goalkeeper, Defender and midfielder values
df9.loc[df9.Position == "Striker", "Kms_covered"] = np.random.uniform(8,10,20)
df.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made,Goals_scored,Kms_covered
0,1,26,Midfielder,4.0,4.28681,0.55434,11.58681
1,2,21,Defender,6.0,5.58681,0.15434,10.08681
2,3,25,Defender,6.0,5.056739,0.127837,9.556739
3,4,33,Striker,4.0,1.260767,0.808681,9.08681
4,5,34,Goalkeeper,7.0,0.208681,0.0,5.58681
5,6,28,Defender,7.0,5.349035,0.142452,9.849035
6,7,20,Midfielder,9.0,3.756739,0.527837,11.056739
7,8,20,Striker,9.0,0.889717,0.755674,8.556739
8,9,20,Defender,9.0,6.189552,0.184478,10.689552
9,10,32,Midfielder,7.0,4.049035,0.542452,11.349035


We can see the Striker data for Kms_covered has now generated between the values 8 and 10.

In [892]:
pd.DataFrame(df9)
df = df9.round({"Tackles_made":2, "Goals_scored":2, "Kms_covered":2})    # rounding the decimals to 2 places and creating new df to hold the values
df.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made,Goals_scored,Kms_covered
0,1,26,Midfielder,4.0,4.29,0.55,11.59
1,2,21,Defender,6.0,5.59,0.15,10.09
2,3,25,Defender,6.0,5.06,0.13,9.56
3,4,33,Striker,4.0,1.26,0.81,9.09
4,5,34,Goalkeeper,7.0,0.21,0.0,5.59
5,6,28,Defender,7.0,5.35,0.14,9.85
6,7,20,Midfielder,9.0,3.76,0.53,11.06
7,8,20,Striker,9.0,0.89,0.76,8.56
8,9,20,Defender,9.0,6.19,0.18,10.69
9,10,32,Midfielder,7.0,4.05,0.54,11.35


One last piece of research has indicated that the number of kilometers covered, regardless of the player position is directly linked to the number of hours trained. Lets assume for this dataset that players that rain for 8 hours or more are likely to cover .5 kilometers more per game and those who train for 7 hours or less per week are likely to cover .5 kilometers less per game. 

In [893]:
np.random.seed(100)       # Adjusting the dataset to add .5 kilometers to players training 8 hours or more.  
df["Kms_adjusted"] = df.loc[df.Hours_training >= 8, "Kms_covered"] + .5
df.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made,Goals_scored,Kms_covered,Kms_adjusted
0,1,26,Midfielder,4.0,4.29,0.55,11.59,
1,2,21,Defender,6.0,5.59,0.15,10.09,
2,3,25,Defender,6.0,5.06,0.13,9.56,
3,4,33,Striker,4.0,1.26,0.81,9.09,
4,5,34,Goalkeeper,7.0,0.21,0.0,5.59,
5,6,28,Defender,7.0,5.35,0.14,9.85,
6,7,20,Midfielder,9.0,3.76,0.53,11.06,11.56
7,8,20,Striker,9.0,0.89,0.76,8.56,9.06
8,9,20,Defender,9.0,6.19,0.18,10.69,11.19
9,10,32,Midfielder,7.0,4.05,0.54,11.35,


In [894]:
np.random.seed(100)       # Adjusting the dataset to remove .5 kilometers from players training 7 hours or less. 
df1 = pd.DataFrame(df)
df1.loc[df1.Hours_training <= 7, "Kms_adjusted"] = df.Kms_covered - .5
df1.head(10)

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made,Goals_scored,Kms_covered,Kms_adjusted
0,1,26,Midfielder,4.0,4.29,0.55,11.59,11.09
1,2,21,Defender,6.0,5.59,0.15,10.09,9.59
2,3,25,Defender,6.0,5.06,0.13,9.56,9.06
3,4,33,Striker,4.0,1.26,0.81,9.09,8.59
4,5,34,Goalkeeper,7.0,0.21,0.0,5.59,5.09
5,6,28,Defender,7.0,5.35,0.14,9.85,9.35
6,7,20,Midfielder,9.0,3.76,0.53,11.06,11.56
7,8,20,Striker,9.0,0.89,0.76,8.56,9.06
8,9,20,Defender,9.0,6.19,0.18,10.69,11.19
9,10,32,Midfielder,7.0,4.05,0.54,11.35,10.85


We can now see that the data in the column Kms_adjusted has had .5 kilometers added or removed to it accordingly.

In [895]:
df1.drop(["Kms_covered"], axis=1)    # dropping the Kms-covered column as it's no longer needed.

Unnamed: 0,Footballer,Age,Position,Hours_training,Tackles_made,Goals_scored,Kms_adjusted
0,1,26,Midfielder,4.0,4.29,0.55,11.09
1,2,21,Defender,6.0,5.59,0.15,9.59
2,3,25,Defender,6.0,5.06,0.13,9.06
3,4,33,Striker,4.0,1.26,0.81,8.59
4,5,34,Goalkeeper,7.0,0.21,0.00,5.09
5,6,28,Defender,7.0,5.35,0.14,9.35
6,7,20,Midfielder,9.0,3.76,0.53,11.56
7,8,20,Striker,9.0,0.89,0.76,9.06
8,9,20,Defender,9.0,6.19,0.18,11.19
9,10,32,Midfielder,7.0,4.05,0.54,10.85


The above is my completed data frame. I will now look at some of the likely distributions of these variables.