# Synthesising a Real World Phenomenon

In this project I plan to:

1) Choose a real world phenonenon that can be measured and for which I can collect at least 100 data points across at least four variables.

2) Investigate the types of variables involved, their likely distributions, and their relationships with each other.

3) Synthesise a data set as closely matching their properties as possible.

## The Python Libraries to be used

Numpy is the fundamental package for scientific computing with Python. Besides it's scientific uses it can also be used as an efficient multi-dimensional container of generic data.

Pandas is a package providing fast, flexible and expressive data structures designed to make working with data both easy and initutive.

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

Seaborn is a Python data visualizition library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

In [55]:
# Upload the Python packages I plan on using

import pandas as pd
from pandas import DataFrame
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import random
%matplotlib inline 

## Data for this project

I have decided to look at a list of 100 footballers and how variables such as the number of hours they train and their age affects their performance. For measuring ther performance I will look at varibles like number of kilometres covered in a match, tackles made and goals scored. I will also factor in their position on the pitch as this will further impact the other variables.

To start I am creating a dataframe with 100 footballers including their ages being randomly selected between 18 and 35 years of age.

In [56]:
np.random.seed(100)     # seeding the data to get the same data for this project
df = pd.DataFrame({"Footballer":np.arange(1,101,1), "Age": np.random.randint(18,36,100)}).set_index("Footballer")
df.head(10)

Unnamed: 0_level_0,Age
Footballer,Unnamed: 1_level_1
1,26
2,21
3,25
4,33
5,34
6,28
7,20
8,20
9,20
10,32


In [57]:
allowed_position =(['Goalkeeper', 'Defender', 'Midfielder', "Striker",])
my_position = [np.random.choice(allowed_position) for i in range(100)]     

In football the lineout is 1 goalkeeper, and usually 4 defenders, 4 midfielders and 2 strikers so I will weight the data for the positions accordingly.

In [134]:
np.random.seed(100)    
position = np.random.choice(allowed_position, 100, p=[0.1, 0.35, 0.35, 0.2]) 
print(position)

['Midfielder' 'Defender' 'Defender' 'Striker' 'Goalkeeper' 'Defender'
 'Midfielder' 'Striker' 'Defender' 'Midfielder' 'Striker' 'Defender'
 'Defender' 'Defender' 'Defender' 'Striker' 'Striker' 'Defender' 'Striker'
 'Defender' 'Defender' 'Striker' 'Striker' 'Defender' 'Defender'
 'Defender' 'Goalkeeper' 'Defender' 'Midfielder' 'Goalkeeper' 'Midfielder'
 'Midfielder' 'Defender' 'Defender' 'Goalkeeper' 'Striker' 'Striker'
 'Goalkeeper' 'Striker' 'Midfielder' 'Midfielder' 'Midfielder'
 'Midfielder' 'Goalkeeper' 'Defender' 'Midfielder' 'Midfielder' 'Defender'
 'Defender' 'Striker' 'Striker' 'Striker' 'Defender' 'Midfielder'
 'Defender' 'Defender' 'Defender' 'Defender' 'Goalkeeper' 'Midfielder'
 'Defender' 'Midfielder' 'Midfielder' 'Defender' 'Striker' 'Striker'
 'Midfielder' 'Defender' 'Defender' 'Defender' 'Defender' 'Defender'
 'Defender' 'Striker' 'Striker' 'Midfielder' 'Midfielder' 'Defender'
 'Goalkeeper' 'Midfielder' 'Midfielder' 'Goalkeeper' 'Midfielder'
 'Striker' 'Defender' 'Defend

Now I would like to expand the dataframe by adding a column of the weighted positions.

In [59]:
np.random.seed(100)
df["Position"] = np.random.choice(allowed_position, 100, p=[0.1, 0.35, 0.35, 0.2]) 
df.head(10)

Unnamed: 0_level_0,Age,Position
Footballer,Unnamed: 1_level_1,Unnamed: 2_level_1
1,26,Midfielder
2,21,Defender
3,25,Defender
4,33,Striker
5,34,Goalkeeper
6,28,Defender
7,20,Midfielder
8,20,Striker
9,20,Defender
10,32,Midfielder


Now I would like to add another column to the dataset where the numbers of hours each player trains is randomly selected between 4 and 10 hours per week.

In [60]:
np.random.seed(100)
hours = np.random.randint(4,11,100)
print(hours)

[ 4  4  7  4  6 10  8  6  9  6  6 10  6  5  4  4  8  7  8  6  4  7  5  9
 10  6  7  8  8  5  9  9  7  8  8  7  7  7  5  5  9 10  7  4  6  5  5 10
  7  6  9  7  4 10  5 10  4  9 10  8  6  4  4  6  9  6  5  4  9  6 10  5
  9  8  6  4  7  7  7  9 10  4  9  5  8  6  7 10  7  8  6 10  8  7  5  4
  8  7  8  9]


In [124]:
df["Hours_training"] = np.random.randint(4,11,100)
df.head(10)

Unnamed: 0_level_0,Age,Position,Hours_training
Footballer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,26,Midfielder,10
2,21,Defender,9
3,25,Defender,4
4,33,Striker,4
5,34,Goalkeeper,7
6,28,Defender,4
7,20,Midfielder,9
8,20,Striker,5
9,20,Defender,8
10,32,Midfielder,7


I would now like to have a look at the number of tackles each player makes in a game. From my research I have found:

1) Goalkeepers as expected rarely make tackles as saves is the name of thier game but they do occassionally come from goal to make a quick slide outside the box 0.2 per game.

2) Defenders bread and butter is tackling, it's how they earn their crust and no surprises that they tend to make more tackels than any other position at 5.5 per game.

3) Midfielders are the engine of a team and wile the cover more ground than anybody, they don't put in as many tackels as their defensive counterparts 4.2 per game.

4) Strikers are the poster boys of any team and they leave  the dirty business of tackeling to their team mates. 1.2 per game

When adding this column to my data set I will weight the data accordingly knowing not all players in each position will put in the same number of tackles. Lets assume 1 of a difference either side of the averages mentioned.

Trying to tie the data that will be generated in column Tackles_made to the four different positions in the column Position as they are connected. In general a defender will make more tackles than than a striker in a match.

In [186]:
gkr = df[df.Position=="Goalkeeper"]   # Creating a dataframe for goalkeeper
print (gkr)

            Age    Position  Hours_training  Tackles_made
Footballer                                               
5            34  Goalkeeper               7      8.980787
27           25  Goalkeeper               7           NaN
30           19  Goalkeeper               7           NaN
35           27  Goalkeeper               8           NaN
38           35  Goalkeeper               5           NaN
44           30  Goalkeeper               4           NaN
59           28  Goalkeeper               8           NaN
79           24  Goalkeeper               7           NaN
82           20  Goalkeeper               6           NaN


In [188]:
gkr = pd.DataFrame({"Position":np.random.uniform(0.1,0.3,9)})

In [193]:
np.random.seed(100)                   # Assigning weight to goalkeepers tackles 
gkr = pd.DataFrame({"Position":np.random.uniform(0.1,0.3,9)})
print(gkr)

   Position
0  0.208681
1  0.155674
2  0.184904
3  0.268955
4  0.100944
5  0.124314
6  0.234150
7  0.265171
8  0.127341


            Age    Position  Hours_training  Tackles_made
Footballer                                               
5            34  Goalkeeper               7      0.124314
27           25  Goalkeeper               7           NaN
30           19  Goalkeeper               7           NaN
35           27  Goalkeeper               8           NaN
38           35  Goalkeeper               5           NaN
44           30  Goalkeeper               4           NaN
59           28  Goalkeeper               8           NaN
79           24  Goalkeeper               7           NaN
82           20  Goalkeeper               6           NaN


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [149]:
dfr = df[df.Position=="Defender"]     # Creating a dataframe for defender

In [152]:
np.random.seed(100)                   # Assigning weight to defenders tackles 
dfr = pd.DataFrame({"Position":np.random.uniform(4.5,6.5,42)})

In [155]:
mdr = df[df.Position=="Midfielder"]   # Creating a dataframe for midfielder

In [157]:
np.random.seed(100)                   # Assigning weight to midfielder tackles 
mdr = pd.DataFrame({"Position":np.random.uniform(3.2,5.2,29)})

In [160]:
stk = df[df.Position=="Striker"]      # Creating a dataframe for striker

In [162]:
np.random.seed(100)                   # Assigning weight to striker tackles 
stk = pd.DataFrame({"Position":np.random.uniform(0.5,1.9,20)})

In [182]:
frames = (gkr,dfr,mdr,stk,)
result = pd.concat(frames)
print(result)

    Position
0   0.208681
1   0.155674
2   0.184904
3   0.268955
4   0.100944
5   0.124314
6   0.234150
7   0.265171
8   0.127341
0   5.586810
1   5.056739
2   5.349035
3   6.189552
4   4.509438
5   4.743138
6   5.841498
7   6.151706
8   4.773413
9   5.650187
10  6.282644
11  4.918404
12  4.870656
13  4.716754
14  4.939395
15  6.457248
16  6.123366
17  4.843882
18  6.132449
19  5.048147
20  5.363408
..       ...
19  3.748147
20  4.063408
21  5.080060
22  4.835299
23  3.872224
24  3.550821
25  3.945664
26  3.211377
27  3.704853
28  4.791325
0   1.260767
1   0.889717
2   1.094325
3   1.682687
4   0.506606
5   0.670197
6   1.439049
7   1.656194
8   0.691389
9   1.305131
10  1.747851
11  0.792883
12  0.759460
13  0.651728
14  0.807576
15  1.870073
16  1.636356
17  0.740717
18  1.642715
19  0.883703

[100 rows x 1 columns]


The next variable I would like to look at is the number of goals scored per game by each player. As with tackling, the research here is not surprising.

1) Goalkeeprs are in the business of preventing goals not scoring them, 0

2) Defenders primary function is to prevent the other teams players from scoring but they do from time to time get on the score sheet, particularly the big centrebacks from set pieces .15 per game

3) Midfielders being the all rounders that they are do have a habit of getting forwardd to assist their strikers and this pays off for them, 0.55 per game.

4) Strikers as expected top the list here as it is their job to put the pall in the net 0.8 per game.

When adding this column to my data set I will weight the data accordingly knowing not all players in each position will score the exact amount of goals. Lets assume .05 of a difference either side of the averages mentioned.

The last variable I would like to look at is the number of kilometres covered by these players in a match. After reserching online I have found out the following:

1) Golakeepers as expected cover the least amount of ground 5.5km on average per game.

2) Defenders fall into two groups, fullbacks and centrebacks. The ground covered by both is different as fullbacks tend to cover more ground by attacking up the wings where as the centre backs tend to sit back and mind the house. For this project we will take an average of 10km per game.

3) Midfielders are the real workhorses in the game and cover an average of 11.5km per game.

4) Strikers tend to wait up front and conserve their energy until they get the ball and so they average about 9km per game.

When adding this column to my data set I will weight the data accordingly knowing not all players in each position will cover the exact same ground. Lets assume 1 kilometre of a difference either side of the averages mentioned above.

In [15]:
np.random.seed(100)
df["kms_ran"] = df.map({"Goalkeeper":np.random.randint(4,11,100), "Defender": np.random.randint(4,11,100)})
df.head(10)

AttributeError: 'DataFrame' object has no attribute 'map'

In [101]:
 round(random.uniform(0.01, 0.03),2)

0.02