# Synthesising a Real World Phenomenon

In this project I plan to:

1) Choose a real world phenonenon that can be measured and fioor which I can collect at least 100 data points across at least four variables.

2) Investigate the types of variables involved, their likely distributions, and their relationships with each other.

3) Synthesise a data set as closely matching their properties as possible.

## The Python Libraries to be used

Numpy is the fundamental package for scientific computing with Python. Besides it's scientific uses it can also be used as an efficient multi-dimensional container of generic data.

Pandas is a package providing fast, flexible and expressive data structures designed to make working with data both easy and initutive.

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

Seaborn is a Python data visualizition library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

In [65]:
# Upload the Python packages I plan on using

import pandas as pd
from pandas import DataFrame
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import random
%matplotlib inline 

## Data for this project

I have decided to look at a list of 100 footballers and how variables such as the number of hours they train and their age affects their performance. For measuring ther performance I will look at varibles like number of kilometres covered in a match, tackles made and goals scored. I will also factor in their position on the pitch as this will further impact the other variables.

To start I am creating a dataframe with 100 footballers including their ages being randomly selected between 18 and 35 years of age.

In [155]:
df = pd.DataFrame({"Footballer":np.arange(1,101,1), "Age": np.random.randint(18,36,100)}).set_index("Footballer")
df.head(10)

Unnamed: 0_level_0,Age
Footballer,Unnamed: 1_level_1
1,26
2,27
3,32
4,23
5,24
6,18
7,22
8,20
9,18
10,28


In [156]:
allowed_position =(['Goalkeeper', 'Defender', 'Midfielder', "Striker",])
my_position = [np.random.choice(allowed_position) for i in range(100)]     

In football the lineout is 1 goalkeeper, and usually 4 defenders, 4 midfielders and 2 strikers so I will weight the data for the positions accordingly.

In [165]:
np.random.seed(100)    # seeding the data to get the same data
position = np.random.choice(allowed_position, 100, p=[0.1, 0.35, 0.35, 0.2]) 
print(position)

['Midfielder' 'Defender' 'Defender' 'Striker' 'Goalkeeper' 'Defender'
 'Midfielder' 'Striker' 'Defender' 'Midfielder' 'Striker' 'Defender'
 'Defender' 'Defender' 'Defender' 'Striker' 'Striker' 'Defender' 'Striker'
 'Defender' 'Defender' 'Striker' 'Striker' 'Defender' 'Defender'
 'Defender' 'Goalkeeper' 'Defender' 'Midfielder' 'Goalkeeper' 'Midfielder'
 'Midfielder' 'Defender' 'Defender' 'Goalkeeper' 'Striker' 'Striker'
 'Goalkeeper' 'Striker' 'Midfielder' 'Midfielder' 'Midfielder'
 'Midfielder' 'Goalkeeper' 'Defender' 'Midfielder' 'Midfielder' 'Defender'
 'Defender' 'Striker' 'Striker' 'Striker' 'Defender' 'Midfielder'
 'Defender' 'Defender' 'Defender' 'Defender' 'Goalkeeper' 'Midfielder'
 'Defender' 'Midfielder' 'Midfielder' 'Defender' 'Striker' 'Striker'
 'Midfielder' 'Defender' 'Defender' 'Defender' 'Defender' 'Defender'
 'Defender' 'Striker' 'Striker' 'Midfielder' 'Midfielder' 'Defender'
 'Goalkeeper' 'Midfielder' 'Midfielder' 'Goalkeeper' 'Midfielder'
 'Striker' 'Defender' 'Defend

Now I would like to expand the dataframe by adding a column of the weighted positions.

In [169]:
np.random.seed(100)
df["Position"] = np.random.choice(allowed_position, 100, p=[0.1, 0.35, 0.35, 0.2]) 
df.head(10)

Unnamed: 0_level_0,Age,Position,hours
Footballer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,26,Midfielder,9
2,27,Defender,8
3,32,Defender,8
4,23,Striker,7
5,24,Goalkeeper,7
6,18,Defender,5
7,22,Midfielder,4
8,20,Striker,5
9,18,Defender,6
10,28,Midfielder,6


Now I would like to add another column to the dataset where the numbers of hours each player trains is randomly selected between 4 and 10 hours per week.

In [170]:
np.random.seed(100)
hours = np.random.randint(4,11,100)
print(hours)

[ 4  4  7  4  6 10  8  6  9  6  6 10  6  5  4  4  8  7  8  6  4  7  5  9
 10  6  7  8  8  5  9  9  7  8  8  7  7  7  5  5  9 10  7  4  6  5  5 10
  7  6  9  7  4 10  5 10  4  9 10  8  6  4  4  6  9  6  5  4  9  6 10  5
  9  8  6  4  7  7  7  9 10  4  9  5  8  6  7 10  7  8  6 10  8  7  5  4
  8  7  8  9]


In [172]:
df["hours"] = np.random.randint(4,11,100)
df.head(10)

Unnamed: 0_level_0,Age,Position,hours
Footballer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,26,Midfielder,6
2,27,Defender,6
3,32,Defender,10
4,23,Striker,6
5,24,Goalkeeper,5
6,18,Defender,6
7,22,Midfielder,5
8,20,Striker,9
9,18,Defender,6
10,28,Midfielder,6


I would now like to have a look at the number of tackles each player makes in a game. From my research I have found:

1) Goalkeepers as expected rarely make tackles as saves is the name of thier game but they do occassionally come from goal to make a quick slide outside the box 0.2 Tackels per game.

2) Defenders bread and butter is tackling, it's how they earn their crust and no surprises that they tend to make more tackels than any other position at 5.5 per game.

3) Midfielders are the engine of a team and wile the cover more ground than anybody, they don't put in as many tackels as their defensive counterparts 4.2 per game.

4) Strikers are the poster boys of any team and they leave  the dirty business of tackeling to their team mates. 1.2 per game

When adding this column to my data set I will weight the data accordingly knowing not all players in each position will put in the same number of tackles. Lets assume 1 of a difference either side of the averages mentioned.

The next variable I would like to look at is the number of goals scored per game by each player. As with tackling, the research here is not surprising.

1) Goalkeeprs are in the business of preventing goals not scoring them, 0

2) Defenders primary function is to prevent the other teams players from scoring but they do from time to time get on the score sheet, particularly the big centrebacks from set pieces .15 per game

3) Midfielders being the all rounders that they are do have a habit of getting forwardd to assist their strikers and this pays off for them, 0.55 per game.

4) Strikers as expected top the list here as it is their job to put the pall in the net 0.8 per game.

When adding this column to my data set I will weight the data accordingly knowing not all players in each position will score the exact amount of goals. Lets assume .05 of a difference either side of the averages mentioned.

The last variable I would like to look at is the number of kilometres covered by these players in a match. After reserching online I have found out the following:

1) Golakeepers as expected cover the least amount of ground 5.5km on average per game.

2) Defenders fall into two groups, fullbacks and centrebacks. The ground covered by both is different as fullbacks tend to cover more ground by attacking up the wings where as the centre backs tend to sit back and mind the house. For this project we will take an average of 10km per game.

3) Midfielders are the real workhorses in the game and cover an average of 11.5km per game.

4) Strikers tend to wait up front and conserve their energy until they get the ball and so they average about 9km per game.

When adding this column to my data set I will weight the data accordingly knowing not all players in each position will cover the exact same ground. Lets assume 1 kilometre of a difference either side of the averages mentioned above.

In [182]:
data = "Position"
print(Position)

NameError: name 'Position' is not defined

In [None]:
np.random.seed(100)
df["kms_ran"] = 
df.head(10)

In [64]:
df = pd.DataFrame({"Footballer":[1,2,3,4], "Position":["Goalkeeper","Defender","Midfielder","Striker"]}, columns=["Footballer", "Position", "Age", "Kms_ran", "Tackles_made","Goals_scored"])

Unnamed: 0,Footballer,Position,Age,Kms_ran,Tackles_made,Goals_scored
0,1,Goalkeeper,,,,
1,2,Defender,,,,
2,3,Midfielder,,,,
3,4,Striker,,,,


In [193]:
data = np.random.uniform(100,4,5)

print(data)

[21.50565964 67.73325279 83.16059644 64.20812356 99.45390329]
