# Programming for Data Analysis

## Project 2019

The following project consists of generating a dataset using numpy.random package, with the following steps.

* Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four diﬀerent variables

* Investigate the types of variables involved, their likely distributions, and their relationships with each other.

* Synthesise/simulate a data set as closely matching their properties as possible

* Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

I have an interest in Irish players playing in the English Soccer leagues, therfore I have chosen to base my data set around this phenomenon. Variables including Name, Surname, Age, County of Ireland, Playing Position and League in wich they play. I have used a number of sources for my data and these will be listed in each variable section. 

In [3]:
# To begin with we add in all the packages we intend to use
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

As part of this project a number of external soucres of data are required and these datasets are stored in a subfolder within this repositiory. 

## Name

The first variable chosen was the first name of the player. This variable was taken from the CSO website "the Top baby boys names 2014". Althoughthe source of the data may not be the most current the satisitcal calculations to follow hold though. The data was stored in a CSV file and stored in the data folder under the name formanes

In [4]:
df_fornames = pd.read_csv("data/fornames.csv")

In [5]:
df_fornames.head()

Unnamed: 0,Name,Qty,Percentage
0,Jack,786,0.055705
1,James,695,0.049256
2,Daniel,638,0.045216
3,Conor,581,0.041176
4,Seán,526,0.037279


The dataset has 3 number columns - Name, Qty and Percentage. This information can be used to randomly generate a boys name using the probility and the variable name. For this we used the 

**numpy.random.choice(a, size=None, replace=True, p=None)**

In [6]:
#First from the df_fornames we define the name array
forname_array = df_fornames["Name"].tolist()
# We then define the pobility of these names arrising
forname_percent = df_fornames["Percentage"].tolist()

forname = np.random.choice(forname_array, 1, p=forname_percent)
print(forname)

['Seán']


https://www.cso.ie/en/releasesandpublications/ep/p-1916/1916irl/people/names/

## Surname

Simialrly to above, the surname follows the same sequence as the forname. The independat.ie reveiled the top 20 Irish surnames in December 2019. We then assumed these to be the surename variables. This data was also stored as a CSV file and stored in data folder.

In [7]:
df_surenames = pd.read_csv("data/surname.csv")

In [8]:
df_surenames.head()

Unnamed: 0,Surname
0,Murphy
1,Kelly
2,Byrne
3,Ryan
4,O'Brien


For this variable we have no quanitiy or percentage breakdown - therefore we give them all equal likelyhood of accurance 1/20.

In [9]:
surname = df_surenames["Surname"]
S_name = surname[int(np.random.choice(19,1))]
print(S_name)

Walsh


https://www.independent.ie/irish-news/revealed-top-20-irish-surnames-31414892.html

## Age

The age variable was taken as another asumption ages 15 through to 33 with a triangle distribution. We have assumed youngsters start to head to England at the age of 15 and it peaks at 18 and gradualy tapers down to the age of 33, where we assumed the carrer of a professional footballer ends. The thought process behind this is that at a young ages players are scouted by teams and then given short-term contracts and then over time with injurys and not been given contract extensions the probibility decreases with age. This assumed distribution takes the form of:

**numpy.random.triangular(left, mode, right, size=None)**

In [35]:
# the left most age is 15, mode/average is 18 and the right is 33
age = np.random.triangular(15, 18, 33, 1)

In [36]:
print(age)

[22.32416539]


## County

As before for the county of origin of each of the players we decided to base this on the probibilty/likelhood of coming from each county from population. Wikepdia has taken data from the CSO census of 2016 for the entire island of Ireland. This data was taken and stored in a CSV file in the data folder. The CSV file contains 3 number columns County, Population and Province. We can use the population data to determine the likelyhood of the county of origin of each of the players. Further calculations are required.

In [12]:
df_county = pd.read_csv("data/population_per_county.csv")

In [13]:
df_county.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 3 columns):
County        32 non-null object
Population    32 non-null float64
Province      32 non-null object
dtypes: float64(1), object(2)
memory usage: 848.0+ bytes


In [14]:
df_county.head()

Unnamed: 0,County,Population,Province
0,Dublin,1347359.0,Leinster
1,Antrim,618108.0,Ulster
2,Cork,542868.0,Munster
3,Down,531665.0,Ulster
4,Galway,258058.0,Connacht


In order to determine the probibility breakdown per county we need to total the population column. Totaled below - 6,573,732 people. We can then determine the percentage probibility breakdown per county.

In [15]:
#print("League - Percentage Breakdown of the 244 entries")
#print(df_county.groupby(['County'])['Percentage']
pop = df_county['Population'].sum()
print(pop)
df_county['propibility'] = df_county['Population'] / pop

6573732.0


In [16]:
df_county.head()

Unnamed: 0,County,Population,Province,propibility
0,Dublin,1347359.0,Leinster,0.204961
1,Antrim,618108.0,Ulster,0.094027
2,Cork,542868.0,Munster,0.082581
3,Down,531665.0,Ulster,0.080877
4,Galway,258058.0,Connacht,0.039256


As previous variables we use the choice function to randomly choose a county for the player to come from.

In [17]:
county_array = df_county["County"].tolist()
county_percent = df_county["propibility"].tolist()
#print(forname_array)
#print(forname_percent)


county1 = np.random.choice(county_array, 1, p=county_percent)
county = str(county1)

print(county)

['Meath']


numpy.ndarray

https://en.wikipedia.org/wiki/List_of_Irish_counties_by_population

## Position

The position of the player is a slightly more straight forward calculation. Here we look a basic team formation of 1 Goalkepper (GK), 4 Defenders (DF), 4 Midfielders (MD) and 2 Forwards (FW). From this we can calculate the probibility of each psositon accuring. As an example the goalkeeper will accure 1 in ever 11 players on the pitch. For the position variable we randomly select a position for the player using the numpy.random.choice function as the previous variables.  

In [38]:
# GK = 1/11  0.09
# DF = 4/11  0.36
# MD = 4/11  0.36
# FW = 2/11  0.19

position = ['GK', 'DF', 'MD', 'FW']
pos_prob = [0.09, 0.36, 0.36, 0.19]

pos = np.random.choice(position, 1, p=pos_prob)
print(pos)

['GK']


## League

We can now determine which league the player plays in angain using the choice function in numpy.random. Soccerway.com has a full database of all players playing in any league in the world. This database can be filtered to determine all the Irish players playing in each league. This information is then stored in a CSV file named leagues in the data folder. This CSV gives a full list of all the current players the league and club they play for.

In [19]:
df_leagues = pd.read_csv("data/leagues.csv")

In [44]:
df_leagues.head()

Unnamed: 0,Player,League,Club
0,K. Long,Premier League,Burnley
1,C. Kelleher,Premier League,Liverpool
2,L. Richards,Premier League,Wolverhampton Wanderers
3,G. Kilkenny,Premier League,AFC Bournemouth
4,S. Coleman,Premier League,Everton


In [48]:
df_leagues.League.unique()

array(['Premier League', 'Championship', 'League One', 'League Two',
       'National League', 'National League N/S', 'Non League Premier',
       'Non League Div One'], dtype=object)

The next step is to determine the probility and name of the league the players are listed in. This column in the CSV file is labeled "League".

In [58]:
# Define a sub-table "LeaguesAndProb" - Leagues and probility
LeaguesAndProb= df_leagues['League'].value_counts(normalize=True)
league_prob = LeaguesAndProb.values
league_name = LeaguesAndProb.index

league = np.random.choice(league_name, 1, p=league_prob)
print(LeaguesAndProb)
print("\n \nThe randomly generated league selected is: ")
print(league)

League One             0.198113
League Two             0.183962
Championship           0.169811
Premier League         0.127358
National League        0.117925
National League N/S    0.099057
Non League Premier     0.066038
Non League Div One     0.037736
Name: League, dtype: float64

 
The randomly generated league selected is: 
['National League N/S']


https://ie.soccerway.com/players/players_abroad/ireland-republic/

## Final Data Point

We can now create a new data frame using all the above varibales and the numpy.random functions.

In [22]:
# Define the dataframe and incude all the column names
df_rand_players = pd.DataFrame(columns=['Forname','Surname','County','Age','Position','League'])

In [23]:
# Include all the additonal CSV files that do not carry over from the above code.
surname = df_surenames["Surname"]

# We cab

for i in range(10):
    name = np.random.choice(forname_array, 1, p=forname_percent)
    player_surname = surname[int(np.random.choice(19,1))]
    county = np.random.choice(county_array, 1, p=county_percent)
    age = np.random.triangular(16, 18, 33, 1)
    pos = np.random.choice(position, 1, p=[0.09, 0.36, 0.36, 0.19])
    league = np.random.choice(league_name, 1, p=league_prob)
    player_details = (str(name[0]) + "," + str(player_surname) + "," + county[0] + "," + str(round(age[0],2)) + "," + str(pos[0]) + "," + str(league[0]))
    print(player_details)
    df_rand_players = df_rand_players.append({'Forname': name[0],
                                           'Surname': player_surname,
                                           'County': county[0],
                                           'Age': round(age[0],2),
                                           'Position': pos[0],
                                           'League': league[0]}, ignore_index=True)

Max,Walsh,Kerry,19.3,DF,Non League Premier
Liam,O'Neill,Laois,24.93,GK,Non League Premier
Jack,O'Reilly,Wexford,17.24,DF,League Two
Michael,Dunne,Cavan,24.04,MD,National League
Jack,Lynch,Limerick,20.82,DF,Premier League
James,Daly,Down,17.24,DF,National League N/S
Oisin,Byrne,Limerick,25.72,DF,Non League Div One
Max,O'Brien,Meath,20.87,MD,League Two
Liam,O'Sullivan,Dublin,17.11,FW,League One
Rian,O'Brien,Cork,21.45,DF,National League


In [24]:
print(df_rand_players)

   Forname     Surname    County    Age Position               League
0      Max       Walsh     Kerry  19.30       DF   Non League Premier
1     Liam     O'Neill     Laois  24.93       GK   Non League Premier
2     Jack    O'Reilly   Wexford  17.24       DF           League Two
3  Michael       Dunne     Cavan  24.04       MD      National League
4     Jack       Lynch  Limerick  20.82       DF       Premier League
5    James        Daly      Down  17.24       DF  National League N/S
6    Oisin       Byrne  Limerick  25.72       DF   Non League Div One
7      Max     O'Brien     Meath  20.87       MD           League Two
8     Liam  O'Sullivan    Dublin  17.11       FW           League One
9     Rian     O'Brien      Cork  21.45       DF      National League


In [25]:
#df_rand_players.drop(df_rand_players.index, inplace=True)
#df_rand_players.drop(['New_Columns'], axis=1)

In [26]:
if "Irish_Caps" not in df_rand_players:
    df_rand_players["Irish_Caps"] = ""

for index, row in df_rand_players.iterrows():
    age = row['Age']
    league = row['League']
    cap = ''
    if age < 20 and league=='Premier League':
        cap = np.random.random_integers(1, 10, 1)
    elif age >= 20 and league=='Premier League':
        cap = np.random.random_integers(1, 25, 1)
    elif age >= 28 and league=='Premier League':
        cap = np.random.random_integers(1, 40, 1)
    elif age >= 20 and league=='Championship':
        cap = np.random.random_integers(1, 2, 1)
    elif age >= 20 and league=='Championship':
        cap = np.random.random_integers(1, 5, 1)
    df_rand_players.at[index, 'Irish_Caps'] = cap


  # This is added back by InteractiveShellApp.init_path()


In [39]:
print(df_rand_players)

   Forname     Surname    County    Age Position               League  \
0      Max       Walsh     Kerry  19.30       DF   Non League Premier   
1     Liam     O'Neill     Laois  24.93       GK   Non League Premier   
2     Jack    O'Reilly   Wexford  17.24       DF           League Two   
3  Michael       Dunne     Cavan  24.04       MD      National League   
4     Jack       Lynch  Limerick  20.82       DF       Premier League   
5    James        Daly      Down  17.24       DF  National League N/S   
6    Oisin       Byrne  Limerick  25.72       DF   Non League Div One   
7      Max     O'Brien     Meath  20.87       MD           League Two   
8     Liam  O'Sullivan    Dublin  17.11       FW           League One   
9     Rian     O'Brien      Cork  21.45       DF      National League   

  Irish_Caps  
0             
1             
2             
3             
4       [13]  
5             
6             
7             
8             
9             


In [28]:
df_rand_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
Forname       10 non-null object
Surname       10 non-null object
County        10 non-null object
Age           10 non-null float64
Position      10 non-null object
League        10 non-null object
Irish_Caps    10 non-null object
dtypes: float64(1), object(6)
memory usage: 640.0+ bytes


In [29]:
#df_rand_players.to_csv(r'generated_players.csv', index = None, header=True)