# Project Statement - 

Choose a real-world phenomenon that can be measured and for which you could
collect at least one-hundred data points across at least four different variables.

1) Investigate the types of variables involved, their likely distributions, and their
relationships with each other.

2) Synthesise/simulate a data set as closely matching their properties as possible.

3) Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.


Note that this project is about simulation – you must synthesise a data set. Some
students may already have some real-world data sets in their own files. It is okay to
base your synthesised data set on these should you wish (please reference it if you do),
but the main task in this project is to create a synthesised data set. The next section
gives an example project idea.

In [5]:
# Importing of packages required for visualisations and to view within the notebook

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Prints the data set for review before visualisation

stats = pd.read_csv('https://raw.githubusercontent.com/ShaunaB93/Programming-for-Data-Analytics-Project/master/Data%20files/Stats%20file.csv')
#print(stats)
stats.head()

Unnamed: 0,Rank,Player,Height (cm),Minutes played 17/18,Headed Goals Scored 17/18,Goals Scored 17/18,Minutes played 16/17,Headed Goals Scored 16/17,Goals Scored 16/17,Minutes played 15/16,Headed Goals Scored 15/16,Goals Scored 15/16,Minutes played 14/15,Headed Goals Scored 14/15,Goals Scored 14/15,Total Headed Goals Scored,Total Goals Scored,Total Minutes Played
0,1.0,Jamie Vardy,179.0,3255.0,3.0,20.0,2808.0,0.0,13.0,3139.0,2.0,24.0,2245.0,0.0,5.0,5.0,62.0,11447.0
1,2.0,Harry Kane,188.0,3083.0,6.0,30.0,2531.0,2.0,29.0,3368.0,1.0,25.0,2581.0,5.0,21.0,14.0,105.0,11563.0
2,3.0,Xherdan Shaqiri,169.0,3049.0,0.0,8.0,1707.0,0.0,4.0,2028.0,0.0,3.0,0.0,0.0,0.0,0.0,15.0,6784.0
3,4.0,Mohamed Salah,175.0,2921.0,2.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,30.0,0.0,0.0,2.0,32.0,2951.0
4,5.0,Romelu Lukaku,190.0,2869.0,3.0,16.0,3267.0,6.0,25.0,3174.0,4.0,18.0,2875.0,1.0,10.0,14.0,69.0,12185.0


# Example Scenario 

As a lecturer I might pick the real-world phenomenon of the performance of students
studying a ten-credit module. After some research, I decide that the most interesting
variable related to this is the mark a student receives in the module - this is going to be
one of my variables (grade).
Upon investigation of the problem, I find that the number of hours on average a
student studies per week (hours), the number of times they log onto Moodle in the
first three weeks of term (logins), and their previous level of degree qualification (qual)
are closely related to grade. The hours and grade variables will be non-negative real
number with two decimal places, logins will be a non-zero integer and qual will be a
categorical variable with four possible values: none, bachelors, masters, or phd.
After some online research, I find that full-time post-graduate students study on average
four hours per week with a standard deviation of a quarter of an hour and that
a normal distribution is an acceptable model of such a variable. Likewise, I investigate
the other four variables, and I also look at the relationships between the variables. I
devise an algorithm (or method) to generate such a data set, simulating values of the
four variables for two-hundred students. I detail all this work in my notebook, and then
I add some code in to generate a data set with those properties.- 

# Suggested Scenario for project brief -


## Real-world phenomen to examine -

Whether height as a premier league striker has an impact on the number of headers/goals scoreded, between the 14/15 and 17/18 seasons being used as the original data in order to model and correctly determine the relationships.

## Types of variables involved -

1) Height of the player 

Average height of the strikers whom scored goals over the time frame in question is 181.5 cm. The minimum and maximum heights noted are - 169 and 201 cm respectively. With a standard deviation of 6.8 cm.

The height of the individual player will be a non-negative real number with one decimal place.

2) The minute played

Average minutes played of the strikers listed over the years in question is 3674. The minimum and maximum minutes noted are - 1 and 12185 mins respectively.

The minutes played for each of the players in the simulated dataset will be a non-negative real number with no decimal places.


3) Total goals scored

Average total goals scored of the strikers listed over the years in question is 15.5. The minimum and maximum total goals scored noted are - 0 and 105 mins respectively.

For such statistics in sport the values are generally rounded to one decimal place, therefore, in the case of this study the value of the total goals scored being simulated will be a non-negative real number with one decimal place.


4) Total headed goals scored

Average total headed goals scored of the strikers listed over the years in question is 2.7. The minimum and maximum total goals scored noted are - 0 and 21 mins respectively.

Similar to the previous statement, such statistics in sport the values are generally rounded to one decimal place, therefore, in the case of this study the value of the total headed goals scored being simulated will be a non-negative real number with one decimal place.


## Likely distributions - 

1) Height of the player 

2) The minute played

3) Total goals scored

4) Total headed goals scored

## Relationships with each other -

It would be generally assumed that a player's height should mean that they would score more headed goals than smaller opponent strikers. I believe that minutes played would also have an impact on this as if a striker plays less minutes over a season or more they should score more goals than someone who plays a lot less minutes whether it be headed goals or in any other fashion. 

In [10]:
print(stats.describe())

             Rank  Height (cm)  Minutes played 17/18  \
count  115.000000   115.000000            116.000000   
mean    57.973913   181.495652           1192.765217   
std     33.325208     6.812494            978.239869   
min      1.000000   169.000000              1.000000   
25%     29.000000   176.000000            255.000000   
50%     58.000000   182.000000           1066.000000   
75%     86.500000   186.000000           1973.500000   
max    114.000000   201.000000           3255.000000   

       Headed Goals Scored 17/18  Goals Scored 17/18  Minutes played 16/17  \
count                 116.000000          116.000000            116.000000   
mean                    0.795259            4.870474            956.019673   
std                     1.343907            5.920454           1055.072757   
min                     0.000000            0.000000              0.000000   
25%                     0.000000            0.000000              0.000000   
50%                     0.0