# Programming for Data Analytics - Project - Gerard Ball

> For this project you must create a data set by simulating a real-world phenomenon of your choosing. You may pick any phenomenon you wish – you might pick one that is of interest to you in your personal or professional life. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python. We suggest you use the numpy.random package for this purpose.
Specifically, in this project you should:
• Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
• Investigate the types of variables involved, their likely distributions, and their relationships with each other.
• Synthesise/simulate a data set as closely matching their properties as possible.
• Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.
Note that this project is about simulation – you must synthesise a data set. Some students may already have some real-world data sets in their own files. It is okay to base your synthesised data set on these should you wish (please reference it if you do), but the main task in this project is to create a synthesised data set. 




## Road Map
1. Introduction 
2. Aim
3. Images of subjects
4. Data Collection
5. Data Synthesis
6. Exploratory Data Analysis
7. Data Visualisation
8. Statistical Analysis
9. Interpretations of results and Discussions
10. Conclusion


## Introduction

Roller coasters offer a budding adrenaline junkie a release from the trials and tribulations of everyday life. Like many things in life, coasters come in all manner of sizes and types and understanding and discerning the relatsionsip between much of them, can offer up some valuable data analysis. The mission of this project is to simulate and synthesize a diverse these roller coasters, capturing variables like speed, height, type and thrill rating. By carrying out this synthesis, the prtoject aims to create a comprehensive and varied representation of roller coasters worldwide and their many types. The dataset will serve as a valuable resource for analysis, providing insights into the relationships between different coaster characteristics. By leveraging this simulated data, I strive to enhance understanding and appreciation of the factors contributing to the thrill and excitement offered by these wonderful marvels of modern engineering whilst facilitating potential insights for enthusiasts, theme park planners and the amusement industry itself. 

## Data Collection

In [None]:
import pandas as pd
import numpy as np

np.random.seed(42)

# Number of data points
num_points = 100
roller_coaster_data = pd.DataFrame({
    'Roller Coaster': [f'Coaster_{i+1}' for i in range(num_points)],
    'Type': np.random.choice(['Steel', 'Wooden', 'Hybrid'], size=num_points),
    'Speed (km/h)': np.random.normal(loc=100, scale=15, size=num_points),
    'Height (m)': np.random.normal(loc=50, scale=15, size=num_points),
})

# steel faster, taller more thrillin on average
steel_mask = roller_coaster_data['Type'] == 'Steel'
roller_coaster_data.loc[steel_mask, 'Speed (km/h)'] += 20
roller_coaster_data.loc[steel_mask, 'Height (m)'] += 10

roller_coaster_data['Thrill Rating'] = (
    0.3 * roller_coaster_data['Speed (km/h)'] +
    0.4 * roller_coaster_data['Height (m)'] +
    np.random.normal(loc=8, scale=1, size=num_points)
)

roller_coaster_data.to_csv('coasterss.csv', index=False)

In [2]:
import pandas as pd
synthesised_data = pd.read_csv('coasterss.csv')

# mean and median
mean_values = synthesised_data.mean()
median_values = synthesised_data.median()

print("Mean values:")
print(mean_values)

print("\nMedian values:")
print(median_values)

Mean values:
Speed (km/h)     107.954113
Height (m)        55.959383
Thrill Rating     62.919193
dtype: float64

Median values:
Speed (km/h)     106.980178
Height (m)        55.295932
Thrill Rating     62.981672
dtype: float64


  mean_values = synthesised_data.mean()
  median_values = synthesised_data.median()


In [1]:
import pandas as pd
roller_coaster_data = pd.read_csv('Coasterss.csv')
grouped_data = roller_coaster_data.groupby('Type')
mean_values = grouped_data.mean()
median_values = grouped_data.median()

print("Mean values for each type:")
print(mean_values)

print("\nMedian values for each type:")
print(median_values)

Mean values for each type:
        Speed (km/h)  Height (m)  Thrill Rating
Type                                           
Hybrid    100.588160   52.768262      59.414376
Steel     123.042441   60.323624      69.158773
Wooden    100.466050   54.706740      60.217614

Median values for each type:
        Speed (km/h)  Height (m)  Thrill Rating
Type                                           
Hybrid    102.633299   52.622495      58.483179
Steel     119.889298   60.227149      69.147440
Wooden     99.573049   54.025590      60.448528


  mean_values = grouped_data.mean()
  median_values = grouped_data.median()
