# Aim of this notebook

In this notebook I want to go through each step of EDA (Exploratory Data Analysis) to gain more practical experience with this statistical tool

## About The Rollercoaster Database
This data contains information about over 1000 rollercoasters. Information was scraped from wikipedia.

https://www.kaggle.com/datasets/robikscube/rollercoaster-database/

## Importing and configuring libraries

In [57]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

plt.style.use('ggplot')
pd.set_option('display.max_columns', 200)

## Loading the Dataset

In [58]:
df = pd.read_csv('./data/coaster_db.csv')

## Step 1: Data Understanding
- shape of the dataset - Dataframe `shape`
- examples of different observations - `head` and `tail`
- types of data - `dtypes`
- basic information about the distribution of numerical features - `describe`

In [59]:
print(df.shape)

(1087, 56)


In [61]:
df.head(10)

Unnamed: 0,coaster_name,Length,Speed,Location,Status,Opening date,Type,Manufacturer,Height restriction,Model,Height,Inversions,Lift/launch system,Cost,Trains,Park section,Duration,Capacity,G-force,Designer,Max vertical angle,Drop,Soft opening date,Fast Lane available,Replaced,Track layout,Fastrack available,Soft opening date.1,Closing date,Opened,Replaced by,Website,Flash Pass Available,Must transfer from wheelchair,Theme,Single rider line available,Restraint Style,Flash Pass available,Acceleration,Restraints,Name,year_introduced,latitude,longitude,Type_Main,opening_date_clean,speed1,speed2,speed1_value,speed1_unit,speed_mph,height_value,height_unit,height_ft,Inversions_clean,Gforce_clean
0,Switchback Railway,600 ft (180 m),6 mph (9.7 km/h),Coney Island,Removed,"June 16, 1884",Wood,LaMarcus Adna Thompson,,Lift Packed,50 ft (15 m),,gravity,,,Coney Island Cyclone Site,1:00,1600 riders per hour,2.9,LaMarcus Adna Thompson,30°,43 ft (13 m),,,,Gravity pulled coaster,,,,,,,,,,,,,,,,1884,40.574,-73.978,Wood,1884-06-16,6 mph,9.7 km/h,6.0,mph,6.0,50.0,ft,,0,2.9
1,Flip Flap Railway,,,Sea Lion Park,Removed,1895,Wood,Lina Beecher,,,,1.0,,,a single car. Riders are arranged 1 across in ...,,,,12.0,Lina Beecher,,,,,,,,,1902,,,,,,,,,,,,,1895,40.578,-73.979,Wood,1895-01-01,,,,,,,,,1,12.0
2,Switchback Railway (Euclid Beach Park),,,"Cleveland, Ohio, United States",Closed,,Other,,,,,,,,,,,,,,,,,,,,,,,1895.0,,,,,,,,,,,,1896,41.58,-81.57,Other,,,,,,,,,,0,
3,Loop the Loop (Coney Island),,,Other,Removed,1901,Steel,Edwin Prescott,,,,1.0,,,a single car. Riders are arranged 2 across in ...,,,,,Edward A. Green,,,,,Switchback Railway,,,,1910,,Giant Racer,,,,,,,,,,,1901,40.5745,-73.978,Steel,1901-01-01,,,,,,,,,1,
4,Loop the Loop (Young's Pier),,,Other,Removed,1901,Steel,Edwin Prescott,,,,1.0,,,,,,,,Edward A. Green,,,,,,,,,1912,,,,,,,,,,,,,1901,39.3538,-74.4342,Steel,1901-01-01,,,,,,,,,1,
5,Cannon Coaster,,,Coney Island,Removed,1902,Wood,George Francis Meyer,,,40 ft (12 m),,,,,,,,,,,,,,,,,,1907,,,,,,,,,,,,,1902,40.575,-73.98,Wood,1902-01-01,,,,,,40.0,ft,,0,
6,Leap-The-Dips,"1,452 ft (443 m)",10 mph (16 km/h),Lakemont Park,Operating,1902,Wood – Side friction,Federal Construction Company,,,41 ft (12 m),,,,,,1:00,,,Edward Joy Morris,25°,9 ft (2.7 m),,,,,,,,,,,,,,,,,,,,1902,,,Wood,1902-01-01,10 mph,16 km/h,10.0,mph,10.0,41.0,ft,,0,
7,Figure Eight (Euclid Beach Park),,,"Cleveland, Ohio, United States",Closed,,Other,,,,,,,,,,,,,,,,,,,,,,,1895.0,,,,,,,,,,,,1904,41.58,-81.57,Other,,,,,,,,,,0,
8,Drop the Dip,,,Coney Island,Removed,"June 6, 1907",Other,Arthur Jarvis,,,60 ft (18 m),,,,,,1 minute 30 seconds,,,"Christopher Feucht, Welcome Mosley",,,,,,,,,1930s,,,,,,,,,,,,,1907,40.5744,-73.9786,Other,1907-06-06,,,,,,60.0,ft,,0,
9,Scenic Railway (Euclid Beach Park),,,"Cleveland, Ohio, United States",Closed,,Other,,,,,,,,,,,,,,,,,,,,,,,1895.0,,,,,,,,,,,,1907,41.58,-81.57,Other,,,,,,,,,,0,


In [60]:
print(df.dtypes)

coaster_name                      object
Length                            object
Speed                             object
Location                          object
Status                            object
Opening date                      object
Type                              object
Manufacturer                      object
Height restriction                object
Model                             object
Height                            object
Inversions                       float64
Lift/launch system                object
Cost                              object
Trains                            object
Park section                      object
Duration                          object
Capacity                          object
G-force                           object
Designer                          object
Max vertical angle                object
Drop                              object
Soft opening date                 object
Fast Lane available               object
Replaced        

As wee can see there are some columns describing the same value ex.: `Speed`, `speed1`, `speed2`, `speed1_value`, `speed_mph`

We can remove the redundant columns for clearness of the dataset and only leave one (in this case `speed_mph`)

In [66]:
df.columns

Index(['coaster_name', 'Length', 'Speed', 'Location', 'Status', 'Opening date',
       'Type', 'Manufacturer', 'Height restriction', 'Model', 'Height',
       'Inversions', 'Lift/launch system', 'Cost', 'Trains', 'Park section',
       'Duration', 'Capacity', 'G-force', 'Designer', 'Max vertical angle',
       'Drop', 'Soft opening date', 'Fast Lane available', 'Replaced',
       'Track layout', 'Fastrack available', 'Soft opening date.1',
       'Closing date', 'Opened', 'Replaced by', 'Website',
       'Flash Pass Available', 'Must transfer from wheelchair', 'Theme',
       'Single rider line available', 'Restraint Style',
       'Flash Pass available', 'Acceleration', 'Restraints', 'Name',
       'year_introduced', 'latitude', 'longitude', 'Type_Main',
       'opening_date_clean', 'speed1', 'speed2', 'speed1_value', 'speed1_unit',
       'speed_mph', 'height_value', 'height_unit', 'height_ft',
       'Inversions_clean', 'Gforce_clean'],
      dtype='object')

In [None]:

columns = [
  'coaster_name', 'Length', 'Speed', 'Location', 'Status', 'Opening date',
  'Type', 'Manufacturer', 'Height restriction', 'Model', 'Height',
  'Inversions', 'Lift/launch system', 'Cost', 'Trains', 'Park section',
  'Duration', 'Capacity', 'G-force', 'Designer', 'Max vertical angle',
  'Drop', 'Soft opening date', 'Fast Lane available', 'Replaced',
  'Track layout', 'Fastrack available', 'Soft opening date.1',
  'Closing date', 'Opened', 'Replaced by', 'Website',
  'Flash Pass Available', 'Must transfer from wheelchair', 'Theme',
  'Single rider line available', 'Restraint Style',
  'Flash Pass available', 'Acceleration', 'Restraints', 'Name',
  'year_introduced', 'latitude', 'longitude', 'Type_Main',
  'opening_date_clean', 'speed1', 'speed2', 'speed1_value', 'speed1_unit',
  'speed_mph', 'height_value', 'height_unit', 'height_ft',
  'Inversions_clean', 'Gforce_clean'
]

In [62]:
df.describe()

Unnamed: 0,Inversions,year_introduced,latitude,longitude,speed1_value,speed_mph,height_value,height_ft,Inversions_clean,Gforce_clean
count,932.0,1087.0,812.0,812.0,937.0,937.0,965.0,171.0,1087.0,362.0
mean,1.54721,1994.986201,38.373484,-41.595373,53.850374,48.617289,89.575171,101.996491,1.326587,3.824006
std,2.114073,23.475248,15.516596,72.285227,23.385518,16.678031,136.246444,67.329092,2.030854,0.989998
min,0.0,1884.0,-48.2617,-123.0357,5.0,5.0,4.0,13.1,0.0,0.8
25%,0.0,1989.0,35.03105,-84.5522,40.0,37.3,44.0,51.8,0.0,3.4
50%,0.0,2000.0,40.2898,-76.6536,50.0,49.7,79.0,91.2,0.0,4.0
75%,3.0,2010.0,44.7996,2.7781,63.0,58.0,113.0,131.2,2.0,4.5
max,14.0,2022.0,63.2309,153.4265,240.0,149.1,3937.0,377.3,14.0,12.0
