# Transit Costs Project

---
### Andrew Chu, Donald Bookman
---

### Introduction
---

Why do some transit lines cost more per kilometer than others? Using the Transit Costs Project data, we can find answers to our questions. Our data came from transitcosts.com but we used a tidied .csv file from the tidy Tuesday github link. The data was collected by researchers through a variety of sources such as wikipedia, the media, city plans, etc. The cases used were Boston, Istanbul, New York, Milan, London. The main variables of this project are country, city, and cost/km in millions of USD. Although other variables are important to look at, these are the major ones. We believe that transit lines cost more to build per kilometer in New York and California because of the higher cost of living compared to other cities.

### Data
---

Below are the 21 variables in the Transit Cost Dataset
- e  - Variable identifier to track entries
- country - Country code. In further detail the code gives the abbreviations for each country described in the dataset
- city - The city that the transit line is being built in
- line - The name of the transit line
- start_year - The year that the transit line’s construction was started
- end_year - The year that the transit line’s construction ended (predicted or actual)
- rr - Boolean flag for if the transit line is a railroad or not
- length - Proposed length for the transit line in kilometers (km)
- tunnel_per - Percent of transit line completed as of the most recent update to the dataset
- tunnel - Total length of line completed in km as of the most recent update to the dataset
- stations - Number of stations where passengers can board/leave the transit line
- source1 - The source of the data entry
- cost - Cost in millions in local currency
- currency - Currency type of the area
- year - Midpoint year of construction
- ppp_rate - Purchasing power parity (PPP), based on the midpoint of construction. Purchasing power parity is a measurement of prices in different countries that uses the prices of specific goods to compare the absolute purchasing power of the countries' currencies
- real_cost - Cost in Millions of USD
- cost_km_millions - Cost/km in millions of USD
- source2 - The source of the cost data
- reference - Reference URL for sources

### Data Analysis Plan
---

The main outcome variable - cost_km_millions

Predictor variables - real_cost, length, and stations

In [2]:
import numpy as np
import pandas as pd

In [5]:
test = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-01-05/transit_cost.csv')
test

Unnamed: 0,e,country,city,line,start_year,end_year,rr,length,tunnel_per,tunnel,stations,source1,cost,currency,year,ppp_rate,real_cost,cost_km_millions,source2,reference
0,7136.0,CA,Vancouver,Broadway,2020,2025,0.0,5.7,87.72%,5.0,6.0,Plan,2830.0,CAD,2018.0,0.84,2377.2,417.052632,Media,https://www.translink.ca/Plans-and-Projects/Ra...
1,7137.0,CA,Toronto,Vaughan,2009,2017,0.0,8.6,100.00%,8.6,6.0,Media,3200.0,CAD,2013.0,0.81,2592,301.395349,Media,https://www.thestar.com/news/gta/transportatio...
2,7138.0,CA,Toronto,Scarborough,2020,2030,0.0,7.8,100.00%,7.8,3.0,Wiki,5500.0,CAD,2018.0,0.84,4620,592.307692,Media,https://urbantoronto.ca/news/2020/03/metrolinx...
3,7139.0,CA,Toronto,Ontario,2020,2030,0.0,15.5,57.00%,8.8,15.0,Plan,8573.0,CAD,2019.0,0.84,7201.32,464.601290,Plan,https://metrolinx.files.wordpress.com/2019/07/...
4,7144.0,CA,Toronto,Yonge to Richmond Hill,2020,2030,0.0,7.4,100.00%,7.4,6.0,Plan,5600.0,CAD,2020.0,0.84,4704,635.675676,Media,https://www.thestar.com/news/gta/2020/06/24/me...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
539,,,,,,,,,,,,,,,,,STD,258.744889,,
540,,,,,,,,,,,,,,,,,MIN,7.789626,,
541,,,,,,,,,,,,,,,,,QUARTILE 1,134.863267,215.7812275,
542,,,,,,,,,,,,,,,,,QUARTILE 3,241.428571,386.2857143,


In [8]:
test_sorted = test.sort_values(by='cost_km_millions', ascending=False)
test_sorted.head(15)

Unnamed: 0,e,country,city,line,start_year,end_year,rr,length,tunnel_per,tunnel,stations,source1,cost,currency,year,ppp_rate,real_cost,cost_km_millions,source2,reference
138,7411.0,US,New York,East Side Access,2007.0,2022.0,1.0,2.8,100.00%,2.8,1.0,Measured,11000.0,USD,2015.0,1.0,11000.0,3928.571429,Media,https://www.nytimes.com/2018/04/25/nyregion/mt...
137,7410.0,US,New York,Second Avenue Phase 2,2019.0,2029.0,0.0,2.6,100.00%,2.6,3.0,Measured,6390.0,USD,2024.0,1.0,6390.0,2457.692308,Plan,https://www.transit.dot.gov/sites/fta.dot.gov/...
139,7416.0,US,New York,Gateway,2019.0,2026.0,1.0,5.3,100.00%,5.3,0.0,Measured,9500.0,USD,2023.0,1.0,9500.0,1792.45283,Plan,https://www.masstransitmag.com/rail/infrastruc...
136,7409.0,US,New York,Second Avenue Phase 1,2007.0,2016.0,0.0,2.7,100.00%,2.7,3.0,Measured,4450.0,USD,2012.0,1.0,4450.0,1648.148148,Media,https://www.nytimes.com/2016/12/19/nyregion/se...
135,7408.0,US,New York,7 extension,2007.0,2014.0,0.0,1.6,100.00%,1.6,1.0,Measured,2400.0,USD,2011.0,1.0,2400.0,1500.0,Plan,http://web.mta.info/nyct/service/new7LineExten...
96,7329.0,SG,Singapore,Circle Line Stage 6,2017.0,2025.0,0.0,4.0,100.00%,4.0,3.0,Media,4850.0,SGD,2021.0,1.13,5480.5,1370.125,Media,https://www.straitstimes.com/singapore/transpo...
140,7417.0,UK,London,Crossrail,2009.0,2021.0,1.0,21.0,100.00%,21.0,8.0,Measured,13328.0,GBP,2015.0,1.4,18659.2,888.533333,FOI,
8,7152.0,US,Los Angeles,Purple Phase 3,2020.0,2027.0,0.0,4.2,100.00%,4.2,2.0,Media,3600.0,USD,2023.0,1.0,3600.0,857.142857,Media,https://la.streetsblog.org/2020/03/24/metro-si...
150,7435.0,NZ,Auckland,City Rail Link,2013.0,2024.0,1.0,3.5,100.00%,3.5,4.0,Wiki,4419.0,NZD,2018.0,0.677,2991.663,854.760857,Plan,https://ourauckland.aucklandcouncil.govt.nz/ar...
71,7280.0,AU,Melbourne,Metro Tunnel,2018.0,2025.0,0.0,9.0,100.00%,9.0,5.0,Plan,11000.0,AUD,2021.0,0.69,7590.0,843.333333,Plan,https://metrotunnel.vic.gov.au/about-the-proje...


In [9]:
test_sorted = test.sort_values(by='cost_km_millions')
test_sorted.head(15)

Unnamed: 0,e,country,city,line,start_year,end_year,rr,length,tunnel_per,tunnel,stations,source1,cost,currency,year,ppp_rate,real_cost,cost_km_millions,source2,reference
540,,,,,,,,,,,,,,,,,MIN,7.789626,,
482,8139.0,CN,Beijing,Capital Airport Express,2005.0,2008.0,0.0,28.1,0.00%,0.0,4.0,Measured,622.3,CNY,2005.0,0.3517,218.89,7.79,Media,http://news.sciencenet.cn/sbhtmlnews/2008/8/20...
249,7634.0,TR,Bursa,Bursaray,1997.0,2014.0,0.0,39.0,16.03%,6.25,38.0,Media,425.0,EUR,2006.0,2.1,892.5,22.88,Media,see scope
121,7378.0,ES,Madrid,1995-98 program,1995.0,1998.0,0.0,56.0,68.00%,38.0,37.0,Trade,1579.0,EUR,1997.0,1.25,1973.75,35.245536,Trade,https://tunnelbuilder.com/metrosur/edition2pdf...
252,7658.0,CN,Shanghai,Line 2 Eastern Extension 1,1999.0,2000.0,0.0,2.72,31.99%,0.87,1.0,Wiki,314.77,CNY,1999.0,0.37,116.46,42.82,Plan,https://zh.wikipedia.org/wiki/%E4%B8%8A%E6%B5%...
247,7632.0,TR,Izmir,M1 Phase 3,2010.0,2012.0,0.0,2.25,100.00%,2.25,2.0,Plan,106.0,TRY,2011.0,1.0,106,47.11,Media,https://www.uab.gov.tr/uploads/pages/kutuphane...
243,7624.0,TR,Istanbul,CR3,2011.0,2019.0,1.0,63.0,0.00%,0.0,36.0,Plan,1298.0,USD,2015.0,2.3,2985.4,47.39,Trade,
17,7169.0,BG,Sofia,Line 1 southern,2013.0,2015.0,0.0,3.0,100.00%,3.0,3.0,Media,44.0,EUR,2014.0,3.32,146.08,48.693333,Media,https://www.novinite.com/articles/168384/Subwa...
183,7504.0,JP,Osaka,Higashi Line,1999.0,2019.0,1.0,20.3,0.00%,0.0,0.0,Plan,120000.0,JPY,2009.0,0.0087,1044,51.428571,Plan,https://www.westjr.co.jp/global/en/ir/library/...
246,7627.0,TR,Izmir,M1 Phase 2,2005.0,2014.0,0.0,5.35,100.00%,5.35,6.0,Plan,252.0,TRY,2010.0,1.1,277.2,51.81,Media,https://www.uab.gov.tr/uploads/pages/kutuphane...


For our preliminary exploratory analysis, we noticed that when you sort the data by the variable “cost_km_millions” (descending), the top five cities were New York. This means that New York is the least cost efficient per kilometer. When taking a look at the most cost efficient transit lines, we noticed that most were extensions or small parts of larger, existing transit lines.

As we comb through the data, I believe organizing the data as a histogram would be helpful in visualizing the data. Especially since we are comparing the cost efficiency of the lines of each country. 

After organizing the data into histograms, we hope to see New York as the city with the highest cost/km statistic to support our hypothesis however, if the data supports another theory, we will tell that story.