Introduction for Python 6022 Assignment

After the end of WW2, many countries’ governments around the world believed cars and airplanes were the future of transport and invested heavily in airports and highways. While cars and planes are indeed useful forms of transportation and certainly have their places in the world economy, there exists another form of transportation which can transform and revitalize the cities and regions in which it’s built: high-speed rail. As a concept, high-speed rail (HSR) is not a new technology, yet the environment benefits and positive socio-economic externalities it can bring to communities are nothing short of extraordinary. Large reductions of greenhouse gas emissions, less congestion on roads and airplanes alike, and the spurring of development along corridors are just a few of them. 
Within the last few decades, many countries in Europe have chosen to create and expand their HSR networks, connecting populations with a faster, greener, and more convenient mode of travel. However, there are still major gaps in those networks, especially across international borders. And given the EU goal to become a climate-neutral economy by 2050, it is crucial for investments in high-speed rail to be carried out in a timely manner. The only question left to answer is where such systems should be built. As such, this report was created to try and answer this question:
Which city-pairs in Europe could stand to benefit the most from a high-speed rail connection?
The answer to this question would be based upon three main factors:
a.	The number of current air passengers. 
b.	The distance between city centers.
c.	The metropolitan population of both cities.
In addition, two sub-questions were devised to place and give better context to the main question:
1.	Which city pairs already have the highest number of air passengers?
2.	Which city pairs fall within the set distance range?

This report only focuses on cities which do not already have an HSR line between them. What constitutes a high-speed rail line was defined according to Category 3 of the European Union Directive 96/48/EC, Annex 1, which requires tracks to be designed for high speeds, permitting a maximum speed of at least 200 km/h. For the purposes of this report, it was decided that if a route between two cities has a segment where 250 km/h is reached, not along the whole length of the line, it would be considered a normal-speed line, and thus eligible for inclusion in the study.
It was decided that the geographical area of the report was to be constrained to five of the largest economies in the EU: France, Germany, Spain, Belgium, Luxembourg, and the Netherlands. These countries were chosen due to their economic and political strength within the EU, their well-developed rail networks, and their likeliness to invest further to improve said networks.
High-speed rail networks are expensive to build, and there are a few reasons for not building an HSR connection between cities, the first and foremost of which being distance. After consulting multiple sources, the ideal range for a HSR connection between two cities was determined to be between 150 – 1200 km. This range was considered appropriate because it is where the door-to-door travel time for HSR will be most competitive with other options. For shorter distances, the time differences between HSR and normal speed rail don’t justify the higher costs of a HSR line, and longer distances will be in heavy competition with airplanes, especially so if those routes go over large bodies of water and/or mountain ranges.
Data from Eurostat was utilized due to its breadth, good formatting, and ease of access. A four-year timespan, stretching from 2016 – 2019, was decided upon so passenger number data could be compared from year to year and trends could be extrapolated. Data from after 2020 was not used due to the large impacts of the COVID-19 pandemic and the subsequent lockdowns on the global economy and passenger traffic between countries.
To Add(?)
[There was no lower population limit placed on cities, simply sorted biggest pop to smallest]
[Geographic features can make a huge difference with capital costs, and so were only considered on a case-by case basis, e.g Frankfurt-Mallorca is huge, but a bridge across the Med would be financially ill-advised]
Data Pipeline
Add in exploratory data raison d'etre

Data Pipeline
Datasets used
•	Eurostat air passenger data 
6 datasets from Eurostat were used, each for one of the countries – Germany, Netherlands, Belgium, Luxemburg, Spain, France [1]. The datasets were processed to extract information on:
o	Year [2016, 2017, 2018, 2019]
o	Origin airport code
o	Destination airport code
o	Number of air passengers carried
•	City name dataset
Since no relevant dataset was found, the airport codes from the Eurostat dataset were used in ChatGPT  to create a table with 2 columns. The first is the 4-letter airport code and the second is the name of the city that the airport is located. This table was then saved to excel and imported in the jupyter notebook.
•	NUTS-3 data
To define the service area of the airport, we decided to use NUTS-3 (Nomenclature of Territorial Units for Statistics) regions. NUTS-3 regions represent the most detailed classification of territorial statistical units within the European Union and usually cover a large municipality or a group of municipalities. Information on the distances between pairs of NUTS-3 regions are found at Mendeley Data [2]. Furthermore, information on the number of inhabitants for each NUTS-3 region is available on Eurostat [3].
•	High-speed rail information 
Information about the already existing high-speed rail infrastructure was taken from multiple sources and manually combined into a dataset containing a total of 33 unique rail routes. High-speed rail (HSR) refers to rail systems designed to operate at speeds exceeding 250 km/h (155 mph) on specially built tracks. While trains may need to reduce speed in certain areas, such as urban environments, most of the tracks should be suitable for high-speed travel.

Step 1: Air passenger data processing
-	Data was loaded into a single dataframe using the Pandas library;
-	Separate origin and destination columns are created in the dataframe from the text containing the countries and airport codes;
-	The dataframe is grouped by year, from now on processing takes place in distinct dataframes for each year;
-	The city name dataset is imported;
-	The airport code is matched with the city name. The city names for both origin and destination are added in the dataframe in 2 new columns;
-	The number of passengers is obtained by adding the values for travel in both directions. Furthermore, the data for cities with multiple airports is also combined;
-	The dataframes are sorted based on the number of passengers;
-	The data is exported to excel;













Sources:
1.	Luxemburg: https://ec.europa.eu/eurostat/databrowser/bookmark/59d1d1c3-3fb9-4c23-8c26-dcaa43d3bec3?lang=en

Germany: https://ec.europa.eu/eurostat/databrowser/bookmark/7f2a3141-2faf-4797-a7f1-cbdb017aa3ef?lang=en
 
Netherlands: https://ec.europa.eu/eurostat/databrowser/bookmark/1b1c8c91-6dfb-4f19-8202-919ff6cabf99?lang=en 
 
Belgium: https://ec.europa.eu/eurostat/databrowser/bookmark/6f8ab896-8184-4960-8ef5-2802fa4a68bd?lang=en
 
France: https://ec.europa.eu/eurostat/databrowser/bookmark/8af738b1-f200-42f0-b00d-60dce508aa50?lang=en
 
Spain: https://ec.europa.eu/eurostat/databrowser/bookmark/89dbcc52-553b-4eee-b779-0a805244a841?lang=en
2.	https://data.mendeley.com/datasets/hvjzvpfgbp/1
3.	https://ec.europa.eu/eurostat/databrowser/view/demo_r_pjanaggr3/default/table?lang=en&category=reg.reg_dem.reg_dempoar
4.	
