# Step 1 - Importing libraries and CSV files

In [None]:
import gdown
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 
import plotly.express as px
import requests
from bs4 import BeautifulSoup
import pandas as pd
import big_project_functions as bpf

**This CSV file contains information on 1 million flights that occurred between April and August 2022.**

In [None]:
df = bpf.uploading_flight_data()
df

**This CSV document entitled 'Airport Code Dataframe' lists airport codes and their corresponding cities.**

In [None]:
df_airport_codes = bpf.uploading_airport_code_data()
df_airport_codes

# Step 2 - Merging the two data frames

**2.1 Merging the columns 'city' and 'three-digit code' from the Airport Codes Dataframe into the main dataframe. All other columns will be dropped.** **2.2 Merging the two dataframes to display the city corresponding to each airport code.**

In [None]:
merged_df = bpf.add_city_columns(df, df_airport_codes)
merged_df

In [None]:
merged_df.isnull().sum()

# Step 3 - Cleaning the data

**3.1 The first step is to drop all rows (flights) that include layovers. This crucial step prevents any distortions in the analysis of flight duration, as time spent during layovers prevents an accurate assessment of actual flying time. Moreover, when layovers are involved, the table only shows the aircraft used without specifying the distance each aircraft traveled. This lack of detailed information makes it very difficult to accurately determine the pollution levels of each type of aircraft. By excluding layovers, we ensure that we can more easily and precisely analyze the emissions and performance of individual aircraft types.** **3.2 Renaming the duration column to indicate it is measured in minutes.** **3.3 Eliminating all rows that do not provide information about CO2 emissions, as this data is crucial for the analysis.** **3.4 Filling null values in avg_co2_emission_for_this_route and co2_percentage.** **3.5 The Airport Code dataframe did not include the airport code for Doha (DOH), so we inserted it manually. This is straightforward because the only null values in the city of arrival column pertain to Doha (DOH).**

In [None]:
clean_df = bpf.clean_flight_data(merged_df)
clean_df

In [None]:
clean_df.isnull().sum()

## Step 4 - Hypothesis

1. Most Pollutant Routes in Summer 2022
Analysis: Calculate the total CO2 emissions for each route.
Approach: Sum CO2 emissions for all flights on each route, then rank them to find the most polluting routes.

In [None]:
bpf.most_pollutant_routes(clean_df)

2. Most Polluting Aircraft Types
Analysis: Determine CO2 emissions by aircraft type.
Approach: Aggregate CO2 emissions by aircraft type and identify which types have the highest emissions.

In [None]:
bpf.most_polluting_aircraft_types(clean_df)

3. CO2 Emissions by Airline Analysis: Compare the CO2 emissions of different airlines. Approach: Aggregate CO2 emissions data by airline and analyze their relative environmental impact.

In [None]:
bpf.co2_emissions_by_airline(clean_df)

4. CO2 Emissions by Airport

Analysis: Assess the total CO2 emissions associated with each airport.
Approach: Sum the CO2 emissions for all departures and arrivals at each airport.

In [None]:
bpf.co2_emissions_by_airport(clean_df)

5. Most Popular Routes
Analysis: Identify routes with the highest number of flights.
Approach: Count the number of flights for each route and rank them.

In [None]:
bpf.most_popular_routes(clean_df)

6. Shortest and Longest Routes
Analysis: Find the routes with the minimum and maximum distances.
Approach: Use the distance data to identify the shortest and longest routes.

In [None]:
bpf.shortest_and_longest_routes(clean_df)

7. Correlation Between Price and Distance
Analysis: Examine how ticket prices vary with distance.
Approach: Calculate the correlation coefficient between ticket price and distance traveled.

In [None]:
bpf.correlation_price_distance(clean_df)

8. Correlation Between Ticket Price and CO2 Emissions
Analysis: Explore the relationship between ticket price and CO2 emissions per flight.
Approach: Calculate the correlation coefficient between ticket price and CO2 emissions.

In [None]:
bpf.correlation_price_co2_emissions(clean_df)