## Flight Data Analysis Python Project
Analysis and projections of various metrics (Airline, Destination, Arrival Time, etc) for flights in India using Python.

## Import Python Libraries
Importing Pandas, MatPlotLib.Pylot, and Seaborn for effective data maniuplation and visualization.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Load Excel Data
Importing Excel Spreadsheet for Flight Data in India

In [None]:
df = pd.read_excel('airlines_flights_data.xlsx')

## Exploring Data Frames
Displaying data to understand data formmating and structure

In [None]:
df.info()
df.head(12)

## Data Cleaning
Checking for null or duplicate values

In [None]:
df[df.isnull().any(axis=1)] 
df[df.duplicated()] 
## No duplicates or null values found so no need to change original data frame.

## Removing Index Column
Index Column not necessary

In [None]:
df_cleaned = df.drop(columns=['index'])
df_cleaned.head(12)

## Summary Statistics
Mean, Median, Quartiles, etc

In [None]:
df_cleaned.describe()

## Adding Price ($) Column
Adding column that has price in USD.

In [None]:
df_cleaned['price (USD $)'] = (df_cleaned['price (INR ₹)'] * .012).round(2)
df_cleaned.head(12)

## Describe Function for Price Column in USD
Checking average price of flights in USD for better understanding.

In [None]:
df_cleaned['price (USD $)'].describe()

## Checking for outliers in Price Column
Seems that price above 75% quartile are outliers, let's check their rows to see if they're accurate.

In [None]:
print(df_cleaned[df_cleaned['price (USD $)'] > df_cleaned['price (USD $)'].quantile(0.75)].to_string())
## Looks like the reason for high prices is because the class is business class, which is often more expensive.

## Accounting for Business Class Flights
Looks like the reason is because this includes business and econcomy tickets, so let's keep that in mind when doing our analysis. 


In [None]:
## Checking for all classes in the dataset
print(df_cleaned[df_cleaned['class'] == 'Economy'][['price (INR ₹)','price (USD $)']].describe().round(2))
print(df_cleaned[df_cleaned['class'] == 'Business'][['price (INR ₹)','price (USD $)']].describe().round(2))

## Still seeing outliers near the maximum values, let's take a closer look.
Going to make a line plot to see at prices the outliers are occurring.

In [None]:
economy_prices  = df_cleaned[(df_cleaned['class'] == 'Economy') & (df_cleaned['price (USD $)'] > 78.87)]['price (USD $)'].sort_values().reset_index(drop=True)
business_prices = df_cleaned[(df_cleaned['class'] == 'Business') & (df_cleaned['price (USD $)'] > 630.78)]['price (USD $)'].sort_values().reset_index(drop=True)

plt.figure(figsize=(10,6))
plt.plot(range(len(economy_prices)), economy_prices, label='Economy')
plt.plot(range(len(business_prices)), business_prices, label='Business')

plt.xlabel('Row Number')
plt.ylabel('Price (USD $)')
plt.title('Flight Prices by Class')
plt.legend()
plt.show()



## Removing Outliers
Based on chart above, economy tickets over 200 are clear outliers and business price tickets over 800 are clear outliers. Let's remove them from our data.

In [None]:
df_cleaned = df_cleaned.drop(df_cleaned[(df_cleaned['class'] == 'Economy') & (df_cleaned['price (USD $)'] > 200)].index)
df_cleaned = df_cleaned.drop(df_cleaned[(df_cleaned['class'] == 'Business') & (df_cleaned['price (USD $)'] > 800)].index)