<a href="https://colab.research.google.com/github/RenaAbbasova/proyecto_rena/blob/master/flight_price_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# @title #AirLinePrice-Regression { display-mode: "form" }
from google.colab import files
from IPython.display import Image

uploaded = files.upload()

In [None]:
Image('/content/suhyeon-choi-tTfDMaRq-FE-unsplash (2).jpg',
      width = 725)

# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from scipy import stats
import scipy
from matplotlib.pyplot import figure


#FEATURES

The various features of the cleaned dataset are explained below:

**Airline**: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.

**Flight**: Flight stores information regarding the plane's flight code. It is a categorical feature.

**Source City**: City from which the flight takes off. It is a categorical feature having 6 unique cities.

**Departure Time**: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.

**Stops**: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.

**Arrival Time**: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.

**Destination City**: City where the flight will land. It is a categorical feature having 6 unique cities.

**Class**: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.

**Duration**: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.

**Days Left**: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.

**Price**: Target variable stores information of the ticket price.

# Load and read the data

In [None]:
data = pd.read_csv('/content/drive/MyDrive/Flight Price prediction/Flight_Price_Prediction.csv')

In [None]:
df = data.copy()

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
# drop unnamed column
df.drop('Unnamed: 0',axis=1,inplace=True)


In [None]:
# rename the columns
df=df.rename(columns={'departure_time':'dep_time', 'destination_city':'des_city', 'arrival_time':'arr_time'})

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.columns

# EDA

In [None]:
# distplot
ax = sns.distplot(df.price)

In [None]:
# not normal distribution, try transformation with log, cbrt, sqrt
ax = sns.distplot(np.cbrt(df.price))# better to use without transformation

In [None]:
# boxplot
ax=sns.boxplot(df.price) # there is outliers

In [None]:
#df.describe()
df.describe(include='all')

In [None]:
palette = "Set3"
plt.figure(figsize=(10,4))
sns.boxplot(df,x='airline',y='price', palette=palette)


# Add title and labels
plt.title('Price Distribution Across Airlines')
plt.xlabel('Airline')
plt.ylabel('Price')

# Rotate x-axis labels for better readability if necessary
plt.xticks(rotation=45, ha='right')

# Show the plot
plt.tight_layout()
plt.show()

#####Vistara has max price range

In [None]:
plt.figure(figsize=(12,4))
sns.barplot(data=df,x='days_left'	,y='price',  color='skyblue')

# Add title and labels
plt.title('Flight Prices by Days Left to Booking')
plt.xlabel('Days Left to Booking')
plt.ylabel('Price')

# Show the plot
plt.show()


##### We can see that the price is very high when the flight is booked 2 or 3 days prior. From 19 to 49 days, the price almost remains the same.

###Categorical variables distribution





In [None]:


# Get a list of categorical columns
categorical_columns = ['airline', 'source_city', 'dep_time', 'stops', 'arr_time',
       'des_city', 'class']

# Loop through each categorical column
for column in categorical_columns:
    plt.figure(figsize=(8, 6))  # Adjust the figure size as needed

    # Count the occurrences of each category in the column
    category_counts = df[column].value_counts()

    # Plot a pie chart for the current categorical column
    category_counts.plot(kind='pie', autopct='%1.1f%%', startangle=140)

    # Add title with the column name
    plt.title(f'Distribution of {column}')
    plt.ylabel('')  # Remove the y-label

    plt.axis('equal')

    plt.show()


######There is a higher distribution of Vistara Airlines and Air India, mostly from Mumbai, Delhi, and Bangalore. Morning, early-morning flights, and evening flights are more common. Most flights are sold with one stop. According to our distribution, people are likely to arrive at night, in the morning, or in the evening. The distribution of cities is mostly equal, but there is a higher percentage for Delhi, Mumbai, and Bangalore. 68.9% of Economy class tickets are sold.

In [None]:

# Assuming you want to plot the distribution of flights over time
plt.figure(figsize=(10, 6))  # Set the figure size

# Assuming 'datetime' is a column containing the datetime information of flights
df['flight'].value_counts().sort_index().plot(kind='line')

# Add title and labels
plt.title('Distribution of Flights')
plt.xlabel('flight')
plt.ylabel('Number of Flights')

# Show the plot
plt.grid(True)  # Add grid lines
plt.tight_layout()
plt.show()


In [None]:

sg_8264_data = df[df['flight'] == 'SG-8264']



In [None]:
airline_of_sg_8264 = sg_8264_data['airline']

In [None]:
airline_of_sg_8264.unique()

In [None]:
sns.barplot(data=df, x='dep_time', y='price', color='pink')

#####When observing the relationship between departure time and price, it is evident that the maximum price occurs at night, while the prices are lower late at night.

In [None]:
sns.barplot(data=df, x='arr_time', y='price')


#####The price is higher when flights arrive in the evening, night, and morning

#Outlier Treatment

In [None]:
def cap_outliers(df,column):
  Q1=df[column].quantile(0.25)
  Q3=df[column].quantile(0.75)
  IQR=Q3-Q1
  Upper_limit=Q3+1.5*IQR
  Lower_limit=Q1-1.5*IQR

  df.loc[df[column]>Upper_limit,column]=Upper_limit
  df.loc[df[column]<Lower_limit]=Lower_limit

cap_outliers(df,'price')


In [None]:
cap_outliers(df,'duration')

In [None]:
#function for finding out outliers
def find_outliers(df,column):
  Q1=df[column].quantile(0.25)
  Q3=df[column].quantile(0.75)
  IQR=Q3-Q1
  Upper_End=Q3+1.5*IQR
  Lower_End=Q1-1.5*IQR

  outlier=df[column][(df[column]>Upper_End)| (df[column]<Lower_End) ]

  return outlier

In [None]:
for column in ['price','duration']:
  print('\n Outliers in column "%s"' %column)

  outlier= find_outliers(df,column)
  print(outlier)

In [None]:
df.head()

In [None]:
df.columns

# Convert categorical variable into numerical

In [None]:
!pip install --upgrade category_encoders

In [None]:
import category_encoders as ce

In [None]:
# List of columns containing categorical data


columns_to_encode = ['airline', 'flight', 'source_city', 'dep_time', 'stops', 'arr_time', 'des_city','class']

te = ce.TargetEncoder(cols=columns_to_encode )
df = te.fit_transform(df, df['price'])


In [None]:
df.head()

In [None]:
from sklearn.preprocessing import MinMaxScaler


In [None]:
scaler=MinMaxScaler()
df_scaled=scaler.fit_transform(df)



In [None]:
df_scaled = pd.DataFrame(df_scaled,columns=df.columns)

In [None]:
df_scaled.head()

In [None]:
plt.figure(figsize=(10,5))
sns.heatmap(df_scaled.corr(),annot=True)

#####The 'airline' and 'flight' variables are highly correlated. To avoid collinearity, we will drop the 'flight' variable.

In [None]:
df_scaled=df_scaled.drop('flight',axis=1)

# Model Building

In [None]:
X=df_scaled.drop('price',axis=1)
Y=df['price']

In [None]:
X.shape,Y.shape

###Linear Regression Model

In [None]:
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=42)

In [None]:
x_train.shape,y_train.shape,x_test.shape,y_test.shape

In [None]:
#Build model m1 using all features
m1=sm.OLS(y_train,sm.add_constant(x_train)).fit()

In [None]:
#Select the top 5 features based on coefficient magnitude
top_features = m1.params.abs().nlargest(5).sort_index()

In [None]:
#Build model m2 using 5 top features
m2=sm.OLS(y_train,sm.add_constant(x_train[top_features.index.tolist()])).fit()



In [None]:
#Compare performance of m1 and m2
print('Performance m1')
print(m1.summary())

print("\n")

print('Performance m2')
print(m2.summary())

##### There are no significant changes in the values of R-squared (R2) and Adjusted R-squared (R2), it suggests that the top 5 features selected based on coefficient magnitude have similar predictive power to using all the features.


# Linear regression sklearn library

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [None]:
m3=LinearRegression()
m3.fit(x_train,y_train)

# traain the model
y_pred=m3.predict(x_test)

# Evaluate model
r_squared = r2_score(y_test,y_pred)
print("R-squared (R2) score:", r_squared)

In [None]:
numerical_columns=df.select_dtypes(include=['float64','int64']).columns.tolist()
numerical_columns

In [None]:
categorical_columns
print(f"Categorical columns:\n{categorical_columns}")