# **FLIGHT PRICE ANALYSIS**

**Problem Statement**

The project aims to analyze and predict flight ticket prices based on various factors such as airline, source, destination, journey date, total stops, and duration. Flight pricing is a complex problem influenced by multiple features, including demand, airline policies, and timing. This analysis helps identify trends and build a predictive model to estimate ticket prices, providing valuable insights for travelers and businesses alike.

**Content**                                                  
1.**Import Packages**    
2.**Read Data**                                                      
3.**Understand and Prepare the Data**                                          
3.1 - Data Types and Dimensions              
3.2 - Feature Engineering              
4.**Splitting and training the data**          
4.1 - Prepare the Data           
4.2 - Scale the Data                
4.3 - Machine learning algorithms                        
5.**Testing Data**                 
6.**Conclusion**

**Import Packages**

In [None]:
import pandas as pd
import pandas as numpy


EXPLORATORY DATA ANALYSIS-**Read the data**




In [None]:
df=pd.read_excel("/Flight_Price_Train.xlsx")
df

In [None]:
df.head(5)

**Understand and Prepare the Data**

**Data Types and Dimensions**

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df['Route'].fillna(df['Route'].bfill(),inplace=True)
df['Total_Stops'].fillna(df['Total_Stops'].bfill(),inplace=True)

In [None]:
df.duplicated().sum()


In [None]:
df.drop_duplicates(inplace=True)
df.duplicated().sum()

In [None]:
df

**Feature Engineering**

In [None]:
df['Date']=pd.to_datetime(df['Date_of_Journey']).dt.day
df['Month']=pd.to_datetime(df['Date_of_Journey']).dt.month
df['Year']=pd.to_datetime(df['Date_of_Journey']).dt.year


In [None]:
df.drop('Date_of_Journey',axis=1,inplace=True)

In [None]:
df

In [None]:
df

**VISUALIZATION**

In [None]:
#a
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(22, 6))
sns.countplot(x='Airline',data=df)

most_pref=df['Airline'].mode()
print(most_pref)

In [None]:
#b
plt.figure(figsize=(22, 6))
sns.countplot(x='Source',data=df)

maj_takeoff=df['Source'].mode()
print(maj_takeoff)


In [None]:
plt.figure(figsize=(22, 6))
sns.countplot(x='Destination',data=df)

most_dest=df['Destination'].mode()
print(most_dest)


In [None]:
#a
plt.figure(figsize=(33,10))
sns.barplot(data=df,x='Airline',y='Price')
plt.show()

In [None]:
high_price_business_flights = df[(df['Additional_Info'] == 'Business') & (df['Price'] > 50000)]
print(high_price_business_flights)

ENCODING

In [None]:
from sklearn import preprocessing
for col in df.select_dtypes(include=["object"]).columns:
  label_encoding = preprocessing.LabelEncoder()
  label_encoding.fit(df[col].unique())
  df[col] = label_encoding.transform(df[col])
  print(f"{col}:{df[col].unique()}")

In [None]:
sns.boxplot(data=df,x='Price')


INTER-QUARTILE RANGE

In [None]:
Q1=df['Price'].quantile(0.25)
Q3=df['Price'].quantile(0.75)
IQR=Q3-Q1
iqr_upper=Q3+1.5*IQR
iqr_lower=Q1-1.5*IQR
print(iqr_upper)
print(iqr_lower)

In [None]:
df=df[(df['Price']<iqr_upper)&(df['Price']>iqr_lower)]

In [None]:
df

In [None]:
df.info()

**Splitting and training the data**

In [None]:
import numpy as np
df["price_log"]=np.log1p(df["Price"])

In [None]:
df

In [None]:
df.drop('Price',axis=1,inplace=True)

In [None]:
df

In [None]:
x=df.drop(['price_log'],axis=1)
y=df['price_log']
print(y)

**ALGORITHMS-MACHINE LEARNING**

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
model1 = LinearRegression()
model1.fit(x_train, y_train)
y_pred = model1.predict(x_test)
r2= r2_score(y_test, y_pred)
print(r2)

In [None]:
model1.score(x_train,y_train)  #train value

In [None]:
from sklearn.tree import DecisionTreeRegressor
model2= DecisionTreeRegressor()
model2.fit(x_train,y_train)
y2_pred = model2.predict(x_test)
r21 = r2_score(y_test, y2_pred)
print(r21)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeRegressor

param_distributions = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': [None, 'sqrt', 'log2']
}

random_search = RandomizedSearchCV(DecisionTreeRegressor(), param_distributions, n_iter=100, cv=5, scoring='r2')
random_search.fit(x_train, y_train)
best_model = random_search.best_estimator_
y3_pred = best_model.predict(x_test)
r211 = r2_score(y_test, y3_pred)
print(r211)

In [None]:
model2.score(x_train,y_train)   #train value

In [None]:
from sklearn.ensemble import RandomForestRegressor
model3 = RandomForestRegressor()
model3.fit(x_train, y_train)
y3_pred = model3.predict(x_test)
r22 = r2_score(y_test, y3_pred)
print(r22)

**ALGORITHMS- BOOSTING**

In [None]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
xgboost_model=xgb.XGBRegressor()
xgboost_model.fit(x_train,y_train)
xgb_predictions=xgboost_model.predict(x_test)
mse=mean_squared_error(y_test,xgb_predictions)
mae=mean_absolute_error(y_test,xgb_predictions)
r2_xg=r2_score(y_test,xgb_predictions)
print("mean_squared_error=",mse)
print("mean_absolute_error=",mae)
print("r2=",r2_xg)

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
gradient_boosting_model=GradientBoostingRegressor()
gradient_boosting_model.fit(x_train,y_train)
gradient_predictions=gradient_boosting_model.predict(x_test)
mse=mean_squared_error(y_test,gradient_predictions)
mae=mean_absolute_error(y_test,gradient_predictions)
r2_gr=r2_score(y_test,gradient_predictions)
print("mean_squared_error=",mse)
print("mean_absolute_error=",mae)
print("r2=",r2_gr)

In [None]:
!pip install catboost
from catboost import CatBoostRegressor
catboost_model=CatBoostRegressor()
catboost_model.fit(x_train,y_train)
cat_predictions=catboost_model.predict(x_test)
mse=mean_squared_error(y_test,cat_predictions)
mae=mean_absolute_error(y_test,cat_predictions)
r2_cat=r2_score(y_test,cat_predictions)
print("mean_squared_error=",mse)
print("mean_absolute_error=",mae)
print("r2=",r2_cat)

**TESTING DATASET**

In [None]:
df_test=pd.read_excel("/content/Flight_Price_Test.xlsx")
df_test

In [None]:
df_test.describe()

In [None]:
df_test.info()

In [None]:
df_test.isnull().sum()

In [None]:
df_test.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True)
df.duplicated().sum()

In [None]:
df_test['Date']=pd.to_datetime(df_test['Date_of_Journey'], format='%d/%m/%Y').dt.day # Specify the correct format
df_test['Month']=pd.to_datetime(df_test['Date_of_Journey'], format='%d/%m/%Y').dt.month
df_test['Year']=pd.to_datetime(df_test['Date_of_Journey'], format='%d/%m/%Y').dt.year

In [None]:
df_test.drop('Date_of_Journey',axis=1,inplace=True)

In [None]:
from sklearn import preprocessing
for col in df_test.select_dtypes(include=["object"]).columns:
  label_encoding = preprocessing.LabelEncoder()
  label_encoding.fit(df_test[col].unique())
  df_test[col] = label_encoding.transform(df_test[col])
  print(f"{col}:{df_test[col].unique()}")

In [None]:
xgboost_model=xgb.XGBRegressor()
xgboost_model.fit(x_train,y_train)
xgb_predictions=xgboost_model.predict(df_test)
print(xgb_predictions)

**CONCLUSION**

In [None]:
model_performance = {
    'Linear Regression': r2,
    'Decision Tree': r211,
    'Random Forest': r22,
    'XGBoost': r2_xg,
    'Gradient Boosting': r2_gr,
    'CatBoost': r2_cat
}

optimum_model = max(model_performance, key=model_performance.get)
print(f"The optimum model is: {optimum_model} with r2: {model_performance[optimum_model]}")

In [None]:
plt.figure(figsize=(20, 6))
plt.bar(model_performance.keys(), model_performance.values())
plt.xlabel('Model')
plt.ylabel('R2 Score')
plt.title('Model Performance')
plt.show()

From the business perspective, the xgboost performs the best with the good r2 score which indicates better performance. This would provide the most accurate predictions for flight prices. This model can be used for pricing predictions and can be further improved by tuning hyperparameters.