#### This notebook and solution includes EDA, Visualisations, Feature engineering and the execution of 6 different ML Models:

#### Linear Regression, Decision Trees, Random Forests, XGBoost, AdaBoost and Gradient Boost.

#### Each model gave a prediction accuracy of over 90% with the Random Forest Regression having the highest accuracy of 96.5%



In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 
import os
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler

# Preprocessing of data and EDA:

In [None]:
df= pd.read_csv('/kaggle/input/flight-price-prediction/Clean_Dataset.csv')

In [None]:
df

In [None]:
df=df.drop('Unnamed: 0', axis=1)

In [None]:
df=df.drop('flight', axis=1)

In [None]:
df

In [None]:
df.describe()

In [None]:
df['airline'].unique()

## Visualising Some Features:

#### 1. Number of passengers per airline

In [None]:
airline_counts = df['airline'].value_counts().sort_values(ascending=True)
sns.set_style("whitegrid")
colors = ['#4C72B0', '#55A868', '#C44E52', '#8172B2', '#CCB974', '#64B5CD']

# Create horizontal bar chart of airline counts
airline_counts.plot(kind='barh', color=colors)
plt.title("Customer Count by Airline")
plt.xlabel("Count")
plt.ylabel("Airline")
plt.show()


#### 2. Average ticket price for each airline

In [None]:
avg_price = df.groupby('airline')['price'].mean().reset_index()
avg_price = avg_price.sort_values(by='price',ascending=False)
sns.barplot(x='airline', y='price', data=avg_price)

plt.xlabel('Airline')
plt.ylabel('Average Price')
plt.show()

#### 3. Number of passengers in Business and Economy Class

In [None]:
class_counts = df['class'].value_counts()
colors = ['#FFD700', '#ed8e51']
class_counts.plot(kind='pie', colors=colors)
plt.title("Number of fliers in Business Vs Economy Class:")
plt.ylabel('')
plt.show()


#### 4. Ticket prices based on class

In [None]:
class_prices = df.groupby('class')['price'].mean()
sns.set_style("whitegrid")
class_prices.plot(kind='bar', color=['#4C72B0', '#55A868'])
plt.title("Average Ticket Price by Airplane Class")
plt.xlabel("Class")
plt.ylabel("Price)")
plt.show()


#### 5. Ticket prices based on duration of flight

In [None]:
plt.scatter(df['duration'], df['price'], s=2, color= '#ed8e51')

plt.title("Flight Duration vs Ticket Price")
plt.xlabel("Duration of Flight")
plt.ylabel("Ticket Price")
plt.show()


#### 6. Relation between number of stops for a flight and the flight ticket price

In [None]:
# Create box plot of number of stops vs ticket price
df.boxplot(column='price', by='stops')

plt.title("")
plt.xlabel("Number of Stops")
plt.ylabel("Ticket Price")
plt.show()


##  Identifying the categorical features:

In [None]:
# capturing those of type *object*

cat_cols = list(df.select_dtypes(include=['object']).columns)
print(f"Number of categorical columns: {len(cat_cols)}")
print(f"Categorical columns:\n{cat_cols}")

### Performing target encoding for all categorical variables:

* Target encoding, also known as likelihood encoding, is a method of encoding categorical features in which each category is replaced with the mean (or median) of the target variable for that category. In other words, we use the target variable to encode the categories of a categorical feature.

*  For example, in our dataset we have a categorical feature called "airline" and a target variable called "price". We can calculate the mean price for each airline and use these means as the new values for the "airline" feature. This way, we are encoding the "airline" feature with information from the target variable "price".

*  The advantage of target encoding is that it can capture the relationship between the categorical feature and the target variable in a more precise way than one-hot encoding, especially when the categorical feature has a large number of categories. Target encoding can also reduce the dimensionality of the feature space.

* However, target encoding has a potential risk of overfitting if there are too few samples for some categories, leading to a high variance in the target encoding values. 

In [None]:
import category_encoders as ce

te = ce.TargetEncoder(cols=cat_cols)
df = te.fit_transform(df, df['price'])


In [None]:
df

## Identifying numerical columns:

In [None]:
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
print(numeric_cols)

Thus, we see that now all the columns have numerical data instead of categorical

## Checking for missing values:

In [None]:
features_with_na = [col for col in df.columns if df[col].isna().sum() > 0]

missing_values_df = pd.DataFrame(df[features_with_na].isnull().mean().sort_values(ascending=False), columns=["percentage"])
missing_values_df.head(10)

#### This dataset has no missing values.

## Scaling Data:

* Scaling is a preprocessing step in machine learning that aims to standardize the range or scale of the input features. The goal of scaling is to ensure that each feature has a similar scale or range, which can help some machine learning models to converge faster and improve their performance. 

* The choice of scaling method depends on the distribution and range of the input features, as well as the specific machine learning model being used. In general, it is a good practice to scale the data before training a machine learning model, unless the model is known to be insensitive to the scale of the input features.

* One commonly used scaling method is MinMaxScaler, which scales the data to a fixed range of values between 0 and 1. It works by subtracting the minimum value of each feature and then dividing by the range (i.e., the difference between the maximum and minimum values). 

* The advantage of MinMaxScaler is that it preserves the shape of the original distribution and does not change the relative position of the data points. It is also relatively simple to use and understand. However, MinMaxScaler may not work well if the distribution of the data is highly skewed or has outliers, as it can magnify the effects of these outliers.

In [None]:
# Min Max Scaler: transformation of data

names = df.columns
indexes = df.index
sc = MinMaxScaler((0, 1)) #between 0 and 1 range
df = sc.fit_transform(df)
data_scaled = pd.DataFrame(df, columns=names, index=indexes)
data_scaled.head()

## Setting our target variables: 

In [None]:
# Set 'price' as the target variable
y = data_scaled['price']

# Extract the input features
X_data = data_scaled.drop(['price'], axis=1)


# Feature engineering:

Extracting the best/ most relevant features through two ways: Pearsons Correlation and kBest Features.

#### 1. Pearsons correlation: 

* Pearson's correlation can be used as a feature engineering technique to identify and select the most relevant features for a machine learning model. By calculating the Pearson's correlation coefficient between each feature and the target variable, we can measure the linear relationship between the feature and the target and determine which features are most predictive of the target variable.


In [None]:
#Using Pearson Correlation
plt.figure(figsize=(25,10))
cor = data_scaled.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

In [None]:
#Correlation with target variable price
cor_target = abs(cor["price"])

relevant_features = cor_target
relevant_features 


According to Pearsons correlation, our top 4 relevant features are: class, airline, stops and duration.

#### 2. Kbest Selection:

* KBest feature selection is a technique in feature engineering that aims to select the k most important features from a dataset based on some statistical metric. The idea behind this technique is to reduce the dimensionality of the dataset by selecting only the most informative features, which can improve the performance of some machine learning models and reduce overfitting.

* KBest feature selection works by ranking the features according to a statistical metric, such as the chi-squared test, mutual information, or f-score, and selecting the top k features with the highest scores. The specific metric used depends on the type of data and the problem at hand.


In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(f_classif, k=4)
X_important = selector.fit_transform(X_data, y)

# Get a boolean mask of the selected features
mask = selector.get_support()

# Create a list of the selected feature names
important_feature_names = X_data.columns[mask]

print(important_feature_names)

According to Kbest Features, our most important features are 'airline', 'source_city', 'destination_city', and 'class'.

#### Why are kBest Features and pearsons correlation giving different best features?:

* The reason why KBest and Pearson's correlation coefficient can give different sets of selected features is that they are based on different assumptions and criteria. KBest feature selection evaluates the relevance of each feature based on a statistical metric, while Pearson's correlation coefficient measures the linear relationship between each feature and the target variable. Therefore, KBest feature selection may select features that are not highly correlated with the target variable but are still informative for the model, while Pearson's correlation coefficient may miss important nonlinear or non-monotonic relationships.

* In practice, it is often a good idea to use multiple feature selection techniques and evaluate their performance on a validation set to choose the best set of features for the machine learning model. This can help to ensure that the selected features are relevant, informative, and not redundant.


#### So what features to select for this dataset?

* Since most of the features are included by either pearsons correlation or kbest feature extraction, we will not eliminate any features and run the models on all our features.

* We could also eliminate all features except 'class' and 'airline' since both the feature extraction techniques yielded these 2 as the best features

# Building, running and evaluating our models:

## What evaluation metrics are being used?

#### 1. score:

* The score method provides a convenient way to quickly evaluate the performance of a trained model on a test dataset, without having to manually compute the evaluation metric. 

* However, it is important to keep in mind that the choice of evaluation metric can have a significant impact on the performance of the model and the conclusions that can be drawn from the results. Therefore, it is often a good idea to use multiple evaluation metrics and perform cross-validation to ensure that the model is robust and generalizes well to new data.

* For classification problems, model.score might return the accuracy, precision, recall, or F1 score, depending on the specific classification algorithm and the choice of evaluation metric. For regression problems, model.score might return the R-squared value, the mean absolute error, or the mean squared error, among others.

#### 2. Mean squared error:

* The MSE metric measures the average squared deviation of the predicted values from the actual values. It is a non-negative value where a value of zero indicates a perfect match between the predicted and actual values. A larger MSE value indicates a higher degree of error between the predicted and actual values. The MSE metric is sensitive to outliers, meaning that a few large errors can significantly increase the overall MSE value.

* MSE is commonly used to evaluate the performance of regression models and can be used to compare the performance of different regression algorithms or to tune hyperparameters of a regression model.

#### 3. R-squared :

* R-squared (R²) is a statistical measure that tells you how well the regression model fits the data. It measures the proportion of the variance in the dependent variable (the variable you are trying to predict) that can be explained by the independent variables (the variables you are using to make the prediction).

* The R-squared score ranges from 0 to 1, with a higher score indicating a better fit of the model to the data. A score of 1 means that the model explains all the variation in the dependent variable, while a score of 0 means that the model does not explain any variation.

* R-squared is useful because it provides a simple way to compare the performance of different regression models. However, it only tells you how well the model fits the data overall and does not provide information about the accuracy of individual predictions. So, it is often used along with other evaluation metrics, such as mean squared error, to get a more complete understanding of the model's performance.



## Split the data into training and testing sets:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y, test_size=0.2, random_state=42)


## Linear Regression: 90.2% accuracy

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


lr = LinearRegression()

# Fit the model to the training data
lr.fit(X_train, y_train)

#make predictions
y_pred = lr.predict(X_test)

# Evaluate the model on the testing data
score = lr.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Accuracy of model :", score)
print("Mean squared error:", mse)
print("R-squared:", r2)

## Decision Tree: 94.3% accuracy

In [None]:
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor(max_depth=5, min_samples_split=10)

dt.fit(X_train, y_train)

y_pred = dt.predict(X_test)

dt_score = dt.score(X_test, y_test)
dt_mse = mean_squared_error(y_test, y_pred)
dt_r2 = r2_score(y_test, y_pred)

print("Accuracy of model :", dt_score)
print("Mean squared error:", dt_mse)
print("R-squared:", dt_r2)



## Random Forest: 96.5% accuracy - our best performing model

In [None]:
from sklearn.ensemble import RandomForestRegressor


rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)

rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

rf_score = rf.score(X_test, y_test)
rf_mse = mean_squared_error(y_test, y_pred)
rf_r2 = r2_score(y_test, y_pred)

print("Accuracy of model :", rf_score)
print("Mean squared error:", rf_mse)
print("R-squared:", rf_r2)

## XGBoost: 95.4% accuracy

In [None]:
import xgboost as xgb

XGB = xgb.XGBRegressor(objective ='reg:squarederror', n_estimators = 10, seed = 42)

XGB.fit(X_train, y_train)

y_pred = XGB.predict(X_test)

XGB_score = XGB.score(X_test, y_test)
XGB_mse = mean_squared_error(y_test, y_pred)
XGB_r2 = r2_score(y_test, y_pred)

print("Accuracy of model :", XGB_score)
print("Mean squared error:", XGB_mse)
print("R-squared:", XGB_r2)


## AdaBoost: 93.4% accuracy

In [None]:
from sklearn.ensemble import AdaBoostRegressor

ada = AdaBoostRegressor(n_estimators=50, learning_rate=0.1, random_state=42)

ada.fit(X_train, y_train)

y_pred = ada.predict(X_test)

ada_score = ada.score(X_test, y_test)
ada_mse = mean_squared_error(y_test, y_pred)
ada_r2 = r2_score(y_test, y_pred)

print("Accuracy of model :", ada_score)
print("Mean squared error:", ada_mse)
print("R-squared:", ada_r2)



## Gradient Boost: 95.2% accuracy

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gb = GradientBoostingRegressor(loss='ls', learning_rate=0.1, n_estimators=100, max_depth=3)

gb.fit(X_train, y_train)

y_pred = gb.predict(X_test)

gb_score = gb.score(X_test, y_test)
gb_mse = mean_squared_error(y_test, y_pred)
gb_r2 = r2_score(y_test, y_pred)

print("Accuracy of model :", gb_score)
print("Mean squared error:", gb_mse)
print("R-squared:", gb_r2)

#### Note that we could improve the performances of all these models by performing hyperparameter tuning.

#### However, since every model gives a performance accuracy of over 90%, I have chosen to skip it. 