In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Flight Price Category Prediction\n",
    "\n",
    "This Jupyter Notebook contains the code for the Flight Price Category Prediction project. The goal is to build a machine learning model that can classify flights as either 'Cheap' or 'Expensive' based on various features. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Import necessary libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import confusion_matrix, classification_report"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Data Loading and Preprocessing\n",
    "\n",
    "First, we'll load the dataset and perform some initial cleaning and preprocessing steps. This includes cleaning column names and transforming the continuous 'price' variable into a categorical 'Price_Category' suitable for a classification task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Load the dataset\n",
    "df = pd.read_csv(\"data/FlightBooking.csv\")\n",
    "df.columns = df.columns.str.strip() # Clean column names\n",
    "\n",
    "# Drop unwanted index column if present\n",
    "if 'Unnamed: 0' in df.columns:\n",
    "    df.drop(columns=['Unnamed: 0'], inplace=True)\n",
    "\n",
    "# Convert continuous price values into categories: Cheap and Expensive\n",
    "df['Price_Category'] = pd.cut(df['price'],\n",
    "                                 bins=[0, 10000, float('inf')],\n",
    "                                 labels=['Cheap', 'Expensive'])\n",
    "\n",
    "print(\"First 5 rows of the dataset:\")\n",
    "print(df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Feature Engineering and Data Splitting\n",
    "\n",
    "We'll separate the features (`X`) from the target variable (`y`). Categorical features are converted into a numerical format using one-hot encoding, and the data is then split into training and testing sets to evaluate the model's performance on unseen data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Prepare features and target\n",
    "X = df.drop(columns=['price', 'Price_Category'])\n",
    "X = pd.get_dummies(X, drop_first=True) # Encode categorical variables\n",
    "y = df['Price_Category']\n",
    "\n",
    "# Split data into training and test sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
    "\n",
    "print(f\"Training set shape: {X_train.shape}\")\n",
    "print(f\"Testing set shape: {X_test.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Model Training and Prediction\n",
    "\n",
    "A Random Forest Classifier, known for its strong performance in classification tasks, is trained on the training data. The model then makes predictions on the unseen test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Train the Random Forest Classifier\n",
    "model = RandomForestClassifier()\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "# Predict on the test set\n",
    "y_pred = model.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 4: Model Evaluation and Visualization\n",
    "\n",
    "To evaluate the model's performance, we'll generate a confusion matrix and a classification report. Visualizations are created to better understand the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Generate and plot confusion matrix\n",
    "cm = confusion_matrix(y_test, y_pred)\n",
    "plt.figure(figsize=(6, 4))\n",
    "sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', \n",
    "            xticklabels=['Cheap', 'Expensive'],\n",
    "            yticklabels=['Cheap', 'Expensive'])\n",
    "plt.title(\"Confusion Matrix\")\n",
    "plt.xlabel(\"Predicted\")\n",
    "plt.ylabel(\"Actual\")\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# Print classification report\n",
    "print(\"Classification Report:\")\n",
    "print(classification_report(y_test, y_pred))\n",
    "\n",
    "# Plot actual vs predicted output for a sample\n",
    "num_samples = 30\n",
    "y_test_sample = y_test.iloc[:num_samples].reset_index(drop=True)\n",
    "y_pred_sample = pd.Series(y_pred[:num_samples])\n",
    "\n",
    "# Plot actual vs predicted values\n",
    "plt.figure(figsize=(14, 6))\n",
    "x = np.arange(num_samples)\n",
    "plt.plot(x, y_test_sample, marker='o', label='Actual', color='green')\n",
    "plt.plot(x, y_pred_sample, marker='x', label='Predicted', color='red')\n",
    "plt.title('Actual vs Predicted Price Category (Sample of 30)')\n",
    "plt.xlabel('Sample Index')\n",
    "plt.ylabel('Price Category')\n",
    "plt.legend()\n",
    "plt.grid(True)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}