**What is Feature Engineering?**
 
            Feature engineering is the process of creating new features or modifying existing ones to improve the performance of machine learning models. It involves techniques like feature extraction, transformation, encoding, and scaling to make data more useful for predictions.

**Why Do We Need Feature Engineering?**

1.**Improves Model Performance** – Good features help models make better predictions.
 
2.**Reduces Overfitting** – Helps eliminate noise and irrelevant data.
 
3.**Handles Missing Data** – Creates meaningful replacements for missing values.
 
4.**Enables Better Interpretability** – Makes features more understandable and useful.

5.**Reduces Dimensionality** – Helps remove unnecessary data points, making the model efficient.

In [2]:
import numpy as np
import pandas as pd

In [4]:
date_time = pd.DataFrame({'TransactionDate':pd.to_datetime(['2025-02-05 14:30:00','2025-02-06 18:45:00'])})

In [5]:
date_time['DateOfWeek'] = date_time['TransactionDate'].dt.dayofweek

In [6]:
date_time['Hour'] = date_time['TransactionDate'].dt.hour

In [8]:
date_time['Isweekend'] = date_time['DateOfWeek'].apply(lambda x:1 if x>=5 else 0)

In [9]:
date_time

Unnamed: 0,TransactionDate,DateOfWeek,Hour,Isweekend
0,2025-02-05 14:30:00,2,14,0
1,2025-02-06 18:45:00,3,18,0


In [10]:
df_transactions = pd.DataFrame({
    'UserID':[101,102,101,103,102],
    'TransactionAmount':[500,300,700,1000,400]
})

In [12]:
df_transactions.groupby('UserID').agg('mean').reset_index()

Unnamed: 0,UserID,TransactionAmount
0,101,600.0
1,102,350.0
2,103,1000.0


In [13]:
from sklearn.preprocessing import OneHotEncoder

In [14]:
df = pd.DataFrame({'ProductCategory':['Electronics','Clothing','Clothing','Grocery']})

In [15]:
encoder = OneHotEncoder(sparse=False)

In [17]:
encoded_features = encoder.fit_transform(df[['ProductCategory']])



In [18]:
encoded_features

array([[0., 1., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])

In [19]:
encoder.get_feature_names_out()

array(['ProductCategory_Clothing', 'ProductCategory_Electronics',
       'ProductCategory_Grocery'], dtype=object)

In [20]:
df_encoded = pd.DataFrame(encoded_features,columns=encoder.get_feature_names_out())

In [21]:
df_encoded

Unnamed: 0,ProductCategory_Clothing,ProductCategory_Electronics,ProductCategory_Grocery
0,0.0,1.0,0.0
1,1.0,0.0,0.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0


In [22]:
pd.get_dummies(df)

Unnamed: 0,ProductCategory_Clothing,ProductCategory_Electronics,ProductCategory_Grocery
0,False,True,False
1,True,False,False
2,True,False,False
3,False,False,True


In [23]:
df = pd.DataFrame({'Transactions':[100,200,5000,10000,20000]})

In [24]:
df['LogTransactionAmount'] = np.log1p(df['Transactions'])

In [25]:
df

Unnamed: 0,Transactions,LogTransactionAmount
0,100,4.615121
1,200,5.303305
2,5000,8.517393
3,10000,9.21044
4,20000,9.903538


In [26]:
from sklearn.preprocessing import MinMaxScaler,StandardScaler

In [28]:
mm = MinMaxScaler()

In [31]:
df['Normalized_values'] = mm.fit_transform(df[['Transactions']])

In [33]:
ss = StandardScaler()

In [34]:
df['StandadizedTransactionAmount'] = ss.fit_transform(df[['Transactions']])

In [35]:
df

Unnamed: 0,Transactions,LogTransactionAmount,Normalized_values,StandadizedTransactionAmount
0,100,4.615121,0.0,-0.93707
1,200,5.303305,0.005025,-0.923606
2,5000,8.517393,0.246231,-0.277351
3,10000,9.21044,0.497487,0.395831
4,20000,9.903538,1.0,1.742196


**Final Summary of Feature Engineering & Imbalanced Data Handling**
 
Feature Extraction : Extract new insights from raw data (e.g., Hour, DayOfWeek)
 
Aggregated Features : Calculate meaningful statistics (e.g., AvgTransactionAmountPerUser)
 
Encoding : Convert categorical variables into numerical (One-Hot Encoding)
 
Log Transformation : Reduce skewness in data distribution
 
Feature Scaling : Normalize numerical features for better model performance
 
Downsampling: Reduce the size of the majority class
 
Upsampling : Increase the size of the minority class
 
SMOTE(Synthetic Minority Over-sampling Technique) : Generate synthetic samples for the minority class

### FEATURES
The various features of the cleaned dataset are explained below:
1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10)Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11) Price: Target variable stores information of the ticket price.
has context menu

**Column Breakdown:**
 
Airline – Name of the airline (e.g., IndiGo, Air India, Jet Airways)
 
Date_of_Journey – The flight's departure date
 
Source – Flight departure location
 
Destination – Flight arrival location
 
Route – Flight path (e.g., BLR → DEL)
 
Dep_Time – Flight departure time
 
Arrival_Time – Flight arrival time
 
Duration – Total travel duration (e.g., "2h 50m")
 
Total_Stops – Number of stops (e.g., "non-stop", "1 stop", "2 stops")
 
Additional_Info – Extra details (e.g., "No info")
 
Price – Flight ticket price (numeric)

In [40]:
df = pd.read_csv(r'C:\Users\CVR\Downloads\flight_price(Sheet1).csv')

In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 918.2+ KB


In [43]:
df

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR ? DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU ? IXR ? BBI ? BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL ? LKO ? BOM ? COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU ? NAG ? BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR ? NAG ? DEL,16:50,21:35,4h 45m,1 stop,No info,13302
...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,9/04/2019,Kolkata,Banglore,CCU ? BLR,19:55,22:25,2h 30m,non-stop,No info,4107
10679,Air India,27/04/2019,Kolkata,Banglore,CCU ? BLR,20:45,23:20,2h 35m,non-stop,No info,4145
10680,Jet Airways,27/04/2019,Banglore,Delhi,BLR ? DEL,08:20,11:20,3h,non-stop,No info,7229
10681,Vistara,01/03/2019,Banglore,New Delhi,BLR ? DEL,11:30,14:10,2h 40m,non-stop,No info,12648


In [45]:
df.isna().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              1
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        1
Additional_Info    0
Price              0
dtype: int64

In [46]:
df = df.dropna()

In [47]:
df.isna().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              0
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        0
Additional_Info    0
Price              0
dtype: int64

In [52]:
df = df.drop_duplicates()

In [53]:
df

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR ? DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU ? IXR ? BBI ? BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL ? LKO ? BOM ? COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU ? NAG ? BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR ? NAG ? DEL,16:50,21:35,4h 45m,1 stop,No info,13302
...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,9/04/2019,Kolkata,Banglore,CCU ? BLR,19:55,22:25,2h 30m,non-stop,No info,4107
10679,Air India,27/04/2019,Kolkata,Banglore,CCU ? BLR,20:45,23:20,2h 35m,non-stop,No info,4145
10680,Jet Airways,27/04/2019,Banglore,Delhi,BLR ? DEL,08:20,11:20,3h,non-stop,No info,7229
10681,Vistara,01/03/2019,Banglore,New Delhi,BLR ? DEL,11:30,14:10,2h 40m,non-stop,No info,12648


In [55]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [56]:
%matplotlib inline

In [57]:
df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR ? DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU ? IXR ? BBI ? BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL ? LKO ? BOM ? COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU ? NAG ? BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR ? NAG ? DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [59]:
df['Airline'].unique()

array(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet',
       'Multiple carriers', 'GoAir', 'Vistara', 'Air Asia',
       'Vistara Premium economy', 'Jet Airways Business',
       'Multiple carriers Premium economy', 'Trujet'], dtype=object)

In [61]:
df['Date_of_Journey'] = pd.to_datetime(df['Date_of_Journey'],format='%d/%m/%Y')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Date_of_Journey'] = pd.to_datetime(df['Date_of_Journey'],format='%d/%m/%Y')


In [63]:
df.describe()

Unnamed: 0,Date_of_Journey,Price
count,10462,10462.0
mean,2019-05-04 13:38:33.056776960,9026.790289
min,2019-03-01 00:00:00,1759.0
25%,2019-03-27 00:00:00,5224.0
50%,2019-05-15 00:00:00,8266.0
75%,2019-06-06 00:00:00,12344.75
max,2019-06-27 00:00:00,79512.0
std,,4624.849541


In [65]:
df['Arrival_Time'].unique()

array(['01:10 22 Mar', '13:15', '04:25 10 Jun', ..., '06:50 10 Mar',
       '00:05 19 Mar', '21:20 13 Mar'], dtype=object)

In [72]:
df['Date'] = df['Date_of_Journey'].dt.day

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Date'] = df['Date_of_Journey'].dt.day


In [73]:
df['Month'] = df['Date_of_Journey'].dt.month

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Month'] = df['Date_of_Journey'].dt.month


In [74]:
df['Year'] = df['Date_of_Journey'].dt.year

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Year'] = df['Date_of_Journey'].dt.year


In [75]:
df

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year
0,IndiGo,2019-03-24,Banglore,New Delhi,BLR ? DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,24,3,2019
1,Air India,2019-05-01,Kolkata,Banglore,CCU ? IXR ? BBI ? BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,2019
2,Jet Airways,2019-06-09,Delhi,Cochin,DEL ? LKO ? BOM ? COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,9,6,2019
3,IndiGo,2019-05-12,Kolkata,Banglore,CCU ? NAG ? BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5,2019
4,IndiGo,2019-03-01,Banglore,New Delhi,BLR ? NAG ? DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,2019-04-09,Kolkata,Banglore,CCU ? BLR,19:55,22:25,2h 30m,non-stop,No info,4107,9,4,2019
10679,Air India,2019-04-27,Kolkata,Banglore,CCU ? BLR,20:45,23:20,2h 35m,non-stop,No info,4145,27,4,2019
10680,Jet Airways,2019-04-27,Banglore,Delhi,BLR ? DEL,08:20,11:20,3h,non-stop,No info,7229,27,4,2019
10681,Vistara,2019-03-01,Banglore,New Delhi,BLR ? DEL,11:30,14:10,2h 40m,non-stop,No info,12648,1,3,2019


In [76]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10462 entries, 0 to 10682
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Airline          10462 non-null  object        
 1   Date_of_Journey  10462 non-null  datetime64[ns]
 2   Source           10462 non-null  object        
 3   Destination      10462 non-null  object        
 4   Route            10462 non-null  object        
 5   Dep_Time         10462 non-null  object        
 6   Arrival_Time     10462 non-null  object        
 7   Duration         10462 non-null  object        
 8   Total_Stops      10462 non-null  object        
 9   Additional_Info  10462 non-null  object        
 10  Price            10462 non-null  int64         
 11  Date             10462 non-null  int32         
 12  Month            10462 non-null  int32         
 13  Year             10462 non-null  int32         
dtypes: datetime64[ns](1), int32(3), int64(1), o

In [95]:
a=df['Route'][0].split()

In [96]:
a

['BLR', '?', 'DEL']

In [90]:
a=df['Route'][0].split()[1]

In [91]:
a

'?'

In [101]:
def replace(s):
    return s[0]+'->'+s[2]

In [102]:
replace(a)

'BLR->DEL'

In [103]:
a.index('?')

1

In [108]:
b = 'DEL ? GOI ? BOM ? COK'.replace('?','->')

In [109]:
b

'DEL -> GOI -> BOM -> COK'

In [110]:
df['Route'] = df['Route'].apply(lambda a:a.replace('?','->'))

In [111]:
df['Route']

0                      BLR -> DEL
1        CCU -> IXR -> BBI -> BLR
2        DEL -> LKO -> BOM -> COK
3               CCU -> NAG -> BLR
4               BLR -> NAG -> DEL
                   ...           
10678                  CCU -> BLR
10679                  CCU -> BLR
10680                  BLR -> DEL
10681                  BLR -> DEL
10682    DEL -> GOI -> BOM -> COK
Name: Route, Length: 10462, dtype: object

In [113]:
df['Arrival_Time'] = df['Arrival_Time'].apply(lambda x:x.split(' ')[0])

In [115]:
df['Arrival_hour']=df['Arrival_Time'].str.split(':').str[0]

In [116]:
df['Arrival_min']=df['Arrival_Time'].str.split(':').str[1]

In [117]:
df.head(2)

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year,Arrival_hour,Arrival_min
0,IndiGo,2019-03-24,Banglore,New Delhi,BLR -> DEL,22:20,01:10,2h 50m,non-stop,No info,3897,24,3,2019,1,10
1,Air India,2019-05-01,Kolkata,Banglore,CCU -> IXR -> BBI -> BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,2019,13,15


In [118]:
df.drop('Arrival_Time',axis=1,inplace=True)

In [123]:
df['Departure_hour']=df['Dep_Time'].str.split(':').str[0].astype(int)

In [124]:
df['Departure_minute']=df['Dep_Time'].str.split(':').str[1].astype(int)

In [126]:
df['Total_Stops'] = df['Total_Stops'].map({'non-stop':0,'2 stops':2,'1 stop':1,'3 stops':3,'4 stops':4})

In [127]:
df

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year,Arrival_hour,Arrival_min,Departure_hour,Departure_minute
0,IndiGo,2019-03-24,Banglore,New Delhi,BLR -> DEL,22:20,2h 50m,0,No info,3897,24,3,2019,01,10,22,20
1,Air India,2019-05-01,Kolkata,Banglore,CCU -> IXR -> BBI -> BLR,05:50,7h 25m,2,No info,7662,1,5,2019,13,15,5,50
2,Jet Airways,2019-06-09,Delhi,Cochin,DEL -> LKO -> BOM -> COK,09:25,19h,2,No info,13882,9,6,2019,04,25,9,25
3,IndiGo,2019-05-12,Kolkata,Banglore,CCU -> NAG -> BLR,18:05,5h 25m,1,No info,6218,12,5,2019,23,30,18,5
4,IndiGo,2019-03-01,Banglore,New Delhi,BLR -> NAG -> DEL,16:50,4h 45m,1,No info,13302,1,3,2019,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,2019-04-09,Kolkata,Banglore,CCU -> BLR,19:55,2h 30m,0,No info,4107,9,4,2019,22,25,19,55
10679,Air India,2019-04-27,Kolkata,Banglore,CCU -> BLR,20:45,2h 35m,0,No info,4145,27,4,2019,23,20,20,45
10680,Jet Airways,2019-04-27,Banglore,Delhi,BLR -> DEL,08:20,3h,0,No info,7229,27,4,2019,11,20,8,20
10681,Vistara,2019-03-01,Banglore,New Delhi,BLR -> DEL,11:30,2h 40m,0,No info,12648,1,3,2019,14,10,11,30
