In [None]:
import pandas as pd

<!-- # Exploratory Data Analysis (EDA)
**N.B.** All cells in this notebook are a part of the EDA process

<br>
 -->
EDA is performed for a variety of reasons: 

-   detecting patterns and relationships in data

-   generating questions or hypotheses

-   to prepare data for machine learning models

<br>

**There's one requirement our data must satisfy regardless of our plans after performing EDA ↓↓**

**it must be representative of the population we wish to study**

## Class Imbalance

→ occurs when one class more frequently than others. This can bias results.

With categorical data, one of the most important considerations is about the representation of classes, which is another term for labels.

We can count frequencies of classes using `.value_counts()` and relative frequencies using `.value_counts(normalize=True)` 

Another method for looking at class frequency is cross-tabulation (`pd.crosstab(index, columns)`), which enables us to examine the frequency of combinations of classes. Values are the frequencies (count) of combination of classes, by default, but you can change this by specifying the `values` and `aggfunc` arguments together 

In [None]:
planes = pd.read_csv("https://raw.githubusercontent.com/MohamedMostafa259/Pandas-Notes/refs/heads/main/Data/planes.csv")
planes

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882.0
1,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218.0
2,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302.0
3,SpiceJet,24/06/2019,Kolkata,Banglore,CCU → BLR,09:00,11:25,2h 25m,non-stop,No info,3873.0
4,Jet Airways,12/03/2019,Banglore,New Delhi,BLR → BOM → DEL,18:55,10:25 13 Mar,15h 30m,1 stop,In-flight meal not included,11087.0
...,...,...,...,...,...,...,...,...,...,...,...
10655,Air Asia,9/04/2019,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107.0
10656,Air India,27/04/2019,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145.0
10657,Jet Airways,27/04/2019,Banglore,Delhi,BLR → DEL,,11:20,3h,non-stop,,7229.0
10658,Vistara,01/03/2019,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648.0


In [None]:
# Say that we know 40 percent of internal Indian flights go to Delhi.
print(planes["Destination"].value_counts(), '\n')
print(planes["Destination"].value_counts(normalize=True))
# This returns the relative frequencies for each class, showing that Delhi only represents 11.82 % of destinations in our dataset. 
# This could suggest that our data is not representative of the population - in this case, internal flights in India.

Destination
Cochin       4391
Banglore     2773
Delhi        1219
New Delhi     888
Hyderabad     673
Kolkata       369
Name: count, dtype: int64 

Destination
Cochin       0.425773
Banglore     0.268884
Delhi        0.118200
New Delhi    0.086105
Hyderabad    0.065257
Kolkata      0.035780
Name: proportion, dtype: float64


#### `pd.crosstab(index, columns)`

In [None]:
pd.crosstab(planes["Source"], planes["Destination"])
# We see the most popular route is from Delhi to Cochin, making up 4318 flights

Destination,Banglore,Cochin,Delhi,Hyderabad,Kolkata,New Delhi
Source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Banglore,0,0,1199,0,0,868
Chennai,0,0,0,0,364,0
Delhi,0,4318,0,0,0,0
Kolkata,2720,0,0,0,0,0
Mumbai,0,0,0,662,0,0


##### Aggregated values with `pd.crosstab(index, columns, values=, aggfunc=)`
The `values` and `aggfunc` args should be specified together

In [None]:
pd.crosstab(planes["Source"], planes["Destination"], values=planes["Price"], aggfunc="median")

Destination,Banglore,Cochin,Delhi,Hyderabad,Kolkata,New Delhi
Source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Banglore,,,4823.0,,,10976.5
Chennai,,,,,3850.0,
Delhi,,10262.0,,,,
Kolkata,9345.0,,,,,
Mumbai,,,,3342.0,,
