# Pandas for Data Analysis

This notebook covers core Pandas concepts required for data analysis:
- DataFrames & Series
- Data exploration
- Handling missing & duplicate data
- Column transformations
- GroupBy aggregations
- Merge, concatenate, pivot, and melt operations

## Importing Pandas

In [1]:
import pandas as pd
import numpy as np

## Creating DataFrames

In [3]:
data = {
    "Name": ["John", "Peter", "Lisa"],
    "Age": [25, 38, 43],
    "Salary": [35000, 40000, 36000]
}

df = pd.DataFrame(data)
df


Unnamed: 0,Name,Age,Salary
0,John,25,35000
1,Peter,38,40000
2,Lisa,43,36000


## Reading External Data

In [8]:
df = pd.read_excel("D:/1 Preparation/DATA ANALYSIS PREPARATION/Datasets/Datasets in 20 Excel Shortcuts to Speed Up Your Workflow youtube video description/expenses tracker.xlsx")
print(df.head())

        Date Category       Sub-Category   Amount Payment Mode
0 2023-02-01  Grocery  Fruits and Veggies   456.0         Cash
1 2023-02-02  Grocery                Milk    26.0          UPI
2 2023-02-03     Food          Restaurant   560.0          UPI
3 2023-02-04     Food  Fruits and Veggies   660.0         Cash
4 2023-02-05     Food              Zomato   400.0          UPI


## Exploring Data

In [14]:
print(df.head())
print()
print(df.tail())
print()
print(df.info())
print()
print(df.describe())

        Date Category       Sub-Category   Amount Payment Mode
0 2023-02-01  Grocery  Fruits and Veggies   456.0         Cash
1 2023-02-02  Grocery                Milk    26.0          UPI
2 2023-02-03     Food          Restaurant   560.0          UPI
3 2023-02-04     Food  Fruits and Veggies   660.0         Cash
4 2023-02-05     Food              Zomato   400.0          UPI

         Date    Category Sub-Category   Amount Payment Mode
24 2023-02-25  Essentials       Perfume  1500.0         Cash
25 2023-02-15  Essentials     Lunch Box   890.0         Cash
26 2023-02-10     Grocery     Chocolate   150.0         Cash
27 2023-02-15     Grocery         Maggi   140.0          UPI
28 2023-02-23        Food    Restaurant   780.0         Card

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           29 non-null     datetime64[ns]

## Checking Missing Values

In [15]:
print(df.isnull())
print()
print(df.isnull().sum())

     Date  Category  Sub-Category   Amount  Payment Mode
0   False     False          False   False         False
1   False     False          False   False         False
2   False     False          False   False         False
3   False     False          False   False         False
4   False     False          False   False         False
5   False     False           True   False         False
6   False     False          False   False         False
7   False     False          False   False         False
8   False     False          False   False         False
9   False     False          False   False         False
10  False     False          False   False         False
11  False     False          False   False         False
12  False     False           True    True         False
13  False     False          False   False         False
14  False     False           True   False         False
15  False     False          False   False         False
16  False     False          Fa

## Handling Missing Data

In [16]:
print(df.dropna())
print()
print(df.bfill())
print()
print(df.ffill())
print()
print(df["Category"].fillna("Unknown", inplace=True))

         Date    Category       Sub-Category    Amount Payment Mode
0  2023-02-01     Grocery  Fruits and Veggies    456.0         Cash
1  2023-02-02     Grocery                Milk     26.0          UPI
2  2023-02-03        Food          Restaurant    560.0          UPI
3  2023-02-04        Food  Fruits and Veggies    660.0         Cash
4  2023-02-05        Food              Zomato    400.0          UPI
6  2023-02-06     Grocery           Chocolate    100.0         Cash
7  2023-02-08     Grocery      Bread and Milk     56.0         Cash
8  2023-02-09     Grocery             Grocery     30.0         Cash
9  2023-02-10        Food          Restaurant   1200.0          UPI
10 2023-02-11  Essentials             Shampoo    780.0         Cash
11 2023-02-11  Essentials            Food Oil    120.0          UPI
13 2023-02-14  Essentials      Salt and Sugar     50.0         Cash
15 2023-02-17     Clothes               Dress   1000.0         Card
16 2023-02-17       Bills          House Rent  1

## Handling Duplicate Data

In [17]:
print(df.duplicated())
print()
print(df["Category"].duplicated().sum())
print()
print(df.drop_duplicates("Category"))

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
dtype: bool

22

         Date    Category       Sub-Category    Amount Payment Mode
0  2023-02-01     Grocery  Fruits and Veggies    456.0         Cash
2  2023-02-03        Food          Restaurant    560.0          UPI
5  2023-02-06     Unknown                 NaN    200.0          UPI
10 2023-02-11  Essentials             Shampoo    780.0         Cash
15 2023-02-17     Clothes               Dress   1000.0         Card
16 2023-02-17       Bills          House Rent  16000.0          UPI
19 2023-02-20     Colthes               Dress     10.0          UPI


## Column Transformation

In [22]:
df["Discount"] = np.where(df["Amount"] > 1000, "15%", "0%")
print(df)

         Date    Category       Sub-Category    Amount Payment Mode Discount
0  2023-02-01     Grocery  Fruits and Veggies    456.0         Cash       0%
1  2023-02-02     Grocery                Milk     26.0          UPI       0%
2  2023-02-03        Food          Restaurant    560.0          UPI       0%
3  2023-02-04        Food  Fruits and Veggies    660.0         Cash       0%
4  2023-02-05        Food              Zomato    400.0          UPI       0%
5  2023-02-06     Unknown                 NaN    200.0          UPI       0%
6  2023-02-06     Grocery           Chocolate    100.0         Cash       0%
7  2023-02-08     Grocery      Bread and Milk     56.0         Cash       0%
8  2023-02-09     Grocery             Grocery     30.0         Cash       0%
9  2023-02-10        Food          Restaurant   1200.0          UPI      15%
10 2023-02-11  Essentials             Shampoo    780.0         Cash       0%
11 2023-02-11  Essentials            Food Oil    120.0          UPI       0%

## Creating Columns Using Functions

In [28]:
data = {"Months":["January","February","March","April"]}

a = pd.DataFrame(data)
print(a)
print()
def extract(value):
    return value[0:3]

a["Short Months"] = a["Months"].map(extract)
print(a)

     Months
0   January
1  February
2     March
3     April

     Months Short Months
0   January          Jan
1  February          Feb
2     March          Mar
3     April          Apr


## GroupBy Operations

In [39]:
print(df.groupby("Category")["Amount"].sum())
print()
print(df.groupby(["Category","Sub-Category "]).agg({"Amount": "sum"}))

Category
Bills         18724.0
Clothes        1000.0
Colthes          10.0
Essentials     4365.0
Food           3845.0
Grocery        1028.0
Unknown        2090.0
Name: Amount, dtype: float64

                                Amount
Category   Sub-Category               
Bills      House Rent          16000.0
           Mobile               1650.0
Clothes    Dress                1000.0
Colthes    Dress                  10.0
Essentials Bedsheets            1025.0
           Food Oil              120.0
           Lunch Box             890.0
           Perfume              1500.0
           Salt and Sugar         50.0
           Shampoo               780.0
Food       Fruits and Veggies    660.0
           Restaurant           2555.0
           Zomato                630.0
Grocery    Bread and Milk         56.0
           Chocolate             320.0
           Fruits and Veggies    456.0
           Grocery                30.0
           Maggi                 140.0
           Milk            

## Merge & Concatenate

In [42]:
data = {"Emp Id":["E01","E03","E04","E06","E07","E08"],
       "Names":["Ram","Shyam","Ravi","Vishnu","Vishal","Krish"],
       "Age":[23,32,41,35,42,34]}
data2 = {"Emp Id":["E01","E02","E03","E04","E05","E06"],
        "Salary":[35000,40000,25000,30000,44000,37000]}

df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
merged_df = pd.merge(df1, df2, on="Emp Id", how="inner")
print(merged_df)
print()
print(pd.concat([df1, df2]))

  Emp Id   Names  Age  Salary
0    E01     Ram   23   35000
1    E03   Shyam   32   25000
2    E04    Ravi   41   30000
3    E06  Vishnu   35   37000

  Emp Id   Names   Age   Salary
0    E01     Ram  23.0      NaN
1    E03   Shyam  32.0      NaN
2    E04    Ravi  41.0      NaN
3    E06  Vishnu  35.0      NaN
4    E07  Vishal  42.0      NaN
5    E08   Krish  34.0      NaN
0    E01     NaN   NaN  35000.0
1    E02     NaN   NaN  40000.0
2    E03     NaN   NaN  25000.0
3    E04     NaN   NaN  30000.0
4    E05     NaN   NaN  44000.0
5    E06     NaN   NaN  37000.0


## Comparing DataFrames

In [46]:
dict = {"Fruits":["mango","apples","banana","papaya"],
        "Price":[150,200,50,40],
       "Quantity":[10,14,10,20]}
df1 = pd.DataFrame(dict)

df2 = df.copy()
df2.loc[0,"Price"]=120
df2.loc[1,"Price"]=160
df2.loc[3,"Price"]=70
df2.loc[0,"Quantity"]=18
df2.loc[1,"Quantity"]=20
df2.loc[3,"Quantity"]=15

print(df1.compare(df2))
print()
print(df1.compare(df2, keep_equal=True))

   Price        Quantity      
    self  other     self other
0  150.0  120.0     10.0  18.0
1  200.0  160.0     14.0  20.0
3   40.0   70.0     20.0  15.0

  Price       Quantity      
   self other     self other
0   150   120       10    18
1   200   160       14    20
3    40    70       20    15


## Pivoting Data

In [50]:
dict = {"keys":["k1","k2","k1","k2"],
       "Names":["John","Ben","David","Peter"],
       "Houses":["Red","Blue","Green","Red"],
       "Grades":["A","B","C","D"]}
df=pd.DataFrame(dict)
print(df.pivot(index="keys", columns="Names", values="Houses"))

Names   Ben  David John Peter
keys                         
k1      NaN  Green  Red   NaN
k2     Blue    NaN  NaN   Red


## Melting Data