**Pandas**

Open Source data analysis and manipulation library.

Has two data structures: Series(1-D) and dataframe(2-D)

In [2]:
import pandas as pd

In [3]:
data={
    "Name" : ["ALice","Bob","Charlie"],
    "Age" : [25,30,25],
    "City" : ["New York","Los Angeles","Chicago"]
}

df= pd.DataFrame(data)
print(df)

      Name  Age         City
0    ALice   25     New York
1      Bob   30  Los Angeles
2  Charlie   25      Chicago


**Creating a DataFrame from a list of Dictionaries**

In [4]:
data=[
    {"Name":"Alice","Age":25,"City":"New York"},
    {"Name":"Bob","Age":30,"City":"Los Angeles"},
    {"Name":"Charlie","Age":25,"City":"Chicago"}
]
df=pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   25      Chicago


**Creating a DataFrame from a CSV file**

In [4]:
df=pd.read_csv("dataset.csv")
print(df)

      Name   Age            City
0    Alice    25        New York
1      Bob    30     Los Angeles
2  Charlie    35         Chicago
3    David    40         Houston
4      Eve    45   San Francisco


Viewing Data

In [9]:
print(df.head())  #first 5 rows by default
print()
print(df.tail())  #last 5 rows
print()
print(df.head(10)) #top 10 rows

      Name   Age            City
0    Alice    25        New York
1      Bob    30     Los Angeles
2  Charlie    35         Chicago
3    David    40         Houston
4      Eve    45   San Francisco

      Name   Age            City
0    Alice    25        New York
1      Bob    30     Los Angeles
2  Charlie    35         Chicago
3    David    40         Houston
4      Eve    45   San Francisco

      Name   Age            City
0    Alice    25        New York
1      Bob    30     Los Angeles
2  Charlie    35         Chicago
3    David    40         Houston
4      Eve    45   San Francisco


In [10]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1    Age    5 non-null      int64 
 2    City   5 non-null      object
dtypes: int64(1), object(2)
memory usage: 248.0+ bytes
None


In [11]:
print(df.describe()) #descriptive statistics

             Age
count   5.000000
mean   35.000000
std     7.905694
min    25.000000
25%    30.000000
50%    35.000000
75%    40.000000
max    45.000000


Selecting Columns

In [7]:
print(df["Name"])
print()
print(df[["Name"," Age"]])


0      Alice
1        Bob
2    Charlie
3      David
4        Eve
Name: Name, dtype: object

      Name   Age
0    Alice    25
1      Bob    30
2  Charlie    35
3    David    40
4      Eve    45


In [5]:
print(df)
print()
print(df.columns)
print()
print(df.index)

      Name   Age            City
0    Alice    25        New York
1      Bob    30     Los Angeles
2  Charlie    35         Chicago
3    David    40         Houston
4      Eve    45   San Francisco

Index(['Name', ' Age', ' City'], dtype='object')

RangeIndex(start=0, stop=5, step=1)


In [8]:
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago
3,David,40,Houston
4,Eve,45,San Francisco



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



**Filtering Rows**

In [15]:
print(df[df[" Age"]>30])
print()
print(df[df["Name"]=="Charlie"])
print()
print(df.loc[2])  #loc is used to access a group of rows and columns by label
print()
print(df.loc[df["Name"]=="Charlie"," City"])  #Instead of where

      Name   Age            City
2  Charlie    35         Chicago
3    David    40         Houston
4      Eve    45   San Francisco

      Name   Age      City
2  Charlie    35   Chicago

Name      Charlie
 Age           35
 City     Chicago
Name: 2, dtype: object

2     Chicago
Name:  City, dtype: object


**Adding new Columns**

In [17]:
df['Salary']=[20000,30000,40000,50000,60000]
print(df)

      Name   Age            City  Salary
0    Alice    25        New York   20000
1      Bob    30     Los Angeles   30000
2  Charlie    35         Chicago   40000
3    David    40         Houston   50000
4      Eve    45   San Francisco   60000


**Modifying Existing Columns**

In [18]:
df[" Age"]=df[" Age"] + 1
print(df)

      Name   Age            City  Salary
0    Alice    26        New York   20000
1      Bob    31     Los Angeles   30000
2  Charlie    36         Chicago   40000
3    David    41         Houston   50000
4      Eve    46   San Francisco   60000


**Dropping COlumns and Rows**

In [19]:
df=df.drop(columns=["Salary"])
print(df)

      Name   Age            City
0    Alice    26        New York
1      Bob    31     Los Angeles
2  Charlie    36         Chicago
3    David    41         Houston
4      Eve    46   San Francisco


In [20]:
df=df.drop(index=4)
print(df)

      Name   Age          City
0    Alice    26      New York
1      Bob    31   Los Angeles
2  Charlie    36       Chicago
3    David    41       Houston


**Grouping Data**

In [26]:
grouped=df.groupby(" City")
print(grouped[" Age"].mean())

 City
 Chicago        36.0
 Houston        41.0
 Los Angeles    31.0
 New York       26.0
Name:  Age, dtype: float64


**Aggregating Data**

In [28]:
aggregated=df.groupby(" City").agg({" Age": ["mean","min","max"]})
print(aggregated)

               Age        
              mean min max
 City                     
 Chicago      36.0  36  36
 Houston      41.0  41  41
 Los Angeles  31.0  31  31
 New York     26.0  26  26


**Merging DataFrames**

In [29]:
df1=pd.DataFrame(
    {
        "ID":[1,2,3],
        "Name":["Alice","Bob","Charlie"]
    }
)
df2=pd.DataFrame({
    "ID":[1,2,4] ,
    "Salary" : [50000,60000,70000]
}
)

merged_df=pd.merge(df1,df2,on="ID",how="inner")  #inner join for common values
print(merged_df)

   ID   Name  Salary
0   1  Alice   50000
1   2    Bob   60000


In [30]:
df1

Unnamed: 0,ID,Name
0,1,Alice
1,2,Bob
2,3,Charlie


In [31]:
df2

Unnamed: 0,ID,Salary
0,1,50000
1,2,60000
2,4,70000


In [32]:
merged_df

Unnamed: 0,ID,Name,Salary
0,1,Alice,50000
1,2,Bob,60000


In [33]:
merged_df2=pd.merge(df1,df2,on="ID",how="outer")
print(merged_df2)

   ID     Name   Salary
0   1    Alice  50000.0
1   2      Bob  60000.0
2   3  Charlie      NaN
3   4      NaN  70000.0


**Joining DataFrames**

In [36]:
df1=pd.DataFrame(
    {
        "Name" : ["Alice","Bob"],
        "Age" : [20,21]
    },
    index=[0,1]
)
df2=pd.DataFrame(
    {
        "City" : ["Chicago","New York"]
    },
    index=[0,2]
)

joined_df=df1.join(df2,how="left")
print(joined_df)

    Name  Age     City
0  Alice   20  Chicago
1    Bob   21      NaN
