
<h1 style="color:#00eeff; font-weight:bold "> 🐼50 Essential Pandas Functions in Python</h1>

---


<h4 style="color:#00eeff; font-weight:bold ">Connect with Me</h4>


🔗 [Facebook](https://www.facebook.com/memuna.gul.2025/)

🔗 [LinkedIn](https://www.linkedin.com/in/memunagul/)


---

## **→ Pandas Read CSV in Python**


In [1]:
# importing libraries
import pandas as pd 
import numpy as np

In [2]:

df=pd.read_csv("people_data.csv")
df

Unnamed: 0,First Name,Last Name,Sex,Email,Date of birth,Job Title
0,Shelby,Terrell,Male,elijah57@example.net,1945-10-26,Games developer
1,Phillip,Summers,Female,bethany14@example.com,1910-03-24,Phytotherapist
2,Kristine,Travis,Male,bthompson@example.com,1992-07-02,Homeopath
3,Yesenia,Martinez,Male,kaitlinkaiser@example.com,2017-08-03,Market researcher
4,Lori,Todd,Male,buchananmanuel@example.net,1938-12-01,Veterinary surgeon


In [3]:
# Read specific columns using read_csv
df=pd.read_csv("people_data.csv",usecols=["First Name","Last Name"])
df

Unnamed: 0,First Name,Last Name
0,Shelby,Terrell
1,Phillip,Summers
2,Kristine,Travis
3,Yesenia,Martinez
4,Lori,Todd


In [4]:
# Setting an Index Column (index_col)
df=pd.read_csv("people_data.csv",index_col="First Name")
df

Unnamed: 0_level_0,Last Name,Sex,Email,Date of birth,Job Title
First Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Shelby,Terrell,Male,elijah57@example.net,1945-10-26,Games developer
Phillip,Summers,Female,bethany14@example.com,1910-03-24,Phytotherapist
Kristine,Travis,Male,bthompson@example.com,1992-07-02,Homeopath
Yesenia,Martinez,Male,kaitlinkaiser@example.com,2017-08-03,Market researcher
Lori,Todd,Male,buchananmanuel@example.net,1938-12-01,Veterinary surgeon


In [5]:
#Handling Missing Values Using read_csv
df=pd.read_csv("people_data.csv",na_values=["N/A","Unknown"])
df

Unnamed: 0,First Name,Last Name,Sex,Email,Date of birth,Job Title
0,Shelby,Terrell,Male,elijah57@example.net,1945-10-26,Games developer
1,Phillip,Summers,Female,bethany14@example.com,1910-03-24,Phytotherapist
2,Kristine,Travis,Male,bthompson@example.com,1992-07-02,Homeopath
3,Yesenia,Martinez,Male,kaitlinkaiser@example.com,2017-08-03,Market researcher
4,Lori,Todd,Male,buchananmanuel@example.net,1938-12-01,Veterinary surgeon


In [6]:
# Using nrows in read_csv()
#enabling quick previews or partial data loading for large datasets
df_2=pd.read_csv('people_data.csv', nrows=3)
df_2

Unnamed: 0,First Name,Last Name,Sex,Email,Date of birth,Job Title
0,Shelby,Terrell,Male,elijah57@example.net,1945-10-26,Games developer
1,Phillip,Summers,Female,bethany14@example.com,1910-03-24,Phytotherapist
2,Kristine,Travis,Male,bthompson@example.com,1992-07-02,Homeopath


In [7]:
# Using skiprows in read_csv()
# skiprows parameter skips unnecessary rows at the start of a file, which is useful for ignoring metadata or extra headers that are not part of the dataset
df= pd.read_csv("people_data.csv")
print("Previous Dataset: ")
print(df)
# using skiprows
df = pd.read_csv("people_data.csv", skiprows = [4,5])
print("Dataset After skipping rows: ")
print(df)

Previous Dataset: 
  First Name Last Name     Sex                       Email Date of birth  \
0     Shelby   Terrell    Male        elijah57@example.net    1945-10-26   
1    Phillip   Summers  Female       bethany14@example.com    1910-03-24   
2   Kristine    Travis    Male       bthompson@example.com    1992-07-02   
3    Yesenia  Martinez    Male   kaitlinkaiser@example.com    2017-08-03   
4       Lori      Todd    Male  buchananmanuel@example.net    1938-12-01   

            Job Title  
0     Games developer  
1      Phytotherapist  
2           Homeopath  
3   Market researcher  
4  Veterinary surgeon  
Dataset After skipping rows: 
  First Name Last Name     Sex                  Email Date of birth  \
0     Shelby   Terrell    Male   elijah57@example.net    1945-10-26   
1    Phillip   Summers  Female  bethany14@example.com    1910-03-24   
2   Kristine    Travis    Male  bthompson@example.com    1992-07-02   

         Job Title  
0  Games developer  
1   Phytotherapist  
2 

In [8]:
df.dtypes


First Name       object
Last Name        object
Sex              object
Email            object
Date of birth    object
Job Title        object
dtype: object

In [9]:
# Parsing Dates 
# converts date columns into datetime objects

df = pd.read_csv("people_data.csv", parse_dates=["Date of birth"])
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   First Name     5 non-null      object        
 1   Last Name      5 non-null      object        
 2   Sex            5 non-null      object        
 3   Email          5 non-null      object        
 4   Date of birth  5 non-null      datetime64[ns]
 5   Job Title      5 non-null      object        
dtypes: datetime64[ns](1), object(5)
memory usage: 372.0+ bytes
None


In [10]:
# Loading a CSV Data from a URL
url = "https://media.geeksforgeeks.org/wp-content/uploads/20241121154629307916/people_data.csv"
df = pd.read_csv(url)
df

Unnamed: 0,First Name,Last Name,Sex,Email,Date of birth,Job Title
0,Shelby,Terrell,Male,elijah57@example.net,1945-10-26,Games developer
1,Phillip,Summers,Female,bethany14@example.com,1910-03-24,Phytotherapist
2,Kristine,Travis,Male,bthompson@example.com,1992-07-02,Homeopath
3,Yesenia,Martinez,Male,kaitlinkaiser@example.com,2017-08-03,Market researcher
4,Lori,Todd,Male,buchananmanuel@example.net,1938-12-01,Veterinary surgeon


## **→ Saving a Pandas Dataframe as a CSV**

In [11]:

# list of name, degree, score
nme = ["aparna", "pankaj", "sudhir", "Geeku"]
deg = ["MBA", "BCA", "M.Tech", "MBA"]
scr = [90, 40, 80, 98]

# dictionary of lists
dict = {'name': nme, 'degree': deg, 'score': scr}
    
df = pd.DataFrame(dict)

# saving the dataframe
df.to_csv('file1.csv')


In [12]:
# Saving CSV Without Headers and Index

df.to_csv('file2.csv', header=False, index=False)

In [13]:
# Save the CSV file to a Specified Location

# df.to_csv(r'C:\Users\Admin\Desktop\file3.csv')

In [14]:
#DataFrame to CSV file using Tab Separator
users = {'Name': ['Amit', 'Cody', 'Drew'],
    'Age': [20,21,25]}

#create DataFrame
df = pd.DataFrame(users, columns=['Name','Age'])

print("Original DataFrame:")
print(df)
print('Data from Users.csv:')

df.to_csv('Users.csv', sep='\t', index=False,header=True)
new_df = pd.read_csv('Users.csv')

print(new_df)

Original DataFrame:
   Name  Age
0  Amit   20
1  Cody   21
2  Drew   25
Data from Users.csv:
  Name\tAge
0  Amit\t20
1  Cody\t21
2  Drew\t25


## **→ Loading Excel spreadsheet as pandas DataFrame**

In [15]:
# Import the excel file and call it xls_file
#excel_file = pd.ExcelFile('titanic.xlsx')

In [16]:
# Creating a dataframe using Excel files
#dataframe1 = pd.read_excel('SampleWork.xlsx')

#print(dataframe1)

In [17]:
#  Reading Specific Sheets using 'sheet_name' of read_excel() method. 
#dataframe2 = pd.read_excel('SampleWork.xlsx', sheet_name = 1)

#print(dataframe2)

In [18]:
# Reading Specific Columns using 'usecols' parameter of read_excel() method. 
#require_cols = [0, 3]

# only read specific columns from an excel file
#required_df = pd.read_excel('SampleWork.xlsx', usecols = require_cols)

#print(required_df)

In [19]:
# Handling missing data using 'na_values' parameter of the read_excel() method. 
#dataframe = pd.read_excel('SampleWork.xlsx', na_values = "Missing")
#print(dataframe)

In [20]:
# Skip rows when Reading an Excel File using 'skiprows' parameter of read_excel() method. 
#df = pd.read_excel('SampleWork.xlsx', sheet_name = 1, skiprows = 2)

#print(df)

In [21]:
# Reading all Sheets of the excel file together using 'sheet_name' parameter of the read_excel() method. 
#all_sheets_df = pd.read_excel('SampleWork.xlsx', na_values = "Missing",sheet_name = None)

#print(all_sheets_df)


## **→ Pandas head() Method**

Pandas head() method is used to return top n (5 by default) rows of a data frame or series.

In [22]:
df=pd.read_csv('titanic.csv')

df.head() #by default top 5 rows

Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [23]:
df.head(10) #top 10 rows 

Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


## **→ Pandas tail() Method**

The .tail() method in Pandas helps us see the last n rows of a DataFrame or Series

In [24]:
df.tail()

Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
886,886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
887,887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True
888,888,0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False
889,889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
890,890,0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


In [25]:
df.tail(10)

Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
881,881,0,3,male,33.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
882,882,0,3,female,22.0,0,0,10.5167,S,Third,woman,False,,Southampton,no,True
883,883,0,2,male,28.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
884,884,0,3,male,25.0,0,0,7.05,S,Third,man,True,,Southampton,no,True
885,885,0,3,female,39.0,0,5,29.125,Q,Third,woman,False,,Queenstown,no,False
886,886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
887,887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True
888,888,0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False
889,889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
890,890,0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


## **→ Pandas sample() Method**

Pandas sample() function is used to select randomly rows or columns from a DataFrame.

**Syntax**:

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)  

**Parameters:**

- **n:** int value, Number of random rows to generate.
- **frac:** Float value, Returns (float value * length of data frame values ) . frac cannot be used with n.
- **replace:** Boolean value, return sample with replacement if True.
- **random_state:** int value or numpy.random.RandomState, optional. if set to a particular integer, will return same rows as sample in every iteration.
- **axis:** 0 or 'row' for Rows and 1 or 'column' for Columns.


In [26]:
# Sampling a Single Random Row
df.sample(n=1)

Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
413,413,0,2,male,,0,0,0.0,S,Second,man,True,,Southampton,no,True


In [27]:
#Sample 25% of the DataFrame
df_25=df.sample(frac=0.25)

print("Original Rows : ", len(df))
print("Sample Rows (25%) :", len(df_25))
df_25


Original Rows :  891
Sample Rows (25%) : 223


Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
739,739,0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True
378,378,0,3,male,20.0,0,0,4.0125,C,Third,man,True,,Cherbourg,no,True
646,646,0,3,male,19.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
110,110,0,1,male,47.0,0,0,52.0000,S,First,man,True,C,Southampton,no,True
202,202,0,3,male,34.0,0,0,6.4958,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
518,518,1,2,female,36.0,1,0,26.0000,S,Second,woman,False,,Southampton,yes,False
375,375,1,1,female,,1,0,82.1708,C,First,woman,False,,Cherbourg,yes,False
249,249,0,2,male,54.0,1,0,26.0000,S,Second,man,True,,Southampton,no,False
574,574,0,3,male,16.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True


In [28]:
#Sampling with Replacement and a Fixed Random State

#With replacement (replace=True): Each row can be selected multiple times. So you might see duplicate rows in the sample.

#Without replacement (replace=False) (default): Each row is selected only once. No duplicates in the sample.

#If you set a fixed random_state (like 42), you'll get the same sample every time you run the code.

df.sample(n=5, replace=True, random_state=42)


Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
102,102,0,1,male,21.0,0,1,77.2875,S,First,man,True,D,Southampton,no,False
435,435,1,1,female,14.0,1,2,120.0,S,First,child,False,B,Southampton,yes,False
860,860,0,3,male,41.0,2,0,14.1083,S,Third,man,True,,Southampton,no,False
270,270,0,1,male,,0,0,31.0,S,First,man,True,,Southampton,no,True
106,106,1,3,female,21.0,0,0,7.65,S,Third,woman,False,,Southampton,yes,True


## **→ Pandas info() Method**

The info() method is used to obtain the summary of dataframe

**Syntax:**

DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)

**Parameters:**

**verbose :** Whether to print the full summary. None follows the display.max_info_columns setting. True or False overrides the display.max_info_columns setting. 

**buf :** writable buffer, defaults to sys.stdout 

**max_cols :** Determines whether full summary or short summary is printed. None follows the display.max_info_columns setting. 

**memory_usage :** Specifies whether total memory usage of the DataFrame elements (including index) should be displayed. None follows the display.memory_usage setting. True or False overrides the display.memory_usage setting. A value of ‘deep’ is equivalent of True, with deep introspection. Memory usage is shown in human-readable units (base-2 representation).

**null_counts :** Whether to show the non-null counts. If None, then only show if the frame is smaller than max_info_rows and max_info_columns. If True, always show counts. If False, never show counts.

In [29]:
#Print the Full Summary of the Dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   891 non-null    int64  
 1   survived     891 non-null    int64  
 2   pclass       891 non-null    int64  
 3   sex          891 non-null    object 
 4   age          714 non-null    float64
 5   sibsp        891 non-null    int64  
 6   parch        891 non-null    int64  
 7   fare         891 non-null    float64
 8   embarked     889 non-null    object 
 9   class        891 non-null    object 
 10  who          891 non-null    object 
 11  adult_male   891 non-null    bool   
 12  deck         203 non-null    object 
 13  embark_town  889 non-null    object 
 14  alive        891 non-null    object 
 15  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(5), object(7)
memory usage: 99.3+ KB


In [30]:
# Print the short summary of the dataframe by setting verbose = False
df.info(verbose = False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Columns: 16 entries, Unnamed: 0 to alone
dtypes: bool(2), float64(2), int64(5), object(7)
memory usage: 99.3+ KB


In [31]:
#  Print a Full Summary of the Dataframe and Exclude the Null-Counts
df.info(verbose=True, show_counts=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 16 columns):
 #   Column       Dtype  
---  ------       -----  
 0   Unnamed: 0   int64  
 1   survived     int64  
 2   pclass       int64  
 3   sex          object 
 4   age          float64
 5   sibsp        int64  
 6   parch        int64  
 7   fare         float64
 8   embarked     object 
 9   class        object 
 10  who          object 
 11  adult_male   bool   
 12  deck         object 
 13  embark_town  object 
 14  alive        object 
 15  alone        bool   
dtypes: bool(2), float64(2), int64(5), object(7)
memory usage: 99.3+ KB


## **→ Pandas dtype attribute**

Pandas dtypes attribute returns a series with the data type of each column.

In [32]:
df.dtypes

Unnamed: 0       int64
survived         int64
pclass           int64
sex             object
age            float64
sibsp            int64
parch            int64
fare           float64
embarked        object
class           object
who             object
adult_male        bool
deck            object
embark_town     object
alive           object
alone             bool
dtype: object

## **→ Pandas size attribute**

**df.size** is used to return the total number of elements in a DataFrame or Series.If we're working with a DataFrame it gives the product of rows and columns or if we're working with a Series it just returns the number of elements (rows).



In [33]:
df.size

14256

## **→ Pandas shape attribute**

 It provides the number of rows (records) and columns (attributes) in the DataFrame.

**Return :** A tuple in the form of (rows, columns).

In [34]:
df.shape

(891, 16)

## **→ Pandas ndim attribute**

The **ndim** function returns number of dimensions (or axes) in the DataFrame or Series.

**Return:**

- 1 for a Series (one-dimensional).

- 2 for a DataFrame (two-dimensional).


In [35]:
df.ndim # as df is 2D dataframe

2

In [36]:
df_new=df['fare']
df_new.ndim # as df_new is 1D (series)

1

## **→ Pandas describe() Method**

describe() method in Pandas is used to generate descriptive statistics of DataFrame columns

**Syntax:** DataFrame.describe(percentiles=None, include=None, exclude=None)

**Parameters:**

- **percentiles:** A list of numbers between 0 and 1, specifying which percentiles to return. The default is None, which returns the 25th, 50th, and 75th percentiles.
- **include:** A list of data types to include in the summary. You can specify data types such as int, float, object (for strings), etc. The default is None, meaning all numeric types are included.
- **exclude:** A list of data types to exclude from the summary. This parameter is also None by default, meaning no types are excluded.

In [37]:
df.describe()

Unnamed: 0.1,Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,445.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,222.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,445.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,667.5,1.0,3.0,38.0,1.0,0.0,31.0
max,890.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [38]:
# Customizing describe() Method with Percentiles
df.describe(percentiles=[0.20,0.40,0.60,0.80],include=['int','float'])

Unnamed: 0.1,Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,445.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,0.0,1.0,0.42,0.0,0.0,0.0
20%,178.0,0.0,1.0,19.0,0.0,0.0,7.8542
40%,356.0,0.0,2.0,25.0,0.0,0.0,10.5
50%,445.0,0.0,3.0,28.0,0.0,0.0,14.4542
60%,534.0,0.0,3.0,31.8,0.0,0.0,21.6792
80%,712.0,1.0,3.0,41.0,1.0,1.0,39.6875
max,890.0,1.0,3.0,80.0,8.0,6.0,512.3292


**Describing Series of Strings (Object Data Type)**

For string data, the describe() method provides:

**count:** Total number of non-null values.

**unique:** The number of unique values.

**top:** The most frequent value.

**freq:** The frequency of the most common value.

In [39]:
# Describing Series of Strings (Object Data Type)
# If you want to describe a column with string data (i.e., an object data type), the output will be different.

df_sex=df['sex'].describe()
df_sex

count      891
unique       2
top       male
freq       577
Name: sex, dtype: object

## **→Pandas unique() Method**

It returns all the unique values in a particular column.

**Syntax:** Series.unique()

**Return Type:** Numpy array of unique values in that column

In [40]:
df['class'].unique()

array(['Third', 'First', 'Second'], dtype=object)

## **→Pandas nunique() Method**


Returns the number of unique values in the column


Syntax: DataFrame.nunique(axis=0, dropna=True) 


**Parameters:**


- **axis** : {0 or ‘index’, 1 or ‘columns’}, default 0
- **dropna** : Don’t include NaN in the counts.

**Returns** : nunique : Series

In [41]:
# Use nunique() function to find the number of unique values over the column axis. 
df.nunique()

Unnamed: 0     891
survived         2
pclass           3
sex              2
age             88
sibsp            7
parch            7
fare           248
embarked         3
class            3
who              3
adult_male       2
deck             7
embark_town      3
alive            2
alone            2
dtype: int64

In [42]:
# Use nunique() function to find the number of unique values over the index axis in a Dataframe. 
df.nunique(axis=1)

0      11
1      10
2      12
3      12
4      12
       ..
886    12
887    12
888    12
889    11
890    12
Length: 891, dtype: int64

## **→Pandas isnull() Method**

It return a boolean same-sized object indicating if the values are NA. Missing values gets mapped to True and non-missing value gets mapped to False.

**Syntax:** df.isnull()

**Parameter :**  None

**Returns :** boolean

In [43]:
#count missing values
df.isnull().any()

Unnamed: 0     False
survived       False
pclass         False
sex            False
age             True
sibsp          False
parch          False
fare           False
embarked        True
class          False
who            False
adult_male     False
deck            True
embark_town     True
alive          False
alone          False
dtype: bool

In [44]:
df.isnull().sum()

Unnamed: 0       0
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [45]:
#check for null in specific column
df['age'].isnull()

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: age, Length: 891, dtype: bool

In [46]:
df['age'].isnull().sum()

np.int64(177)

In [47]:
#number of rows with atleast one null value
df[df.isnull().any(axis=1)].count()

Unnamed: 0     709
survived       709
pclass         709
sex            709
age            532
sibsp          709
parch          709
fare           709
embarked       707
class          709
who            709
adult_male     709
deck            21
embark_town    707
alive          709
alone          709
dtype: int64

In [48]:
#number of rows with all null values
df[df.isnull().all(axis=1)].count()

Unnamed: 0     0
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

## **→Pandas fillna() Method**

fillna() is used to replace missing (NaN) values in a DataFrame or Series with a specific value, a method, or even forward/backward filled values.

In [49]:
data = {
    'Player': ['Babar', 'Rizwan', 'Saim', 'Fakhar', 'Chachu', 'Shadab'],
    'Runs': [569, 407, 345, np.nan, 259, np.nan],
    'Average': [56.90, 33.91, 31.36, 19.63, np.nan, np.nan],
    'Strike Rate': [142.6, 124.1, 157.5, 115.44, 193.2, 140.5],
    'Team': ['PZ', 'MS', 'PZ', 'LQ', np.nan, 'IU'],
    '50s': [5, 4, 2, 1, 1, None],
    'Wickets': [0, 0, 8, 0, 2, 5]
}

df = pd.DataFrame(data)

df

Unnamed: 0,Player,Runs,Average,Strike Rate,Team,50s,Wickets
0,Babar,569.0,56.9,142.6,PZ,5.0,0
1,Rizwan,407.0,33.91,124.1,MS,4.0,0
2,Saim,345.0,31.36,157.5,PZ,2.0,8
3,Fakhar,,19.63,115.44,LQ,1.0,0
4,Chachu,259.0,,193.2,,1.0,2
5,Shadab,,,140.5,IU,,5


In [50]:
#Replacement with a constant
df.fillna(0)

Unnamed: 0,Player,Runs,Average,Strike Rate,Team,50s,Wickets
0,Babar,569.0,56.9,142.6,PZ,5.0,0
1,Rizwan,407.0,33.91,124.1,MS,4.0,0
2,Saim,345.0,31.36,157.5,PZ,2.0,8
3,Fakhar,0.0,19.63,115.44,LQ,1.0,0
4,Chachu,259.0,0.0,193.2,0,1.0,2
5,Shadab,0.0,0.0,140.5,IU,0.0,5


In [51]:
#Column wise fill with custom values
df.fillna(
    {
        'Runs':100,
        'Average':df['Average'].mean()
    }
)

Unnamed: 0,Player,Runs,Average,Strike Rate,Team,50s,Wickets
0,Babar,569.0,56.9,142.6,PZ,5.0,0
1,Rizwan,407.0,33.91,124.1,MS,4.0,0
2,Saim,345.0,31.36,157.5,PZ,2.0,8
3,Fakhar,100.0,19.63,115.44,LQ,1.0,0
4,Chachu,259.0,35.45,193.2,,1.0,2
5,Shadab,100.0,35.45,140.5,IU,,5


In [52]:
#Forward fill 
df.fillna(method='ffill')

  df.fillna(method='ffill')


Unnamed: 0,Player,Runs,Average,Strike Rate,Team,50s,Wickets
0,Babar,569.0,56.9,142.6,PZ,5.0,0
1,Rizwan,407.0,33.91,124.1,MS,4.0,0
2,Saim,345.0,31.36,157.5,PZ,2.0,8
3,Fakhar,345.0,19.63,115.44,LQ,1.0,0
4,Chachu,259.0,19.63,193.2,LQ,1.0,2
5,Shadab,259.0,19.63,140.5,IU,1.0,5


In [53]:
#Backward fill
df.fillna(method='bfill')


  df.fillna(method='bfill')


Unnamed: 0,Player,Runs,Average,Strike Rate,Team,50s,Wickets
0,Babar,569.0,56.9,142.6,PZ,5.0,0
1,Rizwan,407.0,33.91,124.1,MS,4.0,0
2,Saim,345.0,31.36,157.5,PZ,2.0,8
3,Fakhar,259.0,19.63,115.44,LQ,1.0,0
4,Chachu,259.0,,193.2,IU,1.0,2
5,Shadab,,,140.5,IU,,5


In [54]:
#limit the number of fills
df.fillna({
    'Average':df['Average'].mean(),
    'Runs':100
}, limit=1
)

Unnamed: 0,Player,Runs,Average,Strike Rate,Team,50s,Wickets
0,Babar,569.0,56.9,142.6,PZ,5.0,0
1,Rizwan,407.0,33.91,124.1,MS,4.0,0
2,Saim,345.0,31.36,157.5,PZ,2.0,8
3,Fakhar,100.0,19.63,115.44,LQ,1.0,0
4,Chachu,259.0,35.45,193.2,,1.0,2
5,Shadab,,,140.5,IU,,5


In [55]:
#fill with a function
df.fillna(df.mean(numeric_only=True))

Unnamed: 0,Player,Runs,Average,Strike Rate,Team,50s,Wickets
0,Babar,569.0,56.9,142.6,PZ,5.0,0
1,Rizwan,407.0,33.91,124.1,MS,4.0,0
2,Saim,345.0,31.36,157.5,PZ,2.0,8
3,Fakhar,395.0,19.63,115.44,LQ,1.0,0
4,Chachu,259.0,35.45,193.2,,1.0,2
5,Shadab,395.0,35.45,140.5,IU,2.6,5


In [56]:
#inplace modification 
df.fillna(0, inplace=True)
df

Unnamed: 0,Player,Runs,Average,Strike Rate,Team,50s,Wickets
0,Babar,569.0,56.9,142.6,PZ,5.0,0
1,Rizwan,407.0,33.91,124.1,MS,4.0,0
2,Saim,345.0,31.36,157.5,PZ,2.0,8
3,Fakhar,0.0,19.63,115.44,LQ,1.0,0
4,Chachu,259.0,0.0,193.2,0,1.0,2
5,Shadab,0.0,0.0,140.5,IU,0.0,5


## **→Pandas clip() Method**
clip() is used to limit the values in a DataFrame or Series to a given range, like setting minimum and/or maximum thresholds.

**Syntax:** DataFrame.clip(lower=None, upper=None, axis=None)

**Parameters:** 
- **lower :**  Minimum threshold value. All values below this threshold will be set to it.
- **upper :** Maximum threshold value. All values above this threshold will be set to it.
- **axis :**  Align object with lower and upper along the given axis.

In [57]:
df_dummy = pd.DataFrame({
    'A': [1, 5, 10],
    'B': [20, 15, 0]
})
df_dummy.clip(3,15)


Unnamed: 0,A,B
0,3,15
1,5,15
2,10,3


In [58]:
# specific lower and upper thresholds per column element in the dataframe
df_01= pd.DataFrame({
    'Runs': [50, 100, 350, 600],
    'Strike Rate': [110, 140, 175, 210]
})

df_01

Unnamed: 0,Runs,Strike Rate
0,50,110
1,100,140
2,350,175
3,600,210


In [59]:
df_01.clip(lower={'Runs' : 100, 'Strike Rate' : 120},
       upper={'Runs' : 300 , 'Strike Rate' : 170},
       axis=1)

Unnamed: 0,Runs,Strike Rate
0,100,120
1,100,140
2,300,170
3,300,170


## **→Pandas columns Attribute** 

`df.columns` attribute returns the column names of a DataFrame.

In [60]:
import seaborn as sns

In [61]:
df=sns.load_dataset('titanic')
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

## **→Pandas sort_values() Method** 

`sort_values()` function sorts a DataFrame by one or more columns in ascending or descending order

In [62]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [63]:
df.sort_values('age', axis=0, ascending=True)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
803,1,3,male,0.42,0,1,8.5167,C,Third,child,False,,Cherbourg,yes,False
755,1,2,male,0.67,1,1,14.5000,S,Second,child,False,,Southampton,yes,False
644,1,3,female,0.75,2,1,19.2583,C,Third,child,False,,Cherbourg,yes,False
469,1,3,female,0.75,2,1,19.2583,C,Third,child,False,,Cherbourg,yes,False
78,1,2,male,0.83,0,2,29.0000,S,Second,child,False,,Southampton,yes,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
859,0,3,male,,0,0,7.2292,C,Third,man,True,,Cherbourg,no,True
863,0,3,female,,8,2,69.5500,S,Third,woman,False,,Southampton,no,False
868,0,3,male,,0,0,9.5000,S,Third,man,True,,Southampton,no,True
878,0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True


In [64]:
# Sorting DataFrame with Custom NaN Value Placement
df.sort_values('age', axis=0, ascending=True, na_position='last')


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
803,1,3,male,0.42,0,1,8.5167,C,Third,child,False,,Cherbourg,yes,False
755,1,2,male,0.67,1,1,14.5000,S,Second,child,False,,Southampton,yes,False
644,1,3,female,0.75,2,1,19.2583,C,Third,child,False,,Cherbourg,yes,False
469,1,3,female,0.75,2,1,19.2583,C,Third,child,False,,Cherbourg,yes,False
78,1,2,male,0.83,0,2,29.0000,S,Second,child,False,,Southampton,yes,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
859,0,3,male,,0,0,7.2292,C,Third,man,True,,Cherbourg,no,True
863,0,3,female,,8,2,69.5500,S,Third,woman,False,,Southampton,no,False
868,0,3,male,,0,0,9.5000,S,Third,man,True,,Southampton,no,True
878,0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True


In [65]:
df.sort_values('age', axis=0, ascending=True, na_position='first')

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
17,1,2,male,,0,0,13.0000,S,Second,man,True,,Southampton,yes,True
19,1,3,female,,0,0,7.2250,C,Third,woman,False,,Cherbourg,yes,True
26,0,3,male,,0,0,7.2250,C,Third,man,True,,Cherbourg,no,True
28,1,3,female,,0,0,7.8792,Q,Third,woman,False,,Queenstown,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116,0,3,male,70.5,0,0,7.7500,Q,Third,man,True,,Queenstown,no,True
493,0,1,male,71.0,0,0,49.5042,C,First,man,True,,Cherbourg,no,True
96,0,1,male,71.0,0,0,34.6542,C,First,man,True,A,Cherbourg,no,True
851,0,3,male,74.0,0,0,7.7750,S,Third,man,True,,Southampton,no,True


## **→Pandas value_counts() Method**

Returns the counts of the unique values in a series or from a dataframe's column

In [66]:
#Counting unique string values
df['class'].value_counts()

class
Third     491
First     216
Second    184
Name: count, dtype: int64

In [67]:
df['pclass'].value_counts(normalize=True) # Showing the results as percentages

pclass
3    0.551066
1    0.242424
2    0.206510
Name: proportion, dtype: float64

In [68]:
df['age'].value_counts(bins=True)

(0.339, 80.0]    714
Name: count, dtype: int64

In [69]:
df['age'].value_counts(ascending=True, dropna=True)

age
66.0     1
12.0     1
70.5     1
36.5     1
20.5     1
        ..
28.0    25
30.0    25
18.0    26
22.0    27
24.0    30
Name: count, Length: 88, dtype: int64

## **→Pandas nlargest() Method**

`nlargest()` method is used to get n largest values from a data frame or a series.


In [70]:
df.nlargest(5,"age") # Top 5 rows largest value of age

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
630,1,1,male,80.0,0,0,30.0,S,First,man,True,A,Southampton,yes,True
851,0,3,male,74.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
96,0,1,male,71.0,0,0,34.6542,C,First,man,True,A,Cherbourg,no,True
493,0,1,male,71.0,0,0,49.5042,C,First,man,True,,Cherbourg,no,True
116,0,3,male,70.5,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


## **→Pandas nsmallest() Method**

`nsmallest()` method is used to get n least values from a data frame or a series.

In [71]:
df['age'].nsmallest(5) # top 5 rows smallest value of age   |

803    0.42
755    0.67
469    0.75
644    0.75
78     0.83
Name: age, dtype: float64

## **→Pandas copy() Method**

The `df.copy()` function in Pandas allows to create a duplicate of a DataFrame. This can be Deep Copy or Shallow Copy.

In [72]:
df_copy = df.copy()  # create deep copy of dataset(bydefault deep=True)

df_copy

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [73]:
df_shallow_copy = df.copy(deep=False) # shallow copy

df_shallow_copy

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


## **→Pandas loc[ ] Method**

`df.loc[]` is used for label-based indexing, you access rows and columns by their labels (names), not by position.

In [74]:
data = {
    'Player': ['Babar', 'Rizwan', 'Saim', 'Fakhar', 'Chachu'],
    'Runs': [569, 407, 345, 305, 259],
    'Average': [56.90, 33.91, 31.36, 19.63, 28.78],
    'Team': ['PZ', 'MS', 'PZ', 'LQ', 'IU']
}

df = pd.DataFrame(data)

df.set_index('Player', inplace=True)

df

Unnamed: 0_level_0,Runs,Average,Team
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Babar,569,56.9,PZ
Rizwan,407,33.91,MS
Saim,345,31.36,PZ
Fakhar,305,19.63,LQ
Chachu,259,28.78,IU


In [75]:
#selecting single row 
df.loc['Babar']  #Returns a single row as a Series.

Runs        569
Average    56.9
Team         PZ
Name: Babar, dtype: object

In [76]:
#selecting single row
df.loc[['Babar']]  #Returns a DataFrame

Unnamed: 0_level_0,Runs,Average,Team
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Babar,569,56.9,PZ


In [77]:
#Selecting Multiple rows
df.loc[['Babar', 'Rizwan']]

Unnamed: 0_level_0,Runs,Average,Team
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Babar,569,56.9,PZ
Rizwan,407,33.91,MS


In [78]:
#selecting a row and specific column
df.loc['Babar', 'Runs']

np.int64(569)

In [79]:
df.loc['Babar', ['Average', 'Runs']] # Babar average and runs

Average    56.9
Runs        569
Name: Babar, dtype: object

In [80]:
#selecting range of rows
df.loc['Babar':'Chachu']

Unnamed: 0_level_0,Runs,Average,Team
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Babar,569,56.9,PZ
Rizwan,407,33.91,MS
Saim,345,31.36,PZ
Fakhar,305,19.63,LQ
Chachu,259,28.78,IU


In [81]:
#select all rows for a column
df.loc[:, 'Runs']

Player
Babar     569
Rizwan    407
Saim      345
Fakhar    305
Chachu    259
Name: Runs, dtype: int64

In [82]:
#conditional selection 
df.loc[df['Runs'] > 400]

Unnamed: 0_level_0,Runs,Average,Team
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Babar,569,56.9,PZ
Rizwan,407,33.91,MS


In [83]:
#update a value
df.loc['Babar', 'Runs'] = 500

## **→Pandas iloc[ ] Method**

`.iloc[]` stands for integer location. It is used to select rows and columns by their integer positions (just like list indexing in Python)

In [84]:
df=sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [85]:
#select a row by index number
df.iloc[[0]] #first row

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False


In [86]:
#select multiple rows
df.iloc[[0, 2, 4]]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [87]:
#select row range
df.iloc[0:5]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [88]:
#select row+column(cell value)
df.iloc[3,2]

'female'

In [89]:
#select all rows for a column
df.iloc[:, 2]

0        male
1      female
2      female
3      female
4        male
        ...  
886      male
887    female
888    female
889      male
890      male
Name: sex, Length: 891, dtype: object

In [90]:
#select specific columns only
df.iloc[1,[0,2]]

survived         1
sex         female
Name: 1, dtype: object

In [91]:
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [92]:
#modify a value 
print("Before", df.iloc[2, 1])

df.iloc[2, 1] = 1

print("After", df.iloc[2, 1])

Before 3
After 1


## **→Pandas rename() Method**

In [93]:
data = {
    'Player Name': ['Babar', 'Rizwan', 'Saim', 'Fakhar', 'Chachu'],
    'Total Runs': [569, 407, 345, 305, 259],
    'Average Runs': [56.90, 33.91, 31.36, 19.63, 28.78],
    'Player Team': ['PZ', 'MS', 'PZ', 'LQ', 'IU']
}

df = pd.DataFrame(data)

df

Unnamed: 0,Player Name,Total Runs,Average Runs,Player Team
0,Babar,569,56.9,PZ
1,Rizwan,407,33.91,MS
2,Saim,345,31.36,PZ
3,Fakhar,305,19.63,LQ
4,Chachu,259,28.78,IU


In [94]:
#Rename Columns
df_rename=df.rename(columns={'Player Name': 'Player', 'Total Runs': 'Runs', 'Average Runs': 'Average', 'Player Team': 'Team'})
df_rename

Unnamed: 0,Player,Runs,Average,Team
0,Babar,569,56.9,PZ
1,Rizwan,407,33.91,MS
2,Saim,345,31.36,PZ
3,Fakhar,305,19.63,LQ
4,Chachu,259,28.78,IU


In [95]:
#Rename Index(rows)
df_rename=df.rename(index={0: 'First', 1: 'Second', 2: 'Third', 3: 'Fourth', 4: 'Fifth'})
df_rename

Unnamed: 0,Player Name,Total Runs,Average Runs,Player Team
First,Babar,569,56.9,PZ
Second,Rizwan,407,33.91,MS
Third,Saim,345,31.36,PZ
Fourth,Fakhar,305,19.63,LQ
Fifth,Chachu,259,28.78,IU


In [96]:
#Rename in place
df.rename(columns={'Player Name' : 'Player'}, inplace=True)

df

Unnamed: 0,Player,Total Runs,Average Runs,Player Team
0,Babar,569,56.9,PZ
1,Rizwan,407,33.91,MS
2,Saim,345,31.36,PZ
3,Fakhar,305,19.63,LQ
4,Chachu,259,28.78,IU


## **→Pandas where() Method**

The where() function is used to keep values that meet a condition, and replace others with NaN (by default) unless you specify a replacement.

In [97]:
data = {
    'Player': ['Babar', 'Rizwan', 'Saim', 'Fakhar', 'Chachu'],
    'Runs': [569, 407, 345, 305, 259],
    'Average': [56.90, 33.91, 31.36, 19.63, 28.78],
    'Team': ['PZ', 'MS', 'PZ', 'LQ', 'IU']
}

df = pd.DataFrame(data)

df

Unnamed: 0,Player,Runs,Average,Team
0,Babar,569,56.9,PZ
1,Rizwan,407,33.91,MS
2,Saim,345,31.36,PZ
3,Fakhar,305,19.63,LQ
4,Chachu,259,28.78,IU


In [98]:
#keep players with Runs>=400
df.where(df['Runs']>=400) ##  Keeps only rows where Runs > 400; other rows become NaN.

Unnamed: 0,Player,Runs,Average,Team
0,Babar,569.0,56.9,PZ
1,Rizwan,407.0,33.91,MS
2,,,,
3,,,,
4,,,,


In [99]:
#keep other with custom values
df.where(df['Runs']>=400, other='low') # Keeps Runs if > 400, otherwise shows 'Low'

Unnamed: 0,Player,Runs,Average,Team
0,Babar,569,56.9,PZ
1,Rizwan,407,33.91,MS
2,low,low,low,low
3,low,low,low,low
4,low,low,low,low


In [100]:
#use with multiple conditions
df.where((df['Runs'] > 200) & (df['Average'] > 30)) # Keeps only values meeting both conditions

Unnamed: 0,Player,Runs,Average,Team
0,Babar,569.0,56.9,PZ
1,Rizwan,407.0,33.91,MS
2,Saim,345.0,31.36,PZ
3,,,,
4,,,,


## **→ np.where**

In [101]:
#Adding new column
df['Tag'] = df['Runs'].where(df['Runs'] > 400, other='Low')  # Adds a new column with either runs or "Low"
df

Unnamed: 0,Player,Runs,Average,Team,Tag
0,Babar,569,56.9,PZ,569
1,Rizwan,407,33.91,MS,407
2,Saim,345,31.36,PZ,Low
3,Fakhar,305,19.63,LQ,Low
4,Chachu,259,28.78,IU,Low


In [102]:
import numpy as np

df['Tag'] = np.where(df['Runs'] > 400, 'Top', 'Average')

df

Unnamed: 0,Player,Runs,Average,Team,Tag
0,Babar,569,56.9,PZ,Top
1,Rizwan,407,33.91,MS,Top
2,Saim,345,31.36,PZ,Average
3,Fakhar,305,19.63,LQ,Average
4,Chachu,259,28.78,IU,Average


## **→Pandas drop() Method**

The `drop()` function is used to remove rows or columns from a DataFrame.

In [103]:
df=sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [104]:
#drop a column

df.drop('deck', axis=1, inplace=True)
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


In [105]:
#drop multiple columns
df.drop(['fare' ,'pclass','sibsp'], axis=1, inplace=True)
df.head()


Unnamed: 0,survived,sex,age,parch,embarked,class,who,adult_male,embark_town,alive,alone
0,0,male,22.0,0,S,Third,man,True,Southampton,no,False
1,1,female,38.0,0,C,First,woman,False,Cherbourg,yes,False
2,1,female,26.0,0,S,Third,woman,False,Southampton,yes,True
3,1,female,35.0,0,S,First,woman,False,Southampton,yes,False
4,0,male,35.0,0,S,Third,man,True,Southampton,no,True


In [106]:
#drop a row by index
df.drop(4)

Unnamed: 0,survived,sex,age,parch,embarked,class,who,adult_male,embark_town,alive,alone
0,0,male,22.0,0,S,Third,man,True,Southampton,no,False
1,1,female,38.0,0,C,First,woman,False,Cherbourg,yes,False
2,1,female,26.0,0,S,Third,woman,False,Southampton,yes,True
3,1,female,35.0,0,S,First,woman,False,Southampton,yes,False
5,0,male,,0,Q,Third,man,True,Queenstown,no,True
...,...,...,...,...,...,...,...,...,...,...,...
886,0,male,27.0,0,S,Second,man,True,Southampton,no,True
887,1,female,19.0,0,S,First,woman,False,Southampton,yes,True
888,0,female,,2,S,Third,woman,False,Southampton,no,False
889,1,male,26.0,0,C,First,man,True,Cherbourg,yes,True


In [107]:
#Drop Multiple rows
df.drop([0, 2, 4])

Unnamed: 0,survived,sex,age,parch,embarked,class,who,adult_male,embark_town,alive,alone
1,1,female,38.0,0,C,First,woman,False,Cherbourg,yes,False
3,1,female,35.0,0,S,First,woman,False,Southampton,yes,False
5,0,male,,0,Q,Third,man,True,Queenstown,no,True
6,0,male,54.0,0,S,First,man,True,Southampton,no,True
7,0,male,2.0,1,S,Third,child,False,Southampton,no,False
...,...,...,...,...,...,...,...,...,...,...,...
886,0,male,27.0,0,S,Second,man,True,Southampton,no,True
887,1,female,19.0,0,S,First,woman,False,Southampton,yes,True
888,0,female,,2,S,Third,woman,False,Southampton,no,False
889,1,male,26.0,0,C,First,man,True,Cherbourg,yes,True


In [108]:
#Drop using column Position 
df.drop(df.columns[3], axis=1)

Unnamed: 0,survived,sex,age,embarked,class,who,adult_male,embark_town,alive,alone
0,0,male,22.0,S,Third,man,True,Southampton,no,False
1,1,female,38.0,C,First,woman,False,Cherbourg,yes,False
2,1,female,26.0,S,Third,woman,False,Southampton,yes,True
3,1,female,35.0,S,First,woman,False,Southampton,yes,False
4,0,male,35.0,S,Third,man,True,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...
886,0,male,27.0,S,Second,man,True,Southampton,no,True
887,1,female,19.0,S,First,woman,False,Southampton,yes,True
888,0,female,,S,Third,woman,False,Southampton,no,False
889,1,male,26.0,C,First,man,True,Cherbourg,yes,True


In [109]:
#Drop with errors='ignore'
df.drop('column', axis=1, errors='ignore')

Unnamed: 0,survived,sex,age,parch,embarked,class,who,adult_male,embark_town,alive,alone
0,0,male,22.0,0,S,Third,man,True,Southampton,no,False
1,1,female,38.0,0,C,First,woman,False,Cherbourg,yes,False
2,1,female,26.0,0,S,Third,woman,False,Southampton,yes,True
3,1,female,35.0,0,S,First,woman,False,Southampton,yes,False
4,0,male,35.0,0,S,Third,man,True,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...
886,0,male,27.0,0,S,Second,man,True,Southampton,no,True
887,1,female,19.0,0,S,First,woman,False,Southampton,yes,True
888,0,female,,2,S,Third,woman,False,Southampton,no,False
889,1,male,26.0,0,C,First,man,True,Cherbourg,yes,True


## **→Pandas groupby() Method**

`groupby()` function is a powerful tool used to split a DataFrame into groups based on one or more columns, allowing for efficient data analysis and aggregation.

In [110]:
df = pd.read_csv('nba.csv')
df

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...,...
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In [111]:
#Grouping by single column
df.groupby('Team').first() # ->  first row per group (per team)

Unnamed: 0_level_0,Name,Number,Position,Age,Height,Weight,College,Salary
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Atlanta Hawks,Kent Bazemore,24.0,SF,26.0,6-5,201.0,Old Dominion,2000000.0
Boston Celtics,Avery Bradley,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
Brooklyn Nets,Bojan Bogdanovic,44.0,SG,27.0,6-8,216.0,Oklahoma State,3425510.0
Charlotte Hornets,Nicolas Batum,5.0,SG,27.0,6-8,200.0,Virginia Commonwealth,13125306.0
Chicago Bulls,Cameron Bairstow,41.0,PF,25.0,6-9,250.0,New Mexico,845059.0
Cleveland Cavaliers,Matthew Dellavedova,8.0,PG,25.0,6-4,198.0,Saint Mary's,1147276.0
Dallas Mavericks,Justin Anderson,1.0,SG,22.0,6-6,228.0,Virginia,1449000.0
Denver Nuggets,Darrell Arthur,0.0,PF,28.0,6-9,235.0,Kansas,2814000.0
Detroit Pistons,Joel Anthony,50.0,C,33.0,6-9,245.0,UNLV,2500000.0
Golden State Warriors,Leandro Barbosa,19.0,SG,33.0,6-3,194.0,North Carolina,2500000.0


In [112]:
#Grouping by multiple column
df.groupby(['Team', 'Position']).first()


Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Number,Age,Height,Weight,College,Salary
Team,Position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Atlanta Hawks,C,Al Horford,15.0,30.0,6-10,245.0,Florida,12000000.0
Atlanta Hawks,PF,Kris Humphries,43.0,31.0,6-9,235.0,Minnesota,1000000.0
Atlanta Hawks,PG,Dennis Schroder,17.0,22.0,6-1,172.0,Wake Forest,1763400.0
Atlanta Hawks,SF,Kent Bazemore,24.0,26.0,6-5,201.0,Old Dominion,2000000.0
Atlanta Hawks,SG,Tim Hardaway Jr.,10.0,24.0,6-6,205.0,Michigan,1304520.0
...,...,...,...,...,...,...,...,...
Washington Wizards,C,Marcin Gortat,13.0,32.0,6-11,240.0,North Carolina State,11217391.0
Washington Wizards,PF,Drew Gooden,90.0,34.0,6-10,250.0,Kansas,3300000.0
Washington Wizards,PG,Ramon Sessions,7.0,30.0,6-3,190.0,Nevada,2170465.0
Washington Wizards,SF,Jared Dudley,1.0,30.0,6-7,225.0,Boston College,4375000.0


In [113]:
#first 2 Players per team
df.groupby('Team').head(2)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
15,Bojan Bogdanovic,Brooklyn Nets,44.0,SG,27.0,6-8,216.0,,3425510.0
16,Markel Brown,Brooklyn Nets,22.0,SG,24.0,6-3,190.0,Oklahoma State,845059.0
30,Arron Afflalo,New York Knicks,4.0,SG,30.0,6-5,210.0,UCLA,8000000.0
31,Lou Amundson,New York Knicks,17.0,PF,33.0,6-9,220.0,UNLV,1635476.0
46,Elton Brand,Philadelphia 76ers,42.0,PF,37.0,6-9,254.0,Duke,
47,Isaiah Canaan,Philadelphia 76ers,0.0,PG,25.0,6-0,201.0,Murray State,947276.0
61,Bismack Biyombo,Toronto Raptors,8.0,C,23.0,6-9,245.0,,2814000.0
62,Bruno Caboclo,Toronto Raptors,20.0,SF,20.0,6-9,205.0,,1524000.0


In [114]:
#sorting older players by team(highest age)
df.sort_values(['Team', 'Age'], ascending=[True, True]).groupby('Team').head(2)


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
318,Dennis Schroder,Atlanta Hawks,17.0,PG,22.0,6-1,172.0,,1763400.0
310,Tim Hardaway Jr.,Atlanta Hawks,10.0,SG,24.0,6-6,205.0,Michigan,1304520.0
13,James Young,Boston Celtics,13.0,SG,20.0,6-6,215.0,Kentucky,1749840.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
18,Rondae Hollis-Jefferson,Brooklyn Nets,24.0,SG,21.0,6-7,220.0,Arizona,1335480.0
24,Chris McCullough,Brooklyn Nets,1.0,PF,21.0,6-11,200.0,Syracuse,1140240.0
328,Aaron Harrison,Charlotte Hornets,9.0,SG,21.0,6-6,210.0,Kentucky,525093.0
332,Michael Kidd-Gilchrist,Charlotte Hornets,14.0,SF,22.0,6-7,232.0,Kentucky,6331404.0
163,Bobby Portis,Chicago Bulls,5.0,PF,21.0,6-11,230.0,Arkansas,1391160.0
155,Cristiano Felicio,Chicago Bulls,6.0,PF,23.0,6-10,275.0,,525093.0


**Applying Aggregation with GroupBy**

Aggregation is one of the most common operations when using groupby. After grouping the data, you can apply functions like `sum()`, `mean()`, `min()`, `max()`, and more.

In [115]:
df.groupby(['Team', 'Position']).agg(
    total_salary=('Salary', 'sum'),
    avg_salary=('Salary', 'mean'),
    highest_salary=('Salary', 'max' ),
    player_count=('Name', 'count')
)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_salary,avg_salary,highest_salary,player_count
Team,Position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Atlanta Hawks,C,22756250.0,7.585417e+06,12000000.0,3
Atlanta Hawks,PF,23952268.0,5.988067e+06,18671659.0,4
Atlanta Hawks,PG,9763400.0,4.881700e+06,8000000.0,2
Atlanta Hawks,SF,6000000.0,3.000000e+06,4000000.0,2
Atlanta Hawks,SG,10431032.0,2.607758e+06,5746479.0,4
...,...,...,...,...,...
Washington Wizards,C,24490429.0,8.163476e+06,13000000.0,3
Washington Wizards,PF,11300000.0,5.650000e+06,8000000.0,2
Washington Wizards,PG,18022415.0,9.011208e+06,15851950.0,2
Washington Wizards,SF,11158800.0,2.789700e+06,4662960.0,4


## **→Pandas corr() Method**

The `.corr()` function calculates the correlation coefficient between numeric columns in a DataFrame.
It tells you how strongly two columns are related 

In [116]:
df.select_dtypes(include='number').corr()

Unnamed: 0,Number,Age,Weight,Salary
Number,1.0,0.028724,0.206921,-0.112386
Age,0.028724,1.0,0.087183,0.213459
Weight,0.206921,0.087183,1.0,0.138321
Salary,-0.112386,0.213459,0.138321,1.0


## **→Pandas query() Method**

The `query()` function in pandas is used to filter rows in a DataFrame using a string-based condition

In [117]:
data = {
    'Player': ['Babar', 'Rizwan', 'Saim', 'Fakhar', 'Iftikhar'],
    'Team': ['PZ', 'MS', 'PZ', 'LQ', 'IU'],
    'Runs': [569, 407, 345, 305, 259],
    'Average': [56.90, 33.91, 31.36, 19.63, 28.78],
    'Age': [29, 31, 21, 32, 38]
}

df = pd.DataFrame(data)

df

Unnamed: 0,Player,Team,Runs,Average,Age
0,Babar,PZ,569,56.9,29
1,Rizwan,MS,407,33.91,31
2,Saim,PZ,345,31.36,21
3,Fakhar,LQ,305,19.63,32
4,Iftikhar,IU,259,28.78,38


In [118]:
#players with runs>400
df.query('Runs > 400')

Unnamed: 0,Player,Team,Runs,Average,Age
0,Babar,PZ,569,56.9,29
1,Rizwan,MS,407,33.91,31


In [119]:
#players from 'PZ'
df.query('Team == "PZ"')

Unnamed: 0,Player,Team,Runs,Average,Age
0,Babar,PZ,569,56.9,29
2,Saim,PZ,345,31.36,21


In [120]:
#Multiple conditons AND
df.query('Runs > 400 & Team == "MS"')

df.query('Runs > 400 and Team == "MS"') #same output

Unnamed: 0,Player,Team,Runs,Average,Age
1,Rizwan,MS,407,33.91,31


In [121]:
#OR Condition
df.query('Runs > 400 | Team == "MS"') # OR condition

Unnamed: 0,Player,Team,Runs,Average,Age
0,Babar,PZ,569,56.9,29
1,Rizwan,MS,407,33.91,31


In [122]:
#using variable in query
x='PZ'
df.query('Team == @x') #@ use for variable

Unnamed: 0,Player,Team,Runs,Average,Age
0,Babar,PZ,569,56.9,29
2,Saim,PZ,345,31.36,21


## **→Pandas insert() Method**

The `.insert()` function inserts a new column into a DataFrame at the specified column index (position), without overwriting existing columns.

**syntaxt:** df.insert(loc, column, value, allow_duplicates=False)

In [123]:
data={
    'Player' : ['Babar', 'Rizwan', 'Saim', 'Fakhar', 'Chachu'],
    'Runs' : [569, 407, 345, 305, 259],
    'Average' : [56.90, 33.91, 31.36, 19.63, 28.78],
    'Team' : ['PZ', 'MS', 'PZ', 'LQ', 'IU']     
}
df=pd.DataFrame(data)

df

Unnamed: 0,Player,Runs,Average,Team
0,Babar,569,56.9,PZ
1,Rizwan,407,33.91,MS
2,Saim,345,31.36,PZ
3,Fakhar,305,19.63,LQ
4,Chachu,259,28.78,IU


In [124]:
#insert 'Age' column at position 2
df.insert(2, 'Age', [29, 31, 21, 32, 38])

df

Unnamed: 0,Player,Runs,Age,Average,Team
0,Babar,569,29,56.9,PZ
1,Rizwan,407,31,33.91,MS
2,Saim,345,21,31.36,PZ
3,Fakhar,305,32,19.63,LQ
4,Chachu,259,38,28.78,IU


## **→Pandas sum() Method**

In [125]:
data={
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [9, 10, 11, 12]
}

df=pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,1,5,9
1,2,6,10
2,3,7,11
3,4,8,12


In [126]:
#Sum along column(axis=0 by default)
df.sum()

A    10
B    26
C    42
dtype: int64

In [127]:
#sum along rows(axis=1)
df.sum(axis=1)

0    15
1    18
2    21
3    24
dtype: int64

**If you run `df.sum()` without selecting numeric columns, strings will be concatenated.**

In [128]:
df=pd.read_csv('nba.csv')
df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [129]:
#summing numeric values accross columns
#df.sum(numeric_only=True) 
df.select_dtypes('number').sum()

Number    8.079000e+03
Age       1.231100e+04
Weight    1.012360e+05
Salary    2.159837e+09
dtype: float64

In [130]:
df.sum(numeric_only=True) 

Number    8.079000e+03
Age       1.231100e+04
Weight    1.012360e+05
Salary    2.159837e+09
dtype: float64

In [131]:
#summing values accross rows(axis=1)
df.select_dtypes('number').sum(axis=1)

0      7730542.0
1      6796476.0
2          262.0
3      1148875.0
4      5000268.0
         ...    
453    2433570.0
454     900228.0
455    2900303.0
456     947557.0
457          0.0
Length: 458, dtype: float64

## **→Pandas mean() Method**

Pandas `.mean()` function returns the mean of the values for the requested axis. 

In [132]:
df=pd.read_csv('nba.csv')
df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [133]:
#mean of all observations
df.mean(numeric_only=True)

Number    1.767834e+01
Age       2.693873e+01
Weight    2.215230e+02
Salary    4.842684e+06
dtype: float64

In [134]:
#mean of selected column
df[['Age']].mean()

Age    26.938731
dtype: float64

## **→Pandas median() Method**

Pandas `df.median()` function return the median of the values for the requested axis 

In [135]:
df.median(numeric_only=True)

Number         13.0
Age            26.0
Weight        220.0
Salary    2839073.0
dtype: float64

In [136]:
df.median(numeric_only=True, skipna=True)

Number         13.0
Age            26.0
Weight        220.0
Salary    2839073.0
dtype: float64

In [137]:
df[['Age']].median()

Age    26.0
dtype: float64

## **→Pandas std() Method**

`std()` function in pandas returns the standard deviation of the values in a DataFrame or Series. It measures how much the data varies (or spreads out) from the average (mean).

In [138]:
df.std(numeric_only=True)

Number    1.596609e+01
Age       4.404016e+00
Weight    2.636834e+01
Salary    5.229238e+06
dtype: float64

In [139]:
# std of single column
df[['Age']].std()

Age    4.404016
dtype: float64

In [140]:
#std of multiple columns
df[['Age', 'Weight']].std()

Age        4.404016
Weight    26.368343
dtype: float64

## **→Pandas merge() Method**

The `merge()` function is designed to merge two DataFrames based on one or more columns with matching values. The basic idea is to identify columns that contain common data between the DataFrames and use them to align rows.

In [141]:
student_info=pd.DataFrame({
    'StudentID': [201, 202, 203, 204],
    'Name': ['Memuna', 'Attiqa', 'Malaiqa', 'Sehrish'],
    'Department': ['CS', 'EE', 'CS', 'BBA']
    
})

student_info

Unnamed: 0,StudentID,Name,Department
0,201,Memuna,CS
1,202,Attiqa,EE
2,203,Malaiqa,CS
3,204,Sehrish,BBA


In [142]:
students_grades = pd.DataFrame({
    'StudentID': [201, 202, 204, 205],
    'GPA': [3.8, 3.4, 3.2, 3.6],
    'Credits': [18, 20, 16, 19]
})

students_grades


Unnamed: 0,StudentID,GPA,Credits
0,201,3.8,18
1,202,3.4,20
2,204,3.2,16
3,205,3.6,19


In [143]:
# inner join: keep only matching rows
pd.merge(student_info, students_grades, on='StudentID', how='inner')

Unnamed: 0,StudentID,Name,Department,GPA,Credits
0,201,Memuna,CS,3.8,18
1,202,Attiqa,EE,3.4,20
2,204,Sehrish,BBA,3.2,16


In [144]:
#outer join: keep all rows
pd.merge(student_info, students_grades, on='StudentID', how='outer')

Unnamed: 0,StudentID,Name,Department,GPA,Credits
0,201,Memuna,CS,3.8,18.0
1,202,Attiqa,EE,3.4,20.0
2,203,Malaiqa,CS,,
3,204,Sehrish,BBA,3.2,16.0
4,205,,,3.6,19.0


In [145]:
#left join: keep all rows from left table
pd.merge(student_info, students_grades, on='StudentID', how='left')

Unnamed: 0,StudentID,Name,Department,GPA,Credits
0,201,Memuna,CS,3.8,18.0
1,202,Attiqa,EE,3.4,20.0
2,203,Malaiqa,CS,,
3,204,Sehrish,BBA,3.2,16.0


In [146]:
#right join: keep all rows from right table
pd.merge(student_info, students_grades, on='StudentID', how='right')

Unnamed: 0,StudentID,Name,Department,GPA,Credits
0,201,Memuna,CS,3.8,18
1,202,Attiqa,EE,3.4,20
2,204,Sehrish,BBA,3.2,16
3,205,,,3.6,19


## **→Pandas astype() Method**

The `astype()` function is used to cast (convert) a pandas Series or DataFrame column to a specified data type.

In [147]:
df = pd.read_csv('nba.csv')

df.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

In [148]:
#convert the Number column datatype
df = df.dropna()

df.Number = df.Number.astype('int64')

type(df.Number[0])

numpy.int64

In [149]:
#convert the datatype of more than one column at once
df = df.astype({'Team' : 'category', 'Age' : 'int64'})

df.dtypes

Name          object
Team        category
Number         int64
Position      object
Age            int64
Height        object
Weight       float64
College       object
Salary       float64
dtype: object

## **→Pandas set_index() Method**

`set_index()` method sets one or more columns as the index of a DataFrame

In [150]:
df=pd.read_csv('nba.csv')
df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [151]:
#set a single column as index
df.set_index('Name')

Unnamed: 0_level_0,Team,Number,Position,Age,Height,Weight,College,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...
Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In [152]:
#set multiple columns as an index
df.set_index(['Team', 'Name'], inplace=True)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Number,Position,Age,Height,Weight,College,Salary
Team,Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Boston Celtics,Avery Bradley,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
Boston Celtics,Jae Crowder,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
Boston Celtics,John Holland,30.0,SG,27.0,6-5,205.0,Boston University,
Boston Celtics,R.J. Hunter,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
Boston Celtics,Jonas Jerebko,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...
Utah Jazz,Shelvin Mack,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
Utah Jazz,Raul Neto,25.0,PG,24.0,6-1,179.0,,900000.0
Utah Jazz,Tibor Pleiss,21.0,C,26.0,7-3,256.0,,2900000.0
Utah Jazz,Jeff Withey,24.0,C,26.0,7-0,231.0,Kansas,947276.0


## **→Pandas reset_index() Method**

`reset_index()` method is used to reset the index of a DataFrame. By default, it creates a new integer-based index starting from 0.

In [153]:
#reset index of pandas dataframe
df.reset_index(inplace=True)
df

Unnamed: 0,Team,Name,Number,Position,Age,Height,Weight,College,Salary
0,Boston Celtics,Avery Bradley,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Boston Celtics,Jae Crowder,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,Boston Celtics,John Holland,30.0,SG,27.0,6-5,205.0,Boston University,
3,Boston Celtics,R.J. Hunter,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Boston Celtics,Jonas Jerebko,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...,...
453,Utah Jazz,Shelvin Mack,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Utah Jazz,Raul Neto,25.0,PG,24.0,6-1,179.0,,900000.0
455,Utah Jazz,Tibor Pleiss,21.0,C,26.0,7-3,256.0,,2900000.0
456,Utah Jazz,Jeff Withey,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In [154]:
# creating new dataset

df_2 = df[df['Position'] == 'PG']

print(df_2.shape)

df_2.head()

(92, 9)


Unnamed: 0,Team,Name,Number,Position,Age,Height,Weight,College,Salary
0,Boston Celtics,Avery Bradley,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
8,Boston Celtics,Terry Rozier,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
9,Boston Celtics,Marcus Smart,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0
11,Boston Celtics,Isaiah Thomas,4.0,PG,27.0,5-9,185.0,Washington,6912869.0
19,Brooklyn Nets,Jarrett Jack,2.0,PG,32.0,6-3,200.0,Georgia Tech,6300000.0


In [155]:
df_2.reset_index(drop=True, inplace=True)

df_2

Unnamed: 0,Team,Name,Number,Position,Age,Height,Weight,College,Salary
0,Boston Celtics,Avery Bradley,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Boston Celtics,Terry Rozier,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
2,Boston Celtics,Marcus Smart,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0
3,Boston Celtics,Isaiah Thomas,4.0,PG,27.0,5-9,185.0,Washington,6912869.0
4,Brooklyn Nets,Jarrett Jack,2.0,PG,32.0,6-3,200.0,Georgia Tech,6300000.0
...,...,...,...,...,...,...,...,...,...
87,Portland Trail Blazers,Brian Roberts,2.0,PG,30.0,6-1,173.0,Dayton,2854940.0
88,Utah Jazz,Trey Burke,3.0,PG,23.0,6-1,191.0,Michigan,2658240.0
89,Utah Jazz,Dante Exum,11.0,PG,20.0,6-6,190.0,,3777720.0
90,Utah Jazz,Shelvin Mack,8.0,PG,26.0,6-3,203.0,Butler,2433333.0


## **→Pandas at[] Method**

Pandas `at[]` is used to return data in a dataframe at the passed location. The passed location is in the format [position, Column Name]. 
Unlike, .`loc[ ]`, This method only returns single value. Hence dataframe.at[3:6, label] will return an error.

In [156]:
df.head()

Unnamed: 0,Team,Name,Number,Position,Age,Height,Weight,College,Salary
0,Boston Celtics,Avery Bradley,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Boston Celtics,Jae Crowder,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,Boston Celtics,John Holland,30.0,SG,27.0,6-5,205.0,Boston University,
3,Boston Celtics,R.J. Hunter,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Boston Celtics,Jonas Jerebko,8.0,PF,29.0,6-10,231.0,,5000000.0


In [157]:
df.at[3, 'Name']

'R.J. Hunter'

## **→Pandas iterrows() Method**

`iterrows() ` function is a simple way to iterate over rows of a DataFrame. It returns an iterator that yields each row as a tuple containing the index and the row data (as a Pandas Series).

In [158]:
df.head()

Unnamed: 0,Team,Name,Number,Position,Age,Height,Weight,College,Salary
0,Boston Celtics,Avery Bradley,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Boston Celtics,Jae Crowder,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,Boston Celtics,John Holland,30.0,SG,27.0,6-5,205.0,Boston University,
3,Boston Celtics,R.J. Hunter,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Boston Celtics,Jonas Jerebko,8.0,PF,29.0,6-10,231.0,,5000000.0


In [159]:
for index, row in df.iterrows():
    print(f"index: {index}, Player Name: {row['Name']}, Salary : {row['Salary']}")

index: 0, Player Name: Avery Bradley, Salary : 7730337.0
index: 1, Player Name: Jae Crowder, Salary : 6796117.0
index: 2, Player Name: John Holland, Salary : nan
index: 3, Player Name: R.J. Hunter, Salary : 1148640.0
index: 4, Player Name: Jonas Jerebko, Salary : 5000000.0
index: 5, Player Name: Amir Johnson, Salary : 12000000.0
index: 6, Player Name: Jordan Mickey, Salary : 1170960.0
index: 7, Player Name: Kelly Olynyk, Salary : 2165160.0
index: 8, Player Name: Terry Rozier, Salary : 1824360.0
index: 9, Player Name: Marcus Smart, Salary : 3431040.0
index: 10, Player Name: Jared Sullinger, Salary : 2569260.0
index: 11, Player Name: Isaiah Thomas, Salary : 6912869.0
index: 12, Player Name: Evan Turner, Salary : 3425510.0
index: 13, Player Name: James Young, Salary : 1749840.0
index: 14, Player Name: Tyler Zeller, Salary : 2616975.0
index: 15, Player Name: Bojan Bogdanovic, Salary : 3425510.0
index: 16, Player Name: Markel Brown, Salary : 845059.0
index: 17, Player Name: Wayne Ellington,

In [160]:
# iterating over the first row
next(df.iterrows())[1]

Team        Boston Celtics
Name         Avery Bradley
Number                 0.0
Position                PG
Age                   25.0
Height                 6-2
Weight               180.0
College              Texas
Salary           7730337.0
Name: 0, dtype: object

## **→Pandas to_numeric() Method**

`to_numeric()` is one of the general functions in Pandas which is used to convert argument to a numeric type.

In [161]:
df.head()

Unnamed: 0,Team,Name,Number,Position,Age,Height,Weight,College,Salary
0,Boston Celtics,Avery Bradley,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Boston Celtics,Jae Crowder,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,Boston Celtics,John Holland,30.0,SG,27.0,6-5,205.0,Boston University,
3,Boston Celtics,R.J. Hunter,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Boston Celtics,Jonas Jerebko,8.0,PF,29.0,6-10,231.0,,5000000.0


In [162]:
ser=pd.Series(df['Number']).head(5)
ser

0     0.0
1    99.0
2    30.0
3    28.0
4     8.0
Name: Number, dtype: float64

In [163]:
pd.to_numeric('ser',errors='coerce')

np.float64(nan)

In [164]:
# Using errors='ignore'. It will ignore all non-numeric values.
ser = pd.Series(['Memuna', 15, 12.4, 23, 56, 77,89])

pd.to_numeric(ser, errors ='ignore')

  pd.to_numeric(ser, errors ='ignore')


0    Memuna
1        15
2      12.4
3        23
4        56
5        77
6        89
dtype: object

In [165]:
# Using errors='coerce'. It will replace all non-numeric values with NaN.
ser = pd.Series(['Memuna', 15, 12.4, 23, 56, 77,89])

pd.to_numeric(ser, errors ='coerce')

0     NaN
1    15.0
2    12.4
3    23.0
4    56.0
5    77.0
6    89.0
dtype: float64

## **→Pandas to_datetime() Method**

Pandas to_datetime() is used to convert different data types into datetime objects. 

In [166]:

# date string
d_string = "2023-09-17 14:30:00"

# Convert the string to datetime
dt_obj = pd.to_datetime(d_string)

print(dt_obj)

2023-09-17 14:30:00


## **→Pandas to_string() Method**

`to_string()` function in Pandas is specifically designed to render a DataFrame into a console-friendly tabular format as a string output.

In [167]:
df = pd.read_csv('nba.csv')

df.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

In [168]:
print(df.to_string())

                         Name                    Team  Number Position   Age Height  Weight                College      Salary
0               Avery Bradley          Boston Celtics     0.0       PG  25.0    6-2   180.0                  Texas   7730337.0
1                 Jae Crowder          Boston Celtics    99.0       SF  25.0    6-6   235.0              Marquette   6796117.0
2                John Holland          Boston Celtics    30.0       SG  27.0    6-5   205.0      Boston University         NaN
3                 R.J. Hunter          Boston Celtics    28.0       SG  22.0    6-5   185.0          Georgia State   1148640.0
4               Jonas Jerebko          Boston Celtics     8.0       PF  29.0   6-10   231.0                    NaN   5000000.0
5                Amir Johnson          Boston Celtics    90.0       PF  29.0    6-9   240.0                    NaN  12000000.0
6               Jordan Mickey          Boston Celtics    55.0       PF  21.0    6-8   235.0                    

In [169]:
# Excluding index labels
print(df.to_string(index=False))

 

                    Name                   Team  Number Position  Age Height  Weight               College     Salary
           Avery Bradley         Boston Celtics     0.0       PG 25.0    6-2   180.0                 Texas  7730337.0
             Jae Crowder         Boston Celtics    99.0       SF 25.0    6-6   235.0             Marquette  6796117.0
            John Holland         Boston Celtics    30.0       SG 27.0    6-5   205.0     Boston University        NaN
             R.J. Hunter         Boston Celtics    28.0       SG 22.0    6-5   185.0         Georgia State  1148640.0
           Jonas Jerebko         Boston Celtics     8.0       PF 29.0   6-10   231.0                   NaN  5000000.0
            Amir Johnson         Boston Celtics    90.0       PF 29.0    6-9   240.0                   NaN 12000000.0
           Jordan Mickey         Boston Celtics    55.0       PF 21.0    6-8   235.0                   LSU  1170960.0
            Kelly Olynyk         Boston Celtics    41.0 

In [170]:
# Customizing Missing Values Representation
print(df.to_string(na_rep='Missing'))

                         Name                    Team  Number Position     Age   Height  Weight                College      Salary
0               Avery Bradley          Boston Celtics     0.0       PG    25.0      6-2   180.0                  Texas   7730337.0
1                 Jae Crowder          Boston Celtics    99.0       SF    25.0      6-6   235.0              Marquette   6796117.0
2                John Holland          Boston Celtics    30.0       SG    27.0      6-5   205.0      Boston University     Missing
3                 R.J. Hunter          Boston Celtics    28.0       SG    22.0      6-5   185.0          Georgia State   1148640.0
4               Jonas Jerebko          Boston Celtics     8.0       PF    29.0     6-10   231.0                Missing   5000000.0
5                Amir Johnson          Boston Celtics    90.0       PF    29.0      6-9   240.0                Missing  12000000.0
6               Jordan Mickey          Boston Celtics    55.0       PF    21.0     

In [171]:
# Custom Formatting for Floating-Point Numbers
res = df.to_string(float_format="{:.2f}".format)
print(res)

                         Name                    Team  Number Position   Age Height  Weight                College      Salary
0               Avery Bradley          Boston Celtics    0.00       PG 25.00    6-2  180.00                  Texas  7730337.00
1                 Jae Crowder          Boston Celtics   99.00       SF 25.00    6-6  235.00              Marquette  6796117.00
2                John Holland          Boston Celtics   30.00       SG 27.00    6-5  205.00      Boston University         NaN
3                 R.J. Hunter          Boston Celtics   28.00       SG 22.00    6-5  185.00          Georgia State  1148640.00
4               Jonas Jerebko          Boston Celtics    8.00       PF 29.00   6-10  231.00                    NaN  5000000.00
5                Amir Johnson          Boston Celtics   90.00       PF 29.00    6-9  240.00                    NaN 12000000.00
6               Jordan Mickey          Boston Celtics   55.00       PF 21.00    6-8  235.00                    

In [172]:
# Limiting the Number of Rows and Columns
print(df.to_string(max_rows=3,max_cols=2))

              Name  ...     Salary
0    Avery Bradley  ...  7730337.0
..             ...  ...        ...
457            NaN  ...        NaN


## **→Pandas concat() Method**


In [173]:
# Concatenate DataFrames in Python
series1 = pd.Series([1, 2, 3])
series2= pd.Series(['A', 'B', 'C'])

print(pd.concat([series1, series2]))

0    1
1    2
2    3
0    A
1    B
2    C
dtype: object


In [174]:
# Pandas combining two dataframes horizontally with index = 1
series1= pd.Series([1, 2, 3])
series2= pd.Series(['A', 'B', 'C'])

print(pd.concat([series1, series2], axis=1))


   0  1
0  1  A
1  2  B
2  3  C


In [175]:
# Concatenating 2 DataFrames and Assigning Keys

# creating the DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})
display('df1:', df1)
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7']})
display('df2:', df2)

# concatenating
display('After concatenating:')
display(pd.concat([df1, df2],
                  keys=['key1', 'key2']))

'df1:'

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


'df2:'

Unnamed: 0,A,B
0,A4,B4
1,A5,B5
2,A6,B6
3,A7,B7


'After concatenating:'

Unnamed: 0,Unnamed: 1,A,B
key1,0,A0,B0
key1,1,A1,B1
key1,2,A2,B2
key1,3,A3,B3
key2,0,A4,B4
key2,1,A5,B5
key2,2,A6,B6
key2,3,A7,B7


In [176]:
# Concatenating DataFrames horizontally in Pandas with axis = 1

# creating the DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 
                    'B': ['B0', 'B1', 'B2', 'B3']})
display('df1:', df1)
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'], 
                    'D': ['D0', 'D1', 'D2', 'D3']})
display('df2:', df2)

# concatenating
display('After concatenating:')
display(pd.concat([df1, df2],
                  axis = 1))

'df1:'

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


'df2:'

Unnamed: 0,C,D
0,C0,D0
1,C1,D1
2,C2,D2
3,C3,D3


'After concatenating:'

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [177]:
# Concatenating 2 DataFrames with ignore_index = True

# creating the DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 
                    'B': ['B0', 'B1', 'B2', 'B3']})
display('df1:', df1)
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'], 
                    'B': ['B4', 'B5', 'B6', 'B7']})
display('df2:', df2)

# concatenating
display('After concatenating:')
display(pd.concat([df1, df2], 
                  ignore_index = True))

'df1:'

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


'df2:'

Unnamed: 0,A,B
0,A4,B4
1,A5,B5
2,A6,B6
3,A7,B7


'After concatenating:'

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4
5,A5,B5
6,A6,B6
7,A7,B7


In [178]:
#  Concatenating a DataFrame with a Series

# creating the DataFrame
df = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 
                    'B': ['B0', 'B1', 'B2', 'B3']})
display('df:', df1)
# creating the Series
series = pd.Series([1, 2, 3, 4])
display('series:', series)

# concatenating
display('After concatenating:')
display(pd.concat([df, series],
                  axis = 1))

'df:'

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


'series:'

0    1
1    2
2    3
3    4
dtype: int64

'After concatenating:'

Unnamed: 0,A,B,0
0,A0,B0,1
1,A1,B1,2
2,A2,B2,3
3,A3,B3,4


## **→Pandas cov() Method**

Pandas cov() is used to compute pairwise covariance of columns. 


In [179]:
# importing pandas as pd
import pandas as pd

# Creating the dataframe
df = pd.DataFrame({"A":[5, 3, 6, 4], 
                   "B":[11, 2, 4, 3],
                   "C":[4, 3, 8, 5],
                   "D":[5, 4, 2, 8]})

# Print the dataframe
df

Unnamed: 0,A,B,C,D
0,5,11,4,5
1,3,2,3,4
2,6,4,8,2
3,4,3,5,8


In [180]:
# to find the covarience
df.cov()

Unnamed: 0,A,B,C,D
A,1.666667,2.333333,2.333333,-1.5
B,2.333333,16.666667,-1.0,0.0
C,2.333333,-1.0,4.666667,-2.333333
D,-1.5,0.0,-2.333333,6.25


## **→Pandas duplicated() Method**

`duplicated()` helps in analyzing duplicate values only. It returns a boolean series which is True only for Unique elements.

In [181]:
import pandas as pd
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
    'Age': [25, 32, 25, 37]
})
duplicates = df[df.duplicated()]
print(duplicates)

    Name  Age
2  Alice   25


In [182]:
#returning a boolean series
df.duplicated()

0    False
1    False
2     True
3    False
dtype: bool

In [183]:
df['Name'].duplicated()

0    False
1    False
2     True
3    False
Name: Name, dtype: bool

## **→Pandas drop_duplicates() Method**

This method removes the duplicates from Pandas's dataframe.

In [184]:
data = {
    'InvoiceID': [101, 102, 103, 104, 101, 105, 106, 102],
    'Customer': ['Ali', 'Sara', 'John', 'Ali', 'Ali', 'John', 'Sara', 'Sara'],
    'Product': ['TV', 'Phone', 'TV', 'TV', 'TV', 'Tablet', 'Phone', 'Phone'],
    'Amount': [25000, 50000, 25000, 25000, 25000, 15000, 50000, 50000]
}

df = pd.DataFrame(data)

print(df.shape)
df

(8, 4)


Unnamed: 0,InvoiceID,Customer,Product,Amount
0,101,Ali,TV,25000
1,102,Sara,Phone,50000
2,103,John,TV,25000
3,104,Ali,TV,25000
4,101,Ali,TV,25000
5,105,John,Tablet,15000
6,106,Sara,Phone,50000
7,102,Sara,Phone,50000


In [185]:
df.drop_duplicates()

Unnamed: 0,InvoiceID,Customer,Product,Amount
0,101,Ali,TV,25000
1,102,Sara,Phone,50000
2,103,John,TV,25000
3,104,Ali,TV,25000
5,105,John,Tablet,15000
6,106,Sara,Phone,50000


In [186]:
# Dropping duplicates based on specific columns
df.drop_duplicates(subset=['InvoiceID'])

Unnamed: 0,InvoiceID,Customer,Product,Amount
0,101,Ali,TV,25000
1,102,Sara,Phone,50000
2,103,John,TV,25000
3,104,Ali,TV,25000
5,105,John,Tablet,15000
6,106,Sara,Phone,50000


In [187]:
#keep the last occurence 
df.drop_duplicates(subset=['InvoiceID'], keep='last')

Unnamed: 0,InvoiceID,Customer,Product,Amount
2,103,John,TV,25000
3,104,Ali,TV,25000
4,101,Ali,TV,25000
5,105,John,Tablet,15000
6,106,Sara,Phone,50000
7,102,Sara,Phone,50000


**drop all duplicates** 

With keep=False, all occurrences of duplicate rows are removed, leaving only rows that are entirely unique across all columns.

In [188]:
df.drop_duplicates(keep=False)

Unnamed: 0,InvoiceID,Customer,Product,Amount
2,103,John,TV,25000
3,104,Ali,TV,25000
5,105,John,Tablet,15000
6,106,Sara,Phone,50000


## **→Pandas dropna() Method**

This method helps in dropping Rows/Columns with Null values


In [189]:
df=pd.read_csv('nba.csv')
df.shape

(458, 9)

In [190]:
df=df.dropna()
df.shape

(364, 9)

## **→Pandas mask() Method** 

The `mask()` function is used to replace values where a condition is True.

In [191]:
# importing pandas as pd
import pandas as pd

# Creating the dataframe 
df = pd.DataFrame({"A":[12, 4, 5, 44, 1],
                   "B":[5, 2, 54, 3, 2],
                   "C":[20, 16, 7, 3, 8],
                   "D":[14, 3, 17, 2, 6]})

# Print the dataframe
df

Unnamed: 0,A,B,C,D
0,12,5,20,14
1,4,2,16,3
2,5,54,7,17
3,44,3,3,2
4,1,2,8,6


In [192]:
# replace values greater than 10 with -25
df.mask(df > 10, -25)

Unnamed: 0,A,B,C,D
0,-25,5,-25,-25
1,4,2,-25,3
2,5,-25,7,-25
3,-25,3,3,2
4,1,2,8,6


In [193]:
# importing pandas as pd
import pandas as pd

# Creating the dataframe 
df = pd.DataFrame({"A":[12, 4, 5, None, 1],
                   "B":[7, 2, 54, 3, None],
                   "C":[20, 16, 11, 3, 8],
                   "D":[14, 3, None, 2, 6]})

# replace the Na values with 1000
df.mask(df.isna(), 1000)

Unnamed: 0,A,B,C,D
0,12.0,7.0,20,14.0
1,4.0,2.0,16,3.0
2,5.0,54.0,11,1000.0
3,1000.0,3.0,3,2.0
4,1.0,1000.0,8,6.0


## **→Pandas replace() Method**

`replace()` is used to replace values in a DataFrame or Series with something else.

In [194]:
#replacing single value 
df = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Male', 'Unknown']
})

df.replace('Unknown', 'Other', inplace=True)

df

Unnamed: 0,Gender
0,Male
1,Female
2,Male
3,Other


In [195]:
#replacing multiple values 
df.replace({'Male': 'M', 'Female': 'F'}, inplace=True)

df

Unnamed: 0,Gender
0,M
1,F
2,M
3,Other


In [196]:
#replacing values in multiple columns
df = pd.DataFrame({
    'A': [1, 2, 3, 99],
    'B': [99, 5, 99, 7]
})

df.replace(99, 0)

Unnamed: 0,A,B
0,1,0
1,2,5
2,3,0
3,0,7
