## **Introduction to Pandas**
Pandas is an open source data manipulation and analysis library for Python. It provides data structures for efficiently handling large datasets and tools for working with structured data. 

1. Data Structures:
                                                   
Series: A one-dimensional array capable of holding any data type. Its similar to a column in a spreadsheet or single variable in statistics.      
                                   
DataFrame: A two-dimensional table with rows and columns. It can be thought of as a spreadsheet or SQL table, where each column can be different data type.        

2. Data Loading and Saving:

Pandas provides functions to read data from various file formats such as CSV, Excel, SQL databases, JSON and more. It allows you to write data back to these formats.

3. Data cleaning and Preparation:

Pandas offers powerful tools for cleaning and preparing data, including handling missing values, filtering, sorting and merging datasets.

In [1]:
import pandas as pd

import warnings 
warnings.filterwarnings('ignore')

1. Series:                 
      
A Pandas Series is a one-dimensional array of indexed data.              
`Syntax: pd.Series(data, index=index)`

In [2]:
import pandas as pd
import numpy as np

# Creating a Series from a list 
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [3]:
s.describe().T   # Statistical description 

count    5.000000
mean     4.600000
std      2.701851
min      1.000000
25%      3.000000
50%      5.000000
75%      6.000000
max      8.000000
dtype: float64

In [4]:
s.info()   # information of the dataset

<class 'pandas.core.series.Series'>
RangeIndex: 6 entries, 0 to 5
Series name: None
Non-Null Count  Dtype  
--------------  -----  
5 non-null      float64
dtypes: float64(1)
memory usage: 180.0 bytes


2. DataFrame:

- A DataFrame is a two-dimensional table with rows and columns, similar to a spreadsheet or a SQL table. It is the primary data structure in Pandas and is used for most data manipulation and analysis tasks.             
- Each column in a DataFrame is a Series. It allows you to handle heterogeneous data types and supports a wide range of operations.  
- An effective object/data structure offered by PANDAS that allows us to handle the tabular form of data.               
`Syntax: pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)`

In [5]:
# Creating a DataFrame from a dictionary
# Sample data for the DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Hannah', 'Ivan', 'Jack', 'Ivan', 'Jack', 'Grace', 'Hannah'],
    'Age': [45, 58, 60, 28, 49, 67, None, 60, 37, 31, 37, 31, 29, 60],
    'City': ['New York', 'London', 'Tokyo', 'Paris', 'Berlin', 'Paris', 'Berlin', 'New York', 'Tokyo', 'London', 'Tokyo', 'London', 'Berlin', 'New York'],
    'Salary': [95445, 86235, 62330, 46022, 65999, 88320, 57141, 65413, 87016, 58725, 87016, 58725, 57141, None],
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08', '2023-01-09', '2023-01-10', '2023-01-09', '2023-01-10', '2023-01-07', '2023-01-08'],
    'Category': ['A', 'B', 'C', 'A', 'B', 'A', 'C', 'B', 'A', 'B', 'A', 'B', None, 'B']
    }

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City,Salary,Date,Category
0,Alice,45.0,New York,95445.0,2023-01-01,A
1,Bob,58.0,London,86235.0,2023-01-02,B
2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C
3,David,28.0,Paris,46022.0,2023-01-04,A
4,Eva,49.0,Berlin,65999.0,2023-01-05,B
5,Frank,67.0,Paris,88320.0,2023-01-06,A
6,Grace,,Berlin,57141.0,2023-01-07,C
7,Hannah,60.0,New York,65413.0,2023-01-08,B
8,Ivan,37.0,Tokyo,87016.0,2023-01-09,A
9,Jack,31.0,London,58725.0,2023-01-10,B


In [6]:
df.head()    # Top 5 rows of the dataset

Unnamed: 0,Name,Age,City,Salary,Date,Category
0,Alice,45.0,New York,95445.0,2023-01-01,A
1,Bob,58.0,London,86235.0,2023-01-02,B
2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C
3,David,28.0,Paris,46022.0,2023-01-04,A
4,Eva,49.0,Berlin,65999.0,2023-01-05,B


In [7]:
df.head(10)    # Number of specified rows in the dataset

Unnamed: 0,Name,Age,City,Salary,Date,Category
0,Alice,45.0,New York,95445.0,2023-01-01,A
1,Bob,58.0,London,86235.0,2023-01-02,B
2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C
3,David,28.0,Paris,46022.0,2023-01-04,A
4,Eva,49.0,Berlin,65999.0,2023-01-05,B
5,Frank,67.0,Paris,88320.0,2023-01-06,A
6,Grace,,Berlin,57141.0,2023-01-07,C
7,Hannah,60.0,New York,65413.0,2023-01-08,B
8,Ivan,37.0,Tokyo,87016.0,2023-01-09,A
9,Jack,31.0,London,58725.0,2023-01-10,B


In [8]:
df.tail()   # Bottom 5 rows of the dataset

Unnamed: 0,Name,Age,City,Salary,Date,Category
9,Jack,31.0,London,58725.0,2023-01-10,B
10,Ivan,37.0,Tokyo,87016.0,2023-01-09,A
11,Jack,31.0,London,58725.0,2023-01-10,B
12,Grace,29.0,Berlin,57141.0,2023-01-07,
13,Hannah,60.0,New York,,2023-01-08,B


In [9]:
#pip install openpyxl

In [10]:
#df.to_csv('data.csv')
df.to_excel('data.xlsx')
# df.to_json('data.json', orient='records')
# df.to_sql('my_table', engine, index=Falsem if_exists='replace')

In [11]:
# Reading data from a CSV file
# df = pd.read_csv(r"data.csv")
df = pd.read_excel(r"C:\Users\91938\OneDrive\Documents\Python Scripts\Pandas\data.xlsx")   # unicodedecodeerror = use encoding='unicode_escape' 
# pd.read_json("data.json")
# sqlite3.connect("database.db")
df 

Unnamed: 0.1,Unnamed: 0,Name,Age,City,Salary,Date,Category
0,0,Alice,45.0,New York,95445.0,2023-01-01,A
1,1,Bob,58.0,London,86235.0,2023-01-02,B
2,2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C
3,3,David,28.0,Paris,46022.0,2023-01-04,A
4,4,Eva,49.0,Berlin,65999.0,2023-01-05,B
5,5,Frank,67.0,Paris,88320.0,2023-01-06,A
6,6,Grace,,Berlin,57141.0,2023-01-07,C
7,7,Hannah,60.0,New York,65413.0,2023-01-08,B
8,8,Ivan,37.0,Tokyo,87016.0,2023-01-09,A
9,9,Jack,31.0,London,58725.0,2023-01-10,B


### **Exploring Data**

In [12]:
# Display the first few rows 
print(df.head(2))
# Display the last few rows
df.tail(2)

   Unnamed: 0   Name   Age      City   Salary        Date Category
0           0  Alice  45.0  New York  95445.0  2023-01-01        A
1           1    Bob  58.0    London  86235.0  2023-01-02        B


Unnamed: 0.1,Unnamed: 0,Name,Age,City,Salary,Date,Category
12,12,Grace,29.0,Berlin,57141.0,2023-01-07,
13,13,Hannah,60.0,New York,,2023-01-08,B


In [13]:
df.shape   # How many rows and columns (dimensions)

(14, 7)

In [14]:
df.size   # Total elements

98

In [15]:
df.columns   # names of the columns

Index(['Unnamed: 0', 'Name', 'Age', 'City', 'Salary', 'Date', 'Category'], dtype='object')

In [16]:
df.dtypes   # datatype of each column

Unnamed: 0      int64
Name           object
Age           float64
City           object
Salary        float64
Date           object
Category       object
dtype: object

In [17]:
df.info()    # information about the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  14 non-null     int64  
 1   Name        14 non-null     object 
 2   Age         13 non-null     float64
 3   City        14 non-null     object 
 4   Salary      13 non-null     float64
 5   Date        14 non-null     object 
 6   Category    13 non-null     object 
dtypes: float64(2), int64(1), object(4)
memory usage: 916.0+ bytes


In [18]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,14.0,6.5,4.1833,0.0,3.25,6.5,9.75,13.0
Age,13.0,45.538462,14.157539,28.0,31.0,45.0,60.0,67.0
Salary,13.0,70425.230769,16018.646562,46022.0,58725.0,65413.0,87016.0,95445.0


In [19]:
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Name,14,10,Grace,2
City,14,5,New York,3
Date,14,10,2023-01-07,2
Category,13,3,B,6


In [20]:
# Accessing specific columns
df[['Name']]  # df.Name

Unnamed: 0,Name
0,Alice
1,Bob
2,Charlie
3,David
4,Eva
5,Frank
6,Grace
7,Hannah
8,Ivan
9,Jack


In [1]:
name_dit = df['Name'].value_counts()
name_dit.values

NameError: name 'df' is not defined

In [21]:
# Accessing Multiple columns
df[['Name', "Age"]]

Unnamed: 0,Name,Age
0,Alice,45.0
1,Bob,58.0
2,Charlie,60.0
3,David,28.0
4,Eva,49.0
5,Frank,67.0
6,Grace,
7,Hannah,60.0
8,Ivan,37.0
9,Jack,31.0


### **Data Cleaning and Preprocessing**
Types of Data Cleaning involves:
1. Missing Values
2. Outliers
3. Duplicates

Missing values invovles three types:
- Accept: Accepting the missing values
- Delete: Deleting the missing values
- Imputation: 

In [22]:
data = pd.DataFrame({'A': [1, 2, None, 4, None, 6, 7, 1], 'B': [7, None, 9, None, 11, None, 7, 7]})
data

Unnamed: 0,A,B
0,1.0,7.0
1,2.0,
2,,9.0
3,4.0,
4,,11.0
5,6.0,
6,7.0,7.0
7,1.0,7.0


In [23]:
data.isnull().sum()  # Finding the missing values in the dataset

A    2
B    3
dtype: int64

In [24]:
data.isna().sum()  # Same as isnull()

A    2
B    3
dtype: int64

In [25]:
data.notnull().sum()    # Finds the not null values
data.notna().sum()

A    6
B    5
dtype: int64

In [26]:
#data.dropna(inplace=True)

In [27]:
data.dropna(subset=['B'], inplace=True)
data

Unnamed: 0,A,B
0,1.0,7.0
2,,9.0
4,,11.0
6,7.0,7.0
7,1.0,7.0


In [28]:
data['A'].fillna(0, inplace=True)
data

Unnamed: 0,A,B
0,1.0,7.0
2,0.0,9.0
4,0.0,11.0
6,7.0,7.0
7,1.0,7.0


In [29]:
data = pd.Series([1, 6, None, 4, None, 6])
data

0    1.0
1    6.0
2    NaN
3    4.0
4    NaN
5    6.0
dtype: float64

In [30]:
data.interpolate(inplace=True)    # Replacing with the mean or average value of the surrounding values
print(data) 

0    1.0
1    6.0
2    5.0
3    4.0
4    5.0
5    6.0
dtype: float64


In [31]:
data = pd.Series([1, None, None, 4, None, 6, None])
data.fillna(method = 'ffill', inplace=True)   # Forward fill
data

0    1.0
1    1.0
2    1.0
3    4.0
4    4.0
5    6.0
6    6.0
dtype: float64

In [32]:
data = pd.Series([1, None, None, 4, None, 6, None])
data.fillna(method = 'bfill', inplace=True)   # backward fill
data

0    1.0
1    4.0
2    4.0
3    4.0
4    6.0
5    6.0
6    NaN
dtype: float64

#### **For Dataset**
#### **Finding Missing Values**

In [33]:
df

Unnamed: 0.1,Unnamed: 0,Name,Age,City,Salary,Date,Category
0,0,Alice,45.0,New York,95445.0,2023-01-01,A
1,1,Bob,58.0,London,86235.0,2023-01-02,B
2,2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C
3,3,David,28.0,Paris,46022.0,2023-01-04,A
4,4,Eva,49.0,Berlin,65999.0,2023-01-05,B
5,5,Frank,67.0,Paris,88320.0,2023-01-06,A
6,6,Grace,,Berlin,57141.0,2023-01-07,C
7,7,Hannah,60.0,New York,65413.0,2023-01-08,B
8,8,Ivan,37.0,Tokyo,87016.0,2023-01-09,A
9,9,Jack,31.0,London,58725.0,2023-01-10,B


In [34]:
df.isnull().sum()    # Displaying the number of missing values in each coulumn

Unnamed: 0    0
Name          0
Age           1
City          0
Salary        1
Date          0
Category      1
dtype: int64

In [35]:
df.isna().sum()

Unnamed: 0    0
Name          0
Age           1
City          0
Salary        1
Date          0
Category      1
dtype: int64

In [36]:
df.duplicated().sum()    # Finding the duplicates

np.int64(0)

In [37]:
df[df.duplicated()]    #Displaying the duplicates

Unnamed: 0.1,Unnamed: 0,Name,Age,City,Salary,Date,Category


#### **Handling missing values**

In [38]:
df.Salary.mean()

np.float64(70425.23076923077)

In [39]:
# Handling missing values
#df.dropna(inplace=True)  Drop rows with any missing values
df.Age.fillna(df.Age.mean(), inplace=True)
df.Salary.fillna(df.Salary.mean(), inplace=True)
df.Category.fillna(df.Category.mode()[0], inplace=True)
df

Unnamed: 0.1,Unnamed: 0,Name,Age,City,Salary,Date,Category
0,0,Alice,45.0,New York,95445.0,2023-01-01,A
1,1,Bob,58.0,London,86235.0,2023-01-02,B
2,2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C
3,3,David,28.0,Paris,46022.0,2023-01-04,A
4,4,Eva,49.0,Berlin,65999.0,2023-01-05,B
5,5,Frank,67.0,Paris,88320.0,2023-01-06,A
6,6,Grace,45.538462,Berlin,57141.0,2023-01-07,C
7,7,Hannah,60.0,New York,65413.0,2023-01-08,B
8,8,Ivan,37.0,Tokyo,87016.0,2023-01-09,A
9,9,Jack,31.0,London,58725.0,2023-01-10,B


In [40]:
df.drop_duplicates(inplace=True)
df

Unnamed: 0.1,Unnamed: 0,Name,Age,City,Salary,Date,Category
0,0,Alice,45.0,New York,95445.0,2023-01-01,A
1,1,Bob,58.0,London,86235.0,2023-01-02,B
2,2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C
3,3,David,28.0,Paris,46022.0,2023-01-04,A
4,4,Eva,49.0,Berlin,65999.0,2023-01-05,B
5,5,Frank,67.0,Paris,88320.0,2023-01-06,A
6,6,Grace,45.538462,Berlin,57141.0,2023-01-07,C
7,7,Hannah,60.0,New York,65413.0,2023-01-08,B
8,8,Ivan,37.0,Tokyo,87016.0,2023-01-09,A
9,9,Jack,31.0,London,58725.0,2023-01-10,B


In [41]:
df.reset_index(inplace=True)    # reseting the index numbers
df

Unnamed: 0.1,index,Unnamed: 0,Name,Age,City,Salary,Date,Category
0,0,0,Alice,45.0,New York,95445.0,2023-01-01,A
1,1,1,Bob,58.0,London,86235.0,2023-01-02,B
2,2,2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C
3,3,3,David,28.0,Paris,46022.0,2023-01-04,A
4,4,4,Eva,49.0,Berlin,65999.0,2023-01-05,B
5,5,5,Frank,67.0,Paris,88320.0,2023-01-06,A
6,6,6,Grace,45.538462,Berlin,57141.0,2023-01-07,C
7,7,7,Hannah,60.0,New York,65413.0,2023-01-08,B
8,8,8,Ivan,37.0,Tokyo,87016.0,2023-01-09,A
9,9,9,Jack,31.0,London,58725.0,2023-01-10,B


In [42]:
#df.drop(['level_0','index'], inplace=True)     Removing the columns
df.drop(columns=['Unnamed: 0','index'], inplace=True)
df

Unnamed: 0,Name,Age,City,Salary,Date,Category
0,Alice,45.0,New York,95445.0,2023-01-01,A
1,Bob,58.0,London,86235.0,2023-01-02,B
2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C
3,David,28.0,Paris,46022.0,2023-01-04,A
4,Eva,49.0,Berlin,65999.0,2023-01-05,B
5,Frank,67.0,Paris,88320.0,2023-01-06,A
6,Grace,45.538462,Berlin,57141.0,2023-01-07,C
7,Hannah,60.0,New York,65413.0,2023-01-08,B
8,Ivan,37.0,Tokyo,87016.0,2023-01-09,A
9,Jack,31.0,London,58725.0,2023-01-10,B


In [43]:
df.drop_duplicates(inplace=True)
df

Unnamed: 0,Name,Age,City,Salary,Date,Category
0,Alice,45.0,New York,95445.0,2023-01-01,A
1,Bob,58.0,London,86235.0,2023-01-02,B
2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C
3,David,28.0,Paris,46022.0,2023-01-04,A
4,Eva,49.0,Berlin,65999.0,2023-01-05,B
5,Frank,67.0,Paris,88320.0,2023-01-06,A
6,Grace,45.538462,Berlin,57141.0,2023-01-07,C
7,Hannah,60.0,New York,65413.0,2023-01-08,B
8,Ivan,37.0,Tokyo,87016.0,2023-01-09,A
9,Jack,31.0,London,58725.0,2023-01-10,B


In [44]:
df.reset_index(inplace=True)    
df

Unnamed: 0,index,Name,Age,City,Salary,Date,Category
0,0,Alice,45.0,New York,95445.0,2023-01-01,A
1,1,Bob,58.0,London,86235.0,2023-01-02,B
2,2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C
3,3,David,28.0,Paris,46022.0,2023-01-04,A
4,4,Eva,49.0,Berlin,65999.0,2023-01-05,B
5,5,Frank,67.0,Paris,88320.0,2023-01-06,A
6,6,Grace,45.538462,Berlin,57141.0,2023-01-07,C
7,7,Hannah,60.0,New York,65413.0,2023-01-08,B
8,8,Ivan,37.0,Tokyo,87016.0,2023-01-09,A
9,9,Jack,31.0,London,58725.0,2023-01-10,B


In [45]:
df.drop('index', axis=1, inplace=True)
df

Unnamed: 0,Name,Age,City,Salary,Date,Category
0,Alice,45.0,New York,95445.0,2023-01-01,A
1,Bob,58.0,London,86235.0,2023-01-02,B
2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C
3,David,28.0,Paris,46022.0,2023-01-04,A
4,Eva,49.0,Berlin,65999.0,2023-01-05,B
5,Frank,67.0,Paris,88320.0,2023-01-06,A
6,Grace,45.538462,Berlin,57141.0,2023-01-07,C
7,Hannah,60.0,New York,65413.0,2023-01-08,B
8,Ivan,37.0,Tokyo,87016.0,2023-01-09,A
9,Jack,31.0,London,58725.0,2023-01-10,B


#### **Special Functions**

In [46]:
df['City'].str.contains('New York')

0      True
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9     False
10    False
11     True
Name: City, dtype: bool

In [47]:
df[df['City'].str.contains('New York')]     # Selects the rows where city contains 'N'

Unnamed: 0,Name,Age,City,Salary,Date,Category
0,Alice,45.0,New York,95445.0,2023-01-01,A
7,Hannah,60.0,New York,65413.0,2023-01-08,B
11,Hannah,60.0,New York,70425.230769,2023-01-08,B


In [48]:
df['City'].str.split(' ')   # Splits on space

0     [New, York]
1        [London]
2         [Tokyo]
3         [Paris]
4        [Berlin]
5         [Paris]
6        [Berlin]
7     [New, York]
8         [Tokyo]
9        [London]
10       [Berlin]
11    [New, York]
Name: City, dtype: object

In [49]:
df['City'].str.split(' ').str.get(0)   # gets the first element of the list

0        New
1     London
2      Tokyo
3      Paris
4     Berlin
5      Paris
6     Berlin
7        New
8      Tokyo
9     London
10    Berlin
11       New
Name: City, dtype: object

In [50]:
df[df['Name'].str.startswith('A')]

Unnamed: 0,Name,Age,City,Salary,Date,Category
0,Alice,45.0,New York,95445.0,2023-01-01,A


In [51]:
df[df['Name'].str.endswith('e')]

Unnamed: 0,Name,Age,City,Salary,Date,Category
0,Alice,45.0,New York,95445.0,2023-01-01,A
2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C
6,Grace,45.538462,Berlin,57141.0,2023-01-07,C
10,Grace,29.0,Berlin,57141.0,2023-01-07,B


In [52]:
df['Name1']= df['Name'].str.upper()   # upper case
df

Unnamed: 0,Name,Age,City,Salary,Date,Category,Name1
0,Alice,45.0,New York,95445.0,2023-01-01,A,ALICE
1,Bob,58.0,London,86235.0,2023-01-02,B,BOB
2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C,CHARLIE
3,David,28.0,Paris,46022.0,2023-01-04,A,DAVID
4,Eva,49.0,Berlin,65999.0,2023-01-05,B,EVA
5,Frank,67.0,Paris,88320.0,2023-01-06,A,FRANK
6,Grace,45.538462,Berlin,57141.0,2023-01-07,C,GRACE
7,Hannah,60.0,New York,65413.0,2023-01-08,B,HANNAH
8,Ivan,37.0,Tokyo,87016.0,2023-01-09,A,IVAN
9,Jack,31.0,London,58725.0,2023-01-10,B,JACK


In [53]:
df['Name1']= df['Name'].str.lower()     # lower case
df

Unnamed: 0,Name,Age,City,Salary,Date,Category,Name1
0,Alice,45.0,New York,95445.0,2023-01-01,A,alice
1,Bob,58.0,London,86235.0,2023-01-02,B,bob
2,Charlie,60.0,Tokyo,62330.0,2023-01-03,C,charlie
3,David,28.0,Paris,46022.0,2023-01-04,A,david
4,Eva,49.0,Berlin,65999.0,2023-01-05,B,eva
5,Frank,67.0,Paris,88320.0,2023-01-06,A,frank
6,Grace,45.538462,Berlin,57141.0,2023-01-07,C,grace
7,Hannah,60.0,New York,65413.0,2023-01-08,B,hannah
8,Ivan,37.0,Tokyo,87016.0,2023-01-09,A,ivan
9,Jack,31.0,London,58725.0,2023-01-10,B,jack


In [54]:
# Converting data types
df['Age'] = df['Age'].astype('int')   
df

Unnamed: 0,Name,Age,City,Salary,Date,Category,Name1
0,Alice,45,New York,95445.0,2023-01-01,A,alice
1,Bob,58,London,86235.0,2023-01-02,B,bob
2,Charlie,60,Tokyo,62330.0,2023-01-03,C,charlie
3,David,28,Paris,46022.0,2023-01-04,A,david
4,Eva,49,Berlin,65999.0,2023-01-05,B,eva
5,Frank,67,Paris,88320.0,2023-01-06,A,frank
6,Grace,45,Berlin,57141.0,2023-01-07,C,grace
7,Hannah,60,New York,65413.0,2023-01-08,B,hannah
8,Ivan,37,Tokyo,87016.0,2023-01-09,A,ivan
9,Jack,31,London,58725.0,2023-01-10,B,jack


In [55]:
df.drop('Name1', axis = 1, inplace=True)    # columns
df

Unnamed: 0,Name,Age,City,Salary,Date,Category
0,Alice,45,New York,95445.0,2023-01-01,A
1,Bob,58,London,86235.0,2023-01-02,B
2,Charlie,60,Tokyo,62330.0,2023-01-03,C
3,David,28,Paris,46022.0,2023-01-04,A
4,Eva,49,Berlin,65999.0,2023-01-05,B
5,Frank,67,Paris,88320.0,2023-01-06,A
6,Grace,45,Berlin,57141.0,2023-01-07,C
7,Hannah,60,New York,65413.0,2023-01-08,B
8,Ivan,37,Tokyo,87016.0,2023-01-09,A
9,Jack,31,London,58725.0,2023-01-10,B


In [56]:
df.reset_index(drop=True, inplace=True)

In [57]:
df.rename(columns={'Category': 'Cate'}, inplace=True)
df

Unnamed: 0,Name,Age,City,Salary,Date,Cate
0,Alice,45,New York,95445.0,2023-01-01,A
1,Bob,58,London,86235.0,2023-01-02,B
2,Charlie,60,Tokyo,62330.0,2023-01-03,C
3,David,28,Paris,46022.0,2023-01-04,A
4,Eva,49,Berlin,65999.0,2023-01-05,B
5,Frank,67,Paris,88320.0,2023-01-06,A
6,Grace,45,Berlin,57141.0,2023-01-07,C
7,Hannah,60,New York,65413.0,2023-01-08,B
8,Ivan,37,Tokyo,87016.0,2023-01-09,A
9,Jack,31,London,58725.0,2023-01-10,B


#### **Data Selection and Indexing**

In [58]:
# Select specific columns
df[['Name']]

Unnamed: 0,Name
0,Alice
1,Bob
2,Charlie
3,David
4,Eva
5,Frank
6,Grace
7,Hannah
8,Ivan
9,Jack


In [59]:
df[['Name', 'Age', 'City']]    # Accessing multiple column 

Unnamed: 0,Name,Age,City
0,Alice,45,New York
1,Bob,58,London
2,Charlie,60,Tokyo
3,David,28,Paris
4,Eva,49,Berlin
5,Frank,67,Paris
6,Grace,45,Berlin
7,Hannah,60,New York
8,Ivan,37,Tokyo
9,Jack,31,London


In [60]:
df[0:7]

Unnamed: 0,Name,Age,City,Salary,Date,Cate
0,Alice,45,New York,95445.0,2023-01-01,A
1,Bob,58,London,86235.0,2023-01-02,B
2,Charlie,60,Tokyo,62330.0,2023-01-03,C
3,David,28,Paris,46022.0,2023-01-04,A
4,Eva,49,Berlin,65999.0,2023-01-05,B
5,Frank,67,Paris,88320.0,2023-01-06,A
6,Grace,45,Berlin,57141.0,2023-01-07,C


In [61]:
df[0:7:2]

Unnamed: 0,Name,Age,City,Salary,Date,Cate
0,Alice,45,New York,95445.0,2023-01-01,A
2,Charlie,60,Tokyo,62330.0,2023-01-03,C
4,Eva,49,Berlin,65999.0,2023-01-05,B
6,Grace,45,Berlin,57141.0,2023-01-07,C


In [62]:
df[-6:-1]

Unnamed: 0,Name,Age,City,Salary,Date,Cate
6,Grace,45,Berlin,57141.0,2023-01-07,C
7,Hannah,60,New York,65413.0,2023-01-08,B
8,Ivan,37,Tokyo,87016.0,2023-01-09,A
9,Jack,31,London,58725.0,2023-01-10,B
10,Grace,29,Berlin,57141.0,2023-01-07,B


In [63]:
df[-6:-1:2]

Unnamed: 0,Name,Age,City,Salary,Date,Cate
6,Grace,45,Berlin,57141.0,2023-01-07,C
8,Ivan,37,Tokyo,87016.0,2023-01-09,A
10,Grace,29,Berlin,57141.0,2023-01-07,B


In [64]:
df[df['Age'] >= 30]    # Select rows based on a condition

Unnamed: 0,Name,Age,City,Salary,Date,Cate
0,Alice,45,New York,95445.0,2023-01-01,A
1,Bob,58,London,86235.0,2023-01-02,B
2,Charlie,60,Tokyo,62330.0,2023-01-03,C
4,Eva,49,Berlin,65999.0,2023-01-05,B
5,Frank,67,Paris,88320.0,2023-01-06,A
6,Grace,45,Berlin,57141.0,2023-01-07,C
7,Hannah,60,New York,65413.0,2023-01-08,B
8,Ivan,37,Tokyo,87016.0,2023-01-09,A
9,Jack,31,London,58725.0,2023-01-10,B
11,Hannah,60,New York,70425.230769,2023-01-08,B


In [65]:
df_filtered = df[df['Age'] > 20]    # Assuming df is your DataFrame 
df_filtered

Unnamed: 0,Name,Age,City,Salary,Date,Cate
0,Alice,45,New York,95445.0,2023-01-01,A
1,Bob,58,London,86235.0,2023-01-02,B
2,Charlie,60,Tokyo,62330.0,2023-01-03,C
3,David,28,Paris,46022.0,2023-01-04,A
4,Eva,49,Berlin,65999.0,2023-01-05,B
5,Frank,67,Paris,88320.0,2023-01-06,A
6,Grace,45,Berlin,57141.0,2023-01-07,C
7,Hannah,60,New York,65413.0,2023-01-08,B
8,Ivan,37,Tokyo,87016.0,2023-01-09,A
9,Jack,31,London,58725.0,2023-01-10,B


In [66]:
df_filtered = df[(df['Age'] > 20) & (df['Name'] == 'David')]
df_filtered

Unnamed: 0,Name,Age,City,Salary,Date,Cate
3,David,28,Paris,46022.0,2023-01-04,A


In [67]:
df_filtered = df[df['City'].isin(['New York', 'San Francisco'])]
df_filtered

Unnamed: 0,Name,Age,City,Salary,Date,Cate
0,Alice,45,New York,95445.0,2023-01-01,A
7,Hannah,60,New York,65413.0,2023-01-08,B
11,Hannah,60,New York,70425.230769,2023-01-08,B


In [68]:
df_filtered = df.query('Age > 30')
df_filtered

Unnamed: 0,Name,Age,City,Salary,Date,Cate
0,Alice,45,New York,95445.0,2023-01-01,A
1,Bob,58,London,86235.0,2023-01-02,B
2,Charlie,60,Tokyo,62330.0,2023-01-03,C
4,Eva,49,Berlin,65999.0,2023-01-05,B
5,Frank,67,Paris,88320.0,2023-01-06,A
6,Grace,45,Berlin,57141.0,2023-01-07,C
7,Hannah,60,New York,65413.0,2023-01-08,B
8,Ivan,37,Tokyo,87016.0,2023-01-09,A
9,Jack,31,London,58725.0,2023-01-10,B
11,Hannah,60,New York,70425.230769,2023-01-08,B


In [69]:
df.query('Name == "Alice"')

Unnamed: 0,Name,Age,City,Salary,Date,Cate
0,Alice,45,New York,95445.0,2023-01-01,A


#### **Data Exploration**

In [70]:
print(df['Salary'].sum())
print(df['Age'].max())
print(df['Age'].min())
print(df['Salary'].count())

print(df['Age'].mean())
print(df['Salary'].median())
print(df['Age'].mode())
print(df['Salary'].std())
print(df['Salary'].var())

840212.2307692308
67
28
12
47.416666666666664
65706.0
0    60
Name: Age, dtype: int64
15565.177669503992
242274755.88322574


In [71]:
np.sqrt(242274755.88322574)

np.float64(15565.177669503992)

In [72]:
df.City.unique()    # Unique values in the column

array(['New York', 'London', 'Tokyo', 'Paris', 'Berlin'], dtype=object)

In [73]:
for i in df.columns:
    print(f"{i}: ", df[i].unique())

Name:  ['Alice' 'Bob' 'Charlie' 'David' 'Eva' 'Frank' 'Grace' 'Hannah' 'Ivan'
 'Jack']
Age:  [45 58 60 28 49 67 37 31 29]
City:  ['New York' 'London' 'Tokyo' 'Paris' 'Berlin']
Salary:  [95445.         86235.         62330.         46022.
 65999.         88320.         57141.         65413.
 87016.         58725.         70425.23076923]
Date:  ['2023-01-01' '2023-01-02' '2023-01-03' '2023-01-04' '2023-01-05'
 '2023-01-06' '2023-01-07' '2023-01-08' '2023-01-09' '2023-01-10']
Cate:  ['A' 'B' 'C']


In [74]:
df['City'].nunique()    # sum of unique values in a column

5

In [75]:
df['Age'].nlargest()    # largest 5 values

5     67
2     60
7     60
11    60
1     58
Name: Age, dtype: int64

In [76]:
df.nlargest(3, 'Age')   # 3 largest values of Age

Unnamed: 0,Name,Age,City,Salary,Date,Cate
5,Frank,67,Paris,88320.0,2023-01-06,A
2,Charlie,60,Tokyo,62330.0,2023-01-03,C
7,Hannah,60,New York,65413.0,2023-01-08,B


In [77]:
df['Age'].nsmallest()     #smallest 5 value
df.nsmallest(5, 'Age')    #smallest 5 value

Unnamed: 0,Name,Age,City,Salary,Date,Cate
3,David,28,Paris,46022.0,2023-01-04,A
10,Grace,29,Berlin,57141.0,2023-01-07,B
9,Jack,31,London,58725.0,2023-01-10,B
8,Ivan,37,Tokyo,87016.0,2023-01-09,A
0,Alice,45,New York,95445.0,2023-01-01,A


In [78]:
df

Unnamed: 0,Name,Age,City,Salary,Date,Cate
0,Alice,45,New York,95445.0,2023-01-01,A
1,Bob,58,London,86235.0,2023-01-02,B
2,Charlie,60,Tokyo,62330.0,2023-01-03,C
3,David,28,Paris,46022.0,2023-01-04,A
4,Eva,49,Berlin,65999.0,2023-01-05,B
5,Frank,67,Paris,88320.0,2023-01-06,A
6,Grace,45,Berlin,57141.0,2023-01-07,C
7,Hannah,60,New York,65413.0,2023-01-08,B
8,Ivan,37,Tokyo,87016.0,2023-01-09,A
9,Jack,31,London,58725.0,2023-01-10,B


In [79]:
df['City'].unique()

array(['New York', 'London', 'Tokyo', 'Paris', 'Berlin'], dtype=object)

In [80]:
df['City'].value_counts()    # Finds the sum of unique values in the column

City
New York    3
Berlin      3
London      2
Tokyo       2
Paris       2
Name: count, dtype: int64

In [81]:
df['Cate'].value_counts()

Cate
B    6
A    4
C    2
Name: count, dtype: int64

In [82]:
df['Age'].sort_values(ascending=False)    # sorts by age in descending order

5     67
2     60
7     60
11    60
1     58
4     49
0     45
6     45
8     37
9     31
10    29
3     28
Name: Age, dtype: int64

In [83]:
df.sort_values(by='Age', ascending=True)     # sort by age in ascending order 

Unnamed: 0,Name,Age,City,Salary,Date,Cate
3,David,28,Paris,46022.0,2023-01-04,A
10,Grace,29,Berlin,57141.0,2023-01-07,B
9,Jack,31,London,58725.0,2023-01-10,B
8,Ivan,37,Tokyo,87016.0,2023-01-09,A
0,Alice,45,New York,95445.0,2023-01-01,A
6,Grace,45,Berlin,57141.0,2023-01-07,C
4,Eva,49,Berlin,65999.0,2023-01-05,B
1,Bob,58,London,86235.0,2023-01-02,B
2,Charlie,60,Tokyo,62330.0,2023-01-03,C
7,Hannah,60,New York,65413.0,2023-01-08,B


#### **Data Aggregation:**

In [84]:
df.groupby('City')['Age'].mean()     # groupby city and get the mean of the age 

City
Berlin      41.0
London      44.5
New York    55.0
Paris       47.5
Tokyo       48.5
Name: Age, dtype: float64

In [85]:
salary_analysis = df.groupby('City')['Salary'].sum()    # sum of salary for each name
print(salary_analysis.index.tolist())
print(np.round(salary_analysis.values))

['Berlin', 'London', 'New York', 'Paris', 'Tokyo']
[180281. 144960. 231283. 134342. 149346.]


In [86]:
df.groupby(['Name', 'Cate'])['Salary'].sum()   # sum of salary for each name and category

Name     Cate
Alice    A        95445.000000
Bob      B        86235.000000
Charlie  C        62330.000000
David    A        46022.000000
Eva      B        65999.000000
Frank    A        88320.000000
Grace    B        57141.000000
         C        57141.000000
Hannah   B       135838.230769
Ivan     A        87016.000000
Jack     B        58725.000000
Name: Salary, dtype: float64

In [87]:
# Calculating multiple aggregations simultaneously
df.groupby('City').agg({'Age': 'mean', 'Salary': 'sum'}).sort_values(by='Age', ascending=False).reset_index()

Unnamed: 0,City,Age,Salary
0,New York,55.0,231283.230769
1,Tokyo,48.5,149346.0
2,Paris,47.5,134342.0
3,London,44.5,144960.0
4,Berlin,41.0,180281.0


In [88]:
df['Name'][4]

'Eva'

In [89]:
print("before", df["Name"][4])
df['Name'][4]= 'Roshan'
print("after", df['Name'][4])

before Eva
after Roshan


In [90]:
df

Unnamed: 0,Name,Age,City,Salary,Date,Cate
0,Alice,45,New York,95445.0,2023-01-01,A
1,Bob,58,London,86235.0,2023-01-02,B
2,Charlie,60,Tokyo,62330.0,2023-01-03,C
3,David,28,Paris,46022.0,2023-01-04,A
4,Roshan,49,Berlin,65999.0,2023-01-05,B
5,Frank,67,Paris,88320.0,2023-01-06,A
6,Grace,45,Berlin,57141.0,2023-01-07,C
7,Hannah,60,New York,65413.0,2023-01-08,B
8,Ivan,37,Tokyo,87016.0,2023-01-09,A
9,Jack,31,London,58725.0,2023-01-10,B


In [91]:
df['Salary'][5] = 56000 

In [92]:
# Assign ranks based on Age
df['Rank'] = df['Age'].rank(ascending=False, method='average')
df

Unnamed: 0,Name,Age,City,Salary,Date,Cate,Rank
0,Alice,45,New York,95445.0,2023-01-01,A,7.5
1,Bob,58,London,86235.0,2023-01-02,B,5.0
2,Charlie,60,Tokyo,62330.0,2023-01-03,C,3.0
3,David,28,Paris,46022.0,2023-01-04,A,12.0
4,Roshan,49,Berlin,65999.0,2023-01-05,B,6.0
5,Frank,67,Paris,56000.0,2023-01-06,A,1.0
6,Grace,45,Berlin,57141.0,2023-01-07,C,7.5
7,Hannah,60,New York,65413.0,2023-01-08,B,3.0
8,Ivan,37,Tokyo,87016.0,2023-01-09,A,9.0
9,Jack,31,London,58725.0,2023-01-10,B,10.0


#### **Data Manipulation:**

In [93]:
print(df.loc[1])     # loc is used for categorical accessing.

Name             Bob
Age               58
City          London
Salary       86235.0
Date      2023-01-02
Cate               B
Rank             5.0
Name: 1, dtype: object


In [94]:
print(df.loc[1, 'Name'])   # indexing by loc is using index value in number and column name in string. loc is faster than iloc

Bob


In [95]:
print(df.loc[2:3, 'Name'])

2    Charlie
3      David
Name: Name, dtype: object


In [96]:
print(df.iloc[0:3])

      Name  Age      City   Salary        Date Cate  Rank
0    Alice   45  New York  95445.0  2023-01-01    A   7.5
1      Bob   58    London  86235.0  2023-01-02    B   5.0
2  Charlie   60     Tokyo  62330.0  2023-01-03    C   3.0


In [97]:
print(df.iloc[0:3, 2])    # The indexing by iloc uses nummeric index values for rows and column names

0    New York
1      London
2       Tokyo
Name: City, dtype: object


In [98]:
print(df.iloc[0:4])

      Name  Age      City   Salary        Date Cate  Rank
0    Alice   45  New York  95445.0  2023-01-01    A   7.5
1      Bob   58    London  86235.0  2023-01-02    B   5.0
2  Charlie   60     Tokyo  62330.0  2023-01-03    C   3.0
3    David   28     Paris  46022.0  2023-01-04    A  12.0


In [99]:
print(df.iloc[1, 4])

2023-01-02


In [100]:
print(df.loc[0:4, "Name":"Salary"])    # both starting and ending index are included 

      Name  Age      City   Salary
0    Alice   45  New York  95445.0
1      Bob   58    London  86235.0
2  Charlie   60     Tokyo  62330.0
3    David   28     Paris  46022.0
4   Roshan   49    Berlin  65999.0


In [101]:
print(df.iloc[0:4, 0:3])     # starting index is included and ending index is excluded

      Name  Age      City
0    Alice   45  New York
1      Bob   58    London
2  Charlie   60     Tokyo
3    David   28     Paris


In [102]:
print(df.iloc[0:4:2, 0:3:2])

      Name      City
0    Alice  New York
2  Charlie     Tokyo


#### **Adding New Row**

In [103]:
df.loc[12] = {'Name':'Roshan', 'Age':25, 'City':'Blore', 'Salary':20000, 'Date':'2023-01-06', 'Cate':'B', 'Rank':1.0}

In [104]:
df.loc[13] = ['Bhavan', None, 'Paris', 20000, '2023-01-06', 'A', 5.0]

In [105]:
# Dropping Rows 
df.drop(13, inplace=True)

#### **Apply**

In [106]:
def categorise(col):
    if col <= 10:
        return "Young"
    elif col <= 19:
        return "Teen Age"
    elif col <= 45:
        return "Youth"
    elif col <= 60:
        return "Adult"
    else:
        return "Senior"
    
categorise(28)

'Youth'

In [107]:
df['Age Category'] = df['Age'].apply(categorise)
df

Unnamed: 0,Name,Age,City,Salary,Date,Cate,Rank,Age Category
0,Alice,45,New York,95445.0,2023-01-01,A,7.5,Youth
1,Bob,58,London,86235.0,2023-01-02,B,5.0,Adult
2,Charlie,60,Tokyo,62330.0,2023-01-03,C,3.0,Adult
3,David,28,Paris,46022.0,2023-01-04,A,12.0,Youth
4,Roshan,49,Berlin,65999.0,2023-01-05,B,6.0,Adult
5,Frank,67,Paris,56000.0,2023-01-06,A,1.0,Senior
6,Grace,45,Berlin,57141.0,2023-01-07,C,7.5,Youth
7,Hannah,60,New York,65413.0,2023-01-08,B,3.0,Adult
8,Ivan,37,Tokyo,87016.0,2023-01-09,A,9.0,Youth
9,Jack,31,London,58725.0,2023-01-10,B,10.0,Youth


In [108]:
data = {'A': [1,2,3], 'B': [4,5,6], 'C': [7,8,9]}
df1 = pd.DataFrame(data)
df1

Unnamed: 0,A,B,C
0,1,4,7
1,2,5,8
2,3,6,9


In [109]:
# Define a function to double the value of each element
def double(x):
    return x*2

# Apply the function to the DataFrame, column-wise (default behaviour)
result = df1.apply(double)
result

Unnamed: 0,A,B,C
0,2,8,14
1,4,10,16
2,6,12,18


In [110]:
# Define a function to divisible by two value of each element 
def div(x):
    return x%2 == 0

# Apply the function to the DataFrame, column-wise  (default behaviour)
result = df1[df1.apply(div)]
result1 = df1[df1.applymap(div)]
print(result)
print(result1)

     A    B    C
0  NaN  4.0  NaN
1  2.0  NaN  8.0
2  NaN  6.0  NaN
     A    B    C
0  NaN  4.0  NaN
1  2.0  NaN  8.0
2  NaN  6.0  NaN


#### **Merge**

In [111]:
left_data = {
    'ID': [1,2,3,4,5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 35, 28, 24],
}

left_df = pd.DataFrame(left_data)

right_data = {
    'ID': [1,2,4,5,6],
    'City': ['New York', 'London', 'Paris', 'Berlin', 'Tokyo'],
    'Salary': [60000, 75000, 62000, 57000, 80000]
}

right_df = pd.DataFrame(right_data)

In [None]:
df.select.dtypes(include=['int', 'float'])

In [112]:
left_df

Unnamed: 0,ID,Name,Age
0,1,Alice,25
1,2,Bob,30
2,3,Charlie,35
3,4,David,28
4,5,Eva,24


In [113]:
right_df

Unnamed: 0,ID,City,Salary
0,1,New York,60000
1,2,London,75000
2,4,Paris,62000
3,5,Berlin,57000
4,6,Tokyo,80000


In [116]:
# Data Merging and Joining:
# Inner join (default), Merging DataFrames based on a common column
merged_df = pd.merge(left_df, right_df, on='ID')
merged_df

Unnamed: 0,ID,Name,Age,City,Salary
0,1,Alice,25,New York,60000
1,2,Bob,30,London,75000
2,4,David,28,Paris,62000
3,5,Eva,24,Berlin,57000


In [118]:
# Joining DataFrames based on the index
# Inner join on 'ID' column
result_inner = pd.merge(left_df, right_df, on= 'ID', how='inner')
print(result_inner)

# Left join on 'ID' column
result_left = pd.merge(left_df, right_df, on='ID', how='left')
print(result_left)

# Right join on 'ID' column
result_right = pd.merge(left_df, right_df, on='ID', how='right')
print(result_right)

# Outer join on 'ID' column
result_outer = pd.merge(left_df, right_df, on='ID', how='outer')
print(result_outer)

   ID   Name  Age      City  Salary
0   1  Alice   25  New York   60000
1   2    Bob   30    London   75000
2   4  David   28     Paris   62000
3   5    Eva   24    Berlin   57000
   ID     Name  Age      City   Salary
0   1    Alice   25  New York  60000.0
1   2      Bob   30    London  75000.0
2   3  Charlie   35       NaN      NaN
3   4    David   28     Paris  62000.0
4   5      Eva   24    Berlin  57000.0
   ID   Name   Age      City  Salary
0   1  Alice  25.0  New York   60000
1   2    Bob  30.0    London   75000
2   4  David  28.0     Paris   62000
3   5    Eva  24.0    Berlin   57000
4   6    NaN   NaN     Tokyo   80000
   ID     Name   Age      City   Salary
0   1    Alice  25.0  New York  60000.0
1   2      Bob  30.0    London  75000.0
2   3  Charlie  35.0       NaN      NaN
3   4    David  28.0     Paris  62000.0
4   5      Eva  24.0    Berlin  57000.0
5   6      NaN   NaN     Tokyo  80000.0


#### **Time Series Analysis**

In [123]:
import datetime
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df['Date']

0    2023-01-01
1    2023-01-02
2    2023-01-03
3    2023-01-04
4    2023-01-05
5    2023-01-06
6    2023-01-07
7    2023-01-08
8    2023-01-09
9    2023-01-10
10   2023-01-07
11   2023-01-08
12   2023-01-06
Name: Date, dtype: datetime64[ns]

In [124]:
from datetime import date
dt = date.today().day     # Day on current date
print(dt)

27


In [125]:
dt = date.today().month
print(dt)

9


In [126]:
df['Present'] = pd.date_range(start='2024-09-27', periods=13, freq = 'D')

In [127]:
df.rename(columns={'Date': 'Joining Date'}, inplace=True)    # Renaming column name

In [128]:
df['Year_of_Experience'] = df['Present'].dt.year - df['Joining Date'].dt.year

In [129]:
df

Unnamed: 0,Name,Age,City,Salary,Joining Date,Cate,Rank,Age Category,Present,Year_of_Experience
0,Alice,45,New York,95445.0,2023-01-01,A,7.5,Youth,2024-09-27,1
1,Bob,58,London,86235.0,2023-01-02,B,5.0,Adult,2024-09-28,1
2,Charlie,60,Tokyo,62330.0,2023-01-03,C,3.0,Adult,2024-09-29,1
3,David,28,Paris,46022.0,2023-01-04,A,12.0,Youth,2024-09-30,1
4,Roshan,49,Berlin,65999.0,2023-01-05,B,6.0,Adult,2024-10-01,1
5,Frank,67,Paris,56000.0,2023-01-06,A,1.0,Senior,2024-10-02,1
6,Grace,45,Berlin,57141.0,2023-01-07,C,7.5,Youth,2024-10-03,1
7,Hannah,60,New York,65413.0,2023-01-08,B,3.0,Adult,2024-10-04,1
8,Ivan,37,Tokyo,87016.0,2023-01-09,A,9.0,Youth,2024-10-05,1
9,Jack,31,London,58725.0,2023-01-10,B,10.0,Youth,2024-10-06,1


In [130]:
# Extracting the categorical columns and numerical columns 
df[df.select_dtypes(exclude=['int', 'float']).columns]

Unnamed: 0,Name,Age,City,Joining Date,Cate,Age Category,Present
0,Alice,45,New York,2023-01-01,A,Youth,2024-09-27
1,Bob,58,London,2023-01-02,B,Adult,2024-09-28
2,Charlie,60,Tokyo,2023-01-03,C,Adult,2024-09-29
3,David,28,Paris,2023-01-04,A,Youth,2024-09-30
4,Roshan,49,Berlin,2023-01-05,B,Adult,2024-10-01
5,Frank,67,Paris,2023-01-06,A,Senior,2024-10-02
6,Grace,45,Berlin,2023-01-07,C,Youth,2024-10-03
7,Hannah,60,New York,2023-01-08,B,Adult,2024-10-04
8,Ivan,37,Tokyo,2023-01-09,A,Youth,2024-10-05
9,Jack,31,London,2023-01-10,B,Youth,2024-10-06


In [131]:
df[df.select_dtypes(include=['int', 'float']).columns]

Unnamed: 0,Salary,Rank,Year_of_Experience
0,95445.0,7.5,1
1,86235.0,5.0,1
2,62330.0,3.0,1
3,46022.0,12.0,1
4,65999.0,6.0,1
5,56000.0,1.0,1
6,57141.0,7.5,1
7,65413.0,3.0,1
8,87016.0,9.0,1
9,58725.0,10.0,1


In [133]:
date_range = pd.date_range(start='2024-09-27', periods=15, freq = 'H')   # Year = Y, Day = D, Month = M, Hour = H
print(date_range)

DatetimeIndex(['2024-09-27 00:00:00', '2024-09-27 01:00:00',
               '2024-09-27 02:00:00', '2024-09-27 03:00:00',
               '2024-09-27 04:00:00', '2024-09-27 05:00:00',
               '2024-09-27 06:00:00', '2024-09-27 07:00:00',
               '2024-09-27 08:00:00', '2024-09-27 09:00:00',
               '2024-09-27 10:00:00', '2024-09-27 11:00:00',
               '2024-09-27 12:00:00', '2024-09-27 13:00:00',
               '2024-09-27 14:00:00'],
              dtype='datetime64[ns]', freq='h')


In [134]:
# Creating a DataFrame with dates as the index
data = {'value': [10, 20, 15, 25]}
index_dates = pd.date_range(start= '2023-08-01', periods=4, freq= 'D')
df1 = pd.DataFrame(data, index = index_dates)
print(df1)

            value
2023-08-01     10
2023-08-02     20
2023-08-03     15
2023-08-04     25


In [135]:
# Creating a DataFrame with hourly data
data = {'value': [10, 20, 15, 25, 30, 35]}
index_dates = pd.date_range(start= '2023-08-01', periods=6, freq= 'H')
df2 = pd.DataFrame(data, index = index_dates)
print(df2)

# Resample to daily frequency and calculate teh mean for each day
daily_mean = df2.resample('D').mean()
print(daily_mean)

                     value
2023-08-01 00:00:00     10
2023-08-01 01:00:00     20
2023-08-01 02:00:00     15
2023-08-01 03:00:00     25
2023-08-01 04:00:00     30
2023-08-01 05:00:00     35
            value
2023-08-01   22.5


In [None]:
import calendar 
print(calendar.month(2024, 2))

   February 2024
Mo Tu We Th Fr Sa Su
          1  2  3  4
 5  6  7  8  9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29



In [None]:
calendar.prcal(2024)

                                  2024

      January                   February                   March
Mo Tu We Th Fr Sa Su      Mo Tu We Th Fr Sa Su      Mo Tu We Th Fr Sa Su
 1  2  3  4  5  6  7                1  2  3  4                   1  2  3
 8  9 10 11 12 13 14       5  6  7  8  9 10 11       4  5  6  7  8  9 10
15 16 17 18 19 20 21      12 13 14 15 16 17 18      11 12 13 14 15 16 17
22 23 24 25 26 27 28      19 20 21 22 23 24 25      18 19 20 21 22 23 24
29 30 31                  26 27 28 29               25 26 27 28 29 30 31

       April                      May                       June
Mo Tu We Th Fr Sa Su      Mo Tu We Th Fr Sa Su      Mo Tu We Th Fr Sa Su
 1  2  3  4  5  6  7             1  2  3  4  5                      1  2
 8  9 10 11 12 13 14       6  7  8  9 10 11 12       3  4  5  6  7  8  9
15 16 17 18 19 20 21      13 14 15 16 17 18 19      10 11 12 13 14 15 16
22 23 24 25 26 27 28      20 21 22 23 24 25 26      17 18 19 20 21 22 23
29 30                     

#### **Pivot Table**

In [136]:
df

Unnamed: 0,Name,Age,City,Salary,Joining Date,Cate,Rank,Age Category,Present,Year_of_Experience
0,Alice,45,New York,95445.0,2023-01-01,A,7.5,Youth,2024-09-27,1
1,Bob,58,London,86235.0,2023-01-02,B,5.0,Adult,2024-09-28,1
2,Charlie,60,Tokyo,62330.0,2023-01-03,C,3.0,Adult,2024-09-29,1
3,David,28,Paris,46022.0,2023-01-04,A,12.0,Youth,2024-09-30,1
4,Roshan,49,Berlin,65999.0,2023-01-05,B,6.0,Adult,2024-10-01,1
5,Frank,67,Paris,56000.0,2023-01-06,A,1.0,Senior,2024-10-02,1
6,Grace,45,Berlin,57141.0,2023-01-07,C,7.5,Youth,2024-10-03,1
7,Hannah,60,New York,65413.0,2023-01-08,B,3.0,Adult,2024-10-04,1
8,Ivan,37,Tokyo,87016.0,2023-01-09,A,9.0,Youth,2024-10-05,1
9,Jack,31,London,58725.0,2023-01-10,B,10.0,Youth,2024-10-06,1


In [138]:
df.pivot_table(index='Name', columns='Cate', values='Salary', aggfunc='sum')

Cate,A,B,C
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice,95445.0,,
Bob,,86235.0,
Charlie,,,62330.0
David,46022.0,,
Frank,56000.0,,
Grace,,57141.0,57141.0
Hannah,,135838.230769,
Ivan,87016.0,,
Jack,,58725.0,
Roshan,,85999.0,


In [139]:
cross_tab = pd.crosstab(index=[df['Name']],   # First factor to compare
                        columns=df['Cate'])   # The factor we want as our column
cross_tab

Cate,A,B,C
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice,1,0,0
Bob,0,1,0
Charlie,0,0,1
David,1,0,0
Frank,1,0,0
Grace,0,1,1
Hannah,0,2,0
Ivan,1,0,0
Jack,0,1,0
Roshan,0,2,0


In [140]:
# Shifting the index one day forward
df_shifted = df.shift(2)
df_shifted

Unnamed: 0,Name,Age,City,Salary,Joining Date,Cate,Rank,Age Category,Present,Year_of_Experience
0,,,,,NaT,,,,NaT,
1,,,,,NaT,,,,NaT,
2,Alice,45.0,New York,95445.0,2023-01-01,A,7.5,Youth,2024-09-27,1.0
3,Bob,58.0,London,86235.0,2023-01-02,B,5.0,Adult,2024-09-28,1.0
4,Charlie,60.0,Tokyo,62330.0,2023-01-03,C,3.0,Adult,2024-09-29,1.0
5,David,28.0,Paris,46022.0,2023-01-04,A,12.0,Youth,2024-09-30,1.0
6,Roshan,49.0,Berlin,65999.0,2023-01-05,B,6.0,Adult,2024-10-01,1.0
7,Frank,67.0,Paris,56000.0,2023-01-06,A,1.0,Senior,2024-10-02,1.0
8,Grace,45.0,Berlin,57141.0,2023-01-07,C,7.5,Youth,2024-10-03,1.0
9,Hannah,60.0,New York,65413.0,2023-01-08,B,3.0,Adult,2024-10-04,1.0


In [141]:
# Ordinal Encoding or Mapping & Repalce
data = pd.DataFrame({'Grade': ['A', 'B', 'C', 'A', 'D']})

grade_mapping = {'A': 3, 'B': 2, 'C': 1, 'D': 0}
data['Grade_ordinalEncoded'] = data['Grade'].map(grade_mapping)
data['Grade1'] = data['Grade'].replace({'A': 3, 'B': 2, 'C': 1, 'D': 0})
print(data)

  Grade  Grade_ordinalEncoded  Grade1
0     A                     3       3
1     B                     2       2
2     C                     1       1
3     A                     3       3
4     D                     0       0


In [142]:
# Multi Index
branch_df2 = pd.DataFrame(
 [
 [1,2,0,0],
 [3,4,0,0],
 [5,6,0,0],
 [7,8,0,0],
 ],
 index = [2019, 2020, 2021, 2022],
 columns = pd.MultiIndex.from_product([['Delhi', 'Mumbai'], ['Avg_Package', 'Students']])
 )
branch_df2

Unnamed: 0_level_0,Delhi,Delhi,Mumbai,Mumbai
Unnamed: 0_level_1,Avg_Package,Students,Avg_Package,Students
2019,1,2,0,0
2020,3,4,0,0
2021,5,6,0,0
2022,7,8,0,0


In [143]:
branch_df2.stack()

Unnamed: 0,Unnamed: 1,Delhi,Mumbai
2019,Avg_Package,1,0
2019,Students,2,0
2020,Avg_Package,3,0
2020,Students,4,0
2021,Avg_Package,5,0
2021,Students,6,0
2022,Avg_Package,7,0
2022,Students,8,0


#### **Calendar**

In [146]:
import calendar 
print(calendar.month(2024, 2))

   February 2024
Mo Tu We Th Fr Sa Su
          1  2  3  4
 5  6  7  8  9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29



In [147]:
calendar.prcal(2024)

                                  2024

      January                   February                   March
Mo Tu We Th Fr Sa Su      Mo Tu We Th Fr Sa Su      Mo Tu We Th Fr Sa Su
 1  2  3  4  5  6  7                1  2  3  4                   1  2  3
 8  9 10 11 12 13 14       5  6  7  8  9 10 11       4  5  6  7  8  9 10
15 16 17 18 19 20 21      12 13 14 15 16 17 18      11 12 13 14 15 16 17
22 23 24 25 26 27 28      19 20 21 22 23 24 25      18 19 20 21 22 23 24
29 30 31                  26 27 28 29               25 26 27 28 29 30 31

       April                      May                       June
Mo Tu We Th Fr Sa Su      Mo Tu We Th Fr Sa Su      Mo Tu We Th Fr Sa Su
 1  2  3  4  5  6  7             1  2  3  4  5                      1  2
 8  9 10 11 12 13 14       6  7  8  9 10 11 12       3  4  5  6  7  8  9
15 16 17 18 19 20 21      13 14 15 16 17 18 19      10 11 12 13 14 15 16
22 23 24 25 26 27 28      20 21 22 23 24 25 26      17 18 19 20 21 22 23
29 30                     