# Agenda
- Fundamentals of Pandas
  - Purpose
  - Features
- Data Strutures
- Introduction to Series
  - Creating and Accessing Pandas Series using different methods
  - Basic Information on Pandas Series
  - Operations and Transformations
  - Querying in Series
- Introduction to DataFrame
  - Creating a DataFrame using different methods
  - Accessing DataFrame
  - Understanding DataFrame Basics
- Introduction to Statistical Operations in Pandas
  - Descriptive Statistics
  - Mean, Median and Standard Deviation
  - Correlation Analysis
- Date and TimeDelta in Pandas
  - Date Handling
  - Time Delta
    - Creating TimeDelta
    - Performing Arithmetic Ops on Date and Time using TimeDelta
    - Resampling Time Series
- Categorical Data Handling
  - Creating a Categorical Variable
  - Counting Occurences
  - Creating Dummy Variables
  - Label Encoding
- Handling Text Data
- Iteration
  - Iterating Over Rows
  - Apply function
  - Vectorized Operations
  - Iterating over a Series
- Sorting
  - Sorting DataFrame by Column
  - Sorting DataFrame by Multiple Column
  - Sorting DataFrame by index
  - Sorting Series
- Plotting with Pandas

## Fundamentals 
Pandas is an open source library built on top of numpy and is used for data manipulation, data analysis, Data Cleaning, data visualization. The name `pandas` is coming the words Panel Data. The pandas library introduces few Data Structure like Series, DataFrame which makes working with structured data more efficient.
### __Purpose of Pandas__
![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_04_Working_with_Pandas/1_Introduction_to_Pandas/Purpose_of_Pandas.png)

### __Features of Pandas__
![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Updated_Images/Lesson_4/4_01/Features_of_Pandas.png)


## Data Structures
The are Two main Data Structure is pandas 

![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_04_Working_with_Pandas/1_Introduction_to_Pandas/Data_Structures.png)


## Introduction to Series

A pandas Series is one dimensional array like structure containing and respective label/index


It can be created with different data inputs:
![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Updated_Images/Lesson_4/4_01/Introduction_to_Series.png)

## Create and Access Series using different methods

to create Series, pandas has class called as Series with initialization for labels. systactically<br>
$$ seriesname =  pandas.Series(sequence, index = list of indices) $$

In [1]:
# Install Pandas 
!pip install pandas



In [2]:
# Import pandas and numpy
import numpy as np
import pandas as pd


In [3]:
# Creating Pandas Series using List
data =  [1,2,3,4,5]
sr = pd.Series(data)

# Create a series with user defined index using list
idx = ['a', 'b', 'c', 'd', 'e']
sr_with_index = pd.Series(data, index =  idx)


In [4]:
print(sr)

0    1
1    2
2    3
3    4
4    5
dtype: int64


In [5]:
print(sr_with_index)

a    1
b    2
c    3
d    4
e    5
dtype: int64


In [6]:
arr =  np.array(data)
print(arr)

[1 2 3 4 5]


In [7]:
print(sr[4])
print(sr_with_index['e'])

5
5


In [8]:
# Create a Series using Dictionary
sr_with_dict = pd.Series({'Shirts':2, 'Trousers':1, 'Shoes':1, 'Watches':2, 'Wallets':3, 'TShirt': 3, 
                          'Belts':2, 'Socks':2})

In [9]:
print(sr_with_dict)

Shirts      2
Trousers    1
Shoes       1
Watches     2
Wallets     3
TShirt      3
Belts       2
Socks       2
dtype: int64


In [10]:
sr_with_dict['Shirts']

2

In [11]:
# Create a Series using Dictionary with user defined index
dict1 = {'Shirts':2, 'Trousers':1, 'Shoes':1, 'Watches':2, 'Wallets':3, 'TShirt': 3, 'Belts':2, 'Socks':2}
sr_with_dict_idx = pd.Series(dict1, index = ['Shirts', 'Trousers', 'Shoes', 'Watches', 
                                                           'Wallets', 'TShirt', 'Socks', 'Perfume'])

In [12]:
sr_with_dict_idx

Shirts      2.0
Trousers    1.0
Shoes       1.0
Watches     2.0
Wallets     3.0
TShirt      3.0
Socks       2.0
Perfume     NaN
dtype: float64

## Basic Information in pandas Series
These functions collectively helps in summarizing and understanding the characterstics of Data, facilitating effective data exploration and analysis.

In [13]:
# display the first n rows -  head()
first_n_rows =  sr_with_dict_idx.head(3) # default count is 5s
first_n_rows

Shirts      2.0
Trousers    1.0
Shoes       1.0
dtype: float64

In [14]:
# display the last n rows -  tail()
last_n_rows =  sr_with_dict_idx.tail(3) # default count is 5s
last_n_rows

TShirt     3.0
Socks      2.0
Perfume    NaN
dtype: float64

In [15]:
# general summary - info()
sr_with_dict_idx.info()

<class 'pandas.core.series.Series'>
Index: 8 entries, Shirts to Perfume
Series name: None
Non-Null Count  Dtype  
--------------  -----  
7 non-null      float64
dtypes: float64(1)
memory usage: 428.0+ bytes


In [16]:
print(type(sr_with_dict_idx))
print(sr_with_dict_idx.index)
print(len(sr_with_dict_idx.index))
print(sr_with_dict_idx.name)
print(sr_with_dict_idx.count())
print(sr_with_dict_idx.dtype)
print(sr_with_dict_idx.memory_usage())

<class 'pandas.core.series.Series'>
Index(['Shirts', 'Trousers', 'Shoes', 'Watches', 'Wallets', 'TShirt', 'Socks',
       'Perfume'],
      dtype='object')
8
None
7
float64
428


In [17]:
# extract index
print(sr_with_dict_idx.index)

Index(['Shirts', 'Trousers', 'Shoes', 'Watches', 'Wallets', 'TShirt', 'Socks',
       'Perfume'],
      dtype='object')


In [22]:
# extract values
print(sr_with_dict_idx.values)  #***

[ 2.  1.  1.  2.  3.  3.  2. nan]


In [19]:
# num of dimensions
print(sr_with_dict_idx.ndim)

1


In [20]:
# shape
print(sr_with_dict_idx.shape)

(8,)


In [21]:
# size
print(sr_with_dict_idx.size)

8


# Date - 21-03-2025

In [23]:
# Statistical Summary - describe()
sr_with_dict_idx.describe()

count    7.000000
mean     2.000000
std      0.816497
min      1.000000
25%      1.500000
50%      2.000000
75%      2.500000
max      3.000000
dtype: float64

In [24]:
# get unique values -  unique()
sr_with_dict_idx.unique()

array([ 2.,  1.,  3., nan])

In [25]:
# get the count of unique values -  nunique()
sr_with_dict_idx.nunique()


3

## Opeartions and Transformation in Series

In [49]:
sr1 = pd.Series([1,2,3,4,5])
sr2 = pd.Series([2,5,1,6,2], index = ['Pens', 'Pencils', 'Eraser', 'Sharpner', 'Divider'])

In [50]:
# Element wise addition -  add values to series if the index is matcing in both series

rs =  sr1 + sr2
print(rs)

0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
Divider    NaN
Eraser     NaN
Pencils    NaN
Pens       NaN
Sharpner   NaN
dtype: float64


In [51]:
# Apply a function to a Series - apply() - allows to apply any function to each value in the series 
res2 =  sr1.apply(lambda x: x**2 + x +1)
print(res2)

0     3
1     7
2    13
3    21
4    31
dtype: int64


In [52]:
# map values using a dictionary -  map()
print(sr2)
mapped_series =  sr2.map({1:'one', 2:'two', 5:'five', 6:'six'})
print(mapped_series)

Pens        2
Pencils     5
Eraser      1
Sharpner    6
Divider     2
dtype: int64
Pens         two
Pencils     five
Eraser       one
Sharpner     six
Divider      two
dtype: object


In [53]:
# map values using a dictionary -  map()
print(sr2)
mapped_series2 =  sr2.map({1:'one', 2:'two', 5:'five'})
print(mapped_series2)


Pens        2
Pencils     5
Eraser      1
Sharpner    6
Divider     2
dtype: int64
Pens         two
Pencils     five
Eraser       one
Sharpner     NaN
Divider      two
dtype: object


In [54]:
# replace values using a dictionary -  replace()
print(sr2)
rp_series =  sr2.replace({1:'one', 2:'two'})
print(rp_series)

Pens        2
Pencils     5
Eraser      1
Sharpner    6
Divider     2
dtype: int64
Pens        two
Pencils       5
Eraser      one
Sharpner      6
Divider     two
dtype: object


In [55]:
# Sort Series by value - sort_values()

sr2_sorted = sr2.sort_values()
print(sr2_sorted)

Eraser      1
Pens        2
Divider     2
Pencils     5
Sharpner    6
dtype: int64


In [63]:
# Sort Series by index - sort_index()

sr2_sorted_index = sr2_sorted.sort_index()
print(sr2_sorted_index)

Divider     2
Eraser      1
Pencils     5
Pens        2
Sharpner    6
dtype: int64


In [66]:
# Sort Series by value - sort_values() - change order to descending

sr2_sorted_desc = sr2.sort_values(ascending = False)
print(sr2_sorted_desc)

Sharpner    6
Pencils     5
Pens        2
Divider     2
Eraser      1
dtype: int64


In [67]:
# Sort Series by index - sort_index() - change order to descending

sr2_sorted_index_desc = sr2_sorted.sort_index(ascending = False)
print(sr2_sorted_index_desc)

Sharpner    6
Pens        2
Pencils     5
Eraser      1
Divider     2
dtype: int64


In [68]:
# Check for missing values - isnull(), isna()
print(sr_with_dict_idx)

Shirts      2.0
Trousers    1.0
Shoes       1.0
Watches     2.0
Wallets     3.0
TShirt      3.0
Socks       2.0
Perfume     NaN
dtype: float64


In [70]:
sr_with_dict_idx.isnull()  # Display True for Null Values

Shirts      False
Trousers    False
Shoes       False
Watches     False
Wallets     False
TShirt      False
Socks       False
Perfume      True
dtype: bool

In [71]:
sr_with_dict_idx.isnull().sum()

1

In [77]:
~sr_with_dict_idx.isnull()

Shirts       True
Trousers     True
Shoes        True
Watches      True
Wallets      True
TShirt       True
Socks        True
Perfume     False
dtype: bool

In [78]:
sr_with_dict_idx.count()

7

In [79]:
sr_with_dict_idx.notna() # Gives False if the value is not present 

Shirts       True
Trousers     True
Shoes        True
Watches      True
Wallets      True
TShirt       True
Socks        True
Perfume     False
dtype: bool

In [82]:
# Fill the missing with specified value -  fillna()
print(sr_with_dict_idx)
sr_with_dict_idx_filled =  sr_with_dict_idx.fillna('Missing Value')
print(sr_with_dict_idx_filled)

Shirts      2.0
Trousers    1.0
Shoes       1.0
Watches     2.0
Wallets     3.0
TShirt      3.0
Socks       2.0
Perfume     NaN
dtype: float64
Shirts                2.0
Trousers              1.0
Shoes                 1.0
Watches               2.0
Wallets               3.0
TShirt                3.0
Socks                 2.0
Perfume     Missing Value
dtype: object


In [83]:
# Querying on Series
print(sr2)

Pens        2
Pencils     5
Eraser      1
Sharpner    6
Divider     2
dtype: int64


In [99]:
# select elements greater than 4
print(sr2[sr2 >4])
# select elements equal to 2
print(sr2[sr2==2])
# select elements not equal to 5
print(sr2[sr2!=5])
# select elements greater than 2 less than 6
print(sr2[(sr2>2) & (sr2<6)])

Pencils     5
Sharpner    6
dtype: int64
Pens       2
Divider    2
dtype: int64
Pens        2
Eraser      1
Sharpner    6
Divider     2
dtype: int64
Pencils    5
dtype: int64


In [100]:
print(sr2)

Pens        2
Pencils     5
Eraser      1
Sharpner    6
Divider     2
dtype: int64


In [102]:
l1 =  [2,3,4,5]
# select elements from series based on a list of values - isin()

print(sr2[sr2.isin(l1)])

Pens       2
Pencils    5
Divider    2
dtype: int64


In [109]:
# select elements using string methods
st_sr =  pd.Series(['Elizabeth', 'Gowtham', 'Jyoti','Penubala', 'Rakesh' ])
print(st_sr)
print(st_sr[st_sr.str.contains('a')])

0    Elizabeth
1      Gowtham
2        Jyoti
3     Penubala
4       Rakesh
dtype: object
0    Elizabeth
1      Gowtham
3     Penubala
4       Rakesh
dtype: object


In [118]:
# Query based on index labels/Numeric Position
print(sr2)
label = ['Pens', 'Eraser', 'Sharpner']
print(sr2.loc[label]) # Selects data points based on user labels

print(sr2.iloc[[1,3,4]]) # Selects data points based on numerical index

print(sr2.iloc[1:4:1])

Pens        2
Pencils     5
Eraser      1
Sharpner    6
Divider     2
dtype: int64
Pens        2
Eraser      1
Sharpner    6
dtype: int64
Pencils     5
Sharpner    6
Divider     2
dtype: int64
Pencils     5
Eraser      1
Sharpner    6
dtype: int64


## Introduction to DataFrame
A Pandas DataFrame is a two-dimensional, tabular data structure with labeled axes (rows and columns).
It is a primary data structure in the Pandas library, providing a versatile and efficient way to handle and manipulate data in Python.
![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_04_Working_with_Pandas/2_Introduction_to_DataFrame/Introduction_to_Pandas_DataFrame.png)


### __Key Features:__
- __Tabular structure:__ The DataFrame is organized as a table with rows and columns, similar to a spreadsheet or SQL table.
- __Labeled axes:__ Both rows and columns are labeled, allowing for easy indexing and referencing of data.
- __Heterogeneous data types:__ Each column in a DataFrame can contain different types of data, such as integers, floats, strings, or even complex objects.
- __Versatility:__ DataFrames can store and handle a wide range of data formats, including CSV, Excel, SQL databases, and more.
- __Data alignment:__ Operations on DataFrames are designed to handle missing values gracefully, aligning data based on labels.

### __Creating a DataFrame Using Different Methods__
Creating a Pandas DataFrame is a fundamental step in data analysis and manipulation.
- Diverse methods are available within Pandas to generate a DataFrame, addressing various data sources and structures.
- Data, whether in Python dictionaries, lists, NumPy arrays, or external files such as CSV and Excel, can be seamlessly transformed into a structured tabular format by Pandas.

In [119]:
# import libraries

import pandas as pd
import numpy as np

Based upon type pandas offers functions to read the file. for example to read a csv use function `read_csv`, for excel - `read_excel`, for json - `read_json` etc

In [121]:
# create dataframe by reading csv file
df = pd.read_csv('iris.csv')

In [125]:
df

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica


the syntax to create a DataFrame using pd.DataFrame function is
$$\text{pd.DataFrame(data, columns = [list of column names], index = [list of index])}$$

In [129]:
# Create a DataFrame using Dictionary

data_dict = {'Name': ['Alice', 'Bob','Charlie'],
            'Age':[21, 25, 18], 
            'Salary':[50000, 60000, 30000]}

df_dict =  pd.DataFrame(data_dict)
df_dict

Unnamed: 0,Name,Age,Salary
0,Alice,21,50000
1,Bob,25,60000
2,Charlie,18,30000


In [131]:
# Create a DataFrame using List

data_list = [['Alice', 21 ,50000],['Bob', 25, 60000], ['Charlie', 18, 30000]]
cols =  ['Name', 'Age', 'Salary']

df_dict =  pd.DataFrame(data_list, columns =  cols)
df_dict

Unnamed: 0,Name,Age,Salary
0,Alice,21,50000
1,Bob,25,60000
2,Charlie,18,30000


In [132]:
# Create a DataFrame using List with index 

data_list = [['Alice', 21 ,50000],['Bob', 25, 60000], ['Charlie', 18, 30000]]
cols =  ['Name', 'Age', 'Salary']
idx =  ['Emp1', 'Emp2', 'Emp3']

df_dict =  pd.DataFrame(data_list, columns =  cols, index = idx)
df_dict

Unnamed: 0,Name,Age,Salary
Emp1,Alice,21,50000
Emp2,Bob,25,60000
Emp3,Charlie,18,30000


In [134]:
house =  pd.read_csv('HousePrices.csv')

In [135]:
house

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-02 00:00:00,3.130000e+05,3.0,1.50,1340,7912,1.5,0,0,3,1340,0,1955,2005,18810 Densmore Ave N,Shoreline,WA 98133,USA
1,2014-05-02 00:00:00,2.384000e+06,5.0,2.50,3650,9050,2.0,0,4,5,3370,280,1921,0,709 W Blaine St,Seattle,WA 98119,USA
2,2014-05-02 00:00:00,3.420000e+05,3.0,2.00,1930,11947,1.0,0,0,4,1930,0,1966,0,26206-26214 143rd Ave SE,Kent,WA 98042,USA
3,2014-05-02 00:00:00,4.200000e+05,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0,857 170th Pl NE,Bellevue,WA 98008,USA
4,2014-05-02 00:00:00,5.500000e+05,4.0,2.50,1940,10500,1.0,0,0,4,1140,800,1976,1992,9105 170th Ave NE,Redmond,WA 98052,USA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,2014-07-09 00:00:00,3.081667e+05,3.0,1.75,1510,6360,1.0,0,0,4,1510,0,1954,1979,501 N 143rd St,Seattle,WA 98133,USA
4596,2014-07-09 00:00:00,5.343333e+05,3.0,2.50,1460,7573,2.0,0,0,3,1460,0,1983,2009,14855 SE 10th Pl,Bellevue,WA 98007,USA
4597,2014-07-09 00:00:00,4.169042e+05,3.0,2.50,3010,7014,2.0,0,0,3,3010,0,2009,0,759 Ilwaco Pl NE,Renton,WA 98059,USA
4598,2014-07-10 00:00:00,2.034000e+05,4.0,2.00,2090,6630,1.0,0,0,3,1070,1020,1974,0,5148 S Creston St,Seattle,WA 98178,USA


In [136]:
data = {'Column_name': [5, 15, 8],
        'Column1': [10, 20, 30],
        'Column2': [100, 200, 300],
        'Another_column': [25, 35, 45]}
df =  pd.DataFrame(data)

In [137]:
df

Unnamed: 0,Column_name,Column1,Column2,Another_column
0,5,10,100,25
1,15,20,200,35
2,8,30,300,45


In [140]:
# Accessing a single Column
df['Column_name']

0    25
1    35
2    45
Name: Another_column, dtype: int64

In [142]:
# Accessing a multiple Columns
df[['Column1', 'Column2']]

Unnamed: 0,Column1,Column2
0,10,100
1,20,200
2,30,300


In [143]:
#Accesing row by index
df.iloc[0]

Column_name         5
Column1            10
Column2           100
Another_column     25
Name: 0, dtype: int64

In [145]:
#Accesing row based on a condition

df[df['Column_name']>10]

Unnamed: 0,Column_name,Column1,Column2,Another_column
1,15,20,200,35


In [147]:
# Accessing single Data

df.loc[2, 'Column2']

300

In [148]:
df.iloc[2, 2]

300

In [152]:
df.loc[0:1:1, ['Column1', 'Column2']]

Unnamed: 0,Column1,Column2
0,10,100
1,20,200


In [153]:
df.iloc[0:2:1,1:3:1]

Unnamed: 0,Column1,Column2
0,10,100
1,20,200
