# Pandas

Pandas is a powerful data manipulation library that is used for data analysis and data cleaning.It provides two powerful data structures : Series and DataFrames.A Series is a 1-Dimensional array like object while a Data Frame is a two dimensional,size mutable and potentially heterogenous data structure which is represented in a tabular structure in the form of rows and columns

In [3]:
import pandas as pd

In [2]:
## Series
## It is a one dimensional array like object which can contain data of any type.It is like a column of a table

In [4]:
numbers = [1,2,3,4,5,6]
series = pd.Series(numbers)
print(series,type(series))

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64 <class 'pandas.core.series.Series'>


In [5]:
## Creating a series from a dictionary
dictionary = {'name':'Aarush','age':'22'}
series_dictionary = pd.Series(dictionary)
print(series_dictionary,type(series_dictionary))

name    Aarush
age         22
dtype: object <class 'pandas.core.series.Series'>


In [6]:
## Creating a series from 2 different lists of values and indexes
values = [10,20,30]
indices = ['a','b','c']
dictionary = pd.Series(values,index=indices)

Basically when you see a sheet in Excel it can be loaded in a dataframe in Pandas

In [7]:
## DataFrame
## Creating a dataframe from a dictionary of lists
data = {
    'Name':['Krish','John','Jack'],
    'age':[25,30,45],
    'City':['Bangalore','New York','Texas']
}

In [8]:
df = pd.DataFrame(data)
print(df)

    Name  age       City
0  Krish   25  Bangalore
1   John   30   New York
2   Jack   45      Texas


In [9]:
## Converting dataframe to a numpy array
import numpy as np
np.array(df)

array([['Krish', 25, 'Bangalore'],
       ['John', 30, 'New York'],
       ['Jack', 45, 'Texas']], dtype=object)

In [10]:
## Create a dataframe from a list of dictionaries
data = [
    {'Name':'Aarush','age':22},
    {'Name':'Abhay','age':21},
    {'Name':'Manish','age':53}
]
dataframe = pd.DataFrame(data)
print(dataframe)

     Name  age
0  Aarush   22
1   Abhay   21
2  Manish   53


In [11]:
## Reading Files like csv
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,Date,Category,Value,Product,Sales,Region
0,2023-01-01,A,28.0,Product1,754.0,East
1,2023-01-02,B,39.0,Product3,110.0,North
2,2023-01-03,C,32.0,Product2,398.0,East
3,2023-01-04,B,8.0,Product1,522.0,East
4,2023-01-05,B,26.0,Product3,869.0,North


In [12]:
df.tail()

Unnamed: 0,Date,Category,Value,Product,Sales,Region
45,2023-02-15,B,99.0,Product2,599.0,West
46,2023-02-16,B,6.0,Product1,938.0,South
47,2023-02-17,B,69.0,Product3,143.0,West
48,2023-02-18,C,65.0,Product3,182.0,North
49,2023-02-19,C,11.0,Product3,708.0,North


## Pick up a single series


In [13]:
df['Date']

Unnamed: 0,Date
0,2023-01-01
1,2023-01-02
2,2023-01-03
3,2023-01-04
4,2023-01-05
5,2023-01-06
6,2023-01-07
7,2023-01-08
8,2023-01-09
9,2023-01-10


In [14]:
## Pick up the first row
df.loc[0]

Unnamed: 0,0
Date,2023-01-01
Category,A
Value,28.0
Product,Product1
Sales,754.0
Region,East


In [15]:
## Pick up the first row
df.iloc[0]

Unnamed: 0,0
Date,2023-01-01
Category,A
Value,28.0
Product,Product1
Sales,754.0
Region,East


In [20]:
df.loc[0,'Date']

'2023-01-01'

In [24]:
## Accessing a specified element
print(df.at[1,'Value'])

39.0


In [25]:
## iat
df.iat[2,2]

np.float64(32.0)

In [28]:
## Data Manipulation with DataFrame
df['Time'] = ['1 pm' for x in range(len(df))]

In [29]:
df.head()

Unnamed: 0,Date,Category,Value,Product,Sales,Region,Time
0,2023-01-01,A,28.0,Product1,754.0,East,1 pm
1,2023-01-02,B,39.0,Product3,110.0,North,1 pm
2,2023-01-03,C,32.0,Product2,398.0,East,1 pm
3,2023-01-04,B,8.0,Product1,522.0,East,1 pm
4,2023-01-05,B,26.0,Product3,869.0,North,1 pm


## Axis
It has two values one is row value indices and second is column value indices

In [None]:
## axis = 0 for row and axis = 1 for column

In [32]:
## Remove a column
df.drop('Time',axis=1)
## Temporary step

Unnamed: 0,Date,Category,Value,Product,Sales,Region
0,2023-01-01,A,28.0,Product1,754.0,East
1,2023-01-02,B,39.0,Product3,110.0,North
2,2023-01-03,C,32.0,Product2,398.0,East
3,2023-01-04,B,8.0,Product1,522.0,East
4,2023-01-05,B,26.0,Product3,869.0,North
5,2023-01-06,B,54.0,Product3,192.0,West
6,2023-01-07,A,16.0,Product1,936.0,East
7,2023-01-08,C,89.0,Product1,488.0,West
8,2023-01-09,C,37.0,Product3,772.0,West
9,2023-01-10,A,22.0,Product2,834.0,West


In [33]:
df

Unnamed: 0,Date,Category,Value,Product,Sales,Region,Time
0,2023-01-01,A,28.0,Product1,754.0,East,1 pm
1,2023-01-02,B,39.0,Product3,110.0,North,1 pm
2,2023-01-03,C,32.0,Product2,398.0,East,1 pm
3,2023-01-04,B,8.0,Product1,522.0,East,1 pm
4,2023-01-05,B,26.0,Product3,869.0,North,1 pm
5,2023-01-06,B,54.0,Product3,192.0,West,1 pm
6,2023-01-07,A,16.0,Product1,936.0,East,1 pm
7,2023-01-08,C,89.0,Product1,488.0,West,1 pm
8,2023-01-09,C,37.0,Product3,772.0,West,1 pm
9,2023-01-10,A,22.0,Product2,834.0,West,1 pm


In [35]:
df.drop('Time',axis = 1,inplace = True)
df.head()

Unnamed: 0,Date,Category,Value,Product,Sales,Region
0,2023-01-01,A,28.0,Product1,754.0,East
1,2023-01-02,B,39.0,Product3,110.0,North
2,2023-01-03,C,32.0,Product2,398.0,East
3,2023-01-04,B,8.0,Product1,522.0,East
4,2023-01-05,B,26.0,Product3,869.0,North


In [37]:
## Modify Columns
df['Value'] = df['Value'] + 1
df.head()

Unnamed: 0,Date,Category,Value,Product,Sales,Region
0,2023-01-01,A,29.0,Product1,754.0,East
1,2023-01-02,B,40.0,Product3,110.0,North
2,2023-01-03,C,33.0,Product2,398.0,East
3,2023-01-04,B,9.0,Product1,522.0,East
4,2023-01-05,B,27.0,Product3,869.0,North


In [38]:
## To remove rows
df.drop(0,inplace=True)

In [39]:
## Display datatypes of each column
print("Data type :\n",df.dtypes)

Data type :
 Date         object
Category     object
Value       float64
Product      object
Sales       float64
Region       object
dtype: object


In [40]:
## Describe the Dataframe
print("Statistical summary :\n",df.describe())

Statistical summary :
             Value       Sales
count   46.000000   45.000000
mean    53.260870  552.755556
std     29.152804  276.075513
min      3.000000  108.000000
25%     29.000000  338.000000
50%     56.000000  584.000000
75%     71.000000  772.000000
max    100.000000  992.000000


In [41]:
## Group by a column and perform operations
grouped = df.groupby('Category')['Value'].mean()
print("Mean value by category:\n",grouped)

Mean value by category:
 Category
A    50.923077
B    45.142857
C    60.842105
Name: Value, dtype: float64
