# pandas in  10 minutes
- this is short introduction of pandas and for more complex study [Cookbook](https://pandas.pydata.org/docs/user_guide/cookbook.html#cookbook)


In [1]:
# import library as 
import numpy as np
import pandas as pd

## Basic data sturcture in pandas
- pandas provides two type of classes for handling data:
1. Series: a one-dimensional labeled array holding data of any type
such as integers, strings, Python objects etc
2. DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

In [2]:
dates = pd.date_range('20140101',periods=10)
# Creates a DatetimeIndex of 10 sequential dates starting from '2014-01-01', used to label the rows (index) in a DataFrame.
dates

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06', '2014-01-07', '2014-01-08',
               '2014-01-09', '2014-01-10'],
              dtype='datetime64[ns]', freq='D')

In [7]:
df = pd.DataFrame(np.random.randn(10,4), index=dates, columns=list('ABCD'))
# A DataFrame with 10 rows (indexed by dates), 4 columns labeled 'A', 'B', 'C', 'D', filled with random values.
df

Unnamed: 0,A,B,C,D
2014-01-01,0.506059,0.289087,-1.84554,-0.805689
2014-01-02,-2.036319,-0.772767,-1.993456,-0.925015
2014-01-03,-1.855434,-2.203944,-0.423123,1.145439
2014-01-04,-0.024592,-0.482478,-0.654107,0.061867
2014-01-05,1.706519,0.012014,0.103616,-1.291534
2014-01-06,-0.019864,-1.394176,0.199732,-2.435615
2014-01-07,-2.358551,0.475242,-0.594963,1.400826
2014-01-08,-1.668834,-0.637318,-0.768826,0.995356
2014-01-09,-1.220843,-0.611766,2.275665,0.570381
2014-01-10,-0.330113,-0.67154,1.363282,-1.055902


#### Creating a DataFrame by passing a dictionary of objects where the keys are the column labels and the values are the column values.

In [8]:
df2 = pd.DataFrame({
    'A': 1.0,
    'B':pd.Timestamp('20130202'),
    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
    'D': np.array([3]*4, dtype='int32'),
    'E': pd.Categorical(['train','test','train','test']),
    'F': 'foo' 
}
)
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-02-02,1.0,3,train,foo
1,1.0,2013-02-02,1.0,3,test,foo
2,1.0,2013-02-02,1.0,3,train,foo
3,1.0,2013-02-02,1.0,3,test,foo


- columns of dataframe has different dtype 

In [5]:
df2.dtypes

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

## Viewing data
- use dataframe.head() and dataframe.tail() , we can see first and last five rows of data.

In [6]:
df.head()  # first or top five rows of data show in output

Unnamed: 0,A,B,C,D
2014-01-01,1.569233,-0.678559,0.06705,0.819036
2014-01-02,-0.481405,-0.192512,0.947463,0.80632
2014-01-03,-0.864873,-1.100285,-0.093128,0.494835
2014-01-04,-0.198615,0.34425,-1.704664,1.265533
2014-01-05,0.138431,0.729905,-2.103799,-0.38012


In [7]:
df.tail()  # last or bottom five rows of data show in output.

Unnamed: 0,A,B,C,D
2014-01-06,0.024959,0.416609,1.561498,-0.088178
2014-01-07,-0.32438,-0.401447,-0.039363,-1.063489
2014-01-08,-1.844305,0.490536,1.218289,-0.304004
2014-01-09,-0.092354,-0.800876,0.492006,-1.14021
2014-01-10,0.623552,0.286183,1.184754,0.574648


- displaying dataframe.index or dataframe. columns means that show all columns name in dataset.

In [9]:
# df.index
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

- Return a NumPy representation of the underlying data with DataFrame.to_numpy() without the index or column labels:

In [10]:
df.to_numpy()

array([[ 0.50605879,  0.28908725, -1.84553968, -0.80568936],
       [-2.03631901, -0.7727667 , -1.99345599, -0.92501506],
       [-1.85543409, -2.20394379, -0.42312347,  1.14543946],
       [-0.02459234, -0.4824776 , -0.65410694,  0.06186713],
       [ 1.7065195 ,  0.01201425,  0.10361555, -1.2915342 ],
       [-0.01986443, -1.39417564,  0.19973191, -2.43561488],
       [-2.35855076,  0.47524248, -0.59496267,  1.40082596],
       [-1.66883414, -0.63731758, -0.76882595,  0.99535608],
       [-1.2208426 , -0.61176588,  2.27566466,  0.57038111],
       [-0.33011283, -0.67153957,  1.36328165, -1.05590215]])

### Note:
- NumPy arrays have one dtype for the entire array while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. If the common data type is object, DataFrame.to_numpy() will require copying data.

In [11]:
print(df2.dtypes)
print("-------------------------")
print(df2.to_numpy())

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object
-------------------------
[[1.0 Timestamp('2013-02-02 00:00:00') 1.0 3 'train' 'foo']
 [1.0 Timestamp('2013-02-02 00:00:00') 1.0 3 'test' 'foo']
 [1.0 Timestamp('2013-02-02 00:00:00') 1.0 3 'train' 'foo']
 [1.0 Timestamp('2013-02-02 00:00:00') 1.0 3 'test' 'foo']]


- use ``` describe() ``` for short statistics summary. but give only numeric columns output.

In [13]:
df.head()

Unnamed: 0,A,B,C,D
2014-01-01,0.506059,0.289087,-1.84554,-0.805689
2014-01-02,-2.036319,-0.772767,-1.993456,-0.925015
2014-01-03,-1.855434,-2.203944,-0.423123,1.145439
2014-01-04,-0.024592,-0.482478,-0.654107,0.061867
2014-01-05,1.706519,0.012014,0.103616,-1.291534


In [14]:
df.describe()

Unnamed: 0,A,B,C,D
count,10.0,10.0,10.0,10.0
mean,-0.730197,-0.599764,-0.233772,-0.233989
std,1.306984,0.787037,1.311198,1.259044
min,-2.358551,-2.203944,-1.993456,-2.435615
25%,-1.808784,-0.74746,-0.740146,-1.02318
50%,-0.775478,-0.624542,-0.509043,-0.371911
75%,-0.021046,-0.111609,0.175703,0.889112
max,1.706519,0.475242,2.275665,1.400826


- we can take trnasform of the data as mean row change in columns.

In [16]:
# df.T

- use ``` sort_index()``` for sort data as ascending or decending order.

In [17]:
df.sort_index(axis=1, ascending=False) # Sorts the column names in descending order (C, B, A) and rearranges the DataFrame accordingly.
# Sorts the index (row labels if axis=0, column labels if axis=1) of the DataFrame.
# ascending=False= Sorts in descending order (Z to A, or highest to lowest)., Use True for ascending (A to Z).

Unnamed: 0,D,C,B,A
2014-01-01,-0.805689,-1.84554,0.289087,0.506059
2014-01-02,-0.925015,-1.993456,-0.772767,-2.036319
2014-01-03,1.145439,-0.423123,-2.203944,-1.855434
2014-01-04,0.061867,-0.654107,-0.482478,-0.024592
2014-01-05,-1.291534,0.103616,0.012014,1.706519
2014-01-06,-2.435615,0.199732,-1.394176,-0.019864
2014-01-07,1.400826,-0.594963,0.475242,-2.358551
2014-01-08,0.995356,-0.768826,-0.637318,-1.668834
2014-01-09,0.570381,2.275665,-0.611766,-1.220843
2014-01-10,-1.055902,1.363282,-0.67154,-0.330113


- use ```sort_value()``` sort by values

In [18]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2014-01-03,-1.855434,-2.203944,-0.423123,1.145439
2014-01-06,-0.019864,-1.394176,0.199732,-2.435615
2014-01-02,-2.036319,-0.772767,-1.993456,-0.925015
2014-01-10,-0.330113,-0.67154,1.363282,-1.055902
2014-01-08,-1.668834,-0.637318,-0.768826,0.995356
2014-01-09,-1.220843,-0.611766,2.275665,0.570381
2014-01-04,-0.024592,-0.482478,-0.654107,0.061867
2014-01-05,1.706519,0.012014,0.103616,-1.291534
2014-01-01,0.506059,0.289087,-1.84554,-0.805689
2014-01-07,-2.358551,0.475242,-0.594963,1.400826


# Selection
### Note
- While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, DataFrame.at(), DataFrame.iat(), DataFrame.loc() and DataFrame.iloc().
#### Getitem([])
- For a DataFrame, passing a single label selects a columns and yields a Series equivalent to df.A:

In [27]:
df['A']

2014-01-01    0.506059
2014-01-02   -2.036319
2014-01-03   -1.855434
2014-01-04   -0.024592
2014-01-05    1.706519
2014-01-06   -0.019864
2014-01-07   -2.358551
2014-01-08   -1.668834
2014-01-09   -1.220843
2014-01-10   -0.330113
Freq: D, Name: A, dtype: float64

- For a DataFrame, passing a slice : selects matching rows:

In [28]:
df[0:3] # Just like my_list[0:3] gives the first 3 items(rows).

Unnamed: 0,A,B,C,D
2014-01-01,0.506059,0.289087,-1.84554,-0.805689
2014-01-02,-2.036319,-0.772767,-1.993456,-0.925015
2014-01-03,-1.855434,-2.203944,-0.423123,1.145439


#### Selection by label
- Selecting a row matching a label:

In [29]:
df.loc[dates[0]]  # Selects the row in DataFrame df that has the index equal to dates[0].
# .loc[]:Accesses data by label (not position)., dates[0] is the first date in your datetime index.

A    0.506059
B    0.289087
C   -1.845540
D   -0.805689
Name: 2014-01-01 00:00:00, dtype: float64

- Selecting all rows (:) with a select column labels:

In [30]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2014-01-01,0.506059,0.289087
2014-01-02,-2.036319,-0.772767
2014-01-03,-1.855434,-2.203944
2014-01-04,-0.024592,-0.482478
2014-01-05,1.706519,0.012014
2014-01-06,-0.019864,-1.394176
2014-01-07,-2.358551,0.475242
2014-01-08,-1.668834,-0.637318
2014-01-09,-1.220843,-0.611766
2014-01-10,-0.330113,-0.67154


- For label slicing, both endpoints are included:

In [31]:
df.loc['20140101':'20140104',['A','B']]

Unnamed: 0,A,B
2014-01-01,0.506059,0.289087
2014-01-02,-2.036319,-0.772767
2014-01-03,-1.855434,-2.203944
2014-01-04,-0.024592,-0.482478


- Selecting a single row and column label returns a scalar:

In [32]:
# print(df.loc[dates[0],'A'])
df.loc[dates[0],'A']

np.float64(0.5060587872817447)

- For getting fast access to a scalar (equivalent to the prior method):

In [33]:
df.at[dates[0], 'A']

np.float64(0.5060587872817447)

#### Selection by position
- Select via the position of the passed integers:

In [34]:
df.iloc[3]  # Selects the 4th row (index position 3) from the DataFrame df using integer-based indexing.
#  Returns the row at position 3 (i.e., the 4th row).
# .iloc[]:= Stands for integer-location., Used to access rows/columns by position, not by label.
# 

A   -0.024592
B   -0.482478
C   -0.654107
D    0.061867
Name: 2014-01-04 00:00:00, dtype: float64

- Integer slices acts similar to NumPy/Python:

In [35]:
df.iloc[3:5, 0:2]  # Selects rows 3 and 4 (index positions 3 to 4) and columns 0 and 1 from DataFrame df, using integer-based indexing.
# 3:5 → rows at positions 3 and 4 (excludes 5)., 0:2 → columns at positions 0 and 1 (excludes 2).

Unnamed: 0,A,B
2014-01-04,-0.024592,-0.482478
2014-01-05,1.706519,0.012014


- Lists of integer position locations:

In [36]:
df.iloc[[1,2,3], [0,2]]  # Selects rows at positions 1, 2, 3 and columns at positions 0 and 2 from the DataFrame df using integer-based indexing.
# rows → [1, 2, 3] (2nd, 3rd, and 4th rows), columns → [0, 2] (1st and 3rd columns)

Unnamed: 0,A,C
2014-01-02,-2.036319,-1.993456
2014-01-03,-1.855434,-0.423123
2014-01-04,-0.024592,-0.654107


- For slicing rows explicitly:

In [37]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2014-01-02,-2.036319,-0.772767,-1.993456,-0.925015
2014-01-03,-1.855434,-2.203944,-0.423123,1.145439


- For slicing columns explicitly:

In [38]:
df.iloc[:, 1:3]

Unnamed: 0,B,C
2014-01-01,0.289087,-1.84554
2014-01-02,-0.772767,-1.993456
2014-01-03,-2.203944,-0.423123
2014-01-04,-0.482478,-0.654107
2014-01-05,0.012014,0.103616
2014-01-06,-1.394176,0.199732
2014-01-07,0.475242,-0.594963
2014-01-08,-0.637318,-0.768826
2014-01-09,-0.611766,2.275665
2014-01-10,-0.67154,1.363282


- For getting a value explicitly:

In [39]:
df.iloc[1,1]

np.float64(-0.7727666957268983)

- For getting fast access to a scalar (equivalent to the prior method):

In [40]:
df.iat[1,1]

np.float64(-0.7727666957268983)

## Boolean indexing
- Select rows where df.A is greater than 0.

In [41]:
df[df['A']>2] # Returns only the rows where column 'A' has values greater than 2.
#  df['A'] > 2 → creates a boolean Series (True/False for each row)., df[...] → filters rows where condition is True.

Unnamed: 0,A,B,C,D


- Selecting values from a DataFrame where a boolean condition is met:

In [42]:
df[df>0]

Unnamed: 0,A,B,C,D
2014-01-01,0.506059,0.289087,,
2014-01-02,,,,
2014-01-03,,,,1.145439
2014-01-04,,,,0.061867
2014-01-05,1.706519,0.012014,0.103616,
2014-01-06,,,0.199732,
2014-01-07,,0.475242,,1.400826
2014-01-08,,,,0.995356
2014-01-09,,,2.275665,0.570381
2014-01-10,,,1.363282,


- Using isin() method for filtering:

In [45]:
df2= df.copy()
df2['E'] = ['one', 'one', 'two','three','four','five','six','seven','eight','nine']  # add column
df2

Unnamed: 0,A,B,C,D,E
2014-01-01,0.506059,0.289087,-1.84554,-0.805689,one
2014-01-02,-2.036319,-0.772767,-1.993456,-0.925015,one
2014-01-03,-1.855434,-2.203944,-0.423123,1.145439,two
2014-01-04,-0.024592,-0.482478,-0.654107,0.061867,three
2014-01-05,1.706519,0.012014,0.103616,-1.291534,four
2014-01-06,-0.019864,-1.394176,0.199732,-2.435615,five
2014-01-07,-2.358551,0.475242,-0.594963,1.400826,six
2014-01-08,-1.668834,-0.637318,-0.768826,0.995356,seven
2014-01-09,-1.220843,-0.611766,2.275665,0.570381,eight
2014-01-10,-0.330113,-0.67154,1.363282,-1.055902,nine


In [46]:
df2[df2['E'].isin(['one', 'nine'])]

Unnamed: 0,A,B,C,D,E
2014-01-01,0.506059,0.289087,-1.84554,-0.805689,one
2014-01-02,-2.036319,-0.772767,-1.993456,-0.925015,one
2014-01-10,-0.330113,-0.67154,1.363282,-1.055902,nine


#### Setting
- Setting a new column automatically aligns the data by the indexes:

In [47]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20140101', periods=6))
s1

2014-01-01    1
2014-01-02    2
2014-01-03    3
2014-01-04    4
2014-01-05    5
2014-01-06    6
Freq: D, dtype: int64

In [48]:
df['F'] = s1

- Setting values by label:

In [49]:
df.at[dates[0], 'A'] = 0

- Setting values by position:

In [50]:
df.iat[1,2] = 0

- Setting by assigning with a NumPy array:

In [51]:
df.loc[:, 'D']  = np.array([5] * len(df))


- The result of the prior setting operations:

In [52]:
df

Unnamed: 0,A,B,C,D,F
2014-01-01,0.0,0.289087,-1.84554,5.0,1.0
2014-01-02,-2.036319,-0.772767,0.0,5.0,2.0
2014-01-03,-1.855434,-2.203944,-0.423123,5.0,3.0
2014-01-04,-0.024592,-0.482478,-0.654107,5.0,4.0
2014-01-05,1.706519,0.012014,0.103616,5.0,5.0
2014-01-06,-0.019864,-1.394176,0.199732,5.0,6.0
2014-01-07,-2.358551,0.475242,-0.594963,5.0,
2014-01-08,-1.668834,-0.637318,-0.768826,5.0,
2014-01-09,-1.220843,-0.611766,2.275665,5.0,
2014-01-10,-0.330113,-0.67154,1.363282,5.0,


- A where operation with setting:

In [53]:
df2 = df.copy()
df2[df2>0] = -df2
df2

Unnamed: 0,A,B,C,D,F
2014-01-01,0.0,-0.289087,-1.84554,-5.0,-1.0
2014-01-02,-2.036319,-0.772767,0.0,-5.0,-2.0
2014-01-03,-1.855434,-2.203944,-0.423123,-5.0,-3.0
2014-01-04,-0.024592,-0.482478,-0.654107,-5.0,-4.0
2014-01-05,-1.706519,-0.012014,-0.103616,-5.0,-5.0
2014-01-06,-0.019864,-1.394176,-0.199732,-5.0,-6.0
2014-01-07,-2.358551,-0.475242,-0.594963,-5.0,
2014-01-08,-1.668834,-0.637318,-0.768826,-5.0,
2014-01-09,-1.220843,-0.611766,-2.275665,-5.0,
2014-01-10,-0.330113,-0.67154,-1.363282,-5.0,
