<a href="https://colab.research.google.com/github/Avipsa1/UPPP275-Notebooks/blob/main/More_pandas_functionalities.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Object Creation

Creating a `Series` by passing a list of values, letting pandas create a default integer index:

In [3]:
 s = pd.Series([1,3,5,-345,6,8])
 s

0      1
1      3
2      5
3   -345
4      6
5      8
dtype: int64

Creating a `DataFrame` by passing a numpy array, with a datetime index and labeled columns:

In [4]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [5]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,1.133905,-0.003135,-0.158519,-0.25001
2013-01-02,-0.819928,-0.865317,1.180552,0.559542
2013-01-03,-0.707212,-1.061502,-1.497898,-1.761251
2013-01-04,0.500016,-0.118953,-0.69709,-0.11428
2013-01-05,0.047941,-2.0496,0.811191,-1.887079
2013-01-06,0.055552,0.112397,1.074243,-2.061298


Creating a `DataFrame` by passing a `dict` of objects that can be converted to series-like.

In [6]:
df2 = pd.DataFrame({ 'A' : 1.,
                         'B' : pd.Timestamp('20220430'),
                         'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                         'D' : np.array([3] * 4,dtype='int32'),
                         'E' : pd.Categorical(["sample","data","sample","data"]),
                         'F' : 'UPPP275' })
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2022-04-30,1.0,3,sample,UPPP275
1,1.0,2022-04-30,1.0,3,data,UPPP275
2,1.0,2022-04-30,1.0,3,sample,UPPP275
3,1.0,2022-04-30,1.0,3,data,UPPP275


Check the type of each column in the `DataFrame`

In [7]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

Inspect the top and bottom of the dataframes

In [8]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,1.133905,-0.003135,-0.158519,-0.25001
2013-01-02,-0.819928,-0.865317,1.180552,0.559542
2013-01-03,-0.707212,-1.061502,-1.497898,-1.761251
2013-01-04,0.500016,-0.118953,-0.69709,-0.11428
2013-01-05,0.047941,-2.0496,0.811191,-1.887079


In [9]:
df.tail()

Unnamed: 0,A,B,C,D
2013-01-02,-0.819928,-0.865317,1.180552,0.559542
2013-01-03,-0.707212,-1.061502,-1.497898,-1.761251
2013-01-04,0.500016,-0.118953,-0.69709,-0.11428
2013-01-05,0.047941,-2.0496,0.811191,-1.887079
2013-01-06,0.055552,0.112397,1.074243,-2.061298


Display the index, columns, and the underlying numpy data

In [10]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [11]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [12]:
df.values

array([[ 1.13390529, -0.00313548, -0.15851916, -0.25000953],
       [-0.81992755, -0.86531712,  1.18055161,  0.55954224],
       [-0.70721186, -1.06150197, -1.49789785, -1.76125066],
       [ 0.50001606, -0.1189529 , -0.69708962, -0.11428011],
       [ 0.04794119, -2.04959993,  0.81119125, -1.88707905],
       [ 0.05555213,  0.1123973 ,  1.07424264, -2.061298  ]])

`describe` shows a quick summary statistic of your dataframe

In [13]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.035046,-0.664352,0.118746,-0.919063
std,0.73569,0.831268,1.084056,1.116478
min,-0.819928,-2.0496,-1.497898,-2.061298
25%,-0.518424,-1.012456,-0.562447,-1.855622
50%,0.051747,-0.492135,0.326336,-1.00563
75%,0.3889,-0.03209,1.00848,-0.148212
max,1.133905,0.112397,1.180552,0.559542


You can convert the rows of your `DataFrame` to columns using the `transpose` function.

In [14]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,1.133905,-0.819928,-0.707212,0.500016,0.047941,0.055552
B,-0.003135,-0.865317,-1.061502,-0.118953,-2.0496,0.112397
C,-0.158519,1.180552,-1.497898,-0.69709,0.811191,1.074243
D,-0.25001,0.559542,-1.761251,-0.11428,-1.887079,-2.061298


You can sort the values of your data either by choosing an axis or by selecting specific values from the `DataFrame`.

In [15]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,-0.25001,-0.158519,-0.003135,1.133905
2013-01-02,0.559542,1.180552,-0.865317,-0.819928
2013-01-03,-1.761251,-1.497898,-1.061502,-0.707212
2013-01-04,-0.11428,-0.69709,-0.118953,0.500016
2013-01-05,-1.887079,0.811191,-2.0496,0.047941
2013-01-06,-2.061298,1.074243,0.112397,0.055552


In [16]:
df.sort_values(by='D')

Unnamed: 0,A,B,C,D
2013-01-06,0.055552,0.112397,1.074243,-2.061298
2013-01-05,0.047941,-2.0496,0.811191,-1.887079
2013-01-03,-0.707212,-1.061502,-1.497898,-1.761251
2013-01-01,1.133905,-0.003135,-0.158519,-0.25001
2013-01-04,0.500016,-0.118953,-0.69709,-0.11428
2013-01-02,-0.819928,-0.865317,1.180552,0.559542


## Selecting columns from a `DataFrame`.

Selecting a single column 'A', yields a pandas `Series`, equivalent to df.A

In [17]:
df['A']

2013-01-01    1.133905
2013-01-02   -0.819928
2013-01-03   -0.707212
2013-01-04    0.500016
2013-01-05    0.047941
2013-01-06    0.055552
Freq: D, Name: A, dtype: float64

Selecting via [], slices the rows of the `DataFrame`.

In [18]:
df[1:3]

Unnamed: 0,A,B,C,D
2013-01-02,-0.819928,-0.865317,1.180552,0.559542
2013-01-03,-0.707212,-1.061502,-1.497898,-1.761251


In [19]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,-0.819928,-0.865317,1.180552,0.559542
2013-01-03,-0.707212,-1.061502,-1.497898,-1.761251
2013-01-04,0.500016,-0.118953,-0.69709,-0.11428


You can also select values by lable using the `loc` attribute.

In [20]:
df.loc[dates[0]]

A    1.133905
B   -0.003135
C   -0.158519
D   -0.250010
Name: 2013-01-01 00:00:00, dtype: float64

In [21]:
df.loc['20130102':'20130104',['A','B']]

Unnamed: 0,A,B
2013-01-02,-0.819928,-0.865317
2013-01-03,-0.707212,-1.061502
2013-01-04,0.500016,-0.118953


For selecting rows from a `DataFrame` using the position of the row we can use the `iloc` attribute.

In [22]:
df.iloc[3]

A    0.500016
B   -0.118953
C   -0.697090
D   -0.114280
Name: 2013-01-04 00:00:00, dtype: float64

In [23]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,0.500016,-0.118953
2013-01-05,0.047941,-2.0496


In [24]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,-0.819928,1.180552
2013-01-03,-0.707212,-1.497898
2013-01-05,0.047941,0.811191


In [25]:
df.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2013-01-02,-0.819928,-0.865317,1.180552,0.559542
2013-01-03,-0.707212,-1.061502,-1.497898,-1.761251


In [26]:
df.iloc[:,1:3]

Unnamed: 0,B,C
2013-01-01,-0.003135,-0.158519
2013-01-02,-0.865317,1.180552
2013-01-03,-1.061502,-1.497898
2013-01-04,-0.118953,-0.69709
2013-01-05,-2.0496,0.811191
2013-01-06,0.112397,1.074243


## Setting a new column in a `DataFrame`

In [27]:
df['E'] = 10 + 0.5*df['B']

In [28]:
df

Unnamed: 0,A,B,C,D,E
2013-01-01,1.133905,-0.003135,-0.158519,-0.25001,9.998432
2013-01-02,-0.819928,-0.865317,1.180552,0.559542,9.567341
2013-01-03,-0.707212,-1.061502,-1.497898,-1.761251,9.469249
2013-01-04,0.500016,-0.118953,-0.69709,-0.11428,9.940524
2013-01-05,0.047941,-2.0496,0.811191,-1.887079,8.9752
2013-01-06,0.055552,0.112397,1.074243,-2.061298,10.056199


In [30]:
df['F'] = 'New column'
df

Unnamed: 0,A,B,C,D,E,F
2013-01-01,1.133905,-0.003135,-0.158519,-0.25001,9.998432,New column
2013-01-02,-0.819928,-0.865317,1.180552,0.559542,9.567341,New column
2013-01-03,-0.707212,-1.061502,-1.497898,-1.761251,9.469249,New column
2013-01-04,0.500016,-0.118953,-0.69709,-0.11428,9.940524,New column
2013-01-05,0.047941,-2.0496,0.811191,-1.887079,8.9752,New column
2013-01-06,0.055552,0.112397,1.074243,-2.061298,10.056199,New column


## Handling missing data in a `DataFrame`

In [34]:
df.iloc[:,5][2] #Pick the 2nd item from the 6th column

'New column'

In [40]:
from numpy.core.numeric import NaN
df.iloc[:,5][2] = NaN #Introduce a missing values

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [41]:
df

Unnamed: 0,A,B,C,D,E,F
2013-01-01,1.133905,-0.003135,-0.158519,-0.25001,9.998432,New column
2013-01-02,-0.819928,-0.865317,1.180552,0.559542,9.567341,New column
2013-01-03,-0.707212,-1.061502,-1.497898,-1.761251,9.469249,
2013-01-04,0.500016,-0.118953,-0.69709,-0.11428,9.940524,New column
2013-01-05,0.047941,-2.0496,0.811191,-1.887079,8.9752,New column
2013-01-06,0.055552,0.112397,1.074243,-2.061298,10.056199,New column


In [42]:
#drop any rows that have missing data.
df1 = df
df1.dropna(how = "any")

Unnamed: 0,A,B,C,D,E,F
2013-01-01,1.133905,-0.003135,-0.158519,-0.25001,9.998432,New column
2013-01-02,-0.819928,-0.865317,1.180552,0.559542,9.567341,New column
2013-01-04,0.500016,-0.118953,-0.69709,-0.11428,9.940524,New column
2013-01-05,0.047941,-2.0496,0.811191,-1.887079,8.9752,New column
2013-01-06,0.055552,0.112397,1.074243,-2.061298,10.056199,New column


In [43]:
#Fill missing data with a new value
df2 = df
df2.fillna(value = 'Changed Value')

Unnamed: 0,A,B,C,D,E,F
2013-01-01,1.133905,-0.003135,-0.158519,-0.25001,9.998432,New column
2013-01-02,-0.819928,-0.865317,1.180552,0.559542,9.567341,New column
2013-01-03,-0.707212,-1.061502,-1.497898,-1.761251,9.469249,Changed Value
2013-01-04,0.500016,-0.118953,-0.69709,-0.11428,9.940524,New column
2013-01-05,0.047941,-2.0496,0.811191,-1.887079,8.9752,New column
2013-01-06,0.055552,0.112397,1.074243,-2.061298,10.056199,New column


## Perform descriptive statistics on `DataFrame` columns.

In [44]:
df.mean()

  """Entry point for launching an IPython kernel.


A    0.035046
B   -0.664352
C    0.118746
D   -0.919063
E    9.667824
dtype: float64

In [45]:
df.mean(1)

  """Entry point for launching an IPython kernel.


2013-01-01    2.144135
2013-01-02    1.924438
2013-01-03    0.888277
2013-01-04    1.902043
2013-01-05    1.179531
2013-01-06    1.847419
Freq: D, dtype: float64

In [46]:
df.sum()

  """Entry point for launching an IPython kernel.


A     0.210275
B    -3.986110
C     0.712479
D    -5.514375
E    58.006945
dtype: float64

In [48]:
df.value_counts()

A          B          C          D          E          F         
-0.819928  -0.865317   1.180552   0.559542  9.567341   New column    1
 0.047941  -2.049600   0.811191  -1.887079  8.975200   New column    1
 0.055552   0.112397   1.074243  -2.061298  10.056199  New column    1
 0.500016  -0.118953  -0.697090  -0.114280  9.940524   New column    1
 1.133905  -0.003135  -0.158519  -0.250010  9.998432   New column    1
dtype: int64

## Merge: Add row from two dataframes together to create a new dataframe

In [49]:
df = pd.DataFrame(np.random.randn(10, 4))
df

Unnamed: 0,0,1,2,3
0,-1.54103,0.41601,0.583439,1.450867
1,-1.314967,0.520094,0.398222,2.174313
2,-0.43865,-1.137365,1.456801,-0.402792
3,0.132016,0.593165,0.904596,0.552155
4,-1.061227,-1.701369,-1.930179,0.681871
5,-0.097845,1.44438,-0.218857,1.692268
6,-1.166487,-1.715286,-1.193051,0.396796
7,0.528822,1.151142,-0.02763,1.802441
8,0.829739,-0.424182,-0.50964,0.751954
9,-0.530023,-0.581198,-0.250683,-0.044248


In [55]:
# break it into pieces
df1 = pd.DataFrame(np.random.randn(5, 4))
df1

Unnamed: 0,0,1,2,3
0,0.242245,0.783854,0.443921,-0.486156
1,0.003054,-0.241546,-0.33473,0.99257
2,-0.04274,0.180556,0.656793,-0.491512
3,0.388042,1.249039,0.823088,-3.000634
4,0.31815,-0.348485,0.393261,-0.121836


In [61]:
pd.concat([df,df1])

Unnamed: 0,0,1,2,3
0,-1.54103,0.41601,0.583439,1.450867
1,-1.314967,0.520094,0.398222,2.174313
2,-0.43865,-1.137365,1.456801,-0.402792
3,0.132016,0.593165,0.904596,0.552155
4,-1.061227,-1.701369,-1.930179,0.681871
5,-0.097845,1.44438,-0.218857,1.692268
6,-1.166487,-1.715286,-1.193051,0.396796
7,0.528822,1.151142,-0.02763,1.802441
8,0.829739,-0.424182,-0.50964,0.751954
9,-0.530023,-0.581198,-0.250683,-0.044248


 Join two dataframes like we do in a database join. A common column in both dataframes will be used to tie together all columns into a single dataframe.

In [62]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
left

Unnamed: 0,key,lval
0,foo,1
1,foo,2


In [63]:
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
right

Unnamed: 0,key,rval
0,foo,4
1,foo,5


In [64]:
merged = pd.merge(left, right, on='key')
merged

Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


You can find more pandas functionalities here: https://pandas.pydata.org/pandas-docs/version/0.22.0/10min.html 