# Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. 

The name Pandas is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals


Pandas can ben used to store the data in a Dataframe. Think of it as a spreadsheet.
It is very handy for time series and tabular data.
Pandas runs on top of NumPy, so you need to install that first. 

See more at: https://pandas.pydata.org

## Contents
0. Install pandas
1. My first pandas
2. Import CSV files etc
3. Data cleansing
4. Pandas append data
5. Selecting values


In [1]:
# use this command to install pandas
#!pip install numpy
!pip install pandas



## 1. My first pandas


In [2]:
# First example
# source: https://www.youtube.com/watch?v=e60ItwlZTKM
import numpy as np
import pandas as pd

def header(msg):
    print('-'* 50 )
    print('[ '+ msg +' ]')
          
header("1. load hard coded data into a df")
df = pd.DataFrame(
    [['Jan',58,42,74,22,2.95],
    ['Feb',61,45,78,26,3.02],
    ['Mar',65,48,84,25,2.34],
    ['Apr',67,50,92,28,1.02],
    ['May',71,53,98,35,0.48],
    ['Jun',75,56,107,41,0.11],
    ['Jul',77,58,105,44,0.0],
    ['Aug',77,59,102,43,0.03],
    ['Sep',77,57,103,40,0.17],
    ['Oct',73,54,96,34,0.81],
    ['Nov',64,48,84,30,1.7],
    ['Dec',58,42,73,21,2.56]],
    index = [0,1,2,3,4,5,6,7,8,9,10,11],      
    columns = ['month','avg_high','avg_low','record_high','record_low','avg_precipitation'])
print(df)

--------------------------------------------------
[ 1. load hard coded data into a df ]
   month  avg_high  avg_low  record_high  record_low  avg_precipitation
0    Jan        58       42           74          22               2.95
1    Feb        61       45           78          26               3.02
2    Mar        65       48           84          25               2.34
3    Apr        67       50           92          28               1.02
4    May        71       53           98          35               0.48
5    Jun        75       56          107          41               0.11
6    Jul        77       58          105          44               0.00
7    Aug        77       59          102          43               0.03
8    Sep        77       57          103          40               0.17
9    Oct        73       54           96          34               0.81
10   Nov        64       48           84          30               1.70
11   Dec        58       42           73       

In [3]:
# Read a text file into the dataframe

header("Read txt file into a df")
filename = 'weather_example.txt'
df2 = pd.read_csv(filename)
print(df2)


--------------------------------------------------
[ Read txt file into a df ]
   month  avg_high  avg_low  record_high  record_low  avg_precipitation
0    Jan        58       42           74          22               2.95
1    Feb        61       45           78          26               3.02
2    Mar        65       48           84          25               2.34
3    Apr        67       50           92          28               1.02
4    May        71       53           98          35               0.48
5    Jun        75       56          107          41               0.11
6    Jul        77       58          105          44               0.00
7    Aug        77       59          102          43               0.03
8    Sep        77       57          103          40               0.17
9    Oct        73       54           96          34               0.81
10   Nov        64       48           84          30               1.70
11   Dec        58       42           73          21     

In [4]:
# Example 3 print first 5 or 3 last rows of df
header ('3 . df.head()')
print(df.head())
header("3. df.tail(3)")
print(df.tail(3))


--------------------------------------------------
[ 3 . df.head() ]
  month  avg_high  avg_low  record_high  record_low  avg_precipitation
0   Jan        58       42           74          22               2.95
1   Feb        61       45           78          26               3.02
2   Mar        65       48           84          25               2.34
3   Apr        67       50           92          28               1.02
4   May        71       53           98          35               0.48
--------------------------------------------------
[ 3. df.tail(3) ]
   month  avg_high  avg_low  record_high  record_low  avg_precipitation
9    Oct        73       54           96          34               0.81
10   Nov        64       48           84          30               1.70
11   Dec        58       42           73          21               2.56


In [5]:
# Example 4 printing the data types, index, columns, values

header("4. df.dtypes")
print(df.dtypes)
print(df.index)
print(df.columns)
print(df.values)

--------------------------------------------------
[ 4. df.dtypes ]
month                 object
avg_high               int64
avg_low                int64
record_high            int64
record_low             int64
avg_precipitation    float64
dtype: object
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')
Index(['month', 'avg_high', 'avg_low', 'record_high', 'record_low',
       'avg_precipitation'],
      dtype='object')
[['Jan' 58 42 74 22 2.95]
 ['Feb' 61 45 78 26 3.02]
 ['Mar' 65 48 84 25 2.34]
 ['Apr' 67 50 92 28 1.02]
 ['May' 71 53 98 35 0.48]
 ['Jun' 75 56 107 41 0.11]
 ['Jul' 77 58 105 44 0.0]
 ['Aug' 77 59 102 43 0.03]
 ['Sep' 77 57 103 40 0.17]
 ['Oct' 73 54 96 34 0.81]
 ['Nov' 64 48 84 30 1.7]
 ['Dec' 58 42 73 21 2.56]]


In [6]:
# Example no 5 statistical data
header("5. df.describe()")
print(df.describe())

--------------------------------------------------
[ 5. df.describe() ]
        avg_high    avg_low  record_high  record_low  avg_precipitation
count  12.000000  12.000000    12.000000   12.000000          12.000000
mean   68.583333  51.000000    91.333333   32.416667           1.265833
std     7.366488   6.060303    12.323911    8.240238           1.186396
min    58.000000  42.000000    73.000000   21.000000           0.000000
25%    63.250000  47.250000    82.500000   25.750000           0.155000
50%    69.000000  51.500000    94.000000   32.000000           0.915000
75%    75.500000  56.250000   102.250000   40.250000           2.395000
max    77.000000  59.000000   107.000000   44.000000           3.020000


In [7]:
# example 6 

header("6. df.sort_values('record_high', ascending=False)")
print(df.sort_values('record_high', ascending='False'))

--------------------------------------------------
[ 6. df.sort_values('record_high', ascending=False) ]
   month  avg_high  avg_low  record_high  record_low  avg_precipitation
11   Dec        58       42           73          21               2.56
0    Jan        58       42           74          22               2.95
1    Feb        61       45           78          26               3.02
2    Mar        65       48           84          25               2.34
10   Nov        64       48           84          30               1.70
3    Apr        67       50           92          28               1.02
9    Oct        73       54           96          34               0.81
4    May        71       53           98          35               0.48
7    Aug        77       59          102          43               0.03
8    Sep        77       57          103          40               0.17
6    Jul        77       58          105          44               0.00
5    Jun        75       56    

In [8]:
# example 7 slicing records

print(df.avg_low) # index with single column

print(df['avg_low'])# same as above

print(df[2:4]) #rows 2 and 3

print(df[['avg_low', 'avg_high']])

print(df.iloc[3:5, [0,3]])

0     42
1     45
2     48
3     50
4     53
5     56
6     58
7     59
8     57
9     54
10    48
11    42
Name: avg_low, dtype: int64
0     42
1     45
2     48
3     50
4     53
5     56
6     58
7     59
8     57
9     54
10    48
11    42
Name: avg_low, dtype: int64
  month  avg_high  avg_low  record_high  record_low  avg_precipitation
2   Mar        65       48           84          25               2.34
3   Apr        67       50           92          28               1.02
    avg_low  avg_high
0        42        58
1        45        61
2        48        65
3        50        67
4        53        71
5        56        75
6        58        77
7        59        77
8        57        77
9        54        73
10       48        64
11       42        58
  month  record_high
3   Apr           92
4   May           98


In [9]:
# 8. filtering

header('8. df[df.avg_precipitation > 1.0]') # filter on column values
print(df[df.avg_precipitation > 1.0])
# filter on month values
print(df[df['month'].isin(['Jun', 'Jul', 'Aug'])])

--------------------------------------------------
[ 8. df[df.avg_precipitation > 1.0] ]
   month  avg_high  avg_low  record_high  record_low  avg_precipitation
0    Jan        58       42           74          22               2.95
1    Feb        61       45           78          26               3.02
2    Mar        65       48           84          25               2.34
3    Apr        67       50           92          28               1.02
10   Nov        64       48           84          30               1.70
11   Dec        58       42           73          21               2.56
  month  avg_high  avg_low  record_high  record_low  avg_precipitation
5   Jun        75       56          107          41               0.11
6   Jul        77       58          105          44               0.00
7   Aug        77       59          102          43               0.03


In [10]:
#9 assignment (similar to slicing)

df.loc[9, ['avg_precipitation']] = 101.3
print(df.iloc[9:11])

   month  avg_high  avg_low  record_high  record_low  avg_precipitation
9    Oct        73       54           96          34              101.3
10   Nov        64       48           84          30                1.7


In [11]:
# renaming columns 
df.columns=['mon', 'av_hi', 'av_lo', 'rec_hi', 'rec_lo', 'av_rain']
print(df.head())


   mon  av_hi  av_lo  rec_hi  rec_lo  av_rain
0  Jan     58     42      74      22     2.95
1  Feb     61     45      78      26     3.02
2  Mar     65     48      84      25     2.34
3  Apr     67     50      92      28     1.02
4  May     71     53      98      35     0.48


In [12]:
# 11. Iterate a dataframe

In [13]:
# write to a csv file
df.to_csv('foo.csv')

## 2. Import csv files

In [14]:
# import csv file and read it into pandas dataframe

import pandas as pd

fietsdata=pd.read_csv('mijntestcsvfietsen.csv')
fietsdata.head()

Unnamed: 0,latitude in degress (fractional value),f2 = longitude in degress (fractional value),d3 = cumulative time in seconds (decimal value),f4 = cumulative distance in miles (fractional value),f5 = instantaneous speed in mph (fractional value),d6 = heart rate in bpm (decimal value),d7 = cadence in rpm (decimal value),d8 = power in watts (decimal value),f9 = elevation in metres (fractional value)
0,52.358598,4.862353,1,0.0,8.0,99,255.0,65535,17
1,52.358601,4.862423,2,0.01,7.6,100,255.0,65535,17
2,52.358606,4.862485,3,0.01,5.6,101,255.0,65535,17
3,52.358605,4.86254,4,0.01,5.7,102,255.0,65535,17
4,52.358603,4.862583,5,0.01,2.9,102,255.0,65535,17


## 3. Data cleansing


In [15]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data. 

df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df

#df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around

Unnamed: 0,Data
0,0.319138
1,0.264166
2,-0.198500
3,1.479193
4,-0.497935
...,...
195,0.170118
196,-0.578287
197,1.023693
198,0.912940


In [16]:
#replace

new_df=df.replace(2.196349, np.NaN)
new_df.head()

Unnamed: 0,Data
0,0.319138
1,0.264166
2,-0.1985
3,1.479193
4,-0.497935


## 4. Pandas append data

source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html

In [2]:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df

Unnamed: 0,A,B
0,1,2
1,3,4


In [4]:
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))

df.append(df2,ignore_index=True)

Unnamed: 0,A,B
0,1,2
1,3,4
2,5,6
3,7,8


In [9]:
# create an empty dataframe
empty_df = pd.DataFrame(columns = ['Name' , 'Age', 'City' , 'Country'])
empty_df

Unnamed: 0,Name,Age,City,Country


## 5. Selecting values

In [4]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(5, 15, (10, 3)), columns=list('abc'))
df

Unnamed: 0,a,b,c
0,13,10,9
1,8,7,6
2,13,14,12
3,13,8,8
4,14,7,9
5,14,13,13
6,12,8,12
7,11,10,10
8,11,14,6
9,9,14,8


In [7]:
df = df[df.b > 10]

In [8]:
df

Unnamed: 0,a,b,c
2,13,14,12
5,14,13,13
8,11,14,6
9,9,14,8
