# Pandas tutorial

## What is pandas?

- Python library
- For data cleaning, analysis, visualisation and other types of analysis
- Well-suited for many kinds of data
- Build upon numpy and integres well with other libraries

## Data types in Pandas

There are two types of data types:

- Series
  - Essentially an array
  - One dimensional array-like object
  - Capable of holding any data type
  - Has an index
  
- DataFrames
  - Essentially multiple series in one table
  - Two dimensional tabular data structure
  - Capable of holding many data types
  - Index and columns

### Exploring Series

In [6]:
import pandas as pd
import numpy as np

# Create a series
series_01 = pd.Series(np.random.randn(5), index=['a','b','c','d','e'])
series_01

a   -0.727250
b    0.960341
c    0.009022
d    0.832035
e    1.342431
dtype: float64

In [8]:
series_01['a']

-0.72725007863312252

In [9]:
series_01[0]

-0.72725007863312252

### Exploring DataFrames

In [11]:
df = pd.DataFrame(series_01, columns = ['Column 1'])
df

Unnamed: 0,Column 1
a,-0.72725
b,0.960341
c,0.009022
d,0.832035
e,1.342431


In [12]:
# Access column by name
df['Column 1']

a   -0.727250
b    0.960341
c    0.009022
d    0.832035
e    1.342431
Name: Column 1, dtype: float64

In [14]:
# Add new column
df['Column 2'] = df['Column 1'] * 4
df['Column 2']

a   -2.909000
b    3.841363
c    0.036088
d    3.328142
e    5.369724
Name: Column 2, dtype: float64

In [15]:
# Sorting
df.sort_values(by ='Column 2')

Unnamed: 0,Column 1,Column 2
a,-0.72725,-2.909
c,0.009022,0.036088
d,0.832035,3.328142
b,0.960341,3.841363
e,1.342431,5.369724


In [18]:
# Access all rows where values in Column 2 is less than 3 (Boolean indexing)
df[df['Column 2'] <= 3]

Unnamed: 0,Column 1,Column 2
a,-0.72725,-2.909
c,0.009022,0.036088


In [19]:
# Apply annonymous functions
df.apply(lambda x: min(x) + max(x))

Column 1    0.615181
Column 2    2.460724
dtype: float64

In [21]:
# Find the mean
np.mean(df)

Column 1    0.483316
Column 2    1.933263
dtype: float64

In [23]:
# Describe our dataset
df.describe()

Unnamed: 0,Column 1,Column 2
count,5.0,5.0
mean,0.483316,1.933263
std,0.833316,3.333264
min,-0.72725,-2.909
25%,0.009022,0.036088
50%,0.832035,3.328142
75%,0.960341,3.841363
max,1.342431,5.369724


In [34]:
# Read in data into a DataFrame
df = pd.read_csv('dist/ratings_Books.csv', names=['UserId', 'ItemId', 'Rating', 'Time'])
df.head()

Unnamed: 0,UserId,ItemId,Rating,Time
0,AH2L9G3DQHHAJ,116,4.0,1019865600
1,A2IIIDRK3PRRZY,116,1.0,1395619200
2,A1TADCM7YWPQ8M,868,4.0,1031702400
3,AWGH7V0BDOJKB,13714,4.0,1383177600
4,A3UTQPQPM4TQO0,13714,5.0,1374883200


In [30]:
df.groupby(['UserId']).count()

Unnamed: 0_level_0,ItemId,Rating,Time
UserId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A000023026XVLM97BM7KY,1,1,1
A00003322NZ9C82Y46DFN,1,1,1
A0000440NYTE2D2YB089,1,1,1
A00005181SC9PSCD58LCG,3,3,3
A000059624SSMC21VECIQ,1,1,1
A00006923FEAFJLE7GHEL,3,3,3
A00007762BKXYRMOCC0A,1,1,1
A00008222HA6HLYZNCWXK,2,2,2
A00008821J0F472NDY6A2,1,1,1
A000096625CHSNKYTYGZN,1,1,1
