## Pandas

In [6]:
import pandas as pd

DataFrames ues a JSON format to create a table

In [7]:
pd.DataFrame({'Andy': [10,20], 'Tim': [30,40]})

Unnamed: 0,Andy,Tim
0,10,30
1,20,40


In [8]:
pd.DataFrame({'Andy':['Liked It', 'Didn\'t Like It'], 'Time':['Didn\'t Like It']}, index = ['Item A', 'Item B'])

Unnamed: 0,Andy,Time
Item A,Liked It,Didn't Like It
Item B,Didn't Like It,Didn't Like It


Pandas also include Serieses

In [9]:
pd.Series([1,2,3,4,5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

Unlike DataFrames, Serieses are one-dimensional arrays

In [10]:
pd.Series([80,95,92], index=['Math', 'English', 'Science'])

Math       80
English    95
Science    92
dtype: int64

Although creating series or dataframes is useful for small projects, we may find ourselves working with large datasets.

In that case, we'll need to read the datasets using Pandas

In [11]:
test_scores = pd.read_csv("./CalculatedGrades.csv")

In [12]:
test_scores.head()

Unnamed: 0,Name,Age,English,Maths,TotalScore,Percentage
0,Tom,20,103,38,141,44.06
1,Tom,7,52,158,210,65.62
2,Tony,3,110,18,128,40.0
3,Tom,3,92,151,243,75.94
4,Tony,2,103,155,258,80.62


Adding .shape shows us the dimensions of the DataFrame/Series

In [13]:
test_scores.shape

(100, 6)

Luckily the table has been properly indexed, but if we would like to change the index, we can use **index_col**

In [14]:
test_scores = pd.read_csv("./CalculatedGrades.csv", index_col = 1)
test_scores.head()

Unnamed: 0_level_0,Name,English,Maths,TotalScore,Percentage
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20,Tom,103,38,141,44.06
7,Tom,52,158,210,65.62
3,Tony,110,18,128,40.0
3,Tom,92,151,243,75.94
2,Tony,103,155,258,80.62


The table has a lot of data points, and sometimes we may want to isolate specific values/columns.

We can run [dataFrame].[column] to isolate columns/values

In [15]:
test_scores.Name


Age
20     Tom
7      Tom
3     Tony
3      Tom
2     Tony
      ... 
0     Tony
3     John
11    Tony
6      Tom
22     Tom
Name: Name, Length: 100, dtype: object

It's also possible to use the [] operator to return a column (i.e test_score['Name']).

In [16]:
test_scores['Name'][0]

0    John
0    John
0    John
0    Tony
Name: Name, dtype: object

test_score['Name'] returns a Series which is like an array. However, to retrieve a retrieve a single cell, the two brackets (['Name'][<number>]) can't be used.
To do this, we need to use iloc (iloc = index location)

Another thing we can use is loc. The **difference between** loc and iloc is that iloc finds a value/cell by uing an **integer**, while loc uses a **string**.

In [17]:
test_scores['Name'].iloc[0]

'Tom'

Iloc is a bit more versatible becuase it can return a single value or multiple values.

In [18]:
test_scores['Name'].iloc[0:3]

Age
20     Tom
7      Tom
3     Tony
Name: Name, dtype: object

In [19]:
# can simply print out rows without filtering them
test_scores.iloc[0:3]

Unnamed: 0_level_0,Name,English,Maths,TotalScore,Percentage
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20,Tom,103,38,141,44.06
7,Tom,52,158,210,65.62
3,Tony,110,18,128,40.0


In [20]:
# can filter by column
test_scores.iloc[0:5, 0:3]

Unnamed: 0_level_0,Name,English,Maths
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
20,Tom,103,38
7,Tom,52,158
3,Tony,110,18
3,Tom,92,151
2,Tony,103,155


In [21]:
test_scores.iloc[0:5, [0,2,3]]

Unnamed: 0_level_0,Name,Maths,TotalScore
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
20,Tom,38,141
7,Tom,158,210
3,Tony,18,128
3,Tom,151,243
2,Tony,155,258


**Excerpt about iloc and loc**
> Choosing between loc and iloc
> When choosing or transitioning between loc and iloc, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.
>
> iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.
>
> Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet'] (t coming after s in the alphabet).
> 
> This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].
>
> Otherwise, the semantics of using loc are the same as those for iloc.

**Conditional Selection**

Much like how we compare things in Python and other pogramming languaes, the == operator can be used

In [23]:
test_scores.Name == "Tom"

Age
20     True
7      True
3     False
3      True
2     False
      ...  
0     False
3     False
11    False
6      True
22     True
Name: Name, Length: 100, dtype: bool

In [27]:
test_scores.loc[test_scores.Name == "Tom"].head(3)

Unnamed: 0_level_0,Name,English,Maths,TotalScore,Percentage
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20,Tom,103,38,141,44.06
7,Tom,52,158,210,65.62
3,Tom,92,151,243,75.94
