In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("../../../datasets/attacks.csv", encoding="latin-1")

In [3]:
activity = df["Activity"].value_counts()

In [4]:
activity[activity>100]

Surfing         971
Swimming        869
Fishing         431
Spearfishing    333
Bathing         162
Wading          149
Diving          127
Name: Activity, dtype: int64

In [5]:
df[df["Activity"].isin(activity[activity>100].index)]
# This is equivalent to 
# df.loc[df["Activity"].isin(activity[activity>100].index)]

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,...,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,...,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.08,2018.06.08,6300.0,,
6,2018.06.03.a,03-Jun-2018,2018.0,Unprovoked,BRAZIL,Pernambuco,"Piedade Beach, Recife",Swimming,Jose Ernesto da Silva,M,...,Tiger shark,"Diario de Pernambuco, 6/4/2018",2018.06.03.a-daSilva.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.06.03.a,2018.06.03.a,6297.0,,
7,2018.05.27,27-May-2018,2018.0,Unprovoked,USA,Florida,"Lighhouse Point Park, Ponce Inlet, Volusia County",Fishing,male,M,...,"Lemon shark, 3'","K. McMurray, TrackingSharks.com",2018.05.27-Ponce.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.27,2018.05.27,6296.0,,
12,2018.05.13.b,13-May-2018,2018.0,Unprovoked,USA,South Carolina,"Hilton Head Island, Beaufort County",Swimming,Jei Turrell,M,...,,"C. Creswell, GSAF & K. McMurray TrackingSharks...",2018.05.13.b-Turrell.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.13.b,2018.05.13.b,6291.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6295,ND.0007,Before 1906,0.0,Unprovoked,AUSTRALIA,,,Fishing,fisherman,M,...,Blue pointers,"NY Sun, 9/9/1906, referring to account by Loui...",ND-0007 - Fisherman-Australia.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0007,ND.0007,8.0,,
6296,ND.0006,Before 1906,0.0,Unprovoked,AUSTRALIA,New South Wales,,Swimming,Arab boy,M,...,Said to involve a grey nurse shark that leapt ...,"L. Becke in New York Sun, 9/9/1906; L. Schultz...",ND-0006-ArabBoy-Prymount.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0006,ND.0006,7.0,,
6297,ND.0005,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,male,M,...,,"H. Taunton; N. Bartlett, p. 234",ND-0005-RoebuckBay.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0005,ND.0005,6.0,,
6299,ND.0003,1900-1905,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,M,...,,"F. Schwartz, p.23; C. Creswell, GSAF",ND-0003-Ocracoke_1900-1905.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0003,ND.0003,4.0,,


### When is the `.loc` not necessary?

As we have seen, the same syntax `df[...]` is useful for a lot of things. 

We can  use it to select by columns, by lists of True and False, etc.

And all of those can be performed with no problems without the `.loc`. 

So, we pretty much don't have to put the `.loc` anytime.

However, in some cases, we may want pandas to do something specific, such as using both row and an index, where we must use `.loc` method to make sure pandas understands what we want. For example:

- `df[0,"Country"]`

Pandas would look for a column with a MultiIndex (0,"Country") and probably give us a key error

- `df.loc[0,"Country"]`

Pandas will give us the element at row (name) 0, column "Country".

## Groupby

A groupby is an operation in which we remove duplicate rows on a DataFrame, not by dropping rows, but by `reducing` the rows that have a repeated value into a single row.

This means that if we group by column `X`, the resulting DataFrame will only have 1 row for each unique value of `X`.

Let's try it with a sample DataFrame

In [6]:
df2 = pd.DataFrame(zip(["Pepe", "Maria", "Alba", "Juan", "Jorge", "Lola"], [7,6,5,8,5,8]), columns=["student","grade"])
df2

Unnamed: 0,student,grade
0,Pepe,7
1,Maria,6
2,Alba,5
3,Juan,8
4,Jorge,5
5,Lola,8


Now adding some more values for a few of the same students

In [7]:
df2=df2.append([{"student":"Pepe","grade":8},{"student":"Juan","grade":7},{"student":"Lola","grade":8},{"student":"Alba","grade":4}])

In [8]:
df2

Unnamed: 0,student,grade
0,Pepe,7
1,Maria,6
2,Alba,5
3,Juan,8
4,Jorge,5
5,Lola,8
0,Pepe,8
1,Juan,7
2,Lola,8
3,Alba,4


In [9]:
df2.shape

(10, 2)

In [10]:
len(df2["student"].unique())

6

In [11]:
df2.groupby(by="student")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x118bbee80>

As we can see the result is not quite what we expected yet. That is because we have not yet specified how we want the `reducing` to ocurr. There are a few different ways to achieve this in pandas.

We can use some specified functions such as `sum, mean, max, min` when we want the same function applied to all columns:

In [12]:
display(df2.groupby(by="student").mean())
display(df2.groupby(by="student").min())
display(df2.groupby(by="student").std())

Unnamed: 0_level_0,grade
student,Unnamed: 1_level_1
Alba,4.5
Jorge,5.0
Juan,7.5
Lola,8.0
Maria,6.0
Pepe,7.5


Unnamed: 0_level_0,grade
student,Unnamed: 1_level_1
Alba,4
Jorge,5
Juan,7
Lola,8
Maria,6
Pepe,7


Unnamed: 0_level_0,grade
student,Unnamed: 1_level_1
Alba,0.707107
Jorge,
Juan,0.707107
Lola,0.0
Maria,
Pepe,0.707107


Or we can specify the function ourselves, by either using a dictionary format or a keyword-tuple format. In the keyword case, that will be the name of the newly generated column. In either cases we use the `.agg` method.

This is very useful when we want different things done to different columns. So let's add a new column to see how it would work.

In [13]:
import numpy as np

In [14]:
df2["rd"]=[np.random.randint(0,10,3) for _ in range(df2.shape[0])]
df2

Unnamed: 0,student,grade,rd
0,Pepe,7,"[2, 7, 8]"
1,Maria,6,"[8, 5, 3]"
2,Alba,5,"[2, 8, 2]"
3,Juan,8,"[2, 2, 7]"
4,Jorge,5,"[8, 1, 8]"
5,Lola,8,"[7, 5, 8]"
0,Pepe,8,"[0, 5, 3]"
1,Juan,7,"[3, 7, 1]"
2,Lola,8,"[9, 0, 5]"
3,Alba,4,"[1, 2, 0]"


In [15]:
df2.groupby(by="student").agg({"grade":"mean",
                               "rd":lambda args: [el for lst in args for el in lst]})

Unnamed: 0_level_0,grade,rd
student,Unnamed: 1_level_1,Unnamed: 2_level_1
Alba,4.5,"[2, 8, 2, 1, 2, 0]"
Jorge,5.0,"[8, 1, 8]"
Juan,7.5,"[2, 2, 7, 3, 7, 1]"
Lola,8.0,"[7, 5, 8, 9, 0, 5]"
Maria,6.0,"[8, 5, 3]"
Pepe,7.5,"[2, 7, 8, 0, 5, 3]"


#### With renaming

In [16]:
df2.groupby(by="student").agg(mean=("grade","mean"),
                              concat_list=("rd",lambda args: [el for lst in args for el in lst]))

Unnamed: 0_level_0,mean,concat_list
student,Unnamed: 1_level_1,Unnamed: 2_level_1
Alba,4.5,"[2, 8, 2, 1, 2, 0]"
Jorge,5.0,"[8, 1, 8]"
Juan,7.5,"[2, 2, 7, 3, 7, 1]"
Lola,8.0,"[7, 5, 8, 9, 0, 5]"
Maria,6.0,"[8, 5, 3]"
Pepe,7.5,"[2, 7, 8, 0, 5, 3]"


### Shark example

In [17]:
act_country = df.groupby(by=["Country","Activity"]).agg(act_count=("Activity","count"))
act_country

Unnamed: 0_level_0,Unnamed: 1_level_0,act_count
Country,Activity,Unnamed: 2_level_1
PHILIPPINES,USS Hoel DD 533 sunk on 10/24/1944 in the Battle off Samar. 2 crewmen were swimmng alongside a floater net &,1
TONGA,Five men on makeshift raft after their 10 m fishing boat capsized and sank in rough seas. Survivors rescued after 7.5 hours in the water,2
TONGA,Scuba diving,1
AFRICA,Jumped into river,1
ALGERIA,Swimming,1
...,...,...
YEMEN,Diving around anchored liner,2
YEMEN,Diving for coins,2
YEMEN,Standing,1
YEMEN,Swimming at side of small boat,1


In [18]:
act_country.sort_values("act_count",ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,act_count
Country,Activity,Unnamed: 2_level_1
USA,Surfing,564
USA,Swimming,310
AUSTRALIA,Surfing,195
AUSTRALIA,Swimming,156
USA,Fishing,115
...,...,...
INDIAN OCEAN,H.M.S. Cornwall & H.M.S.Dorsetshire sunk by Japanese dive bombers. Officers & men in the water formed a circle with 60 of their dead in the center for 36 hours,1
INDIAN OCEAN,Fell overboard,1
INDIAN OCEAN,Adrift on a 4' raft for 32 days,1
INDIA,Washing a dog,1


In [19]:
# The top 5 combinations "Country-Activity":
act_country.sort_values("act_count",ascending=False).iloc[:5]

Unnamed: 0_level_0,Unnamed: 1_level_0,act_count
Country,Activity,Unnamed: 2_level_1
USA,Surfing,564
USA,Swimming,310
AUSTRALIA,Surfing,195
AUSTRALIA,Swimming,156
USA,Fishing,115


In [20]:
# The top 5 activities in Spain:
# Note: `.loc` is needed because of the multiIndex
act_country.loc["SPAIN"].sort_values("act_count",ascending=False).iloc[:5]

Unnamed: 0_level_0,act_count
Activity,Unnamed: 1_level_1
Swimming,15
Bathing,4
Fishing,4
Diving,2
Skin diving,2


In [21]:
act_country.index

MultiIndex([(              ' PHILIPPINES', ...),
            (                    ' TONGA', ...),
            (                    ' TONGA', ...),
            (                    'AFRICA', ...),
            (                   'ALGERIA', ...),
            (            'AMERICAN SAMOA', ...),
            (            'AMERICAN SAMOA', ...),
            (            'AMERICAN SAMOA', ...),
            ('ANDAMAN / NICOBAR ISLANDAS', ...),
            (           'ANDAMAN ISLANDS', ...),
            ...
            (               'WEST INDIES', ...),
            (               'WEST INDIES', ...),
            (             'WESTERN SAMOA', ...),
            (                     'YEMEN', ...),
            (                     'YEMEN', ...),
            (                    'YEMEN ', ...),
            (                    'YEMEN ', ...),
            (                    'YEMEN ', ...),
            (                    'YEMEN ', ...),
            (                    'YEMEN ', ...)],
   

In [22]:
act_country.loc["BRAZIL"]

Unnamed: 0_level_0,act_count
Activity,Unnamed: 1_level_1
Attempting to catch a crocodile,1
Bathing,3
Batin,1
Body boarding,4
Boogie boarding,1
Cleaning fish,1
Diving,1
Fell overboard from the steamship Chala,1
Fishing,3
Fishing boat swamped in a storm,1


In [23]:
act_country.loc[["USA"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,act_count
Country,Activity,Unnamed: 2_level_1
USA,,1
USA,"""Riding waves on a board""",1
USA,"""Swimming vigorously""",1
USA,"16' catamaran capsized previous night, occupants stayed with wreckage until morning, then attempted to swim ashore",1
USA,25-foot cabin cruiser Happy Jack sank in heavy seas,1
USA,...,...
USA,Wreck of the schooner Pohoiki,3
USA,"Yacht Gooney Bird foundered, 4 survivors on raft",1
USA,"Zosimo & his son, Jeffrey Popa, failed to return from overnight fishing trip in a 14' boat, Boat apparently sank, debris recovered but his son & boat were never found",1
USA,fishing boat exploded & sank,2


## .isin()

If we want to check if each element of a given column is present in a list, tuple, set, etc. (Anything that could be achieved with the operator `in`, e.g.: `7 in range(2,8)`:

#### DON'T:

`df[col] in lst`

##### Why not? 

It would make the syntax ambiguous. It seems that we are checking whether the Series that represents that column as whole is present or not on the list. 

##### Solution:

To avoid this type of confusion and still allow for an easy use of this operation, the programmers responsable for the pandas library coded the `.isin` method.

#### DO:

`df[col].isin(lst)`


In [24]:
df["Year"].isin(range(1990,2021))

0         True
1         True
2         True
3         True
4         True
         ...  
25718    False
25719    False
25720    False
25721    False
25722    False
Name: Year, Length: 25723, dtype: bool

## .str

Simmilarly, the `str` method is another tool coded by the pandas team to allow us to call the string methods (lower, upper, split, and so much more) for all the elements of a Series as if those elements were strings. 

This is equivalent to doing and apply over this Series casting the values to String and calling on the methods, but allow for a cleaner and more direct syntax. Let's take a look on some examples.

In [25]:
df["Country"].str.title()

0              Usa
1              Usa
2              Usa
3        Australia
4           Mexico
           ...    
25718          NaN
25719          NaN
25720          NaN
25721          NaN
25722          NaN
Name: Country, Length: 25723, dtype: object

In [56]:
import re
isla = df["Country"].str.match(".*island.*", flags=re.IGNORECASE)
isla[isla.isna()] = False

In [57]:
df[isla]

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
90,2017.09.06.R,Reported 06-Sep-2017,2017.0,Unprovoked,SOLOMON ISLANDS,,Owarigi Island,Spearfishing,Bartholmew,M,...,,BBC,2017.09.06-Bartholomew.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2017.09.06.R,2017.09.06.R,6213.0,,
306,2016.02.10.R,Reported 10-Feb-2016,2016.0,Invalid,CAYMAN ISLANDS,Grand Cayman,Stingray City Bar,Feeding stingrays?,Richard Branson,M,...,No shark involvement,R. Branson,2016.02.10.R-Branson-stingray.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2016.02.10.R,2016.02.10.R,5997.0,,
578,2014.03.13,13-Mar-2014,2014.0,Invalid,CAYMAN ISLANDS,,,Scuba diving / culling lionfish,Jason Dimitri,M,...,Invalid,"You Tube, posted 4/12/2014",2014.03.13-Dimitri.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2014.03.13,2014.03.13,5725.0,,
993,2010.09.02,02-Sep-2010,2010.0,Unprovoked,SOLOMON ISLANDS,Western Province,,,Benjamin D'Emden,M,...,,"The Daily Telegraph, 9/3/2010",2010.09.02-D'Emden.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2010.09.02,2010.09.02,5310.0,,
1094,2009.09.01,01-Sep-2009,2009.0,Provoked,SOLOMON ISLANDS,Makira-Ulawa Province,"Kirakira, Makira Island (formerly San Cristobal)",Fishing,male,M,...,,"Solomon Star, 9/4/2009",2009.09.01-Kirakira.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2009.09.01,2009.09.01,5209.0,,
1301,2007.12.19,19-Dec-2007,2007.0,Invalid,BRITISH VIRGIN ISLANDS,Green Bay,,Scuba diving,Wayne Francis Johanning,M,...,Invalid,"C. Johannson, GSAF",2007.12.19-Johanning.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2007.12.19,2007.12.19,5002.0,,
1329,2007.09.17,17-Sep-2007,2007.0,Unprovoked,SOLOMON ISLANDS,Marovo Lagoon,Kicha Island,Spearfishing,Corey Howell,M,...,Gray reef shark,L. Choquette,2007.09.17-Corey.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2007.09.17,2007.09.17,4974.0,,
1608,2005.03.05,05-Mar-2005,2005.0,Unprovoked,SOLOMON ISLANDS,Santa Isabel Province,,Diving,male,M,...,,"Solomon Star, 3/9/2005",2005.03.05-SolomonIslands.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2005.03.05,2005.03.05,4695.0,,
2091,1999.11.00.b,Nov-1999,1999.0,Unprovoked,MARSHALL ISLANDS,Alinglaplap Atoll,Island J4H,Swimming,Dally Bayo,M,...,"Grey reef shark, 1.2 m [4']",www.svcherokee.com/pages/ Ailingilaplap.htm,1999.11.00.b-Bayo.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,1999.11.00.b,1999.11.00.b,4212.0,,
2092,1999.11.00.a,Nov-1999,1999.0,Unprovoked,MARSHALL ISLANDS,Alinglaplap Atoll,Island J4H,Swimming,Morson Daniel,M,...,"Grey reef shark, 1.2 m [4']",www.svcherokee.com/pages/ Ailingilaplap.htm,1999.11.00.a-Morson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,1999.11.00.a,1999.11.00.a,4211.0,,
