https://www.youtube.com/watch?v=oH3wYKvwpJ8&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=23

Question: Could you explain how to read the pandas documentation?

https://pandas.pydata.org/pandas-docs/stable/api.html

Question: What is the **difference between ufo.isnull() and pd.isnull**(ufo)?

In [2]:
import pandas as pd

In [3]:
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv('http://bit.ly/uforeports')

In [4]:
# use 'isnull' as a top-level function
pd.isnull(ufo).head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,False,True,False,False,False
1,False,True,False,False,False
2,False,True,False,False,False
3,False,True,False,False,False
4,False,True,False,False,False


In [5]:
# equivalent: use 'isnull' as a DataFrame method
ufo.isnull().head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,False,True,False,False,False
1,False,True,False,False,False
2,False,True,False,False,False
3,False,True,False,False,False
4,False,True,False,False,False


Documentation for **isnull**

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html

Question: Why are DataFrame slices inclusive when using **.loc**, but exclusive when using **.iloc**?

In [12]:
ufo.loc[0:4, :] #loc 0 to 4

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


# LOC

**label-based** slicing is inclusive of the start and stop

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html

In [9]:
# **position-based** slicing is inclusive of the start and exclusive of the stop
ufo.iloc[0:4, :] #iloc 0 to 3 numpy sintaxe

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00


# ILOC 
**position-based** slicing is inclusive of the start and exclusive of the stop
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html

In [16]:
# 'iloc' is simply following NumPy's slicing convention...
ufo.values[0:4, :]

array([['Ithaca', nan, 'TRIANGLE', 'NY', '6/1/1930 22:00'],
       ['Willingboro', nan, 'OTHER', 'NJ', '6/30/1930 20:00'],
       ['Holyoke', nan, 'OVAL', 'CO', '2/15/1931 14:00'],
       ['Abilene', nan, 'DISK', 'KS', '6/1/1931 13:00']], dtype=object)

In [17]:
# ...and NumPy is simply following Python's slicing convention
'python'[0:4]

'pyth'

In [18]:
list(range(0, 4))

[0, 1, 2, 3]

In [20]:
# 'loc' is inclusive of the stopping label because you don't necessarily know what label will come after it
ufo.loc[0:4, 'City':'State']

Unnamed: 0,City,Colors Reported,Shape Reported,State
0,Ithaca,,TRIANGLE,NY
1,Willingboro,,OTHER,NJ
2,Holyoke,,OVAL,CO
3,Abilene,,DISK,KS
4,New York Worlds Fair,,LIGHT,NY


Question: How do I randomly sample rows from a DataFrame?


In [22]:
# sample 3 rows from the DataFrame without replacement (new in pandas 0.16.1)
ufo.sample(n=3) #variable output

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
3355,Kissimmee,,DISK,FL,8/20/1976 20:00
6806,Las Vegas,,LIGHT,NV,6/15/1992 22:00
3773,Zimmerman,,TEARDROP,MN,6/15/1978 18:00


Documentation for **sample**

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html

In [23]:
# use the 'random_state' parameter for reproducibility
ufo.sample(n=3, random_state=42) #reprodutability = (random_state=?) for numbers output

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
217,Norridgewock,,DISK,ME,9/15/1952 14:00
12282,Ipava,,TRIANGLE,IL,10/1/1998 21:15
17933,Ellinwood,,FIREBALL,KS,11/13/2000 22:00


In [25]:
# sample 75% of the DataFrame's rows without replacement
ufo.sample(frac=0.75, random_state=99).head(3) #reprodutability = (frac=?,random_state=?) for fraction output)

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
6250,Sunnyvale,,OTHER,CA,12/16/1989 0:00
8656,Corpus Christi,,,TX,9/13/1995 0:10
2729,Mentor,,DISK,OH,8/8/1974 10:00


In [27]:
#ML 
# sample 75% of the DataFrame's rows without replacement
train = ufo.sample(frac=0.75, random_state=99)

 ## selecting the rows that I want!

In [31]:
# store the remaining 25% of the rows in another DataFrame
test = ufo.loc[~ufo.index.isin(train.index), :]

In [30]:
test.head(3)

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00
5,Valley City,,DISK,ND,9/15/1934 15:30
8,Eklutna,,CIGAR,AK,10/15/1936 17:00


Documentation for isin

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.isin.html