# Section 1: Getting Started With Pandas

We will begin by introducing the `Series`, `DataFrame`, and `Index` classes, which are the basic building blocks of the pandas library, and showing how to work with them. By the end of this section, you will be able to create DataFrames and perform operations on them to inspect and filter the data.

## Getting the Dataset

In [None]:
# Do Not Modify the links
!mkdir -p data
# Meteorite_Landings.csv Dataset
!wget -P ./data https://raw.githubusercontent.com/stefmolin/pandas-workshop/main/data/Meteorite_Landings.csv


--2023-05-19 13:02:39--  https://raw.githubusercontent.com/stefmolin/pandas-workshop/main/data/Meteorite_Landings.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4769811 (4.5M) [text/plain]
Saving to: ‘./data/Meteorite_Landings.csv’


2023-05-19 13:02:39 (61.1 MB/s) - ‘./data/Meteorite_Landings.csv’ saved [4769811/4769811]



In [None]:
!head -10 ./data/Meteorite_Landings.csv

name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
Aachen,1,Valid,L5,21,Fell,01/01/1880 12:00:00 AM,50.775000,6.083330,"(50.775, 6.08333)"
Aarhus,2,Valid,H6,720,Fell,01/01/1951 12:00:00 AM,56.183330,10.233330,"(56.18333, 10.23333)"
Abee,6,Valid,EH4,107000,Fell,01/01/1952 12:00:00 AM,54.216670,-113.000000,"(54.21667, -113.0)"
Acapulco,10,Valid,Acapulcoite,1914,Fell,01/01/1976 12:00:00 AM,16.883330,-99.900000,"(16.88333, -99.9)"
Achiras,370,Valid,L6,780,Fell,01/01/1902 12:00:00 AM,-33.166670,-64.950000,"(-33.16667, -64.95)"
Adhi Kot,379,Valid,EH4,4239,Fell,01/01/1919 12:00:00 AM,32.100000,71.800000,"(32.1, 71.8)"
Adzhi-Bogdo (stone),390,Valid,LL3-6,910,Fell,01/01/1949 12:00:00 AM,44.833330,95.166670,"(44.83333, 95.16667)"
Agen,392,Valid,H5,30000,Fell,01/01/1814 12:00:00 AM,44.216670,0.616670,"(44.21667, 0.61667)"
Aguada,398,Valid,L6,1620,Fell,01/01/1930 12:00:00 AM,-31.600000,-65.233330,"(-31.6, -65.23333)"


## Anatomy of a DataFrame

A **DataFrame** is composed of one or more **Series**. The names of the **Series** form the column names, and the row labels form the **Index**.

In [None]:
import pandas as pd

meteorites = pd.read_csv('./data/Meteorite_Landings.csv', nrows=5,index_col=1)
meteorites

Unnamed: 0_level_0,name,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Aachen,Valid,L5,21,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
2,Aarhus,Valid,H6,720,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
6,Abee,Valid,EH4,107000,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
10,Acapulco,Valid,Acapulcoite,1914,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
370,Achiras,Valid,L6,780,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


*Source: [NASA's Open Data Portal](https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh)*

#### Series:

In [None]:
meteorites['name']

id
1        Aachen
2        Aarhus
6          Abee
10     Acapulco
370     Achiras
Name: name, dtype: object

#### Columns:

In [None]:
meteorites.columns

Index(['name', 'nametype', 'recclass', 'mass (g)', 'fall', 'year', 'reclat',
       'reclong', 'GeoLocation'],
      dtype='object')

#### Index:

In [None]:
meteorites.index

Int64Index([1, 2, 6, 10, 370], dtype='int64', name='id')

## Creating DataFrames

We can create DataFrames from a variety of sources such as other Python objects, flat files, webscraping, and API requests. Here, we will see just a couple of examples, but be sure to check out [this page](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) in the documentation for a complete list.

### Using a flat file

In [None]:
import pandas as pd

meteorites = pd.read_csv('./data/Meteorite_Landings.csv',index_col=1)

*Tip: There are many parameters to this function to handle some initial processing while reading in the file &ndash; be sure check out the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).*

### Using data from an API

Collect the data from [NASA's Open Data Portal](https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh) using the Socrata Open Data API (SODA) with the `requests` library:

In [None]:
import requests

response = requests.get(
    'https://data.nasa.gov/resource/gh4g-9sfh.json',
    params={'$limit': 50_000}
)

if response.ok:
    payload = response.json()
else:
    print(f'Request was not successful and returned code: {response.status_code}.')
    payload = None

In [None]:
payload[:2]

[{'name': 'Aachen',
  'id': '1',
  'nametype': 'Valid',
  'recclass': 'L5',
  'mass': '21',
  'fall': 'Fell',
  'year': '1880-01-01T00:00:00.000',
  'reclat': '50.775000',
  'reclong': '6.083330',
  'geolocation': {'latitude': '50.775', 'longitude': '6.08333'}},
 {'name': 'Aarhus',
  'id': '2',
  'nametype': 'Valid',
  'recclass': 'H6',
  'mass': '720',
  'fall': 'Fell',
  'year': '1951-01-01T00:00:00.000',
  'reclat': '56.183330',
  'reclong': '10.233330',
  'geolocation': {'latitude': '56.18333', 'longitude': '10.23333'}}]

Create the DataFrame with the resulting payload:

In [None]:
import pandas as pd

df = pd.DataFrame(payload)
df.head(3)

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation,:@computed_region_cbhk_fwbd,:@computed_region_nnqa_25f4
0,Aachen,1,Valid,L5,21,Fell,1880-01-01T00:00:00.000,50.775,6.08333,"{'latitude': '50.775', 'longitude': '6.08333'}",,
1,Aarhus,2,Valid,H6,720,Fell,1951-01-01T00:00:00.000,56.18333,10.23333,"{'latitude': '56.18333', 'longitude': '10.23333'}",,
2,Abee,6,Valid,EH4,107000,Fell,1952-01-01T00:00:00.000,54.21667,-113.0,"{'latitude': '54.21667', 'longitude': '-113.0'}",,


In [None]:
df.geolocation.dtype

dtype('O')

In [None]:
records = []
for i in range(10):
  records.append({'i':i*7,'j':i+1})
my_df = pd.DataFrame(records)
my_df.to_csv('my_df.csv',index=False)

In [None]:
!cat my_df.csv

i,j
0,1
7,2
14,3
21,4
28,5
35,6
42,7
49,8
56,9
63,10


*Tip: `df.to_csv('data.csv')` writes this data to a new file called `data.csv`.*

## Inspecting the data
Now that we have some data, we need to perform an initial inspection of it. This gives us information on what the data looks like, how many rows/columns there are, and how much data we have. 

Let's inspect the `meteorites` data.

#### How many rows and columns are there?

In [None]:
meteorites.shape

(45716, 9)

#### What are the column names?

In [None]:
meteorites.columns

Index(['name', 'nametype', 'recclass', 'mass (g)', 'fall', 'year', 'reclat',
       'reclong', 'GeoLocation'],
      dtype='object')

#### What type of data does each column currently hold?

In [None]:
meteorites.dtypes

name            object
nametype        object
recclass        object
mass (g)       float64
fall            object
year            object
reclat         float64
reclong        float64
GeoLocation     object
dtype: object

#### What does the data look like?

In [None]:
meteorites.head(3)

Unnamed: 0_level_0,name,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Aachen,Valid,L5,21.0,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
2,Aarhus,Valid,H6,720.0,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
6,Abee,Valid,EH4,107000.0,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"


Sometimes there may be extraneous data at the end of the file, so checking the bottom few rows is also important:

In [None]:
meteorites.tail(10)

Unnamed: 0_level_0,name,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
31354,Zerkaly,Valid,H5,16000.0,Found,01/01/1956 12:00:00 AM,52.13333,81.96667,"(52.13333, 81.96667)"
54609,Zhaoping,Valid,"Iron, IAB complex",2000000.0,Found,01/01/1983 12:00:00 AM,24.23333,111.18333,"(24.23333, 111.18333)"
30405,Zhigansk,Valid,"Iron, IIIAB",900000.0,Found,01/01/1966 12:00:00 AM,68.0,128.3,"(68.0, 128.3)"
30406,Zhongxiang,Valid,Iron,100000.0,Found,01/01/1981 12:00:00 AM,31.2,112.5,"(31.2, 112.5)"
31355,Zillah 001,Valid,L6,1475.0,Found,01/01/1990 12:00:00 AM,29.037,17.0185,"(29.037, 17.0185)"
31356,Zillah 002,Valid,Eucrite,172.0,Found,01/01/1990 12:00:00 AM,29.037,17.0185,"(29.037, 17.0185)"
30409,Zinder,Valid,"Pallasite, ungrouped",46.0,Found,01/01/1999 12:00:00 AM,13.78333,8.96667,"(13.78333, 8.96667)"
30410,Zlin,Valid,H4,3.3,Found,01/01/1939 12:00:00 AM,49.25,17.66667,"(49.25, 17.66667)"
31357,Zubkovsky,Valid,L6,2167.0,Found,01/01/2003 12:00:00 AM,49.78917,41.5046,"(49.78917, 41.5046)"
30414,Zulu Queen,Valid,L3.7,200.0,Found,01/01/1976 12:00:00 AM,33.98333,-115.68333,"(33.98333, -115.68333)"


#### Get some information about the DataFrame

In [None]:
meteorites.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45716 entries, 1 to 30414
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         45716 non-null  object 
 1   nametype     45716 non-null  object 
 2   recclass     45716 non-null  object 
 3   mass (g)     45585 non-null  float64
 4   fall         45716 non-null  object 
 5   year         45425 non-null  object 
 6   reclat       38401 non-null  float64
 7   reclong      38401 non-null  float64
 8   GeoLocation  38401 non-null  object 
dtypes: float64(3), object(6)
memory usage: 3.5+ MB


### Exercise 1.1

##### Create a DataFrame by reading in the `2019_Yellow_Taxi_Trip_Data.csv` file. Examine the first 5 rows.

In [None]:
# Complete this exercise in the exercises.ipynb file
# 2019_Yellow_Taxi_Trip_Data.csv Dataset
!wget -P ./data https://github.com/stefmolin/pandas-workshop/raw/main/data/2019_Yellow_Taxi_Trip_Data.csv

--2023-05-19 13:03:06--  https://github.com/stefmolin/pandas-workshop/raw/main/data/2019_Yellow_Taxi_Trip_Data.csv
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/stefmolin/pandas-workshop/main/data/2019_Yellow_Taxi_Trip_Data.csv [following]
--2023-05-19 13:03:06--  https://raw.githubusercontent.com/stefmolin/pandas-workshop/main/data/2019_Yellow_Taxi_Trip_Data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1000622 (977K) [text/plain]
Saving to: ‘./data/2019_Yellow_Taxi_Trip_Data.csv’


2023-05-19 13:03:06 (15.3 MB/s) - ‘./data/2019_Yellow_Taxi_Trip_Data.csv’ saved [1000622/1000622]



### Exercise 1.2

##### Find the dimensions (number of rows and number of columns) in the data.

In [None]:
# Complete this exercise in the exercises.ipynb file

## Extracting subsets

A crucial part of working with DataFrames is extracting subsets of the data: finding rows that meet a certain set of criteria, isolating columns/rows of interest, etc. After narrowing down our data, we are closer to discovering insights. This section will be the backbone of many analysis tasks.

#### Selecting columns

We can select columns as attributes if their names would be valid Python variables:

In [None]:
meteorites.name

id
1            Aachen
2            Aarhus
6              Abee
10         Acapulco
370         Achiras
            ...    
31356    Zillah 002
30409        Zinder
30410          Zlin
31357     Zubkovsky
30414    Zulu Queen
Name: name, Length: 45716, dtype: object

If they aren't, we have to select them as keys. However, we can select multiple columns at once this way:

In [None]:
meteorites[['name', 'mass (g)']]

Unnamed: 0_level_0,name,mass (g)
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Aachen,21.0
2,Aarhus,720.0
6,Abee,107000.0
10,Acapulco,1914.0
370,Achiras,780.0
...,...,...
31356,Zillah 002,172.0
30409,Zinder,46.0
30410,Zlin,3.3
31357,Zubkovsky,2167.0


#### Selecting rows

In [None]:
meteorites[0:12]

Unnamed: 0_level_0,name,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Aachen,Valid,L5,21.0,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
2,Aarhus,Valid,H6,720.0,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
6,Abee,Valid,EH4,107000.0,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
10,Acapulco,Valid,Acapulcoite,1914.0,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
370,Achiras,Valid,L6,780.0,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"
379,Adhi Kot,Valid,EH4,4239.0,Fell,01/01/1919 12:00:00 AM,32.1,71.8,"(32.1, 71.8)"
390,Adzhi-Bogdo (stone),Valid,LL3-6,910.0,Fell,01/01/1949 12:00:00 AM,44.83333,95.16667,"(44.83333, 95.16667)"
392,Agen,Valid,H5,30000.0,Fell,01/01/1814 12:00:00 AM,44.21667,0.61667,"(44.21667, 0.61667)"
398,Aguada,Valid,L6,1620.0,Fell,01/01/1930 12:00:00 AM,-31.6,-65.23333,"(-31.6, -65.23333)"
417,Aguila Blanca,Valid,L,1440.0,Fell,01/01/1920 12:00:00 AM,-30.86667,-64.55,"(-30.86667, -64.55)"


In [None]:
meteorites.iloc[10]

name                 Aioun el Atrouss
nametype                        Valid
recclass                 Diogenite-pm
mass (g)                       1000.0
fall                             Fell
year           01/01/1974 12:00:00 AM
reclat                       16.39806
reclong                      -9.57028
GeoLocation      (16.39806, -9.57028)
Name: 423, dtype: object

#### Indexing

We use `iloc[]` to select rows and columns by their position:

In [None]:
meteorites.iloc[100:104, [0, 3, 4, 6]]

Unnamed: 0_level_0,name,mass (g),fall,reclat
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5026,Benton,2840.0,Fell,45.95
48975,Berduc,270.0,Fell,-31.91
5028,Béréba,18000.0,Fell,11.65
5029,Berlanguillas,1440.0,Fell,41.68333


In [None]:
meteorites.loc[100:104, ['name','fall']]

Unnamed: 0_level_0,name,fall
id,Unnamed: 1_level_1,Unnamed: 2_level_1
100,Acfer 091,Found
101,Acfer 092,Found
102,Acfer 093,Found
103,Acfer 094,Found
104,Acfer 095,Found


We use `loc[]` to select by name:

In [None]:
meteorites.loc[100:104, 'mass (g)':'year']

Unnamed: 0_level_0,mass (g),fall,year
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100,3487.0,Found,01/01/1990 12:00:00 AM
101,244.0,Found,01/01/1990 12:00:00 AM
102,420.0,Found,01/01/1990 12:00:00 AM
103,82.0,Found,01/01/1990 12:00:00 AM
104,104.0,Found,01/01/1990 12:00:00 AM


#### Filtering with Boolean masks

A **Boolean mask** is a array-like structure of Boolean values &ndash; it's a way to specify which rows/columns we want to select (`True`) and which we don't (`False`).

Here's an example of a Boolean mask for meteorites weighing more than 50 grams that were found on Earth (i.e., they were not observed falling):

In [None]:
mask = (meteorites['mass (g)'] > 50)
meteorites[mask].describe()

Unnamed: 0,mass (g),reclat,reclong
count,19874.0,16069.0,16069.0
mean,30438.53,-15.472352,37.029264
std,870531.3,47.626357,78.800465
min,50.01,-87.36667,-165.43333
25%,108.7115,-72.0,0.0
50%,265.0,0.0,26.0
75%,962.0,26.98333,75.10028
max,60000000.0,81.16667,178.2


In [None]:
n_mask = (meteorites['mass (g)'] > 50) & (meteorites.fall == 'Found')
meteorites[n_mask].shape

(18854, 9)

**Important**: Take note of the syntax here. We surround each condition with parentheses, and we use bitwise operators (`&`, `|`, `~`) instead of logical operators (`and`, `or`, `not`).

We can use a Boolean mask to select the subset of meteorites weighing more than 1 million grams (1,000 kilograms or roughly 2,205 pounds) that were observed falling:

In [None]:
meteorites[(meteorites['mass (g)'] > 1e6) & (meteorites.fall == 'Fell')]

Unnamed: 0_level_0,name,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2278,Allende,Valid,CV3,2000000.0,Fell,01/01/1969 12:00:00 AM,26.96667,-105.31667,"(26.96667, -105.31667)"
12171,Jilin,Valid,H5,4000000.0,Fell,01/01/1976 12:00:00 AM,44.05,126.16667,"(44.05, 126.16667)"
12379,Kunya-Urgench,Valid,H5,1100000.0,Fell,01/01/1998 12:00:00 AM,42.25,59.2,"(42.25, 59.2)"
17922,Norton County,Valid,Aubrite,1100000.0,Fell,01/01/1948 12:00:00 AM,39.68333,-99.86667,"(39.68333, -99.86667)"
23593,Sikhote-Alin,Valid,"Iron, IIAB",23000000.0,Fell,01/01/1947 12:00:00 AM,46.16,134.65333,"(46.16, 134.65333)"


*Tip: Boolean masks can be used with `loc[]` and `iloc[]`.*

An alternative to this is the `query()` method:

In [None]:
meteorites.query("`mass (g)` > 1e6 and fall == 'Fell'")

Unnamed: 0_level_0,name,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2278,Allende,Valid,CV3,2000000.0,Fell,01/01/1969 12:00:00 AM,26.96667,-105.31667,"(26.96667, -105.31667)"
12171,Jilin,Valid,H5,4000000.0,Fell,01/01/1976 12:00:00 AM,44.05,126.16667,"(44.05, 126.16667)"
12379,Kunya-Urgench,Valid,H5,1100000.0,Fell,01/01/1998 12:00:00 AM,42.25,59.2,"(42.25, 59.2)"
17922,Norton County,Valid,Aubrite,1100000.0,Fell,01/01/1948 12:00:00 AM,39.68333,-99.86667,"(39.68333, -99.86667)"
23593,Sikhote-Alin,Valid,"Iron, IIAB",23000000.0,Fell,01/01/1947 12:00:00 AM,46.16,134.65333,"(46.16, 134.65333)"


*Tip: Here, we can use both logical operators and bitwise operators.*

## Calculating summary statistics

In the next section of this workshop, we will discuss data cleaning for a more meaningful analysis of our datasets; however, we can already extract some interesting insights from the `meteorites` data by calculating summary statistics.

#### How many of the meteorites were found versus observed falling?

In [None]:
meteorites.fall.value_counts()

Found    44609
Fell      1107
Name: fall, dtype: int64

In [None]:
meteorites.year.str.split(' ').str[0].str.split('/').str[2]

id
1        1880
2        1951
6        1952
10       1976
370      1902
         ... 
31356    1990
30409    1999
30410    1939
31357    2003
30414    1976
Name: year, Length: 45716, dtype: object

In [None]:
meteorites['year'] = pd.to_datetime(meteorites['year'], errors = 'coerce')
meteorites['yr'] = meteorites.year.apply(lambda x:x.year)
meteorites.yr.value_counts()


2003.0    3323
1979.0    3046
1998.0    2697
2006.0    2456
1988.0    2296
          ... 
1801.0       1
1750.0       1
1741.0       1
1779.0       1
1792.0       1
Name: yr, Length: 244, dtype: int64

*Tip: Pass in `normalize=True` to see this result as percentages. Check the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) for additional functionality.*

#### What was the mass of the average meterorite?

In [None]:
meteorites['mass (g)'].mean()

13278.078548601512

**Important**: The mean isn't always the best measure of central tendency. If there are outliers in the distribution, the mean will be skewed. Here, the mean is being pulled higher by some very heavy meteorites &ndash; the distribution is [right-skewed](https://www.analyticsvidhya.com/blog/2020/07/what-is-skewness-statistics/).

Taking a look at some quantiles at the extremes of the distribution shows that the mean is between the 95th and 99th percentile of the distribution, so it isn't a good measure of central tendency here:

In [None]:
meteorites['mass (g)'].quantile([0.01, 0.05, 0.5, 0.95, 0.99])

0.01        0.44
0.05        1.10
0.50       32.60
0.95     4000.00
0.99    50600.00
Name: mass (g), dtype: float64

A better measure in this case is the median (50th percentile), since it is robust to outliers:

In [None]:
meteorites['mass (g)'].median()

32.6

#### What was the mass of the heaviest meteorite?

In [None]:
meteorites['mass (g)'].max()

60000000.0

Let's extract the information on this meteorite:

In [None]:
meteorites.loc[meteorites['mass (g)'].idxmax()]

name                             Hoba
nametype                        Valid
recclass                    Iron, IVB
mass (g)                   60000000.0
fall                            Found
year           01/01/1920 12:00:00 AM
reclat                      -19.58333
reclong                      17.91667
GeoLocation     (-19.58333, 17.91667)
Name: 11890, dtype: object

#### How many different types of meteorite classes are represented in this dataset?

In [None]:
meteorites.recclass.nunique()

466

Some examples:

In [None]:
meteorites.recclass.unique()[:14]

array(['L5', 'H6', 'EH4', 'Acapulcoite', 'L6', 'LL3-6', 'H5', 'L',
       'Diogenite-pm', 'Unknown', 'H4', 'H', 'Iron, IVA', 'CR2-an'],
      dtype=object)

*Note: All fields preceded with "rec" are the values recommended by The Meteoritical Society. Check out [this Wikipedia article](https://en.wikipedia.org/wiki/Meteorite_classification) for some information on meteorite classes.*

#### Get some summary statistics on the data itself
We can get common summary statistics for all columns at once. By default, this will only be numeric columns, but here, we will summarize everything together:

In [None]:
meteorites.describe(include='all')

Unnamed: 0,name,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
count,45716,45716,45716,45585.0,45716,45425,38401.0,38401.0,38401
unique,45716,2,466,,2,266,,,17100
top,Aachen,Valid,L6,,Found,01/01/2003 12:00:00 AM,,,"(0.0, 0.0)"
freq,1,45641,8285,,44609,3323,,,6214
mean,,,,13278.08,,,-39.12258,61.074319,
std,,,,574988.9,,,46.378511,80.647298,
min,,,,0.0,,,-87.36667,-165.43333,
25%,,,,7.2,,,-76.71424,0.0,
50%,,,,32.6,,,-71.5,35.66667,
75%,,,,202.6,,,0.0,157.16667,


**Important**: `NaN` values signify missing data. For instance, the `fall` column contains strings, so there is no value for `mean`; likewise, `mass (g)` is numeric, so we don't have entries for the categorical summary statistics (`unique`, `top`, `freq`).

#### Check out the documentation for more descriptive statistics:

- [Series](https://pandas.pydata.org/docs/reference/series.html#computations-descriptive-stats)
- [DataFrame](https://pandas.pydata.org/docs/reference/frame.html#computations-descriptive-stats)

### Exercise 1.3

##### Using the data in the `2019_Yellow_Taxi_Trip_Data.csv` file, calculate summary statistics for the `fare_amount`, `tip_amount`, `tolls_amount`, and `total_amount` columns.

In [None]:
# Complete this exercise in the exercises.ipynb file

### Exercise 1.4

##### Find the dimensions (number of rows and number of columns) in the data.

In [None]:
# Complete this exercise in the exercises.ipynb file