# Pandas - basis

This part is intended for trainees without previous experience in programming.  
Most exercises only require one line of code to be answered, and are meant to familiarise yourself with python syntax and pandas.

## Prelude - loading the data

We will be using a data set containing data about yellow taxi trips. Just execute the following cell without modifying it.

In [1]:
import urllib.request
# Please execute this at the start of the notebook

base_url = 'https://raw.githubusercontent.com/Aenori/PythonDataAnalysis_public/main/help_files/pandas/'
helper_file = '2019_Yellow_Taxi_Trip_Data.csv'

urllib.request.urlretrieve(base_url + helper_file, helper_file)

('2019_Yellow_Taxi_Trip_Data.csv', <http.client.HTTPMessage at 0x7fa01c2910a0>)

In [2]:
import pandas as pd
df_taxi = pd.read_csv(helper_file)

## Part 1 - Dataframe global overview

All exercises of this part are just one function call

### Exercice 1 : number of rows and columns

What is the number of rows and columns of the DataFrame `df_taxi` ? There are several methods that give the information, use the one that only gives it.

In [3]:
df_taxi.shape

(10000, 18)

### Exercice 2 : list of columns

Find the list of the columns of `df_taxi`

In [4]:
df_taxi.columns

Index(['vendorid', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'ratecodeid', 'store_and_fwd_flag',
       'pulocationid', 'dolocationid', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge'],
      dtype='object')

### Exercice 3 : detailled list of columns

Find the list of the columns of `df_taxi` with their type, and some other info, like the memory usage of the `DataFrame`

In [5]:
df_taxi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   vendorid               10000 non-null  int64  
 1   tpep_pickup_datetime   10000 non-null  object 
 2   tpep_dropoff_datetime  10000 non-null  object 
 3   passenger_count        10000 non-null  int64  
 4   trip_distance          10000 non-null  float64
 5   ratecodeid             10000 non-null  int64  
 6   store_and_fwd_flag     10000 non-null  object 
 7   pulocationid           10000 non-null  int64  
 8   dolocationid           10000 non-null  int64  
 9   payment_type           10000 non-null  int64  
 10  fare_amount            10000 non-null  float64
 11  extra                  10000 non-null  float64
 12  mta_tax                10000 non-null  float64
 13  tip_amount             10000 non-null  float64
 14  tolls_amount           10000 non-null  float64
 15  imp

### Exercice 4 : DataFrame sample

Show the 5 first rows of the `DataFrame` 

In [6]:
df_taxi.head()

Unnamed: 0,vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecodeid,store_and_fwd_flag,pulocationid,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,2,2019-10-23T16:39:42.000,2019-10-23T17:14:10.000,1,7.93,1,N,138,170,1,29.5,1.0,0.5,7.98,6.12,0.3,47.9,2.5
1,1,2019-10-23T16:32:08.000,2019-10-23T16:45:26.000,1,2.0,1,N,11,26,1,10.5,1.0,0.5,0.0,0.0,0.3,12.3,0.0
2,2,2019-10-23T16:08:44.000,2019-10-23T16:21:11.000,1,1.36,1,N,163,162,1,9.5,1.0,0.5,2.0,0.0,0.3,15.8,2.5
3,2,2019-10-23T16:22:44.000,2019-10-23T16:43:26.000,1,1.0,1,N,170,163,1,13.0,1.0,0.5,4.32,0.0,0.3,21.62,2.5
4,2,2019-10-23T16:45:11.000,2019-10-23T16:58:49.000,1,1.96,1,N,163,236,1,10.5,1.0,0.5,0.5,0.0,0.3,15.3,2.5


### Exercice 5 : statistical description

In one line, extract somme statistical data about all numerical columns : min, max, quantile ... 

In [8]:
df_taxi.describe()

Unnamed: 0,vendorid,passenger_count,trip_distance,ratecodeid,pulocationid,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,1.6337,1.4977,3.01525,1.0842,166.9004,166.3135,1.301,15.106313,1.960235,0.49245,2.634494,0.623447,0.29814,22.564659,2.2828
std,0.481817,1.139353,4.148063,0.418244,63.791288,68.525953,0.486644,13.954762,1.39294,0.071544,3.4098,6.437507,0.032537,19.209255,0.720946
min,1.0,0.0,0.0,1.0,1.0,1.0,1.0,-52.0,-4.5,-0.5,0.0,-6.12,-0.3,-65.92,-2.5
25%,1.0,1.0,0.92,1.0,132.0,132.0,1.0,7.0,1.0,0.5,0.0,0.0,0.3,12.375,2.5
50%,2.0,1.0,1.5,1.0,162.0,163.0,1.0,10.0,1.0,0.5,2.0,0.0,0.3,16.3,2.5
75%,2.0,2.0,2.76,1.0,234.0,236.0,2.0,16.0,3.5,0.5,3.25,0.0,0.3,22.88,2.5
max,2.0,6.0,38.11,5.0,265.0,265.0,4.0,176.0,7.0,0.5,43.0,612.0,0.3,671.8,2.75


## Part 2 - column informations

### Exercice 2 : list of columns

Find the list of the columns of `df_taxi`

### Exercice 2 : list of columns

Find the list of the columns of `df_taxi`

### Exercice 2 : list of columns

Find the list of the columns of `df_taxi`

### Exercice 2 : list of columns

Find the list of the columns of `df_taxi`

### Exercice 2 : list of columns

Find the list of the columns of `df_taxi`

### Exercice 2 : list of columns

Find the list of the columns of `df_taxi`

### Exercice 2 : list of columns

Find the list of the columns of `df_taxi`

### Exercice 2 : list of columns

Find the list of the columns of `df_taxi`

### Exercice 2 : list of columns

Find the list of the columns of `df_taxi`

### Exercice 2 : list of columns

Find the list of the columns of `df_taxi`

### Exercice 2 : list of columns

Find the list of the columns of `df_taxi`

### Exercice 2 : list of columns

Find the list of the columns of `df_taxi`

### Exercice 2 : list of columns

Find the list of the columns of `df_taxi`