<div style="float: right; margin: 0px 15px 15px 0px;">
<img src="https://upload.wikimedia.org/wikipedia/commons/b/b6/HULT_IBS_Logo_Outline_Black_%28cropped%29.png" width=150/>
</div>
<h1> Python for Data Analysis: Methods & Tools </h1>
<em> <strong>Python for everyday people </strong></em>
<br><br>
Written by Felipe Dominguez - Adjunct Professor<br>
Hult International Business School <br>
<br>
<h1><u> Chapter 07 - Exploratory Analysis with Pandas</u></h1>
<em> Describe it!</em>

<h3>1. Exploring DataFrames</h3><br>
<div align = justify>
In Chapter 06, you were introduced to the <b>Python Pandas</b> library. This chapter will focus on exploring, analyzing, and performing calculations on Pandas dataframes. Get ready! <br><br>
   When working with a new dataset, it's important you <b>become familiar</b> with it by understanding what information contains, and start thinking about how to use it. One useful way to do this is by using Pandas's built-in functions, such as <b>.head()</b> and <b>.info()</b>, which provide a preview of the first rows, information about each column's data type, number of non-null values, among others. <br><br>
    As a general first step when importing a dataset, you must <b>check the data types of each column</b>. This is crucial, as Python relies on a set of assumptions about how to handle each data type, and will convert them to the one that fits the best. For example, if a column imported from a csv file is a datetime type, Python will most likely convert it into a string (object). Therefore, if you would like to take advantage of Python's date-time features and functions, you ensure this column is identiied as a datetime type.<br><br>
    After previewing the data (head and info) and converting data types, it is crucial to understand if the data align with real-world or with your case of study. For example, if you are analyzing the microeconomics of Chile using a global dataset (UN, World Bank, etc.), you must ensure the data only includes information about Chile. Or, if you are analyzing a demographic group (35-45 years old), you can check if your data's descriptive statistics (mean, standard deviation, etc.) align with that group. Additionally, you should be aware of categorical variables, which are analyzed differently than numeric values. If the dataset does not reflect what was expected or intended, it is the analyst's responsibility to find out the reasons why. This process of discovery may lead to useful insights that you were not initially aware of.<br><br>
For this chapter, you will work with a dataset containing information about taxi trips in New York from April 2019 to June 2019, provided by <a href = 'https://www.mavenanalytics.io/'>Maven Analytics</a>.
</div>

In [4]:
# import packages
import numpy as np
import pandas as pd

In [5]:
# Code 1.1.
# Import the NY taxi trips dataset. This code will not produce output.
df_taxi = pd.read_csv(filepath_or_buffer = "./__resources/2019_taxi_trips.csv")

In [6]:
# Code 1.2.
# Show the first 5 elements of df_tax
df_taxi.head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,2,4/17/19 8:58,4/17/19 9:04,Central Harlem,Central Harlem North,1,1.2,7.0,0.0,0.5,0.0,0,0.3,7.8,2,1,0.0
1,2,4/17/19 12:36,4/17/19 12:43,Central Harlem,Central Harlem North,1,1.12,7.0,0.0,0.5,0.0,0,0.3,7.8,2,1,0.0
2,2,4/17/19 12:17,4/17/19 12:25,Central Harlem,Central Harlem North,1,1.28,7.0,0.0,0.5,0.0,0,0.3,7.8,2,1,0.0
3,2,4/17/19 15:49,4/17/19 15:58,Central Harlem,Central Harlem North,1,1.03,7.0,0.0,0.5,0.0,0,0.3,7.8,2,1,0.0
4,2,4/18/19 13:22,4/18/19 13:29,Central Harlem,Central Harlem North,1,1.35,7.0,0.0,0.5,0.0,0,0.3,7.8,2,1,0.0


In [7]:
# Code 1.3. 
# retrieve information about df_taxi
df_taxi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 997632 entries, 0 to 997631
Data columns (total 17 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   VendorID               997632 non-null  int64  
 1   lpep_pickup_datetime   997632 non-null  object 
 2   lpep_dropoff_datetime  997632 non-null  object 
 3   PULocationID           997632 non-null  object 
 4   DOLocationID           997632 non-null  object 
 5   passenger_count        997632 non-null  int64  
 6   trip_distance          997632 non-null  float64
 7   fare_amount            997632 non-null  float64
 8   extra                  997632 non-null  float64
 9   mta_tax                997632 non-null  float64
 10  tip_amount             997632 non-null  float64
 11  tolls_amount           997632 non-null  int64  
 12  improvement_surcharge  997632 non-null  float64
 13  total_amount           997632 non-null  float64
 14  payment_type           997632 non-nu

<h4>1.1. pandas .describe()</h4><br>
<div align = justify>
The <b>.describe()</b> method in the pandas library is a useful tool for obtaining a summary of a dataset. While codes 1.2 and 1.3 provide a general overview of the New York Taxi trips data, there are still many questions that cannot be answered with these two functions alone. For example, you may want to know the most frequent vendor ID, the average tip amount, or the maximum amount paid. In order to answer these questions and gain a deeper understanding of the data, you can use the .describe() method.<br><br>
<h5>a. Looking at numeric variables</h5><br>
    By default, the <b>.describe()</b> method returns statistical summaries of <b>numeric variables</b> in the dataset. However, it can also be configured to provide <b>summaries of categorical variables</b>. This method is useful for quickly checking if the data aligns with its documentation and identifying any extreme values within each feature (column) of the data.
    
</div>



In [8]:
# Code 1.1.1.
df_taxi.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
count,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0
mean,1.834926,1.31707,2.692922,12.32298,0.400701,0.485981,1.051496,0.0,0.29151,14.933453,1.47044,1.02048,0.425761
std,0.371247,0.978563,2.79461,10.90192,0.584892,0.092054,1.932947,0.0,0.055105,11.750769,0.522848,0.141637,0.994419
min,1.0,0.0,0.0,-250.0,-4.5,-0.5,-21.22,0.0,-0.3,-250.0,1.0,1.0,-2.75
25%,2.0,1.0,1.02,6.5,0.0,0.5,0.0,0.0,0.3,8.3,1.0,1.0,0.0
50%,2.0,1.0,1.8,9.5,0.0,0.5,0.0,0.0,0.3,11.8,1.0,1.0,0.0
75%,2.0,1.0,3.28,14.5,0.5,0.5,1.86,0.0,0.3,18.05,2.0,1.0,0.0
max,2.0,9.0,202.1,1562.5,7.25,3.55,224.57,0.0,0.3,1563.3,5.0,2.0,2.75


<div align = justify><br>
The .describe( ) method, as shown in Code 1.1.1, can output <b>a large number of statistical measures and decimal points</b>, which can be overwhelming to interpret. To improve the aesthetic and interpretability of the output, it is <b>highly recommended to round the decimal points to a lower precision</b>. This can be done using the <b>round()</b> function by specifying a fixed number of decimal. It is important to remember that printing an excessive number of decimal points can actually hinder the interpretation of the analysis, rather than enhance it. Therefore, it is advisable to strike a balance between precision and readability when presenting data. Code 1.1.2. demonstrate the new output after reducing the number of decimal points.
</div>

In [9]:
# Code 1.1.2.

# Round all numbers to two decimals.
df_taxi.describe().round(2)

Unnamed: 0,VendorID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
count,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0,997632.0
mean,1.83,1.32,2.69,12.32,0.4,0.49,1.05,0.0,0.29,14.93,1.47,1.02,0.43
std,0.37,0.98,2.79,10.9,0.58,0.09,1.93,0.0,0.06,11.75,0.52,0.14,0.99
min,1.0,0.0,0.0,-250.0,-4.5,-0.5,-21.22,0.0,-0.3,-250.0,1.0,1.0,-2.75
25%,2.0,1.0,1.02,6.5,0.0,0.5,0.0,0.0,0.3,8.3,1.0,1.0,0.0
50%,2.0,1.0,1.8,9.5,0.0,0.5,0.0,0.0,0.3,11.8,1.0,1.0,0.0
75%,2.0,1.0,3.28,14.5,0.5,0.5,1.86,0.0,0.3,18.05,2.0,1.0,0.0
max,2.0,9.0,202.1,1562.5,7.25,3.55,224.57,0.0,0.3,1563.3,5.0,2.0,2.75


<div align = justify>
The new table is <b>more organized and easier</b> to interpret than the original. It allows you to quickly identify key trends and patterns in the data, and highlights any extreme values that may be present. The use of this table as a starting point for your analysis can help you to get a better understanding of the dataset and identify areas that may require further investigation. Additionally, being organized can help you to convey a degree of professionalism into your analyses to your audience.<br><br>
In the next stage of your analysis, it is important to also perform descriptive statistics on the <b>categorical features</b> of the dataset. This will provide additional insights into the characteristics and distribution of these variables, and can help you to better understand the relationships between different features in the data. By running descriptive statistics on both numeric and categorical features, you will have a more comprehensive understanding of the data and be better equipped to draw meaningful conclusions from your analysis.
</div>

<h5> b. Looking at categorical variables</h5><br><br>
<div align = justify>
As shown in the info table in Code 1.3, most of the features in the dataset are numeric, but there are a few that are stored as objects (strings), including the <b>lpep_pickup_datetime</b>, <b>lpep_dropoff_datetime</b>, <b>PULocationID</b>, and <b>DOLocationID</b> columns. The first two contain dates that have been stored as strings, and it may be useful to transform them into proper <b>date data types</b> for further analysis.<br><br>
To obtain summary statistics for categorical variables using the .describe() method, you can specify the <b>include</b> optional parameter and set it to <b>object</b> to indicate that you want to include only categorical data in the summary. As demonstrated in Code 1.1.3, the resulting summary table for categorical variables differs from the one for numeric variables, as it provides information about the <b>frequency</b>, number of <b>unique values</b>, and <b>mode</b> of each categorical feature rather than statistical metrics. This information can be useful for understanding the distribution and characteristics of the categorical data in the dataset.
</div>


In [10]:
# Code 1.1.3.

# describe statistics in categorical variables
df_taxi.describe(include = 'object')

Unnamed: 0,lpep_pickup_datetime,lpep_dropoff_datetime,PULocationID,DOLocationID
count,997632,997632,997632,997632
unique,91291,91297,257,257
top,5/30/19 18:14,6/7/19 18:57,East Harlem North,East Harlem North
freq,43,44,72921,39555


<br><br><br><br><br><br><br><br><br>

<h4>1.2. Diving deeper into categorical variables</h4><br>
<div align = justify>
While the .describe( ) method is a useful tool for obtaining summary statistics for categorical data, it does not provide a comprehensive overview of all the values within the variables. To gain a more detailed understanding of the frequency of different categorical values, you can use the <b>.value_counts()</b> method, which is a Pandas Series method that returns the counts of unique values in a Pandas Series.<br><br>    
To apply the .value_counts() method to a column in a Pandas DataFrame, <b>you must retrieve the column as a Pandas Series</b> using the following syntax:

</div>

~~~
    DataFrame['column_name'].value_counts()
~~~

<br>
<em><b>Note: </b>The .value_counts() method can be used with both numeric and non-numeric data. It is a useful tool for getting a detailed breakdown of the frequency of different values within a categorical variable.</em>

In [11]:
# Code 1.2.
df_taxi['PULocationID'].value_counts()

East Harlem North                    72921
East Harlem South                    66529
Central Harlem                       59818
Elmhurst                             47574
Astoria                              45273
                                     ...  
Rossville/Woodrow                        1
Jamaica Bay                              1
Eltingville/Annadale/Prince's Bay        1
Oakwood                                  1
Rikers Island                            1
Name: PULocationID, Length: 257, dtype: int64

<h3>2. Convert data types</h3><br>
<h4>2.1. pandas .astype()</h4><br>
<div align = justify>
As mentioned in section 1.1.b., the features <em>lpep_pickup_datetime and lpep_dropoff_datetime</em> were transformed into string types by Python, but they actually represent dates. Additionally, the "VendorID" feature represents a categorical variable despite being a number. Oftentimes, you will need to convert some features from your DataFrame to represent the data type that you need. In this case, you must transform <em>VendorID</em> to string and <em>lpep_pickup_datetime and lpep_dropoff_datetime</em> to dates.<br><br>
The most common way to convert a feature's data type is to used the method <b>astype()</b>. This method enables Python to change the type of a complete column. The structure is as follows:
</div>

~~~
    Single feature:
    df_name['column_name'] = df_name['column_name'].astype(datatype)
    
    Multiple features:
    df_name = df_name[['column_name_1', 'column_name_2', ... 'column_name_n']].astype(datatype)
~~~ 
    

In [12]:
# Code 2.1.1. Transform VendorID into a string type
df_taxi['VendorID'] = df_taxi['VendorID'].astype('str')

<h4>2.1.2. Working with date data types</h4><br>
<div align = justify>
The use of date data types offers several advantages when working with it. For instance, you can extract the year or the month directly from it and used it for different purposes. Currently the <em>lpep_pickup_datetime and lpep_dropoff_datetime</em> features are considered as string. The best way to transform this column into a native pandas datetime data type is following this structure:
</div>

~~~
    df_name['column_name'] = pd.to_datetime(df_name['column_name'])
~~~


In [13]:
# Code 2.1.2.a. Transform strings into pandas datetime type.
df_taxi['lpep_pickup_datetime'] = pd.to_datetime(df_taxi['lpep_pickup_datetime'])
df_taxi['lpep_dropoff_datetime'] = pd.to_datetime(df_taxi['lpep_dropoff_datetime'])
df_taxi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 997632 entries, 0 to 997631
Data columns (total 17 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   VendorID               997632 non-null  object        
 1   lpep_pickup_datetime   997632 non-null  datetime64[ns]
 2   lpep_dropoff_datetime  997632 non-null  datetime64[ns]
 3   PULocationID           997632 non-null  object        
 4   DOLocationID           997632 non-null  object        
 5   passenger_count        997632 non-null  int64         
 6   trip_distance          997632 non-null  float64       
 7   fare_amount            997632 non-null  float64       
 8   extra                  997632 non-null  float64       
 9   mta_tax                997632 non-null  float64       
 10  tip_amount             997632 non-null  float64       
 11  tolls_amount           997632 non-null  int64         
 12  improvement_surcharge  997632 non-null  floa

<br><br><br><div align = justify>
Code 2.1.2.a. has changed the data type of the columns lpep_pickup_datetime and lpep_dropoff_datetime to a <b>datetime data type</b> (datetime64[ns])which allows for more efficient manipulation of date and time data using specific datetime methods. For example, code 2.1.2.b. allows you to <b>extract the year</b> of the pikcup datetime column. Furthermore, you can store the year of the date as a separate feature in your dataframe, as code 2.1.2.c. depicts.<br><br>
    Remember that you need to become familiar with the dataset and check if the data make sense for the analysis you want to perform. In this case, you've been given a dataset to analyze the New York taxi trips during 2019. However, <b>are you sure that the data is align with the request? </b>
Let's continue checking this dataset on the next section.
</div>

In [14]:
# Code 2.1.2.b.
df_taxi['lpep_pickup_datetime'].dt.year.head()

0    2019
1    2019
2    2019
3    2019
4    2019
Name: lpep_pickup_datetime, dtype: int64

In [15]:
# Code 2.1.2.c. 
df_taxi['year_pu'] = df_taxi['lpep_pickup_datetime'].dt.year
df_taxi['year_do'] = df_taxi['lpep_dropoff_datetime'].dt.year
df_taxi.head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge,year_pu,year_do
0,2,2019-04-17 08:58:00,2019-04-17 09:04:00,Central Harlem,Central Harlem North,1,1.2,7.0,0.0,0.5,0.0,0,0.3,7.8,2,1,0.0,2019,2019
1,2,2019-04-17 12:36:00,2019-04-17 12:43:00,Central Harlem,Central Harlem North,1,1.12,7.0,0.0,0.5,0.0,0,0.3,7.8,2,1,0.0,2019,2019
2,2,2019-04-17 12:17:00,2019-04-17 12:25:00,Central Harlem,Central Harlem North,1,1.28,7.0,0.0,0.5,0.0,0,0.3,7.8,2,1,0.0,2019,2019
3,2,2019-04-17 15:49:00,2019-04-17 15:58:00,Central Harlem,Central Harlem North,1,1.03,7.0,0.0,0.5,0.0,0,0.3,7.8,2,1,0.0,2019,2019
4,2,2019-04-18 13:22:00,2019-04-18 13:29:00,Central Harlem,Central Harlem North,1,1.35,7.0,0.0,0.5,0.0,0,0.3,7.8,2,1,0.0,2019,2019


<h3>3. Operations with Pandas</h3>

<h4>3.1. Filter the dataset</h4><br>
<div align = justify>
    When working with data, it's common to encounter messy or inconsistent information. To ensure that your analysis is accurate and meaningful, it's crucial to thoroughly examine the data to confirm that it aligns with your expectations. Code 4.1.1. and 4.1.2. identify observations that have information from a year different to 2019. To perform an analysis based on 2019, you must remove any data point that does not belong to the year 2019 before going further.
</div>

In [16]:
# Code 3.1.1
df_taxi[df_taxi['year_pu'] != 2019].head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge,year_pu,year_do
23430,2,2008-12-31 23:21:00,2008-12-31 23:27:00,East Harlem North,Central Harlem,1,1.08,6.5,0.0,0.5,0.0,0,0.3,7.3,2,1,0.0,2008,2008
47620,2,2009-01-01 09:16:00,2009-01-01 09:27:00,East Tremont,Fordham South,1,0.73,5.5,0.0,0.5,0.0,0,0.3,6.3,2,1,0.0,2009,2009
71690,2,2008-12-31 23:42:00,2008-12-31 23:47:00,East Harlem North,Central Harlem,1,0.57,4.5,0.0,0.5,0.0,0,0.3,5.3,2,1,0.0,2008,2008
87932,2,2010-09-23 01:02:00,2010-09-23 10:14:00,Kew Gardens,South Ozone Park,1,2.05,8.5,0.0,0.5,0.0,0,0.3,9.3,2,1,0.0,2010,2010
103944,2,2010-09-23 00:01:00,2010-09-23 18:19:00,Forest Hills,Kew Gardens,1,1.29,9.5,0.0,0.5,0.0,0,0.3,10.3,2,1,0.0,2010,2010


In [17]:
# Code 3.1.2.
df_taxi[df_taxi['year_do'] != 2019].head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge,year_pu,year_do
23430,2,2008-12-31 23:21:00,2008-12-31 23:27:00,East Harlem North,Central Harlem,1,1.08,6.5,0.0,0.5,0.0,0,0.3,7.3,2,1,0.0,2008,2008
47620,2,2009-01-01 09:16:00,2009-01-01 09:27:00,East Tremont,Fordham South,1,0.73,5.5,0.0,0.5,0.0,0,0.3,6.3,2,1,0.0,2009,2009
71690,2,2008-12-31 23:42:00,2008-12-31 23:47:00,East Harlem North,Central Harlem,1,0.57,4.5,0.0,0.5,0.0,0,0.3,5.3,2,1,0.0,2008,2008
87932,2,2010-09-23 01:02:00,2010-09-23 10:14:00,Kew Gardens,South Ozone Park,1,2.05,8.5,0.0,0.5,0.0,0,0.3,9.3,2,1,0.0,2010,2010
103944,2,2010-09-23 00:01:00,2010-09-23 18:19:00,Forest Hills,Kew Gardens,1,1.29,9.5,0.0,0.5,0.0,0,0.3,10.3,2,1,0.0,2010,2010


<div align = justify>
As presented in Chapter 06 - Slicing with conditions (section 3.3.), it is possible it is possible to selectively extract (slice) data from a DataFrame using logical conditions. In this particular case, you must filter the dataset and remove any observations that do not belong to the year 2019. As a reminder, the structure for slicing a DataFrame based on conditions is as follows:
</div>

~~~
i. DataFrame.loc[ rows, columns ][ condition 1 ][ condition 2 ][ condition n ]
ii. DataFrame[ ([condition_1 ]) & (condition_2) & ... & [condition_n] ]
~~~

<div align = justify><br>
Code 3.1.3 creates a new DataFrame, df_taxis_filtered, by filtering out observations where the pickup or drop-off year is not 2019. Now, the df_taxis_filtered return empty if you try to slice any observation with a year different than 2019.<br><br>
    
<em><b>Note: </b>It is a good practice to create a new dataframe when filtering or performing significant operations to your dataset. That way, you can come back to the original data easy and quick.</em>
</div>

In [19]:
# Code 3.1.3.

# Create a new filtered data frame (Best practice)
df_taxis_filtered = df_taxi[(df_taxi['year_pu'] == 2019) & (df_taxi['year_do'] == 2019)]

# print any observations with a pick up or drop off year different than 2019 (Results in empy dataframes!)
print(df_taxis_filtered[df_taxis_filtered['year_pu'] != 2019])
print(df_taxis_filtered[df_taxis_filtered['year_do'] != 2019])

Empty DataFrame
Columns: [VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge, year_pu, year_do]
Index: []
Empty DataFrame
Columns: [VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge, year_pu, year_do]
Index: []


<h4>3.2. Mathematical operations</h4><br>
<div align = justify>
    Working with pandas DataFrames have multiple advantages. For example, you have the ability to perform mathematical operations on its features. You can use standard mathematical operations such as multiplication, addition, and division on a feature and pandas will apply it to each element of the column. <br><br>
    Code 3.2.1 demonstrates this by creating adding a new column to the df_example DataFrame, which is the result of multiplying column_1 by 100. As expected, pandas multiplies each observation in column_1 by 100 and stores the result in the new feature called column_3.    
</div>


In [20]:
# Code 3.2.1.
# create a dataframe
df_example = pd.DataFrame(
                            {
                                "column_1": [1.0, 2.0, 3.0, 4.0],
                                "column_2": pd.Categorical(["test", "train", "test", "train"]),
                            }
                        )

df_example['column_3'] = df_example['column_1'] * 100
df_example

Unnamed: 0,column_1,column_2,column_3
0,1.0,test,100.0
1,2.0,train,200.0
2,3.0,test,300.0
3,4.0,train,400.0


<h5>3.2.1. Calculate the percent of tips per trip</h5><br>
<div align = justify>
Suppose you are tasked with calculating the <b>percentage of tips per trip in the New York taxis trips dataset</b> and only returning observations where the client gave a tip to the driver. One way to accomplish this is by performing <b>mathematical operations on the columns of the DataFrame.</b> <br><br>
    Code 3.2.1.1 calculates the percentage of tips by dividing the 'tip_amount' feature by the 'total_amount' column and then multiplying the result by 100. When dividing, there is always a risk of dividing by zero, which would result in an undefined value. Pandas treats any undefined results as NaN. To handle these NaN calculations, you can tell Pandas to replace any null observations in the new feature with 0.<br><br>
    Code 3.2.1.2 then replaces any NaN values with 0, allowing you to provide the final DataFrame that only contains observations where the client gave a tip to the driver. In fact, the data shows that <b>38.2%</b> of the taxi trips in New York between April and June 2019 resulted in a tip for the driver.

<em><b>Note: </b>Depending of your analysis, you might want to remove those observations where the total amount was reimbursed (negative)</em>
</div>

In [21]:
# Code 3.2.1.1.

# Create a new column with the percentage of tips per trip
df_taxis_filtered['percent_tips'] = df_taxis_filtered['tip_amount'] / df_taxis_filtered['total_amount'] * 100

# print filtered dataset
df_taxis_filtered[['tip_amount', 'total_amount', 'percent_tips' ]][df_taxis_filtered['percent_tips'] != 0].head()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_taxis_filtered['percent_tips'] = df_taxis_filtered['tip_amount'] / df_taxis_filtered['total_amount'] * 100


Unnamed: 0,tip_amount,total_amount,percent_tips
602712,0.0,0.0,
602713,0.0,0.0,
602714,0.0,0.0,
602715,0.0,0.0,
602716,0.0,0.0,


In [23]:
# Code 3.2.1.2.

# Fill any observation where the total_amount was 0. 
df_taxis_filtered['percent_tips'].fillna(0, inplace = True)

# Filter the dataset to see only those observations where a client gave the drive a tip.
df_taxis_filtered[['tip_amount', 'total_amount', 'percent_tips' ]][df_taxis_filtered['percent_tips'] > 0].head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_taxis_filtered['percent_tips'].fillna(0, inplace = True)


Unnamed: 0,tip_amount,total_amount,percent_tips
616574,-0.76,-4.56,16.666667
616575,-0.76,-4.56,16.666667
616576,7.25,31.5,23.015873
616577,-0.01,-5.31,0.188324
616578,8.56,51.36,16.666667


<h4>3.3. What is the average number of trips per week?</h4><br>
<div align = justify>
Let's delve into the New York taxi trips dataset to determine the average number of trips per week between April and June of 2019. To calculate this metric, you must first identify the week when each trip was required, sum the number of trips for each week, and then calculate the average. This process may seem daunting at first, but Pandas provides methods to make this calculation simple.<br>
    
<h5>3.3.1. Taking advantage of date pandas data types</h5><br>
In section 2.1.2. you transformed the features lpep_pickup_datetime and lpep_dropoff_datetime into a pandas datetime data type. One advantage of this transformation is the ability to use the <b>.dt.isocalendar().week</b> method to extract the week number (based on the isocalendar) for a given date.<br>
    Code 3.3.1. demonstrates how to calculate the week number of each observation.
</div>

In [20]:
# Code 3.3.1.

# Store the week number of each pick up and drop off date into a new column
df_taxis_filtered['week_pu'] = df_taxis_filtered['lpep_pickup_datetime'].dt.isocalendar().week
df_taxis_filtered['week_do'] = df_taxis_filtered['lpep_dropoff_datetime'].dt.isocalendar().week

# print the first 5 observations
df_taxis_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_taxis_filtered['week_pu'] = df_taxis_filtered['lpep_pickup_datetime'].dt.isocalendar().week
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_taxis_filtered['week_do'] = df_taxis_filtered['lpep_dropoff_datetime'].dt.isocalendar().week


Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,...,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge,year_pu,year_do,percent_tips,week_pu,week_do
0,2,2019-04-17 08:58:00,2019-04-17 09:04:00,Central Harlem,Central Harlem North,1,1.2,7.0,0.0,0.5,...,0.3,7.8,2,1,0.0,2019,2019,0.0,16,16
1,2,2019-04-17 12:36:00,2019-04-17 12:43:00,Central Harlem,Central Harlem North,1,1.12,7.0,0.0,0.5,...,0.3,7.8,2,1,0.0,2019,2019,0.0,16,16
2,2,2019-04-17 12:17:00,2019-04-17 12:25:00,Central Harlem,Central Harlem North,1,1.28,7.0,0.0,0.5,...,0.3,7.8,2,1,0.0,2019,2019,0.0,16,16
3,2,2019-04-17 15:49:00,2019-04-17 15:58:00,Central Harlem,Central Harlem North,1,1.03,7.0,0.0,0.5,...,0.3,7.8,2,1,0.0,2019,2019,0.0,16,16
4,2,2019-04-18 13:22:00,2019-04-18 13:29:00,Central Harlem,Central Harlem North,1,1.35,7.0,0.0,0.5,...,0.3,7.8,2,1,0.0,2019,2019,0.0,16,16


<h5>3.3.2. Grouping by</h5><br>
<div align = justify>
    Similar to the SQL language, you can group observations in DataFrames based on its features using the Pandas' <b>groupby method</b>. In the background, Pandas split the object, applies a aggregated function to each group (e.g., mean, sum, among others), and combines the results. <a href = https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html >Group by </a>is useful for grouping large amounts of data and performing computations based on those groups. <br><br>
    Code 3.3.2.1. demonstrates how to use the groupby method to perform a sum operation based on the VendorID feature. As shown in the code output, the resulting DataFrame has only two rows (Vendor ID 1 and 2) and each column provides the sum of all the numeric values in the original DataFrame for that group.
</div>

~~~
    df_grouped = df.groupby([by =[[column_name_1], [column_name_2],...,[column_name_n]]).function()
~~~


In [21]:
# Code 3.3.2.1.

# Group DataFrame by Vendor
df_taxis_f_group = df_taxis_filtered.groupby(['VendorID']).sum()
df_taxis_f_group.head()

Unnamed: 0_level_0,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge,year_pu,year_do,percent_tips,week_pu,week_do
VendorID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1,190553,482761.2,2201775.7,122600.83,80104.45,146201.79,0,48076.5,2598759.27,244387,168546,67053.25,332492958,332492958,871657.5,3352485,3352593
2,1123366,2203691.95,10091636.47,277143.83,404713.72,902782.65,0,242736.0,12298886.77,1222525,849491,357691.5,1681673556,1681673556,5361993.0,16993454,16994645


<h5>3.3.2. Calculate the average number of trips per week</h5><br>
<div align = justify>
You are close to completing this chapter. The next step is to group the DataFrame based on the new <em>pickup_week</em> column created previously. To count the number of trips per week, you can use the <b>.count() function</b>, which will count the number of observations per week number. Additionally, instead of using the entire DataFrame, you can choose to include only one of its columns, such as VendorID. Code 3.3.2.1 demonstrates this operation.<br><br>
</div>

In [22]:
# Code 3.3.2.1.

# group dataframe by week number and count the number of observations
df_taxis_f_group = df_taxis_filtered[["VendorID", "week_pu"]].groupby(by = ['week_pu']).count()

df_taxis_f_group

Unnamed: 0_level_0,VendorID
week_pu,Unnamed: 1_level_1
16,79877
17,106106
18,109203
19,109443
20,113567
21,104215
22,103320
23,102270
24,106196
25,63409


<div align = justify>
Finally, to calculate the average number of trip you just need to calculate the mean of this new DataFrame. <br><br>
Code 3.3.2.2. presents the solution to the question: What is the average number of taxi trips per week between April and June of 2019? Well, now you know that between April and June of 2019 there were 99,760 taxi trips per week in average.
</div>

In [23]:
# Code 3.3.2.2.

# average number of trips per week
trips_p_week = int(df_taxis_f_group.mean())

print(f"""The average number of taxi trips in NY between April and June of 2019 is:
{trips_p_week}""")


The average number of taxi trips in NY between April and June of 2019 is:
99760


<h3>4. Descriptive statistics again</h3><br>
<div align = justify>
Throughout this chapter, you have performed various operations on the original dataset and have gained a deeper understanding of the data. <b>It is highly recommended to periodically review the descriptive statistics</b> of your DataFrame. Comparing the new results with the original dataset can help ensure that the results are meaningful and aligned with your research goals before proceeding with further data analysis.
    
</div>

In [36]:
print(f"{'-'*40}   Numeric Descriptive Statistics   {'-'*40}")
df_taxis_filtered.describe().round(2)

----------------------------------------   Numeric Descriptive Statistics   ----------------------------------------


Unnamed: 0,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge,year_pu,year_do,percent_tips,week_pu,week_do
count,997606.0,997606.0,997606.0,997606.0,997606.0,997606.0,997606.0,997606.0,997606.0,997606.0,997606.0,997606.0,997606.0,997606.0,997606.0,997606.0,997606.0
mean,1.32,2.69,12.32,0.4,0.49,1.05,0.0,0.29,14.93,1.47,1.02,0.43,2019.0,2019.0,6.25,20.39,20.4
std,0.98,2.79,10.9,0.58,0.09,1.93,0.0,0.06,11.75,0.52,0.14,0.99,0.0,0.0,8.59,2.71,2.71
min,0.0,0.0,-250.0,-4.5,-0.5,-21.22,0.0,-0.3,-250.0,1.0,1.0,-2.75,2019.0,2019.0,-0.0,16.0,16.0
25%,1.0,1.02,6.5,0.0,0.5,0.0,0.0,0.3,8.3,1.0,1.0,0.0,2019.0,2019.0,0.0,18.0,18.0
50%,1.0,1.8,9.5,0.0,0.5,0.0,0.0,0.3,11.8,1.0,1.0,0.0,2019.0,2019.0,0.0,20.0,20.0
75%,1.0,3.28,14.5,0.5,0.5,1.86,0.0,0.3,18.05,2.0,1.0,0.0,2019.0,2019.0,16.67,23.0,23.0
max,9.0,202.1,1562.5,7.25,3.55,224.57,0.0,0.3,1563.3,5.0,2.0,2.75,2019.0,2019.0,100.0,25.0,25.0


In [35]:
print(f"{'-'*10} Categorical Descriptive Statistics {'-'*10}")
df_taxis_filtered.describe(include = object)

---------- Categorical Descriptive Statistics ----------


Unnamed: 0,VendorID,PULocationID,DOLocationID
count,997606,997606,997606
unique,2,257,257
top,2,East Harlem North,East Harlem North
freq,832924,72918,39554


<h3>5. Summary</h3><br>
<div align = justify>
Congratulations! You've made it this far. You are ready to start diving into more complex program to analyze data, like regressions, machine learning algorithms, and more!
Best of luck in your path with Python!!
    
In this chapter you covered:
    <ul>
        <li>How to explore DataFrames</li>
        <li>Difference between numeric and categorical variables exploring</li>
        <li>How to convert data types in DataFrames</li>
        <li>Operating with DataFrames</li>
    </ul>
</div>

<div align = center>
    <h1> AWESOME WORK!</h1>
</div>