<h1 align=center><font size = 5> New York Taxi Analysis</font></h1>

## 1. Introduction

A taxi company tackle a problem is how to assign the cabs to passengers efficiently. One of main issue is determining the duration of the current trip so it can predict when the cab will be free for the next trip. You are challenged to build a model that predicts the total ride duration of taxi trips in New York City. Your primary dataset is one released by the NYC Taxi and Limousine Commission. To build the best model, you would do analysis to find the best features that have relative with trip duration of taxi. In this project, you will practice data analysis and data visualization skill using Python such as: describe the data, handle missing value, data cleansing, feature engineer, feature selection to get insights about the data and determine how different variables are dependent on the target variable **Trip Duration**.

## 2. Data description


### File descriptions
train.csv - contains 1458644 trip records

weather_data_nyc.csv - the weather information in each day of trip

fastest_routes_train_part_1.csv

fastest_routes_train_part_2.csv


### Data fields
#### train.csv:
- id - a unique identifier for each trip
- vendor_id - a code indicating the provider associated with the trip record
- pickup_datetime - date and time when the meter was engaged
- dropoff_datetime - date and time when the meter was disengaged
- passenger_count - the number of passengers in the vehicle (driver entered value)
- pickup_longitude - the longitude where the meter was engaged
- pickup_latitude - the latitude where the meter was engaged
- dropoff_longitude - the longitude where the meter was disengaged
- dropoff_latitude - the latitude where the meter was disengaged
- store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor - because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip

- trip_duration - duration of the trip in seconds. Our target feature in the training data is measured in seconds

#### weather_nyc.csv

Weather data collected from the National Weather Service. It contains the first six months of 2016, for a weather station in central park. It contains for each day the minimum temperature, maximum temperature, average temperature, precipitation, new snow fall, and current snow depth. The temperature is measured in Fahrenheit and the depth is measured in inches. T means that there is a trace of precipitation.

- date : date of collecting data
- maximum temperature
- minimum temperature
- average temperature
- precipitation
- snow fall
- snow depth

#### fastest routes

This is suggested information about the fastest route from starting street to ending street for a trip that will help you estimate the duration.

- id - a unique identifier for each trip
- starting_street - a street when start the trip
- end_street - a street at end of the trip
- total_distance
- total_travel_time
- number_of_steps
- street_for_each_step
- distance_per_step
- travel_time_per_step
- step_maneuvers
- step_direction
- step_location_list



<h1 align=center><font size = 5>Project Requirements</font></h1>

Complete following tasks in this projects:



### Import Python Package

In [1]:
import pandas as pd

### <span style="color:blue">1.Load data</span>
- Load trip data that describes above using DataFrame in Pandas

In [2]:
#write your code here
df_trip = pd.read_csv('data/trip.csv')
df_weather = pd.read_csv('data/weather_data_nyc.csv')
df_1 = pd.read_csv('data/fastest_routes_train_part_1.csv', index_col=None)
df_2 = pd.read_csv('data/fastest_routes_train_part_2.csv', index_col=None)
df_final_routes = pd.concat([df_1, df_2])
df_final_routes.to_csv('data/final_routes.csv')

### <span style="color:blue">2.Reformat data type</span>
- Some datatime fields have wrong data types, write your code to change to correct format.

In [4]:
#write your code here
df_trip.dtypes


id                     object
vendor_id               int64
pickup_datetime        object
dropoff_datetime       object
passenger_count         int64
pickup_longitude      float64
pickup_latitude       float64
dropoff_longitude     float64
dropoff_latitude      float64
store_and_fwd_flag     object
trip_duration           int64
dtype: object

### <span style="color:blue">3.Descriptive Statistics </span>
- Using Descriptive Statistics to find some insights in 3 tables. Write your finding in the report

In [4]:
#write your code here

#### 3.1. Univariate Analysis
- Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one variable. It doesn’t deal with causes or relationships (unlike regression) and it’s major purpose is to describe; it takes data, summarizes that data and finds patterns in the data. 
- Your objective is finding features that correlate with trip duration and using these to predict duration of taxi trip. In first step, you must get more understand about each column (feature) in train.csv that will help you find best features.



##### 3.1.1. Distribution of trip duration
- Visualize the distribution of trip duration. It is better if you transform duration time to log10
- Write your findings

In [5]:
#write your code here

##### 3.1.2. Pickup latitude and Pickup longitude
- Use data visualization and write your findings 

In [6]:
#write your code here

##### 3.1.3. Dropoff latitude and Dripoff longitude
- Use data visualization and write your findings 

In [7]:
#write your code here

##### 3.1.4. Pickup datatime and Dropoff datetime
-  its better to visualize the trips by hour, day, week, month...

In [8]:
#write your code here

##### 3.1.5. Vendor 
- use data visualization and write your findings 

In [9]:
#write your code here

##### 3.1.6 Passenger_count

In [10]:
#write your code here

##### 3.1.7 Add more analysis

In [11]:
#write your code here

#### 3.2. Bivariate Analysis and Multivariate Analysis
- Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.
<br><br/>
- Multivariate Data Analysis is a statistical technique used to analyze data that originates from more than one variable.
<br><br/>
- Now look at the relationship between each of the variables with the target variable **trip_duration**. We’ll start with a few very simple questions.

##### 3.2.1 How pickup location, drop-off location, the direct distance of pickup location and drop-off location impact on trip duration?

In [12]:
#find your answer here

##### 3.2.2 How pickup datetime affects to trip duration? Do quieter days and hours lead to faster trips?

In [13]:
#find your answer here

##### 3.2.3	How different numbers of passengers and the different vendors are correlated with the duration of the trip?

In [14]:
#find your answer here

##### 3.2.4 Add more your questions:

#### 3.3 More Analysis with External data
- We have two other tables: weather and faster routes. Use your analysis to find more features that are correlated with trip duration.

##### 3.3.1 How weather affect to total time of a trip? How does snow or rain impact on trip duration?

In [15]:
#find your answer here

- Another external data set is fastest route of each trip that includes the pickup/dropoff streets and total distance/duration between these two points together with a sequence of travels steps such as turns or entering a highway. 
- This is suggested information about the fastest route from starting street to ending street for a trip that will help you estimate the duration.

##### 3.3.2 How is number of left turns, right turns, turns effect to trip duration? 

In [16]:
#find your answer here

### <span style="color:blue">4.Feature Selection </span>

- Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in
<br><br/>
- Read here
https://www.kaggle.com/sz8416/6-ways-for-feature-selection
<br><br/>
- After engineering new features and before starting the modelling, we will visualize the relations between our parameters using a correlation matrix. For this, we need to change all the input features into a numerical format. The visualisation uses the heatmap plot from seaborn package

#### 4.1 Correlation
- Read more about feature selection with correlation: https://towardsdatascience.com/feature-selection-correlation-and-p-value-da8921bfb3cf

In [17]:
#write your code here

#### 4.2 Using feature importance

- Install lightgbm package 
- Use lightgbm for feature selection


In [18]:
#write your code here

### <span style="color:blue">5.Conclusion </span>
- Write all features that you select to predict trip duration on the report