# **PROBLEM STATEMENT:**

*Running EDA in NYC TAXI TRIP Dataset*

# **IMPORT LIBRARIES

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt

#  IMPORTING DATASET

In [None]:
#import the data from a csv file.
data = pd.read_csv("../input/nyc-trip-duration/nyc_taxi_trip_duration.csv")

# **BASIC DATA EXPLORATION**

NOW WE WILL BE EXPLORING DATASET AND AND MODIFY IT AS PER REQUIREMENT FOR FURTHER ANALYSIS

In [None]:
data.head()

In [None]:
#Checking the shape of dataset
data.shape

There are around 7 lakh records with 11 Columns

In [None]:
#Checking null values
data.isnull().sum()

There is no null values in any of the column as per the isnull function

In [None]:
#Converting time stamp to datetime function

data['pickup_datetime']=pd.to_datetime(data['pickup_datetime'])
data['dropoff_datetime']=pd.to_datetime(data['dropoff_datetime'])

In [None]:
#Calculate and assign new columns to the dataframe such as weekday,
#month and pickup_hour which will help us to gain more insights from the data.
data['month'] = data.pickup_datetime.dt.month
data['weekday_num'] = data.pickup_datetime.dt.weekday
data['pickup_hour'] = data.pickup_datetime.dt.hour
data['dropoff_hour'] = data.dropoff_datetime.dt.hour

In [None]:
data.head()

#

# UNIVARIATE ANALYSIS

Univariate analysis is the analysis of one variable. It's major purpose is to describe patterns in the data consisting of single variable.

# **PASSENGERS**

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:

plt.figure(figsize = (20,5))
sns.boxplot(data.passenger_count)
plt.show()

From Above box plot we can Say that:-
1. There is trip with 0 Passenger also
2. Few trips are showing with 7,8,9 Passengers this can be outliers
3. Many trip are with 1,2 passengers

**Since There is trip with 0 Passenger which is nearly impossible we will replace it with count 1 as passenger count as Mean median are close to 1**

In [None]:
data.passenger_count.describe()

In [None]:
data['passenger_count']=data.passenger_count.map(lambda x:1 if x==0 else x)

In [None]:
data['passenger_count'].head(10)

In [None]:
data['passenger_count'].tail(10)

In [None]:
data.passenger_count.value_counts()

In [None]:
sns.countplot(data.passenger_count)
plt.show()

FROM THE ABOVE FIGURE IT IS EVIDENT THAT IN MOST CASE TRIP WAS TAKEN BY SINGLE PASSENGER

# **VENDOR**
* ------------------------------
Lets see which vendor out of two has better market in NEW YORK

In [None]:
data['vendor_id'].value_counts()

In [None]:
sns.countplot(data.vendor_id)
plt.show()

From above count it is evident that almost for both vendor their presence in market is almost same still to be accurate vendor 2 is leading by few points

In [None]:
data.head(50)

****TRIP DURATION****

In [None]:
data['trip_duration'].describe()

In [None]:
data['trip_duration'].value_counts()

In [None]:
plt.figure(figsize=(20,5))
sns.boxplot(data['trip_duration'])
plt.show()

Some of the trip are above 100000 Seconds
Observations:
There are some durations with as low as 1 second. which points towards trips with 0 km distance.
Major trip durations took between 10-20 mins to complete.
Mean and mode are not same which shows that trip duration distribution is skewed towards right.
Let's analyze more

In [None]:
data[data.trip_duration > 86400]

In [None]:
data[data.trip_duration < 86400]

In [None]:
data.trip_duration.groupby(pd.cut(data.trip_duration, np.arange(1,7200,600))).count().plot(kind='barh')
plt.xlabel('Trip Counts')
plt.ylabel('Trip Duration (seconds)')
plt.show()

#most of the trip is is between 1,601 seconds

**store_and_fwd_flag**


In [None]:
data.store_and_fwd_flag.value_counts()

**TRIP PER PICKUP HOUR**

In [None]:
sns.countplot(data.pickup_hour)
plt.show()

This show pickup hour count across 24 hour
### Observation
- It's inline with the general trend of taxi pickups which starts increasing from 6AM in the morning and then declines from late evening i.e. around 8 PM. There is no unusual behavior here.

<a id=week_trip></a>
## Total trips per weekday
***
Let's take a look now at the distribution of taxi pickups across the week.

In [None]:
plt.figure(figsize = (8,6))
sns.countplot(data.weekday_num)
plt.xlabel(' Month ')
plt.ylabel('Pickup counts')
plt.show()

### Observation
- Here we can see an increasing trend of taxi pickups starting from Monday till Friday. The trend starts declining from saturday till monday which is normal where some office going people likes to stay at home for rest on the weekends. 

Let's drill down more to see the hourwise pickup pattern across the week

In [None]:
n = sns.FacetGrid(data, col='weekday_num')
n.map(plt.hist, 'pickup_hour')
plt.show()

### Interesting find:
 - Taxi pickups increased in the late night hours over the weekend possibly due to more outstation rides or for the late night leisures nearby activities.
 - Early morning pickups i.e before 5 AM have increased over the weekend in comparison to the office hours pickups i.e. after 7 AM which have decreased due to obvious reasons.
 - Taxi pickups seems to be consistent across the week at 15 Hours i.e. at 3 PM.

<a id=month_trip></a>
## Total trips per month
***
Let's take a look at the trip distribution across the months to understand if there is any diffrence in the taxi pickups in different months

In [None]:
sns.countplot(data.month)
plt.ylabel('Trip Counts')
plt.xlabel('Months')
plt.show()

<a id=bivariate></a>
# Bivariate Analysis
***
Bivariate analysis is used to find out if there is a relationship between two sets of values. It usually involves the variables X and Y.


<img src='https://i.pinimg.com/originals/c8/d4/0e/c8d40e9ec4ffd4f3af527eb40ba80462.gif' align='centre'/>

In [None]:
data.dtypes

In [None]:
numerical=data.select_dtypes(include=['int64','float64'])

In [None]:
numerical.dtypes

In [None]:
#checking correlation
numerical.corr()

In [None]:
#plotting heatmap to show realtionship between variables
plt.figure(figsize=(30,10),dpi=140)
correlation=numerical.dropna().corr(method='pearson')
sns.heatmap(correlation,linewidth=2)

Observation:
1)There is strong relation betwen pickup hour and dropoff hour

**TRIP DURATION PER MONTH**

We need to aggregate the total trip duration to plot it agaist the month. The aggregation measure can be anything like sum, mean, median or mode for the duration

In [None]:
group1=data.groupby('pickup_hour').trip_duration.mean()
sns.pointplot(group1.index,group1.values)
plt.ylabel('Trip duration second')
plt.show()