# Bus data preparation

This script wrangles data regarding the Circulator bus and the Metro-Bus service in DC. The data is taken as CSV files from:

LINK: opendata.dc.gov/datasets/e20d1dc981174440b38ac768de4eb921_54

After taking the desired data the script exports the dataframe as a new CSV file.

Some info about the raw data:

**DC Circulator Bus Stops**. The dataset contains locations and attributes of DC Circulator Stops, created as part of the DC Geographic Information System (DC GIS) for the D.C. Office of the Chief Technology Officer (OCTO) and participating D.C. government agencies. A database provided by the District Department of Transportation identified DC Circulator Stops. DC Circulator is a local transit option that provides access between key and popular locations in the District. All DC GIS data is stored and exported in Maryland State Plane coordinates NAD 83 meters. METADATA CONTENT IS IN PROCESS OF VALIDATION AND SUBJECT TO CHANGE.

**WMATA Metro Bus Stops**. The dataset contains locations and attributes of metro bus stops, created as part of the DC Geographic Information System (DC GIS) for the D.C. Office of the Chief Technology Officer (OCTO) and participating D.C. government agencies. A database provided by WMATA contains bus stop locations and attributes. All DC GIS data is stored and exported in Maryland State Plane coordinates NAD 83 meters. METADATA CONTENT IS IN PROCESS OF VALIDATION AND SUBJECT TO CHANGE.

In [2]:
import pandas as pd
import numpy as np


In [21]:
Bus_df= pd.read_csv('DC_Circulator_Stops.csv',sep=',')
Bus2_df= pd.read_csv('Metro_Bus_Stops.csv',sep=',')

Bus2_df.head()

Unnamed: 0,﻿X,Y,OBJECTID,LATITUDE,LONGITUDE,ROUTES
0,-77.20453,38.831697,2001,38.831697,-77.20453,"16A,16B,16E,29C,29G,29K,29KV1,29N,3A"
1,-77.166547,38.823334,2002,38.823334,-77.166547,"29C,29G,29K,29KV1,29N"
2,-77.181217,38.82739,2003,38.82739,-77.181217,"29C,29G,29K,29KV1,29N"
3,-77.142189,38.817771,2004,38.817771,-77.142189,"17A,17B,17M,29C,29G,29K,29KV1,29N,7A,7AC,7F,7H"
4,-77.14161,38.817882,2005,38.817882,-77.14161,


Let's check if there are **null values** on these data sets:

In [35]:
len(Bus_df[Bus_df.notnull()]), len(Bus_df)

(148, 148)

The Circulator Bus Stops data file does not contain null values. Let's check the file regarding the Metro-bus stops. 

In [19]:
len(Bus2_df[Bus2_df.ROUTES.notnull()]),len(Bus2_df[Bus2_df.ROUTES.isnull()])

(9863, 908)

Almost a 10% of the **METRO-BUS** data **must be cleaned** as there is no info about the routes that passes through the stop. I take only the non null data:

In [36]:
Bus2_df=Bus2_df[Bus2_df.ROUTES.notnull()]

Double check to make sure that all the NaN values are not anymore in our dataframe.

In [37]:
len(Bus2_df[Bus2_df.ROUTES.isnull()])

0

Now, we take only the columns that we are interested in:

In [38]:
Bus_stops_df=Bus_df[['LINE','STOP','Y','\xef\xbb\xbfX']]
Bus2_stops_df=Bus2_df[['ROUTES','LONGITUDE','LATITUDE','OBJECTID']]

For clarity and consistency with the others data sets, X and Y are set as Longitude and Latitude variables.
* X and LONGITUDE = Longitude
* Y and LATITUDE = Latitude

In [39]:
#   BUS1 STOPS CIRCULATOR
Bus_stops_df['Text']=Bus_stops_df['STOP'] + '. Line: ' +  Bus_stops_df['LINE']
Bus_stops_df=Bus_stops_df.rename(columns={'\xef\xbb\xbfX': 'Longitude', 'Y': 'Latitude'})

#   BUS2 STOPS METRO-BUS
Bus2_stops_df['Text']='Routes: ' +  (Bus2_stops_df['ROUTES'])
Bus2_stops_df=Bus2_stops_df.rename(columns={'LONGITUDE': 'Longitude', 'LATITUDE': 'Latitude'})

#Bus_stops_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from IPython.kernel.zmq import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [40]:
#Bus_stops_df.head()
Bus_stops_df.tail()

Unnamed: 0,LINE,STOP,Latitude,Longitude,Text
143,National Mall,E STREET AND COLUMBUS CIRCLE NE,38.896434,-77.007957,E STREET AND COLUMBUS CIRCLE NE. Line: Nationa...
144,Union Station - Navy Yard,PENNSYLVANIA AVENUE AND 7TH STREET SE,38.884788,-76.995952,PENNSYLVANIA AVENUE AND 7TH STREET SE. Line: U...
145,Woodley Park - Adams Morgan - McPherson Square,14TH STREET AND U STREET NW,38.917202,-77.032106,14TH STREET AND U STREET NW. Line: Woodley Par...
146,Woodley Park - Adams Morgan - McPherson Square,MOUNT PLEASANT STREET AND IRVING STREET NW,38.928596,-77.03718,MOUNT PLEASANT STREET AND IRVING STREET NW. Li...
147,Georgetown - Union Station; Dupont Circle - Ro...,PENNSYLVANIA AVENUE AND 28TH STREET NW,38.904889,-77.057219,PENNSYLVANIA AVENUE AND 28TH STREET NW. Line: ...


In [41]:
#Bus2_stops_df.head()
Bus2_stops_df.tail()

Unnamed: 0,ROUTES,Longitude,Latitude,OBJECTID,Text
10766,86,-76.955346,38.951954,10369,Routes: 86
10767,W19,-77.169099,38.598694,9628,Routes: W19
10768,F8,-76.953121,38.954511,10370,Routes: F8
10769,W19,-77.157619,38.600797,9629,Routes: W19
10770,"86,86V2",-76.95604,38.949919,10371,"Routes: 86,86V2"


Finally, the dataframe is exported as a CSV file

In [42]:
Bus_stops_df.to_csv('Bus_stops_df.csv', sep=',')
Bus2_stops_df.to_csv('Metro-Bus_stops_df.csv', sep=',')