_by  <a href="mailto:X00193937@mytudublin.ie">Jorge Jimenez Garcia</a>_ : X00193937

# Table of contents:

* [Introduction](#p1)
    * [Motivation](#p1.1)
    * [Dataset overview](#p1.2)
    * [Imports and tools used](#p1.3)
    * [Data Quality](#p1.4)
* [Exploratory Data Analysis](#p2)


## Introduction
<a name='p1'></a>

### Motivation
<a name='p1.1'></a>

In light of the recent climate crisis, alternative and sustainable modes of transportation rise as a valuable option to lower a city's carbon footprint. Additionally, there have been recent discussions on the walkability and accessibility of cities in relation to their quality of life. With so many important topics related to transportation in the spotlight, I thought it would be fit to examine datasets regarding it.

To this end, we will study New York City's Citywide Mobility Survey, or CMS for short. NYC was picked for generally being understood to be a transport friendly city, and being a big city, it allows the city to collect a wide variety of diverse information about its citizens.

[The New York CMS](https://data.cityofnewyork.us/browse?q=Citywide%20Mobility%20Survey&sortBy=relevance) is a yearly survey on the city's population to assess the citizen's view of transport, their usage and other demographic data about public and private transport users and is conducted by the Department of Transportation. This survey is comprised of several different dataset that contain information about the respondent, their household, each individual trip they made or their vehicles.

In this project, we will study the [Person data from the 2019 version of the report](https://data.cityofnewyork.us/Transportation/Citywide-Mobility-Survey-Person-Survey-2019/6bqn-qdwq). 

### Dataset
<a name='p1.2'></a>

The CMS survey is a statistically valid sample of nearly 3000 residents of NYC across the 10 designed geographic survey zones, with approximately 300 respondents per zone. It also contains incomplete information about the member in the respondent's household. (3346 respondents, 8286 person entries) 

The survey contains a variety of attributes regarding the survey's result, as well as general information about the survey's respondents. With a total of 165 attributes, we will not make use of all of them, but some that might be of use are as follow:

* __cms\_zone__: Categorical; The area where the respondent lives, from a set of predefined areas by the Department of Transportation (Inner Brooklyn, Middle Queens, Outer Queens, Manhattan Core, Northern Bronx, Northern Manhattan, Outer Brooklyn, Staten Island, Inner Queens, Southern Bronx)
* __num\_trips__: Discrete; The number of trips a person made for the survey's duration
* __num_walk_trips__: Discrete; The number of trips done on foot
* __num_transit_trips__: Discrete; The number of trips made using public transport
* __num_bike_trips__: Discrete; The number of trips made by bike
* __num_taxi_trips__: Discrete; The number of trips made by taxi
* __num_tnc_trips__: Discrete; The number of trips made using a vehicle-for-hire service (eg. Uber, Lyft, etc.)
* __age__: Categorical; The age of the respondent, in ranges (Under 5, 5 to 15, 16 to 17, 18 to 24, 25 to 34, etc.)
* __employment__: Categorical; Type of employment (Full-time, Part-time, Self-employed, Not employed, Unpaid Volunteer or Intern)
* __student__: Categorical; If the respondent is currently a student and of what type (Not a student, Full-time, Part-time)
* __industry__: Categorical; Work industry (Financial Services, Real Estate, Capital Goods, Business Services, etc.)
* __work_cms_zone__: Categorical; The area where the respondent works (Inner Brooklyn, Middle Queens, Outer Queens, Manhattan Core, Northern Bronx, Northern Manhattan, Outer Brooklyn, Staten Island, Inner Queens, Southern Bronx)
* __work\_mode__: Typical mode of transportation to work (Walk, Other, Household Vehicle, Rental/Carshare/Work Vehicle, Bus, Ferry, Rail, Taxi or TNC, Scooter)

[_This information is sourced from the CMS's Data Dictionary_](https://data.cityofnewyork.us/api/views/6bqn-qdwq/files/038d8557-22ba-4268-b323-d03ea3c82a88?download=true&filename=Open%20Data%20Dictionary_CMS%20%20Person%20Survey%202019.xlsx)

In [1]:
import pandas as pd

attributes = ['cms_zone', 'num_trips', 'num_walk_trips', 'num_transit_trips', 'num_bike_trips', 
              'num_taxi_trips', 'num_tnc_trips','age','employment','student','industry',
              'work_cms_zone','work_mode']

df = pd.read_csv('Citywide_Mobility_Survey_-_Person_Survey_2019.csv')[attributes]
print(df.shape[0])
df.head(10)

8286


Unnamed: 0,cms_zone,num_trips,num_walk_trips,num_transit_trips,num_bike_trips,num_taxi_trips,num_tnc_trips,age,employment,student,industry,work_cms_zone,work_mode
0,Inner Brooklyn,,,,,,,9,3,0,995,,995
1,Inner Brooklyn,,,,,,,8,6,0,995,,995
2,Inner Brooklyn,23.0,1.0,3.0,11.0,0.0,0.0,5,6,1,995,,995
3,Middle Queens,,,,,,,8,1,0,995,,995
4,Middle Queens,,,,,,,7,6,0,995,,995
5,Middle Queens,15.0,2.0,1.0,0.0,0.0,0.0,5,2,2,8,,100
6,Middle Queens,30.0,22.0,9.0,0.0,0.0,0.0,7,7,0,15,,105
7,Middle Queens,,,,,,,5,2,0,995,,995
8,Middle Queens,,,,,,,9,3,0,995,,995
9,Middle Queens,,,,,,,5,6,0,995,,995


### Imports
<a name='p1.3'></a>

For analysis, we will use the standard data analysis Python toolkit 

In [2]:
# Pandas already was imported to show an overview of the dataset
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import pingouin
from scipy import stats

### Data Quality
<a name='p1.4'></a>

As previously mentioned, the survey dataset contains both respondents and their household's members. Since this survey is on a person-by-person basis, there is no information to be gained from their household members, who additionally did not answer said survey and therefore have missing data on almost all of the attributes we are interested in. We can easily identify household members using this fact, and they will be removed from the dataset, since they were only included as part of the bigger CMS dataset.

Some other attributes will be reworked to have clearer data, for instance the number of trips by car is understood to be `num_trips`, since it seems the attribute does not represent the total, but the codebook does not specify the mode of transportation used, and no other mode remains. 

In [3]:
df.rename(columns = { 'num_trips': 'num_car_trips' }, inplace=True )

#### Missing Data

In [4]:
df.isna().sum()

cms_zone                0
num_car_trips        4940
num_walk_trips       4940
num_transit_trips    4940
num_bike_trips       4940
num_taxi_trips       4940
num_tnc_trips        4940
age                     0
employment              0
student                 0
industry                0
work_cms_zone        6599
work_mode               0
dtype: int64

As we can see from the missing data analysis, there are 4940 instances of the `num_trips` attributes that are identified as missing or incomplete. These are actually the household members of the actual survey respondent. This matches up with the information reported in the codebook and previously mentioned, where it is stated that the dataset contains 3346 survey respondents (Total of 8286 instances, 4940 were 'missing' or 'non aplicable' instances so the 3346 respondents remain)

This leaves the missing values in `work_cms_zone` to address. These probably correspond to unemployed survey respondents, or those who do not work in the city. This means we have to avoid dropping the NaN instances in `work_cms_zone`, since it could be important to discern between, for example, workers who work inside the city and those that do not.

In [5]:
miss = ['num_car_trips', 'num_walk_trips', 'num_transit_trips', 'num_bike_trips', 'num_taxi_trips', 'num_tnc_trips']
df[miss] = df[miss].mask( df[miss].isna(), np.nan)
df.dropna(subset=miss,inplace=True)
df.reset_index(drop=True, inplace=True)

We double check that we did not drop the NaN instances in `work_cms_zone`

In [6]:
assert df.isna().sum()['work_cms_zone'] > 0

### Exploratory Data Analysis
<a name='p5'></a>
