# Analyzing Spatial data from the GoFord bike sharing service
### by Lucas Valerio de Oliveira

## Table of Contents
- [Introduction](#intro)
- [Preliminary Wrangling](#wrangling)
    -  [Gathering data and Assessing data](#gatheringassessing)
-  [Univariate Exploration](#analysis)
-  [Bivariate Exploration](#analysis)
-  [Multivariate Exploration](#analysis)
-  [References](#refs)

<a id=intro></a>
## Introduction

According to Wikipedia, Ford GoBike is a public bike sharing system in California's San Francisco Bay region. Initially known as Bay Wheels, Ford GoBike is the first regional and large-scale bike-sharing system deployed in California and the west coast of the United States. It was established as bay area bike share in August 2013. As of January 2018, the Bay Wheels system had more than 2,600 bikes at 262 stations in San Francisco, East Bay and San Jose.

In this study, data provided by the bike sharing program during the period of February 2019 will be analyzed. The data will be analyzed through an exploratory analysis and finally an explanatory analysis of the data will be made.

<a id=wrangling></a>
## Preliminary Wrangling

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from sklearn.cluster import KMeans
import matplotlib.cm as cm
from geopy.geocoders import Nominatim
import time

%matplotlib inline

ModuleNotFoundError: No module named 'geopy'

<a id=gatheringassessing></a>
### Gathering and Assessing Data

In [None]:
df = pd.read_csv('201902-fordgobike-tripdata.csv')

In [None]:
def checkDataFrame(df,dfname = ''):
    '''
    This function will summary all details from Dataset like: Shape, Info and columns describe
    '''
    
    print('Dataframe Summary\n')
    print(dfname)
    print('='*100)
    print('\tRows: {} Columns {}\n'.format(df.shape[0],df.shape[1]))
    print('-'*100)
    print(df.info(verbose=True))
    print('-'*100)
    print(df.describe())
    print('-'*100)
    for i in df.columns:
        vcount = df[i].value_counts()
        print(vcount)
        print('-'*100)
    print('Summary END')
    print('='*100)

In [None]:
checkDataFrame(df,'fordgobike')

### What is the structure of your dataset?

The data has 183412 rows of records and 16 columns of data. Some tables have null data that needs to be analyzed to decide whether to be treated or remove. egarding the type of data, it is observed that date and time variables need to be treated for the DateTime type. Fields that have some ID identifier need to be converted to String, and finally the Birthday Year variable deve der analisada, uma vez que foi identificado individuos should be analyzed, since it was identified individuals who have a date of birth of 1878 and therefore we should analyze the case. Finally, the data are from the period of February 2019


### What is/are the main feature(s) of interest in your dataset?

What interested me most in the data was the desire to find out how the data is distributed spatially and then build a real graphical representation of that distribution,we can use the LAT data, LONG for that.

In addition, I will try to find out which factors influence the duration of the trip in terms of date and time, age of users, point of departure and point of arrival and also in relation to the gender of the user.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

The first part of my analysis will be important to evaluate the latitude and longitude data in relation to the following characteristics: the duration time, ages, gender and type of user. We can analyze these characteristics and see how they are distributed along the map.
Regarding travel time I need to mainly evaluate start time information, station information and user characteristics.

<a id=cleaning></a>
### Cleaning

In this session we will adjust some information and improve the quality of the data

In [None]:
clean_df = df.copy()

-------------------
#### Fix all wrong types and remove null birth and gender data

As stated earlier, we need to adjust the data type to ensure better data quality. The data for time and date, id-to-string data, and gender data for category will be made.

In [None]:
datetime_cols = ['start_time','end_time']
id_cols =['start_station_id','end_station_id','bike_id']

for i in datetime_cols:
    clean_df[i] = pd.to_datetime(clean_df[i])
for i in id_cols:
    clean_df[i] = clean_df[i].astype(str)
clean_df = clean_df[(clean_df['member_gender'].notnull())]
clean_df = clean_df[(clean_df.member_birth_year.notnull())]
clean_df.member_gender = clean_df.member_gender.astype('category')
clean_df.member_birth_year = clean_df.member_birth_year.astype('int')

**Test**

In [None]:
clean_df[id_cols].dtypes

In [None]:
clean_df[datetime_cols].dtypes

In [None]:
clean_df[['member_gender','member_birth_year']].dtypes


In [None]:
clean_df['hour_value'] = pd.DatetimeIndex(clean_df.start_time).hour
clean_df['day_week']  = pd.DatetimeIndex(clean_df.start_time).dayofweek
clean_df['day']  = pd.DatetimeIndex(clean_df.start_time).day

-------------------
#### Check and adjust Birthday Year values
We will remove rows that are Null and will not interfere with the study, since a wrong fill can cause deviations in the final results. The distributions of the ages will then be evaluated and we will finally remove the outliers

In [None]:
plt.boxplot(clean_df.member_birth_year)
plt.ylabel('Birth Year');

In [None]:
#Bases in boxplot lets remove all rows under 1940 (80 years).
lower_range = 1940
clean_df = clean_df[(clean_df.member_birth_year > lower_range)]

In [None]:
plt.boxplot(clean_df.member_birth_year)
plt.ylabel('Birth Year');

-------------------
#### Evaluate the data of the stations
We will now plot the stations on the map and evaluate the null data that was found.

In [None]:
clean_df = clean_df[~(clean_df.member_birth_year.isna())]

In [None]:
plt.boxplot(clean_df.member_birth_year)
plt.ylabel('Birth Year');

In [None]:
#Bases in boxplot lets remove all rows under 1940 (80 years).
lower_range = 1940
clean_df = clean_df[(clean_df.member_birth_year > lower_range)]

In [None]:
plt.boxplot(clean_df.member_birth_year)
plt.ylabel('Birth Year');

**Test**

In [None]:
clean_df.describe()

### Analyze null data from bike stations
In this session we will investigate the integrity of the longitude and latitude variables and how are the map distributions.

First I will plot all stations on the map and evaluate the case of stations that are without the ID and without the name information, but are lat,long informed.

In [None]:
#Lets check all stations and get info about data missing
mask = (clean_df.start_station_name.notnull() | clean_df.start_station_name.notnull())
station_null_df = clean_df[~mask]
station_notnull_df = clean_df[mask]

In [None]:
#Lets plot all stations and plot missing

fig = go.Figure()

fig.add_trace(go.Scattermapbox(
        lat=station_null_df.start_station_latitude,
        lon=station_null_df.start_station_longitude,
        mode='markers',
        marker=go.scattermapbox.Marker(
            size=15
        ),
        text=['Null Start Data'],
        name='Null Start Data'
    ))
fig.add_trace(go.Scattermapbox(
        lat=station_null_df.end_station_latitude,
        lon=station_null_df.end_station_longitude,
        mode='markers',
        marker=go.scattermapbox.Marker(
            size=10
        ),
        text=['Null End Data'],
        name='Null End Data'
    ))

fig.add_trace(go.Scattermapbox(
        lat=station_notnull_df.start_station_latitude,
        lon=station_notnull_df.start_station_longitude,
        mode='markers',
        marker=go.scattermapbox.Marker(
            size=15
        ),
        text=['Not Null start Data'],
        name='Not Null start Data'
    ))

fig.add_trace(go.Scattermapbox(
        lat=station_notnull_df.end_station_latitude,
        lon=station_notnull_df.end_station_longitude,
        mode='markers',
        marker=go.scattermapbox.Marker(
            size=13
        ),
        text=['Not Null end Data'],
        name='Not Null end Data'
    ))

fig.update_layout(
    hovermode='closest',
    mapbox=dict(
        style='carto-positron',
        bearing=0,
        center=go.layout.mapbox.Center(
            lat=37.6,
            lon=-122.1
        ),
        pitch=0,
        zoom=8
    )
)

fig.show()

Through the previous visualization, it is noted that the null data match the same area, so removing this data will not affect the studies, since there are no offsets between areas with null name and non-null. That way I will choose to remove the data, because each station corresponds to a street name and to adjust the information would be necessary to tidy up given the data, which would be very laborious. However, if it were relevant to the study it would be important to retrieve the information since it could bring relevant information about the behavior of the data area.

In [None]:
clean_df = station_notnull_df

This relationship of clusters can tell us that there is a similarity between latitude, longitude and data can be grouped together to define which regions we are studying.

### How are the stations distributed? What are the regions of the study?

To answer these questions we will look for what are the relationships between the pairs of geographic location latitude and longitude. And then check if there is a relationship of similarity of groups between the pairs of coordinates.

In [None]:
f, ax = plt.subplots(1,2,figsize=(15,7))
plot_start = sns.scatterplot(ax=ax[0],data=clean_df,x='start_station_latitude',y='start_station_longitude');
plot_end = sns.scatterplot(ax=ax[1],data=clean_df,x='end_station_latitude',y='end_station_longitude');

Note that the data is separated by regions, I will create a function to get these groupings through cauterization and assign a centroide to each of them that I will call the macro region, then I will increase the number of clusters to get the micro regions along the data.

In [None]:
#Lets create macro and micro regions with Kmeans Clustering
kmeans_start_macro = KMeans(n_clusters=3, random_state=0).fit(clean_df[['start_station_latitude','start_station_longitude']])
kmeans_end_macro = KMeans(n_clusters=3, random_state=0).fit(clean_df[['end_station_latitude','end_station_longitude']])
clean_df['start_station_macro_region']=kmeans_start_macro.labels_
clean_df['end_station_macro_region']=kmeans_end_macro.labels_

kmeans_start_micro = KMeans(n_clusters=6, random_state=0).fit(clean_df[['start_station_latitude','start_station_longitude']])
kmeans_end_micro = KMeans(n_clusters=6, random_state=0).fit(clean_df[['end_station_latitude','end_station_longitude']])
clean_df['start_station_micro_region']=kmeans_start_micro.labels_
clean_df['end_station_micro_region']=kmeans_end_micro.labels_

clean_df['start_station_macro_region']= clean_df['start_station_macro_region'].astype('category')
clean_df['end_station_macro_region']=clean_df['end_station_macro_region'].astype('category')
clean_df['start_station_micro_region']=clean_df['start_station_micro_region'].astype('category')
clean_df['end_station_micro_region']=clean_df['end_station_micro_region'].astype('category')

In parallel I want to look for what are the names of the regions since the cauterization method only provides us with a centroide. In this way I built a function based on a use of the GeoPY Python library, which accesses the information from a repository.

In [None]:
#Function to get Lat Long Macro and micro names
geolocator = Nominatim(user_agent="test_id2")
def findaddress(latlonglist,geolocator):
    '''Function to get address from lat_log list'''
    address = []
    for i in latlonglist:
        lat = i[0]
        long = i[1]
        string = str(lat)+','+str(long)
        location = geolocator.reverse(string)
        loc_addr = location.raw['address']
        address.append([loc_addr,i])
        time.sleep(1)
    return address

In [None]:
#Get infos about lat_long from cluster centers to define macro and micro region labels
macro_start_names = findaddress(kmeans_start_macro.cluster_centers_,geolocator)
macro_end_names = findaddress(kmeans_start_macro.cluster_centers_,geolocator)
micro_start_names = findaddress(kmeans_start_micro.cluster_centers_,geolocator)
micro_end_names = findaddress(kmeans_start_micro.cluster_centers_,geolocator)

An example will be made to get macro regions and micro regions. These functions can be used to assign new information to centrodes and enrich study data.

In [None]:
macro_start_names

In [None]:
micro_start_names

The following are the macro and micro regions defined by the Kmeans cauterization method:

In [None]:
f, ax = plt.subplots(1,2,figsize=(15,7))
plot_start = sns.scatterplot(ax=ax[0],data=clean_df,
                             x='start_station_latitude',y='start_station_longitude', hue="start_station_macro_region");

plot_end = sns.scatterplot(ax=ax[1],data=clean_df,
                           x='end_station_latitude',y='end_station_longitude', hue="start_station_macro_region");
ax[0].set_title('Macro Regions - Start Trip');
ax[1].set_title('Macro Regions - End Trip');

In [None]:
f, ax = plt.subplots(1,2,figsize=(15,7))
plot_start = sns.scatterplot(ax=ax[0],data=clean_df,
                             x='start_station_latitude',y='start_station_longitude', hue="start_station_micro_region");

plot_end = sns.scatterplot(ax=ax[1],data=clean_df,
                           x='end_station_latitude',y='end_station_longitude', hue="start_station_micro_region");

In [None]:
clean_df.info()

With the data adjusted in the proper way I will begin the next session evaluating each variable important for spatial analysis and in relation to the trips made.

In [None]:
#Save clean data as master
clean_df.to_csv('clean_master_fordgobike.csv', index=False)

In [None]:
#load master data
master_df = clean_df.copy()

<a id=uniexp></a>
## Univariate Exploration



## Relative to time variables

We will start evaluating the distribution of trips over time. It will be analyzed in the following ways: by hour, day of the week, per week, and by Day in the Month.

#### How is the distribution of travel over the hours of the day?

In [None]:
#by hour
plt.figure(figsize=(10,5))
plt.bar(master_df['hour_value'].value_counts().index,master_df['hour_value'].value_counts(),color='#28627A',width=1)
plt.xlabel('Hour');
plt.ylabel('Frequency');
plt.title('Start Trip hour vs Frequency');

#### Observation:

Based on the graphs above it is observed that there is a higher demand for the service in the hours between the first part of the morning 8am and the late afternoon 5pm. This is associated with the time that people are leaving home more for work and their appointments and the time that people are returning from their appointments. It is observed that during the early hours and the time that has less the search for the service.


#### How is the distribution of travel over the day in weeks?

In [None]:
#byweek
weekday = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
plt.figure(figsize=(5,5))
plt.barh(master_df['day_week'].value_counts().index,master_df['day_week'].value_counts(),color='#28627A')
plt.yticks(np.arange(0,7),weekday)
plt.ylabel('Day in Week');
plt.xlabel('Frequency');
plt.title('Start Trip day of Week vs Frequency');

#### Observation:

Regarding the use during the days of the week, the service has higher demand during Mondays and Friday, and Thursday is the day when demand is higher. On weekends, demand for the service decreases by almost half.


#### How is the distribution of travel over the days in month?

In [None]:
#bydayinmonth
plt.figure(figsize=(7,10))
plt.barh(master_df['day'].value_counts().index,master_df['day'].value_counts(),color='#28627A')
plt.yticks(np.arange(0,29,step=1))
plt.ylabel('All Days in February/2019');
plt.xlabel('Frequency');
plt.title('Start Trip in each day of February 2019 vs Frequency');

#### Observation:

During the month, the pattern of higher frequency of use during weekdays and the reduction on weekends are noted.
An important point to investigate is the relationship of time during weekend and weekdays, and to verify how this trip occurs in terms of time and distances from the displacements. In addition we can check if the age of people who use the service during the week is different from weekends.

### Regarding user variables

#### What is the distribution of genders across the data from bike trips?

In [None]:
plt.figure(figsize=(5,5))
gender = master_df['member_gender'].value_counts()
plt.bar(gender.index,gender,color='#28627A')
#plt.ylabel('Day in Week');
#plt.xlabel('Frequency');
plt.title('Gender Relative values');

#### Observation:
It is evident that the program has a greater use by men in about 80000 users more male than female.


#### How is the distribution of users' birth dates?

In [None]:
plt.figure(figsize=(6,6))
genderyear = master_df['member_birth_year']
plt.hist(genderyear,bins = 15, density=True, color='#28627A')
plt.xlabel('Birth Year range');
plt.ylabel('Frequency');
plt.title('Density birth year values');

#### Observation:
users have an age distribution inclined to the right and with high frequency values ​​in the region of birth in 1990. The smallest year was 1940, showing that there are users among all age groups, but with greater concentration in young people, adults and lower in the elderly.

#### How often are users subscribed to the bike sharing program?

In [None]:
plt.figure(figsize=(5,5))
genderyear = master_df['user_type'].value_counts()
plt.bar(genderyear.index,genderyear,color='#28627A',width=0.5)
plt.xlabel('User Type');
plt.ylabel('Frequency');
plt.title('User type vs Frequency');

#### Observation:

It is noticed that the number of registered users is approximately 8 times greater than the customers. This indicates that there is a high enrollment rate for the use of bikes

### Regarding travel-related variables 

#### What are the most used stations? Is there a segregation of regions in terms of the distribution of latitudes and longitudes?

In [None]:
startstation = master_df['start_station_name'].value_counts()
plt.figure(figsize=(10,5))
fig = plt.barh(startstation[:10].index,startstation[:10],color='#28627A')
plt.xlabel('Frequency');
plt.ylabel('Start Stations');
plt.title('Top 25 Most Used at Start Stations');

In [None]:
endstation = master_df['end_station_name'].value_counts()
plt.figure(figsize=(10,5))
fig = plt.barh(endstation[:10].index,endstation[:10],color='#28627A')
plt.xlabel('Frequency');
plt.ylabel('Start Stations');
plt.title('Top 25 Most Used at End Stations');

In [None]:
master_df[['start_station_latitude','start_station_longitude']].hist(bins=5,color='#28627A');
master_df[['end_station_latitude','end_station_longitude']].hist(bins=5,color='#28627A');

#### Observation
In a quick analysis of origin and destination, we can notice that the displacement occurs mostly with starting station Market St at 10th St and end at san Francisco Caltrain Station 2. This indicates that we can find a pattern of displacement of people in the region.

In addition, histograms reinforce the idea that displacements are made in specific groups of the bike program region and therefore we can group them through specific techniques, this would enrich our data in relation to the regions of the program since there are no displacements between the large groups of data region. We could study what is the urban mobility pattern of the region with source and destination data and include user information.

#### What is the distribution of travel times?

In [None]:
plt.hist(data=master_df, x='duration_sec',color='#28627A');
plt.xlabel('Trip Duration in Seconds');

In [None]:
master_df['duration_sec'].describe(percentiles=[.99])

In [None]:
master_df = master_df[master_df['duration_sec'] < 3600]

In [None]:
bins = np.arange(0, 3177, 60)
ticks = np.arange(0, 3177, 900)
plt.hist(data=master_df, x='duration_sec', bins=bins,color='#28627A');
plt.xticks(ticks, ticks);
plt.xlabel('Trip Duration in Secound');

#### Observation
It is observed that it was necessary to make an adjustment of the plotted data, referring to the duration of the trip. This is because there was a very high value of 84548 seconds. In these senses we can disregard trips longer than 1 hour or so 3600 seconds, because they are very destonating values of the general distribution of travel time.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The variation that caught my attention most were latitude and longitude, since these distributions seem to be concentrated in fixed regions, as seen in the graph of the data cleaning session. I'll see how they're related and try to group them together to generate more information in the study regions relationships. It was not necessary to perform transformations in the data of lat,long 
 
### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The travel time variable showed a distribution very out of the ordinary, I had to adjust the data based on the analysis of the 99% quartile and found that only one information was given discrepant time and i chose to remove it and then generate a new distribution chart that seemed much more realistic.

<a id=biexp></a>
## Bivariate Exploration

Let's analyze the correlations and look for what relationships exist between the data

#### Which columns are correlated?

In [None]:
corr = master_df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});

#### Observation

We observed that the correlation between latitude and longitude data are strongly related while the age and duration variables are partially related. Therefore, we will evaluate the data regions to follow the studies. In general, information does not bring much news, but it is important to follow the approach of spatial analysis, because it demonstrates a strong relationship between this information.

Through the results obtained above we can notice that the displacement between macro regions almost does not happen, but between the micro regions, especially those that belong to macro regions at the bottom of the chart, have many trips that start in one micro and end in another given the user's starting origin.


### Evaluating data against macro regions
#### What is the relationship of the date of birth year within the macro regions?

In [None]:
plt.figure(figsize=(8,8))
sns.violinplot(data=master_df, x='start_station_macro_region', y='member_birth_year', color='crimson', inner='quartile');
plt.xlabel('Macro region')
plt.ylabel('Birth year')
plt.title('Age Vs Macro Region');

#### Observation

We can see from this graph that the people who are in Macro region 1 have their birth date around 1995, we can say that the members of this region in general are younger than the members of other macro regions. who have more distributed ages along the dates of birth. The members of the macro region 1 have an age distributed around the year 1985, while the members of the macro region 2 have more frequent births in 1988 and later a greater

#### How does the year of birth vary with travel times?


In [None]:
sns.displot(master_df, x="member_birth_year", y="duration_sec");

#### Oservation

The data obtained shows that there is a higher concentration of trips for members born between 1980 and 2000, indicated by the strong color of the data. In addition, longer journeys begin to fall as the limb age increases. In general, younger members make trips with longer time, but are more focused on a value of 500 seconds of travel, which is around 10 minutes of travel. This is an interesting result for evaluating how each area behaves in relation to the general view.

#### How do average distances relate to macro regions?

In [None]:
sns.barplot(data=master_df, x="start_station_macro_region", y="duration_sec")
plt.xlabel('Macro region')
plt.ylabel('duration_sec')
plt.title('Distances Vs Macro Region');

#### Observation:

We can observe that the region 0 is the one with the longest average travel time, this can be evaluated to know how the regions of the study are and to know what is the spatial relationship with this data.

#### How gender are relate to macro regions?

In [None]:
sns.countplot(data=master_df,x='start_station_macro_region',hue='user_type');

#### Observation

Here the genre follows the same pattern as the general data. There is a big difference between men and women

#### How user type are relate to macro regions?

In [None]:
sns.countplot(data=master_df,x='start_station_macro_region',hue='user_type');

#### Observation

Here the genre follows the same pattern as the general data. There is a big difference subscriber  and customer

#### What are the relationships between the temporal variables and the macro regions of the study?

In [None]:
plt.figure(figsize = [15, 15]);


plt.subplot(3, 1, 1);
sns.countplot(data = master_df, x = 'hour_value', hue = 'start_station_macro_region', palette = 'ch:s=.25,rot=-.35');
ax = plt.subplot(3, 1, 2);

sns.countplot(data = master_df, x = 'day_week', hue = 'start_station_macro_region', palette = 'ch:s=.25,rot=-.35');
ax.legend(ncol = 2); # re-arrange legend to reduce overlapping

ax = plt.subplot(3, 1, 3);
sns.countplot(data = master_df, x = 'day', hue = 'start_station_macro_region', palette = 'ch:s=.25,rot=-.35');
ax.legend(loc = 1, ncol = 2); # re-arrange legend to remove overlapping


#### Observation

We can notice that during the hours of the day, days of the week and days in the month, the macro region 0 is superior to all the data in the same pattern that we observed in univariate analyzes. This pattern can be seen in other regions.

## Evaluating data for micro regions

#### What is the relationship of the date of birth year within the micro regions?

In [None]:
plt.figure(figsize=(8,8))
sns.violinplot(data=master_df, x='start_station_micro_region', y='member_birth_year', color='crimson', inner='quartile');
plt.xlabel('Macro region')
plt.ylabel('Birth year')
plt.title('Age Vs Micro Region');

#### Observation

When we evaluate the microregions, we evaluate d.A. two situations that are very different from the others, which are region 1 and region 3. According to the initial data of the session, we observed that there was a micro region that was basically equal to the macro region and therefore we can say that due to the similar behavior between micro region 1 and macro region 1 these data are the same. already paara micro region 3 it behaves similar to macro region 2. which is where the micro region is located.

### What is the average travel time in micro regions?


In [None]:
sns.barplot(data=master_df, x="start_station_micro_region", y="duration_sec")
plt.xlabel('Micro region')
plt.ylabel('duration_sec')
plt.title('Distances Vs Micro Region');

#### Observation

We observed that micro regions 2 and 4 are the ones that have on average the longest duration of bike use and are longer or close to the values of the macro regions that are inserted. The shortest average time was with region 1 that shares the same value as the macro region.

#### How the types and genres of users are distributed in relation to micro regions

In [None]:
sns.countplot(data=master_df,x='start_station_micro_region',hue='user_type');

In [None]:
sns.countplot(data=master_df,x='start_station_micro_region',hue='member_gender');

#### Observation

Here the genre follows the same pattern as the general data. There is a big difference subscriber  and customer


### What are the relationships between micro-regions and people's data in relation to time variables?

In [None]:

plt.figure(figsize = [25, 20]);


plt.subplot(6, 1, 1);
sns.countplot(data = master_df, x = 'hour_value', hue = 'start_station_micro_region', palette = 'ch:s=.25,rot=-.35');
ax = plt.subplot(6, 1, 2);

sns.countplot(data = master_df, x = 'day_week', hue = 'start_station_micro_region', palette = 'ch:s=.25,rot=-.35');
ax.legend(ncol = 2); # re-arrange legend to reduce overlapping

ax = plt.subplot(6, 1, 3);
sns.countplot(data = master_df, x = 'day', hue = 'start_station_micro_region', palette = 'ch:s=.25,rot=-.35');
ax.legend(loc = 1, ncol = 2); # re-arrange legend to remove overlapping

ax = plt.subplot(6, 1, 4);
sns.countplot(data = master_df, x = 'day_week', hue = 'member_gender', palette = 'Greens');

ax = plt.subplot(6, 1, 5);
sns.countplot(data = master_df, x = 'day_week', hue = 'user_type', palette = 'autumn');

ax = plt.subplot(6, 1, 6);

sns.countplot(data = master_df, x = 'day', hue = 'user_type', palette = 'spring');

#### Observation

We observe here a pattern where the micro areas it ends up showing in a little more detail the variation of the data, in relation to when we observed in a macro view. This approach shows the importance of including some information to the data that can be approached under the same perspective, but that generate totally different data.

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Regarding the macro data, we observed a very important characteristic in relation to the mean distances of the regions and how the distribution of ages in each of them varies. In relation to the data of regions, the region with the highest movement of bicycles in all periods of day and time was region 0 followed by region 2 and finally region 1. This was interesting because in addition to showing more use of the services it is the region that has the most user, but the region that has the least user shows a constancy of use over time.

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, I noticed that micro regions have large variations in relation to the first approach with macro regions and they can elucidate the relationship of distances with paths better than macro regions, since it captures these movements between zones.

<a id=multiexp></a>
## Multivariate Exploration

In this session I will discuss in a limited way the relationship of the bike stations in terms of the duration of the trips, time of departure and arrival and origin and destination of the data.

### Macro region 0

I will start the study of the macro region 0 first i will analyze what are the characteristics in terms of the departure and arrival of the region in relation to the time of travel made by the user grouping the data by the stations of the region

#### Questions:

- **In macro region 0, how are spatial data distributed in terms of travel duration?**

- **In macro region 0 relation to the seasons, what is the average time that most people attended?**

- **In macro region 0,What is the pattern of displacement of people?**

In [None]:
macro0 = master_df[master_df['start_station_macro_region'] == 0].groupby('start_station_name').mean()
plt.figure(figsize = [20, 20]);
ax= plt.subplot(2, 2, 1);
sns.scatterplot(x="start_station_latitude", y="start_station_longitude",hue='duration_sec',
                   size='duration_sec', sizes=(200, 1000) ,data=macro0,palette = 'ch:s=0.8,rot=0')
ax = plt.subplot(2, 2, 2);
sns.scatterplot(x="end_station_latitude", y="end_station_longitude",hue='duration_sec',
                   size='duration_sec', sizes=(200, 1000) ,data=macro0,palette = 'ch:s=0.8,rot=0');

In [None]:
plt.figure(figsize = [20, 10]);
ax= plt.subplot(1, 2, 1);
sns.scatterplot(x="start_station_latitude", y="start_station_longitude",hue='hour_value',
                   size='hour_value', sizes=(0, 1000) ,data=macro0,palette = 'ch:s=.25,rot=-.35')
ax = plt.subplot(1, 2, 2);
sns.scatterplot(x="end_station_latitude", y="end_station_longitude",hue='hour_value',
                   size='hour_value', sizes=(0, 1000) ,data=macro0,palette = 'ch:s=.25,rot=-.35');

In [None]:
macro0_stations = master_df[master_df['start_station_macro_region'] == 0]
plt.figure(figsize = [20, 10]);
ax= plt.subplot(1, 2, 1);
sns.scatterplot(x="start_station_latitude", y="start_station_longitude",hue='start_station_micro_region',
                    s=300,data=macro0_stations,palette = 'Set2')
ax = plt.subplot(1, 2, 2);
sns.scatterplot(x="end_station_latitude", y="end_station_longitude",hue='start_station_micro_region',
                    s=300,data=macro0_stations,palette = 'Set2');

#### Observation:

The results show that the people in the center of region 0 have long displacements in the peripheries of the region, evidenced by the first graph and short displacements in the centers. Graph two shows that the short displacements made in the region are widely used both by the morning and in the afternoon, while in the peripheries the schedule tends to be used in the morning or in the afternoon, this can be the effect of the people who will perform the task in the central region and at the end of the day returns to their homes.

### Macro region 1


#### Questions:

- **In macro region 1, how are spatial data distributed in terms of travel duration?**

- **In macro region 1 relation to the seasons, what is the average time that most people attended?**

- **In macro region 1,What is the pattern of displacement of people?**

In [None]:
macro1 = master_df[master_df['start_station_macro_region'] == 1].groupby('start_station_name').mean()
plt.figure(figsize = [20, 10]);
ax= plt.subplot(1, 2, 1);
sns.scatterplot(x="start_station_latitude", y="start_station_longitude",hue='duration_sec',
                   size='duration_sec', sizes=(250, 1500) ,data=macro1,palette = 'ch:s=0.8,rot=0')
ax = plt.subplot(1, 2, 2);
sns.scatterplot(x="end_station_latitude", y="end_station_longitude",hue='duration_sec',
                   size='duration_sec', sizes=(250, 1500) ,data=macro1,palette = 'ch:s=0.8,rot=0');

In [None]:
plt.figure(figsize = [20, 10]);
ax= plt.subplot(1, 2, 1);
sns.scatterplot(x="start_station_latitude", y="start_station_longitude",hue='hour_value',
                   size='hour_value', sizes=(0, 1000) ,data=macro1,palette = 'ch:s=.25,rot=-.35')
ax = plt.subplot(1, 2, 2);
sns.scatterplot(x="end_station_latitude", y="end_station_longitude",hue='hour_value',
                   size='hour_value', sizes=(0, 1000) ,data=macro1,palette = 'ch:s=.25,rot=-.35');

In [None]:
macro1_stations = master_df[master_df['start_station_macro_region'] == 1]
plt.figure(figsize = [20, 10]);
ax= plt.subplot(1, 2, 1);
sns.scatterplot(x="start_station_latitude", y="start_station_longitude",hue='start_station_micro_region',
                    s=300,data=macro1_stations,palette = 'Set2')
ax = plt.subplot(1, 2, 2);
sns.scatterplot(x="end_station_latitude", y="end_station_longitude",hue='start_station_micro_region',
                    s=300,data=macro1_stations,palette = 'Set2');

#### Observation

The results of macro region 1 have interesting characteristics, the first is that it is composed of 3 micro regions and that it suffers short displacements in the center and increases as it moves away. An interesting characteristic and the highest average time occurs in the central region, this is due to the return of people to their homes, while in the more remote regions the average time is closer to the first part of the morning, indicating a trip to the center. Finally, the analysis of trips between micro regions is evident in the last graph that shows a displacement from 3 zones to the central position of the data. That is, the most likely destination of the pattern of displacement of people is to the central region and less common to the peripheries.

### Macro region 2

#### Questions:

- **In macro region 2, how are spatial data distributed in terms of travel duration?**

- **In macro region 2 relation to the seasons, what is the average time that most people attended?**

- **In macro region 2,What is the pattern of displacement of people?**

In [None]:
macro2 = master_df[master_df['start_station_macro_region'] == 2].groupby('start_station_name').mean()
plt.figure(figsize = [20, 10]);
ax= plt.subplot(1, 2, 1);
sns.scatterplot(x="start_station_latitude", y="start_station_longitude",hue='duration_sec',
                   size='duration_sec', sizes=(300, 1500) ,data=macro2,palette = 'ch:s=0.8,rot=0')
ax = plt.subplot(1, 2, 2);
sns.scatterplot(x="end_station_latitude", y="end_station_longitude",hue='duration_sec',
                   size='duration_sec', sizes=(300, 1500) ,data=macro2,palette = 'ch:s=0.8,rot=0');

In [None]:
plt.figure(figsize = [20, 10]);
ax= plt.subplot(1, 2, 1);
sns.scatterplot(x="start_station_latitude", y="start_station_longitude",hue='hour_value',
                   size='hour_value', sizes=(0, 1000) ,data=macro2,palette = 'ch:s=.25,rot=-.35')
ax = plt.subplot(1, 2, 2);
sns.scatterplot(x="end_station_latitude", y="end_station_longitude",hue='hour_value',
                   size='hour_value', sizes=(0, 1000) ,data=macro2,palette = 'ch:s=.25,rot=-.35');

In [None]:
macro2_stations = master_df[master_df['start_station_macro_region'] == 2]
plt.figure(figsize = [20, 10]);
ax= plt.subplot(1, 2, 1);
sns.scatterplot(x="start_station_latitude", y="start_station_longitude",hue='start_station_micro_region',
                    s=300,data=macro2_stations,palette = 'Set2')
ax = plt.subplot(1, 2, 2);
sns.scatterplot(x="end_station_latitude", y="end_station_longitude",hue='start_station_micro_region',
                    s=300,data=macro2_stations,palette = 'Set2');

#### Observation

Region 2 has the same behavior of displacement and time of the others, but a curious fact and that when we evaluate the micro regions we observe that there are trips of people who leave the Macro zone 2 and go towards macro zone 1, that is, we have here a mobility between macro zones which we did not observe in previous cases.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

We observed that the central regions of the space distributions analyzed have large user movements, but in shorter times of use. People living on the outskirts tend to make longer journeys and often the destination is the center. In addition to the temporal variable, we have a more use of the service at the next time of the next 15 hours in the center in all regions, indicating that people at the end of the day are seeking to move from the center to other regions.

### Were there any interesting or surprising interactions between features?

Among all macro regions studied only macro region 2 presented displacement data between macrozones. Moreover, it is very evident that the displacement between the microregions is linked to the central part of the macro regions studied.

## References

https://jakevdp.github.io/PythonDataScienceHandbook/04.13-geographic-data-with-basemap.html
https://geopy.readthedocs.io/en/stable/index.html?highlight=latitude#geopy.location.Location.latitude
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
https://plotly.com/python/scattermapbox/
https://www.geeksforgeeks.org/get-the-city-state-and-country-names-from-latitude-and-longitude-using-python/
