## Contexto: 
A Uber Technologies, Inc. é uma empresa multinacional de rede de transporte americana com sede em São Francisco e opera em mais de 785 áreas metropolitanas, com mais de 110 milhões de usuários em todo o mundo. O mercado de compartilhamento de carona é altamente volátil, com a demanda flutuando significativamente de acordo com o horário, local, condições climáticas, eventos locais, etc. A chave para o sucesso neste negócio é a capacidade de detectar padrões nessas flutuações e atender à demanda a qualquer momento. Como um recém-contratado Cientista de Dados no escritório da Uber em Nova York, você recebeu a tarefa de extrair informações dos dados que auxiliarão no crescimento do negócio. Boas informações são acionáveis e fazem sentido comercial

## Objetivo: 
Extrair insights acionáveis que permitirão o crescimento do negócio no mercado.

## Principais questionamentos:
1. Quais são as diferentes variáveis que influenciam as solicitações de corrida?
2. Qual fator afeta mais as solicitações de corrida? Quais poderiam ser as razões plausíveis para isso?
3. Quais são suas recomendações à gestão da Uber para capitalizar sobre a demanda flutuante?

## Descrição dos dados:

* pickup_dt: Data e hora da coleta.
* borough: Distrito da cidade de Nova Iorque.
* pickups: Número de coletas no período.
* spd: Velocidade do vento em milhas por hora.
* vsb: Visibilidade em milhas até a décima mais próxima.
* temp: Temperatura em Fahrenheit.
* dewp: Ponto de orvalho em Fahrenheit.
* slp: Pressão ao nível do mar.
* pcp01: Precipitação líquida de 1 hora.
* pcp06: Precipitação líquida de 6 horas.
* pcp24: Precipitação líquida de 24 horas.
* sd: Altura da neve em polegadas.
* hday: Sendo feriado (Y) ou não (N).

### Importando as bibliotecas necessárias - pandas, numpy, seaborn, matplotlib.pyplot

In [2]:
# Library to suppress warnings or deprecation notes 
import warnings
warnings.filterwarnings('ignore')

In [4]:
# Libraries to help with reading and manipulating data
import pandas as pd # pandas, para manipulação de conjunto de dados na memória 
import numpy as np # numpy, manipulação de numeros e funções numéricas
import matplotlib.pyplot as plt

# Libraries to help with data visualization
import seaborn as sns # bibliotecas para visualização: graficos, plots, diversos tipos
import matplotlib.pyplot as plt # bibliotecas para visualização: graficos, plots, diversos tipos
%matplotlib inline 

### Carregando o conjunto de dados

In [5]:
data = pd.read_csv('Uber.csv')

In [6]:
data.head(20)

Unnamed: 0,pickup_dt,borough,pickups,spd,vsb,temp,dewp,slp,pcp01,pcp06,pcp24,sd,hday
0,2015-01-01 01:00:00,Bronx,152,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
1,2015-01-01 01:00:00,Brooklyn,1519,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
2,2015-01-01 01:00:00,EWR,0,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
3,2015-01-01 01:00:00,Manhattan,5258,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
4,2015-01-01 01:00:00,Queens,405,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
5,2015-01-01 01:00:00,Staten Island,6,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
6,2015-01-01 01:00:00,,4,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
7,2015-01-01 02:00:00,Bronx,120,3.0,10.0,30.0,6.0,1023.0,0.0,0.0,0.0,0.0,Y
8,2015-01-01 02:00:00,Brooklyn,1229,3.0,10.0,30.0,6.0,1023.0,0.0,0.0,0.0,0.0,Y
9,2015-01-01 02:00:00,EWR,0,3.0,10.0,30.0,6.0,1023.0,0.0,0.0,0.0,0.0,Y


In [7]:
data.tail(20)

Unnamed: 0,pickup_dt,borough,pickups,spd,vsb,temp,dewp,slp,pcp01,pcp06,pcp24,sd,hday
29081,2015-06-30 20:00:00,,5,8.0,10.0,80.0,58.0,1012.4,0.0,0.0,0.0,0.0,N
29082,2015-06-30 21:00:00,Bronx,57,7.0,10.0,76.0,61.0,1012.4,0.0,0.0,0.0,0.0,N
29083,2015-06-30 21:00:00,Brooklyn,829,7.0,10.0,76.0,61.0,1012.4,0.0,0.0,0.0,0.0,N
29084,2015-06-30 21:00:00,EWR,0,7.0,10.0,76.0,61.0,1012.4,0.0,0.0,0.0,0.0,N
29085,2015-06-30 21:00:00,Manhattan,4313,7.0,10.0,76.0,61.0,1012.4,0.0,0.0,0.0,0.0,N
29086,2015-06-30 21:00:00,Queens,483,7.0,10.0,76.0,61.0,1012.4,0.0,0.0,0.0,0.0,N
29087,2015-06-30 21:00:00,Staten Island,2,7.0,10.0,76.0,61.0,1012.4,0.0,0.0,0.0,0.0,N
29088,2015-06-30 22:00:00,Bronx,53,5.0,10.0,76.0,64.0,1011.9,0.0,0.0,0.0,0.0,N
29089,2015-06-30 22:00:00,Brooklyn,860,5.0,10.0,76.0,64.0,1011.9,0.0,0.0,0.0,0.0,N
29090,2015-06-30 22:00:00,EWR,0,5.0,10.0,76.0,64.0,1011.9,0.0,0.0,0.0,0.0,N


In [8]:
data.sample(20)

Unnamed: 0,pickup_dt,borough,pickups,spd,vsb,temp,dewp,slp,pcp01,pcp06,pcp24,sd,hday
5877,2015-02-07 03:00:00,Manhattan,1217,0.0,10.0,25.0,6.0,1021.4,0.0,0.0,0.0,8.0,N
26887,2015-06-17 11:00:00,Bronx,66,9.0,10.0,71.0,57.0,1018.3,0.0,0.0,0.0,0.0,N
4435,2015-01-29 02:00:00,Manhattan,402,7.0,10.0,28.0,1.0,1023.7,0.0,0.0,0.279167,8.0,N
29097,2015-06-30 23:00:00,Manhattan,3828,7.0,10.0,75.0,65.0,1011.8,0.0,0.0,0.0,0.0,N
22189,2015-05-19 17:00:00,Manhattan,3855,1.0,4.0,64.666667,59.0,1012.7,0.0,0.0,0.0,0.0,N
22955,2015-05-24 09:00:00,Staten Island,1,6.0,10.0,57.0,37.0,1027.6,0.0,0.0,0.03,0.0,N
18559,2015-04-27 07:00:00,,5,5.0,10.0,50.0,29.0,1004.7,0.0,0.01,0.845,0.0,N
15187,2015-04-06 08:00:00,EWR,0,0.0,10.0,45.0,33.0,1025.3,0.0,0.0,0.111667,0.0,N
25365,2015-06-08 03:00:00,Manhattan,423,0.0,10.0,64.0,45.0,1019.8,0.0,0.09,0.0,0.0,N
20877,2015-05-11 15:00:00,,1,3.333333,8.0,75.666667,64.666667,1020.0,0.0,0.0,0.0,0.0,N


**Observações**

* pickup_dt contém a data e a hora da coleta.
* borough contém o nome do distrito de Nova Iorque onde a coleta foi feita.
* pickups contém o número de coletas no distrito no momento especificado.
* Todas as variáveis relacionadas ao clima são numéricas.
* A variável holiday é uma variável categórica.

### Verificando o formato do conjunto de dados

In [9]:
data.shape

(29101, 13)

* O Conjunto de dados contém 29.101 linhas (observações) e 13 colunas (atributos)

### Verificando a info()
#### Verificar as informações relacionados à estrutura do conjunto de dados

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29101 entries, 0 to 29100
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pickup_dt  29101 non-null  object 
 1   borough    26058 non-null  object 
 2   pickups    29101 non-null  int64  
 3   spd        29101 non-null  float64
 4   vsb        29101 non-null  float64
 5   temp       29101 non-null  float64
 6   dewp       29101 non-null  float64
 7   slp        29101 non-null  float64
 8   pcp01      29101 non-null  float64
 9   pcp06      29101 non-null  float64
 10  pcp24      29101 non-null  float64
 11  sd         29101 non-null  float64
 12  hday       29101 non-null  object 
dtypes: float64(9), int64(1), object(3)
memory usage: 2.9+ MB


**Observações**

* Todas as colunas têm 29.101 observações, exceto borough, que tem 26.058 observações, indicando a presença de valores ausentes nela.
* A coluna pickup_dt está sendo lida como um tipo de dados 'object', mas deveria ter o tipo de dados 'datetime'.
* As colunas borough e hday (feriado) devem ser variáveis categóricas.

# Pré-processamento dos Dados

In [14]:
# copying data to another varaible to avoid any changes to original data
df = data.copy()

In [15]:
# date time column should be converted to date time data type
# categorical column should be converted to categorical type 
# (It reduces the data space required to store the dataframe, 
# every class in the categorical column will be represented by a number under the hood.
# This is useful during model building)

df['pickup_dt'] = pd.to_datetime(df.pickup_dt)
df['borough'] = df.borough.astype('category')
df['hday'] = df.hday.astype('category')

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29101 entries, 0 to 29100
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   pickup_dt  29101 non-null  datetime64[ns]
 1   borough    26058 non-null  category      
 2   pickups    29101 non-null  int64         
 3   spd        29101 non-null  float64       
 4   vsb        29101 non-null  float64       
 5   temp       29101 non-null  float64       
 6   dewp       29101 non-null  float64       
 7   slp        29101 non-null  float64       
 8   pcp01      29101 non-null  float64       
 9   pcp06      29101 non-null  float64       
 10  pcp24      29101 non-null  float64       
 11  sd         29101 non-null  float64       
 12  hday       29101 non-null  category      
dtypes: category(2), datetime64[ns](1), float64(9), int64(1)
memory usage: 2.5 MB


### Visualizando as primeiras linhas do conjunto de dados


In [17]:
df.head(5)

Unnamed: 0,pickup_dt,borough,pickups,spd,vsb,temp,dewp,slp,pcp01,pcp06,pcp24,sd,hday
0,2015-01-01 01:00:00,Bronx,152,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
1,2015-01-01 01:00:00,Brooklyn,1519,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
2,2015-01-01 01:00:00,EWR,0,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
3,2015-01-01 01:00:00,Manhattan,5258,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
4,2015-01-01 01:00:00,Queens,405,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y


In [18]:
df.sample(5)

Unnamed: 0,pickup_dt,borough,pickups,spd,vsb,temp,dewp,slp,pcp01,pcp06,pcp24,sd,hday
6789,2015-02-12 21:00:00,Bronx,45,10.0,10.0,36.0,18.0,1006.4,0.0,0.0,0.0,8.0,Y
4608,2015-01-30 04:00:00,Manhattan,376,3.0,5.266667,34.333333,22.666667,1016.5,0.0,0.0,0.0,6.083333,N
1048,2015-01-07 15:00:00,,1,16.0,10.0,22.0,4.0,1014.4,0.0,0.0,0.0,0.416667,N
6058,2015-02-08 06:00:00,Manhattan,501,6.0,7.0,36.0,27.0,1015.1,0.0,0.0,0.0,8.0,N
13293,2015-03-25 09:00:00,,1,0.0,10.0,35.0,19.0,1029.1,0.0,0.0,0.213333,0.0,N


**Observações**

* pickup_dt contém a data e a hora da coleta.
* borough contém o nome do distrito de Nova Iorque onde a coleta foi feita.
* pickups contém o número de coletas no distrito no momento especificado.
* Todas as variáveis relacionadas ao clima são numéricas.
* A variável holiday é uma variável categórica.

### Resumo dos dados

In [19]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
pickups,29101.0,490.215903,995.649536,0.0,1.0,54.0,449.0,7883.0
spd,29101.0,5.984924,3.699007,0.0,3.0,6.0,8.0,21.0
vsb,29101.0,8.818125,2.442897,0.0,9.1,10.0,10.0,10.0
temp,29101.0,47.669042,19.814969,2.0,32.0,46.0,64.5,89.0
dewp,29101.0,30.823065,21.283444,-16.0,14.0,30.0,50.0,73.0
slp,29101.0,1017.817938,7.768796,991.4,1012.5,1018.2,1022.9,1043.4
pcp01,29101.0,0.00383,0.018933,0.0,0.0,0.0,0.0,0.28
pcp06,29101.0,0.026129,0.093125,0.0,0.0,0.0,0.0,1.24
pcp24,29101.0,0.090464,0.219402,0.0,0.0,0.0,0.05,2.1
sd,29101.0,2.529169,4.520325,0.0,0.0,0.0,2.958333,19.0


**Observações**

* xxxxx
* xxxx

In [21]:
df.describe(exclude='number').T

Unnamed: 0,count,unique,top,freq,first,last
pickup_dt,29101,4343,2015-01-01 01:00:00,7,2015-01-01 01:00:00,2015-06-30 23:00:00
borough,26058,6,Bronx,4343,NaT,NaT
hday,29101,2,N,27980,NaT,NaT


In [None]:
# While doing uni-variate analysis of numerical variables we want to study their central tendency
# and dispersion.
# Let us write a function that will help us create a boxplot and histogram for any input numerical
# variable.
# This function takes the numerical column as the input and returns the boxplots
# and histograms for the variable.
# Let us see if this helps us write faster and cleaner code.    
   
# function to plot a boxplot and a histogram along the same scale.
 
 
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined
 
    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram