# Projeto I - Pandas

Este Jupyter Notebooke tem por objetivo realizar testes em Datasets utilizando a biblioteca **pandas**. Realizando testes de maneira autônoma, pretende-se analisar o Dataset utilizado no projeto 1 para familiarização com a biblioteca.

In [2]:
# Importanto bibliotecas
import pandas as pd

In [73]:
path = 'C:/Users/thiagoPanini/Downloads/chicago.csv'
df_bike = pd.read_csv(path)
df_bike.head()

Unnamed: 0,Start Time,End Time,Trip Duration,Start Station,End Station,User Type,Gender,Birth Year
0,2017-01-01 00:00:36,2017-01-01 00:06:32,356,Canal St & Taylor St,Canal St & Monroe St (*),Customer,,
1,2017-01-01 00:02:54,2017-01-01 00:08:21,327,Larrabee St & Menomonee St,Sheffield Ave & Kingsbury St,Subscriber,Male,1984.0
2,2017-01-01 00:06:06,2017-01-01 00:18:31,745,Orleans St & Chestnut St (NEXT Apts),Ashland Ave & Blackhawk St,Subscriber,Male,1985.0
3,2017-01-01 00:07:28,2017-01-01 00:12:51,323,Franklin St & Monroe St,Clinton St & Tilden St,Subscriber,Male,1990.0
4,2017-01-01 00:07:57,2017-01-01 00:20:53,776,Broadway & Barry Ave,Sedgwick St & North Ave,Subscriber,Male,1990.0


In [3]:
# Verificando a quantidade de linhas e colunas do DataFrame
df_bike.shape

(1551505, 8)

In [4]:
# Verificando o tipo de dado de cada coluna
df_bike.dtypes

Start Time        object
End Time          object
Trip Duration      int64
Start Station     object
End Station       object
User Type         object
Gender            object
Birth Year       float64
dtype: object

In [24]:
# Muitas vezes, colunas do tipo str são mostradas como object. Vamos analisar a primeira linha da coluna Start Time

df_bike['Start Time'][0]

'2017-01-01 00:00:36'

In [18]:
# Verificando o tipo primitivo de cada uma das colunas identificadas como "object"

print(f"Tipo da coluna 'Start Time': {type(df_bike['Start Time'][0])}")
print(f"Tipo da coluna 'End Time': {type(df_bike['End Time'][0])}")
print(f"Tipo da coluna 'Start Duration': {type(df_bike['Start Station'][0])}")
print(f"Tipo da coluna 'End Station': {type(df_bike['End Station'][0])}")
print(f"Tipo da coluna 'User Type': {type(df_bike['User Type'][0])}")
print(f"Tipo da coluna 'Gender': {type(df_bike['Gender'][0])}")

Tipo da coluna 'Start Time': <class 'str'>
Tipo da coluna 'End Time': <class 'str'>
Tipo da coluna 'Start Duration': <class 'str'>
Tipo da coluna 'End Station': <class 'str'>
Tipo da coluna 'User Type': <class 'str'>
Tipo da coluna 'Gender': <class 'float'>


In [9]:
# Verificando o tipo do dataset lido com pandas
type(df_bike)

pandas.core.frame.DataFrame

In [25]:
# Verificando todas as informações possíveis do dataset
df_bike.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1551505 entries, 0 to 1551504
Data columns (total 8 columns):
Start Time       1551505 non-null object
End Time         1551505 non-null object
Trip Duration    1551505 non-null int64
Start Station    1551505 non-null object
End Station      1551505 non-null object
User Type        1551505 non-null object
Gender           1234638 non-null object
Birth Year       1234822 non-null float64
dtypes: float64(1), int64(1), object(6)
memory usage: 94.7+ MB


O método info() mostrado acima indica que há colunas com dados nulos, como é o caso de 'Gender' e 'Birth Year'. Isto pois a contagem total de linhas (1.551.505) é menor nestes dois atributos, o que indica que houve entradas de dados nulas.

In [26]:
# Verificando a quantidade de entradas únicas em cada coluna
df_bike.nunique()

Start Time       1374908
End Time         1338926
Trip Duration      12096
Start Station        582
End Station          581
User Type              3
Gender                 2
Birth Year            82
dtype: int64

O método nunique() indica a quantidade de entradas possíveis para cada valor. Entretanto, a informação mostrada não considera entradas nulas (verificar que 'Gender' é mostrado com 2 entradas possíveis, entretanto há 3 se considerar a entrada nula).

In [27]:
# Mostrando estatísticas descritivas sobre cada coluna do dataset
df_bike.describe()

Unnamed: 0,Trip Duration,Birth Year
count,1551505.0,1234822.0
mean,939.7778,1980.864
std,1617.702,10.99154
min,60.0,1899.0
25%,392.0,1975.0
50%,670.0,1984.0
75%,1127.0,1989.0
max,86338.0,2016.0


Perceba que o método describe() avalia apenas colunas cujo tipo primitivo é dado como numérico.

In [34]:
# Visualizando todas as colunas do dataset em formato de lista
df_bike.columns

Index(['Start Time', 'End Time', 'Trip Duration', 'Start Station',
       'End Station', 'User Type', 'Gender', 'Birth Year'],
      dtype='object')

In [35]:
for column in df_bike.columns:
    print(column)

Start Time
End Time
Trip Duration
Start Station
End Station
User Type
Gender
Birth Year


In [37]:
for i, v in enumerate(df_bike.columns):
    print(i, v)

0 Start Time
1 End Time
2 Trip Duration
3 Start Station
4 End Station
5 User Type
6 Gender
7 Birth Year


## Selecionando e Indexando Datasets com .loc e .iloc
Para melhor entendimento das inúmeras possibilidades com essas duas funções, é recomendada a leitura do seguinte link:

[Selecting-and-indexing-with-loc-iloc](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)

![Pandas-selections-and-indexing.png](attachment:Pandas-selections-and-indexing.png)

## .loc
Indexando através do nome (label) da coluna do Dataset.

In [43]:
# Verificando Dataset
df_bike.head()

Unnamed: 0,Start Time,End Time,Trip Duration,Start Station,End Station,Gender,Birth Year
0,2017-01-01 00:00:36,2017-01-01 00:06:32,356,Canal St & Taylor St,Canal St & Monroe St (*),,
1,2017-01-01 00:02:54,2017-01-01 00:08:21,327,Larrabee St & Menomonee St,Sheffield Ave & Kingsbury St,Male,1984.0
2,2017-01-01 00:06:06,2017-01-01 00:18:31,745,Orleans St & Chestnut St (NEXT Apts),Ashland Ave & Blackhawk St,Male,1985.0
3,2017-01-01 00:07:28,2017-01-01 00:12:51,323,Franklin St & Monroe St,Clinton St & Tilden St,Male,1990.0
4,2017-01-01 00:07:57,2017-01-01 00:20:53,776,Broadway & Barry Ave,Sedgwick St & North Ave,Male,1990.0


In [47]:
for i, v in enumerate(df_bike.columns):
    print(i, v)

0 Start Time
1 End Time
2 Trip Duration
3 Start Station
4 End Station
5 Gender
6 Birth Year


In [39]:
# Selecionando todos os dados entre Trip Duration e User Type
df_teste = df_bike.loc[:,'Trip Duration':'User Type']
df_teste.head()

Unnamed: 0,Trip Duration,Start Station,End Station,User Type
0,356,Canal St & Taylor St,Canal St & Monroe St (*),Customer
1,327,Larrabee St & Menomonee St,Sheffield Ave & Kingsbury St,Subscriber
2,745,Orleans St & Chestnut St (NEXT Apts),Ashland Ave & Blackhawk St,Subscriber
3,323,Franklin St & Monroe St,Clinton St & Tilden St,Subscriber
4,776,Broadway & Barry Ave,Sedgwick St & North Ave,Subscriber


In [4]:
# Selecionando apenas Start Station e Start Time
df_teste = df_bike.loc[:, ['Start Station', 'Start Time']]
df_teste.head()

Unnamed: 0,Start Station,Start Time
0,Canal St & Taylor St,2017-01-01 00:00:36
1,Larrabee St & Menomonee St,2017-01-01 00:02:54
2,Orleans St & Chestnut St (NEXT Apts),2017-01-01 00:06:06
3,Franklin St & Monroe St,2017-01-01 00:07:28
4,Broadway & Barry Ave,2017-01-01 00:07:57


In [8]:
# Selecionando as colunas Start Time, End Time, User Type e Gender
df_teste = df_bike.loc[:, ['Start Time', 'End Time','User Type', 'Gender']]
df_teste.head()

Unnamed: 0,Start Time,End Time,User Type,Gender
0,2017-01-01 00:00:36,2017-01-01 00:06:32,Customer,
1,2017-01-01 00:02:54,2017-01-01 00:08:21,Subscriber,Male
2,2017-01-01 00:06:06,2017-01-01 00:18:31,Subscriber,Male
3,2017-01-01 00:07:28,2017-01-01 00:12:51,Subscriber,Male
4,2017-01-01 00:07:57,2017-01-01 00:20:53,Subscriber,Male


In [9]:
# Selecionando colunas de Start Time até Gender somente quando Gender for Feminino
df_teste = df_bike.loc[df_bike['Gender'] == 'Female', 'Start Time': 'Gender']
df_teste.head()

Unnamed: 0,Start Time,End Time,Trip Duration,Start Station,End Station,User Type,Gender
10,2017-01-01 00:17:13,2017-01-01 11:03:34,38781,Wilton Ave & Diversey Pkwy,Halsted St & Wrightwood Ave,Subscriber,Female
14,2017-01-01 00:25:47,2017-01-01 00:39:53,846,Ravenswood Ave & Lawrence Ave,Clarendon Ave & Gordon Ter,Subscriber,Female
18,2017-01-01 00:27:28,2017-01-01 00:42:44,916,Millennium Park,Michigan Ave & 18th St,Subscriber,Female
20,2017-01-01 00:27:52,2017-01-01 00:33:46,354,Paulina Ave & North Ave,Damen Ave & Division St,Subscriber,Female
22,2017-01-01 00:30:10,2017-01-01 01:10:34,2424,Lake Shore Dr & Ohio St,Broadway & Barry Ave,Subscriber,Female


In [11]:
# Selecionando todas as colunas quando Trip Duration for maior que 1000
df_teste = df_bike.loc[df_bike['Trip Duration'] > 1000, :]
df_teste.head()

Unnamed: 0,Start Time,End Time,Trip Duration,Start Station,End Station,User Type,Gender,Birth Year
10,2017-01-01 00:17:13,2017-01-01 11:03:34,38781,Wilton Ave & Diversey Pkwy,Halsted St & Wrightwood Ave,Subscriber,Female,1988.0
15,2017-01-01 00:25:47,2017-01-01 00:43:23,1056,Clark St & Congress Pkwy,Wolcott Ave & Polk St,Subscriber,Male,1984.0
22,2017-01-01 00:30:10,2017-01-01 01:10:34,2424,Lake Shore Dr & Ohio St,Broadway & Barry Ave,Subscriber,Female,1985.0
26,2017-01-01 00:35:23,2017-01-01 01:05:22,1799,McClurg Ct & Illinois St,Fairbanks Ct & Grand Ave,Customer,,
29,2017-01-01 00:35:34,2017-01-01 01:05:26,1792,McClurg Ct & Illinois St,Fairbanks Ct & Grand Ave,Customer,,


In [14]:
df_teste = df_bike.loc[df_bike['Start Station'].str.endswith('Illinois St'), :]
df_teste.head()

Unnamed: 0,Start Time,End Time,Trip Duration,Start Station,End Station,User Type,Gender,Birth Year
23,2017-01-01 00:32:58,2017-01-01 00:41:59,541,LaSalle St & Illinois St,State St & Kinzie St,Customer,,
26,2017-01-01 00:35:23,2017-01-01 01:05:22,1799,McClurg Ct & Illinois St,Fairbanks Ct & Grand Ave,Customer,,
29,2017-01-01 00:35:34,2017-01-01 01:05:26,1792,McClurg Ct & Illinois St,Fairbanks Ct & Grand Ave,Customer,,
31,2017-01-01 00:37:19,2017-01-01 01:05:22,1683,McClurg Ct & Illinois St,Fairbanks Ct & Grand Ave,Customer,,
45,2017-01-01 00:56:09,2017-01-01 01:10:51,882,McClurg Ct & Illinois St,Millennium Park,Customer,,


In [18]:
df_teste = df_bike.loc[df_bike['Start Time'].apply(lambda x: x[5:7] == '02')]
df_teste.head()

Unnamed: 0,Start Time,End Time,Trip Duration,Start Station,End Station,User Type,Gender,Birth Year
111942,2017-02-01 00:04:02,2017-02-01 00:13:27,565,Columbus Dr & Randolph St,Clark St & Chicago Ave,Subscriber,Female,1991.0
111943,2017-02-01 00:04:38,2017-02-01 00:10:18,340,Morgan St & Lake St,Milwaukee Ave & Grand Ave,Subscriber,Male,1981.0
111944,2017-02-01 00:07:11,2017-02-01 00:11:49,278,Clinton St & Madison St,Clinton St & Tilden St,Subscriber,Male,1977.0
111945,2017-02-01 00:08:44,2017-02-01 00:13:36,292,University Ave & 57th St,Ellis Ave & 60th St,Subscriber,Male,1976.0
111946,2017-02-01 00:15:29,2017-02-01 00:25:07,578,Hermitage Ave & Polk St,Peoria St & Jackson Blvd,Subscriber,Female,1984.0


In [26]:
test = df_bike['Start Time'].apply(lambda x: x[5:7] == '02')
df_teste2 = df_bike.loc[test, 'Start Time':'Trip Duration']
df_teste2.head()

Unnamed: 0,Start Time,End Time,Trip Duration
111942,2017-02-01 00:04:02,2017-02-01 00:13:27,565
111943,2017-02-01 00:04:38,2017-02-01 00:10:18,340
111944,2017-02-01 00:07:11,2017-02-01 00:11:49,278
111945,2017-02-01 00:08:44,2017-02-01 00:13:36,292
111946,2017-02-01 00:15:29,2017-02-01 00:25:07,578


In [28]:
df_bike.set_index('User Type', inplace=True)
df_bike.head()

Unnamed: 0_level_0,Start Time,End Time,Trip Duration,Start Station,End Station,Gender,Birth Year
User Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Customer,2017-01-01 00:00:36,2017-01-01 00:06:32,356,Canal St & Taylor St,Canal St & Monroe St (*),,
Subscriber,2017-01-01 00:02:54,2017-01-01 00:08:21,327,Larrabee St & Menomonee St,Sheffield Ave & Kingsbury St,Male,1984.0
Subscriber,2017-01-01 00:06:06,2017-01-01 00:18:31,745,Orleans St & Chestnut St (NEXT Apts),Ashland Ave & Blackhawk St,Male,1985.0
Subscriber,2017-01-01 00:07:28,2017-01-01 00:12:51,323,Franklin St & Monroe St,Clinton St & Tilden St,Male,1990.0
Subscriber,2017-01-01 00:07:57,2017-01-01 00:20:53,776,Broadway & Barry Ave,Sedgwick St & North Ave,Male,1990.0


In [29]:
df_bike.loc['Customer']

Unnamed: 0_level_0,Start Time,End Time,Trip Duration,Start Station,End Station,Gender,Birth Year
User Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Customer,2017-01-01 00:00:36,2017-01-01 00:06:32,356,Canal St & Taylor St,Canal St & Monroe St (*),,
Customer,2017-01-01 00:14:57,2017-01-01 00:26:22,685,Daley Center Plaza,Canal St & Monroe St (*),,
Customer,2017-01-01 00:15:03,2017-01-01 00:26:28,685,Daley Center Plaza,Canal St & Monroe St (*),,
Customer,2017-01-01 00:17:01,2017-01-01 00:29:49,768,Dayton St & North Ave,Ogden Ave & Chicago Ave,,
Customer,2017-01-01 00:18:28,2017-01-01 00:31:05,757,Canal St & Madison St,LaSalle St & Illinois St,,
Customer,2017-01-01 00:30:07,2017-01-01 00:34:32,265,Wilton Ave & Diversey Pkwy,Halsted St & Wrightwood Ave,,
Customer,2017-01-01 00:32:58,2017-01-01 00:41:59,541,LaSalle St & Illinois St,State St & Kinzie St,,
Customer,2017-01-01 00:35:23,2017-01-01 01:05:22,1799,McClurg Ct & Illinois St,Fairbanks Ct & Grand Ave,,
Customer,2017-01-01 00:35:34,2017-01-01 01:05:26,1792,McClurg Ct & Illinois St,Fairbanks Ct & Grand Ave,,
Customer,2017-01-01 00:37:19,2017-01-01 01:05:22,1683,McClurg Ct & Illinois St,Fairbanks Ct & Grand Ave,,


In [42]:
# Resetando index

df_bike = df_bike.reset_index(drop=True)
df_bike.head()

Unnamed: 0,Start Time,End Time,Trip Duration,Start Station,End Station,Gender,Birth Year
0,2017-01-01 00:00:36,2017-01-01 00:06:32,356,Canal St & Taylor St,Canal St & Monroe St (*),,
1,2017-01-01 00:02:54,2017-01-01 00:08:21,327,Larrabee St & Menomonee St,Sheffield Ave & Kingsbury St,Male,1984.0
2,2017-01-01 00:06:06,2017-01-01 00:18:31,745,Orleans St & Chestnut St (NEXT Apts),Ashland Ave & Blackhawk St,Male,1985.0
3,2017-01-01 00:07:28,2017-01-01 00:12:51,323,Franklin St & Monroe St,Clinton St & Tilden St,Male,1990.0
4,2017-01-01 00:07:57,2017-01-01 00:20:53,776,Broadway & Barry Ave,Sedgwick St & North Ave,Male,1990.0


## .iloc
Indexando através do índice (num) da coluna do Dataset.

In [48]:
for i, v in enumerate(df_bike.columns):
    print(i, v)

0 Start Time
1 End Time
2 Trip Duration
3 Start Station
4 End Station
5 Gender
6 Birth Year


In [50]:
# Seleciona as duas primeiras colunas do Dataset
df_teste = df_bike.iloc[:, :2]
df_teste.head()

Unnamed: 0,Start Time,End Time
0,2017-01-01 00:00:36,2017-01-01 00:06:32
1,2017-01-01 00:02:54,2017-01-01 00:08:21
2,2017-01-01 00:06:06,2017-01-01 00:18:31
3,2017-01-01 00:07:28,2017-01-01 00:12:51
4,2017-01-01 00:07:57,2017-01-01 00:20:53


In [51]:
# Seleciona apenas as duas primeiras colunas das três primeiras linhas
df_teste = df_bike.iloc[:3, :2]
df_teste.head()

Unnamed: 0,Start Time,End Time
0,2017-01-01 00:00:36,2017-01-01 00:06:32
1,2017-01-01 00:02:54,2017-01-01 00:08:21
2,2017-01-01 00:06:06,2017-01-01 00:18:31


In [72]:
df_teste = df_bike.iloc[:, 5]
df_teste.head()

0     NaN
1    Male
2    Male
3    Male
4    Male
Name: Gender, dtype: object