# <font color='blue'>Data Science Academy - Formação Cientista de Dados</font>
# <font color='blue'>Autor: Evandro Eulálio Cleto</font>

## <font color='blue'>Data Início: 23/05/2023</font>
## <font color='blue'>Data Finalização: /2023</font>


![title](imagens/Apresent_Proj.png)

## <font color='blue'>Objetivo deste projeto:</font>
### <font color='blue'>Responder 10 perguntas de negócios através de análise de dados usando os pacotes parkSQL, PandaSQL, SQLAlchemy, MySql e Docker</font>

# Arquitetura do projeto

![title](imagens/Infro_Projeto.png)

Este foi um projeto desafiador pois foi desenvolvido no Linux Ubuntu 22.04, através de virtualização pelo Oracle VM VirtualBox executado de uma máquina com Windows 11.

Esse projeto teve início baixando um dataset do Microsoft Excel do link https://data.world/makeovermonday/2018w51, no Linux.

O dataset foi carregado através do pacote Pandas, que também foi usado para análise exploratória, em formato de dataframe.

O dataframe do Pandas foi gravado em uma tabela no SGBD MySQL e para a conexão do SGBD ao Python, foi usada a biblioteca SQLAlchemy.

Para extração dos dados do MySQL foi usada a biblioteca PandaSQL, usando o SQLAlchemy como conector com Python.

Já o MySQL foi instalado à partir de um container Docker, que devido à falta de suporte à KVM pelo VirtualBox foram instalados via command line. 

O guia para a instalação do Docker está aqui: https://github.com/EvandroCleto/Projeto03_V3_Analise_Risco_Transporte/blob/main/Guia_Instalacao_Docker_Linux.txt

E o guia para instalação do container com o MySQL está aqui: https://github.com/EvandroCleto/Projeto03_V3_Analise_Risco_Transporte/blob/main/Guia_Instalacao_MySQL_Docker.txt

As 10 pergutas de negócio foram respondidas usando querys pelo SparkSQL, que alimentaram gráficos plotados através do pacote Matplotlib. 







In [None]:
# Esse pacote é usado para gravar as versões de outros pacotes usados neste jupyter notebook.
#!pip install -q -U watermark

In [1]:
# Importa o findspark e inicializa
# findspark -> Fornece findpark.init() para tornar o pyspark importável como uma biblioteca regular.
#!pip install findspark
import findspark
findspark.init()

In [None]:
#https://pypi.org/project/mysql-connector-python/
# Instala conector com o MySQL
#!pip install mysql-connector-python

In [None]:
# https://www.sqlalchemy.org/
#sqlalchemy -> facilita a conexão com SGBD
#!pip install -q sqlalchemy

In [None]:
# https://pypi.org/project/pandasql/
#pandasql -> Extrai dados do PostgreSQL
#!pip install -q pandasql

In [2]:
# Imports
import pandasql
import sqlalchemy
import pandas as pd
from pandasql import sqldf
from sqlalchemy import create_engine
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf

In [3]:
#Teste de Conexão com o MySQL
import mysql.connector
from mysql.connector import errorcode
try:
	db_connection = mysql.connector.connect(host='localhost', user='root', password='402676Ev', database='evandro')
	print("Database connection made!")
except mysql.connector.Error as error:
	if error.errno == errorcode.ER_BAD_DB_ERROR:
		print("Database doesn't exist")
	elif error.errno == errorcode.ER_ACCESS_DENIED_ERROR:
		print("User name or password is wrong")
	else:
		print(error)
else:
	db_connection.close()

Database connection made!


In [4]:
# Versões dos pacotes usados neste jupyter notebook
%reload_ext watermark
%watermark -a "Data Science Academy" --iversions

Author: Data Science Academy

findspark : 2.0.1
pandasql  : 0.7.3
sqlalchemy: 1.4.39
pandas    : 1.4.4



In [69]:
# Carregando o dataset London Bus Safety Performance 
df1 = pd.read_excel("dados/TFL_Bus_Safety.xlsx", sheet_name = 'Sheet1',index_col = False)

In [70]:
# Dimensões dos dados
df1.shape

(23158, 12)

In [71]:
# Tipo das variávies
df1.dtypes

Year                                  int64
Date Of Incident             datetime64[ns]
Route                                object
Operator                             object
Group Name                           object
Bus Garage                           object
Borough                              object
Injury Result Description            object
Incident Event Type                  object
Victim Category                      object
Victims Sex                          object
Victims Age                          object
dtype: object

In [72]:
# Renomeia as colunas que possuem espaço no nome para evitar problemas na execução de querys
df1.rename(columns={'Date Of Incident': 'Date_Incident', 'Group Name':'Group_Name', 'Bus Garage':'Bus_Garage', 
                    'Injury Result Description':'Injury_Description', 'Incident Event Type':'Incident_Type',
                    'Victim Category':'Victim_Category','Victims Sex':'Victims_Sex','Victims Age':'Victims_Age'}, 
           inplace = True)

In [216]:
# Exibe os 10 1º resgistros
df1.head(15)

Unnamed: 0,Year,Date_Incident,Route,Operator,Group_Name,Bus_Garage,Borough,Injury_Description,Incident_Type,Victim_Category,Victims_Sex,Victims_Age
0,2015,2015-01-01,1,London General,Go-Ahead,Garage Not Available,Southwark,Injuries treated on scene,Onboard Injuries,Passenger,Male,Child
1,2015,2015-01-01,4,Metroline,Metroline,Garage Not Available,Islington,Injuries treated on scene,Onboard Injuries,Passenger,Male,Unknown
2,2015,2015-01-01,5,East London,Stagecoach,Garage Not Available,Havering,Taken to Hospital – Reported Serious Injury or...,Onboard Injuries,Passenger,Male,Elderly
3,2015,2015-01-01,5,East London,Stagecoach,Garage Not Available,None London Borough,Taken to Hospital – Reported Serious Injury or...,Onboard Injuries,Passenger,Male,Elderly
4,2015,2015-01-01,6,Metroline,Metroline,Garage Not Available,Westminster,Reported Minor Injury - Treated at Hospital,Onboard Injuries,Pedestrian,Female,Elderly
5,2015,2015-01-01,6,Metroline,Metroline,Garage Not Available,Westminster,Taken to Hospital – Reported Serious Injury or...,Onboard Injuries,Passenger,Female,Elderly
6,2015,2015-01-01,8,Selkent,Stagecoach,Garage Not Available,City of London,Injuries treated on scene,Onboard Injuries,Passenger,Male,Adult
7,2015,2015-01-01,9,London United,London United,Garage Not Available,Hammersmith & Fulham,Injuries treated on scene,Onboard Injuries,Conductor,Unknown,Unknown
8,2015,2015-01-01,10,London United,London United,Garage Not Available,Westminster,Injuries treated on scene,Onboard Injuries,Passenger,Female,Elderly
9,2015,2015-01-01,11,London General,Go-Ahead,Garage Not Available,City of London,Taken to Hospital – Reported Serious Injury or...,Onboard Injuries,Passenger,Female,Adult


## Conectando ao SGBD MySQL no Docker

In [76]:
db_connection = mysql.connector.connect(user='root', password='402676Ev',
                              host='localhost',
                              database='evandro')

In [77]:
# Abre um cursor -> forma de percorrer objeto em um BD(Navegar por tabelas, pelos metadados, etc)
dbcursor = db_connection.cursor()

In [78]:
# Objeto cursor
dbcursor

<mysql.connector.cursor_cext.CMySQLCursor at 0x7f77f5e2e8e0>

In [79]:
# Habilita commit automatico
db_connection.autocommit = True

In [80]:
# Drop no banco(Caso exista)
dbcursor.execute('DROP DATABASE IF EXISTS tb_transporte')

In [81]:
# Cria o banco de dados no SGBD
dbcursor.execute('CREATE DATABASE tb_transporte')

In [82]:
# Fecha conexão
dbcursor.close()

True

## Conectando ao novo Banco de Dados criado no SGBD MySQL no Docker

In [83]:
db_connection = mysql.connector.connect(user='root', password='402676Ev',
                              host='localhost',
                              database='tb_transporte')

In [84]:
# Habilita commit automatico
db_connection.autocommit = True

## Criando Engine SQLAlchemy de Conexão ao PostgreSQL no Docker

In [85]:
# Cria o engine SQLAlchemy
engine = create_engine('mysql+mysqlconnector://root:402676Ev@localhost/tb_transporte')

In [86]:
# O método to_sql() salva o dataframe do Pandas na tabela do MySQL
# Se a tabela já existir será sobrescrita
df1.to_sql('tb_transporte', engine, if_exists= 'replace', index= False)

23158

## Carregando Dados do MySQL em Dataframes do Pandas com PandaSQL

In [87]:
# Query de consulta aos metadados para obter detalhes de uma tabela
pd.read_sql_query('''select ordinal_position, column_name, data_type 
                    from information_schema.columns 
                    where table_name = 'tb_transporte'
                    ''',
                 engine).head(12)

Unnamed: 0,ORDINAL_POSITION,COLUMN_NAME,DATA_TYPE
0,7,Borough,text
1,6,Bus_Garage,text
2,2,Date_Incident,datetime
3,5,Group_Name,text
4,9,Incident_Type,text
5,8,Injury_Description,text
6,4,Operator,text
7,3,Route,text
8,10,Victim_Category,text
9,12,Victims_Age,text


In [231]:
#Verificando o numero de linhas de uma das tabelas
pd.read_sql_query('select count(*) numero_linhas from tb_transporte', engine)

Unnamed: 0,numero_linhas
0,23158


## Respondendo às perguntas de negócio

### Pergunta 1: Qual a quantidade de incidentes por gênero?

In [266]:
perg1 = pd.read_sql_query('select IF(GROUPING(Victims_Sex), "Total", Victims_Sex) AS Victims_Sex, count(*) as Qtde from tb_transporte group by Victims_Sex with rollup', engine)


In [233]:
perg1

Unnamed: 0,Victims_Sex,Qtde
0,Female,11847
1,Male,7709
2,Unknown,3602
3,Total,23158


### Pergunta 2:  Qual faixa etária esteve mais envolvida nos incidentes?

In [260]:
perg2 = pd.read_sql_query('select IF(GROUPING(Victims_Age), "Total", Victims_Age) AS Victims_Age, count(*) as Qtde from tb_transporte group by Victims_Age with rollup order by count(*) desc', engine)

In [261]:
perg2

Unnamed: 0,Victims_Age,Qtde
0,Total,23158
1,Adult,10754
2,Unknown,7135
3,Elderly,2769
4,Child,2181
5,Youth,319


### Pergunta 3: Qual o percentual de incidentes por tipo de evento (Incident Event Type)?

In [238]:
 perg3 = pd.read_sql_query('Select Incident_Type, count(Incident_Type) as Qtde, round((count(Incident_Type) / (Select count(Incident_Type) as total from tb_transporte)) * 100,2) as Percentual from tb_transporte group by Incident_Type with rollup order by count(*) desc', engine)

In [239]:
perg3

Unnamed: 0,Incident_Type,Qtde,Percentual
0,,23158,100.0
1,Slip Trip Fall,6981,30.15
2,Onboard Injuries,6563,28.34
3,Personal Injury,4596,19.85
4,Collision Incident,4166,17.99
5,Assault,590,2.55
6,Activity Incident Event,114,0.49
7,Vandalism Hooliganism,73,0.32
8,Safety Critical Failure,66,0.28
9,Fire,6,0.03


### Pergunta 4: Como foi a evolução de incidentes por mês ao longo do tempo?


In [240]:
 perg4 = pd.read_sql_query('Select distinct(DATE_FORMAT(Date_Incident, "%M/%Y")) Ultima_Data, Incident_Type, count(CAST(Incident_Type AS UNSIGNED)) OVER (PARTITION BY Incident_Type ORDER BY DATE_FORMAT(Date_Incident, "%M/%Y")) as Evolucao_Acidentes  from tb_transporte ORDER BY 1,2', engine)

In [241]:
perg4

Unnamed: 0,Ultima_Data,Incident_Type,Evolucao_Acidentes
0,April/2015,Assault,10
1,April/2015,Collision Incident,94
2,April/2015,Onboard Injuries,362
3,April/2015,Safety Critical Failure,1
4,April/2015,Vandalism Hooliganism,3
...,...,...,...
255,September/2018,Collision Incident,4166
256,September/2018,Personal Injury,4596
257,September/2018,Safety Critical Failure,66
258,September/2018,Slip Trip Fall,6981


### Pergunta 5: Quando o incidente foi “Collision Incident” em qual mês houve o maior número de incidentes envolvendo pessoas do sexo feminino?


In [242]:
  perg5 = pd.read_sql_query('Select DATE_FORMAT(Date_Incident, "%M/%Y") Mes_Ano, Incident_Type, Victims_Sex, count(Incident_Type) as Qtde from tb_transporte where Incident_Type = "Collision Incident" and Victims_Sex = "Female" group by Date_Incident, Incident_Type, Victims_Sex order by count(*) desc limit 5', engine)

In [243]:
perg5

Unnamed: 0,Mes_Ano,Incident_Type,Victims_Sex,Qtde
0,November/2016,Collision Incident,Female,63
1,September/2016,Collision Incident,Female,56
2,August/2017,Collision Incident,Female,52
3,July/2017,Collision Incident,Female,49
4,June/2016,Collision Incident,Female,47


### Pergunta 6: Qual foi a média de incidentes por mês envolvendo crianças (Child)?


In [244]:
 perg6 = pd.read_sql_query('Select DATE_FORMAT(Date_Incident, "%M/%Y") Mes_Ano, Victims_Age, count(Incident_Type) as Qtde from tb_transporte where Victims_Age = "Child" group by Date_Incident, Victims_Age order by count(*) desc limit 15', engine)

In [245]:
perg6

Unnamed: 0,Mes_Ano,Victims_Age,Qtde
0,October/2017,Child,70
1,July/2018,Child,70
2,April/2017,Child,70
3,June/2018,Child,70
4,June/2017,Child,69
5,August/2017,Child,68
6,July/2017,Child,68
7,September/2018,Child,67
8,March/2018,Child,65
9,June/2016,Child,64


### Pergunta 7: Considerando a descrição de incidente como “Injuries treated on scene” (coluna Injury Result Description), qual o total de incidentes de pessoas do sexo masculino e sexo feminino?


In [246]:
 perg7 = pd.read_sql_query('Select Injury_Description, Victims_Sex, count(Incident_Type) as Qtde from tb_transporte where Injury_Description = "Injuries treated on scene" group by Injury_Description, Victims_Sex order by count(*) desc', engine)

In [247]:
perg7

Unnamed: 0,Injury_Description,Victims_Sex,Qtde
0,Injuries treated on scene,Female,8816
1,Injuries treated on scene,Male,5632
2,Injuries treated on scene,Unknown,2888


### Pergunta 8: No ano de 2017 em qual mês houve mais incidentes com idosos (Elderly)?


In [248]:
 perg8 = pd.read_sql_query('Select DATE_FORMAT(Date_Incident, "%M/%Y") Mes_Ano, Victims_Age, count(Incident_Type) as Qtde from tb_transporte where Victims_Age = "Elderly" and YEAR(Date_Incident) = "2017" group by Date_Incident, Victims_Age ORDER BY count(Incident_Type) desc' , engine)

In [249]:
perg8

Unnamed: 0,Mes_Ano,Victims_Age,Qtde
0,July/2017,Elderly,81
1,September/2017,Elderly,78
2,March/2017,Elderly,77
3,April/2017,Elderly,75
4,August/2017,Elderly,70
5,May/2017,Elderly,69
6,October/2017,Elderly,69
7,November/2017,Elderly,68
8,December/2017,Elderly,67
9,January/2017,Elderly,66


### Pergunta 9: Considerando o Operador qual a distribuição de incidentes ao longo do tempo?


In [250]:
 perg9 = pd.read_sql_query('Select distinct(YEAR(Date_Incident)) Ano, Operator,count(CAST(operator AS UNSIGNED)) OVER (PARTITION BY Operator ORDER BY year(Date_Incident)) as Evolucao_Acidentes  from tb_transporte ORDER BY 1,2; ', engine)

In [251]:
perg9

Unnamed: 0,Ano,Operator,Evolucao_Acidentes
0,2015,Abellio London,117
1,2015,Abellio West,27
2,2015,Arriva Kent Thameside,73
3,2015,Arriva London North,789
4,2015,Arriva London South,482
...,...,...,...
78,2018,Metroline West,1232
79,2018,Quality Line,142
80,2018,Selkent,1808
81,2018,Sullivan Bus & Coach,1


### Pergunta 10: Qual o tipo de incidente mais comum com ciclistas?

In [262]:
 perg10 = pd.read_sql_query('Select Victim_Category Categoria, Incident_Type, count(*) Qtde from tb_transporte where Victim_Category = "Cyclist" group by Victim_Category, Incident_Type ORDER BY 1,2; ', engine)

SyntaxError: EOL while scanning string literal (3811546065.py, line 1)

In [253]:
perg10

Unnamed: 0,Categoria,Incident_Type,Qtde
0,Cyclist,Collision Incident,256
1,Cyclist,Onboard Injuries,4
2,Cyclist,Personal Injury,8
3,Cyclist,Slip Trip Fall,7
