# Partido para Quê?

As a Brazilian, I have many doubts about the necessity of many political parties in Brazil. Currently, there are 35 parties and an [Oxford study](https://www.bbc.com/portuguese/brasil-43288018) has shown that they could be reduce to only 2. Therefore, the objective of this project is answer the title question: "Partido para quê?" (which mean **"Parties for what?"** in Portuguese), throughout the analysis of voting patterns. 


It will be used **Clustering Techniques** to classify Brazilians Senators based on their voting history of the past 3 years.

In [1]:
# Install Requirements
!pip install pandas sqlalchemy



You are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


# Data

## Description

The data is available in the file `database.db`. This database contains 2 tables:
- **Senadores**: Table with the Senators info (id, code, name, sex, state, party).
- **Votos**: Table with the voting info (id, senator, session, vote)

All the data used was collected though the [Senate API](http://legis.senado.leg.br/dadosabertos/docs/ui/index.html). The script with the source code which collects the data is available in this repository. TODO: REPOLINK.

## Visualization

Lets start by loading the data from the database to two **Pandas** `DataFrame` objects.

In [19]:
# Imports
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

# Create DB Engine and Session
engine = create_engine('sqlite:///database.db')
DBSession = sessionmaker(bind=engine)
session = DBSession()

# All field from Senatores Table
query_senators = 'SELECT * FROM Senadores'
senators = pd.read_sql(query_senators, session.bind)

# 5 First values
senators.head()


Unnamed: 0,SenadorID,SenadorCod,NomeCompleto,Sexo,Estado,PartidoSigla
0,1,4981,Acir Marcos Gurgacz,Masculino,RO,PDT
1,2,5140,Airton Sandoval Santana,Masculino,SP,MDB
2,3,945,Alvaro Fernandes Dias,Masculino,PR,PODE
3,4,4988,Ana Amélia de Lemos,Feminino,RS,PP
4,5,5529,Antonio Augusto Junho Anastasia,Masculino,MG,PSDB


Similarly, we can do the same to the voting table.

In [20]:

# All field from Votos Table
query_votes = 'SELECT * FROM Votos'
votes = pd.read_sql(query_votes, session.bind)

# Close DB session
session.close()

# Show first 5 values
votes.head()

Unnamed: 0,VotoID,SenadorID,SessaoID,Voto
0,0,40,44809,Sim
1,3,40,44949,Sim
2,7,40,45483,Sim
3,8,40,44589,Sim
4,9,40,42415,Sim


## Cleaning and Preparation

Let's if the DataFrames have any missing values.

In [21]:
# Missing values check
print("Total of missing entries in Senators: " + str(senators.isnull().sum().sum()))
print("Total of missing entries in Votes: " + str(votes.isnull().sum().sum()))

Total of missing entries in Senators: 0
Total of missing entries in Votes: 0


Apparently, there are no missing values in the table. 

Checking the ~~bad~~ [API Documentation](http://legis.senado.leg.br/dadosabertos/docs/ui/index.html), we can see that there are multiple possible values for the `vote` column. With the exception of `Sim` and `Não`, all values represent different kinds of abstesions. 

**These values are listed bellow.**


In [22]:
# All possible votes values
votes['Voto'].unique()

array(['Sim', 'MIS', 'AP ', 'P-NRV', 'Não', 'LS ', 'Abstenção',
       'Presidente (art. 51 RISF)', 'LP ', 'NCom'], dtype=object)

These values will be remaped to the following values.

Vote | Value
-------|-------
Sim | 1
Não | 0
All others (abstensions) | 0.5

For this, the following function was defined and applied.

In [27]:
def votes_remap(v):
    # Check if columns is numeric
    if isinstance(v, str):
        if v == 'Sim':
            return 1
        elif v == 'Não':
            return 0
        else:
            return .5
    else:
        return v
votes['Voto'] = votes['Voto'].apply(votes_remap)
votes['Voto'] = votes['Voto'].astype(float)
votes.head()


Unnamed: 0,VotoID,SenadorID,SessaoID,Voto
0,0,40,44809,1.0
1,3,40,44949,1.0
2,7,40,45483,1.0
3,8,40,44589,1.0
4,9,40,42415,1.0


Alright, now we have two clean dataframes containig the information needed to perform the analysis. However, the data is still defined in two different DataFrames and not exactly in correct format (**Istances x Attributes**).


First, let's perform a **Pivot Table** on the `votes` DataFrame. This way, each row will be a senator and the columns will represent the voting behaviour of this senator in each session.

In [None]:
As we can see, most of the field in the Tables are self explanatory. The detailed format will be except