# Representatividade da Mulher na Sociedade: análise do timeline

**Objetivo**: analisar a evolução da representatividade da mulher na sociedade ao longo dos anos.

### Procedimento:
#### 1- Captura de dados utilizando web scrapping
- Fonte de dados: Nações Unidas
- Women's share of legislators and managers: http://data.un.org/Data.aspx?d=GenderStat&f=inID%3a120
- Women's share of labour force: http://data.un.org/Data.aspx?d=GenderStat&f=inID%3a107
- Gender Inequality Index: http://hdr.undp.org/en/indicators/68606

#### 2- Realizar a junção das bases

#### 3- Jogar no tableau para análise de dados visual

#### Insights / visualização
1- Qual o ano com mais dados para labour force?
2- Como os países do mundo se caracterizam quanto ao labour force?
   Quais os 5 melhores países e os 5 piores países?
3- Como as regiões do mundo se caracterizam quanto ao labour force?
4- Existe correlação da representatividade do labour force com o gdi?
5- Qual foi a evolução ao londo dos anos do pior país e do melhor país em questão de representatividade?

## Women's Share of Labour Force

In [1]:
#Libraries
from selenium import webdriver
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import time

In [2]:
# Open url
driver = webdriver.Chrome('chromedriver.exe')
url = f'http://data.un.org/Data.aspx?d=GenderStat&f=inID%3a107'
driver.get(url)

In [3]:
itens_tabela=[]

#primeira página
# pega a tabela completa, codigo em html
tabela_completa_selenium = driver.find_element_by_xpath('//*[@id="divData"]/div[1]').get_attribute('innerHTML')
# transforma o resultado para BeatifulSoup
tabela_completa_bs = BeautifulSoup(tabela_completa_selenium)
#busca os elementos td
elementos_td = tabela_completa_bs.find_all('td')
#pega o texto dos itens, transforma em lista
itens_tabela.append([x.text for x in elementos_td])

In [4]:
#página 2 a 83
for page in range(0,83):
    # localiza e clica no botao de passar a pagina
    driver.find_element_by_xpath('//*[@id="linkNextB"]').click()
    # pega a tabela completa, codigo em html
    tabela_completa_selenium = driver.find_element_by_xpath('//*[@id="divData"]/div[1]').get_attribute('innerHTML')
    # transforma o resultado para BeatifulSoup
    tabela_completa_bs = BeautifulSoup(tabela_completa_selenium)
    #busca os elementos td
    elementos_td = tabela_completa_bs.find_all('td')
    #pega o texto dos itens, transforma em lista
    itens_tabela.append([x.text for x in elementos_td])
    time.sleep(2)

In [5]:
#transforma em array
tabela1 = np.array(itens_tabela).reshape(-1)

In [6]:
#reshaping
tabela1 = np.array(tabela1).reshape((int(len(tabela1)/7), 7))

In [7]:
#transforma em dataframe
df_tabela1 = pd.DataFrame(tabela1)

In [8]:
#última página
# pega a tabela completa, codigo em html
tabela_completa_selenium = driver.find_element_by_xpath('//*[@id="divData"]/div[1]').get_attribute('innerHTML')
# transforma o resultado para BeatifulSoup
tabela_completa_bs = BeautifulSoup(tabela_completa_selenium)
#busca os elementos td
elementos_td = tabela_completa_bs.find_all('td')
#pega o texto dos itens, transforma em lista e depois faz um reshape
itens_tabela = [x.text for x in elementos_td]
tabela1 = np.array(itens_tabela).reshape((int(len(itens_tabela)/7), 7))
#transforma em dataframe
df_tabela2 = pd.DataFrame(tabela1)

In [9]:
women_labour_force = pd.concat([df_tabela1, df_tabela2], ignore_index=True)

In [10]:
women_labour_force.rename(columns={0: "Country", 1: "Subgroup", 2: "Year", 3: "Source", 4: "Unit", 
                                   5: "Value Labour Force"}, inplace=True)

In [11]:
women_labour_force.drop(columns=['Subgroup', 'Source', 'Unit', 6], inplace=True)

In [12]:
women_labour_force.loc[30:50]

Unnamed: 0,Country,Year,Value Labour Force
30,Albania,1998,41.5
31,Albania,1997,41.3
32,Albania,1996,41.1
33,Albania,1995,40.9
34,Albania,1994,40.7
35,Albania,1993,40.5
36,Albania,1992,40.4
37,Albania,1991,40.3
38,Albania,1990,40.2
39,Albania,1989,40.1


## Women's share of Legislators and Managers

In [13]:
# Open url
driver = webdriver.Chrome('chromedriver.exe')
url = f'http://data.un.org/Data.aspx?d=GenderStat&f=inID%3a120'
driver.get(url)

In [14]:
itens_tabela=[]

#primeira página
# pega a tabela completa, codigo em html
tabela_completa_selenium = driver.find_element_by_xpath('//*[@id="divData"]/div[1]').get_attribute('innerHTML')
# transforma o resultado para BeatifulSoup
tabela_completa_bs = BeautifulSoup(tabela_completa_selenium)
#busca os elementos td
elementos_td = tabela_completa_bs.find_all('td')
#pega o texto dos itens, transforma em lista
itens_tabela.append([x.text for x in elementos_td])

In [15]:
#página 2 a 25
for page in range(0,25):
    # localiza e clica no botao de passar a pagina
    driver.find_element_by_xpath('//*[@id="linkNextB"]').click()
    # pega a tabela completa, codigo em html
    tabela_completa_selenium = driver.find_element_by_xpath('//*[@id="divData"]/div[1]').get_attribute('innerHTML')
    # transforma o resultado para BeatifulSoup
    tabela_completa_bs = BeautifulSoup(tabela_completa_selenium)
    #busca os elementos td
    elementos_td = tabela_completa_bs.find_all('td')
    #pega o texto dos itens, transforma em lista
    itens_tabela.append([x.text for x in elementos_td])
    time.sleep(2)

In [16]:
#transforma em array
tabela1 = np.array(itens_tabela).reshape(-1)

In [17]:
#reshaping
tabela1 = np.array(tabela1).reshape((int(len(tabela1)/7), 7))

In [18]:
#transforma em dataframe
df_tabela1 = pd.DataFrame(tabela1)

In [19]:
#última página
# pega a tabela completa, codigo em html
tabela_completa_selenium = driver.find_element_by_xpath('//*[@id="divData"]/div[1]').get_attribute('innerHTML')
# transforma o resultado para BeatifulSoup
tabela_completa_bs = BeautifulSoup(tabela_completa_selenium)
#busca os elementos td
elementos_td = tabela_completa_bs.find_all('td')
#pega o texto dos itens, transforma em lista e depois faz um reshape
itens_tabela = [x.text for x in elementos_td]
tabela1 = np.array(itens_tabela).reshape((int(len(itens_tabela)/7), 7))
#transforma em dataframe
df_tabela2 = pd.DataFrame(tabela1)

In [20]:
women_share_politics = pd.concat([df_tabela1, df_tabela2], ignore_index=True)

In [21]:
women_share_politics.rename(columns={0: "Country", 1: "Subgroup", 2: "Year", 3: "Source", 4: "Unit", 
                                     5: "Value Legislators_Managers"}, inplace=True)

In [22]:
women_share_politics.drop(columns=['Subgroup', 'Source', 'Unit', 6], inplace=True)

In [23]:
women_share_politics.loc[30:50]

Unnamed: 0,Country,Year,Value Legislators_Managers
30,Australia,1988,39.0
31,Australia,1987,38.8
32,Australia,1986,39.0
33,Australia,1985,17.6
34,Austria,2006,28.6
35,Austria,2005,27.2
36,Austria,2004,27.6
37,Austria,2003,27.2
38,Austria,2002,29.0
39,Austria,2001,29.3


## Gender Inequality Index

In [24]:
driver = webdriver.Chrome('chromedriver.exe')
url = f'http://hdr.undp.org/en/indicators/68606#'
driver.get(url)

In [25]:
#pega a tabela completa
df=pd.read_html(driver.find_element_by_tag_name('table').get_attribute('outerHTML'))

In [26]:
#atribuição para dataframe gdi
gdi = df[0]
gdi.columns

Index(['HDI Rank', 'Country', '1995', 'Unnamed: 3', '2000', 'Unnamed: 5',
       '2005', 'Unnamed: 7', '2010', 'Unnamed: 9', '2011', 'Unnamed: 11',
       '2012', 'Unnamed: 13', '2013', 'Unnamed: 15', '2014', 'Unnamed: 17',
       '2015', 'Unnamed: 19', '2016', 'Unnamed: 21', '2017', 'Unnamed: 23',
       '2018', 'Unnamed: 25', '2019', 'Unnamed: 27'],
      dtype='object')

In [27]:
#drop colunas desnecessárias
gdi.drop(columns=['Unnamed: 3', 'Unnamed: 5', 'Unnamed: 7', 'Unnamed: 9', 'Unnamed: 11', 'Unnamed: 13', 
                  'Unnamed: 15', 'Unnamed: 17', 'Unnamed: 19', 'Unnamed: 21', 'Unnamed: 23', 'Unnamed: 25', 'Unnamed: 27'],
                    inplace=True)

In [28]:
#reshaping dataframe
gdi = gdi.melt(id_vars='Country', value_vars=['1995', '2000', '2005', '2010', '2011', '2012',
       '2013', '2014', '2015', '2016', '2017', '2018', '2019'], ignore_index=True)

In [29]:
gdi.rename(columns={'variable':'Year'}, inplace=True)

In [30]:
gdi.loc[0:29]

Unnamed: 0,Country,Year,value
0,Afghanistan,1995,..
1,Albania,1995,..
2,Algeria,1995,0.682
3,Angola,1995,..
4,Argentina,1995,0.427
5,Armenia,1995,0.479
6,Australia,1995,0.180
7,Austria,1995,0.186
8,Azerbaijan,1995,..
9,Bahamas,1995,..


In [48]:
gdi=gdi.replace('..', np.NaN)

In [49]:
#exportar para excel
gdi.to_excel("gdi.xlsx")

## Women's Share (Representation in Society)

**Join of tables:** women_labour_force and women_share_politics

In [39]:
#merge das tabelas
women_representation = pd.merge(women_labour_force, women_share_politics, on=['Country', 'Year'], how='outer') 

In [40]:
#drop duplicates
women_representation.drop_duplicates(keep='first', ignore_index=True, inplace=True)

In [41]:
women_representation

Unnamed: 0,Country,Year,Value Labour Force,Value Legislators_Managers
0,Afghanistan,2006,29.4,
1,Afghanistan,2005,29.0,
2,Afghanistan,2004,28.8,
3,Afghanistan,2003,28.8,
4,Afghanistan,2002,28.6,
...,...,...,...,...
4191,Serbia,2006,,24.9
4192,Serbia,2005,,24.8
4193,Serbia,2004,,25.9
4194,St. Helena,1998,,43.5


In [61]:
women_representation = women_representation.astype({'Year': 'int', 'Value Labour Force': 'float', 
                                                    'Value Legislators_Managers': 'float'})

In [62]:
women_representation.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4196 entries, 0 to 4195
Data columns (total 4 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Country                     4196 non-null   object 
 1   Year                        4196 non-null   int32  
 2   Value Labour Force          4158 non-null   float64
 3   Value Legislators_Managers  1265 non-null   float64
dtypes: float64(2), int32(1), object(1)
memory usage: 114.9+ KB


In [63]:
#exportar para excel
women_representation.to_excel("v2_women_representation.xlsx")

In [50]:
#merge: women_representation e gdi
women_standing = pd.merge(women_representation, gdi, on=['Country', 'Year'], how='outer') 

In [51]:
women_standing.rename(columns={'value':'GDI'}, inplace=True)

In [58]:
women_standing = women_standing.astype({'Year': 'int', 'Value Labour Force': 'float', 'Value Legislators_Managers': 'float', 'GDI':'float'})

In [59]:
women_standing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6426 entries, 0 to 6425
Data columns (total 5 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Country                     6413 non-null   object 
 1   Year                        6426 non-null   int32  
 2   Value Labour Force          4158 non-null   float64
 3   Value Legislators_Managers  1265 non-null   float64
 4   GDI                         2124 non-null   float64
dtypes: float64(3), int32(1), object(1)
memory usage: 226.0+ KB


In [53]:
#drop duplicates
women_standing.drop_duplicates(keep='first', ignore_index=True, inplace=True)

In [57]:
#exportar para excel
women_standing.to_excel("v3_women_standing.xlsx")