## 16Types

**This notebook extracts an approximation of what percent of the population is each type of MBTI type according to the website 16personalites (which has a large database on this topic). It will be useful in the EDA to cross-check figures**

<br>

We use requests and beautifulsoup to extract webscrape the data

In [66]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import ast

<br>

We decide the countries for which we want the data, in our case, English speaking countries (we leave aside some of the smaller countries).

In [238]:
countries = ['canada', 'new-zealand', 'ireland', 'australia', 'united-states','united-kingdom']

<br>

The following code extracts the % of types and the total population and total number of respondents

In [168]:
d = {}
n = {}

for country in countries:
    url = 'https://www.16personalities.com/country-profiles/'+country+'#region-switches'
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Extracts the table with the % of MBTI types per country
    table = str(soup.findAll("country-profiles-top-ten-list"))
    table = re.split('\[|\]',table) 
    result = ast.literal_eval(table[2])
    d[country] = pd.DataFrame(result)
    
    # Extracts the number of population and respondents per country
    demographics = str(soup.findAll('div',  {"class": "demographics"}))
    result = re.split('<|>',demographics)
    population = result[8]
    respondents = result[-5]
    n[country] = [int(population.replace(',','')), int(respondents.replace(',',''))]


<br>

Since we have a dictionary with many different dataframes we put them all together

In [169]:
final_df = pd.DataFrame()
for key, value in d.items():
     df = value
     df.loc[:,'Country'] = key
     final_df = pd.concat([df, final_df], 0)
final_df

Unnamed: 0,code,name,percentage,Country
0,infp,Turbulent Mediator,12.75,canada
1,enfp,Turbulent Campaigner,8.46,canada
2,infj,Turbulent Advocate,6.50,canada
3,isfj,Turbulent Defender,5.42,canada
4,enfp,Assertive Campaigner,4.98,canada
...,...,...,...,...
27,istp,Turbulent Virtuoso,1.11,united-kingdom
28,entj,Turbulent Commander,1.06,united-kingdom
29,istp,Assertive Virtuoso,1.01,united-kingdom
30,estj,Turbulent Executive,1.00,united-kingdom


<br>

Since the "percentage" column is not a float number we change its type

In [149]:
final_df['percentage'] = final_df['percentage'].astype('float')
final_df['percentage'] = final_df['percentage'].apply(lambda x:x/100)

<br>

We then calculate what amount of population and respondents for each type

In [254]:
amount = []
for country in countries: 
    result = final_df['percentage'][final_df['Country'] == country].multiply(n[country][0])
    amount.extend(result.tolist())

In [258]:
final_df['population'] = amount

In [261]:
amount = []
for country in countries: 
    result = final_df['percentage'][final_df['Country'] == country].multiply(n[country][1])
    amount.extend(result.tolist())

In [262]:
final_df['respondents'] = amount

<br>

Finally we check that our data is correct (some values differ a little due to rounding but it is enough for what we need)

In [266]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [269]:
final_df.groupby('Country').sum()

Unnamed: 0_level_0,percentage,population,respondents
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
australia,1.0,22744188.696,1722174.193
canada,1.0,35092816.033,2132754.364
ireland,1.0,4892794.231,210652.063
new-zealand,1.0,4438393.0,374865.0
united-kingdom,1.0,64088222.0,3002723.0
united-states,0.972,312370535.808,22043156.94


In [275]:
for k,v in n.items():
    print(k + ' --> population = {0}, respondants = {1}'.format(v[0],v[1]))

united-kingdom --> population = 64088222, respondants = 3002723
united-states --> population = 321368864, respondants = 22678145
australia --> population = 22751014, respondants = 1722691
ireland --> population = 4892305, respondants = 210631
new-zealand --> population = 4438393, respondants = 374865
canada --> population = 35099836, respondants = 2133181


<br>

For our analysis we are interested in seeing what percent of the population is each type, so we will extract only this information for the time being.

In [287]:
types = final_df.groupby('code').sum()

In [289]:
types['percentage'].divide(len(countries)).sum()

0.9952666666666667

In [290]:
percent_types = types['percentage'].divide(len(countries))

In [292]:
percent_types.to_csv('../../data/16personalities.csv')