# MGT-499 Statistics and Data Science - Individual Assignment

In [551]:
# Importing all the libraries I am using
import numpy as np
import math
import pandas as pd
#import my_module # here I can add all my functions
import os #We import os in order to change/specify our working directory. .chdir(path) allows to change the working directory to "path" while .getcwd() returns a string representing the current working directory (you could also use pwd)
import plotly.express as px #interractive boxplots
import matplotlib.pyplot as plt 
import seaborn as sb
import bokeh.io # for more visual demonstration
import bokeh.models #for more visual demonstration
import bokeh.plotting #for more visual demonstration
#df_gps = pd.read_stata("data/global_preference_survey.dta") # Import dataset on Global Preference Survey from local folder

This notebook contains the individual assignment for the class MGT-499 Statistics and Data Science. Important information:
- **Content**: the assignment is divided in two main parts, namely data cleaning (2 datasets) and Exploratory Data Analysis, for a total of 13 main questions (see table of contents). Some of these main questions are divided in sub questions. In the first part, the questions are very specific, while in the second part they are more open.
- **Deadline**: Tuesday 8th of November at 23:59. 
- **Final Output**: a Jupyter notebook, which we (teachers) can run. 
- **Answering the Questions**: you will find the questions in markdown cells below. Under each of these cells, you will find a cell / cells for answers. Type there your answer. For the answer to be correct, the cell with the answer must run without error (unless specified). You can use markdown cells for the answers that require text.
- **Submission**: submit the assignment on Moodle, under [Individual Assignment](https://moodle.epfl.ch/mod/assign/view.php?id=1222846)

## Content
- [Polity5 Dataset](#polity5)  
    - [Question 1: Import the data and get a first glance](#question1)
    - [Question 2: Select some variables](#question2)
    - [Question 3: Missing Values](#question3)
    - [Question 4: Check Polity2](#question4)
- [Quality of Government (QOG) Environmental Indicators Dataset](#qog)  
    - [Question 5: Import the data and do few fixes](#question5)
    - [Question 6: Merge QOG and Polity5 ... first attempt](#question6)
    - [Question 7: Merge QOG and Polity5 ... second attempt](#question7)
    - [Question 8: Clean the merged dataframe](#question8)
- [Exploratory Data Analysis](#eda)
    - [Question 9: Selecting the ingredients for the recipe (how I select the variables)](#question9)  
    - [Question 10: Picking the right quantity of each ingredient (how I select my sample)](#question10)
    - [Question 11: Tasting and preparing the ingredients (univariate analysis)](#question11)
    - [Question 12: Cooking the ingredients together (bivariate analysis)](#question12)
    - [Question 13: Tasting the new recipe (conclusion)](#question13)

## Polity5 data <a class="anchor" id="polity5"></a>

Polity5 is a widely used democracy scale. The raw data as well as the codebook are available [here](http://www.systemicpeace.org/inscrdata.html). For this assignment, we have modified a bit the original version, for example we have added the iso3 code for countries to make you save time. You can find the modified version [here](https://raw.githubusercontent.com/edoardochiarotti/class_datascience/main/Notebooks/Assignment/individual_assignment/data/polity2_iso3.csv).

### Question 1: import the data and get a first glance <a class="anchor" id="question1"></a>

1a) Import the csv 'polity2_iso3.csv' (file provided in the link [here](https://raw.githubusercontent.com/edoardochiarotti/class_datascience/main/Notebooks/Assignment/individual_assignment/data/polity2_iso3.csv)) as a panda dataframe (ignore the warning message) **(1 point)**

In [552]:
url = "https://raw.githubusercontent.com/edoardochiarotti/class_datascience/main/Notebooks/Assignment/individual_assignment/data/polity2_iso3.csv"
data = pd.read_csv(url,low_memory=False) #use low memeory false since there are different datatypes in the same column
type(data)

pandas.core.frame.DataFrame

1b) Display the first 10 rows **(1 point)**

In [553]:
data.head(10) #with the head function I can decide how many rows I want to display

Unnamed: 0,iso3,year,p5,cyear,ccode,scode,country,flag,fragment,democ,...,interim,bmonth,bday,byear,bprec,post,change,d5,sf,regtrans
0,,1800,0,2711800,271,WRT,Wuerttemburg,0,,0,...,,1.0,1.0,1800.0,1.0,-7.0,88.0,1.0,,
1,,1800,0,7301800,730,KOR,Korea,0,,5,...,,1.0,1.0,1800.0,1.0,1.0,88.0,1.0,,
2,,1800,0,2451800,245,BAV,Bavaria,0,,0,...,,1.0,1.0,1800.0,1.0,-10.0,88.0,1.0,,
3,,1801,0,7301801,730,KOR,Korea,0,,5,...,,,,,,,,,,
4,,1801,0,2711801,271,WRT,Wuerttemburg,0,,0,...,,,,,,,,,,
5,,1801,0,2451801,245,BAV,Bavaria,0,,0,...,,,,,,,,,,
6,,1802,0,7301802,730,KOR,Korea,0,,5,...,,,,,,,,,,
7,,1802,0,2711802,271,WRT,Wuerttemburg,0,,0,...,,,,,,,,,,
8,,1802,0,2451802,245,BAV,Bavaria,0,,0,...,,,,,,,,,,
9,,1803,0,7301803,730,KOR,Korea,0,,5,...,,,,,,,,,,


1c) Display the data types of all the variables included in the data **(1 point)**

In [554]:
data.dtypes #shows the type of data from each column  


iso3         object
year          int64
p5            int64
cyear         int64
ccode         int64
scode        object
country      object
flag          int64
fragment    float64
democ         int64
autoc         int64
polity        int64
polity2     float64
durable      object
xrreg         int64
xrcomp        int64
xropen        int64
xconst        int64
parreg        int64
parcomp       int64
exrec       float64
exconst       int64
polcomp     float64
prior        object
emonth       object
eday         object
eyear        object
eprec        object
interim      object
bmonth       object
bday         object
byear        object
bprec        object
post         object
change       object
d5           object
sf           object
regtrans     object
dtype: object

1d) By looking at your answer in 1c, what is the difference between the different types of variables? Why the type of some variables is defined as object? **(1 point)**

My data shows 3 different types of variables: 
- object -> since I use panda, instead of string it shows me object. It refers to strings, as well as mixed columns of numeric characters and strings
- int64 -> the column holds only numeric characters
- float64 -> the column holds numeric characters, as well as blanks like "NaN"

-> The index64 stands for the memory allocated to the character

### Question 2. Select some variables <a class="anchor" id="question2"></a>

2a) Create a subset dataframe that contains the variables 'iso3', 'country', 'year', 'polity2' and display it **(1 point)**

In [555]:
# 2a) subset creation for iso3', 'country', 'year', 'polity2
subset_1=data.loc[:,["iso3","country","year","polity2"]]
print(subset_1)

      iso3       country  year  polity2
0      NaN  Wuerttemburg  1800     -7.0
1      NaN         Korea  1800      1.0
2      NaN       Bavaria  1800    -10.0
3      NaN         Korea  1801      1.0
4      NaN  Wuerttemburg  1801     -7.0
...    ...           ...   ...      ...
17569  ZWE      Zimbabwe  2014      4.0
17570  ZWE      Zimbabwe  2015      4.0
17571  ZWE      Zimbabwe  2016      4.0
17572  ZWE      Zimbabwe  2017      4.0
17573  ZWE      Zimbabwe  2018      4.0

[17574 rows x 4 columns]


2b) Display the type of the variable "year" **(1 point)**

In [556]:
# 2b) display type of the single column "year"
subset_1["year"].dtypes

dtype('int64')

2c) Convert the variable "year" to string **(1 point)**
<br>
Hint: if you get a warning message of the type "SettingWithCopyWarning", it is because you did not subset the data in the right way. Go back to your class notes and check the different ways to subset a dataframe, and try again. If you do it correctly, you will not get the warning message.

In [557]:
#  2c) convert integer into string -> object sicne I use panda

subset_1["year"]=subset_1["year"].astype(str) 
print(subset_1.dtypes)# checking if my operation worked


iso3        object
country     object
year        object
polity2    float64
dtype: object


### Question 3: Missing Values <a class="anchor" id="question3"></a>

3a) Subset the rows that have iso3 missing and display **(1 point)**

In [558]:
#  3a) missing iso3 subset

#print(subset_1.isna().sum()) #check which columns have NaN's 

subset_iso3=subset_1.loc[subset_1["iso3"].isna()] #create new subset_iso3

print(subset_iso3)
subset_iso3=subset_iso3.copy()

     iso3       country  year  polity2
0     NaN  Wuerttemburg  1800     -7.0
1     NaN         Korea  1800      1.0
2     NaN       Bavaria  1800    -10.0
3     NaN         Korea  1801      1.0
4     NaN  Wuerttemburg  1801     -7.0
...   ...           ...   ...      ...
1272  NaN    Montenegro  2018      9.0
1273  NaN   Sudan-North  2018     -4.0
1274  NaN       Vietnam  2018     -7.0
1275  NaN      Ethiopia  2018      1.0
1276  NaN        Serbia  2018      8.0

[1270 rows x 4 columns]


3b) Display the countries that have missing iso3. What can you tell by looking at them? Any similarities? **(1 point)**

In [559]:
# 3b) countries with missing iso3
subset_iso3.country.unique()
'''
these countries show different patterns. 
-> One group shows that they are states within a country - for example Bavaria or Saxony are states within Germany, while Tuscany is within Italy
-> Another group shows countries that are not existing anymore like Germany West, Czechoslovakia, Yugoslavia since there were annexed and other parts became independent countries
'''

'\nthese countries show different patterns. \n-> One group shows that they are states within a country - for example Bavaria or Saxony are states within Germany, while Tuscany is within Italy\n-> Another group shows countries that are not existing anymore like Germany West, Czechoslovakia, Yugoslavia since there were annexed and other parts became independent countries\n'

3c) Display the countries with missing iso3 from 2011. **(1 point)**

In [560]:
#  3c) displays all countries of 2011 onwards that don't have an iso code
subset_iso3["year"]= subset_iso3["year"].astype(int) #transform year into interger so it can be read as a numerical number -> needed for comaprison <,>,=
subset_iso3[(subset_iso3["iso3"].isna() & (subset_iso3["year"] >2010))].country.unique()  
#print(subset_iso3)          

array(['Montenegro', 'Serbia', 'Ethiopia', 'Sudan-North', 'Vietnam'],
      dtype=object)

3d) Display the rows for which the column "country" contains the word "Serbia". By looking at the result, can you tell what happened to Serbia in 2006? **(1 point)**
<br>
Hint: the most general way of doing this is to use a combination of re.search and list comprehension. To display the full subset, you can use print(df.to_string()).

In [561]:
# 3d) Serbia dive in
subset_serbia=subset_iso3.loc[subset_iso3["country"].str.contains("Serbia")] #extract Serbia from my subset_dataset
subset_serbia[(subset_serbia["year"] == 2006) & (subset_serbia["country"]== "Serbia")] #only selecting Serbia in 2006
print(subset_serbia.to_string()) #see bigger picture to compare
#in 2006 Serbia and Montenegro split - Montenegro was annexed of Serbia and Serbian's polity factor jumped from 6 to 8.

     iso3                country  year  polity2
224   NaN                 Serbia  1830     -7.0
230   NaN                 Serbia  1831     -7.0
252   NaN                 Serbia  1832     -7.0
261   NaN                 Serbia  1833     -7.0
272   NaN                 Serbia  1834     -7.0
286   NaN                 Serbia  1835     -7.0
295   NaN                 Serbia  1836     -7.0
301   NaN                 Serbia  1837     -7.0
318   NaN                 Serbia  1838      2.0
333   NaN                 Serbia  1839      2.0
344   NaN                 Serbia  1840      2.0
357   NaN                 Serbia  1841      2.0
363   NaN                 Serbia  1842      2.0
369   NaN                 Serbia  1843      2.0
387   NaN                 Serbia  1844      2.0
394   NaN                 Serbia  1845      2.0
410   NaN                 Serbia  1846      2.0
420   NaN                 Serbia  1847      2.0
429   NaN                 Serbia  1848      2.0
439   NaN                 Serbia  1849  

3e) Write a function that does the operation in 3d and use it to display the subset that has the word "sudan" (all lower cap) in country. Then do the same for the word "vietnam" (all lower cap). **(1 point)**
<br>
Hint: options of functions can be very useful.

In [562]:
# function for Serbia dive in
def subsetting_country(country_name):
    return subset_iso3.loc[subset_iso3["country"].str.contains (country_name,case=False)] #same as in 3e, additionally define with case the input of lower letters
subsetting_country("sudan")    


Unnamed: 0,iso3,country,year,polity2
1234,,Sudan-North,2011,-4.0
1241,,Sudan-North,2012,-4.0
1244,,Sudan-North,2013,-4.0
1252,,Sudan-North,2014,-4.0
1256,,Sudan-North,2015,-4.0
1262,,Sudan-North,2016,-4.0
1270,,Sudan-North,2017,-4.0
1273,,Sudan-North,2018,-4.0


3f) Replace nan values in iso3 with correct iso3 for the 5 countries found in 3c from 2011 onwards, and display the subset with the fixed values to check that everything worked. **(1 point)**
<br>
Hint: the correct iso3 for these 5 countries are "ETH","MNE","SRB","SDN","VNM".

In [563]:
# 3f) correcting iso codes

print(subset_iso3.to_string())

for index in subset_iso3.index:
    if subset_iso3.loc[index, 'country'] == 'Ethiopia' and subset_iso3.loc[index,"year"] >2010:
        subset_iso3.loc[index,'iso3']="ETH" 
    if subset_iso3.loc[index, 'country'] == 'Montenegro' and subset_iso3.loc[index,"year"] >2010:
        subset_iso3.loc[index,'iso3']="MNE" 
    if subset_iso3.loc[index, 'country'] == 'Serbia' and subset_iso3.loc[index,"year"] >2010:
        subset_iso3.loc[index,'iso3']="SRB" 
    if subset_iso3.loc[index, 'country'] == 'Sudan-North' and subset_iso3.loc[index,"year"] >2010:
        subset_iso3.loc[index,'iso3']="SDN" 
    if subset_iso3.loc[index, 'country'] == 'Vietnam' and subset_iso3.loc[index,"year"] >2010:
        subset_iso3.loc[index,'iso3']="VNM" 


print(subset_iso3.to_string())


     iso3                country  year  polity2
0     NaN           Wuerttemburg  1800     -7.0
1     NaN                  Korea  1800      1.0
2     NaN                Bavaria  1800    -10.0
3     NaN                  Korea  1801      1.0
4     NaN           Wuerttemburg  1801     -7.0
5     NaN                Bavaria  1801    -10.0
6     NaN                  Korea  1802      1.0
7     NaN           Wuerttemburg  1802     -7.0
8     NaN                Bavaria  1802    -10.0
9     NaN                  Korea  1803      1.0
10    NaN                Bavaria  1803    -10.0
11    NaN           Wuerttemburg  1803     -7.0
12    NaN                Bavaria  1804    -10.0
13    NaN                  Korea  1804      1.0
14    NaN           Wuerttemburg  1804     -7.0
15    NaN           Wuerttemburg  1805     -7.0
16    NaN                  Korea  1805      1.0
17    NaN                Bavaria  1805    -10.0
18    NaN                Bavaria  1806    -10.0
19    NaN           Wuerttemburg  1806  

3g) Drop the remaining rows which have nan in "iso3" and display the new number of rows of the dataframe (how many are they?) **(1 point)**

In [564]:
# 3g

subset_iso3_clean=subset_iso3.dropna(subset=["iso3"]) #dropping NaN for missing iso3

print(subset_iso3_clean)


     iso3      country  year  polity2
1230  MNE   Montenegro  2011      9.0
1231  SRB       Serbia  2011      8.0
1233  ETH     Ethiopia  2011     -3.0
1234  SDN  Sudan-North  2011     -4.0
1235  VNM      Vietnam  2011     -7.0
1236  MNE   Montenegro  2012      9.0
1238  SRB       Serbia  2012      8.0
1239  ETH     Ethiopia  2012     -3.0
1240  VNM      Vietnam  2012     -7.0
1241  SDN  Sudan-North  2012     -4.0
1242  VNM      Vietnam  2013     -7.0
1243  MNE   Montenegro  2013      9.0
1244  SDN  Sudan-North  2013     -4.0
1245  ETH     Ethiopia  2013     -3.0
1247  SRB       Serbia  2013      8.0
1248  SRB       Serbia  2014      8.0
1249  ETH     Ethiopia  2014     -3.0
1251  MNE   Montenegro  2014      9.0
1252  SDN  Sudan-North  2014     -4.0
1253  VNM      Vietnam  2014     -7.0
1254  SRB       Serbia  2015      8.0
1255  ETH     Ethiopia  2015     -3.0
1256  SDN  Sudan-North  2015     -4.0
1258  VNM      Vietnam  2015     -7.0
1259  MNE   Montenegro  2015      9.0
1260  MNE   

### Question 4: Check Polity2 <a class="anchor" id="question4"></a>

4a) Display the first and last year included in the dataset **(1 point)**

In [565]:
# 4a) first and last year in data
print(subset_iso3["year"].iloc[[0,-1]])


0       1800
1276    2018
Name: year, dtype: int32


4b) What do the values in "polity2" represent? **(1 point)**

Answer 4b: 

4c) Do we have weird values for polity2? If yes, why? What should we do about them? Transform the data accordingly. **(1 point)**

Answer 4c:

In [566]:
# 4c)
(subset_iso3 == 'Not Available').sum()
Max= subset_iso3["polity2"].loc[subset_iso3["polity2"].idxmax()]
Min= subset_iso3["polity2"].loc[subset_iso3["polity2"].idxmin()]
print(Max,Min)

10.0 -10.0


4d) Make a map that shows the number of observations of polity2 by country **(1 point)**

In [567]:
# Answer 4d


4e) Store the final dataframe (the one you obtained after 4d) in an object called df_pol **(1 point)**

In [591]:
# 4e)
df_pol=subset_iso3


## Quality of Government Environmental Indicators <a class="anchor" id="qog"></a>

The QoG Environmental Indicators dataset (QoG-EI) (Povitkina, Marina, Natalia Alvarado Pachon & Cem Mert Dalli. 2021). The Quality of Government Environmental Indicators Dataset, version Sep21. University of Gothenburg: The Quality of Government Institute, https://www.gu.se/en/quality-government), is a compilation of indicators measuring countries' environmental performance over time, including the presence and stringency of environmental policies, environmental outcomes (emissions, deforestation, etc.), and public opinion on the environment. Codebook and data are available [here](https://www.gu.se/en/quality-government/qog-data/data-downloads/environmental-indicators-dataset).

### Question 5: Import the data and do few fixes <a class="anchor" id="question5"></a>

5a) Import data from the Quality of Government Environmental Indicators Dataset and display the variables types and the number of rows **(1 point)**
<br>
Hint: When you go on the webpage of the Environmental Indicators Dataset, you can directly import from a URL by copying the link address of the dataset! 

In [596]:
# 5a) import new data set
url_2 = "https://www.qogdata.pol.gu.se/data/qog_ei_sept21.csv"
df_coq = pd.read_csv(url_2,encoding="latin-1") 
type(df_coq)
#encoding='latin-1'

pandas.core.frame.DataFrame

5b) Rename the variable "ccodealp" to "iso3" **(1 point)**

In [570]:
# Answer 5b


5c) Check the type of the variables "year" and "iso3" are string, if not convert them to string **(1 point)**

In [571]:
# Answer 5c


### Question 6: Merge QOG and Polity5 ... issues with QOG? <a class="anchor" id="question6"></a>

6a) Get a subset of the dataframe that includes the variables "cname", "iso3", "year" and "cckp_temp", and display the number of rows. **(1 point)**

In [572]:
# Answer 6a


6b) Merge this subset (left) and the clean version of the polity data (right), using the argument how="left". Was the merge succesfull? If yes, how many rows has the merged dataframe? Is it the same number of rows of the subset in 6a? **(1 point)**

In [573]:
# Answer 6b


6c) Do the same by adding the argument validate="one-to-one". Can you make some hypotheses on why you get an error? **(1 point)**

In [574]:
# Answer 6c


6d) Consider the subset of the QOG you obtained in 6a and write a code to (i) count the number of observations for the variable "cckp_temp" for each combination of iso3 and year, (ii) store the results in a dataframe. For example, the combination "USA-2012" should have 1 observation for "cckp_temp", so the result of your code should be 1. The code should do this for all iso3-year combinations of your subset dataframe, and store the results in a dataframe. **(1 point)**
<br>
Hint: it should not take you more than 2 lines of code.

In [575]:
# Answer 6d


6e) Use the code in 6d to write a function that displays all rows of the dataframe obtained in 6a that have more than one observation of "cckp_temp" for each iso3-year combination, and check if it works. **(1 point)**

In [576]:
# Answer 6e


6f) Which countries have more than one observation for each iso3-year combination? Deal with these countries in the subset dataframe created in 6a to make sure you no longer have double observations for iso3-year combinations, and check that after your fix this is actually the case. **(1 point)**
<br>
Hint: should we keep a country with all missing values?

In [577]:
# Answer 6f


6g) If your check went well, now you can perform the same operation directly in the QOG dataframe (not in the substed dataframe created in 6a). How many rows does now the QOG dataframe has? **(1 point)**

In [578]:
# Answer 6g


### Question 7: Merge QOG and Polity5 ... issues with Polity5? <a class="anchor" id="question7"></a>

7a) Merge the cleaned QOG dataframe (left) and the Polity dataframe (right) using the options how="left" and validate="one_to_one". Does it work? Why? **(1 point)**

In [579]:
# Answer 7a


7b) Use the function you wrote in 6e to check what's wrong in the "clean" version of Polity **(1 point)**

In [580]:
# Answer 7b


7c) Drop or fix the countries that create troubles directly in the "clean" version of Polity and motivate your choices. **(1 point)**

In [581]:
# Answer 7c


7d) Try now to merge the "clean-clean" versions of COG and Polity (the ones you obtained in 7g and 8c) always using the options how="left" and validate="one_to_one". Does it work, and why? How many rows has the resulting merged dataframe? **(1 point)**

In [582]:
# Answer 7d


### Question 8: Clean the merged dataframe <a class="anchor" id="question8"></a>

8a) In the merged dataframe, order the columns so that you have the "index" variables first and the variables with actual values last. **(1 point)**
<br>
Hint: index variables are "iso3", "year" and other similar variables you can find, and the variables with actual values are "polity2", "cckp_temp" and other similar variables you can find.

In [583]:
# Answer 8a


8b) Rename "cname" as "country" and "country" as "country_polity". **(1 point)**

In [584]:
# Answer 8b


8c) Save the clean merged dataframe as a csv in a subfolder called "clean_data" in your working directory **(1 point)**

In [585]:
# Answer 8c


## Exploratory Data Analysis <a class="anchor" id="eda"></a>

In this section you will define a research question and perform a preliminary Exploratory Data Analysis (EDA) to address - or better, start addressing - the question at hand. This exercise will be done along the lines of the analysis done by our own Quentin Gallea in "*A recipe to empirically answer any question quickly*" ([Towards Data Science, 2022](https://towardsdatascience.com/a-recipe-to-empirically-answer-any-question-quickly-22e48c867dd5)). In this article, Quentin shows the first steps of an EDA that aims to explore whether heat waves have pushed governments to implement regulations against climate change (causal link). The logic is that, as it gets hotter and hotter, governments become more aware of climate change, and the problems it can cause to society, and start addressing it. In Quentin's analysis, heat waves (proxied by temperature) is the "main explanatory variable", rainfall is the "explanatory variable for heterogeneity", and regulations against climate change (proxied by the Environmental Policy Stringency Index) is the "outcome variable". He finds that indeed countries with relatively high temperatures have implemented more regulations against climate change. This is true especially when rainfall levels are low, as when it does not rain the damage of extreme heat is more evident to legislators, who therefore apply stricter regulations against these phenomenons.
<br>
<br>
In this exercise, you will be asked to do a similar analysis on a research question of your choice, using at least two of the variables of the dataset we have created in the former questions (QOG + Polity). For example, "what is the average temperature in 2010?" is not a valid research question (univariate), while "what is the impact of high temperatures on the stringency of climate regulations?" is a valid research question (at least bivariate). As before, we will ask you some (this time more general and open) questions, and you should report your answer in the cells below each question. Use a mix of markdown and code cells to answer (markdown for text and code for graphs and tables). We should be able to run all the graphs, i.e. screenshots of graphs are not accepted. Note that for now we have put only one markdown cell and one code cell for the answer, but feel free to add as many cells as you need.
<br>
Beyond the python code, we will grade the interpretations of the results and the coding decision you make.
<br>
<br>
Let your creativity guide you and let's have some fun!

### Question 9: Selecting the ingredients (how I select the variables) <a class="anchor" id="question9"></a>
We have saved the clean merged data that resulted from the previous questions in "clean_data_prepared_EDA" (it should be the same of the one you saved in "clean_data"). Import the clean merged data from "clean_data_prepared_EDA" using this [link](https://raw.githubusercontent.com/edoardochiarotti/class_datascience/main/Notebooks/Assignment/individual_assignment/clean_data_prepared_EDA/df_qog_polity_merged.csv). Explore the variables in the newly obtained dataframe by checking the documentation of QOG and Polity. Then, define a research question that addresses a causal link between at least two of these variables. Describe the research question, why you are addressing it and the variables of interest (outcome variable, main explanatory variable and explanatory variable for heterogeneity). **(3 points)**

Answer 9:

In [586]:
# Answer 9:

### Question 10: Picking the right quantity of each ingredient (how I select my sample) <a class="anchor" id="question10"></a>
Explore the data availability of your variables of interest and select a clean sample for the analysis. Describe this sample with the help of summary-statistics tables and maps. **(3 points)**

Answer 10:

In [587]:
# Answer 10:

### Question 11: Tasting and preparing the ingredients (univariate analysis) <a class="anchor" id="question11"></a>
Do an univariate analysis for each variable you have chosen (outcome variable, main explanatory variable and explanatory variable for heterogeneity):
- Prepare the variable, for example see if you need to transform the data further, i.e. log-transform, define a categorical variable, deal with outliers, etc.
- Understand the nature of the variable, i.e. continuous, categorical, binary, etc., which then allows to pick the right statistical tool in the bivariate analysis.
- Get an idea of the variable's behaviour across time and space.

Describe these steps and the conclusions you can draw with the help of histograms, tables, maps and line graphs. **(3 points)**

Answer 11:

In [588]:
# Answer 11:

### Question 12: Cooking the ingredients together (bivariate analysis) <a class="anchor" id="question12"></a>

Considering the "nature" of your variables (continuous, categorical, binary, etc.), pick the right tool / tools for a preliminary bivariate analysis, i.e. correlation tables, bar/line graphs, scatter plots, etc. Use these tools to describe your preliminary bivariate analysis and your findings. **(3 points)**

Answer 12:

In [589]:
# Answer 12:

### Question 13: Tasting the new recipe (conclusion) <a class="anchor" id="question13"></a>

Explain what you learned, the problem faced, what would you do next (you can suggest other data you would like to have etc). **(2 points)**

Answer 13: