# MGT-499 Statistics and Data Science - Individual Assignment

In [1]:
# Import here what you need
import pandas as pd
import numpy as np

This notebook contains the individual assignment for the class MGT-499 Statistics and Data Science. Important information:
- **Content**: the assignment is divided in two main parts, namely data cleaning (2 datasets) and Exploratory Data Analysis, for a total of 13 main questions (see table of contents). Some of these main questions are divided in sub questions. In the first part, the questions are very specific, while in the second part they are more open.
- **Deadline**: Tuesday 8th of November at 23:59. 
- **Final Output**: a Jupyter notebook, which we (teachers) can run. 
- **Answering the Questions**: you will find the questions in markdown cells below. Under each of these cells, you will find a cell / cells for answers. Type there your answer. For the answer to be correct, the cell with the answer must run without error (unless specified). You can use markdown cells for the answers that require text.
- **Submission**: submit the assignment on Moodle, under [Individual Assignment](https://moodle.epfl.ch/mod/assign/view.php?id=1222846)

## Content
- [Polity5 Dataset](#polity5)  
    - [Question 1: Import the data and get a first glance](#question1)
    - [Question 2: Select some variables](#question2)
    - [Question 3: Missing Values](#question3)
    - [Question 4: Check Polity2](#question4)
- [Quality of Government (QOG) Environmental Indicators Dataset](#qog)  
    - [Question 5: Import the data and do few fixes](#question5)
    - [Question 6: Merge QOG and Polity5 ... first attempt](#question6)
    - [Question 7: Merge QOG and Polity5 ... second attempt](#question7)
    - [Question 8: Clean the merged dataframe](#question8)
- [Exploratory Data Analysis](#eda)
    - [Question 9: Selecting the ingredients for the recipe (how I select the variables)](#question9)  
    - [Question 10: Picking the right quantity of each ingredient (how I select my sample)](#question10)
    - [Question 11: Tasting and preparing the ingredients (univariate analysis)](#question11)
    - [Question 12: Cooking the ingredients together (bivariate analysis)](#question12)
    - [Question 13: Tasting the new recipe (conclusion)](#question13)

## Polity5 data <a class="anchor" id="polity5"></a>

Polity5 is a widely used democracy scale. The raw data as well as the codebook are available [here](http://www.systemicpeace.org/inscrdata.html). For this assignment, we have modified a bit the original version, for example we have added the iso3 code for countries to make you save time. You can find the modified version [here](https://raw.githubusercontent.com/edoardochiarotti/class_datascience/main/Notebooks/Assignment/individual_assignment/data/polity2_iso3.csv).

### Question 1: import the data and get a first glance <a class="anchor" id="question1"></a>

1a) Import the csv 'polity2_iso3.csv' (file provided in the link [here](https://raw.githubusercontent.com/edoardochiarotti/class_datascience/main/Notebooks/Assignment/individual_assignment/data/polity2_iso3.csv)) as a panda dataframe (ignore the warning message) **(1 point)**

In [2]:
# Answer 1a
url = "https://raw.githubusercontent.com/edoardochiarotti/class_datascience/main/Notebooks/Assignment/individual_assignment/data/polity2_iso3.csv"

df=pd.read_csv(url)

  df=pd.read_csv(url)


1b) Display the first 10 rows **(1 point)**

In [3]:
# Answer 1b
df.head(10)

Unnamed: 0,iso3,year,p5,cyear,ccode,scode,country,flag,fragment,democ,...,interim,bmonth,bday,byear,bprec,post,change,d5,sf,regtrans
0,,1800,0,2711800,271,WRT,Wuerttemburg,0,,0,...,,1.0,1.0,1800.0,1.0,-7.0,88.0,1.0,,
1,,1800,0,7301800,730,KOR,Korea,0,,5,...,,1.0,1.0,1800.0,1.0,1.0,88.0,1.0,,
2,,1800,0,2451800,245,BAV,Bavaria,0,,0,...,,1.0,1.0,1800.0,1.0,-10.0,88.0,1.0,,
3,,1801,0,7301801,730,KOR,Korea,0,,5,...,,,,,,,,,,
4,,1801,0,2711801,271,WRT,Wuerttemburg,0,,0,...,,,,,,,,,,
5,,1801,0,2451801,245,BAV,Bavaria,0,,0,...,,,,,,,,,,
6,,1802,0,7301802,730,KOR,Korea,0,,5,...,,,,,,,,,,
7,,1802,0,2711802,271,WRT,Wuerttemburg,0,,0,...,,,,,,,,,,
8,,1802,0,2451802,245,BAV,Bavaria,0,,0,...,,,,,,,,,,
9,,1803,0,7301803,730,KOR,Korea,0,,5,...,,,,,,,,,,


1c) Display the data types of all the variables included in the data **(1 point)**

In [4]:
# Answer 1c
df.dtypes

iso3         object
year          int64
p5            int64
cyear         int64
ccode         int64
scode        object
country      object
flag          int64
fragment    float64
democ         int64
autoc         int64
polity        int64
polity2     float64
durable      object
xrreg         int64
xrcomp        int64
xropen        int64
xconst        int64
parreg        int64
parcomp       int64
exrec       float64
exconst       int64
polcomp     float64
prior        object
emonth       object
eday         object
eyear        object
eprec        object
interim      object
bmonth       object
bday         object
byear        object
bprec        object
post         object
change       object
d5           object
sf           object
regtrans     object
dtype: object

1d) By looking at your answer in 1c, what is the difference between the different types of variables? Why the type of some variables is defined as object? **(1 point)**

Answer 1d:
As you can see, Pandas is using different names for data types. Here is a description:

|Pandas type|Native Python type|Description|
|:-------|:-------|:----------|
|`object` | `string` | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings). |
|`int64` | `int` | Numeric characters. 64 refers to the memory allocated to hold this character. |
|`float64` | `float` | Numeric characters with decimals. If a column contains numbers and NaNs (see below), pandas will default to float64, in case your missing value has a decimal.|

Our data frame contains `object` (e.g., strings like countries), `int64` and `float64`. You may wonder about the emissions of arsenic (As), nickel (Ni), and chromium (Cr), which are apparently `object` type. We will discover why later.

### Question 2. Select some variables <a class="anchor" id="question2"></a>

2a) Create a subset dataframe that contains the variables 'iso3', 'country', 'year', 'polity2' and display it **(1 point)**

In [5]:
# Answer 2a
df_subset = df.loc[:, ('iso3', 'country','year','polity2')]
df_subset

Unnamed: 0,iso3,country,year,polity2
0,,Wuerttemburg,1800,-7.0
1,,Korea,1800,1.0
2,,Bavaria,1800,-10.0
3,,Korea,1801,1.0
4,,Wuerttemburg,1801,-7.0
...,...,...,...,...
17569,ZWE,Zimbabwe,2014,4.0
17570,ZWE,Zimbabwe,2015,4.0
17571,ZWE,Zimbabwe,2016,4.0
17572,ZWE,Zimbabwe,2017,4.0


2b) Display the type of the variable "year" **(1 point)**

In [6]:
# Answer 2b
df_subset.dtypes["year"]

dtype('int64')

2c) Convert the variable "year" to string **(1 point)**
<br>
Hint: if you get a warning message of the type "SettingWithCopyWarning", it is because you did not subset the data in the right way. Go back to your class notes and check the different ways to subset a dataframe, and try again. If you do it correctly, you will not get the warning message.

In [7]:
# Answer 2c
df_subset['year'] = df_subset['year'].apply(str)

df_subset.dtypes

iso3        object
country     object
year        object
polity2    float64
dtype: object

In [8]:
print(df_subset.dtypes["year"])

object


In [9]:
df_subset.dtypes

iso3        object
country     object
year        object
polity2    float64
dtype: object

In [10]:
df_subset.dtypes["year"]

dtype('O')

### Question 3: Missing Values <a class="anchor" id="question3"></a>

3a) Subset the rows that have iso3 missing and display **(1 point)**

In [11]:
# Answer 3a
df_subset['iso3'].isna()

0         True
1         True
2         True
3         True
4         True
         ...  
17569    False
17570    False
17571    False
17572    False
17573    False
Name: iso3, Length: 17574, dtype: bool

In [12]:
df.loc[0, 'iso3']

nan

In [13]:
df_subset['iso3'].isna()

0         True
1         True
2         True
3         True
4         True
         ...  
17569    False
17570    False
17571    False
17572    False
17573    False
Name: iso3, Length: 17574, dtype: bool

In [14]:
df_subset

Unnamed: 0,iso3,country,year,polity2
0,,Wuerttemburg,1800,-7.0
1,,Korea,1800,1.0
2,,Bavaria,1800,-10.0
3,,Korea,1801,1.0
4,,Wuerttemburg,1801,-7.0
...,...,...,...,...
17569,ZWE,Zimbabwe,2014,4.0
17570,ZWE,Zimbabwe,2015,4.0
17571,ZWE,Zimbabwe,2016,4.0
17572,ZWE,Zimbabwe,2017,4.0


3b) Display the countries that have missing iso3. What can you tell by looking at them? Any similarities? **(1 point)**

In [15]:
# Answer 3b
df_subset[(df_subset['iso3'].isna())].country.unique()

array(['Wuerttemburg', 'Korea', 'Bavaria', 'Saxony', 'Parma', 'Tuscany',
       'Sardinia', 'Modena', 'Two Sicilies', 'Baden', 'Gran Colombia',
       'United Province CA', 'Serbia', 'Orange Free State', 'Yemen North',
       'Czechoslovakia', 'USSR', 'Germany West', 'Germany East',
       'Pakistan', 'South Vietnam', 'Yemen South', 'Vietnam',
       'Yugoslavia', 'Ethiopia', 'Serbia and Montenegro', 'Montenegro',
       'Sudan-North'], dtype=object)

3c) Display the countries with missing iso3 from 2011. **(1 point)**

In [16]:
# Answer 3c
iso_missing_countries_since_twenty_eleven = df_subset[(df_subset['iso3'].isna()) & (df_subset['year'] >= '2011')]
sorted_iso_missing_countries_since_twenty_eleven=(iso_missing_countries_since_twenty_eleven['country'].unique())
print(sorted_iso_missing_countries_since_twenty_eleven)

['Montenegro' 'Serbia' 'Ethiopia' 'Sudan-North' 'Vietnam']


3d) Display the rows for which the column "country" contains the word "Serbia". By looking at the result, can you tell what happened to Serbia in 2006? **(1 point)**
<br>
Hint: the most general way of doing this is to use a combination of re.search and list comprehension. To display the full subset, you can use print(df.to_string()).

In [17]:
# Answer 3d
df_subset_serbia = df_subset[(df_subset['country'] == 'Serbia')]
print(df_subset_serbia)

     iso3 country  year  polity2
224   NaN  Serbia  1830     -7.0
230   NaN  Serbia  1831     -7.0
252   NaN  Serbia  1832     -7.0
261   NaN  Serbia  1833     -7.0
272   NaN  Serbia  1834     -7.0
...   ...     ...   ...      ...
1248  NaN  Serbia  2014      8.0
1254  NaN  Serbia  2015      8.0
1263  NaN  Serbia  2016      8.0
1269  NaN  Serbia  2017      8.0
1276  NaN  Serbia  2018      8.0

[104 rows x 4 columns]


In [18]:
print(df_subset_serbia.to_string())

     iso3 country  year  polity2
224   NaN  Serbia  1830     -7.0
230   NaN  Serbia  1831     -7.0
252   NaN  Serbia  1832     -7.0
261   NaN  Serbia  1833     -7.0
272   NaN  Serbia  1834     -7.0
286   NaN  Serbia  1835     -7.0
295   NaN  Serbia  1836     -7.0
301   NaN  Serbia  1837     -7.0
318   NaN  Serbia  1838      2.0
333   NaN  Serbia  1839      2.0
344   NaN  Serbia  1840      2.0
357   NaN  Serbia  1841      2.0
363   NaN  Serbia  1842      2.0
369   NaN  Serbia  1843      2.0
387   NaN  Serbia  1844      2.0
394   NaN  Serbia  1845      2.0
410   NaN  Serbia  1846      2.0
420   NaN  Serbia  1847      2.0
429   NaN  Serbia  1848      2.0
439   NaN  Serbia  1849      2.0
450   NaN  Serbia  1850      2.0
467   NaN  Serbia  1851      2.0
475   NaN  Serbia  1852      2.0
482   NaN  Serbia  1853      2.0
492   NaN  Serbia  1854      2.0
502   NaN  Serbia  1855      2.0
523   NaN  Serbia  1856      2.0
533   NaN  Serbia  1857      2.0
542   NaN  Serbia  1858     -9.0
553   NaN 

In the dataset we see a period of time where we have no data (1920-2006). This is because Serbia was an autonomous kingdom until 1918. From that year until 2006 it has been part of different countries (i.e. Yugoslavia). From 2006 it became independent again.

In [19]:
df_subset_serbia_twothousandsix = df_subset[(df_subset['country'] == 'Serbia') & (df_subset['year'] == '2006')]
df_subset_serbia_twothousandsix

Unnamed: 0,iso3,country,year,polity2
1213,,Serbia,2006,8.0


3e) Write a function that does the operation in 4d and use it to display the subset that has the word "sudan" (all lower cap) in country. Then do the same for the word "vietnam" (all lower cap). **(1 point)**
<br>
Hint: options of functions can be very useful.

In [20]:
# Answer 3e
df_subset[(df_subset['country'] == 'sudan')]

Unnamed: 0,iso3,country,year,polity2


In [21]:
df_subset[(df_subset['country'] == 'vietnam')]

Unnamed: 0,iso3,country,year,polity2


3f) Replace nan values in iso3 with correct iso3 for the 5 countries found in 3c from 2011 onwards, and display the subset with the fixed values to check that everything worked. **(1 point)**
<br>
Hint: the correct iso3 for these 5 countries are "ETH","MNE","SRB","SDN","VNM".

In [22]:
df_subset.isna().sum() 

iso3       1270
country       0
year          0
polity2     263
dtype: int64

In [23]:
new_iso3 = df_subset.copy()

countries = ['Montenegro', 'Serbia', 'Ethiopia', 'Sudan-North', 'Vietnam']
iso3_real_codes = ['MNE', 'SRB', 'ETH', 'SDN', 'VNM']

replacement = {'iso3': np.nan}

for country_index, iso3 in zip(countries, iso3_real_codes):

    new_iso3[(new_iso3['country']== country_index) & \
            (new_iso3['year'].astype(int) >=2011)] = \
    new_iso3[(new_iso3['country'] == country_index) & \
            (new_iso3['year'].astype(int) >=2011)].\
    replace(to_replace = replacement, value = iso3)
    
print(new_iso3[(new_iso3['country'] == 'Montenegro') |
              (new_iso3['country'] == 'Serbia') |
              (new_iso3['country'] == 'Ethiopia') |
              (new_iso3['country'] == 'Sudan-North') |
              (new_iso3['country'] == 'Vietnam')].to_string())
            

     iso3      country  year  polity2
224   NaN       Serbia  1830     -7.0
230   NaN       Serbia  1831     -7.0
252   NaN       Serbia  1832     -7.0
261   NaN       Serbia  1833     -7.0
272   NaN       Serbia  1834     -7.0
286   NaN       Serbia  1835     -7.0
295   NaN       Serbia  1836     -7.0
301   NaN       Serbia  1837     -7.0
318   NaN       Serbia  1838      2.0
333   NaN       Serbia  1839      2.0
344   NaN       Serbia  1840      2.0
357   NaN       Serbia  1841      2.0
363   NaN       Serbia  1842      2.0
369   NaN       Serbia  1843      2.0
387   NaN       Serbia  1844      2.0
394   NaN       Serbia  1845      2.0
410   NaN       Serbia  1846      2.0
420   NaN       Serbia  1847      2.0
429   NaN       Serbia  1848      2.0
439   NaN       Serbia  1849      2.0
450   NaN       Serbia  1850      2.0
467   NaN       Serbia  1851      2.0
475   NaN       Serbia  1852      2.0
482   NaN       Serbia  1853      2.0
492   NaN       Serbia  1854      2.0
502   NaN   

3g) Drop the remaining rows which have nan in "iso3" and display the new number of rows of the dataframe (how many are they?) **(1 point)**

In [24]:
# Answer 3g
new_iso3_drop = new_iso3.dropna(subset=['iso3'])
new_iso3_drop

Unnamed: 0,iso3,country,year,polity2
1230,MNE,Montenegro,2011,9.0
1231,SRB,Serbia,2011,8.0
1232,SSD,South Sudan,2011,0.0
1233,ETH,Ethiopia,2011,-3.0
1234,SDN,Sudan-North,2011,-4.0
...,...,...,...,...
17569,ZWE,Zimbabwe,2014,4.0
17570,ZWE,Zimbabwe,2015,4.0
17571,ZWE,Zimbabwe,2016,4.0
17572,ZWE,Zimbabwe,2017,4.0


In [25]:
len(new_iso3_drop)

16344

### Question 4: Check Polity2 <a class="anchor" id="question4"></a>

4a) Display the first and last year included in the dataset **(1 point)**

In [26]:
# Answer 4a
print(df_subset.iloc[0]['year'])
print(df_subset.iloc[-1]['year'])

1800
2018


4b) What do the values in "polity2" represent? **(1 point)**

Answer 4b: 
Polity2 classifies the regime under which a country is subject to. It ranges from 10 (full democracy) to -10 (full autocracy).

In [27]:
new_iso3_drop['polity2']

1230     9.0
1231     8.0
1232     0.0
1233    -3.0
1234    -4.0
        ... 
17569    4.0
17570    4.0
17571    4.0
17572    4.0
17573    4.0
Name: polity2, Length: 16344, dtype: float64

4c) Do we have weird values for polity2? If yes, why? What should we do about them? Transform the data accordingly. **(1 point)**

Answer 4c: Yes, we have weird values. The variable should only display values between 10 and -10, but it is actually showing some 'nan' values, -88 and -66.

In [30]:
# Answer 4c
new_iso3_drop['polity2'].unique()

array([  9.,   8.,   0.,  -3.,  -4.,  -7.,   1.,  -6.,  -8., -10.,  nan,
        -1.,  -2.,  -9.,  -5.,   3.,   5.,   7.,   2.,   6.,  10.,   4.,
       -88., -66.])

In [29]:
new_iso3_drop_clean = df_subset[(new_iso3_drop['polity2'] >= -10) & (new_iso3_drop['polity2'] <= 10)]

  new_iso3_drop_clean = df_subset[(new_iso3_drop['polity2'] >= -10) & (new_iso3_drop['polity2'] <= 10)]


IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

4d) Make a map that shows the number of observations of polity2 by country **(1 point)**

In [None]:
# Answer 4d


4e) Store the final dataframe (the one you obtained after 5d) in an object called df_pol **(1 point)**

In [None]:
# Answer 4e


## Quality of Government Environmental Indicators <a class="anchor" id="qog"></a>

The QoG Environmental Indicators dataset (QoG-EI) (Povitkina, Marina, Natalia Alvarado Pachon & Cem Mert Dalli. 2021). The Quality of Government Environmental Indicators Dataset, version Sep21. University of Gothenburg: The Quality of Government Institute, https://www.gu.se/en/quality-government), is a compilation of indicators measuring countries' environmental performance over time, including the presence and stringency of environmental policies, environmental outcomes (emissions, deforestation, etc.), and public opinion on the environment. Codebook and data are available [here](https://www.gu.se/en/quality-government/qog-data/data-downloads/environmental-indicators-dataset).

### Question 5: Import the data and do few fixes <a class="anchor" id="question5"></a>

5a) Import data from the Quality of Government Environmental Indicators Dataset and display the variables types and the number of rows **(1 point)**
<br>
Hint: When you go on the webpage of the Environmental Indicators Dataset, you can directly import from a URL by copying the link address of the dataset! 

In [None]:
# Answer 5a
url = 'https://www.qogdata.pol.gu.se/data/qog_ei_sept21.csv'

df_five = pd.read_csv(url , encoding='latin-1')
df_five.dtypes

In [None]:
len(df_five.index)

5b) Rename the variable "ccodealp" to "iso3" **(1 point)**

In [None]:
# Answer 5b
df_five.rename(columns = {'ccodealp': 'iso3'}, inplace = True)

5c) Check the type of the variables "year" and "iso3" are string, if not convert them to string **(1 point)**

In [None]:
# Answer 5c
print(df_five.dtypes["iso3"])

In [None]:
print(df_five.dtypes["year"])

In [None]:
df_five['year'] = df_five['year'].apply(str) #we convert 'year' into a string

print(df_five.dtypes["year"])

### Question 6: Merge QOG and Polity5 ... issues with QOG? <a class="anchor" id="question6"></a>

6a) Get a subset of the dataframe that includes the variables "cname", "iso3", "year" and "cckp_temp", and display the number of rows. **(1 point)**

In [None]:
# Answer 6a
df_five_subset = df_five.loc[:, ('cname', 'iso3','year','cckp_temp')]
df_five_subset

In [None]:
len(df_five_subset)

6b) Merge this subset (left) and the clean version of the polity data (right), using the argument how="left". Was the merge succesfull? If yes, how many rows has the merged dataframe? Is it the same number of rows of the subset in 6a? **(1 point)**

In [None]:
# Answer 6b


6c) Do the same by adding the argument validate="one-to-one". Can you make some hypotheses on why you get an error? **(1 point)**

In [None]:
# Answer 6c


6d) Consider the subset of the QOG you obtained in 6a and write a code to (i) count the number of observations for the variable "cckp_temp" for each combination of iso3 and year, (ii) store the results in a dataframe. For example, the combination "USA-2012" should have 1 observation for "cckp_temp", so the result of your code should be 1. The code should do this for all iso3-year combinations of your subset dataframe, and store the results in a dataframe. **(1 point)**
<br>
Hint: it should not take you more than 2 lines of code.

In [None]:
# Answer 6d


6e) Use the code in 6d to write a function that displays all rows of the dataframe obtained in 6a that have more than one observation of "cckp_temp" for each iso3-year combination, and check if it works. **(1 point)**

In [None]:
# Answer 6e


6f) Which countries have more than one observation for each iso3-year combination? Deal with these countries in the subset dataframe created in 6a to make sure you no longer have double observations for iso3-year combinations, and check that after your fix this is actually the case. **(1 point)**
<br>
Hint: should we keep a country with all missing values?

In [None]:
# Answer 6f


6g) If your check went well, now you can perform the same operation directly in the QOG dataframe (not in the substed dataframe created in 6a). How many rows does now the QOG dataframe has? **(1 point)**

In [None]:
# Answer 6g


### Question 7: Merge QOG and Polity5 ... issues with Polity5? <a class="anchor" id="question7"></a>

7a) Merge the cleaned QOG dataframe (left) and the Polity dataframe (right) using the options how="left" and validate="one_to_one". Does it work? Why? **(1 point)**

In [None]:
# Answer 7a


7b) Use the function you wrote in 6e to check what's wrong in the "clean" version of Polity **(1 point)**

In [None]:
# Answer 7b


7c) Drop or fix the countries that create troubles directly in the "clean" version of Polity and motivate your choices. **(1 point)**

In [None]:
# Answer 7c


7d) Try now to merge the "clean-clean" versions of COG and Polity (the ones you obtained in 7g and 8c) always using the options how="left" and validate="one_to_one". Does it work, and why? How many rows has the resulting merged dataframe? **(1 point)**

In [None]:
# Answer 7d


### Question 8: Clean the merged dataframe <a class="anchor" id="question8"></a>

8a) In the merged dataframe, order the columns so that you have the "index" variables first and the variables with actual values last. **(1 point)**
<br>
Hint: index variables are "iso3", "year" and other similar variables you can find, and the variables with actual values are "polity2", "cckp_temp" and other similar variables you can find.

In [None]:
# Answer 8a


8b) Rename "cname" as "country" and "country" as "country_polity". **(1 point)**

In [None]:
# Answer 8b


8c) Save the clean merged dataframe as a csv in a subfolder called "clean_data" in your working directory **(1 point)**

In [None]:
# Answer 8c


## Exploratory Data Analysis <a class="anchor" id="eda"></a>

In this section you will define a research question and perform a preliminary Exploratory Data Analysis (EDA) to address - or better, start addressing - the question at hand. This exercise will be done along the lines of the analysis done by our own Quentin Gallea in "*A recipe to empirically answer any question quickly*" ([Towards Data Science, 2022](https://towardsdatascience.com/a-recipe-to-empirically-answer-any-question-quickly-22e48c867dd5)). In this article, Quentin shows the first steps of an EDA that aims to explore whether heat waves have pushed governments to implement regulations against climate change (causal link). The logic is that, as it gets hotter and hotter, governments become more aware of climate change, and the problems it can cause to society, and start addressing it. In Quentin's analysis, heat waves (proxied by temperature) is the "main explanatory variable", rainfall is the "explanatory variable for heterogeneity", and regulations against climate change (proxied by the Environmental Policy Stringency Index) is the "outcome variable". He finds that indeed countries with relatively high temperatures have implemented more regulations against climate change. This is true especially when rainfall levels are low, as when it does not rain the damage of extreme heat is more evident to legislators, who therefore apply stricter regulations against these phenomenons.
<br>
<br>
In this exercise, you will be asked to do a similar analysis on a research question of your choice, using at least two of the variables of the dataset we have created in the former questions (QOG + Polity). For example, "what is the average temperature in 2010?" is not a valid research question (univariate), while "what is the impact of high temperatures on the stringency of climate regulations?" is a valid research question (at least bivariate). As before, we will ask you some (this time more general and open) questions, and you should report your answer in the cells below each question. Use a mix of markdown and code cells to answer (markdown for text and code for graphs and tables). We should be able to run all the graphs, i.e. screenshots of graphs are not accepted. Note that for now we have put only one markdown cell and one code cell for the answer, but feel free to add as many cells as you need.
<br>
Beyond the python code, we will grade the interpretations of the results and the coding decision you make.
<br>
<br>
Let your creativity guide you and let's have some fun!

### Question 9: Selecting the ingredients (how I select the variables) <a class="anchor" id="question9"></a>
We have saved the clean merged data that resulted from the previous questions in "clean_data_prepared_EDA" (it should be the same of the one you saved in "clean_data"). Import the clean merged data from "clean_data_prepared_EDA" using this [link](https://raw.githubusercontent.com/edoardochiarotti/class_datascience/main/Notebooks/Assignment/individual_assignment/clean_data_prepared_EDA/df_qog_polity_merged.csv). Explore the variables in the newly obtained dataframe by checking the documentation of QOG and Polity. Then, define a research question that addresses a causal link between at least two of these variables. Describe the research question, why you are addressing it and the variables of interest (outcome variable, main explanatory variable and explanatory variable for heterogeneity). **(3 points)**

Answer 9:

In [None]:
# Answer 9:
url = "https://raw.githubusercontent.com/edoardochiarotti/class_datascience/main/Notebooks/Assignment/individual_assignment/clean_data_prepared_EDA/df_qog_polity_merged.csv"

df_eda=pd.read_csv(url)
list(df_eda)

Answer 9: How the Co2 levels affect the regulations against climate of each country.
In my analysis, Co2 levels (proxied by temperature) is the "main explanatory variable", and regulations against climate change is the "outcome variable".

### Question 10: Picking the right quantity of each ingredient (how I select my sample) <a class="anchor" id="question10"></a>
Explore the data availability of your variables of interest and select a clean sample for the analysis. Describe this sample with the help of summary-statistics tables and maps. **(3 points)**

Answer 10:

In [None]:
# Answer 10:

### Question 11: Tasting and preparing the ingredients (univariate analysis) <a class="anchor" id="question11"></a>
Do an univariate analysis for each variable you have chosen (outcome variable, main explanatory variable and explanatory variable for heterogeneity):
- Prepare the variable, for example see if you need to transform the data further, i.e. log-transform, define a categorical variable, deal with outliers, etc.
- Understand the nature of the variable, i.e. continuous, categorical, binary, etc., which then allows to pick the right statistical tool in the bivariate analysis.
- Get an idea of the variable's behaviour across time and space.

Describe these steps and the conclusions you can draw with the help of histograms, tables, maps and line graphs. **(3 points)**

Answer 11:

In [None]:
# Answer 11:

### Question 12: Cooking the ingredients together (bivariate analysis) <a class="anchor" id="question12"></a>

Considering the "nature" of your variables (continuous, categorical, binary, etc.), pick the right tool / tools for a preliminary bivariate analysis, i.e. correlation tables, bar/line graphs, scatter plots, etc. Use these tools to describe your preliminary bivariate analysis and your findings. **(3 points)**

Answer 12:

In [None]:
# Answer 12:

### Question 13: Tasting the new recipe (conclusion) <a class="anchor" id="question13"></a>

Explain what you learned, the problem faced, what would you do next (you can suggest other data you would like to have etc). **(2 points)**

Answer 13: