# Applied Data Analysis and Machine Learning - Class Project
In this class project, you are supposed to work with data from the Global Preferences Survey, a globally representative dataset on risk and time preferences, positive and negative reciprocity, altruism, and trust. <br>
Further information can be found on the website (https://gps.iza.org/) and in the paper "Global Evidence on Economic Preferences" by Falk, Becker, Dohmen, Enke, Huffman, and Sunde, published in *The Quarterly Journal of Economics* 133(4): 1645–1692, 2018 (https://doi.org/10.1093/qje/qjy013).

**IMPORTANT:** <br>
Please enter the matriculation number of all group members here:
1. XXXXXX
2. YYYYYY
3. ZZZZZZ

In this class project, you will use the different techniques taught in the course: data handling, data visualization, and machine learning.

First load the necessary packages. <br>
If you want to use additional libraries you can add them to the following cell:

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()
import folium
from folium.plugins import MarkerCluster

## Problem 1 - Data Handling
The basis of your work will be the dataset containing information on and preferences of individuals (`individual_new.csv`):
- *country*: Country name
- *isocode*: Three-letter country codes ISO 3166-1 alpha-3
- *ison*: Three-digit country codes ISO 3166-1 numeric
- *region*: Subnational region of interview
- *language*: Interview language
- *date*: Interview date
- *id*: Respondent ID
- *wgt*: Sampling weight of the observation
- *patience*: Level of patience
- *risktaking*: Willingness to take risks
- *posrecip*: Positive reciprocity
- *negrecip*: Negative reciprocity
- *altruism*: Level of altruism
- *trust*: Level of trust
- *subj_math_skills*: Subjective math skills from 0 to 10
- *female*: Indicator for female
- *age*: Age

Note that the variables *patience*, *risktaking*, *posrecip*, *negrecip*, *altruism* and *trust* are normalized to mean 0 and standard deviation 1.

In [2]:
data = pd.read_csv("individual_new.csv", sep=",")
data

Unnamed: 0,country,isocode,ison,region,language,date,id,wgt,patience,risktaking,posrecip,negrecip,altruism,trust,subj_math_skills,female,age
0,Turkey,TUR,792,Adana,Turkish,07 Mar 12,7100800000001,0.271783,0.047176,1.020203,0.594384,-0.367175,-0.139953,1.679754,7.0,1,26
1,Turkey,TUR,792,Adana,Turkish,08 Mar 12,7100800000002,0.271783,-0.675698,0.387177,0.662234,0.077251,-0.139953,0.950434,3.0,1,50
2,Turkey,TUR,792,Adana,Turkish,08 Mar 12,7100800000003,0.442259,0.318254,1.020203,-0.000930,0.077251,-0.606967,0.585774,7.0,1,21
3,Turkey,TUR,792,Adana,Turkish,07 Mar 12,7100800000004,1.423671,0.498972,1.271527,0.959891,0.077251,0.560569,0.585774,7.0,0,24
4,Turkey,TUR,792,Adana,Turkish,07 Mar 12,7100800000005,0.705356,0.589331,1.122619,1.325398,-0.367175,0.327062,1.679754,9.0,0,24
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80332,Egypt,EGY,818,Aswan,Arabic,22 Nov 12,7400200001196,1.293363,-0.675698,1.150581,0.662234,0.373536,-0.451296,-0.143547,5.0,1,39
80333,Egypt,EGY,818,Aswan,Arabic,22 Nov 12,7400200001197,0.759782,-0.548160,-1.697871,0.364577,-0.732848,-0.758613,0.221113,5.0,0,28
80334,Egypt,EGY,818,Aswan,Arabic,22 Nov 12,7400200001198,0.415252,-0.803235,-1.874741,-0.528394,0.521678,-0.139953,0.585774,5.0,1,45
80335,Egypt,EGY,818,Aswan,Arabic,22 Nov 12,7400200001199,1.139673,-1.313386,-0.047083,-0.000930,-0.488991,-0.139953,0.585774,7.0,0,27


#### a)
Explore the dataset, for example:
- Check for missing values
- Check for the correct datatype of the variables

In [3]:
# Insert your code here.

#### b)
Identify potential correlations between different variables. <br>
Are there regional differences in certain preferences?

In [4]:
# Insert your code here.

## Problem 2 - Data Visualization
#### a)
To get some first insights in the data, create meaningful plots on preferences, countries, demographics, etc. <br>
You can use any kind that you deem useful: histograms, line plots, etc.

In [5]:
# Insert your code here.

#### b)
Pick at least one preference measure. <br>
For this measure, create an interactive map with `folium` that tells you the average in the country in the given year. <br>
*Hint 1: Be cautious with country names.* <br>
*Hint 2: Consider the sampling weights when taking the (weighted) average.*

In [6]:
# Insert your code here.

## Problem 3 - Supervised Machine Learning
#### a)
Try to predict the subjects' countries of origin using the information provided. <br>
Report the performance measures for different predictor variables.

In [7]:
# Insert your code here.

#### b)
Try to fill the gaps in the preference measures in the data using appropriate prediction models.

In [8]:
# Insert your code here.

## Problem 4 - Unsupervised Machine Learning
#### a)
Use the preference measures to cluster *individuals*. <br>
Which is the optimal number of clusters? <br>
Can you provide an intuition for the clusters you identified?

In [9]:
# Insert your code here.

#### b)
Use the preference measures to cluster *countries*. <br>
Which is the optimal number of clusters? <br>
Can you provide an intuition for the clusters you identified?

In [10]:
# Insert your code here.