# YOUR PROJECT TITLE

> **Note the following:** 
> 1. This is *not* meant to be an example of an actual **data analysis project**, just an example of how to structure such a project.
> 1. Remember the general advice on structuring and commenting your code
> 1. The `dataproject.py` file includes a function which can be used multiple times in this notebook.

Imports and set magics:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from matplotlib_venn import venn2

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# user written modules
import dataproject


# Read and clean data

We import our datasets from a CSV-file to a panda dataframe and then keep only the variables we need 

In [2]:
student = pd.read_csv('asgdnkz7.csv')

#Keep only the columns we need
cols_to_keep_s = ['IDSTUD', 'ASBG01', 'ASBG04', 'ASBGDML', 'ASBGSLM', 'ASBGSCM', 'ASMMAT01']

student = student[cols_to_keep_s]

#Drop rows with missing values for our primary variables
student = student.dropna(subset=['IDSTUD', 'ASBG01', 'ASBGSCM', 'ASMMAT01'])

#Make the gender variable a dummy 
student['ASBG01'] = student['ASBG01'].replace({1: 0, 2: 1}).astype(int)

#Make the number of books variable to a dummy 
student['ASBG04'] = student['ASBG04'].apply(lambda x: 1 if x > 3 else 0)

#Rename variables
student = student.rename(columns={
    'ASBG01': 'male',
    'ASBG04': 'many_books',
    'ASBGDML': 'bad_beh_mat', 
    'ASBGSLM': 'like_mat', 
    'ASBGSCM': 'conf_mat', 
    'ASMMAT01': 'score_mat'
})

student

Unnamed: 0,IDSTUD,male,many_books,bad_beh_mat,like_mat,conf_mat,score_mat
0,50010101.0,1,0,10.17203,7.98325,9.67422,564.16173
1,50010102.0,1,1,9.65042,8.17345,9.42421,589.88375
2,50010103.0,0,0,9.15042,9.72723,11.70061,514.63289
3,50010106.0,0,0,12.34583,3.85307,6.32925,422.80754
4,50010107.0,1,0,10.79976,3.85307,7.70454,470.52185
...,...,...,...,...,...,...,...
3688,52020115.0,0,0,11.19445,8.14669,8.85216,519.89076
3689,52020120.0,1,0,9.65042,8.14669,9.22621,535.43639
3690,52020121.0,1,0,10.17203,9.31319,11.14456,523.01763
3691,52020123.0,0,0,8.61844,7.63586,7.91361,419.64660


In [7]:
teacher = pd.read_csv('atgdnkz7.csv')

#Keep only the columns we need
cols_to_keep_t = ['IDTEALIN', 'IDLINK', 'ATBG02', 'ATBG05AC', 'ATDMMEM']

teacher = teacher[cols_to_keep_t]

#Drop rows with missing values for our primary variables
teacher = teacher.dropna(subset=['IDTEALIN', 'IDLINK', 'ATBG02', 'ATDMMEM'])

#Make the gender variable a dummy 
teacher['ATBG02'] = teacher['ATBG02'].replace({1: 0, 2: 1}).astype(int)

#Make the major in mathemathics variable a dummy 
teacher['ATBG05AC'] = teacher['ATBG05AC'].replace({2: 0})


#Make the education categorical variable to dummies
dummies = pd.get_dummies(teacher['ATDMMEM'])
dummies.columns = ['mat_and_educ', 'educ', 'mat', 'other_major', 'no_major']

dummies = dummies.astype(int)

#merge the new dummies to the teacher dataset
teacher = pd.concat([teacher, dummies], axis=1)

#Rename variables
teacher = teacher.rename(columns={
    'ASBG01': 'male_t',
    'ATBG05AC': 'mat_major',
    'ASBGDML': 'bad_beh_mat', 
})

teacher

Unnamed: 0,IDTEALIN,IDLINK,ATBG02,mat_major,ATDMMEM,mat_and_educ,educ,mat,other_major,no_major
0,50010101.0,1.0,0,1.0,3.0,0,0,1,0,0
1,50020103.0,3.0,1,1.0,1.0,1,0,0,0,0
2,50020204.0,4.0,1,1.0,3.0,0,0,1,0,0
3,50030202.0,2.0,1,0.0,4.0,0,0,0,1,0
4,50030404.0,4.0,1,0.0,4.0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...
302,52000101.0,1.0,1,1.0,3.0,0,0,1,0,0
305,52000303.0,3.0,0,1.0,1.0,1,0,0,0,0
306,52010101.0,1.0,1,0.0,2.0,0,1,0,0,0
307,52020101.0,1.0,1,0.0,4.0,0,0,0,1,0


In [None]:
ts_link = pd.read_csv('astdnkz7.csv')

#Keep only the columns we need
cols_to_keep_ts = ['IDSTUD', 'IDTEALIN', 'IDLINK', 'MATSUBJ']
ts_link = ts_link[cols_to_keep_ts]

#Drop rows with missing values for our primary variables
ts_link = ts_link.dropna(subset=['IDSTUD', 'IDTEALIN', 'IDLINK', 'MATSUBJ'])
ts_link

## Explore each data set

In order to be able to **explore the raw data**, we provide some **static** and **interactive plots** to show important developments 

We start with the student dataset

In [None]:
def plot_func(column):
    for gender in student['male'].unique():
        student[student['male'] == gender][column].plot(kind='hist', rwidth=0.8, alpha=0.5, label=f'Gender {gender}')
    plt.legend()
    plt.show()

columns = ['many_books', 'bad_beh_mat', 'like_mat', 'conf_mat', 'score_mat']
widgets.interact(plot_func, column=widgets.Dropdown(options=columns, value='many_books', description='Column:'))

Explain what you see when moving elements of the interactive plot around. 

We then look at the teacher dataset

In [None]:
teacher['ATDMMEM'].plot(kind='hist', rwidth=0.8)
plt.show()

# Merge data sets

Now you create combinations of your loaded data sets. Remember the illustration of a (inner) **merge**:

In [None]:
plt.figure(figsize=(15,7))
v = venn2(subsets = (4, 4, 10), set_labels = ('Data X', 'Data Y'))
v.get_label_by_id('100').set_text('dropped')
v.get_label_by_id('010').set_text('dropped' )
v.get_label_by_id('110').set_text('included')
plt.show()

Here we are dropping elements from both data set X and data set Y. A left join would keep all observations in data X intact and subset only from Y. 

Make sure that your resulting data sets have the correct number of rows and columns. That is, be clear about which observations are thrown away. 

**Note:** Don't make Venn diagrams in your own data project. It is just for exposition. 

# Analysis

To get a quick overview of the data, we show some **summary statistics** on a meaningful aggregation. 

MAKE FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.