# Exercise 19 - Statistical moments of histograms using Pandas
You are given data on olympic athletes in the form of `ex19\_data.csv`, a filtered version of the complete dataset found at [Kaggle.com](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results/data).

From this, you are going to analyse medal distributions.<br>

Documentation about the Pandas library, can be found online at:
https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html <br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html <br>




## 19.1

Using `pandas.read_csv`, load the `ex21_data.csv` file available on Moodle. <br>

A Gold medal counts as 3 points, Silver medal as 2 points and Bronze medal as 1 point. If `Medal` field for an athlete is `NaN`, assume the athlete has not won a medal (0 points).<br>

Add to the DataFrame a new column `Score`, which associate to every `Medal` the above mentioned points.

**HINT**: Try initializing a new `pandas.Series`




In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# YOUR ANSWER HERE

## 19.2
Divide the pandas DataFrame into two, one for _male_ and the other for _female_ participants. <br>

For both genders extract the following data from the DataFrame: <br>

**1)** Total participant number per age  <br>
**2)** The total score per age (sum the points for every medal won by atletes of a given age) <br>
**3)** The average score per participant (of given age) per age. <br>

In case an athlete participated in multiple competitions, you can count them as multiple participants. <br>


In [None]:
# YOUR ANSWER HERE

## 19.3
Plot these 6 distributions into a single figure with 6 subplots.

_(row 0 corresponds to female participants, row 1 to male participants)_

In [None]:
# YOUR ANSWER HERE

## 19.4

Define a function which computes the following (weighted) statistical quantities of participant ages of all 6 distributions:

* The mean
* Standard deviation
* Skewness
* Kurtosis. 

**HINT**: You can cast pandas Series to np arrays using the `to_numpy` function. `DataFrame.index` returns the index colum of a DataFrame .

**Another HINT**:

* The weighted mean is given as $\bar{x} = \sum_{i} x_i \cdot w(x_i) / \sum_{i} w(x_i)$.

* The $n$-th weighted moment is given as $m_k = \sum_{i} (x_i - \bar{x})^k \cdot w(x_i) / \sum_{i} w(x_i)$.

* The skewness is given as $m_3 / m_2^{3/2}$. <br>

* The kurtosis is given as $m_4 / m_2^2$. To be clear: $x$ is the age, $w(x)$ are the different distributions.

In [None]:
# YOUR ANSWER HERE

# Exercise 20 - Molecular Dynamics of Gold and Copper


In this exercise, you will analyze the structural information of a given material by using simple statistical tools.

The AuCu bi-layer is a two-dimensional (2-D) film. In the ground state (i.e., near null temperature, $T \sim 0$K), it shows a completely flat layer of Au atoms above a completely flat layer of Cu atoms (see the figure below). At $T \sim 0$K, computational simulations (specifically, Density Functional Theory calculations) predict a value of $0.227$nm for the distance between the two layers. If the system is heated up, then atoms vibrate, changing the structural properties of the film (e.g., the layers are no longer completely flat).


On Moodle were uploaded, the results obtained by molecular dynamic simulations, which calculate the evolution in time (in steps of $1$fs) of the AuCu bi-layer at specific temperatures. The simulation of $6878$fs time-steps is divided in three `numpy` files:


- $\textrm{MD_AuCu_100k.npy}$,  $T \sim 100$K  with $500$fs steps,
- $\textrm{MD_AuCu_300k.npy}$,  $T \sim 300$K  with $1378$fs steps,
- $\textrm{MD_AuCu_300k_2.npy}$,  $T \sim 300$K  with $5000$fs steps.



Moreover, an additional simulation was performed independently, heating up the system and, simultaneously, applying a strain which elongates the film along its lateral directions:
- $\textrm{MD_AuCu_strain.npy}$  $T \sim 300$K  with $5000$fs steps.

Technical information which may help you and/or answer your most curious questions:
- The simulations include $50$ atoms of Cu and $50$ atoms of Au.

- Periodic boundary conditions are applied (i.e., simulations of an infinitely large 2-D film).
- A rigid drift (i.e., the rigid translation) of the system along z occurs, but it is not physically relevant (there is nothing but the 2-D AuCu film in the simulation universe).

- Positions are expressed in Angström.
- The stored arrays have four indexes:
    * the time (fs)
    * the element type 0=Cu/1=Au
    * an index to identify the individual atom
    * the three xyz coordinates .

In [10]:
import matplotlib.pyplot as plt
plt.figure(dpi=120)
plt.imshow(plt.imread('ex20_vmd.png'))
plt.axis('off')
plt.show()

FileNotFoundError: [Errno 2] No such file or directory: 'ex20_vmd.png'

<Figure size 720x480 with 0 Axes>

## 20.1
Load the 4 files using `Numpy`, join the three arrays from the non strained simulations: 

* the one at 100k and
* the two at 300k. <br>

Scatter in a `matplotlib` 3Daxis plot the atoms' positions at time zero _(the fist point in the 100k simulation)_ 

In [None]:
# YOUR ANSWER HERE

## 20.2
At every time step, calculate the average position along z of the Cu and Au layers, separately.


In [8]:
# YOUR ANSWER HERE

## 20.3
At every time step, calculate the standard deviation of the z coordinate both for the Cu- and Au atoms, separately.


In [7]:
# YOUR ANSWER HERE

## 20.4
Calculate the average distance between the Au and Cu layers.

In [6]:
# YOUR ANSWER HERE

## 20.5
Plot in a graph including the average distance (from task 20.4), and the standard deviations of Cu and Au atoms (from task 20.3), as functions of time.

Show also the ground state value for the distance between the two layer.

In [None]:
# YOUR ANSWER HERE

## 20.6
Repeat the calculation of the average distance and the standard deviation for the strained simulation, and draw a similar plot.

In [None]:
# YOUR ANSWER HERE

## 20.7
Comment the effects of raising the temperature from $100$K to $300$K.


In [None]:
# YOUR ANSWER HERE

## 20.8 
Comment the effects of strain with respect to the non-strained film.

In [5]:
# YOUR ANSWER HERE

# Exercise 21 - Introduction to Pytorch
`Pytorch` is a software which is specific to create and costumize NN algorithm for machine learning.

A more tutorials on how to run pytorch can be found at: https://pytorch.org/tutorials/ 


In [4]:
import torch
from torch import nn,tensor

## 21.1 PyTorch Tensors
Help can be found at: https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html

The basic data type in PyTorch are `tensors`, they resemble Numpy Arrays, but are optimized to be used inside the Machine Learning Model. <br> 

In order to familiarize yourself with tensors: <br>

**1)** Construct one tensor from a Python list <br>

**2)** Construct one tensor from a random numpy array <br>

**3)** Generate three tensors:

* one with random elements
* one with all elemnts equal to 1
* one with all zeros

of shape (3,3) using the relative functions inside the module torch. <br>

**4)** Notice that tensors, like numpy arrays, support most of the standard mathematical operations, compute three different matematical operations (based on your choice) using the previously generated tensors <br>

**5)** Transform the results of your operations to numpy arrays

In [11]:
# YOUR ANSWER HERE

## 21.2 Datasets
In Pytorch data are grouped in `Datasets`, to define a costum dataset, you need to derive a costum class from the `torch.utils.data.Dataset` class (see class inheritance in Python).<br>

A Dataset class must implement three functions: __init__, __len__, and __getitem__ , as in the following example found at: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files


    import os
    import pandas as pd
    from torchvision.io import read_image
    from torch.utils.data import Dataset
    
    class CustomImageDataset(Dataset):
        def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
            self.img_labels = pd.read_csv(annotations_file)
            self.img_dir = img_dir
            self.transform = transform
            self.target_transform = target_transform

        def __len__(self):
            return len(self.img_labels)

        def __getitem__(self, idx):
            img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
            image = read_image(img_path)
            label = self.img_labels.iloc[idx, 1]
            if self.transform:
                image = self.transform(image)
            if self.target_transform:
                label = self.target_transform(label)
            return image, label
In the example above the initializer saves a list of file names (img_labels) in the Dataset instance.
The `__getitem__(n)` function reads from the hard drive the _nth_ image file. 


**1)** Create two random tensors (X,Y) of length 200.<br>

**2)** Define a costum dataset which contains as objects two tensors (X and Y). The method `__getitem__(n)` should return (trivially) the _nth_ values of these tensors

**3)** Test if the code works

In [None]:
# YOUR ANSWER HERE

## 21.3 Neural Layers
`Neural layers` are function which apply transformations to tensors.
The linear transformation is among the simplest:
https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html#nn-linear

**1)** Create a linear layer which has both as input and output one element tensors <br>

**2)** Guess the parameters of the transformation, trying to apply the transformation to different tensor _(notice that the paramenter are randomly created every time you create a new layer)_<br>

**3)** Concatenate two layers in order to produce from a scalar imput a 3D vector and a 3 by 3 matrix in the end

In [None]:
# YOUR ANSWER HERE