This notebook, by [felipe.alonso@urjc.es](mailto:felipe.alonso@urjc.es) and [jorge.calero@urjc.es](mailto:jorge.calero@urjc.es)

This notebook is an introduction to descriptive statistics using pandas. The first step when working with data is to perform an exploratory analysis to get some intuitions about how data is distributed. 


# 1. Load libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 2. Pandas

The pandas library includes two types of data structures:

- [**Series**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) which are one-dimensional ndarray with axis labels (including time series)
- [**DataFrames**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame): Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

Let's see some examples.

## 2.1 Series

In [None]:
s = pd.Series([1,2,3,"catorce"])
print(s)

In [None]:
# s.index
# s[3]
# s[0:3]
# s.values
# s.values * 2

Indexes can be easily changed

In [None]:
s = pd.Series(data =[1,2,3,"catorce"], index = ['hola','que','tal','estas'])
print(s)

In [None]:
# s['hola']
# s[0]
# s[0:3]

## 2.2 DataFrames

[DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) can be constructed by using:

- Numpy ndarray
- Dictionaries

In [None]:
# from Numpy ndarray
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),columns=['a', 'b', 'c'])
df

In [None]:
# from dictionary
queen_dict = {}
queen_dict["name"]=["Freddie" ,"Roger", "Brian", "John"]
queen_dict["surname"] = ["Mercury","Taylor", "May", "Deacon"]
queen_dict["year_of_birth"]= [1946,1949, 1947, 1951]

queen_df = pd.DataFrame(data = queen_dict)
queen_df

### Pandas operations

In [None]:
# main properties
queen_df.index
queen_df.columns
#queen_df.shape

In [None]:
# change index
queen_df.index = ['voice','drums','guitar','bass']
queen_df

In [None]:
# reset index
queen_df.reset_index()
#queen_df.reset_index(drop = True)

In [None]:
# set index
queen_df.set_index('name',inplace = True)
queen_df

In [None]:
# Get column values
queen_df['name'] # note that this is a Series
#queen_df['name'].values

In [None]:
# Get several columns
queen_df[['surname','name']] # note that this is a DataFrame
#queen_df[['surname','name']].values

In [None]:
# Get rows
queen_df.loc['voice']

#queen_df.loc['voice'].values
#queen_df.loc[['voice','drums'],'name']
#queen_df.loc[['voice','drums'],['name','surname']]
#queen_df.iloc[0]
#queen_df.iloc[0:2]
#queen_df.iloc[0:2,1:]

# 3. Load data in Pandas

In [None]:
df = pd.read_excel('./data/Datos.xlsx')
df.shape

In [None]:
# some options
df.head()
df.tail()
df.sample(10)
df.dtypes
df.drop(columns = 'Id')

## 3.1 Summary statistics

In [None]:
# Mean values
df.mean()       # columns (axis=0) 
df.mean(axis=1) # rows

In [None]:
# standard deviation
df.std(ddof=1) 

In [None]:
# others
df.count()
df.min()
df.median()
df.max()

In [None]:
# summary
df.describe(percentiles = [.05, .25, .5, .75, .95])

<div class="alert alert-block alert-info">
<b>Exercise:</b> Calculate the mean value of a (random) sample of size 10 for the variable Sat_fat_dr.</div>

In [None]:
# Your code here
# ...

df['Sat_fat_dr'].sample(10).mean()

## 3.2 Filtering

Let's find the rows for which `Alcoh_dr` is 0.

In [None]:
df[df.Alcoh_dr == 0]

Now, let's take the cases where `Alcoh_dr` is greater than 0 for the Calors columns.

In [None]:
# option 1
df[df.Alcoh_dr > 0][['Calor_dr','Calor_ffq']]

# option 2
df.loc[df.Alcoh_dr > 0,['Calor_dr','Calor_ffq']]

## 3.3 Do some plotting

In [None]:
# histograms

# option 1: using pandas
df.Alcoh_dr.hist(bins=50, grid=False)
plt.show()

# option 2: using matplotlib
plt.hist(df.Alcoh_dr.values, bins=50)
plt.show()

In [None]:
# boxplots
df[['Calor_dr','Calor_ffq']].boxplot()
plt.show()

In [None]:
# scatter plots
df.plot.scatter(x = 'Calor_dr', y = 'Calor_ffq')
plt.show()

# References

- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)