#  Pandas information methods

Pandas, and especially DataFrames will serve as data support.

Series and DataFrames have many methods that give us information about the data they contain.

**Series and DataFrame : information methods**
- .info
- .describe
- .count , mean, median, max, std..
- .agg
- .corr
- .grouby

**API Reference de Pandas**

https://pandas.pydata.org/docs/reference/index.html

In [1]:
import pandas as pd

### Dataset Auto MPG
Orixe: UC Irvine Machine Learning Repository

https://archive.ics.uci.edu/ml/datasets/auto+mpg

In [None]:
# The Dataset can be loaded from the Seaborn library, which brings it as a dataset example.
# Load de mpg dataset (from seaborn samples)
# Seaborn is a statistical visualization library.
# Seaborn is based on matplotlib library
# https://seaborn.pydata.org/
# import seaborn as sns
# df = sns.load_dataset('mpg')

In [2]:
# Load dataset from local datasets repository
df = pd.read_csv('../datasets/auto-mpg.csv')

In [3]:
# Show the dataset ; it also shows the number of rows and columns
df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino
...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,usa,ford mustang gl
394,44.0,4,97.0,52.0,2130,24.6,82,europe,vw pickup
395,32.0,4,135.0,84.0,2295,11.6,82,usa,dodge rampage
396,28.0,4,120.0,79.0,2625,18.6,82,usa,ford ranger


In [4]:
# shape shows directly the dimensions of the dataset
df.shape

(398, 9)

In [5]:
# Show a Series / column also shows some information: number and type of data
df.mpg

0      18.0
1      15.0
2      18.0
3      16.0
4      17.0
       ... 
393    27.0
394    44.0
395    32.0
396    28.0
397    31.0
Name: mpg, Length: 398, dtype: float64

In [6]:
# show columns and data types
df.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight            int64
acceleration    float64
model_year        int64
origin           object
name             object
dtype: object

In [7]:
# Show Dataframe info:
# columns, not null data, data types, memory usage...
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


In [8]:
# basic statistical data, only for numeric variables
df.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
count,398.0,398.0,398.0,392.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,104.469388,2970.424623,15.56809,76.01005
std,7.815984,1.701004,104.269838,38.49116,846.841774,2.757689,3.697627
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0
25%,17.5,4.0,104.25,75.0,2223.75,13.825,73.0
50%,23.0,4.0,148.5,93.5,2803.5,15.5,76.0
75%,29.0,8.0,262.0,126.0,3608.0,17.175,79.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0


- count - number of non-zero values for that column
- mean - represents a mean value of data.
- std (standard deviation) - measure of data dispersion. Mean distance between the data and its average

The next 5 values are the PENTANUMERICAL SUMMARY.
- They divide the space of values in 4 equal parts (in terms of number of data)
- min - minimum value
- 25th percentile, first quartile or Q1: from the minimum to the Q1 value are 25% of the data
- 50th percentile or MEDIAN: from the minimum to the median 50% of the data are found.
- 75th percentile, third quartile or Q3: from the minimum to the Q3 value are 75% of the data
- max - maximum value

In [9]:
# We can force the statistics to be from non-numeric data only.
# The statistics that are calculated are different: count, unique, top, freq.

df.describe(exclude = 'number')

Unnamed: 0,origin,name
count,398,398
unique,3,305
top,usa,ford pinto
freq,249,6


- count - number of non-null values for the column
- unique: number of different values
- top: most repeated value
- freq: number of ocurrences of the most repeated value

### Categorical vs. numerical variables

Simplifying:
- **Numerical** variables: those that can be measured == (**quantitative** variables)
- **Categorical** variables: those that cannot be measured == (**qualitative** variables)

"describe" inform us of the number and type of variables in the dataframe (int+float -> numeric , object -> categorical)

In [10]:
# Summary of statistics applied to a Series/column
# Has the same meaning as in the case of a DataFrame
df.mpg.describe()

count    398.000000
mean      23.514573
std        7.815984
min        9.000000
25%       17.500000
50%       23.000000
75%       29.000000
max       46.600000
Name: mpg, dtype: float64

In [None]:
# We can apply a function to the whole dataframe to calculate statistics (applied by columns).

In [11]:
#df.mean() no funciona
df.mean(numeric_only=True)


mpg               23.514573
cylinders          5.454774
displacement     193.425879
horsepower       104.469388
weight          2970.424623
acceleration      15.568090
model_year        76.010050
dtype: float64

In [12]:
#df.median()
df.median(numeric_only=True)

mpg               23.0
cylinders          4.0
displacement     148.5
horsepower        93.5
weight          2803.5
acceleration      15.5
model_year        76.0
dtype: float64

In [13]:
df.max()
#df.agg('max')

mpg                         46.6
cylinders                      8
displacement               455.0
horsepower                 230.0
weight                      5140
acceleration                24.8
model_year                    82
origin                       usa
name            vw rabbit custom
dtype: object

In [14]:
df.agg('max')

mpg                         46.6
cylinders                      8
displacement               455.0
horsepower                 230.0
weight                      5140
acceleration                24.8
model_year                    82
origin                       usa
name            vw rabbit custom
dtype: object

In [None]:
# We can also apply the functions only to a series/column.

In [15]:
df.mpg.mean()
#df.mpg.agg('mean')

23.514572864321607

In [16]:
df.mpg.agg('mean')

23.514572864321607

In [None]:
# It is possible to apply several functions in a single line (either to the whole dataframe or only to a series/column).

In [22]:
#df.agg(['mean', 'std']) falla

In [17]:
df.mpg.agg(['mean', 'std'])

mean    23.514573
std      7.815984
Name: mpg, dtype: float64

##### Correlation

In our work as data scientists or analysts one of the common objectives can be
to find relationships between different variables.
But,be careful! because correlation between two variables does not imply causality!

**Correlation** measures the relationship between two variables.

Correlation does not explain why there is a relationship between variables, it just 
indicates its existence and gives a measure of its value.

Mantra estatístico nº1: **CORRELATION DOES NOT IMPLY CAUSALITY.**

La Correlación siempre va estar entre 1 y -1

In [18]:
# Apply the correlation calculation to the whole dataframe with the function "corr".
# The result is a symmetrical table in relation to its diagonal axis.
# mpg millas por galon
# displacement "desplazamiento"
df.corr(numeric_only=True)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
mpg,1.0,-0.775396,-0.804203,-0.778427,-0.831741,0.420289,0.579267
cylinders,-0.775396,1.0,0.950721,0.842983,0.896017,-0.505419,-0.348746
displacement,-0.804203,0.950721,1.0,0.897257,0.932824,-0.543684,-0.370164
horsepower,-0.778427,0.842983,0.897257,1.0,0.864538,-0.689196,-0.416361
weight,-0.831741,0.896017,0.932824,0.864538,1.0,-0.417457,-0.306564
acceleration,0.420289,-0.505419,-0.543684,-0.689196,-0.417457,1.0,0.288137
model_year,0.579267,-0.348746,-0.370164,-0.416361,-0.306564,0.288137,1.0


In [None]:
# Some appreciations:
# - a diagonal is always 1: a variable has the highest correlation with itself
# - weight (weight) and miles per gallon (mpg) have a quite high inverse correlation (-)
# - number of cylinders has a high correlation with horsepower
# - acceleration has almost nothing to do with model year

#### groupby

In [None]:
# The categorical variables can be used to create different groups, on which to apply the same functions that we applied on other variables (mean, max, ...).

In [None]:
# First create the group (a kind of special grouped dataframe).
# Next, apply the function to the new group.

In [19]:
df_groupby_origin = df.groupby('origin')

In [20]:
df_groupby_origin = df.groupby('origin')
df_groupby_origin.mean(numeric_only=True)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
europe,27.891429,4.157143,109.142857,80.558824,2423.3,16.787143,75.814286
japan,30.450633,4.101266,102.708861,79.835443,2221.227848,16.172152,77.443038
usa,20.083534,6.248996,245.901606,119.04898,3361.931727,15.033735,75.610442
