# Scatter diagram

In [None]:
import pandas as pd

It is often necessary to understand how different values depend on each other. Let's take, for example, data on `height`  and `weight` of people.

In [None]:
df = pd.read_csv('./datasets/HeightWeight.csv')

In [None]:
df.head()

Let's study the numerical characteristics:

In [None]:
df.describe()

and histograms:

In [None]:
ar = df.hist()

Histograms are similar to normal distribution plots. Separately, we note that we are dealing with data on adults.
To find the relationship between height and weight, they also need to be marked on the same graph, and not on separate histograms. Set the height and weight on the X and Y axes, respectively:

In [None]:
ar = df.plot(x='Height', y='Weight') 

Furry something.

The original data was not sorted. And the growth of each next person is in no way connected with the growth of the previous one. Let's get rid of the mess of jumping values connected by lines.
Let's sort the data:

In [None]:
ar = df.sort_values('Height').plot(x='Height', y='Weight') 

A graph where values are connected by lines is good if it illustrates a continuous relationship.
It is much better to indicate individual combinations of height and weight with dots. This is possible on a special type of charts - scatter. Let's pass the `scatter` value to the `kind` parameter of the `plot()` method:

In [None]:
ar = df.plot(x='Height', y='Weight', kind='scatter') 

On the graph you can see the relationship between the two quantities. And also to understand what data is typical and what is abnormal.
So, if we are told that there is a person with a height of 190 cm and a weight of 50 kg, we will answer that they either measured it incorrectly, or this is a very rare person.

# Correlation

Method 1

The obvious drawback of this graph is that in the middle there is a huge number of points that have merged into a single mass. In the value cloud, you cannot see areas of higher density. There are two ways to make the graph clearer.

Let's make the dots semi-transparent by setting the `alpha` parameter. Let's try to find its optimal value:

In [None]:
ar = df.plot(x='Height', y='Weight', kind='scatter', alpha=0.03) 

Method 2

When there are many points and each individually is not interesting, the data is displayed in a special way. The graph is divided into cells; count the points in each cell. Then the cells are filled with color: the more dots, the thicker the color.

This type of graph is specified through the kind parameter, it is assigned the value `hexbin`.
The number of cells along the horizontal axis is set by the `gridsize` parameter, an analogue of `bins` for `hist()`.

In [None]:
ar = df.plot(x='Height', y='Weight', kind='hexbin', gridsize=20, figsize=(8, 6), sharex=False, grid=True) 

# Correlation coefficient

The Pearson coefficient is found by the `corr()` method. The method is applied to the column with the first value, and the column with the second value is passed as a parameter. Which is the first, and which is the second - it does not matter:

In [None]:
df['Height'].corr(df['Weight'])

In [None]:
df.corr()

A correlation of `0.5` indicates a connection, but not too strong. It turns out that an increase in height is accompanied by an increase in weight, but this is not always the case.

Is the opposite true: does height increase with weight gain? As far as we know from experience, no. Although correlation illustrates the relationship between quantities, it does not prove the existence of a causal relationship. This is not to say that by gaining weight, we become taller.

# Scatterplot matrix

earlier we found a correlation between the two quantities. However, life phenomena are much more complex and can depend on many factors. For example, it is interesting to study not only height and weight, but also how they are affected by age and sex.

Unfortunately, it is impossible to draw a clear graph for four parameters at once. However, you can build scatterplots in pairs for height and weight, height and age, weight and sex, weight and age - a total of 16 options. In pandas, this problem is solved not by `df.plot()`, but by a special method: `pd.plotting.scatter_matrix(df)`

In the data set, for each person, height, weight and age are known. In the sex column, the value 1 denotes male, 0 denotes female. Let's build the scattering matrix:

In [None]:
df = pd.read_csv('./datasets/HeightWeight2.csv')

In [None]:
ar = pd.plotting.scatter_matrix(df, figsize=(10, 10))

For all pairs of columns, except gender, the correlation coefficient can be found. It is enough to call the `corr()` method without parameters.

In [None]:
df.corr()

# More charts

slightly automate the drawing process, and also make the graph more informative

In [None]:
df.head()

let's make a regular charts

In [None]:
ax = df.plot(x = 'Weight',
             y = 'Height',
             kind = 'scatter',                    
             style = 'o',                          
             alpha = 0.05,                          
             figsize = (8, 4.5),                   
             grid = True)

add median data to it, these data will be close to correlation

In [None]:
ax = df.plot(x = 'Weight',
             y = 'Height',
             kind = 'scatter',                    
             style = 'o',                          
             alpha = 0.05,                          
             figsize = (8, 4.5),                   
             grid = True)
(df.groupby('Weight')['Height'].agg(['median'])
         .plot(ax = ax,
               y = 'median',
               style = '-r',
               alpha = 0.1, 
               legend = True,          
               label = 'median on ' + 'Weight',    
               grid = True              
              ))

the median looks bad because data for which we grouped was not grouped

fix this by rounding the data

In [None]:
ax = df.plot(x = 'Weight',
             y = 'Height',
             kind = 'scatter',                    
             style = 'o',                          
             alpha = 0.05,                          
             figsize = (8, 4.5),                   
             grid = True)
(df.round().groupby('Weight')['Height'].agg(['median'])
         .plot(ax = ax,
               y = 'median',
               style = '-r',
               alpha = 1, 
               legend = True,          
               label = 'median on ' + 'Weight',    
               grid = True              
              ))

the ideal is already close, but you need to cut off a small amount of data, this can be done using quartiles

In [None]:
llimit = df['Weight'].quantile(0.01)
rlimit = df['Weight'].quantile(0.99)

In [None]:
ax = df.plot(x = 'Weight',
             y = 'Height',
             kind = 'scatter',                    
             style = 'o',                          
             alpha = 0.05,                          
             figsize = (8, 4.5),  
             xlim = (llimit, rlimit),
             grid = True)
(df.round().groupby('Weight')['Height'].agg(['median'])
         .plot(ax = ax,
               y = 'median',
               style = '-r',
               alpha = 1, 
               legend = True,          
               label = 'median on ' + 'Weight',    
               grid = True              
              ))

left to automation

create a function that will receive data
1. dataframe
2. X data
3. Y data

In [None]:
def draw_graph(df, column_x, column_y): 
    
    llimit = df[column_x].quantile(0.01)
    rlimit = df[column_x].quantile(0.99)   
    
    ax = df.plot(x = column_x,
                 y = column_y,
                 kind = 'scatter',                    
                 style = 'o',                          
                 alpha = 0.05,                          
                 figsize = (8, 4.5),  
                 xlim = (llimit, rlimit),
                 grid = True)
    (df.round().groupby(column_x)[column_y].agg(['median'])
             .plot(ax = ax,
                   y = 'median',
                   style = '-r',
                   alpha = 1, 
                   legend = True,          
                   label = 'median on ' + column_x,    
                   grid = True              
                  ))

now, to draw the graph, we just need to call the function with the necessary parameters

In [None]:
draw_graph(df, 'Weight', 'Height')

In [None]:
draw_graph(df, 'Height', 'Weight')

In [None]:
draw_graph(df, 'Age', 'Weight')

We are limited only by our imagination. For best use, try to limit the variables you pass

In [None]:
def hist_compare(df1, df2, column, bins, lims):
    ax = df1.plot(kind = 'hist',
                 y = column,
                 bins = bins,
                 range = lims,
                 alpha = 0.3,
                 grid = True,
                 legend = True,           
                 label = column,
                 figsize = (8, 4.5))
    df2.plot(kind = 'hist',
                  y = column,
                  ax = ax,               
                  bins = bins,
                  range = lims,
                  alpha = 0.4,
                  grid = True,
                  legend = True,
                  label = column 
                          )

In [None]:
hist_compare(df.query('Height < 68'), df.query('Height >=68') , 'Weight', bins=30, lims=(90,160))

what data can be optimized/automated here?