# Python for Data Science & Analysis 
## Intro to Pandas for Data Querying

# Part 1: What is EDA?

## What is a standard *Technical* EDA workflow?


### The Ideal Method

There is NO single method for EDA ...

### Non-Ideal Method

#### Summary of Method
* data quality, structure, etc. metrics
    * `.info()`
* descriptive stats for all single columns
* descriptive stats for all pairs of columns
    * ie., correlation
* domain-specific row subsets
    * all single cols, all pairs 
* introduce factoring columns
    * ie., groupby
    * factor all single columns
* introduce domain-specific novel columns
    * eg., bmi from w/h^2
* domain-specific visuals
    * eg., geo-plotting for geo data
* lightweight predictive modelling
    * eg., linear regression, etc. 
    
    

#### Technical EDA

* Step 0. Show all key data quality metrics (mostly non-statistical, quality checks)
    * missing data, descriptive stats of all columns...
* Step 1. Plot all univariate (single-columns) distributions
    * `.describe()`
    * eg., `sns.distplot` for *all* single columns
* Step 2. Plot all plausible group'd (factored or groupby) univariate distribtions
    * ie., subset rows of a single column (based on something domain specific)
    * plot distributions of those
* Step 3. Show all multivariate (multi-column, correlation) descriptive statistics
    * correlation coefs between all pairs of columns, etc.
* Step 4. Plot all multivariate distributions
    * eg., start with linear regression for all pairs of variables
    * eg., group contintous columns by discrete columns
* Step 5a. Introduce more domain-specific querying
    * ie., domain-specific `WHERE` conditions
    * subset rows and repeat (0 - 4)
    * subset columns and rows and repeat (0 - 4)
* Step 5b. Introduce facotoring columns (derived columns for groupby)
    * aka. enrichment, feature engineering (aka. adding a column)
    * grouping an exsiting real-number column by meaningful categories
        * eg., young, old, etc.
    * then use this col on others
    * this captures higher-level (heirachical) patterns
* Step 5c. (Maybe:) Derived continuous columns 
    * eg., bmi from weight/height
    * plots, etc.
* Step 6. (Maybe:) Introduce novel domain-specific visualizations
    * eg., geo maps, techincal heatmaps, ...
* Step 7. Lightweight Stat Modelling
    * linear regression = regression (continuous)
        * simple trend prediction
    * logistic regression = classification (discete)
        * sorting data points into groups (labels/classes/etc.)

# Part 2: EDA with Pandas

Exercise:

Import pandas and read in the titanic csv file as a dataframe saved as df.



In [4]:
#Solution
import pandas as pd

df = pd.read_csv('datasets/titanic.csv')


## How do we EDA with Pandas?

* data quality, structure, etc. metrics
    * $\rightarrow$ `.info()`
* descriptive stats for all single columns
    * $\rightarrow$ `.describe()`, `.mean()`, `.value_counts()`
* descriptive stats for all pairs of columns
    * $\rightarrow$ `.corr()`
* domain-specific row subsets
    * $\rightarrow$ `df.loc`
* introduce factoring columns 
    * $\rightarrow$ `.groupby`
* introduce domain-specific novel columns
    * $\rightarrow$ `df['bmi'] = df['w'] / df['h'] ** 2`
* domain-specific visuals
    * $\rightarrow$ `df.plot`
    

## How do I understand the structure of a dataset?

Exercise:

Use .info() to view the structure of your dataframe.

In [3]:
#Solution
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB


## How do I find possible relations between columns?

A correlation coefficient is a value between -1 and 1 (inclusive) that measures the strength of correlation between two variables (columns) in the data.

A coefficient of 1 is perfect positive correlation (when visualised on a scatter plot, points appear in a perfect straight line with a positive gradient).

A coefficient of -1 is perfect negative correlation (when visualised on a scatter plot, points appear in a perfect straight line with a negative gradient).

If there is no correlation the coefficient will be 0.

When visualising data on a scatter plot if the data points create an elliptical shape then they are said to have a bivariate normal distribution (and later we will see that this means that *linear* regression may be appropriate).

The more circular the ellipse is, the less correlated the variables are. If the ellipse is longer and thinner then the coefficient of correlation will be closer to -1 or 1.

**Important**

Just because two variables are correlated doesn't mean one has caused the other.

For example, if there is a correlation between ice cream sales and murder rates it would most likely be incorrect to say that selling more ice creams causes more murders. Sometimes a correlation is spurious, and sometimes it may be another underlying variable (e.g. temperature - which when high might cause more ice cream sales and more hot tempers!).

Calculating a correlation coefficient is part of exploring the data to look for links that should then be further investigated.

We can do this quickly in a data set with many possible pairs of columns, using the .corr() method.

Exercise:

Use the dataframe method .corr() to see the correlation coefficients between all the pairs of variables. Notice which ones are closer to 1 or -1.

In [4]:
#Solution
df.corr()


Unnamed: 0,survived,pclass,age,sibsp,parch,fare,adult_male,alone
survived,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307,-0.55708,-0.203367
pclass,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495,0.094035,0.135207
age,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067,0.280328,0.19827
sibsp,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651,-0.253586,-0.584471
parch,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225,-0.349943,-0.583398
fare,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0,-0.182024,-0.271832
adult_male,-0.55708,0.094035,0.280328,-0.253586,-0.349943,-0.182024,1.0,0.404744
alone,-0.203367,0.135207,0.19827,-0.584471,-0.583398,-0.271832,0.404744,1.0


A score of $1$ means that the two columns *contain the same information*. $0$ means they are *random* with respect each other (ie., no predictiability between them). Numbers outside these suggest some level of shared information between columns. 

We are interested in the sign & magnitude of these entires... above, we can see: strong negative between survived and pclass (class 1 $\rightarrow$ 3, survival). 

Aside: the options are

* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation

At this stage we don't get into which to use when, but some of you may enjoy reading further about these now you are aware of them.

## How do I select columns from a dataframe?

Exercise:

1) Use addressing to select the age column from the dataframe.

2) Select the age and fare columns from the dataframe.

In [5]:
#Solution
df['age']


0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

In [6]:
#Solution
df[ ['age', 'fare'] ]


Unnamed: 0,age,fare
0,22.0,7.2500
1,38.0,71.2833
2,26.0,7.9250
3,35.0,53.1000
4,35.0,8.0500
...,...,...
886,27.0,13.0000
887,19.0,30.0000
888,,23.4500
889,26.0,30.0000


## How do I select rows from a dataframe?

If you wish to obtain a subset of rows, you should use `.loc` ("locate").


We use `df.loc` to locate rows,

`df.loc[  row-indexs,  column-names ]`

Exercise:

1) Select the first row from the age column.

2) Select the first 5 rows from the age and fare columns.

3) Select the 1st, 11th, and 8th rows from the survived column (in that order).

In [None]:
#Solution
df.loc[0, 'age'] # SELECT age FROM df WHERE rowid = 0


In [8]:
#Solution
df.loc[0:5, ['age', 'fare']]


Unnamed: 0,age,fare
0,22.0,7.25
1,38.0,71.2833
2,26.0,7.925
3,35.0,53.1
4,35.0,8.05
5,,8.4583


In [9]:
#Solution
df.loc[ [0, 10, 7], 'survived']


0     0
10    1
7     0
Name: survived, dtype: int64

## How do I select rows on cleaned datasets?

Operations which modify the indexes of a datset (eg., by removing them), may produce suprising results when used with `.loc`:

Exercise:

1) Use the dataframe method .dropna() to remove the rows with NAs in. Save this as clean.

2) Display the first 5 rows and all columns of the clean dataframe.

3) Display the first 5 rows and all columns of the original dataframe and compare.

4) What happens if you try to address row 2 in the clean data set?

In [14]:
#Solution
clean = df.dropna()
clean.loc[0:5, :]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False


We have dropped rows indexed `0, 2, ...`, so they don't show up. 

## How does Pandas indexing differ from lists?

With `.loc`, you cannot use negative indexes. Indexes refer to a *column* (the index column) not to a sequence position.

Reminder:

In [1]:
data = [1, 2, 3]
data[-2:]

[2, 3]

In [2]:
data[0:3]

[1, 2, 3]

## How do I use comparison operators to filter rows?

If we want a subset of rows, we should use `df.loc`...

Let's look at an example before explaining how it works... 

Exercise:

Run the below section of code to create a boolean column called query (with True or False if they survived).

The query column is used to select just the rows where the passenger survived, then the fare column only is selected, then the mean average of this column is found.

After the comma you will see a very similar piece of code, except that ~query has been used (NOT query) meaning that the rows that have NOT True that they survived (ie they didn't survive), the fare column is selected and averaged.

In [5]:
query = df['survived'] == 1

df.loc[  query ,  'fare'  ].mean(),  df.loc[ ~query, 'fare'].mean()


(48.39540760233917, 22.117886885245877)

The query is a *comparison* across every entry in the `survived` column which yield a `True` or a `False`:

In [10]:
query


0      False
1       True
2       True
3       True
4      False
       ...  
886    False
887     True
888    False
889     True
890    False
Name: survived, Length: 891, dtype: bool

When used as a filter, a boolean column such as `query`, selects the indexes of the rows which are `True`. 

In [None]:
query_index = [1, 2, 3, 887, 889]

When we use that query as an index, the rows corresponding to the `True` values are selected:

Exercise:

1) Create a query (a column of True/False entries) where the fare is greater than or equal to 500.

2) Use your query to select the rows where the fare is 500 or more and view the age column.

In [11]:
#Solution
df.loc[ df['fare'] >= 500 , 'age']


258    35.0
679    36.0
737    35.0
Name: age, dtype: float64

When using a comparison as index, pandas will run the comparison over every row and *keep the row indexes* where the comparison comes out `True`. 

## What are the standard comparison operators?

* `==` is equal to
* `!=` is not equal to
* `>` is greater than
* `<`
* `>=` is greater than or equal to
* `<=` 
* `df.isin(set)` rows are in `set`
* EXTRA:
    * pandas has a wide variety of other comparison operators
    * eg., `df['embark_town'].str.contains('*town')`
        * "which `embark_town` end in `town` ?"

### Example

Locate the rows where the class is "in" `{First, Third}`, select the fare column, take the average.


`SELECT mean(fare) FROM df WHERE class IN ("First", "Third")`

Exercise:

1) Create a query that selects the class column and uses .isin(\['First', 'Third'\]) to make a column with True if a passenger is in 1st or 3rd class.

2) Use this query to find the mean fare paid by the passengers who were in 1st or 3rd class.

In [6]:
#Solution
df.loc[  
       df['class'].isin(['First', 'Third'])  # row index filter
    , 'fare'                                 # columns
].mean() # query


35.20807298444123

Average `fare` for second and third class passengers:

Exercise:

Find the mean fare for passengers who were not in first class.

Hint: Your query to narrow the rows down needs to have the class column not equal (!=) to 'First'.

In [14]:
#Solution
df.loc[  df['class'] != 'First', 'fare'].mean()


15.580054518518512

EXTRA:

Exercise:

1) Create a query that uses the 'embark_town' column with .str.contains('tow?n').

2) What does the regex pattern 'tow?n' match to apart from 'town' ?

In [10]:
#Solution
df['embark_town'].str.contains('tow?n')

0       True
1      False
2       True
3       True
4       True
       ...  
886     True
887     True
888     True
889    False
890     True
Name: embark_town, Length: 891, dtype: object

## How do I combine comparisons?

In python, the logical operators are `and`, `or`, `not`... these do not work with pandas (, and many other data process libraries).

These are not designed to work *across* a dataset; ie., down columns: to use this way, you'd need to loop.

In [11]:
("@" not in "mburgess@qa.com") and (len("M") == 1) or ("London".isalpha())


True

We must always bracket comparison before combining them (if there is only one, it is optional)...

#### `&` $\equiv$ AND 

In [19]:
(df['age'] <= 50) & (df['survived'] == 1)


0      False
1       True
2       True
3       True
4      False
       ...  
886    False
887     True
888    False
889     True
890    False
Length: 891, dtype: bool

In [20]:
df.loc[ 
      (df['age'] >= 50) & (df['survived'] == 1) # row index filter
    , 'fare'                                    # selecting columns
].mean() # query


65.58441111111112

### `|` $\equiv$ OR

In [20]:
(df['age'] >= 50) | (df['survived'] == 1)


0      False
1       True
2       True
3       True
4      False
       ...  
886    False
887     True
888    False
889     True
890    False
Length: 891, dtype: bool

In [22]:
df.loc[ (df['age'] >= 50) | (df['fare'] >= 100), 'survived'].mean()


0.5166666666666667

### `~` $\equiv$ NOT

In [23]:
~((df['age'] >= 50) | (df['fare'] >= 100))


0      True
1      True
2      True
3      True
4      True
       ... 
886    True
887    True
888    True
889    True
890    True
Length: 891, dtype: bool

In [24]:
df.loc[ ~((df['age'] >= 50) | (df['fare'] >= 100)), 'survived'].mean()


0.3631647211413748

## How do I compute the frequency of matches?

A mean across a column of booleans (ie., a comparison) counts `True` as `1` and therefore counts the frequency of rows which match the comparison.

In [25]:
((df['age'] >= 50) | (df['fare'] >= 100)).mean()


0.13468013468013468

In [26]:
results = [True, True, False, True]

sum(results)/len(results)

0.75

## Aside: In sumary..


* `&` and
* `|` or
* `~` not

## Exercise (30 min)

* create your own notebook, and with the `titanic` dataset 

* building on any prior solutions...
    * compute *and* interpret
        * .info()
        * provide comments which describe the data quality based on `.info()`
* descriptive stats for all pairs of columns
    * .corr()
    * interpret & provide comments around which columns show association
    * what columns "make sense" correlated? can you explain why?

* select the first five rows (not using `head`) for:
    * all columns
    * age column
    * age and fare columns
    * select rows `[0, 3, 5]`
    * select the last five rows:
        * HINTS: the last row has an index of `len(df)`
            * `len(df) - 5 : len(df)`
        * HINTS: `:` means "all"
            * from the beginning to the end
* find the rates of survival for different age groups
    * `.loc[ ... , 'age']`
    * `df['age'] <= 18`
    * ` (df['age'] < ...) & (df['age'] > ...)`

## Extension

Start to create your own Pandas cheat sheet using the full official documentation:

* Using the Pandas user guide: https://pandas.pydata.org/docs/user_guide/10min.html
* You may like to explore more functions available for descriptive statistics: https://pandas.pydata.org/docs/user_guide/basics.html#descriptive-statistics
