## Pandas

pandas ('panel-data') is the main library for working with tabular data in Python on small data sets (as a rule of thumb, less than 1GB). Before learning how to read and write CSV/Excel files, we will go over the basics of pandas.

----

The main object you will work with in Pandas is a dataframe (`pd.DataFrame`).
A dataframe is a table, but it offers much more than just a matrix of values. 

![Anatomy of a dataframe](https://media.geeksforgeeks.org/wp-content/cdn-uploads/creating_dataframe1.png) 

([source](https://media.geeksforgeeks.org/wp-content/cdn-uploads/creating_dataframe1.png))



### Series

A dataframe is composed of columns, each series is 1-D nd-array, with axis labels. We can create a series from a list of an array of values.

In [None]:
import pandas as pd
import numpy as np


student_grades = pd.Series(
    data=np.random.normal(90, 1.5, size=5), 
    index=list('ABCDE'), name='student_grades'
)
print(student_grades, 
    student_grades.shape,
    student_grades.values,
    student_grades.name, sep='\n\n')



### DataFrame

A data frame is a collection of series objects, known as columns. Dataframes are potentially hetrogenous, unlike arrays, as each column can have its own data type.

----

We will now create a dataframe but we will not give it any special column names (label-based identifier for columns - axis 1) or row names (index - label based
identifier for rows - axis 0). 

In [None]:
import string

# Create 5 columns of 20 values each, sampled from a random distribution
random_numbers = pd.DataFrame(
    data=np.random.normal(size=(20, 5)))

# Add a 6th column that contains random strings
random_numbers[5] = np.random.choice(['dog', 'cat', 'bear', 'bird'],
                                     size=random_numbers.shape[0])

print(random_numbers,
      random_numbers.columns,
      random_numbers.index, sep='\n\n')



We can get some information on our dataframe using `df.info` e.g., the numebr of null values on each column, their names and data types.

In [None]:
random_numbers.info()

#### $\color{dodgerblue}{\text{Exercise}}$
Refering to columns using an integer index doesn't add much over arrays. This is way we can use column names.

Change the name of the dataframe columns using multiple ways. Print the new column names after each change to see what happened.
*   First using assignment on creation.
*   Second, update the column names by using the `pd.DataFrame.rename` method (e.g., change names, capitalization, etc.).


In [None]:
course_grades = pd.DataFrame(
    data=[
          # Specifying each column values for each row
          ('Python 101', 'Fall', 95, 2021), 
          ('Python 101', 'Spring', 85, 2020), 
          ('Python 101', 'Fall', 90, 2019), 
          ('Python 102', 'Fall', 95, 2021), 
          ('Python 102', 'Summer', 100, 2020), 
          ('Python 102', 'Fall', 90, 2019), 
    ], columns=['Course', 'Term', 'Average', 'Year'])

print(course_grades)

course_grades = course_grades.rename(
    dict(zip(course_grades.columns, 
             ['Course Name', 'Semester', 'Mean Grade', 'Date'])), 
    axis=1)

print(course_grades) # Why has it not changed?


The crux of the previous exercise was that in Pandas, you have to be aware what are the effects of your actions. Many of the functions return a copy of the dataframe with the additional change from the function call, rather than change it in place by default. Using `inplace=True` is a matter of choice, but there are [debates](https://github.com/pandas-dev/pandas/issues/16529) for and against it.

### Selection

In pandas you can select columns, rows or both in multiple ways. To demonstrate and practice it we will load an example dataset from a library that we'll get to know later.

In [None]:
import seaborn as sns

# mpg (miles per gallon) is a good data set for this section as it contains 
#  both numeric and string columns

mpg = sns.load_dataset('mpg')

print(mpg.info())

mpg.head() # prints the first ten rows, to get a feel of the data

We can use square brackets to retrieve a single column (i.e., a series)

In [None]:
mpg['origin']

You can also use mpg.origin to get the same result, although alluring for newcomers, this is not recommended. 
* You cannot retrieve a column this way if it has spaces in it (`df.total price`).
* You cannot store the column name in another variable. (`x = 'col_name'; df.x`)
* You cannot retrieve a couple of columns together.

**Use square brackets syntax.**

In [None]:
mpg[['model_year', 'weight']] # You can retrieve multiple columns in a new order

An important thing to note is that the returned values, are a *view* or a slice 
of the dataframe and not a new object. This means that changes that are applied
to it will be reflected in the original dataframe as well. 

In [None]:
mpg['model_year'] += 1900 # This is why reassignment works here. 
mpg.head() # The original values are greater by 1900

There are four main ways to select rows and columns based on an index. We will cover only two of them. 

----

#### $\color{dodgerblue}{\text{Exercise}}$

`iloc` stands for integer-location. We know that a dataframe is in some sense a collection of NumPy arrays, and we know how to index 2-D arrays. So we know how to use iloc. 

Fill the code below to select every third row (axis=0) and every second column (axis=1) in the dataframe.

In [None]:
mpg.iloc[2::3, 1::2]
mpg.iloc[0, 1] # This returns the first row, second column

# Booleans also work (clunky for demo purposes).
#  6th row and below, mid-three columns
mpg.iloc[5:, [False, False, False, True, True, True, False, False, False]] 

[iloc is flexible, but can only be fed integers. For more info take a look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html).

----

The next method of selection is `loc`, selection by label (also by boolean indexing). 

In [None]:
mpg.loc[:, ['mpg', 'model_year']] 

In [None]:
mpg.set_index(['name', 'origin'])

Up until now, we didn't make use of the dataframe row-index (the row labels). They were always integer so we could just select them as on iloc, if required, the index can be turned into a string for example.

In [None]:
mpg.set_index(['name']).loc[('toyota corolla')]

However, a much cleaner way to put this in the current case (and most) would be to just use `loc` with boolean indexing.

In [None]:
# mpg['name'] == 'toyota corolla' # rutrns a series of booleans
mpg.loc[mpg['name'] == 'toyota corolla']

Which is very specific and flexible, but could require slightly different syntax than what we know (see below).


In [None]:
mpg.loc[
        # Select non-japanese models
        (mpg['origin'] != 'japan') 
        # Models from 1976 or later
        & ~(mpg['model_year'] > 1975) 
        # Find if the model name name contains 'volvo or ford'
        & (mpg['name'].str.contains('volvo|ford')) 
    ] 

You've noticed that we didn't use the regular `and` and `not` keywords when chaining arguments. Here you are required to use bitwise-operators.

The short version is:
* When chaining conditions use paranthesis. 
* Instead of `and` use `&`
* Instead of `or` use `|`
* Instead of `not` use `~`

If you want the long version, go [here](https://towardsdatascience.com/bitwise-operators-and-chaining-comparisons-in-pandas-d3a559487525). 

### Setting and mutating

There are multiple ways by which you can update existing values in the dataframe or add new. 

#### $\color{dodgerblue}{\text{Exercise}}$
Setting with enlargement is a method in which we "try" to index inexistent indices and set their values. Create a new column called 'kpg' (kilometers per gallon; mpg multiplied by 1.609).

In [None]:
mpg['kpg'] = mpg['mpg'] * 1.609
mpg.head()

The same goes for adding new rows. 

In [None]:
# We are using some null values for Lada, as we don't have the mpg data
mpg.loc[mpg.shape[0]] = (
    np.nan, 4, 95.69, 78, 2535.32, 23, 1977, 'soviet union', 'Lada Niva', np.nan)
mpg.tail()

### GroupBy

"Group by" is a way to do one or more of the following steps: 
* Split the dataframe into groups.
* Apply a function to each group (e.g., calculate summary statistics).
* Recombine the results into a dataframe.



In [None]:
mpg.groupby(['origin', 'cylinders']).mean().round(2)

Let's break it down.

`groupby` takes column name(s) as the variable that will contain our identifiers, the names of each group. 

In [None]:
gb = mpg.groupby('origin')
gb.groups['japan'] # Returns the indices from the original dataframe

We can grab a specific group from the groupby object:

In [None]:
usa = gb.get_group('usa')
usa.head()

We can apply all sorts of transformations or aggregations on the group.

In [None]:
usa.agg('mean') # usa.agg(my_unique_agg_function)
usa.mean()

And we can iterate over groups, which is a common matplotlib-pandas idiom.

In [None]:
import matplotlib.pyplot as plt
fig, axs = plt.subplots(2, 2, sharex=True, sharey=True)

for (group_name, group_df), ax in zip(gb, axs.flat):
    ax.scatter(*group_df[['acceleration', 'horsepower']].values.T)
    ax.set_title(group_name)
    ax.annotate(f'r({group_df.shape[0]}) = ' +
        f"{group_df[['acceleration', 'horsepower']].corr().min().iloc[0]:.2f}",
        xy=(0.525, 0.9,), xycoords='axes fraction')
    


And offers more control compared with the built-in plotting in pandas, which is much more useful for simple exploration. See [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

In [None]:
ax = mpg.plot.scatter('acceleration', 'horsepower', )
ax.annotate(f'r({mpg.shape[0]}) = ' +
        f"{mpg[['acceleration', 'horsepower']].corr().min().iloc[0]:.2f}",
        xy=(0.525, 0.9,), xycoords='axes fraction')

OK. So we have our aggregated dataframe, and we know what was each step in making it.

#### $\color{dodgerblue}{\text{Exercise}}$

Aggregate the mean and standard deviation of the dataframe, by `origin` and `cylinders`.

In [None]:
grouped = mpg.groupby(['origin', 'cylinders']).agg(['mean', 'std']).round(2)
grouped.head()

The result is a `MultiIndex`ed data frame. Here are the basics. See more [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html).

----

Simple indexing, returning a column.

In [None]:
grouped['mpg']['mean']

The indexing with MultiIndex can get very complicated. 

In [None]:
grouped.loc[(slice('europe', 'japan'), slice(None)), ('mpg', 'mean')]

Note that the origin and cylinders columns are now missing from the dataframe, they were turned into row indices (so far we only seen integers). We have shown how we can use them in reindexing operation. But sometimes we would want to return them to the data frame (e.g., if we want to use them on further analysis). 

In [None]:
 # Note that this returns a new dataframe, we can also use `inplace` argument.
grouped.reset_index(level='origin')

Or skip this in the first place.

In [None]:
mpg.groupby(['origin'], as_index=False).mean()

### Transform

Often we would want the aggregation operation to return a data structure that has the same dimensions as the original. For example, when we want to add summary statistics of each group or subject (e.g., think of an experiment with many trials per participant). 

----


`assign` is a function that returns a new dataframe with an additional column, and can be used for elegant chaining.

In [None]:
mpg.assign(
    weight_by_origin=mpg.groupby(
        'origin', sort=False)['weight'].transform('mean')).tail()

Note that we have an `NaN` for some for some of the columns, espcially `horsepower`. 



In [None]:
mpg.info()

One way of imputation is to fill the missing values with some cetral tendency measure. We can do it with the mean or median, for example.



#### $\color{dodgerblue}{\text{Exercise}}$

Fill in the code below, returning the `horsepower` column where missing values will be filled with the dataset median for the column. 

In [None]:
mpg['horsepower'].fillna(mpg['horsepower'].agg('median'))

If we want to fill the missing values in the column using the mean of a specific group, here is one option. 

Note the use of the `values` attribute. This ensures that the assignment is of a NumPy array rather than a view of the data frame. 

----

`describe` is a method to get a quick summary of the different columns. 

In [None]:
mpg['horsepower'] = mpg['horsepower'].fillna(mpg.groupby('cylinders')['horsepower'].transform(
    'mean')).values
mpg.describe().round(2)

### I / O

So far we either created our dataframes by hand, or imported them from a built-in dataset. However, usually you would be working on files.

Now is a good time to tell you that Colab runs on a Linux-based machine. As with any computer, we have folders. 

We can use the exclamation mark to run commands on the shell ("Command prompt") of our current machine. This will be very useful later when we get to installing new libraries, which are not already installed on Colab. 

Here is the current folder contensts, and the contensts of the `sample_data` folder that colab offers us.

In [None]:
! dir . # Revelas the files in the current folder

In [None]:
! dir ./sample_data # Revelas the files in the folder below the current

We see that Colab offers us two famous datasets "mnist" and "california housing". The files are split into training and testing datasets, so we can easily train a machine learning model to the training set and test on the test set. 

We will use these files to demonstrate how we read and write data to and from files.

#### $\color{dodgerblue}{\text{Exercise}}$

Fill in the code below to load the california datasets. 

In [None]:
train = pd.read_csv('sample_data/california_housing_train.csv')
test = pd.read_csv('sample_data/california_housing_test.csv')

Here we combine the two files, by concatenating the two dataframes. 

In [None]:
df = pd.concat(
    [train, test]
)

## Another option would be 
# df = train.append(test)

#### $\color{dodgerblue}{\text{Exercise}}$

Fill in the code below to save the new dataframe into a CSV file named 'california_combined', place it in the same directory as the original files. 

In [None]:
df.to_csv('sample_data/california_combined.csv', index=False)

Now let's see if we saved it correctly. 

In case you are not using Colab, Jupyter Notebook or a similar tool, you might want to use the some module to view files and folders on disk from within Python. One out of many options here would be:

In [None]:
import glob # Elegantly filter the list of files in the folder
glob.glob('sample_data/*.csv')
# glob.glob('sample_data/[cali]*.csv')

Pandas can work with many [other formats](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) except CSV, like Excel, SPSS, Stata, and more. The process is pretty much the same as we did here.  

### Misc.



Often times we want to view the individual values in a specific column or the whole dataframe. `unique` and `value_counts` are useful here. 

In [None]:
print(mpg['name'].value_counts(), 
      mpg['name'].unique(), sep='\n\n')

Note that to do the same we will need to use the `apply` method, that iterates over the columns (or rows). 

In [None]:
print(mpg[['cylinders', 'origin']].apply(pd.Series.unique))

## A more common form would be 
mpg[['cylinders', 'origin']].apply(lambda s: s.unique())



#### query


`query` is a row-selection method of pandas dataframes that can be very elegant, espcially after long chaining operations. 

Say we want to take all models weighting more than 2000 which were in made USA. Than we take only the summary of the groups who make less than 25 miles per gallon. 

Consider the two following options:

In [None]:
# This is verbose and error prone
# We need to groupby etc. inside loc to get the same dimensions
mpg.loc[(
    mpg['weight'] > 2000) & (mpg['origin'] == 'usa')].groupby(
        'cylinders').median().loc[mpg.loc[(mpg['weight'] > 2000)
         & (mpg['origin'] == 'usa')].groupby('cylinders').median()['mpg'] < 25]


In [None]:
# This is succint, readable and easy to debug
mpg.query('weight > 2000 & origin == "usa"'
    ).groupby('cylinders').median().query('mpg < 25')


#### where and mask

`where` is another method, which returns a copy of the dataframe, setting to `NaN` every row or cell that is not True according to the filter expression. This is useful if you want to filter rows or columns but get an object which has the same dimensions. 

In [None]:
mpg.where(mpg['name'] == 'buick skylark 320')

The inverse of `where` is mask.

In [None]:
mpg.mask(mpg['name'] == 'buick skylark 320')

### Reshaping

Reshaping is the act of changing the structure of a dataframe, like turning rows into columns and vice versa (e.g., "Pivot table"). 

As with any task, Pandas offers a variety of reshaping options. Here are the basics.

In [None]:
tips = sns.load_dataset('tips')
tips.head()

In [None]:
pivotted = pd.pivot_table(tips, 
               values=['total_bill', 'tip'], index=['smoker', 'time'], 
               columns='size',
               aggfunc='sum') # Can also be median, your own function, etc.

pivotted

Reshaping wide dataframe to long can be achieved usign stack.

In [None]:
tips.groupby(['smoker']).mean().stack().reset_index().rename(
    columns={'level_1': 'Variable', 0: 'Mean Value'}
)

Crosstabbing is another common operation. 

In [None]:
pd.crosstab(tips['smoker'], tips['time'])

That's it for pandas, we merely scratched the surface. We didn't cover some very powerful features like windowed operations (e.g., cumulative\rolling sum), time series and categorical data. To continue on your own, head over [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html). 

## Seaborn

Seaborn is a visualiztion library (like Matplotlib), but is built on top of Matplotlib. 

In [None]:
import seaborn as sns
tips = sns.load_dataset('tips')
tips.head()

Seaborn offers many types of plots. Most can be drawn straight onto your subplots object and using variables taken straight from a data frame. 

In [None]:
fig, axs = plt.subplots(1, 2)

ax1 = sns.scatterplot(
    data=tips, x='total_bill', y='tip', hue='smoker', 
    ax=axs[0],
    legend=None,
)

ax1.set(xlabel='Total Bill ($)', ylabel='Tip ($)')



ax2 = sns.pointplot(
    data=tips, x='day', y='tip', hue='smoker',
    ax=axs[1],
    join=False, err_style="bars", dodge=0.3
)

ax2.set(xlabel='Day', ylabel='Tip ($)')

fig.tight_layout()

#### $\color{dodgerblue}{\text{Exercise}}$

Generate a figure with subplots in a 3X1 array.
Using the `tips` dataset, draw the following from topmost to bottommost.

* A horizontal boxplot (`sns.boxplot`) of the number of guests in a party (`size`) on each day. Seperate the boxes into different hues based on the sex of the person paying the waiter.  
* A histogram (`sns.histplot`) of the relative size of `tip` to `total_bill`, with the color of the hisograms based on the whether there is a `smoker` in the party. 
* A line (`sns.lineplot`) showing the trend in `total_bill`' across the values of `size`. Set the style of the lines to whether there is a smoker in the party. 

In [None]:
fig, axs = plt.subplots(3, 1, figsize=(4, 6))

sns.boxplot(ax=axs.flat[0], data=tips, x='day', y='total_bill', hue='smoker')
axs.flat[0].legend().remove()

sns.histplot(ax=axs.flat[1], data=tips, x=tips['tip'] / tips['total_bill'], 
             hue='smoker', alpha=0.5)

sns.lineplot(data=tips, ax=axs.flat[2], x='size', y='tip', 
             style='smoker')

The plots we used so far are axes-level plots. They can either accept an `ax` argument or return a new ax if not given one be to plotted on. 



In [None]:
df = sns.load_dataset("penguins")
df.head()

In [None]:
ax = sns.kdeplot(data=df, x='bill_length_mm', y='body_mass_g', 
                 hue="sex", fill=True, alpha=0.5)
type(ax)

# If we want a handle to the Figure object that Seaborn did not give us
fig = plt.gcf()
fig.set_facecolor('grey')

ax == fig.axes[0]



Seaborn also can generate figure level plots. 

They do not accept an `ax` argument, and are drawn into their own object, usually one that inherits properties from the Matplotlib Figure class.

In [None]:
pairplot_grid = sns.pairplot(df, hue="species")
print(pairplot_grid.axes) # Just like a Figure

In [None]:
df.columns

Another illustrative multi-grid plot. 

In [None]:
marginal_hist = sns.jointplot(data=df, x='bill_length_mm',
                              y='bill_depth_mm',
                              hue="species")


In [None]:
sns.heatmap(df.corr(), annot=True)

To sum up, most multi grid plots are very useful for exploration of data. 

One last thing to learn about Seaborn is it's `FacetGrid` object. While offering slightly less control than directly using subplots and looping over groups from a GroupBy operation, it can produce nice graphs quickly. See more [here](https://seaborn.pydata.org/tutorial/axis_grids.html#grid-tutorial).

In [None]:
from matplotlib import pyplot as plt
plt.style.use('ggplot') # Set a nice scheme

g = sns.FacetGrid(tips, col="sex", row="smoker", hue='day')
g.map(sns.scatterplot, "total_bill", "tip", alpha=.7)
g.add_legend()

# Statistical analyses using Python

There are several libraries in Python that make common statistical tests accessible. 

## Statsmodels

Statsmodels is one of the major statistics libraries in Python (there is a slightly less rich statistics module under `scipy`, see [here](https://docs.scipy.org/doc/scipy/reference/reference/stats.html)). We will touch very briefly on two common analyses using statsmodels.



#### Independent Samples t-test


Here we will test a simple hypothesis about the penguins dataset. As you can see, the `bill_length_mm` attribute is different between the three species. 

In [None]:
penguins = sns.load_dataset('penguins')

sns.jointplot(data=penguins, x='bill_length_mm',
                              y='bill_depth_mm',
                              hue="species")


#### $\color{dodgerblue}{\text{Exercise}}$

Statsmodels t-test function receives two arrays x1 and x2, and some other argumensts. 

Use the Penguin data set to test the hypothesis that the bill length of the `Adelie` species is smaller than that of the `Chinstrap` group.

Sample 8 observations from each group. 

Use a p-value of 0.001.

Print out the results using `str.format` method or an f-string.

In [None]:
import seaborn as sns
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from statsmodels.stats.weightstats import ttest_ind

from statsmodels.formula.api import ols



penguins = sns.load_dataset('penguins')

n = 8
dependent_var = 'bill_length_mm'

## There are three unique species. Remove the one that we don't need. 
# penguins['species'].unique()

adelie_data = penguins.query('species == "Adelie"').sample(
    n=n, random_state=42)[dependent_var]
chinstrap_data = penguins.query('species == "Chinstrap"').sample(
    n=n, random_state=42)[dependent_var]

t, p_value, dof = ttest_ind(adelie_data, chinstrap_data, 
                           alternative='smaller')

print(f't({dof}) = {t:.2f}, p-value ' + 
      (f'= {p_value:.3f}' if p_value >= 0.001  else '< .001')
      )
p_value < .001

#### Linear regression

Linear regression is another common statistical analysis. Let's use it to predict the bill depth using flipper length and species (as a dummy variable). Then we will see how our model generalizes in prediction. 

In [None]:
train, test = train_test_split(penguins.dropna(), test_size=0.2, random_state=10)

Let's first look at our data. 

In [None]:
fig, axs = plt.subplots(1, 2, sharex=True, sharey=True)
for group_name, group_df, ax in zip(['Train', 'Test'], [train, test], axs.flat):
    ax2 = sns.scatterplot(data=group_df, x='body_mass_g',
                              y='bill_length_mm',
                              hue="sex", ax=ax)
    ax.set_title(group_name)
axs[0].legend().remove()
axs[1].legend(bbox_to_anchor=(1.1, 1.05), title='sex')


In [None]:
reg = ols('bill_length_mm ~ C(sex) * body_mass_g', data=train.dropna()).fit() 
print(reg.summary())

We see that only the `body_mass_g` variable is a statistically significant predictor for `bill_length_mm`, while `sex` or their interaction isn't significant.

Now let's use our regression model to compare the fit for the training and testing samples. 

In [None]:
train['predicted_values'] = reg.predict()
test['predicted_values'] = reg.predict(test)

fig, axs = plt.subplots(1, 2, sharex=True, sharey=True)
for group_name, group_df, ax in zip(
    ['Train', 'Test'], [train, test], axs.flat):

    sns.scatterplot(data=group_df, x='bill_length_mm',
                              y='predicted_values', alpha=1,
                              hue="sex", ax=ax)
    
    y_min, y_max = group_df['predicted_values'].describe()[['min', 'max']].values

    ax.set(
           xlabel='Actual Bill Length',
           ylabel='Predicted Bill Length')
    
    x_ref = y_ref = np.linspace(y_min, y_max, 100)
    ax.plot(x_ref, y_ref, color='black', linewidth=1)
    
    r_square = (
        group_df[['bill_length_mm', 'predicted_values']].corr().pow(2).min().iloc[0])
    ax.set_title(f'{group_name} - $r^{2}$ = {r_square:.2f}')

axs[0].legend().remove()
axs[1].legend(bbox_to_anchor=(1.1, 1.05), title='sex')

## Installing libraries

To install packages in Python need to use a package manager. Luckily, Colab comes with `pip`, the package installer for Python. A package manager keeps track of what packages you have on your current environment, and installs/updates packages accordingly when you want to install a new package. 


In [None]:
# To get a list of the installed libraries
! pip list

In [None]:
!pip install pingouin


### Pingouin

We just installed [Pingouin](https://pingouin-stats.org/), a Pandas based library useful for many common statistical tests. 


#### $\color{dodgerblue}{\text{Exercise}}$

Import `Pingouin` and load the `attention` dataset from `Seaborn`. 
conduct a repeated-measures ANOVa using Pingouin, with the following parameters:

* `subject` is the participant ID.
* `attention` is the between group factor.
* `solutions` is the within group (repeated) factor. 
* `score` is the dependent variable. 

In [None]:
import pingouin as pg
import seaborn as sns
df = sns.load_dataset('attention')

In [None]:
pg.mixed_anova(
    data=df, 
    dv='score', 
    between='attention', 
    within='solutions',
    subject='subject',
    effsize="np2" # Partial eta-square effect size
    )



### robusta

**[robusta](https://eitanhemed.github.io/robusta/_build/html/index.html)** is a statistical hypothesis testing in Python that i am currently devloping, it is based on an interface between R and Python. Here is a [demo](https://colab.research.google.com/drive/1jmwYpEGcpFr4CF6ZA5HMiQ2LcHbZqzO_?usp=sharing) of the current state. 
If we have time we can install it later, when we will be learning about local installations of Python. 



# Local Python installation

If you want to install Python locally, there are many ways you can go about it. Today we will look at one that would fit most academic researchers that intend on using existing Python tools for data analysis. 

We will install JupyterLab. JupyterLab is a notebook interface for working with Python (and some other languages, such as R and Julia). 

* [Download](https://github.com/jupyterlab/jupyterlab-desktop#download) and install JupyterLab App.



# Using your own Google Drive with Colab

Sometimes we want to use Google Drive to load or save files, instead of using Python locally. 

First you need to connect your Google Drive. We need to import the `drive` module. 

In [None]:
# Some additional imports just for the demo.
import os
import matplotlib.pyplot as plt
import numpy as np

# What is actually essential - import the drive module 
from google.colab import drive

Now we need to connect Drive. We are mounting it as a folder in the current machine. 


The current folder prior to mounting.

In [None]:
os.listdir('.') # Show the contents of the current directory

# Can also just be the following like when uncommented
#!ls

Run this and follow the instructions. 

In [None]:
drive.mount('./drive') # Mount drive in the current directory

The contents of the current folder after mounting.

In [None]:
os.listdir('.')

For the sake of the demo create a new folder in your Google Drive using the following. 

In [None]:
new_folder_in_drive = 'drive/MyDrive/python_workshop/example_directory'

# Try to create the directory
os.makedirs(new_folder_in_drive,
          exist_ok=True) 

# Change the current directory to be the new folder
os.chdir(new_folder_in_drive)

Now we would plot some data and save it on our new drive folder. 

In [None]:
# Create some figure
fig, ax = plt.subplots()

# Generate some random data
a = np.random.random((16, 16))

# Plot the data
ax.imshow(a, cmap='hot', interpolation='nearest')

# Save the new plot
fig.savefig('random_heatmap.png')


In [None]:
## To end the session uncomment and run the following line
#drive.flush_and_unmount()

These are the essentials in terms of using your Google Drive on Colab. Here is [additional info](https://colab.research.google.com/notebooks/io.ipynb)