## DataFrames with Pandas (Data Containers IV)

- `pd.DataFrame({})` produces a DataFrame from a dictionary
- `pd.DataFrame(dict(list(zip([],[]))))` produces a DataFrame from a list of names and list of lists

In [None]:
import pandas as pd
import numpy as np

In [None]:
foo1 = [True, False, False, True, True, False]
foo2 = ["Liver", "Brain", "Testes", "Muscle", "Intestine", "Heart"]
foo3 = [13, 88, 1233, 55, 233, 18]

In [None]:
# We have already imported pandas as pd
# use a dictionary to get a data frame
foo_df = pd.DataFrame({'healthy': foo1, 
                       'tissue': foo2, 
                       'quantity': foo3})

In [None]:
foo_df

**Exercise 4.1**

In [None]:
cities = ['Munich', 'Paris', 'Amsterdam', 'Madrid', 'Istanbul']
dist = [584, 1054, 653, 2301, 2191]

In [None]:
distDictDF = pd.DataFrame({"City": cities,
                           "Distance": dist})
distDictDF

#### From Lists (jsut as an example)

In [None]:
# names
list_names = ['healthy', 'tissue', 'quantity']

# columns are a list of lists
list_cols = [foo1, foo2, foo3]
list_cols

In [None]:
zip_list = list(zip(list_names, list_cols))
zip_list

In [None]:
zip_dict = dict(zip_list)
zip_dict # True dictionary

In [None]:
zip_df = pd.DataFrame(zip_dict)
zip_df

### Working with DataFrames

In [None]:
foo_df['healthy'] #indexing by name, an individual column "Series", 1-dim array

In [None]:
foo_df[['healthy']] # DataFrame with 1 column

In [None]:
foo_df.healthy # Get the Series using a `.` notation

In [None]:
# Add a new column and populate it with the value (which is recycled, i.e. "broadcast" over the length of the Series)
foo_df['new'] = 0
foo_df

In [None]:
# But this doesn't work
foo_df.new2 = 4
foo_df

# Mini-case study - exercise 4.8

In [None]:
import pandas as pd
mtcars = pd.read_csv('../00-data/mtcars.csv')
mtcars.head()

1. Calculate the correlation between `mpg` and `wt` and test if it is significant.

In [None]:
# corr() is not a function:
# corr(mtcars['mpg'], mtcars['wt'])

In [None]:
# A correlation matrix
cor_mat = mtcars.corr()
cor_mat[['wt']].iloc[0]

In [None]:
# Targeted correlation between two Series in the DataFrame
mtcars['mpg'].corr(mtcars['wt'])

In [None]:
# Using NumPy
r = np.corrcoef(mtcars['mpg'], mtcars['wt'])
r

In [None]:
# import pingouin as pg 
# pg.corr(x=mtcars['mpg'], y=mtcars['wt'])

In [None]:
import scipy.stats as stats
stats.pearsonr(mtcars['mpg'], mtcars['wt'])

2. Visualize the relationship in an XY scatter plot (bonus points for a regression line).


In [None]:
# previously imported seaborn (see above)
# sns.scatter(x=mtcars['mpg'], y=mtcars['wt'])

import matplotlib.pyplot as plt # Low level plotting
y = mtcars.mpg
x = mtcars.wt
plt.scatter(x,y)
# plt.show()

In [None]:
sns.regplot(x="wt", y="mpg", data = mtcars, ci=None)

In [None]:
sns.lmplot(x="wt", y="mpg", data=mtcars, ci=None)

In [None]:
# from sklearn.linear_model import LinearRegression
# model = LinearRegression().fit(x, y)

In [None]:
from statsmodels.formula.api import ols
model = ols("mpg ~ wt", data=mtcars)
results = model.fit()
results.summary()

For plotting:

- Continuous vs continuous (i.e. scatter plot), but also
- Continuous vs categorical, or 
- Categorical vs continuous

**Y-Axis**
- Dependent variable (i.e. _dependent_ on the indpendent variable)
- Response (i.e. the _outcome_)
- f(x) (i.e. y _as a function of_ x)

**X-Axis**
- Independent variable (i.e. decided upon by the experimenter)
- Predictor (a variable that _predicts_ a specific resonse, i.e. y)

3. Convert `wt` column from pounds to kg (bouns points for adding it to the DataFrame).

In [None]:
# Avoid making unnecessary new objects that are separate from your DataFrame
#twocol = mtcars[['mpg','wt']]
#twocol['kilo'] = y/2.2046
#twocol

In [None]:
mtcars["wt"]

In [None]:
mtcars["wt"].apply(lambda x: x/2.2046)

In [None]:
mtcars['wt_kg'] = mtcars['wt']/2.2046226218

In [None]:
# Alternatvely:
# mtcars['wt_kg_2'] = mtcars['wt']*0.453592

# Indexing


In [None]:
foo_df

In [None]:
foo_df['tissue']

In [None]:
foo_df[['tissue']]

In [None]:
foo_df.tissue

index position as number with `.iloc[]`

In [None]:
# First row, as a Series
foo_df.iloc[0] 

In [None]:
# First row, as a DataFrame
foo_df.iloc[[0]] 

In [None]:
foo_df.iloc[[0, 1]] 

Using two dimensions:

In [None]:
# To get all columns, use : after the comma
foo_df.iloc[0,:] # : == everything 

In [None]:
# Valid
# foo_df.iloc[0,]

Indexing begins at 0 and is _exclusive_

In [None]:
# The first two columns, all rows
foo_df.iloc[:,:2] # 0, 1, and exclude 2

In [None]:
# One column as a Series
foo_df.iloc[:,1]

In [None]:
# The same, as a DataFrame
foo_df.iloc[:,1:2]

In [None]:
# counting from the opposite direction
# -1 is the last row
foo_df.iloc[-1,]
# forward: 0  1  2  3  4  5
# Reverse -6 -5 -4 -3 -2 -1

In [None]:
foo_df

In [None]:
foo_df.reverse()

**Exercise 5.1**

Using foo_df, retreive:

The 2nd to 3rd rows

In [None]:
# foo_df.iloc[2:3, :] # only the 3rd row
# foo_df.iloc[[2,3],:] # the 3rd & the 4th row using a list
# foo_df.iloc[2:4] # the 3rd & the 4th row using : notation
# foo_df.iloc[0:2,] # 1st & 2nd rows
# foo_df[-1:2,:] # Computer says "no"
# foo_df.iloc[3:,:] # 4th to the end 

# yes :)
foo_df.iloc[[1, 2]] # 2nd & 3rd rows using a list
# foo_df.iloc[1:3,:] # 2nd & 3rd rows using : notation


The last 2 rows

In [None]:
foo_df[-2:]

In [None]:
# The last two no matter how long
foo_df.iloc[-2:]
# foo_df.iloc[[-1,-2],:] # specify the order with a list

In [None]:
# Hard coding positions
# foo_df.iloc[ [4, 5],:]
# foo_df.iloc[4:,:]

A random row in foo_df

In [None]:
# import random
# foo_df.iloc[[random.randrange(0, len(foo_df))]]
foo_df.sample()

From the 4th to the last row
(But without hard-coding, i.e. regardless of how many rows my data frame contains)

In [None]:
# foo_df[3:]
foo_df.iloc[3:,:]


** Exercise 5.2**

Using `.iloc()` with:



In [None]:
# Integers? yes :)
foo_df.iloc[4,]


In [None]:
# Floats? Computer says no
# foo_df[0.1:]

In [None]:
# Strings (Characters)? Computer says "no"
# foo_df.iloc[:,'tissue']
# foo_df.iloc['Brain'] # need to look inside the tissue column
# foo_df.iloc['A'] # No is no 'A' anyways
# foo_df[heart:]   # No object defined 

In [None]:
# A heterogenous list? Computer says "no"
# foo_df.iloc[:,[1, 'quantity']]



In [None]:
# A homogenous list?
foo_df.iloc[[1,4,-1,-1,3,2,0]]


**Exercise 5.3**

Use indexing to obtain all the odd rows


In [None]:
foo_df.iloc[1::2] # 2nd, 4th, 6th

In [None]:
foo_df.iloc[lambda x: x.index % 2 == 1]

In [None]:
foo_df[foo_df.index % 2 != 0]

Use indexing to obtain all the even rows

In [None]:
foo_df.iloc[::2] # 1st, 3rd, 5th

In [None]:
foo_df.iloc[lambda x: x.index % 2 == 0]

# Logical Expressions

Relational and logical operators

In [None]:
foo_df[foo_df.quantity >= 233]

In [None]:
foo_df[(foo_df.tissue == "Heart") | (foo_df.quantity >= 233)]

**Exercise 5.4**

Only “healthy” samples.


In [None]:
foo_df[foo_df.healthy]


Only “unhealthy” samples.


In [None]:
foo_df[-foo_df.healthy]


In [None]:
foo_df[~foo_df.healthy]


**Exercise 5.5**


Only low quantity samples, those below 100.

Midrange: Quantity between 100 and 1000,


Tails of the distribution: Quantity below 100 and beyond 1000.


**Exercise 5.6**

Only “Heart” samples.

“Heart” and “liver” samples


Everything except “intestines”
