### **Handling Data** ###

- Often, we need to deal with multi-dimensional data
- Lists are not inherently multi-dimensional, although you can embed lists within lists (or tuples within lists)
- **Exercise:** Create a multi-dimensional list of x + y coordinate paits (as tuples)

In [1]:
coordinates=[(1,1),(1,2),(1,3),(2,1),(2,2),(2,3)]

type(coordinates)

list

In [2]:
coordinates[0]

(1, 1)

In [3]:
coordinates[3]

(2, 1)

In [4]:
type(coordinates[1])

tuple

### **Dictionaires** ###

- Type of data structure in python that allows you to store items in key value pairs.
- They are like a List but instead of using a numerical index it uses an assigned key to reference an item.
-  Unlike sequences, which are indexed by a range of numbers, dictionaries are indexed by keys, which can be any immutable type; strings and numbers can always be keys.
- You can also use the dict() function to construct a dictionary. Takes in a sequence of key-value pairs but slightly different syntax.

In [7]:
nbaTeams =dict([('Los Angeles','Lakers'),('Toronto','Raptors'),('Miami','Heat')])

In [8]:
nbaTeams.get("Toronto")

'Raptors'

In [9]:
nbaTeams.get("Raptors")

- This returns nothing
- Cannot have two teams in Toronto

### **Numpy** ###

- While lists and libraries have their use cases, can be kind of clunky and limiting
- Python has a number of libraries commonly used in (data)science to deal with n-dimensional arrays
- One such library is NumPy (Numerical Python)

In [13]:
import numpy as np   # Easier to call np

### **Numpy Arrays** ###

- To create an array, you can have to call the array method

In [15]:
import numpy as np

arr = np.array([1,2,3,4,5])

In [31]:
print(arr)

[1 2 3 4 5]


In [21]:
mdArr = np.array([[1,2,3],[4,5,6],[7,8,9]])
   #Square brackets define your array

In [24]:
print(mdArr)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [26]:
array3d = ([[[0,0,0],[0,0,0]],[[1,1,1],[1,1,1]]])

In [27]:
print(array3d)

[[[0, 0, 0], [0, 0, 0]], [[1, 1, 1], [1, 1, 1]]]


In [28]:
mdArr[0]

array([1, 2, 3])

In [32]:
mdArr[1]

array([4, 5, 6])

In [33]:
mdArr[0,1]

np.int64(2)

In [35]:
for x in mdArr:
    print(x)

[1 2 3]
[4 5 6]
[7 8 9]


In [37]:
for x in mdArr:    
    for y in x:   # Iterating through values
        print(y)

1
2
3
4
5
6
7
8
9


In [38]:
for x in mdArr:
    for y in x:
        print(x)
        print(y)

[1 2 3]
1
[1 2 3]
2
[1 2 3]
3
[4 5 6]
4
[4 5 6]
5
[4 5 6]
6
[7 8 9]
7
[7 8 9]
8
[7 8 9]
9


- Finding specific values in your data is one of the most common things you will need to do (e.g., RTs < 2.5SD, acc == True, cond ==1)
- Can index the specific values using a Boolean in a few different ways

In [43]:
newArr = [5,6,2,4,7,8,2,1,0]

newArr = np.array(newArr)

In [49]:
mask = newArr > 2

In [51]:
print(mask)

[ True  True False  True  True  True False False False]


In [53]:
newArr[mask]   # Return values at the index that are True

array([5, 6, 4, 7, 8])

In [57]:
cleanData = newArr[newArr>2]  # Same thing but shorter code

In [59]:
newArr2 = [0,1,2,3,4,5,6,7,8]

newArr2 = np.array(newArr2)

In [64]:
x = newArr[newArr2 > 4]  # Can combine the above into one

In [65]:
print(x)

[8 2 1 0]


In [None]:
cleanData2 = rt[(cond ===1) and (block > 4)] # Example

### **Creating a large numpy array** ###

In [71]:
import numpy as np

data = np.loadtxt(fname='inflammation-01.csv', delimiter=',')

In [72]:
print(type(data))

<class 'numpy.ndarray'>


In [73]:
print(data.shape)

(60, 40)


In [74]:
data[10,35]

np.float64(3.0)

In [75]:
print(data[0:4, 0:10])

[[0. 0. 1. 3. 1. 2. 4. 7. 8. 3.]
 [0. 1. 2. 1. 2. 1. 3. 2. 2. 6.]
 [0. 1. 1. 3. 3. 2. 6. 2. 5. 9.]
 [0. 0. 2. 0. 4. 2. 2. 1. 6. 7.]]


In [76]:
data[0,:]

array([ 0.,  0.,  1.,  3.,  1.,  2.,  4.,  7.,  8.,  3.,  3.,  3., 10.,
        5.,  7.,  4.,  7.,  7., 12., 18.,  6., 13., 11., 11.,  7.,  7.,
        4.,  6.,  8.,  8.,  4.,  4.,  5.,  7.,  3.,  4.,  2.,  3.,  0.,
        0.])

In [78]:
data[0,5:-1]

array([ 2.,  4.,  7.,  8.,  3.,  3.,  3., 10.,  5.,  7.,  4.,  7.,  7.,
       12., 18.,  6., 13., 11., 11.,  7.,  7.,  4.,  6.,  8.,  8.,  4.,
        4.,  5.,  7.,  3.,  4.,  2.,  3.,  0.])

In [80]:
x = data[0:1,0:1]

In [82]:
print(x) # Only one value

[[0.]]


In [88]:
meanval = np.mean(data)

In [89]:
print(meanval)

6.14875


In [94]:
meanval = np.mean(data,0)   # Computing means over an axis (set at 0 here)

In [93]:
print(meanval)

[ 0.          0.45        1.11666667  1.75        2.43333333  3.15
  3.8         3.88333333  5.23333333  5.51666667  5.95        5.9
  8.35        7.73333333  8.36666667  9.5         9.58333333 10.63333333
 11.56666667 12.35       13.25       11.96666667 11.03333333 10.16666667
 10.          8.66666667  9.15        7.25        7.33333333  6.58333333
  6.06666667  5.95        5.11666667  3.6         3.3         3.56666667
  2.48333333  1.5         1.13333333  0.56666667]


In [95]:
meanval = np.mean(data[0],0)

In [98]:
np.mean(data[0,:])  # Row

np.float64(5.45)

In [99]:
np.mean(data[:,0])  # Column

np.float64(0.0)

In [101]:
from numpy import random  # Then you can call just random not like np.random.whatever

In [102]:
random.randint(100)

69

In [103]:
randData = random.randint(5,10, size =[10,50])

In [104]:
print(randData)

[[8 5 7 8 6 9 6 7 8 9 6 9 9 6 9 9 6 6 9 6 8 9 5 7 8 9 5 7 8 9 8 9 7 5 5 5
  6 5 6 6 5 9 5 9 9 9 9 5 8 6]
 [8 8 5 7 5 6 7 6 9 5 8 8 6 9 8 7 6 9 7 7 7 5 5 6 7 7 6 6 8 5 8 8 7 6 6 9
  6 9 6 8 8 5 8 8 5 6 7 7 5 6]
 [6 5 5 9 9 7 8 7 5 5 6 8 9 5 8 8 7 9 6 9 9 6 7 5 8 9 8 5 8 8 7 8 7 8 9 9
  5 8 9 5 9 7 6 6 8 6 8 5 8 8]
 [7 9 7 8 8 7 8 9 5 7 7 6 8 7 5 5 8 6 8 5 8 9 8 7 6 8 6 9 8 5 6 8 9 7 7 5
  7 9 9 7 5 9 5 7 9 5 8 5 9 6]
 [7 9 7 7 5 6 5 6 8 9 7 8 5 7 8 5 5 6 8 8 9 6 6 5 5 6 5 8 5 6 5 9 6 8 8 5
  6 5 6 6 8 7 9 6 7 5 5 6 6 7]
 [8 6 9 9 6 7 8 8 9 7 6 8 8 7 7 5 5 9 7 6 6 8 8 7 5 5 8 6 5 8 7 7 8 5 5 6
  5 8 9 7 6 9 5 5 9 8 8 9 6 9]
 [8 8 9 5 8 8 5 6 5 6 5 6 8 9 9 7 7 5 6 7 7 6 6 8 9 5 8 5 7 8 9 7 8 6 9 8
  6 7 6 8 9 6 9 7 8 8 5 5 5 8]
 [7 7 9 6 6 5 6 5 9 8 9 5 6 9 9 9 6 6 9 7 7 8 5 5 5 5 5 7 7 9 7 5 8 7 9 8
  5 6 8 6 8 6 5 5 9 6 8 7 6 9]
 [6 6 8 9 7 6 6 7 6 5 7 9 5 5 8 7 7 6 7 9 6 7 9 7 6 5 7 9 5 7 8 8 6 8 8 6
  5 6 9 7 5 5 6 7 7 7 6 7 5 8]
 [6 6 7 5 5 9 7 6 9 8 7 7 5 5 5 8 7 8 6 8 6 8 5 9 7 8 6

In [106]:
random.rand()  # Returns a random value from 0 to 1

0.9432023134812372

In [115]:
random.choice(data[0,:])  # Grabs a random value from data given the conditions specified 

np.float64(4.0)

- More information/documentation on numpy (https://numpy.org/doc/2.1/index.html) 

### **Pandas** ###

- Pandas is a software library written for the Python programming language for data manipulation and analysis.
- Built on top of the Numpy library, which means that most of the methods defined for Numpy arrays apply to Pandas data structures.
- Difference - Set up to deal with other types of data that numpy cannot handle
- **Panda has two main data structures:**

> **Series:** A pandas Series is a one-dimensional labelled array capable of holding data of any type (integer, string, float, Python objects, etc.).

> **DataFrame:** The primary data structure in pandas, the DataFrame is a two-dimensional, mutable, and potentially heterogeneous tabular data structure with labelled axes (rows and columns).

In [118]:
import pandas as pd

In [119]:
pd.Series(['4 cups','1 cup','2 large','1 can'])

0     4 cups
1      1 cup
2    2 large
3      1 can
dtype: object

In [121]:
volumes = pd.Series(data=['4 cups','1 cup','2 large','1 can'])

In [122]:
s = pd.Series(data=[1,'2',3,4,'5',6,7,8,'99','100'])

In [123]:
print(s)

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8     99
9    100
dtype: object


In [125]:
s.astype('int') # Automatically convert a string to an interger if it is an integer

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8     99
9    100
dtype: int64

In [126]:
x = s.astype('int')

In [127]:
x.mean()

np.float64(23.5)

In [129]:
s.astype('str') # Reversed what we just did

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8     99
9    100
dtype: object

In [135]:
data = pd.Series([1,2,pd.NA,4,5]) # Missing data (fill with NA)

In [131]:
print(data)

0       1
1       2
2    <NA>
3       4
4       5
dtype: object


In [134]:
data.dropna(inplace=True)  # Dropping the NA

In [133]:
print(data)

0    1
1    2
3    4
4    5
dtype: object


In [136]:
data = pd.Series([1,2,pd.NA,4,5])

In [139]:
data.fillna('Null', inplace=True)  # Filling NA with Null

In [140]:
print(data)

0       1
1       2
2    Null
3       4
4       5
dtype: object


In [143]:
data = [1,2,3,4,5]

In [145]:
data = pd.Series([1,2,3,4,5])

In [146]:
data.apply(np.sqrt)

0    1.000000
1    1.414214
2    1.732051
3    2.000000
4    2.236068
dtype: float64

In [148]:
data.apply(lambda x: x + 1) # Transforming data (adding a function)

0    2
1    3
2    4
3    5
4    6
dtype: int64

### **Panda Data Frame** ###

In [151]:
data = pd.read_csv('RTdata.csv')

In [152]:
data['subjs']

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
12    13
13    14
14    15
15    16
16    17
17    18
Name: subjs, dtype: int64

In [153]:
data['subjs'][5]  # Indexing 

np.int64(6)

In [155]:
data = pd.read_csv('RTdata.csv', index_col='subjs') # Can index when you read in data

In [156]:
data.loc[:,'K']

subjs
1     0.254095
2     1.020760
3     2.882098
4     1.024061
5     2.093723
6     1.012419
7     2.904390
8     1.922144
9     2.359120
10    2.088641
11    1.693146
12    1.072074
13    3.813035
14    2.791595
15    0.064458
16    4.186431
17    3.112951
18    4.045537
Name: K, dtype: float64

In [157]:
data[:]['K']  # Rows and name of column 

subjs
1     0.254095
2     1.020760
3     2.882098
4     1.024061
5     2.093723
6     1.012419
7     2.904390
8     1.922144
9     2.359120
10    2.088641
11    1.693146
12    1.072074
13    3.813035
14    2.791595
15    0.064458
16    4.186431
17    3.112951
18    4.045537
Name: K, dtype: float64

In [159]:
data.iloc[:,4] # Specify range

subjs
1     0.254095
2     1.020760
3     2.882098
4     1.024061
5     2.093723
6     1.012419
7     2.904390
8     1.922144
9     2.359120
10    2.088641
11    1.693146
12    1.072074
13    3.813035
14    2.791595
15    0.064458
16    4.186431
17    3.112951
18    4.045537
Name: K, dtype: float64

- Groupby creates groupings or categories based on common values in columns

In [167]:
data.groupby(['sex', 'race']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,runcode,RTs,K
sex,race,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
f,asian,23897.166667,0.737243,1.620218
f,caucasian,23891.0,0.630609,1.645856
m,asian,23895.333333,0.208388,2.044557
m,caucasian,23896.166667,0.374321,2.924689


In [174]:
data.loc[:'sex']

Unnamed: 0_level_0,runcode,sex,race,RTs,K
subjs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,23887,m,asian,0.268098,0.254095
2,23888,f,caucasian,0.810172,1.02076
3,23889,m,caucasian,0.625572,2.882098
4,23890,f,asian,0.892729,1.024061
5,23891,m,caucasian,0.4957,2.093723
6,23892,f,caucasian,0.117297,1.012419
7,23893,f,caucasian,0.964358,2.90439
8,23894,m,caucasian,0.131785,1.922144
9,23895,f,asian,0.5298,2.35912
10,23896,f,asian,0.917709,2.088641


In [178]:
titanic = pd.read_csv('Titanic.csv')

In [179]:
print(titanic)

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                                Heikkinen, Miss Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ... 

In [181]:
titanic["Name"].str.lower()   

0                                braund, mr. owen harris
1      cumings, mrs. john bradley (florence briggs th...
2                                  heikkinen, miss laina
3           futrelle, mrs. jacques heath (lily may peel)
4                               allen, mr. william henry
                             ...                        
886                                montvila, rev. juozas
887                          graham, miss margaret edith
888              johnston, miss catherine helen "carrie"
889                                behr, mr. karl howell
890                                  dooley, mr. patrick
Name: Name, Length: 891, dtype: object

In [182]:
titanic["Name"].str.split(",")

0                             [Braund,  Mr. Owen Harris]
1      [Cumings,  Mrs. John Bradley (Florence Briggs ...
2                               [Heikkinen,  Miss Laina]
3        [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
4                            [Allen,  Mr. William Henry]
                             ...                        
886                             [Montvila,  Rev. Juozas]
887                       [Graham,  Miss Margaret Edith]
888           [Johnston,  Miss Catherine Helen "Carrie"]
889                             [Behr,  Mr. Karl Howell]
890                               [Dooley,  Mr. Patrick]
Name: Name, Length: 891, dtype: object