## Numpy

In [1]:
import numpy as np

1. Create an `ndarray` of shape (3,5). All values of the array are 100.

In [2]:
a = np.full((3,5),100)
a

array([[100, 100, 100, 100, 100],
       [100, 100, 100, 100, 100],
       [100, 100, 100, 100, 100]])

2. Create a border of zeros that surround the result in Question 1, so that the output is of shape (5,7), the outside values are 0 and the inside values are 100. 

In [3]:
b = np.zeros((5,7))
b[1:-1,1:-1] = a
b

array([[  0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0., 100., 100., 100., 100., 100.,   0.],
       [  0., 100., 100., 100., 100., 100.,   0.],
       [  0., 100., 100., 100., 100., 100.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.]])

3. Create a 5x5 matrix with values 1,2,3,4 just below the diagonal.


In [4]:
c = np.zeros((5,5))
c[1:,:-1] = np.diag([1,2,3,4])
c

array([[0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 2., 0., 0., 0.],
       [0., 0., 3., 0., 0.],
       [0., 0., 0., 4., 0.]])

4. Create a checkerboard 8x8 matrix using the tile function

In [5]:
section = np.array([[0,1],[1,0]])
np.tile(section, (4,4))

array([[0, 1, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 1, 0]])

5. Sort a 2d-array by the nth column

In [6]:
n = 1
arr = np.array([[0,1,2,3],[100,0,1,7],[8,5,-1,9]])
sorted(arr,key=lambda i:i[n-1],reverse=True)

[array([100,   0,   1,   7]), array([ 8,  5, -1,  9]), array([0, 1, 2, 3])]

6. Compute a matrix rank

In [7]:
mat = np.eye(4)
np.linalg.matrix_rank(mat)

4

7. Consider an array of dimension (5,5,3), how to mulitply it by an array with dimensions (5,5)?

In [8]:
a = np.arange(75).reshape((5,5,3))
b = np.arange(25).reshape((5,5))
a * b[:,:,None]

array([[[   0,    0,    0],
        [   3,    4,    5],
        [  12,   14,   16],
        [  27,   30,   33],
        [  48,   52,   56]],

       [[  75,   80,   85],
        [ 108,  114,  120],
        [ 147,  154,  161],
        [ 192,  200,  208],
        [ 243,  252,  261]],

       [[ 300,  310,  320],
        [ 363,  374,  385],
        [ 432,  444,  456],
        [ 507,  520,  533],
        [ 588,  602,  616]],

       [[ 675,  690,  705],
        [ 768,  784,  800],
        [ 867,  884,  901],
        [ 972,  990, 1008],
        [1083, 1102, 1121]],

       [[1200, 1220, 1240],
        [1323, 1344, 1365],
        [1452, 1474, 1496],
        [1587, 1610, 1633],
        [1728, 1752, 1776]]])

8. Consider a 16x16 array, how to get the block-sum (block size is 4x4)?

In [9]:
a = np.arange(256).reshape((16,16))
a.reshape((4,4,4,4)).sum(axis=(1,3))

array([[ 408,  472,  536,  600],
       [1432, 1496, 1560, 1624],
       [2456, 2520, 2584, 2648],
       [3480, 3544, 3608, 3672]])

9. Get the n largest values of an array

In [10]:
n = 10
a = np.arange(256)
import heapq
heapq.nlargest(n,a)

[255, 254, 253, 252, 251, 250, 249, 248, 247, 246]

10. Implement the Game of Life using numpy arrays

In [114]:
def lifeGame(board, niter):
    ''' 
    Implementation of Conway's game of life

    params: 
    board: original state, numpy 2d array
    niter: number of iterations, nonnegative integer

    return:
    board: final state
    '''

    m = len(board)
    n = len(board[0])
    for _ in range(niter):
        extended_board = np.zeros((m+2,n+2))
        extended_board[1:-1,1:-1] = board
        board = transition(extended_board)
    
    return board

def transition(bd):
    ''' Transition of the state for an extended board; the rule is based on https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life '''
    number_of_living_neighbors = bd[:-2,:-2] + bd[:-2,1:-1] + bd[:-2,2:] + bd[1:-1,:-2] + bd[1:-1,2:] + bd[2:,:-2] + bd[2:,1:-1] + bd[2:,2:]
    return np.array([[1 if ((bd[i+1][j+1] == 1 and number_of_living_neighbors[i][j] == 2)
                            or (bd[i+1][j+1] == 1 and number_of_living_neighbors[i][j] == 3)
                            or (bd[i+1][j+1] == 0 and number_of_living_neighbors[i][j] == 3)) else 0 for j in range(len(bd[0])-2) ] for i in range(len(bd)-2)])



In [116]:
# Implementation example
board = np.zeros((15,16))
board[6][7] = board[6][8] = board[7][6] = board[7][8] = board[8][8] = 1
print(board)
for i in range(1,6):
    print(lifeGame(board, i))
# The result matches an animation in https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 1 0 0 

Check https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises.ipynb for more Numpy exercises.

## Pandas

In [11]:
import pandas as pd
import seaborn as sb

1. Using NumPy, create a Pandas DataFrame with five rows and three columns

In [14]:
arr = np.arange(15).reshape(5,3)
df = pd.DataFrame(data = arr)
df

Unnamed: 0,0,1,2
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11
4,12,13,14


The following six questions depend on the penguins dataset.

In [17]:
df_penguins = sb.load_dataset('penguins')
df_penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


2. What function can we use to drop the rows that have missing data?

In [16]:
df_penguins.dropna()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
...,...,...,...,...,...,...,...
338,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


3. By default, will the function in Question 2 modify df_penguins or will it return a copy?

In [19]:
df_penguins
# return a copy

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


4. Let df_full be the dataframe obtained from dropping rows with missing data. What is the average bill length of a penguin, in millimeters, in this (df_full) data set?

In [21]:
df_full = df_penguins.dropna()
df_full.bill_length_mm.mean()

43.99279279279279

5. Which species has the longest flippers?

In [32]:
df_full.sort_values(by = 'flipper_length_mm', ascending = False).iloc[0].species

'Gentoo'

6. Which two species have the most similar mean weight?

In [33]:
df_full.groupby('species').body_mass_g.mean()

species
Adelie       3706.164384
Chinstrap    3733.088235
Gentoo       5092.436975
Name: body_mass_g, dtype: float64

Adelie and Chinstrap have the most similar mean weight.

7. How could you sort by species first, then by body mass?

In [34]:
df_full.sort_values(by = ['species', 'body_mass_g'])

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
58,Adelie,Biscoe,36.5,16.6,181.0,2850.0,Female
64,Adelie,Biscoe,36.4,17.1,184.0,2850.0,Female
54,Adelie,Biscoe,34.5,18.1,187.0,2900.0,Female
98,Adelie,Dream,33.1,16.1,178.0,2900.0,Female
116,Adelie,Torgersen,38.6,17.0,188.0,2900.0,Female
...,...,...,...,...,...,...,...
331,Gentoo,Biscoe,49.8,15.9,229.0,5950.0,Male
297,Gentoo,Biscoe,51.1,16.3,220.0,6000.0,Male
337,Gentoo,Biscoe,48.8,16.2,222.0,6000.0,Male
253,Gentoo,Biscoe,59.6,17.0,230.0,6050.0,Male


The following three questions depend on the taxis dataset.

In [36]:
df_taxis = sb.load_dataset('taxis')
df_taxis

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.60,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.00,0.0,9.30,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.70,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.10,0.0,13.40,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6428,2019-03-31 09:51:53,2019-03-31 09:55:27,1,0.75,4.5,1.06,0.0,6.36,green,credit card,East Harlem North,Central Harlem North,Manhattan,Manhattan
6429,2019-03-31 17:38:00,2019-03-31 18:34:23,1,18.74,58.0,0.00,0.0,58.80,green,credit card,Jamaica,East Concourse/Concourse Village,Queens,Bronx
6430,2019-03-23 22:55:18,2019-03-23 23:14:25,1,4.14,16.0,0.00,0.0,17.30,green,cash,Crown Heights North,Bushwick North,Brooklyn,Brooklyn
6431,2019-03-04 10:09:25,2019-03-04 10:14:29,1,1.12,6.0,0.00,0.0,6.80,green,credit card,East New York,East Flatbush/Remsen Village,Brooklyn,Brooklyn


8. The 'pickup' column contains the date and time the customer picked up, but it's a string.  Add a column to the DataFrame, 'pickup_time', containing the value in 'pickup' as a DateTime.

In [49]:
df_taxis['pickup_time'] = pd.to_datetime(df_taxis['pickup'])
df_taxis['pickup_time']

0      2019-03-23 20:21:09
1      2019-03-04 16:11:55
2      2019-03-27 17:53:01
3      2019-03-10 01:23:59
4      2019-03-30 13:27:42
               ...        
6428   2019-03-31 09:51:53
6429   2019-03-31 17:38:00
6430   2019-03-23 22:55:18
6431   2019-03-04 10:09:25
6432   2019-03-13 19:31:22
Name: pickup_time, Length: 6433, dtype: datetime64[ns]

9. Create a subset of the dataframe to create a new DataFrame, taxis_one_day. This new DataFrame should have pickup_time values between '2019-03-23 00:06:00' (inclusive) and '2019-03-24 00:00:00' (exlusive).

In [64]:
import datetime
taxis_one_day = df_taxis[(df_taxis['pickup_time'] >= datetime.datetime(2019,3,23)) & (df_taxis['pickup_time'] < datetime.datetime(2019,3,24))].copy()
taxis_one_day

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough,pickup_time
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.60,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan,2019-03-23 20:21:09
8,2019-03-23 11:48:50,2019-03-23 12:06:14,1,3.63,15.0,1.00,0.0,19.30,yellow,credit card,East Harlem South,Midtown Center,Manhattan,Manhattan,2019-03-23 11:48:50
17,2019-03-23 20:50:49,2019-03-23 21:02:07,1,2.60,10.5,2.00,0.0,16.30,yellow,credit card,Midtown Center,East Harlem South,Manhattan,Manhattan,2019-03-23 20:50:49
117,2019-03-23 09:39:25,2019-03-23 09:56:45,0,3.60,15.5,3.75,0.0,22.55,yellow,credit card,Yorkville East,Penn Station/Madison Sq West,Manhattan,Manhattan,2019-03-23 09:39:25
144,2019-03-23 18:35:01,2019-03-23 18:47:39,1,3.20,12.5,2.00,0.0,17.80,yellow,credit card,UN/Turtle Bay South,East Village,Manhattan,Manhattan,2019-03-23 18:35:01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6325,2019-03-23 20:52:40,2019-03-23 21:10:12,1,3.62,14.5,0.00,0.0,15.80,green,cash,Long Island City/Hunters Point,Steinway,Queens,Queens,2019-03-23 20:52:40
6331,2019-03-23 11:27:00,2019-03-23 12:20:11,1,7.67,28.0,0.00,0.0,28.00,green,cash,Jackson Heights,Maspeth,Queens,Queens,2019-03-23 11:27:00
6338,2019-03-23 18:05:38,2019-03-23 18:25:36,1,2.82,14.0,0.00,0.0,14.80,green,credit card,Claremont/Bathgate,Spuyten Duyvil/Kingsbridge,Bronx,Bronx,2019-03-23 18:05:38
6427,2019-03-23 18:26:09,2019-03-23 18:49:12,1,7.07,20.0,0.00,0.0,20.00,green,cash,Parkchester,East Harlem South,Bronx,Manhattan,2019-03-23 18:26:09


10. From the dataframe in Question 9, take the mean of the numeric columns, grouped at one hour intervals.  Save the result as df_means, and display it.

In [71]:
taxis_one_day['pickup_hour'] = taxis_one_day['pickup'].str.split().str.get(1).str.split(":").str.get(0)
taxis_one_day.groupby('pickup_hour').mean()

Unnamed: 0_level_0,passengers,distance,fare,tip,tolls,total
pickup_hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1.0,1.911667,8.583333,1.415,0.0,12.965
1,1.25,1.325,7.875,1.525,0.0,12.575
2,1.727273,1.739091,8.181818,1.641818,0.0,13.169091
3,1.5,3.3775,11.75,2.41,0.0,17.335
4,2.0,0.95,5.5,0.915,0.0,10.215
5,2.0,1.27,6.0,0.98,0.0,10.53
6,1.0,0.4,21.5,0.0,0.0,23.133333
7,2.333333,0.98,5.25,1.165,0.0,9.298333
8,1.0,0.02,2.5,0.0,0.0,3.3
9,1.5,1.352,7.4,1.674,0.0,12.124


Check https://colab.research.google.com/github/CodeSolid/CodeSolid.github.io/blob/main/booksource/exercises/PandasExercises.ipynb for more Pandas exercises.