NumPy Exercises for Data Analysis (Python)- 20questions

1.How to create a boolean array?

In [0]:
import numpy as np

In [3]:
np.full((3, 3), False, dtype=bool)

array([[False, False, False],
       [False, False, False],
       [False, False, False]])

2.How to extract items that satisfy a given condition from 1D array?

In [4]:
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arr[arr%2==1]


array([1, 3, 5, 7, 9])

3.How to replace items that satisfy a condition with another value in numpy array?

In [7]:
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arr[arr%2==0]=2
arr


array([2, 1, 2, 3, 2, 5, 2, 7, 2, 9])

4.How to reshape an array?

In [11]:
arr = np.arange(10)
arr.reshape(2, -1) 

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

5.How to print only 3 decimal places in python numpy array?

In [12]:
arr = np.random.random([6,4])

# Limit to 3 decimal places
np.set_printoptions(precision=3)
arr[:4]

array([[0.269, 0.639, 0.867, 0.049],
       [0.702, 0.109, 0.321, 0.971],
       [0.218, 0.725, 0.651, 0.923],
       [0.971, 0.381, 0.578, 0.778]])

6.How to replace items that satisfy a condition without affecting the original array?

In [13]:
arr = np.arange(12)
out = np.where(arr % 2 == 0, -1, arr)
print(arr)
out

[ 0  1  2  3  4  5  6  7  8  9 10 11]


array([-1,  1, -1,  3, -1,  5, -1,  7, -1,  9, -1, 11])

7.How to stack two arrays vertically?

In [15]:
x = np.arange(10).reshape(2,-1)
y = np.repeat(1, 10).reshape(2,-1)

np.concatenate([x, y], axis=0)



array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

8.How to stack two arrays horizontally?

In [16]:
x = np.arange(10).reshape(2,-1)
y = np.repeat(1, 10).reshape(2,-1)

np.concatenate([x, y], axis=1)

array([[0, 1, 2, 3, 4, 1, 1, 1, 1, 1],
       [5, 6, 7, 8, 9, 1, 1, 1, 1, 1]])

9.How to generate custom sequences in numpy without hardcoding?

In [18]:
a = np.array([1,2,3])
np.r_[np.repeat(a, 3), np.tile(a, 3)]

array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3])

10.How to get the common items between two python numpy arrays?

In [19]:
a = np.array([1,2,3,2,3,4,3,4,5,6])
b = np.array([7,2,10,2,7,4,9,4,9,8])
np.intersect1d(a,b)

array([2, 4])

11.How to remove from one array those items that exist in another?

In [20]:
a = np.array([1,2,3,4,5])
b = np.array([2,3,4,5,7])

np.setdiff1d(a,b)

array([1])

12.How to get the positions where elements of two arrays match?

In [21]:
a = np.array([1,2,3,4,5])
b = np.array([2,3,4,5,7])

np.where(a == b)


(array([], dtype=int64),)

13.How to extract all numbers between a given range from a numpy array?

In [23]:
x = np.arange(10)

index = np.where((x >= 5) & (x <= 8))
x[index]

array([5, 6, 7, 8])

14.How to make a python function that handles scalars to work on numpy arrays?

In [24]:
def max(a, b):
    if a >= b:
        return a
    else:
        return b

pair_max = np.vectorize(max, otypes=[float])

x = np.array([5, 7, 9, 8, 6, 4, 5])
y = np.array([6, 3, 4, 8, 9, 7, 1])

pair_max(x, y)

array([6., 7., 9., 8., 9., 7., 5.])

15.How to swap two columns in a 2d numpy array?

In [28]:
arr = np.arange(9).reshape(3,3)
arr
arr[:, [1,0,2]]

array([[1, 0, 2],
       [4, 3, 5],
       [7, 6, 8]])

16.How to swap two rows in a 2d numpy array?

In [29]:
arr = np.arange(9).reshape(3,3)

arr[::-1]

array([[6, 7, 8],
       [3, 4, 5],
       [0, 1, 2]])

17.How to reverse the columns of a 2D array?

In [30]:
arr = np.arange(9).reshape(3,3)
arr[:, ::-1]

array([[2, 1, 0],
       [5, 4, 3],
       [8, 7, 6]])

18.How to create a 2D array containing random floats between 5 and 10?

In [32]:
arr = np.arange(9).reshape(3,3)

rand_arr = np.random.randint(low=5, high=10, size=(5,3)) + np.random.random((5,3))
rand_arr

array([[5.607, 6.418, 5.756],
       [7.815, 9.581, 7.529],
       [5.512, 9.054, 9.162],
       [6.982, 6.06 , 6.889],
       [5.406, 8.827, 9.293]])

19.How to import a dataset with numbers and texts keeping the text intact in python numpy?

In [34]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')
iris[:3]

array([[b'5.1', b'3.5', b'1.4', b'0.2', b'Iris-setosa'],
       [b'4.9', b'3.0', b'1.4', b'0.2', b'Iris-setosa'],
       [b'4.7', b'3.2', b'1.3', b'0.2', b'Iris-setosa']], dtype=object)

20.How to extract a particular column from 1D array of tuples?

In [36]:
species = np.array([row[4] for row in iris])
species[:5]

array([b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa',
       b'Iris-setosa'], dtype='|S15')

101 Pandas Exercises for Data Analysis

1.How to get the items of series A not present in series B?

In [0]:
import pandas as pd

In [40]:
pd1 = pd.Series([1, 2, 3, 4, 5])
pd2 = pd.Series([4, 5, 6, 7, 8])
pd1[~pd1.isin(pd2)]

0    1
1    2
2    3
dtype: int64

2.How to get the items not common to both series A and series B?

In [42]:
pd1 = pd.Series([1, 2, 3, 4, 5])
pd2 = pd.Series([4, 5, 6, 7, 8])
pd_i = pd.Series(np.intersect1d(pd1, pd2))  
pd_i

0    4
1    5
dtype: int64

3.How to get the minimum, 25th percentile, median, 75th, and max of a numeric series?

In [44]:
pd1 = pd.Series([1, 2, 3, 4, 5,6,7,8,9])
np.percentile(pd1, q=[0, 25, 50, 75, 100])

array([1., 3., 5., 7., 9.])

4.How to keep only top 2 most frequent values as it is and replace everything else as ‘Other’?

In [56]:
pd1 = pd.Series([1, 2, 3, 4, 5,6,7,8,9,9,9,8,8,8])
val=pd1.value_counts()
pd1[~pd1.isin(pd1.value_counts().index[:2])] = 'Other'
pd1

0     Other
1     Other
2     Other
3     Other
4     Other
5     Other
6     Other
7         8
8         9
9         9
10        9
11        8
12        8
13        8
dtype: object

5.How to bin a numeric series to 10 groups of equal size?

In [57]:

pd1 = pd.Series(np.random.random(20))
print(pd1.head())

pd.qcut(pd1, q=[0, .10, .20, .3, .4, .5, .6, .7, .8, .9, 1], 
        labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th']).head()

0    0.758360
1    0.447395
2    0.014378
3    0.328555
4    0.200232
dtype: float64


0    7th
1    5th
2    1st
3    3rd
4    2nd
dtype: category
Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]

6.How to find the positions of numbers that are multiples of 3 from a series?

In [58]:

pd1 = pd.Series(np.random.randint(1, 10, 7))
pd1

print(pd1)
np.argwhere(pd1 % 3==0)

0    4
1    1
2    9
3    6
4    1
5    7
6    5
dtype: int64


  return bound(*args, **kwds)


array([[2],
       [3]])

7.How to get the positions of items of series A in another series B?

In [59]:
pd1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
pd2 = pd.Series([1, 3, 10, 13])

[np.where(i == pd1)[0].tolist()[0] for i in pd2]



[5, 4, 0, 8]

8.How to compute the mean squared error on a truth and predicted series?

In [62]:
truth = pd.Series([10, 9, 6, 5])
pred = pd.Series([1, 3, 10, 13])

mse=np.mean(truth-pred)**2
mse

0.5625

9.How to convert the first character of each element in a series to uppercase?

In [63]:
pd2 = pd.Series(['pav', 'kum','chau'])
pd2.map(lambda x: x.title())

0     Pav
1     Kum
2    Chau
dtype: object

10.How to calculate the number of characters in each word in a series?

In [64]:

pd2 = pd.Series(['pav', 'kum','chau'])

pd2.map(lambda x: len(x))

0    3
1    3
2    4
dtype: int64

11.How to convert a series of date-strings to a timeseries?

In [65]:
pd1 = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
pd.to_datetime(pd1)

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

12.How to get the day of month, week number, day of year and day of week from a series of date strings?

In [67]:
pd1 = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
from dateutil.parser import parse
ts = pd1.map(lambda x: parse(x))
ts

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

13.How to convert year-month string to dates corresponding to the 4th day of the month?

In [69]:
pd1 = pd.Series(['Jan 2010', 'Feb 2011', 'Mar 2012'])

from dateutil.parser import parse

ser_ts = pd1.map(lambda x: parse(x))
ser_ts

0   2010-01-28
1   2011-02-28
2   2012-03-28
dtype: datetime64[ns]

14.How to get the mean of a series grouped by another series?

In [71]:
pd1 = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
pd2 = pd.Series(np.linspace(1, 10, 10))

pd2.groupby(pd1).mean()

apple     5.000000
banana    9.000000
carrot    3.833333
dtype: float64

15.How to compute the euclidean distance between two series?

In [72]:
p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
 
sum((p - q)**2)**.5

18.16590212458495

16.How to replace missing spaces in a string with the least frequent character?

In [73]:
my_str = 'dbc deb abed gade'

ser = pd.Series(list('dbc deb abed gade'))
freq = ser.value_counts()
print(freq)
least_freq = freq.dropna().index[-1] 
"".join(ser.replace(' ', least_freq))

d    4
     3
b    3
e    3
a    2
c    1
g    1
dtype: int64
ERROR! Session/line number was not unique in database. History logging moved to new session 61


'dbcgdebgabedggade'

17.How to create a TimeSeries starting ‘2000-01-01’ and 10 weekends (saturdays) after that having random numbers as values?

In [74]:
pd1 = pd.Series(np.random.randint(1,10,10), pd.date_range('2000-01-01', periods=10, freq='W-SAT'))
pd1

2000-01-01    3
2000-01-08    4
2000-01-15    1
2000-01-22    2
2000-01-29    3
2000-02-05    1
2000-02-12    6
2000-02-19    1
2000-02-26    4
2000-03-04    3
Freq: W-SAT, dtype: int64

18.How to fill an intermittent time series so all missing dates show up with values of previous non-missing date?

In [75]:
pd1 = pd.Series([1,10,3, np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))

pd1.resample('D').ffill()

2000-01-01     1.0
2000-01-02     1.0
2000-01-03    10.0
2000-01-04    10.0
2000-01-05    10.0
2000-01-06     3.0
2000-01-07     3.0
2000-01-08     NaN
Freq: D, dtype: float64

19.How to import only every nth row from a csv file to create a dataframe?

In [77]:
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', chunksize=50)
df2 = pd.DataFrame()
for chunk in df:
    df2 = df2.append(chunk.iloc[0,:])
df2

Unnamed: 0,age,b,chas,crim,dis,indus,lstat,medv,nox,ptratio,rad,rm,tax,zn
0,65.2,396.9,0.0,0.00632,4.09,2.31,4.98,24.0,0.538,15.3,1.0,6.575,296.0,18.0
50,45.7,395.56,0.0,0.08873,6.8147,5.64,13.45,19.7,0.439,16.8,4.0,5.963,243.0,21.0
100,79.9,394.76,0.0,0.14866,2.7778,8.56,9.42,27.5,0.52,20.9,5.0,6.727,384.0,0.0
150,97.3,372.8,0.0,1.6566,1.618,19.58,14.1,21.5,0.871,14.7,5.0,6.122,403.0,0.0
200,13.9,384.3,0.0,0.01778,7.6534,1.47,4.45,32.9,0.403,17.0,3.0,7.135,402.0,95.0
250,13.0,396.28,0.0,0.1403,7.3967,5.86,5.9,24.4,0.431,19.1,7.0,6.487,330.0,22.0
300,47.4,390.86,0.0,0.04417,7.8278,2.24,6.07,24.8,0.4,14.8,5.0,6.871,358.0,70.0
350,44.4,396.9,0.0,0.06211,8.7921,1.25,5.98,22.9,0.429,19.7,1.0,6.49,335.0,40.0
400,100.0,396.9,0.0,25.0461,1.5888,18.1,26.77,5.6,0.693,20.2,24.0,5.987,666.0,0.0
450,92.6,0.32,0.0,6.71772,2.3236,18.1,17.44,13.4,0.713,20.2,24.0,6.749,666.0,0.0


20.How to change column values when importing csv to a dataframe?

In [78]:
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', 
                 converters={'medv': lambda x: 'High' if float(x) > 25 else 'Low'})

print(df.head())

      crim    zn  indus  chas    nox  ...  tax  ptratio       b  lstat  medv
0  0.00632  18.0   2.31     0  0.538  ...  296     15.3  396.90   4.98   Low
1  0.02731   0.0   7.07     0  0.469  ...  242     17.8  396.90   9.14   Low
2  0.02729   0.0   7.07     0  0.469  ...  242     17.8  392.83   4.03  High
3  0.03237   0.0   2.18     0  0.458  ...  222     18.7  394.63   2.94  High
4  0.06905   0.0   2.18     0  0.458  ...  222     18.7  396.90   5.33  High

[5 rows x 14 columns]
