## 101 Pandas Exercises for Data Analysis 
https://www.machinelearningplus.com/python/101-pandas-exercises-python/



#### 1. How to import pandas and check the version?

In [1]:
import pandas as pd 
import numpy as np 

In [2]:
pd.__version__

'0.25.1'

In [3]:
#Book solution
pd.show_versions(as_json=True)

{'system': {'commit': None, 'python': '3.7.4.final.0', 'python-bits': 64, 'OS': 'Windows', 'OS-release': '10', 'machine': 'AMD64', 'processor': 'Intel64 Family 6 Model 142 Stepping 10, GenuineIntel', 'byteorder': 'little', 'LC_ALL': 'None', 'LANG': 'None', 'LOCALE': 'None.None'}, 'dependencies': {'pandas': '0.25.1', 'numpy': '1.18.1', 'pytz': '2019.3', 'dateutil': '2.8.0', 'pip': '19.2.3', 'setuptools': '41.4.0', 'Cython': '0.29.13', 'pytest': '5.2.1', 'hypothesis': None, 'sphinx': '2.2.0', 'blosc': None, 'feather': None, 'xlsxwriter': '1.2.1', 'lxml.etree': '4.4.1', 'html5lib': '1.0.1', 'pymysql': None, 'psycopg2': None, 'jinja2': '2.10.3', 'IPython': '7.8.0', 'pandas_datareader': None, 'bs4': '4.8.0', 'bottleneck': '1.2.1', 'fastparquet': None, 'gcsfs': None, 'matplotlib': '3.1.1', 'numexpr': '2.7.0', 'odfpy': None, 'openpyxl': '3.0.0', 'pandas_gbq': None, 'pyarrow': None, 'pytables': None, 's3fs': None, 'scipy': '1.3.1', 'sqlalchemy': '1.3.9', 'tables': '3.5.2', 'xarray': None, 'xlr

#### 2. How to create a series from a list, numpy array and dict?
Create a pandas series from each of the items below: a list, numpy and a dictionary

mylist = list('abcedfghijklmnopqrstuvwxyz')

myarr = np.arange(26)

mydict = dict(zip(mylist, myarr))

In [4]:
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

In [5]:
mylist = pd.Series(mylist)

myarr = pd.Series(myarr)

mydict = pd.Series(mydict)


In [6]:
print(mylist[:5])

0    a
1    b
2    c
3    e
4    d
dtype: object


In [7]:
print(myarr[:5])

0    0
1    1
2    2
3    3
4    4
dtype: int32


In [8]:
print(mydict[:5])

a    0
b    1
c    2
e    3
d    4
dtype: int64


#### 3. How to convert the index of a series into a column of a dataframe?

Difficulty Level: L1

Convert the series ser into a dataframe with its index as another column on the dataframe.

In [9]:
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
ser = pd.Series(mydict)
ser[:5]

a    0
b    1
c    2
e    3
d    4
dtype: int64

In [10]:
ser
df = pd.DataFrame({"Col1": ser.index, "Col2":ser.values}, index=ser)
df.head()

Unnamed: 0,Col1,Col2
0,a,0
1,b,1
2,c,2
3,e,3
4,d,4


In [11]:
# Solution 
df = ser.to_frame().reset_index()
print(df.head())

  index  0
0     a  0
1     b  1
2     c  2
3     e  3
4     d  4


#### 4. How to combine many series to form a dataframe?
Difficulty Level: L1

Combine ser1 and ser2 to form a dataframe.

In [12]:
#input
import numpy as np
ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pd.Series(np.arange(26))

In [13]:
df = pd.DataFrame({'Col_1': ser1,
                   'Col_2': ser2})

In [14]:
df.head()

Unnamed: 0,Col_1,Col_2
0,a,0
1,b,1
2,c,2
3,e,3
4,d,4


In [15]:
# Book Solution 1
df = pd.concat([ser1, ser2], axis=1) #axis = 1 = Columns
type(df)

pandas.core.frame.DataFrame

In [16]:
# Book Solution 2
df = pd.DataFrame({'col1': ser1, 'col2': ser2})
print(df.head())

  col1  col2
0    a     0
1    b     1
2    c     2
3    e     3
4    d     4


#### 5. How to assign name to the series’ index?
Difficulty Level: L1

Give a name to the series ser calling it ‘alphabets’.

In [17]:
#input
ser = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))

In [18]:
ser.rename('alphabets')[:5]

0    a
1    b
2    c
3    e
4    d
Name: alphabets, dtype: object

In [19]:
# Book Solution
ser.name = 'alphabets'
ser.head()

0    a
1    b
2    c
3    e
4    d
Name: alphabets, dtype: object

#### 6. How to get the items of series A not present in series B?
Difficulty Level: L2

From ser1 remove items present in ser2.

In [20]:
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

In [21]:
ser1 = ser1[~(ser1.isin(ser2))]

In [22]:
ser1

0    1
1    2
2    3
dtype: int64

In [23]:
# Book Solution
ser1[~ser1.isin(ser2)]

0    1
1    2
2    3
dtype: int64

#### 7. How to get the items not common to both series A and series B?
Difficulty Level: L2

Get all items of ser1 and ser2 not common to both.

In [24]:
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

In [25]:
not_common = pd.concat([ser1[~ser1.isin(ser2)],ser2[~ser2.isin(ser1)]])
not_common

0    1
1    2
2    3
2    6
3    7
4    8
dtype: int64

In [26]:
#Book solution

ser_u = pd.Series(np.union1d(ser1, ser2))  # union
ser_i = pd.Series(np.intersect1d(ser1, ser2))  # intersect
ser_u[~ser_u.isin(ser_i)]

0    1
1    2
2    3
5    6
6    7
7    8
dtype: int64

In [27]:
ser_u[~ser_u.isin(ser_i)].all() == not_common.all()

True

#### 8. How to get the minimum, 25th percentile, median, 75th, and max of a numeric series?
Difficuty Level: L2

Compute the minimum, 25th percentile, median, 75th, and maximum of ser.

In [28]:
#input 
np.random.seed(10)
ser = pd.Series(np.random.normal(10, 5, 25))

In [29]:
ser[:5]

0    16.657933
1    13.576395
2     2.272999
3     9.958081
4    13.106680
dtype: float64

In [30]:
ser.describe()

count    25.000000
mean     10.485604
std       5.398767
min       0.111359
25%       6.399572
50%      11.143151
75%      13.576395
max      21.924837
dtype: float64

In [31]:
#Book solution
np.percentile(ser, q=[0, 25, 50, 75, 100])

array([ 0.1113586 ,  6.3995722 , 11.14315065, 13.57639487, 21.92483665])

#### 9. How to get frequency counts of unique items of a series?
Difficulty Level: L1

Calculte the frequency counts of each unique value ser.

In [32]:
ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))

In [33]:
ser.value_counts()

a    7
h    5
e    5
c    3
d    3
b    3
g    3
f    1
dtype: int64

In [34]:
# Book Solution
ser.value_counts()

a    7
h    5
e    5
c    3
d    3
b    3
g    3
f    1
dtype: int64

#### 10. How to keep only top 2 most frequent values as it is and replace everything else as ‘Other’?
Difficulty Level: L2

From ser, keep the top 2 most frequent items as it is and replace everything else as ‘Other’.

In [35]:
np.random.RandomState(100)
ser = pd.Series(np.random.randint(1, 5, [12]))
print(ser)
ser.value_counts()[:2]

0     1
1     3
2     1
3     3
4     3
5     1
6     4
7     2
8     1
9     4
10    2
11    3
dtype: int32


3    4
1    4
dtype: int64

In [36]:
ser.value_counts()[:2].index

Int64Index([3, 1], dtype='int64')

In [37]:
ser[~(ser.isin(ser.value_counts()[:2].index))] = 'Other'

In [38]:
ser

0         1
1         3
2         1
3         3
4         3
5         1
6     Other
7     Other
8         1
9     Other
10    Other
11        3
dtype: object

In [39]:
# Book Solution
print("Top 2 Freq:", ser.value_counts())
ser[~ser.isin(ser.value_counts().index[:2])] = 'Other'
ser

Top 2 Freq: 3        4
1        4
Other    4
dtype: int64


0         1
1         3
2         1
3         3
4         3
5         1
6     Other
7     Other
8         1
9     Other
10    Other
11        3
dtype: object

#### 11. How to bin a numeric series to 10 groups of equal size?
Difficulty Level: L2

Bin the series ser into 10 equal deciles and replace the values with the bin name.

In [40]:
ser = pd.Series(np.random.random(20))
ser

0     0.093460
1     0.821106
2     0.151152
3     0.384114
4     0.944261
5     0.987625
6     0.456305
7     0.826123
8     0.251374
9     0.597372
10    0.902832
11    0.534558
12    0.590201
13    0.039282
14    0.357182
15    0.079613
16    0.305460
17    0.330719
18    0.773830
19    0.039959
dtype: float64

In [41]:
# Book Solution
pd.qcut(ser, q=[0, .10, .20, .3, .4, .5, .6, .7, .8, .9, 1], 
        labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th']).head()

0     2nd
1     8th
2     3rd
3     5th
4    10th
dtype: category
Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]

#### 12. How to convert a numpy array to a dataframe of given shape? (L1)
Difficulty Level: L1

Reshape the series ser into a dataframe with 7 rows and 5 columns

In [42]:
#input
ser = pd.Series(np.random.randint(1, 10, 35))
ser

0     2
1     3
2     6
3     1
4     2
5     1
6     2
7     1
8     3
9     2
10    2
11    1
12    1
13    6
14    1
15    5
16    7
17    7
18    1
19    3
20    4
21    4
22    3
23    7
24    1
25    6
26    2
27    4
28    7
29    6
30    6
31    2
32    9
33    1
34    6
dtype: int32

In [43]:
pd.DataFrame(ser.values.reshape(7,5))

Unnamed: 0,0,1,2,3,4
0,2,3,6,1,2
1,1,2,1,3,2
2,2,1,1,6,1
3,5,7,7,1,3
4,4,4,3,7,1
5,6,2,4,7,6
6,6,2,9,1,6


In [44]:
# Book Solution
df = pd.DataFrame(ser.values.reshape(7,5))
df

Unnamed: 0,0,1,2,3,4
0,2,3,6,1,2
1,1,2,1,3,2
2,2,1,1,6,1
3,5,7,7,1,3
4,4,4,3,7,1
5,6,2,4,7,6
6,6,2,9,1,6


#### 13. How to find the positions of numbers that are multiples of 3 from a series?
Difficulty Level: L2

Find the positions of numbers that are multiples of 3 from ser.

Input

In [45]:
ser = pd.Series(np.random.randint(1, 10, 7))
ser

0    9
1    6
2    8
3    9
4    6
5    9
6    2
dtype: int32

In [46]:
mask = ser.values%3 == 0
print(ser[mask])
ser[mask].index

0    9
1    6
3    9
4    6
5    9
dtype: int32


Int64Index([0, 1, 3, 4, 5], dtype='int64')

In [47]:
# Book Solution
print(ser)
np.argwhere(ser % 3==0)

0    9
1    6
2    8
3    9
4    6
5    9
6    2
dtype: int32


array([[0],
       [1],
       [3],
       [4],
       [5]], dtype=int64)

#### 14. How to extract items at given positions from a series
Difficulty Level: L1

From ser, extract the items at positions in list pos.

Input

In [48]:
ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]

In [49]:
ser[pos]

0     a
4     e
8     i
14    o
20    u
dtype: object

In [50]:
# Book Solution
ser.take(pos)

0     a
4     e
8     i
14    o
20    u
dtype: object

#### 15. How to stack two series vertically and horizontally ?
Difficulty Level: L1

Stack ser1 and ser2 vertically and horizontally (to form a dataframe).

Input

In [51]:
ser1 = pd.Series(range(5))
ser2 = pd.Series(list('abcde'))

In [52]:
#Vertically
pd.concat([ser1,ser2])

0    0
1    1
2    2
3    3
4    4
0    a
1    b
2    c
3    d
4    e
dtype: object

In [53]:
#Horizontally
pd.concat([ser1,ser2], axis=1)

Unnamed: 0,0,1
0,0,a
1,1,b
2,2,c
3,3,d
4,4,e


In [54]:
# Book solution
# Vertical
ser1.append(ser2)

# Horizontal
df = pd.concat([ser1, ser2], axis=1)
print(df)

   0  1
0  0  a
1  1  b
2  2  c
3  3  d
4  4  e


#### 16. How to get the positions of items of series A in another series B?
Difficulty Level: L2

Get the positions of items of ser2 in ser1 as a list.

Input

In [55]:
ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])

In [56]:
ser1[ser1.isin(ser2)].index

Int64Index([0, 4, 5, 8], dtype='int64')

In [57]:
# Book Solution 1
[np.where(i == ser1)[0].tolist()[0] for i in ser2]

# Book Solution 2
[pd.Index(ser1).get_loc(i) for i in ser2]

[5, 4, 0, 8]

#### 17. How to compute the mean squared error on a truth and predicted series?
Difficulty Level: L2

Compute the mean squared error of truth and pred series.

Input

In [58]:
truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)

In [59]:
# Book Solution
np.mean((truth-pred)**2)

0.1867779426084901

#### 18. How to convert the first character of each element in a series to uppercase?
Difficulty Level: L2

Change the first character of each word to upper case in each word of ser.

In [60]:
ser = pd.Series(['how', 'to', 'kick', 'ass?'])
ser

0     how
1      to
2    kick
3    ass?
dtype: object

In [61]:
ser = pd.Series([x[:1].upper()+x[1:] for x in ser])
ser

0     How
1      To
2    Kick
3    Ass?
dtype: object

In [62]:
# Solution 1
ser.map(lambda x: x.title())

0     How
1      To
2    Kick
3    Ass?
dtype: object

In [63]:
# Solution 2
ser.map(lambda x: x[0].upper() + x[1:])

0     How
1      To
2    Kick
3    Ass?
dtype: object

In [64]:
# Solution 3
pd.Series([i.title() for i in ser])

0     How
1      To
2    Kick
3    Ass?
dtype: object

#### 19. How to calculate the number of characters in each word in a series?
Difficulty Level: L2

Input

In [65]:
ser = pd.Series(['how', 'to', 'kick', 'ass?'])

In [66]:
ser.map(lambda x: len(x))

0    3
1    2
2    4
3    4
dtype: int64

In [67]:
# Book Solution
ser.map(lambda x: len(x))

0    3
1    2
2    4
3    4
dtype: int64

#### 20. How to compute difference of differences between consequtive numbers of a series?
Difficulty Level: L1

Difference of differences between the consequtive numbers of ser.

Input

In [68]:
ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

In [69]:
len(ser)

8

In [70]:
dif_ser = []
for i in range(len(ser)-1):
  dif_ser.append(ser[i+1]-ser[i])

dif_ser

[2, 3, 4, 5, 6, 6, 8]

In [71]:
# Solution
print(ser.diff().tolist())
ser.diff().diff().tolist()


[nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0]


[nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]

#### 21. How to convert a series of date-strings to a timeseries?
Difficiulty Level: L2

Input

In [72]:
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

In [73]:
pd.to_datetime(ser)

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

In [74]:
# Solution 1
from dateutil.parser import parse
ser.map(lambda x: parse(x))

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

In [75]:
# Solution 2
pd.to_datetime(ser)

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

####  22. How to get the day of month, week number, day of year and day of week from a series of date strings?

Difficiulty Level: L2

Get the day of month, week number, day of year and day of week from ser.

Input

In [86]:
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
ser = pd.to_datetime(ser)

In [87]:
ser.dt.day_name()

0       Friday
1    Wednesday
2     Saturday
3     Thursday
4       Monday
5     Saturday
dtype: object

In [88]:
ser.dt.dayofweek

0    4
1    2
2    5
3    3
4    0
5    5
dtype: int64

In [89]:
ser.dt.week

0    53
1     5
2     9
3    14
4    19
5    23
dtype: int64

In [92]:
ser.dt.dayofyear

0      1
1     33
2     63
3     94
4    125
5    157
dtype: int64

In [95]:
ser.dt.daysinmonth

0    31
1    28
2    31
3    30
4    31
5    30
dtype: int64

#### 23. How to convert year-month string to dates corresponding to the 4th day of the month?
Difficiulty Level: L2

Change ser to dates that start with 4th of the respective months.

Input

In [110]:
ser = pd.Series(['Jan 2010', 'Feb 2011', 'Mar 2012'])

In [105]:
ser = pd.to_datetime(ser)
ser

0   2010-01-01
1   2011-02-01
2   2012-03-01
dtype: datetime64[ns]

In [108]:
ser = ser.map(lambda x: x.replace(day=4))
ser

0   2010-01-04
1   2011-02-04
2   2012-03-04
dtype: datetime64[ns]

In [111]:
#Book solution

# Solution 1
from dateutil.parser import parse
# Parse the date
ser_ts = ser.map(lambda x: parse(x))

# Construct date string with date as 4
ser_datestr = ser_ts.dt.year.astype('str') + '-' + ser_ts.dt.month.astype('str') + '-' + '04'

# Format it.
[parse(i).strftime('%Y-%m-%d') for i in ser_datestr]

# Solution 2
ser.map(lambda x: parse('04 ' + x))

0   2010-01-04
1   2011-02-04
2   2012-03-04
dtype: datetime64[ns]

In [None]:
#### 24. How to filter words that contain atleast 2 vowels from a series?
Difficiulty Level: L3

From ser, extract words that contain atleast 2 vowels.

Input 

In [113]:
ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])