## NumPy & Pandas Built-In Methods with Examples

In [113]:
#import required NumPy and Pandas
import numpy as np
import pandas as pd

Datasets used in the exercise
1. Kaggle's healthy_lifestyle_city_2016.csv

### NumPy

1. np.array()

Creates a one dimensional array

In [114]:
l = [ 6, 5, 8, 3,7,9]
a = np.array(l)
print(a)
type(a)

[6 5 8 3 7 9]


numpy.ndarray

2. np.sort()

Sorts array with values in ascending order

In [115]:
print(np.sort(a)) # uses array a in number 1
type(a)

[3 5 6 7 8 9]


numpy.ndarray

3. np.linspace(start, stop, step)
Creates a new 1D numpy array with evenly spread elements
within the given interval

In [116]:
print(np.linspace( 0 , 10 , 50 ))

[ 0.          0.20408163  0.40816327  0.6122449   0.81632653  1.02040816
  1.2244898   1.42857143  1.63265306  1.83673469  2.04081633  2.24489796
  2.44897959  2.65306122  2.85714286  3.06122449  3.26530612  3.46938776
  3.67346939  3.87755102  4.08163265  4.28571429  4.48979592  4.69387755
  4.89795918  5.10204082  5.30612245  5.51020408  5.71428571  5.91836735
  6.12244898  6.32653061  6.53061224  6.73469388  6.93877551  7.14285714
  7.34693878  7.55102041  7.75510204  7.95918367  8.16326531  8.36734694
  8.57142857  8.7755102   8.97959184  9.18367347  9.3877551   9.59183673
  9.79591837 10.        ]


4. np.average()

Averages over all the values in the numpy array

In [117]:
print(np.average(a))

6.333333333333333


5. np.argsort()

Returns the indices of a NumPy array so that the indexed values would be sorted.

In [118]:
print(np.argsort(a))

[3 1 0 4 2 5]


6. np.arange()

Creates an array with a range of elements.
Also creates a range  of evenly spaced intervals by specifying the first number, last number, and the step size.

In [119]:
np.arange(6)

array([0, 1, 2, 3, 4, 5])

In [120]:
np.arange(2, 9, 2)

array([2, 4, 6, 8])

7. arr.reshape()

Using arr.reshape() will give a new shape to an array without changing the data

In [121]:
b = a.reshape(3, 2)
print(b)

[[6 5]
 [8 3]
 [7 9]]


8. np.newaxis

Increases the dimensions of your array by one dimension when used once. This means that a 1D array will become a 2D array, a 2D array will become a 3D array, and so on.

In [122]:
c = np.array([1, 2, 3, 4, 5, 6])
c.shape

(6,)

In [123]:
d = c[np.newaxis, :]
print(d)
d.shape

[[1 2 3 4 5 6]]


(1, 6)

9. np.nonzero(a)

Returns the indices of the nonzero elements in NumPy array

In [124]:
e = np.array([ 10 , 3 , 7 , 1 , 0 ])
print(np.nonzero(e)) 

(array([0, 1, 2, 3], dtype=int64),)


10. np.cumsum()

Calculates the cumulative sum of the elements in NumPy
array

In [125]:
print(e) # displays e
print(np.cumsum(e)) # creates an array with e

[10  3  7  1  0]
[10 13 20 21 21]


11. np.var()

Calculates the variance of a numpy array.

In [126]:
print(e)
print(np.var(e))

[10  3  7  1  0]
14.16


12. np.std()

Calculates the standard deviation of a numpy array.

In [127]:
print(e)
print(np.std(e))

[10  3  7  1  0]
3.7629775444453557


13. np.diff()

Calculates the difference between subsequent values in
NumPy array.

In [128]:
print(e)
print(np.diff(e, n = 1 ))

[10  3  7  1  0]
[-7  4 -6 -1]


14. np.flip()

It allows you to flip, or reverse, the contents of an array along an axis. When using np.

In [129]:
print(e)
print(np.flip(e))

[10  3  7  1  0]
[ 0  1  7  3 10]


15. np.max()

Returns maximum element in the array

In [130]:
print(e)
e.max()

[10  3  7  1  0]


10

16. np.min()

Returns minimum element in the array

In [131]:
print(e)
e.min()

[10  3  7  1  0]


0

17. np.sum()

Returns sum of all elements in the array

In [132]:
print(e)
e.sum()

[10  3  7  1  0]


21

18. np.flatten()

Flattens a 2D to a 1D array 
When you use flatten, changes to your new array won’t change the parent array

In [133]:
x = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
x.flatten()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

In [134]:
a1 = x.flatten()
a1[0] = 99
print(x) # Original array
print(a1) # New array

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
[99  2  3  4  5  6  7  8  9 10 11 12]


19. np.ravel()

Flattens a 2D to a 1D array and any changes to the new array will affect the parent array as well. Ravel does not create a copy, it’s memory efficient.

In [135]:
a2 = x.ravel()
a2[0] = 98
print(x) # Original array
print(a2) # New array

[[98  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
[98  2  3  4  5  6  7  8  9 10 11 12]


20. np.logspace()

Generate evenly spaced numbers on a log scale.

In [136]:
np.logspace(start = 1, stop = 78, num = 50, endpoint = True, base = 10, dtype=None, axis = 0)

array([1.00000000e+01, 3.72759372e+02, 1.38949549e+04, 5.17947468e+05,
       1.93069773e+07, 7.19685673e+08, 2.68269580e+10, 1.00000000e+12,
       3.72759372e+13, 1.38949549e+15, 5.17947468e+16, 1.93069773e+18,
       7.19685673e+19, 2.68269580e+21, 1.00000000e+23, 3.72759372e+24,
       1.38949549e+26, 5.17947468e+27, 1.93069773e+29, 7.19685673e+30,
       2.68269580e+32, 1.00000000e+34, 3.72759372e+35, 1.38949549e+37,
       5.17947468e+38, 1.93069773e+40, 7.19685673e+41, 2.68269580e+43,
       1.00000000e+45, 3.72759372e+46, 1.38949549e+48, 5.17947468e+49,
       1.93069773e+51, 7.19685673e+52, 2.68269580e+54, 1.00000000e+56,
       3.72759372e+57, 1.38949549e+59, 5.17947468e+60, 1.93069773e+62,
       7.19685673e+63, 2.68269580e+65, 1.00000000e+67, 3.72759372e+68,
       1.38949549e+70, 5.17947468e+71, 1.93069773e+73, 7.19685673e+74,
       2.68269580e+76, 1.00000000e+78])

### Pandas

1. pd. Series()

Creates a Series by passing a list of values letting pandas create a default integer index

In [137]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
type(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


pandas.core.series.Series

2. pd.read_csv()

Import CSV; allows a user to create a Pandas Dataframe from a local CSV.

In [138]:
df = pd.read_csv('B:\WTF\Week 9\healthy_lifestyle_city_2021.csv')

3. head()

Displays the first 5 rows in a dataframe

In [139]:
df.head()

Unnamed: 0,City,Rank,Sunshine hours(City),Cost of a bottle of water(City),Obesity levels(Country),Life expectancy(years) (Country),Pollution(Index score) (City),Annual avg. hours worked,Happiness levels(Country),Outdoor activities(City),Number of take out places(City),Cost of a monthly gym membership(City)
0,Amsterdam,1,1858,£1.92,20.40%,81.2,30.93,1434,7.44,422,1048,£34.90
1,Sydney,2,2636,£1.48,29.00%,82.1,26.86,1712,7.22,406,1103,£41.66
2,Vienna,3,1884,£1.94,20.10%,81.0,17.33,1501,7.29,132,1008,£25.74
3,Stockholm,4,1821,£1.72,20.60%,81.8,19.63,1452,7.35,129,598,£37.31
4,Copenhagen,5,1630,£2.19,19.70%,79.8,21.24,1380,7.64,154,523,£32.53


4. sort_values()
Pass a column name and specify ascending = True or False to sort the dataframe by that column.

In [140]:
df2 = df.sort_values('Sunshine hours(City)',ascending=True)
print(df2.head())

        City  Rank Sunshine hours(City) Cost of a bottle of water(City)  \
19    Geneva    20                    -                           £2.62   
23    Taipei    24                 1405                           £0.57   
27    Dublin    28                 1453                           £1.40   
32  Brussels    33                 1546                           £2.11   
36    Zurich    37                 1566                           £3.20   

   Obesity levels(Country)  Life expectancy(years) (Country)  \
19                  19.50%                              82.6   
23                   6.20%                              75.4   
27                  25.30%                              80.5   
32                  22.10%                              80.4   
36                  19.50%                              82.6   

   Pollution(Index score) (City) Annual avg. hours worked  \
19                         27.25                     1557   
23                         49.32          

5. to_csv()
Export the CSV; use this method to save your dataframe to local storage.

In [141]:
df.to_csv('B:\WTF\Week 9\healthy_lifestyle_city_2021_sorted.csv')

6. pd.date_range()

Format: pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=_NoDefault.no_default, inclusive=None, **kwargs)

Returns the range of equally spaced time points

In [142]:
dates = pd.date_range(start = '1/1/2018', periods = 8)
dates

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
              dtype='datetime64[ns]', freq='D')

7. pd.DataFrame()

Creates two-dimensional, size-mutable, potentially heterogeneous tabular data which  also contains labeled axes (rows and columns)

In [143]:
df3 = pd.DataFrame(np.random.randn(8,4), index = dates, columns = list("ABCD"))
df3

Unnamed: 0,A,B,C,D
2018-01-01,0.70159,1.666007,-0.576277,-0.053822
2018-01-02,-0.57591,-0.611593,-0.344368,-0.780003
2018-01-03,1.249545,-0.354353,-0.243451,0.250312
2018-01-04,-0.264001,2.614185,-0.665672,0.825756
2018-01-05,0.193754,-1.314808,1.447198,0.987283
2018-01-06,0.017679,1.191284,0.471902,0.524517
2018-01-07,0.305624,-0.527995,-0.488307,-1.105889
2018-01-08,1.156672,-0.460331,-0.364484,0.336539


8. Dataframe.shape()

It gives a total number of rows and then columns

In [148]:
df.shape

(44, 12)

9. Dataframe.size()

Returns the number of rows times number of columns in the data frame

In [149]:
df.size

528

10. Dataframe.info()

We can also use df.info(), from that we get different information such as rows from RangeIndex, Data columns and then data type of each column. It also includes the information of non-null counts.

In [150]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44 entries, 0 to 43
Data columns (total 12 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   City                                    44 non-null     object 
 1   Rank                                    44 non-null     int64  
 2   Sunshine hours(City)                    44 non-null     object 
 3   Cost of a bottle of water(City)         44 non-null     object 
 4   Obesity levels(Country)                 44 non-null     object 
 5   Life expectancy(years) (Country)        44 non-null     float64
 6   Pollution(Index score) (City)           44 non-null     object 
 7   Annual avg. hours worked                44 non-null     object 
 8   Happiness levels(Country)               44 non-null     float64
 9   Outdoor activities(City)                44 non-null     int64  
 10  Number of take out places(City)         44 non-null     int64  


11. Dataframe.describe()

Then to understand basic statistics of variables we can use df.describe(). It will give you count, mean, standard deviation, and also 5 number summary.

In [151]:
df.describe()

Unnamed: 0,Rank,Life expectancy(years) (Country),Happiness levels(Country),Outdoor activities(City),Number of take out places(City)
count,44.0,44.0,44.0,44.0,44.0
mean,22.5,78.175,6.435,213.977273,1443.113636
std,12.845233,5.30437,0.991202,127.190297,1388.80327
min,1.0,56.3,3.57,23.0,250.0
25%,11.75,75.4,5.87,125.25,548.0
50%,22.5,80.4,6.9,189.5,998.0
75%,33.25,81.8,7.175,288.25,1674.25
max,44.0,83.2,7.8,585.0,6417.0


12. Dataframe.columns()

To know the names of all the variables in a data frame, we can use df.columns.

In [152]:
df.columns

Index(['City', 'Rank', 'Sunshine hours(City)',
       'Cost of a bottle of water(City)', 'Obesity levels(Country)',
       'Life expectancy(years) (Country)', 'Pollution(Index score) (City)',
       'Annual avg. hours worked', 'Happiness levels(Country)',
       'Outdoor activities(City)', 'Number of take out places(City)',
       'Cost of a monthly gym membership(City)'],
      dtype='object')

13. Dataframe.nunique

To get the total unique values of variables, we can use df.nunique(). It will give all the unique values a variable contains.

In [153]:
df.nunique()

City                                      44
Rank                                      44
Sunshine hours(City)                      40
Cost of a bottle of water(City)           39
Obesity levels(Country)                   28
Life expectancy(years) (Country)          27
Pollution(Index score) (City)             44
Annual avg. hours worked                  23
Happiness levels(Country)                 30
Outdoor activities(City)                  43
Number of take out places(City)           44
Cost of a monthly gym membership(City)    44
dtype: int64

14. Dataframe.groupby()

groupby() is used to group a Pandas DataFrame by 1 or more columns, and perform some mathematical operation on it. groupby() can be used to summarize data in a simple manner.

In [160]:
df3 = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "B"]], 
                  columns = ["col1", "col2", "col3"])

df3.groupby("col3").agg({"col1":sum, "col2":max})

Unnamed: 0_level_0,col1,col2
col3,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1,2
B,8,10


15. Dataframe.tail()

16. Dataframe.isna()

To get the total number of null values in a data, we can use df.isna() as below. Sum will give the total null values. If we want just one variable null values, we can also get it by giving the name of the variable as below.

In [154]:
df.isna()

Unnamed: 0,City,Rank,Sunshine hours(City),Cost of a bottle of water(City),Obesity levels(Country),Life expectancy(years) (Country),Pollution(Index score) (City),Annual avg. hours worked,Happiness levels(Country),Outdoor activities(City),Number of take out places(City),Cost of a monthly gym membership(City)
0,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False,False


17. DataFrame.rename()

If you want to rename the column headers, use the df.rename() method, as demonstrated below: 

In [162]:
df4 = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "B"]], 
                  columns = ["col_A", "col2", "col3"])
print(df4)
df4.rename(columns = {"col_A":"col1"})

   col_A  col2 col3
0      1     2    A
1      5     8    B
2      3    10    B


Unnamed: 0,col1,col2,col3
0,1,2,A
1,5,8,B
2,3,10,B


18. df.drop()

If you want to delete a column, use the df.drop() method:

In [163]:
df5 = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "B"]], 
                  columns = ["col1", "col2", "col3"])

print(df5.drop(columns = ["col1"]))

   col2 col3
0     2    A
1     8    B
2    10    B


19. df.iloc

Selects by position

In [164]:
df.iloc[1]

City                                      Sydney
Rank                                           2
Sunshine hours(City)                        2636
Cost of a bottle of water(City)            £1.48
Obesity levels(Country)                   29.00%
Life expectancy(years) (Country)            82.1
Pollution(Index score) (City)              26.86
Annual avg. hours worked                    1712
Happiness levels(Country)                   7.22
Outdoor activities(City)                     406
Number of take out places(City)             1103
Cost of a monthly gym membership(City)    £41.66
Name: 1, dtype: object

20. DataFrame.dtypes

You can print the datatype of all columns using the dtypes argument

In [146]:
df.dtypes

City                                       object
Rank                                        int64
Sunshine hours(City)                       object
Cost of a bottle of water(City)            object
Obesity levels(Country)                    object
Life expectancy(years) (Country)          float64
Pollution(Index score) (City)              object
Annual avg. hours worked                   object
Happiness levels(Country)                 float64
Outdoor activities(City)                    int64
Number of take out places(City)             int64
Cost of a monthly gym membership(City)     object
dtype: object