## Table of Content

1. **[Pandas](#pandas)**

2. **[Data Structures](#structures)**
    
3. **[Pandas Series](#series)**
    - 3.1 - [Creating a Series](#creatingS)
    - 3.2 - [Manipulating Series](#manipulatingS)

4. **[Pandas DataFrames](#dataframes)**
    - 4.1 - [Creating DataFrames](#creatingDF)
    - 4.2 - [Manipulating DataFrames](#manipulatingDF)

5. **[Reading Data from Different Sources](#reading_data)**


## Pandas
### Introduction to Pandas
*Pandas* contain data structures and data manipulation tools designed for data cleaning and analysis.

While pandas adopt many coding idioms from `NumPy`, the biggest difference is that *pandas is designed for working with tabular or heterogeneous data*. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

The name Pandas is derived from the term “*panel data*”, an econometrics term for **multidimensional structured data sets**.

#### **How to install pandas?**

1. We can import it as:

`import pandas as pd`

2. Or we can use:

`!pip install pandas`

To import the pandas library the following convention is used

In [1]:
import pandas as pd

So from now on we will use `pd.` instead of pandas. 

<a id="structures"> </a>
## 2. Data Structures
#### Introduction to Data Structures

Pandas has two data structures as follows:<br>
1. A Series is 1-dimensional labeled array that can hold data of any type (integer, string, boolean, float, python objects, and so on). It’s axis labels are collectively called an index.<br>
2. A DataFrame is 2-dimensional labeled data structure with columns. It supports multiple datatypes.
  

<a id="series"> </a>
## 3. Pandas Series
#### Introduction to Pandas Series and Creating Series

Pandas Series is a one-dimensional labeled array capable of holding any data type. However, a series is a sequence of homogeneous data types, similar to an array, list, or column in a table.

It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.


<a id="creatingS"> </a>
### 3.1 Creating a Series

#### **1. To create a numeric series** 

In [2]:
# create a numeric series
numbers = range(1,100,5) #from range 1 to 100 in steps of 5
pd.Series(numbers)

0      1
1      6
2     11
3     16
4     21
5     26
6     31
7     36
8     41
9     46
10    51
11    56
12    61
13    66
14    71
15    76
16    81
17    86
18    91
19    96
dtype: int64

The output also gives the data type of the series as `int64`

And note that by default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

> *In python, the row names are called 'index'*

#### **2. To create an object series** 

In [3]:
# create a object series
string = "Hi" , "How" ,"are", "you", "?"
pd.Series(string)

0     Hi
1    How
2    are
3    you
4      ?
dtype: object

The output gives the data type of the series as `object`

#### **3. To create a series by giving both numeric and string values** 

In [4]:
# create a Series with an arbitrary list
s = pd.Series([345, 'London', 34.5, -34.45, 'Happy Birthday'])
s

0               345
1            London
2              34.5
3            -34.45
4    Happy Birthday
dtype: object

> Here the numeric values are treated as object because the object data tyoe can store a number, but the number data type cant store a string. But, by storing data as an object, we loose the ability to do arithmethic on it.

#### **4. To set index values for a series**
We can index our series with data other than numbers, to do that, we have to specify the series with an index explicitly identified.

In [5]:
marks = [60, 89, 74, 86]

subject = ["Maths", "Science", "English" , "Social Science"]

pd.Series(marks, index = subject) #(series, idex series)


Maths             60
Science           89
English           74
Social Science    86
dtype: int64

The index is added using the argument `index=`. **The data type of the series continues to be numeric**.

#### **5. To create a series from a dictionary**

In [6]:
data = {'Maths': 60, 'Science': 89, 'English': 76, 'Social Science': 86}

pd.Series(data)

Maths             60
Science           89
English           76
Social Science    86
dtype: int64

In [7]:
data = {'Maths': 60, 'Science': 'I\'m a string :D', 'English': 76, 'Social Science': 86}

pd.Series(data)

Maths                          60
Science           I'm a string :D
English                        76
Social Science                 86
dtype: object

On passing a dict, the index in the resulting Series will have the dict’s keys in sorted order.

#### **6. A series with missing values**

In [8]:
data = {'Maths': 60, 'Science': 89, 'English': 76, 'Social Science': 86}

subjects = ["Maths", "Science", "Arts and Crafts" , "Social Science"]

marks_series = pd.Series(data, index = subjects)

print(marks_series)

Maths              60.0
Science            89.0
Arts and Crafts     NaN
Social Science     86.0
dtype: float64


> THe data type fo htis series is `float64`, because we have a `NaN` value, and `NaN` values are floating point numbers, so, the whole series has bo be a `float` series.

If we pass a key that is not defined then its value will be `NAN`. This is because the `data` variable is a dictionary. So when we create the `marks_series` variable and re-index our series, the new index values(which come from the `subjects` variable) overwrite the original keys of our dictionary, but because the value `76` is asociated with the key `English`, when we overwrite such key with the new `Arts and Crafts` key, the value `76` will not be reasigned to the `Arts and Crafts` key(because of the nature of a `dictionary`), and since ther is no value asociated with the `Arts and Crafts` key, the `Arts and Crafts` key will be asigned a `missingValue` value by Python, and since `missingValue` is *not a number*, it gets diaplayed as `NaN`.

And the value `76` no longer has a key, so it will not be displayed.

<a id="manipulatingS"> </a>
### 3.2 Manipulating Series 
#### Series Methods

**1. To check for null values using `.isnull`**

In [9]:
marks_series.isnull()

Maths              False
Science            False
Arts and Crafts     True
Social Science     False
dtype: bool

`False` indicates that the value is not null.

**2. To check for null values using `.notnull`**

In [10]:
marks_series.notnull()

Maths               True
Science             True
Arts and Crafts    False
Social Science      True
dtype: bool

` True` indicates that the value is not null.

**3. To know the subjects in which marks score is more than 75**

In [11]:
marks_series[marks_series > 75]
#In marks_series, show all the indexes for which
#the value of marks_series is greater than 75.

Science           89.0
Social Science    86.0
dtype: float64

**4. To assign 68 marks to 'Arts and Crafts'**

In [12]:
marks_series["Arts and Crafts"] = 68
marks_series

Maths              60.0
Science            89.0
Arts and Crafts    68.0
Social Science     86.0
dtype: float64

> Since the series was already a `float` series, even if we change all teh values to integers, it will preserve it's type of `float64`, but we can change it to `int64` if we wish to.

**5. To check whether Maths marks are 73**

In [13]:
marks_series.Maths == 73

False

In [14]:
# or you may use

marks_series["Maths"] == 73

False

**6. Sorting a numeric series**

> Panda series have methods.

In [15]:
# create a pandas series
import numpy as np
values = pd.Series([23, 45, np.nan, 41, 23, 34, 55, np.nan, 34, 20])
values

0    23.0
1    45.0
2     NaN
3    41.0
4    23.0
5    34.0
6    55.0
7     NaN
8    34.0
9    20.0
dtype: float64

In [16]:
# ascending order
values.sort_values(ascending = True)

9    20.0
0    23.0
4    23.0
5    34.0
8    34.0
3    41.0
1    45.0
6    55.0
2     NaN
7     NaN
dtype: float64

In [17]:
# descending order
values.sort_values(ascending = False)

6    55.0
1    45.0
3    41.0
5    34.0
8    34.0
0    23.0
4    23.0
9    20.0
2     NaN
7     NaN
dtype: float64

> **Caution:**The `sort_values` method does not change the original index of the series, it only sorts the values.

**7. Sorting a categorical series**

In [18]:
# create a pandas series
string_values = pd.Series(["a", "j", "d", "f", "t", "a"])

string_values

0    a
1    j
2    d
3    f
4    t
5    a
dtype: object

In [19]:
# ascending order
string_values.sort_values(ascending = True)

0    a
5    a
2    d
3    f
1    j
4    t
dtype: object

In [20]:
# descending order
string_values.sort_values(ascending = False)

4    t
1    j
3    f
2    d
0    a
5    a
dtype: object

**8. Rank a Series**

The `rank` method asigns a *rank* to every index in ascending or descending order.

In [21]:
# recall the marks_series
print(marks_series)
print('-----------------------')

marks_series.rank( ascending=True)

Maths              60.0
Science            89.0
Arts and Crafts    68.0
Social Science     86.0
dtype: float64
-----------------------


Maths              1.0
Science            4.0
Arts and Crafts    2.0
Social Science     3.0
dtype: float64

<a id="dataframes"> </a>
## 4. Pandas DataFrames
#### Introduction to Dataframes and Creating Dataframes

A DataFrame is a *multidimentional analogue of series*, it is a tabular representation of data containing an ordered collection of columns, each of which can be a different type (numeric, string, boolean, and so on).

***The DataFrame has both a row and column index***; it can be thought of as a ***dict of Series all sharing the same index***. In a data frame, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. 

While a DataFrame is physically two-dimensional, it can be used to represent higher dimensional data in a tabular format using hierarchical indexing

<a id="creatingDF"> </a>
### 4.1 Creating DataFrames and DataFrame Methods

**1. Creating a data frame a dictionary**

> Data frames are indicated through curly braces, inside of which, are `key:value` pairs of objects separated by commas. Each `key:value` pair is a column, each `key` is the *name* of the column, and each value inside the `value` is a *value* in the column.

In [22]:
data = {'Subject': ['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art'],
        'Marks': (45, 65, 78, 65, 80, 78),
        'CGPA': [2.5, 3.0, 3.5, 2.0, 4.0, 4.0]}

df = pd.DataFrame(data)
print(df)

    Subject  Marks  CGPA
0     Maths     45   2.5
1   History     65   3.0
2   Science     78   3.5
3   English     65   2.0
4  Georaphy     80   4.0
5       Art     78   4.0


> **Note:** Like Series, the resulting DataFrame is assigned index automatically, and we can also manipulate such index. Also note that the **Marks** values are in a tuple. 

**2. To create dataframe from series**

In [23]:
Subject = pd.Series(['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art'])
Marks = pd.Series([45, 65, 78, 65, 80, 78])
CGPA = pd.Series([2.5, 3.0, 3.5, 2.0, 4.0, 4.0])

In [24]:
pd.DataFrame([Subject,Marks,CGPA], index = ['Subject','Marks','CGPA'])

Unnamed: 0,0,1,2,3,4,5
Subject,Maths,History,Science,English,Georaphy,Art
Marks,45,65,78,65,80,78
CGPA,2.5,3.0,3.5,2.0,4.0,4.0


If we want a vertical dataframe, we use `.T`. The `T` stands for *transpose*.

In [25]:
pd.DataFrame([Subject,Marks,CGPA], index = ['Subject','Marks','CGPA']).T

Unnamed: 0,Subject,Marks,CGPA
0,Maths,45,2.5
1,History,65,3.0
2,Science,78,3.5
3,English,65,2.0
4,Georaphy,80,4.0
5,Art,78,4.0


**Remark:** Assign a name to the data frame and then use `.T` to transpose it.

**4. To create dataframe from lists**

In [26]:
Subject = ['Maths', 'History', 'Science', 'English', 'Georaphy', 'Art']
Marks = [45, 65, 78, 65, 80, 78]
CGPA = [2.5, 3.0, 3.5, 2.0, 4.0, 4.0]

In [27]:
pd.DataFrame([Subject,Marks,CGPA], index = ['Subject','Marks','CGPA']).T

Unnamed: 0,Subject,Marks,CGPA
0,Maths,45,2.5
1,History,65,3.0
2,Science,78,3.5
3,English,65,2.0
4,Georaphy,80,4.0
5,Art,78,4.0


**5. To read data from csv file**

In practice, data is often too complicated to be manually entered. Data usually exists in spreadsheets, databases or other kinds of data repositories; so there should be a way to convert data from how it is usually stored into a dataframe.

For example, we can use `csv` files. `csv` is a different format in which you can save spreadsheets in, it stands for "*comma separated values*", and it is a file type Pandas can read using the method `read_csv`. This method takes as its argument a *file name*. If the file we input into this method is readable, it is directly read as a *dataframe*.

In [28]:
data = pd.read_csv("./data/example.csv")
data

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45,60,1.35
1,12,43,1.21
2,54,78,1.5
3,26,65,1.21
4,68,50,1.32
5,21,43,1.52
6,10,32,1.65
7,57,34,1.61
8,75,23,1.24
9,32,21,1.52


In [29]:
type(data)

pandas.core.frame.DataFrame

On checking the data type, we notice it is read as pandas data frame.

In [30]:
print(data)

    Age  Weight (in kg)  Height (in m)
0    45              60           1.35
1    12              43           1.21
2    54              78           1.50
3    26              65           1.21
4    68              50           1.32
5    21              43           1.52
6    10              32           1.65
7    57              34           1.61
8    75              23           1.24
9    32              21           1.52
10   23              53           1.50
11   34              65           1.76
12   55              89           1.65
13   23              45           1.75
14   56              76           1.69
15   67              78           1.85
16   26              65           1.21
17   56              74           1.69
18   67              78           1.85
19   26              65           1.21
20   68              50           1.32
21   56              76           1.69
22   67              78           1.85


**6. To print head of the data**

In [31]:
data.head()

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45,60,1.35
1,12,43,1.21
2,54,78,1.5
3,26,65,1.21
4,68,50,1.32


By default, the `.head()` will display **first** five rows. However, we can set the desired number of rows to be displayed.

Say we want to see the first 9 rows, we write the number 9 in the parentheses.

In [32]:
data.head(9)

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45,60,1.35
1,12,43,1.21
2,54,78,1.5
3,26,65,1.21
4,68,50,1.32
5,21,43,1.52
6,10,32,1.65
7,57,34,1.61
8,75,23,1.24


**7. To print tail of the data**

In [33]:
data.tail()

Unnamed: 0,Age,Weight (in kg),Height (in m)
18,67,78,1.85
19,26,65,1.21
20,68,50,1.32
21,56,76,1.69
22,67,78,1.85


By default, the `.tail()` will display **last** five rows. However, we can set the desired number of rows to be displayed.

Say we want to see the last 14 rows, we write the number 14 in the parentheses.

In [34]:
data.tail(14)

Unnamed: 0,Age,Weight (in kg),Height (in m)
9,32,21,1.52
10,23,53,1.5
11,34,65,1.76
12,55,89,1.65
13,23,45,1.75
14,56,76,1.69
15,67,78,1.85
16,26,65,1.21
17,56,74,1.69
18,67,78,1.85


**8. To obtain the dimension of the data**

In [35]:
data.shape

(23, 3)

$\uparrow\uparrow\uparrow$ (23 rows and 3 columns)

**9. To know the data types of a data frame**

In [36]:
data.dtypes

Age                 int64
Weight (in kg)      int64
Height (in m)     float64
dtype: object

We see the data type of each variable.

**10. To know some information of the data**

In [37]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             23 non-null     int64  
 1   Weight (in kg)  23 non-null     int64  
 2   Height (in m)   23 non-null     float64
dtypes: float64(1), int64(2)
memory usage: 680.0 bytes


We see this output gives the number of rows present in the data `RangeIndex: 23 entries, 0 to 22` There are 15 rows numbered from 0 to 14. And there are a total of four columns - `Data columns (total 3 columns)`. 

Consider `Age 23 non-null int64` indicates that the column named 'Marks' has 23 non-null observations having the data type 'int64'

And finally the memory used to save this dataframe is 864 bytes.

**11. To check the data type of column in the data frame**

In [38]:
type(data.Age)

pandas.core.series.Series

In [39]:
type(data["Weight (in kg)"])

pandas.core.series.Series

In [40]:
type(data["Height (in m)"])

pandas.core.series.Series

**Note that every column of the data frame is a pandas Series.**

<a id="manipulatingDF"> </a>
### 4.2  Manipulating DataFrames 
#### Manipulating the Dataframes

### Add new column and rows


> CAUTION:
>1. DataFrame[column] works for any column name, but DataFrame.column only works when the column name is a valid Python variable name.
>2. New columns cannot be created with the ` data.BMI ` syntax.


**1. Adding a new column to the data set**

In [41]:
data["BMI"] = data["Weight (in kg)"] / data["Height (in m)"]**2

In [42]:
data

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45,60,1.35,32.921811
1,12,43,1.21,29.369579
2,54,78,1.5,34.666667
3,26,65,1.21,44.395875
4,68,50,1.32,28.696051
5,21,43,1.52,18.611496
6,10,32,1.65,11.753903
7,57,34,1.61,13.116778
8,75,23,1.24,14.958377
9,32,21,1.52,9.089335


In [43]:
data.shape

(23, 4)

$\uparrow\uparrow\uparrow$ Our data set now has 4 columns.

**2. Adding a new row to the data set**

A new row can be added using the function `copy()`, but we have to be very careful about the index. Simmilar to tuples, if we change the indexes, we are changing the orderings in original dataframes. That is why we copy the dataset first,

In [44]:
data_copy = data.copy()
data_copy.loc[23] = [45, 85, 1.8, 26.3]

In [45]:
data_copy

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45.0,60.0,1.35,32.921811
1,12.0,43.0,1.21,29.369579
2,54.0,78.0,1.5,34.666667
3,26.0,65.0,1.21,44.395875
4,68.0,50.0,1.32,28.696051
5,21.0,43.0,1.52,18.611496
6,10.0,32.0,1.65,11.753903
7,57.0,34.0,1.61,13.116778
8,75.0,23.0,1.24,14.958377
9,32.0,21.0,1.52,9.089335


We see that a new column number 23 has be added to the data.

**3. Indexing a dataframe using `.iloc`**

`DataFrame.iloc[]` method is used when the index label of a data frame is something other than numeric series of 0, 1, 2, 3….n or in case the user doesn’t know the index label. 

We shall work on the BMI data set.

#### Select the 2nd row

In [46]:
data.iloc[2]

Age               54.000000
Weight (in kg)    78.000000
Height (in m)      1.500000
BMI               34.666667
Name: 2, dtype: float64

#### Select 4th, 7th and 10th rows

In [47]:
data.iloc[[4,7,10]]

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
4,68,50,1.32,28.696051
7,57,34,1.61,13.116778
10,23,53,1.5,23.555556


>We use two square brackets since we are passing a list of row numbers to be accessed.

The `data.iloc` is a dataframe itself, it is a subset of the original dataframe.

#### Select 12th to 17th rows

In [48]:
data.iloc[12:17]

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
12,55,89,1.65,32.690542
13,23,45,1.75,14.693878
14,56,76,1.69,26.609713
15,67,78,1.85,22.790358
16,26,65,1.21,44.395875


#### Select the 1st column

In [49]:
data.iloc[:, 1]

0     60
1     43
2     78
3     65
4     50
5     43
6     32
7     34
8     23
9     21
10    53
11    65
12    89
13    45
14    76
15    78
16    65
17    74
18    78
19    65
20    50
21    76
22    78
Name: Weight (in kg), dtype: int64

#### Select the last column

In [50]:
data.iloc[:,-1]

0     32.921811
1     29.369579
2     34.666667
3     44.395875
4     28.696051
5     18.611496
6     11.753903
7     13.116778
8     14.958377
9      9.089335
10    23.555556
11    20.983988
12    32.690542
13    14.693878
14    26.609713
15    22.790358
16    44.395875
17    25.909457
18    22.790358
19    44.395875
20    28.696051
21    26.609713
22    22.790358
Name: BMI, dtype: float64

To select the last column we use -1, to select the second last column we use -2

#### Select the first two columns

In [51]:
data.iloc[:,0:2]
# select two columns starting from the 0 column

Unnamed: 0,Age,Weight (in kg)
0,45,60
1,12,43
2,54,78
3,26,65
4,68,50
5,21,43
6,10,32
7,57,34
8,75,23
9,32,21


#### Select the first two columns and 5 to 10 rows

In [52]:
data.iloc[5:11, 0:2]
#Select from row 5 to 11 in the first two columns

Unnamed: 0,Age,Weight (in kg)
5,21,43
6,10,32
7,57,34
8,75,23
9,32,21
10,23,53


**4. Indexing a dataframe using `.loc`**

`DataFrame.loc[]` method is a method that takes only index labels and returns row or dataframe if the index label exists in the caller data frame. <br>
`DataFrame.loc[Row_names, column_names]` is used to select or index rows or columns based on their name.

#### Select 1 to 5 rows and 2nd and 4th columns

In [53]:
data.loc[1:5,["Weight (in kg)","BMI"]]

Unnamed: 0,Weight (in kg),BMI
1,43,29.369579
2,78,34.666667
3,65,44.395875
4,50,28.696051
5,43,18.611496


**Note:** the row names are numbers 

**5. Selecting columns by specifying column names**

#### Select the column 'Age'

In [54]:
data.Age

0     45
1     12
2     54
3     26
4     68
5     21
6     10
7     57
8     75
9     32
10    23
11    34
12    55
13    23
14    56
15    67
16    26
17    56
18    67
19    26
20    68
21    56
22    67
Name: Age, dtype: int64

> **Remark:** Using this method we can select only one column.

In [55]:
# OR
data["Age"]

0     45
1     12
2     54
3     26
4     68
5     21
6     10
7     57
8     75
9     32
10    23
11    34
12    55
13    23
14    56
15    67
16    26
17    56
18    67
19    26
20    68
21    56
22    67
Name: Age, dtype: int64

#### Select the column 'Age' and 'BMI'

In [56]:
data[["Age","BMI"]]

Unnamed: 0,Age,BMI
0,45,32.921811
1,12,29.369579
2,54,34.666667
3,26,44.395875
4,68,28.696051
5,21,18.611496
6,10,11.753903
7,57,13.116778
8,75,14.958377
9,32,9.089335


**6. Sort the data frame on the basis of values in a column**

Each column of a pandas DataFrame is treated as a pandas Series. The `.sort_values()` in DataFrames works similar to the `pandas.Series`

In [57]:
# print head() of 'data'
data.head()

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45,60,1.35,32.921811
1,12,43,1.21,29.369579
2,54,78,1.5,34.666667
3,26,65,1.21,44.395875
4,68,50,1.32,28.696051


In [58]:
# sort the data frame on basis of 'Age' values
# by default the values will get sorted in ascending order
data.sort_values('Age')

#Note: 'ascending = False' will sort the data frame in descending order

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
6,10,32,1.65,11.753903
1,12,43,1.21,29.369579
5,21,43,1.52,18.611496
13,23,45,1.75,14.693878
10,23,53,1.5,23.555556
19,26,65,1.21,44.395875
3,26,65,1.21,44.395875
16,26,65,1.21,44.395875
9,32,21,1.52,9.089335
11,34,65,1.76,20.983988


> **Remark:** Note that the index location has not been changed.

**7. Rank the dataframe**

In [59]:
# rank the data frame 'data' in descending order based on 'BMI'
# 'method = min' assigns the minimum rank to highest equal value of 'BMI' 
data['BMI_ranked'] = data['BMI'].rank(ascending = 0, method  = 'min')
data

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI,BMI_ranked
0,45,60,1.35,32.921811,5.0
1,12,43,1.21,29.369579,7.0
2,54,78,1.5,34.666667,4.0
3,26,65,1.21,44.395875,1.0
4,68,50,1.32,28.696051,8.0
5,21,43,1.52,18.611496,18.0
6,10,32,1.65,11.753903,22.0
7,57,34,1.61,13.116778,21.0
8,75,23,1.24,14.958377,19.0
9,32,21,1.52,9.089335,23.0


From the above data frame, we can see that 'BMI = 44.395875' is repeating thrice; thus the method = 'min' will assign the minimum rank (=1) to all the three values of BMI. The rank '4' will be assigned to the second largest value of BMI and so on. Thus, there is no rank equal to 2 and 3.

In [60]:
# method = 'dense' assigns same rank to all the same BMI values
data['BMI_densed_rank'] = data['BMI'].rank(method = 'dense')
data

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI,BMI_ranked,BMI_densed_rank
0,45,60,1.35,32.921811,5.0,15.0
1,12,43,1.21,29.369579,7.0,13.0
2,54,78,1.5,34.666667,4.0,16.0
3,26,65,1.21,44.395875,1.0,17.0
4,68,50,1.32,28.696051,8.0,12.0
5,21,43,1.52,18.611496,18.0,6.0
6,10,32,1.65,11.753903,22.0,2.0
7,57,34,1.61,13.116778,21.0,3.0
8,75,23,1.24,14.958377,19.0,5.0
9,32,21,1.52,9.089335,23.0,1.0


Here, dense method assigns minimum rank (=1) to minimum value (=9.089335) of the BMI. Rank 2 will be assigned to BMI value greater than min=9.089335 and so on. Thus, no rank is skipped in the dense method.

**8. To check for missing values**

We shall import a new dataset.

In [61]:
missing_data = pd.read_csv("./data/example_missingdata.csv")
missing_data

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45.0,60.0,1.35
1,12.0,43.0,1.21
2,54.0,78.0,1.5
3,26.0,65.0,1.21
4,68.0,50.0,1.32
5,21.0,,1.52
6,10.0,32.0,1.65
7,57.0,34.0,1.61
8,75.0,23.0,1.24
9,32.0,21.0,1.52


In [62]:
missing_data.isnull()

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,True,False
6,False,False,False
7,False,False,False
8,False,False,False
9,False,False,False


In [63]:
missing_data.isnull().sum()

Age               1
Weight (in kg)    2
Height (in m)     1
dtype: int64

The function `.isnull` check whether the data is missing. The `sum()` sums the number of 'True' values in the column. The final output gives the number of missing values in each column.

Here, we see there are 2 missing values in the 'weight' column and one missing value in other columns.

<a id="reading_data"> </a>
## 5.  Reading Data from Different Sources
#### Reading Data From Different Sources
Note that the files names are used as examples only. You can try importing your own files to execute the below examples.

**1. Read a `.xlsx` file**

In [64]:
pd.read_excel('./data/example.xlsx')

Unnamed: 0,Age,Gender,Salary
0,45.0,Male,40000.0
1,12.0,Male,0.0
2,54.0,Female,150000.0
3,26.0,Male,30000.0
4,64.0,Female,15000.0
5,21.0,Female,25600.0


**2. Read a `.zip` file**

In [65]:
import zipfile
with zipfile.ZipFile('./data/data.zip') as z:
    with z.open('example.csv') as f:
        file = pd.read_csv(f)
        print(file.head())

   Age  Weight (in kg)  Height (in m)
0   45              60           1.35
1   12              43           1.21
2   54              78           1.50
3   26              65           1.21
4   68              50           1.32


**3. Read a `.html` file**

In [66]:
df = pd.read_html('Sheet1.html', header=1, index_col=0)
df

ImportError: lxml not found, please install it

**4. Read a `.txt` file**

In [None]:
data = pd.read_csv('./data/example.txt', sep="\t")
data.head()

Unnamed: 0,Country,Birth rate,Life expectancy
0,Vietnam,1.822,74.828244
1,Vanuatu,3.869,70.819488
2,Tonga,3.911,72.150659
3,Timor-Leste,5.578,61.999854
4,Thailand,1.579,73.927659


**5. Read a `.json` file**

In [None]:
pd.read_json('iris.json')

**6. Read a `.xml` file**

In [None]:
import xml.etree.ElementTree as ET 

tree = ET.parse("xml_file.xml")
root = tree.getroot() 

df_col = ["Name", "Gender", "Marks"]
rows = []

for node in root: 
  name = node.attrib.get("name")
  gender = node.find("gender").text if node is not None else None
  marks = node.find("marks").text if node is not None else None
  
  rows.append({"Name": name, "Gender": gender, "Marks": marks})

xml_df = pd.DataFrame(rows, columns = df_col)
xml_df