# Pandas in Python
<hr>

## Practice Lab: Selecting data in a Dataframe
<hr>
### Objectives

After completing this lab you will be able to:

*   Use Pandas Library to create DataFrame and Series
*   Locate data in the DataFrame using <code>loc()</code> and <code>iloc()</code> functions
*   Use slicing


### **Exercise 1: Pandas: DataFrame and Series**
<hr>

**Pandas** is a popular library for data analysis built on top of the Python programming language. Pandas generally provide two data structures for manipulating data, They are: 
 
* DataFrame
* Series

A **DataFrame** is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

* A Pandas DataFrame will be created by loading the datasets from existing storage. 
* Storage can be SQL Database, CSV file, Excel file, etc. 
* It can also be created from the lists, dictionaries, and from a list of dictionaries.

**Series** represents a one-dimensional array of indexed data.
It has two main components :
1. An array of actual data.
2. An associated array of indexes or data labels.

The index is used to access individual data values. You can also get a column of a dataframe as a **Series**. You can think of a Pandas series as a 1-D dataframe. 


##### To install Pandas
1. via conda environment(conda install pandas)
2. via pip(pip install pandas)

In [1]:
# let us import the Pandas Library
import pandas as pd

Let us consider a dictionary 'x' with keys and values as shown below.

We then create a dataframe from the dictionary using the function pd.DataFrame(dict)


In [2]:
# Defining a dictinary as "dict"
dict = {
    'Name': ['Rose', 'John', 'Jane', 'Mary', 'Alex', 'David', 'Lina', 'Sophia', 'Mark'],
    'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9],
    'Department': [
        'Architect Group', 'Software Group', 'Design Team', 'Infrastructure',
        'HR Department', 'Finance', 'Support', 'Research', 'Marketing'
    ],
    'Salary': [100000, 80000, 50000, 60000, 75000, 90000, 55000, 95000, 70000],
    'Age': [32, 28, 25, 35, 30, 40, 29, 34, 27],
    'Experience (Years)': [8, 5, 3, 10, 6, 12, 4, 9, 2],
    'City': ['New York', 'London', 'Paris', 'Berlin', 'Tokyo', 'Sydney', 'Toronto', 'Dubai', 'Rome'],
    'Joining Year': [2015, 2018, 2020, 2012, 2017, 2010, 2019, 2014, 2021],
    'Performance Score': [9, 8, 7, 8, 9, 9, 7, 10, 8]
}

# Casting dictionary to DataFrame
df = pd.DataFrame(dict)

# Displaying the result
df

Unnamed: 0,Name,ID,Department,Salary,Age,Experience (Years),City,Joining Year,Performance Score
0,Rose,1,Architect Group,100000,32,8,New York,2015,9
1,John,2,Software Group,80000,28,5,London,2018,8
2,Jane,3,Design Team,50000,25,3,Paris,2020,7
3,Mary,4,Infrastructure,60000,35,10,Berlin,2012,8
4,Alex,5,HR Department,75000,30,6,Tokyo,2017,9
5,David,6,Finance,90000,40,12,Sydney,2010,9
6,Lina,7,Support,55000,29,4,Toronto,2019,7
7,Sophia,8,Research,95000,34,9,Dubai,2014,10
8,Mark,9,Marketing,70000,27,2,Rome,2021,8


We can see the direct correspondence between the table. The keys correspond to the column labels and the values or lists correspond to the rows.


#### **Column Selection:**

To select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name. 

Let's Retrieve the data present in the <code>ID</code> column.


In [3]:
x = df[['ID']]
x

Unnamed: 0,ID
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9


Let's use the <code>type()</code> function and check the type of the variable.


In [4]:
#check the type of x
type(x)

pandas.core.frame.DataFrame

The output shows us that the type of the variable is a DataFrame object.


#### Access to multiple columns

Let us retrieve the data for <code>Department</code>, <code>Salary</code> and <code>ID</code> columns


In [5]:
#Retrieving the Department, Salary and ID columns and assigning it to a variable z

y = df[['Department','Salary','ID']]
y

Unnamed: 0,Department,Salary,ID
0,Architect Group,100000,1
1,Software Group,80000,2
2,Design Team,50000,3
3,Infrastructure,60000,4
4,HR Department,75000,5
5,Finance,90000,6
6,Support,55000,7
7,Research,95000,8
8,Marketing,70000,9


### Practice


##### Problem 1: Create a dataframe to display the result as below:



<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%204/images/Student_data.png" width="300" alt="Student Data">
</center>


In [6]:
student_dict = {
    'Student':['David','Samuel','Tery','Evan'],
    'Age':[27,24,22,32],
    'Country':['UK','Canada','China','USA'],
    'Course':['Python','Data Structures','Machine Learning','Web Development'],
    'Marks':[85,72,89,76]}

df1 = pd.DataFrame(student_dict)
df1

Unnamed: 0,Student,Age,Country,Course,Marks
0,David,27,UK,Python,85
1,Samuel,24,Canada,Data Structures,72
2,Tery,22,China,Machine Learning,89
3,Evan,32,USA,Web Development,76


##### Problem 2: Retrieve the Marks column and assign it to a variable b


In [7]:
b = df1[['Marks']]
b

Unnamed: 0,Marks
0,85
1,72
2,89
3,76


##### Problem 3: Retrieve the Country and Course columns and assign it to a variable c


In [8]:
c = df1[['Country', 'Course']]
c

Unnamed: 0,Country,Course
0,UK,Python
1,Canada,Data Structures
2,China,Machine Learning
3,USA,Web Development


In [9]:
# Get the Student column as a series Object

x = df1['Student']
x

0     David
1    Samuel
2      Tery
3      Evan
Name: Student, dtype: object

In [10]:
#check the type of x
type(x)

pandas.core.series.Series

### **Exercise 2: <code>loc()</code> and <code>iloc()</code> functions**
<hr>
<code>loc()</code> is a label-based data selecting method which means that we have to pass the name of the row or column that we want to select. This method includes the last element of the range passed in it.

Simple syntax for your understanding: 

 - loc[row_label, column_label]

<code>iloc()</code> is an indexed-based selecting method which means that we have to pass an integer index in the method to select a specific row/column. This method does not include the last element of the range passed in it.

Simple syntax for your understanding: 
   
 - iloc[row_index, column_index]



In [11]:
df

Unnamed: 0,Name,ID,Department,Salary,Age,Experience (Years),City,Joining Year,Performance Score
0,Rose,1,Architect Group,100000,32,8,New York,2015,9
1,John,2,Software Group,80000,28,5,London,2018,8
2,Jane,3,Design Team,50000,25,3,Paris,2020,7
3,Mary,4,Infrastructure,60000,35,10,Berlin,2012,8
4,Alex,5,HR Department,75000,30,6,Tokyo,2017,9
5,David,6,Finance,90000,40,12,Sydney,2010,9
6,Lina,7,Support,55000,29,4,Toronto,2019,7
7,Sophia,8,Research,95000,34,9,Dubai,2014,10
8,Mark,9,Marketing,70000,27,2,Rome,2021,8


In [12]:
# Access the value on the first row and first column

df.iloc[0,0]

'Rose'

In [13]:
# Access the value on the first row and the third column

df.iloc[0,2]

'Architect Group'

In [14]:
# Access the column using the name
# As we are accessing the column with label or its name so we use loc() function, otherwise it will give us the error

df.loc[0, "Salary"]

np.int64(100000)

Let us create a new dataframe called 'df2' and assign 'df' to it. Now, let us set the "Name" column as an index column using the method set_index().

The meaning is that we are setting column 'Name' in place of the 'index' of rows. By doing this we simply give name of the employee with any of the column(Employee Information) to find any specific information related to employee.

I mean from this ->  df.iloc[row_index, column index]                        
To this ->           df.loc[employee_name, Column_name]


In [15]:
df2 = df
df2 = df2.set_index("Name")
df2

Unnamed: 0_level_0,ID,Department,Salary,Age,Experience (Years),City,Joining Year,Performance Score
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Rose,1,Architect Group,100000,32,8,New York,2015,9
John,2,Software Group,80000,28,5,London,2018,8
Jane,3,Design Team,50000,25,3,Paris,2020,7
Mary,4,Infrastructure,60000,35,10,Berlin,2012,8
Alex,5,HR Department,75000,30,6,Tokyo,2017,9
David,6,Finance,90000,40,12,Sydney,2010,9
Lina,7,Support,55000,29,4,Toronto,2019,7
Sophia,8,Research,95000,34,9,Dubai,2014,10
Mark,9,Marketing,70000,27,2,Rome,2021,8


In [16]:
#To display the first 5 rows of new dataframe

df2.head()

Unnamed: 0_level_0,ID,Department,Salary,Age,Experience (Years),City,Joining Year,Performance Score
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Rose,1,Architect Group,100000,32,8,New York,2015,9
John,2,Software Group,80000,28,5,London,2018,8
Jane,3,Design Team,50000,25,3,Paris,2020,7
Mary,4,Infrastructure,60000,35,10,Berlin,2012,8
Alex,5,HR Department,75000,30,6,Tokyo,2017,9


In [17]:
#Now, let us access the column using the name

df2.loc['Jane', 'Salary']

np.int64(50000)

### Practice

Use the <code>loc()</code> function,to get the Department of Jane in the newly created dataframe df2.


In [18]:
df2.loc['Jane','Department']

'Design Team'

Use the <code>iloc()</code> function to get the Salary of Mary in the newly created dataframe df2.


In [19]:
df2.loc['Mary','Salary']

np.int64(60000)

### **Exercise 3: Slicing**
<hr>
Slicing uses the [] operator to select a set of rows and/or columns from a DataFrame.

To slice out a set of rows, we use this syntax: data[start:stop], 

here the start represents the index from where to consider, and stop represents the index one step BEYOND the row you want to select. we can perform slicing using both the index and the name of the column.

> NOTE: When slicing in pandas, the start bound is included in the output.

So if you want to select rows 0, 1, and 2 your code would look like this: df.iloc[0:3].

It means you are telling Python to start at index 0 and select rows 0, 1, 2 up to but not including 3.

> NOTE: Labels must be found in the DataFrame or we will get a KeyError.

Indexing by labels(i.e. using <code>loc()</code>) differs from indexing by integers (i.e. using <code>iloc()</code>). With <code>loc()</code>, both the start bound and the stop bound are inclusive. When using <code>loc()</code>, integers can be used, but the integers refer to the index label and not the position. 

For example, using <code>loc()</code> and select 1:4 will get a different result than using <code>iloc()</code> to select rows 1:4.

<h4 id="data">We can also select a specific data value using a row and column location within the DataFrame and iloc indexing.


In [20]:
# let us do the slicing using old dataframe df

df.iloc[0:2, 0:3]

Unnamed: 0,Name,ID,Department
0,Rose,1,Architect Group
1,John,2,Software Group


In [21]:
#let us do the slicing using loc() function on old dataframe df where index column is having labels as 0,1,2

df.loc[0:2,'ID':'Department']

Unnamed: 0,ID,Department
0,1,Architect Group
1,2,Software Group
2,3,Design Team


In [22]:
#let us do the slicing using loc() function on new dataframe df2 where index column is Name having labels: Rose, John and Jane

df2.loc['Rose':'Jane', 'ID':'Department']

Unnamed: 0_level_0,ID,Department
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Rose,1,Architect Group
John,2,Software Group
Jane,3,Design Team


### Practice

using <code>loc()</code> function, do slicing on old dataframe df to retrieve the Name, ID and department of index column having labels as 2,3


In [23]:
df.loc[2:3, 'Name':'Department']

Unnamed: 0,Name,ID,Department
2,Jane,3,Design Team
3,Mary,4,Infrastructure


## Practice Lab: Loading data with Pandas
<hr>

### Objectives

After completing this lab you will be able to:

*   Use Pandas to access and view data


<h3>Table of Contents</h3>
<div style="margin-top: 20px">
    <ul>
        <li><a href="#About-the-Dataset">About the Dataset</a></li>
        <li><a href="#Introduction-of-Pandas">Introduction of <code>Pandas</code></a></li>
        <li><a href="#Viewing-Data-and-Accessing-Data">Viewing Data and Accessing Data</a></li>
        <li><a href="#Quiz-on-DataFrame">Quiz on DataFrame</a></li>
    </ul>

</div>


## About the Dataset


The table has one row for each product and several columns.

<ul>
    <li><b>OrderID</b>: A unique identifier for each order</li>
    <li><b>Product</b>: The name of the product purchased</li>
    <li><b>Category</b>: The category to which the product belongs (e.g., Electronics, Furniture, Stationery)</li>
    <li><b>Quantity</b>: The number of units purchased for that product</li>
    <li><b>Price</b>: The price per unit of the product</li>
    <li><b>Total</b>: The total cost for the product (calculated as Quantity × Price)</li>
    <li><b>OrderDate</b>: The date when the order was placed</li>
    <li><b>CustomerCity</b>: The city where the customer resides</li>
    
</ul>

You can see the dataset here:

<font size="1">
<table style="font-size:medium; border:1px solid black; border-collapse:collapse;">
  <tr>
    <th>OrderID</th>
    <th>Product</th>
    <th>Category</th>
    <th>Quantity</th>
    <th>Price</th>
    <th>Total</th>
    <th>OrderDate</th>
    <th>CustomerCity</th>
  </tr>
  <tr>
    <td>1</td>
    <td>Laptop</td>
    <td>Electronics</td>
    <td>2</td>
    <td>800</td>
    <td>1600</td>
    <td>2022-01-10</td>
    <td>New York</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Smartphone</td>
    <td>Electronics</td>
    <td>3</td>
    <td>600</td>
    <td>1800</td>
    <td>2022-02-15</td>
    <td>Los Angeles</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Desk Chair</td>
    <td>Furniture</td>
    <td>5</td>
    <td>150</td>
    <td>750</td>
    <td>2022-03-12</td>
    <td>Chicago</td>
  </tr>
  <tr>
    <td>4</td>
    <td>Notebook</td>
    <td>Stationery</td>
    <td>10</td>
    <td>2</td>
    <td>20</td>
    <td>2022-04-05</td>
    <td>Houston</td>
  </tr>
  <tr>
    <td>5</td>
    <td>Monitor</td>
    <td>Electronics</td>
    <td>1</td>
    <td>300</td>
    <td>300</td>
    <td>2022-05-21</td>
    <td>Miami</td>
  </tr>
</table>
</font>


In [24]:
# Import required library

import pandas as pd

#### **Reading csv file**
After the import command, we now have access to a large number of pre-built classes and functions. One way pandas allows you to work with data is a dataframe. Let's go through the process to go from a comma separated values (<b>.csv</b>) file to a dataframe. This variable <code>csv_path</code> stores the path of the <b>.csv</b>, that is  used as an argument to the <code>read_csv</code> function. The result is stored in the object <code>df</code>, this is a common short form used for a variable referring to a Pandas dataframe.

In [25]:
# Read data from CSV file

csv_path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/LXjSAttmoxJfEG6il1Bqfw/Product-sales.csv'
df3 = pd.read_csv(csv_path)

We can use the method <code>head()</code> to examine the first five rows of a dataframe:


In [26]:
# Print first five rows of the dataframe

df3.head()

Unnamed: 0,OrderID,Product,Category,Quantity,Price,Total,OrderDate,CustomerCity
0,1,Laptop,Electronics,2,800,1600,2022-01-10,New York
1,2,Smartphone,Electronics,3,600,1800,2022-02-15,Los Angeles
2,3,Desk Chair,Furniture,5,150,750,2022-03-12,Chicago
3,4,Notebook,Stationery,10,2,20,2022-04-05,Houston
4,5,Monitor,Electronics,1,300,300,2022-05-21,Miami


#### **Reading xlxs file**
We use the path of the excel file and the function <code>read_excel</code>. The result is a data frame as before:


In [27]:
# Loading excel file data

xlsx_path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n9LOuKI9SlUa1b5zkaCMeg/Product-sales.xlsx'

In [28]:
# Loading excel file into dataframe
# If any error occur related to excel then install 'openpyxl' in your envionment in which python is running
# Command -> conda install openpyxl  OR   pip install openpyxl

df4 = pd.read_excel(xlsx_path)
df4.head()

Unnamed: 0,OrderID,Product,Category,Quantity,Price,Total,OrderDate,CustomerCity
0,1,Laptop,Electronics,2,800,1600,2022-01-10,New York
1,2,Smartphone,Electronics,3,600,1800,2022-02-15,Los Angeles
2,3,Desk Chair,Furniture,5,150,750,2022-03-12,Chicago
3,4,Notebook,Stationery,10,2,20,2022-04-05,Houston
4,5,Monitor,Electronics,1,300,300,2022-05-21,Miami


We can access the column <b>Quantity</b> and assign it a new dataframe <b>x</b>:


In [29]:
# Access to the column Length

x = df4[['Quantity']]
x

Unnamed: 0,Quantity
0,2
1,3
2,5
3,10
4,1


The process is shown in the figure:


### Viewing Data and Accessing Data


You can also get a column as a series. You can think of a Pandas series as a 1-D dataframe. Just use one bracket:


In [30]:
# Get the column as a series

x = df4['Product']
x

0        Laptop
1    Smartphone
2    Desk Chair
3      Notebook
4       Monitor
Name: Product, dtype: object

You can also get a column as a dataframe. For example, we can assign the column <b>Quantity</b>:


In [31]:
# Get the column as a dataframe

x = df4[['Quantity']]
type(x)

pandas.core.frame.DataFrame

You can do the same thing for multiple columns; we just put the dataframe name, in this case, <code>df</code>, and the name of the multiple column headers enclosed in double brackets. The result is a new dataframe comprised of the specified columns:


In [32]:
# Access to multiple columns

y = df4[['Product','Category', 'Quantity']]
y

Unnamed: 0,Product,Category,Quantity
0,Laptop,Electronics,2
1,Smartphone,Electronics,3
2,Desk Chair,Furniture,5
3,Notebook,Stationery,10
4,Monitor,Electronics,1


## Indexing

One way to access unique elements is the <code>iloc</code> method, where you can access the 1st row and the 1st column as follows:


In [33]:
# Access the value on the first row and the first column

df4.iloc[0, 0]

np.int64(1)

You can access the 2nd row and the 1st column as follows:


In [34]:
# Access the value on the second row and the first column

df4.iloc[1,0]

np.int64(2)

You can access the 1st row and the 3rd column as follows:


In [35]:
# Access the value on the first row and the third column

df4.iloc[0,2]

'Electronics'

In [36]:
# Access the value on the second row and the third column
df4.iloc[1,2]

'Electronics'

You can access the column using the name as well, the following are the same as above:


In [37]:
# Access the column using the name

df4.loc[0, 'Product']

'Laptop'

In [38]:
# Access the column using the name

df4.loc[1, 'Product']

'Smartphone'

In [39]:
# Access the column using the name

df4.loc[1, 'CustomerCity']

'Los Angeles'

In [40]:
# Access the column using the name

df4.loc[1, 'Total']

np.int64(1800)

## Slicing the dataframe

You can perform slicing using both the index and the name of the column:


In [41]:
# Slicing the dataframe

df4.iloc[0:2, 0:3]

Unnamed: 0,OrderID,Product,Category
0,1,Laptop,Electronics
1,2,Smartphone,Electronics


In [42]:
# Slicing the dataframe using name

df4.loc[0:2, 'OrderID':'Category']

Unnamed: 0,OrderID,Product,Category
0,1,Laptop,Electronics
1,2,Smartphone,Electronics
2,3,Desk Chair,Furniture


## Practice on DataFrame


Use a variable <code>q</code> to store the column <b>Price</b> as a dataframe


In [43]:
q = df4[['Price']]
q

Unnamed: 0,Price
0,800
1,600
2,150
3,2
4,300


Assign the variable <code>q</code> to the dataframe that is made up of the column <b>Product</b> and <b>Category</b>:


In [44]:
q = df4[['Product', 'Category']]
q

Unnamed: 0,Product,Category
0,Laptop,Electronics
1,Smartphone,Electronics
2,Desk Chair,Furniture
3,Notebook,Stationery
4,Monitor,Electronics


Access the 2nd row and the 3rd column of <code>df4</code>:


In [45]:
df4.iloc[1,2]

'Electronics'

Use the following list to convert the dataframe index <code>df4</code> to characters and assign it to <code>df4_new</code>; find the element corresponding to the row index <code>a</code> and column  <code>'CustomerCity'</code>. Then select the rows <code>a</code> through <code>d</code> for the column  <code>'CustomerCity'</code>


In [46]:
new_index=['a','b','c','d','e']

df4_new = df4
df4_new.index = new_index
df4_new.loc['a', 'CustomerCity']
df4_new.loc['a':'d', "CustomerCity"]

a       New York
b    Los Angeles
c        Chicago
d        Houston
Name: CustomerCity, dtype: object