# 10 minutes to Pandas
[Pandas](https://pandas.pydata.org/docs/index.html) is one of the most popular tools for data analysis in Python. This open-source library is the backbone of many data projects and is used for data cleaning and data manipulation.
With Pandas, you gain greater control over complex data sets. It’s an essential tool in the data analysis tool belt. If you’re not using Pandas, you’re not making the most of your data.

In this walkthrough, we're going to take a look at some most used Pandas' operations and commands:

* [Create a DataFrame](#Create-a-DataFrame)
* [Import Dataset](#Import-Dataset)
* [Row/Column Selection, Addition, and Deletion](#Row/Column-Selection,-Addition,-and-Deletion)
* [Slice a DataFrame](#Slice-a-DataFrame)
* [Group data](#Group-data)

---
First, let's import Pandas and check if it's installed.

In [1]:
import pandas as pd
pd.__version__

'1.4.1'

### Create a DataFrame
A pandas DataFrame can be created using various inputs like
1. [Lists](#1.-Lists)
2. [Dict](#2.-Dict)
3. [Series](#3.-Series)
4. [Numpy Arrays](#4.Numpy-Arrays)
5. Another DataFrame

#### 1. Lists

In [2]:
data = [1,2,3,4,5]
df = pd.DataFrame(data)
df

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


In [3]:
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
df

Unnamed: 0,Name,Age
0,Alex,10
1,Bob,12
2,Clarke,13


#### 2. Dict
The dictionary keys are by default taken as column names.

In [4]:
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Tom,28
1,Jack,34
2,Steve,29
3,Ricky,42


In [5]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
df
# Note: NaN (Not a Number) is appended in missing areas.

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


#### 3. Series

In [6]:
data = {
    'a' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
    'b' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
}

df = pd.DataFrame(data)
df

Unnamed: 0,a,b
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


#### 4.Numpy Arrays

In [7]:
import numpy as np

# Create a Numpy array
array = np.array([[1, 1, 1], [2, 4, 8], [3, 9, 27], 
                  [4, 16, 64], [5, 25, 125], [6, 36, 216], 
                  [7, 49, 343]])
  
# Create index
index_values = ['first', 'second', 'third',
                'fourth', 'fifth', 'sixth', 'seventh']
   
# Create column names
column_values = ['number', 'squares', 'cubes']
  
# Create the dataframe
df = pd.DataFrame(data = array, 
                  index = index_values, 
                  columns = column_values)
df

Unnamed: 0,number,squares,cubes
first,1,1,1
second,2,4,8
third,3,9,27
fourth,4,16,64
fifth,5,25,125
sixth,6,36,216
seventh,7,49,343


### Import Dataset

In [8]:
# Import a CSV file into dataframe
df = pd.read_csv("data/employees_info.csv")
df

Unnamed: 0,name,gender,job title,department,phone,email,residence,address,time zone
0,Maximilianus Kertess,M,Biostatistician II,Research and Development,752-802-7785,mkertess0@imgur.com,Canada,18411 Sugar Point,America/Edmonton
1,Hayley Robshaw,F,Administrative Assistant I,Marketing,540-813-4217,hrobshaw1@weibo.com,China,1281 Namekagon Lane,Asia/Chongqing
2,Margaret Anetts,F,Food Chemist,Services,593-431-7020,manetts2@skyrock.com,Belgium,7404 Dakota Alley,Europe/Brussels
3,West Kittoe,M,Nuclear Power Engineer,Services,762-464-6485,wkittoe3@ustream.tv,Poland,92750 Buell Drive,Europe/Warsaw
4,Bucky Gillon,M,Dental Hygienist,Legal,419-969-8647,bgillon4@home.pl,China,3068 Morrow Street,Asia/Chongqing
...,...,...,...,...,...,...,...,...,...
995,Dillon Howatt,M,Clinical Specialist,Research and Development,464-126-8895,dhowattrn@google.com,Mexico,402 Pond Lane,America/Mexico_City
996,Carlyle McKinnon,M,Programmer II,Product Management,327-200-2857,cmckinnonro@mail.ru,Croatia,5 Amoth Place,Europe/Zagreb
997,Des Fayter,M,Assistant Media Planner,Accounting,133-479-3458,dfayterrp@ca.gov,Sweden,79 Larry Pass,Europe/Stockholm
998,Amberly Gabb,F,Media Manager II,Product Management,919-701-1347,agabbrq@nyu.edu,Paraguay,985 Summer Ridge Road,America/Asuncion


In [9]:
# Get all column names
df.columns

Index(['name', 'gender', 'job title', 'department', 'phone', 'email',
       'residence', 'address', 'time zone'],
      dtype='object')

In [10]:
# Use head() or tail() to get the first/last five records of the dataframe by default
df.head()

Unnamed: 0,name,gender,job title,department,phone,email,residence,address,time zone
0,Maximilianus Kertess,M,Biostatistician II,Research and Development,752-802-7785,mkertess0@imgur.com,Canada,18411 Sugar Point,America/Edmonton
1,Hayley Robshaw,F,Administrative Assistant I,Marketing,540-813-4217,hrobshaw1@weibo.com,China,1281 Namekagon Lane,Asia/Chongqing
2,Margaret Anetts,F,Food Chemist,Services,593-431-7020,manetts2@skyrock.com,Belgium,7404 Dakota Alley,Europe/Brussels
3,West Kittoe,M,Nuclear Power Engineer,Services,762-464-6485,wkittoe3@ustream.tv,Poland,92750 Buell Drive,Europe/Warsaw
4,Bucky Gillon,M,Dental Hygienist,Legal,419-969-8647,bgillon4@home.pl,China,3068 Morrow Street,Asia/Chongqing


In [11]:
# or just specify how many records to display by giving a number
df.tail(3)

Unnamed: 0,name,gender,job title,department,phone,email,residence,address,time zone
997,Des Fayter,M,Assistant Media Planner,Accounting,133-479-3458,dfayterrp@ca.gov,Sweden,79 Larry Pass,Europe/Stockholm
998,Amberly Gabb,F,Media Manager II,Product Management,919-701-1347,agabbrq@nyu.edu,Paraguay,985 Summer Ridge Road,America/Asuncion
999,Leena Glowinski,F,Civil Engineer,Product Management,909-270-4772,lglowinskirr@hc360.com,South Africa,51 Stang Pass,Africa/Johannesburg


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        1000 non-null   object
 1   gender      1000 non-null   object
 2   job title   1000 non-null   object
 3   department  1000 non-null   object
 4   phone       1000 non-null   object
 5   email       1000 non-null   object
 6   residence   1000 non-null   object
 7   address     1000 non-null   object
 8   time zone   1000 non-null   object
dtypes: object(9)
memory usage: 70.4+ KB


In [13]:
df.describe()

Unnamed: 0,name,gender,job title,department,phone,email,residence,address,time zone
count,1000,1000,1000,1000,1000,1000,1000,1000,1000
unique,1000,2,183,12,1000,1000,124,1000,162
top,Maximilianus Kertess,M,Recruiter,Marketing,752-802-7785,mkertess0@imgur.com,China,18411 Sugar Point,Asia/Chongqing
freq,1,509,20,98,1,1,175,1,82


### Row/Column Selection, Addition, and Deletion

In [14]:
# Create a test dataframe
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Tom,28
1,Jack,34
2,Steve,29
3,Ricky,42


In [15]:
# Select a column
df['Name']

0      Tom
1     Jack
2    Steve
3    Ricky
Name: Name, dtype: object

In [16]:
# Select a row
df.loc[0]

Name    Tom
Age      28
Name: 0, dtype: object

In [17]:
# Select a cell by giving the column name and index
df.loc[0]['Name']

'Tom'

In [18]:
# Add a column
gender = ['M','M','M','F']
df['Gender'] = pd.DataFrame(gender)
df

Unnamed: 0,Name,Age,Gender
0,Tom,28,M
1,Jack,34,M
2,Steve,29,M
3,Ricky,42,F


In [19]:
# Delete a column
del df['Age']
df

Unnamed: 0,Name,Gender
0,Tom,M
1,Jack,M
2,Steve,M
3,Ricky,F


### Slice a DataFrame
Sometimes getting a simple Series doesn’t accomplish our goals. For more complex operations, Pandas provides DataFrame Slicing using “loc” and “iloc” functions. 
Multiple rows can be selected using ‘ : ’ operator.

In [20]:
# Let's use the csv data we just imported
df = pd.read_csv("data/employees_info.csv")
df

Unnamed: 0,name,gender,job title,department,phone,email,residence,address,time zone
0,Maximilianus Kertess,M,Biostatistician II,Research and Development,752-802-7785,mkertess0@imgur.com,Canada,18411 Sugar Point,America/Edmonton
1,Hayley Robshaw,F,Administrative Assistant I,Marketing,540-813-4217,hrobshaw1@weibo.com,China,1281 Namekagon Lane,Asia/Chongqing
2,Margaret Anetts,F,Food Chemist,Services,593-431-7020,manetts2@skyrock.com,Belgium,7404 Dakota Alley,Europe/Brussels
3,West Kittoe,M,Nuclear Power Engineer,Services,762-464-6485,wkittoe3@ustream.tv,Poland,92750 Buell Drive,Europe/Warsaw
4,Bucky Gillon,M,Dental Hygienist,Legal,419-969-8647,bgillon4@home.pl,China,3068 Morrow Street,Asia/Chongqing
...,...,...,...,...,...,...,...,...,...
995,Dillon Howatt,M,Clinical Specialist,Research and Development,464-126-8895,dhowattrn@google.com,Mexico,402 Pond Lane,America/Mexico_City
996,Carlyle McKinnon,M,Programmer II,Product Management,327-200-2857,cmckinnonro@mail.ru,Croatia,5 Amoth Place,Europe/Zagreb
997,Des Fayter,M,Assistant Media Planner,Accounting,133-479-3458,dfayterrp@ca.gov,Sweden,79 Larry Pass,Europe/Stockholm
998,Amberly Gabb,F,Media Manager II,Product Management,919-701-1347,agabbrq@nyu.edu,Paraguay,985 Summer Ridge Road,America/Asuncion


In [21]:
# Get multiple by giving a range of index
df[1:5]

Unnamed: 0,name,gender,job title,department,phone,email,residence,address,time zone
1,Hayley Robshaw,F,Administrative Assistant I,Marketing,540-813-4217,hrobshaw1@weibo.com,China,1281 Namekagon Lane,Asia/Chongqing
2,Margaret Anetts,F,Food Chemist,Services,593-431-7020,manetts2@skyrock.com,Belgium,7404 Dakota Alley,Europe/Brussels
3,West Kittoe,M,Nuclear Power Engineer,Services,762-464-6485,wkittoe3@ustream.tv,Poland,92750 Buell Drive,Europe/Warsaw
4,Bucky Gillon,M,Dental Hygienist,Legal,419-969-8647,bgillon4@home.pl,China,3068 Morrow Street,Asia/Chongqing


In [22]:
# df.loc[<ROWS_TO_SELECT>, <COLUMNS_TO_SELECT>]
df.loc[:,["name","job title"]]

Unnamed: 0,name,job title
0,Maximilianus Kertess,Biostatistician II
1,Hayley Robshaw,Administrative Assistant I
2,Margaret Anetts,Food Chemist
3,West Kittoe,Nuclear Power Engineer
4,Bucky Gillon,Dental Hygienist
...,...,...
995,Dillon Howatt,Clinical Specialist
996,Carlyle McKinnon,Programmer II
997,Des Fayter,Assistant Media Planner
998,Amberly Gabb,Media Manager II


In [23]:
'''
or we can even get the rows on a certain condition. For example, we want to get all the employees that work at
the Accounting department, and we only want to select 'name', 'job title' and 'department' these three columns
'''
accounting_employees = df.loc[(df["department"] == "Accounting"), ["name","job title", "department"]]
accounting_employees

Unnamed: 0,name,job title,department
10,Ann-marie Daveley,Quality Engineer,Accounting
18,Willem Tamsett,Graphic Designer,Accounting
34,Jaynell Baselli,Mechanical Systems Engineer,Accounting
43,Gale Workman,Business Systems Development Analyst,Accounting
68,Ingemar O' Dooley,Senior Financial Analyst,Accounting
...,...,...,...
937,Jessalyn Domerque,Senior Editor,Accounting
943,Eberhard Josselson,Account Coordinator,Accounting
945,Egor Pittam,Analyst Programmer,Accounting
960,Michelina Gimert,Safety Technician III,Accounting


### Group data

In [24]:
grouped = df.groupby(["department","name"]).first()
grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,gender,job title,phone,email,residence,address,time zone
department,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Accounting,Abie Hawkings,M,Office Assistant II,247-553-0319,ahawkings3v@woothemes.com,Lebanon,07 Loeprich Center,Asia/Beirut
Accounting,Aleece Gosland,F,Editor,924-906-4001,agoslandmm@nature.com,Brazil,19 Merrick Park,America/Sao_Paulo
Accounting,Ambrosius Dun,M,Civil Engineer,263-465-6885,adunaq@diigo.com,Yemen,91 Jenna Alley,Asia/Aden
Accounting,Andree McGenn,F,Administrative Officer,124-944-2608,amcgenncc@opera.com,Germany,52455 Loftsgordon Crossing,Europe/Berlin
Accounting,Ann-marie Daveley,F,Quality Engineer,804-815-7609,adaveleya@delicious.com,Philippines,908 Di Loreto Crossing,Asia/Manila
...,...,...,...,...,...,...,...,...
Training,Vick Shovelbottom,M,VP Sales,518-654-2238,vshovelbottomcn@springer.com,Poland,95 Pierstorff Drive,Europe/Warsaw
Training,Webb Finicj,M,Quality Control Specialist,567-227-4839,wfinicjbo@mail.ru,Brazil,5 Forster Hill,America/Fortaleza
Training,Win Gronav,M,Structural Analysis Engineer,951-279-9818,wgronavg3@nydailynews.com,Indonesia,82 Spohn Way,Asia/Jakarta
Training,Witty Weekley,M,Office Assistant III,113-252-8572,wweekleyhz@jugem.jp,China,95 Grayhawk Crossing,Asia/Chongqing
