### PANDAS: INTRODUCTION

> It is often said that 80% of data analysis is spent on the data cleaning and preparing data. To get a handle on the problem, this section focuses on a small, but important aspect of data manipulation and cleaning with Pandas.

### Data Structures

**There are two data structures are there in Pandas -**<br>
* **Series -** It is one-dimensional labeled array capable of holding any data type (integer, strings, floating point numbers, Python objects etc.) of data. The axis is collectively referred to as index.

* **Data Frame -** It is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL Table or a Series of objects.

### Series Data Structure:

**pandas.core.series.Series(data, index, dtype, copy)**<br>
* **data -** data takes various forms like ndarray, list, constants, dictionary etc.<br>
* **index -** it is unique and hashable for easy identification.<br>
* **dtype -** it is for data type.<br>
* **copy -** copy data, and its default value is False. It only affects for Series or one dimensional ndarray inputs.

In [1]:
# importing required modules
import pandas as pd
import numpy as np

In [2]:
# creating empty Series
s = pd.Series()
print (s, type(s))

Series([], dtype: float64) <class 'pandas.core.series.Series'>


  s = pd.Series()


In [9]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, type(arr_data))
s = pd.Series(data = arr_data)
print (s)
print (type(s))
print (s[0], s[3])

['apple' 'banana' 'cherry' 'pineapple'] <class 'numpy.ndarray'>
0        apple
1       banana
2       cherry
3    pineapple
dtype: object
<class 'pandas.core.series.Series'>
apple pineapple


In [7]:
arr_data = np.array([100, 300, 200, 600, 500])
s = pd.Series(arr_data, copy = False)
print (arr_data, type(arr_data))
print (s, type(s))
s[0] = 999
arr_data[2] = 888
print (arr_data, type(arr_data))
print (s, type(s))

[100 300 200 600 500] <class 'numpy.ndarray'>
0    100
1    300
2    200
3    600
4    500
dtype: int32 <class 'pandas.core.series.Series'>
[999 300 888 600 500] <class 'numpy.ndarray'>
0    999
1    300
2    888
3    600
4    500
dtype: int32 <class 'pandas.core.series.Series'>


In [8]:
arr_data = np.array([100, 300, 200, 600, 500])
s = pd.Series(arr_data, copy = True)
print (arr_data, type(arr_data))
print (s, type(s))
s[0] = 999
arr_data[2] = 888
print (arr_data, type(arr_data))
print (s, type(s))

[100 300 200 600 500] <class 'numpy.ndarray'>
0    100
1    300
2    200
3    600
4    500
dtype: int32 <class 'pandas.core.series.Series'>
[100 300 888 600 500] <class 'numpy.ndarray'>
0    999
1    300
2    200
3    600
4    500
dtype: int32 <class 'pandas.core.series.Series'>


In [15]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, type(arr_data))
s = pd.Series(data = arr_data, index = [100, 101, 102, 103])
print (s)
print (type(s))
print (s[100], type(s[100]), s[103], type(s[103]))

['apple' 'banana' 'cherry' 'pineapple'] <class 'numpy.ndarray'>
100        apple
101       banana
102       cherry
103    pineapple
dtype: object
<class 'pandas.core.series.Series'>
apple <class 'str'> pineapple <class 'str'>


In [20]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, type(arr_data))
s = pd.Series(data = arr_data, index = [100, 103, 100, 103])
print (s)
print (type(s))
print (s[100])
print (s[103])
print (type(s[103]))

['apple' 'banana' 'cherry' 'pineapple'] <class 'numpy.ndarray'>
100        apple
103       banana
100       cherry
103    pineapple
dtype: object
<class 'pandas.core.series.Series'>
100     apple
100    cherry
dtype: object
103       banana
103    pineapple
dtype: object
<class 'pandas.core.series.Series'>


In [18]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, type(arr_data))
s = pd.Series(data = arr_data, index = ['fruit-1', 'fruit-2', 'fruit-3', 'fruit-4'])
print (s)
print (type(s))
print (s['fruit-1'], s[0])
print (s['fruit-3'], s[2])

['apple' 'banana' 'cherry' 'pineapple'] <class 'numpy.ndarray'>
fruit-1        apple
fruit-2       banana
fruit-3       cherry
fruit-4    pineapple
dtype: object
<class 'pandas.core.series.Series'>
apple apple
cherry cherry


In [22]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, type(arr_data))
s = pd.Series(data = arr_data, index = ['fruit-1', 'fruit-2', 'fruit-2', 'fruit-1'])
print (s)
print (type(s))
print (s['fruit-1'])
print (s['fruit-2'])
print (s[0])
print (s[2])

['apple' 'banana' 'cherry' 'pineapple'] <class 'numpy.ndarray'>
fruit-1        apple
fruit-2       banana
fruit-2       cherry
fruit-1    pineapple
dtype: object
<class 'pandas.core.series.Series'>
fruit-1        apple
fruit-1    pineapple
dtype: object
fruit-2    banana
fruit-2    cherry
dtype: object
apple
cherry


In [24]:
# create a Series from a dictionary
dict_data = {'apple':100, 'banana':202, 'coconut':450, 'mango':435}
print (dict_data, type(dict_data))
s = pd.Series(dict_data)
print (s)

{'apple': 100, 'banana': 202, 'coconut': 450, 'mango': 435} <class 'dict'>
apple      100
banana     202
coconut    450
mango      435
dtype: int64


In [25]:
dict_data = {'apple':100, 'banana':202, 'coconut':450, 'mango':435}
print (dict_data, type(dict_data))
s = pd.Series(dict_data, index = ['banana', 'mango', 'apple', 'coconut'])
print (s)

{'apple': 100, 'banana': 202, 'coconut': 450, 'mango': 435} <class 'dict'>
banana     202
mango      435
apple      100
coconut    450
dtype: int64


In [31]:
dict_data = {'apple':100, 'banana':202, 'coconut':450, 'mango':435}
print (dict_data, type(dict_data))
s = pd.Series(data = dict_data, index = ['banana', 'lime', 'coconut', 'mango', 'guava', 'apple', 'mango', 'apple', 'coconut'])
print (s)
print (s[0], s[3], s[4], s[5])

{'apple': 100, 'banana': 202, 'coconut': 450, 'mango': 435} <class 'dict'>
banana     202.0
lime         NaN
coconut    450.0
mango      435.0
guava        NaN
apple      100.0
mango      435.0
apple      100.0
coconut    450.0
dtype: float64
202.0 435.0 nan 100.0


In [32]:
# create a Series from a scalar
s = pd.Series(5, index = [0, 1, 2, 3, 4])
print (s)

0    5
1    5
2    5
3    5
4    5
dtype: int64


In [34]:
s = pd.Series(5, index = [0, 1, 2, 0, 1, 2])
print (s)
print (s[0])

0    5
1    5
2    5
0    5
1    5
2    5
dtype: int64
0    5
0    5
dtype: int64


In [40]:
# Create a Series from a list
s = pd.Series(data = [101, 303, 202, 404, 505], index = ['red', 'blue', 'brown', 'black', 'silver'])
print (s)
print (s['blue'], s[1])
# print (s['golden'])   # error - KeyError

red       101
blue      303
brown     202
black     404
silver    505
dtype: int64
303 303


In [43]:
s = pd.Series(data = [101, 303, 202, 404, 505], index = ['red', 'blue', 'brown', 'black', 'silver'])
print (s)
print (s.sort_values())
print (s.sort_index())

red       101
blue      303
brown     202
black     404
silver    505
dtype: int64
red       101
brown     202
blue      303
black     404
silver    505
dtype: int64
black     404
blue      303
brown     202
red       101
silver    505
dtype: int64


In [50]:
s = pd.Series(data = ['red', 'blue', 'brown', 'black', 'silver'])
print (s)
print (s[0])
print (s[3])
print (s[0:4])
print (s[-5:-1])

0       red
1      blue
2     brown
3     black
4    silver
dtype: object
red
black
0      red
1     blue
2    brown
3    black
dtype: object
0      red
1     blue
2    brown
3    black
dtype: object


In [53]:
s = pd.Series(data = [101, 303, 202, 404, 505], index = ['red', 'blue', 'brown', 'black', 'silver'])
print (s)
print (s['brown'])
print (s[['brown', 'red', 'silver', 'blue']])

red       101
blue      303
brown     202
black     404
silver    505
dtype: int64
202
brown     202
red       101
silver    505
blue      303
dtype: int64


In [60]:
var1 = 100
var2 = 200
var1
var2

200

In [61]:
var1 = 100
var2 = 200
print (var1)
var2

100


200

In [62]:
var1 = 100
var2 = 200
print (var1)
print (var2)

100
200


### Data Frame Data Structure:

### Create DataFrame

In [63]:
user_data = [['alice', 19, 'F', 'student'], ['john', 26, 'M', 'student']]
print (user_data, type(user_data))
user_columns = ['name', 'age', 'gender', 'job']
print (user_columns, type(user_columns))
user1 = pd.DataFrame(data = user_data)
print (user1)
user1

[['alice', 19, 'F', 'student'], ['john', 26, 'M', 'student']] <class 'list'>
['name', 'age', 'gender', 'job'] <class 'list'>
       0   1  2        3
0  alice  19  F  student
1   john  26  M  student


Unnamed: 0,0,1,2,3
0,alice,19,F,student
1,john,26,M,student


In [64]:
user_data = [['alice', 19, 'F', 'student'], ['john', 26, 'M', 'student']]
print (user_data, type(user_data))
user_columns = ['name', 'age', 'gender', 'job']
print (user_columns, type(user_columns))
user1 = pd.DataFrame(data = user_data, columns = user_columns)
print (user1)
user1

[['alice', 19, 'F', 'student'], ['john', 26, 'M', 'student']] <class 'list'>
['name', 'age', 'gender', 'job'] <class 'list'>
    name  age gender      job
0  alice   19      F  student
1   john   26      M  student


Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student


In [67]:
data_dict = {'emp_name':['Amal', 'Kamal', 'Bimal', 'Shyamal'],
             'emp_age':[34, 35, 45, 43]}
emp_id = [100, 101, 102, 103]
print (data_dict)
df = pd.DataFrame(data = data_dict)
df

{'emp_name': ['Amal', 'Kamal', 'Bimal', 'Shyamal'], 'emp_age': [34, 35, 45, 43]}


Unnamed: 0,emp_name,emp_age
0,Amal,34
1,Kamal,35
2,Bimal,45
3,Shyamal,43


In [68]:
data_dict = {'emp_name':['Amal', 'Kamal', 'Bimal', 'Shyamal'],
         'emp_age':[34, 35, 45, 43]}
emp_id = [100, 101, 102, 103]
print (data_dict)
df = pd.DataFrame(data = data_dict, index = emp_id)
df

{'emp_name': ['Amal', 'Kamal', 'Bimal', 'Shyamal'], 'emp_age': [34, 35, 45, 43]}


Unnamed: 0,emp_name,emp_age
100,Amal,34
101,Kamal,35
102,Bimal,45
103,Shyamal,43


In [70]:
df = df.reset_index()
df

Unnamed: 0,index,emp_name,emp_age
0,100,Amal,34
1,101,Kamal,35
2,102,Bimal,45
3,103,Shyamal,43


In [71]:
df.rename(columns = {'index':'emp_id', 'emp_name':'emp_fname'})

Unnamed: 0,emp_id,emp_fname,emp_age
0,100,Amal,34
1,101,Kamal,35
2,102,Bimal,45
3,103,Shyamal,43


In [72]:
df

Unnamed: 0,index,emp_name,emp_age
0,100,Amal,34
1,101,Kamal,35
2,102,Bimal,45
3,103,Shyamal,43


In [73]:
df.rename(columns = {'index':'emp_id', 'emp_name':'emp_fname'}, inplace = True)
df

Unnamed: 0,emp_id,emp_fname,emp_age
0,100,Amal,34
1,101,Kamal,35
2,102,Bimal,45
3,103,Shyamal,43


In [76]:
# DataFrame to NumPy ndarray
arr_data = df[['emp_fname', 'emp_age']].to_numpy()
print (arr_data, type(arr_data))

[['Amal' 34]
 ['Kamal' 35]
 ['Bimal' 45]
 ['Shyamal' 43]] <class 'numpy.ndarray'>


In [77]:
arr_data = df.to_numpy()
print (arr_data, type(arr_data))

[[100 'Amal' 34]
 [101 'Kamal' 35]
 [102 'Bimal' 45]
 [103 'Shyamal' 43]] <class 'numpy.ndarray'>


In [107]:
user_data = [['alice', 19, 'F', 'student'], ['john', 26, 'M', 'student']]
user_columns = ['name', 'age', 'gender', 'job']
user1 = pd.DataFrame(data = user_data, columns = user_columns)
user1

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student


In [109]:
user_data = dict(name = ['eric', 'paul'], age = [22, 58], gender = ['M', 'F'], job = ['student', 'manager'])
print (user_data)
user2 = pd.DataFrame(data = user_data)
user2

{'name': ['eric', 'paul'], 'age': [22, 58], 'gender': ['M', 'F'], 'job': ['student', 'manager']}


Unnamed: 0,name,age,gender,job
0,eric,22,M,student
1,paul,58,F,manager


In [84]:
user_data = dict(name = ['peter', 'julie'], age = [33, 44], gender = ['M', 'F'], job = ['engineer', 'scientist'])
print (user_data)
user3 = pd.DataFrame(data = user_data)
user3

{'name': ['peter', 'julie'], 'age': [33, 44], 'gender': ['M', 'F'], 'job': ['engineer', 'scientist']}


Unnamed: 0,name,age,gender,job
0,peter,33,M,engineer
1,julie,44,F,scientist


### Concatenate DataFrame

In [93]:
users = user1.append(user2)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
0,eric,22,M,student
1,paul,58,F,manager


In [94]:
users = user1.append(user2, ignore_index = True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager


In [95]:
users = users.append(user3, ignore_index=True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [98]:
users = user1.append(user2).append(user3, ignore_index = True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [99]:
users = pd.concat([user1, user2, user3])
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
0,eric,22,M,student
1,paul,58,F,manager
0,peter,33,M,engineer
1,julie,44,F,scientist


In [110]:
users = pd.concat([user1, user2, user3], ignore_index = True) 
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist
