
# Pandas in Python

What is Pandas? 

Pandas is defined as an open-source library that provides high-perpormance data manipulation in Python.
The name of Pandas is derived from the word Panel Data, which means an Econometrics from Multidimensional data.
It is used for data analysis in Python and developed by Wes Mckinney in 2008.

Data analysis requires lots of processing, such as restructuring, cleaning, merging etc. There are different tools are available for fast data processing, such as Numpy, Scipy, Panda. But we prefer Pandas because working with Pandas is fast, simple and more exressive than other tools.


In [1]:
# Import Pandas and check the version

import pandas as pd
print(pd.__version__)

1.4.4


Python Pandas Data Structure:

The Pandas provide two data structures for processing the data, i.e., Series and DataFrame.


# Series:
        '''It is defined as a one-dimensional array that is capable of storing various data types. 
           The row ladels of series are called the index. 
           We can easily convert the list, tuple, and dictionary into series using "Series" method. 
           A Series cannot contain multiple columns.'''


In [2]:
# Create a Series from a numpy array and dict:

# From list:
print("Using List")
mylst = list(['orange', 'banana', 'mango', 'carry'])
lst = pd.Series(mylst)
print(lst)
print(type(lst))

Using List
0    orange
1    banana
2     mango
3     carry
dtype: object
<class 'pandas.core.series.Series'>


In [3]:
# From Numpy Array

print("Using Numpy Array")
import numpy as np
myarr = np.arange(5)
arr = pd.Series(myarr)
print(arr)
print(type(arr))

Using Numpy Array
0    0
1    1
2    2
3    3
4    4
dtype: int32
<class 'pandas.core.series.Series'>


In [4]:
# From Dictionary

print("Using Dictionary")
mydict = {1:'ogange', 2:'mango', 3:'carry'}
dct = pd.Series(mydict)
print(dct)
print(type(dct))

Using Dictionary
1    ogange
2     mango
3     carry
dtype: object
<class 'pandas.core.series.Series'>


In [5]:
# Convert the index of a Series into a columns of a dataframe

df1 = dct.to_frame().reset_index()
print(df1.head())

   index       0
0      1  ogange
1      2   mango
2      3   carry


In [6]:
# Combine many series to from a dataframe
df2 = pd.concat([lst, arr], axis = 1)
print(df2)

        0  1
0  orange  0
1  banana  1
2   mango  2
3   carry  3
4     NaN  4


In [7]:
# Combine many series to frame a dataframe
df3 = pd.DataFrame({'col1':lst, 'col2':arr})
print(df3)

     col1  col2
0  orange     0
1  banana     1
2   mango     2
3   carry     3
4     NaN     4


# DataFrame:
        '''Pandas DataFrame is a widely used structure which works wiht a two-dimensional array with labeled axes (rows and columns).
       DataFrame is defined as a standard way to store data that has two different indexes, i.e., row index and column index.'''

In [8]:
# Create an empty DataFrame

df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


In [9]:
# Create DataFrame using List
lst = ['orange', 'mango', 'carry', 'banana']
df1 = pd.DataFrame(lst)
print(df1)

        0
0  orange
1   mango
2   carry
3  banana


In [10]:
# Create a DataFrame from Dict
dct = {'ID': [101, 102, 103, 104], 'Department': ['B.Sc', 'B.Tach', 'M.Tech', 'Phd']}
df2 = pd.DataFrame(dct)
print(df2)

    ID Department
0  101       B.Sc
1  102     B.Tach
2  103     M.Tech
3  104        Phd


In [36]:
# Create DataFrame using Dictionary
dct1 = {'ID': [1,2,3,4,5,6,7,8,9,10],
       'Name': ['Raj', 'Ram', 'Ramya', 'Vidya', 'Vinay', 'Shanti', 'Bhagya', 'Arun', 'Avani', 'Shivu'],
       'email_id': ['raj@gmail.com', 'ram@gmail.com', 'ramya@gmail.com', 'vidya@gmail.com', 'vinay@gmail.com', 'shanti@gmail.com', 'bhagya@gmail.com', 'arun@gmail.com', 'avani@gmail.com','shivu@gmail.com'],
       'Expirence': [2,5,3,4,2,2.5,8,9,6,7],
       'Salary': [25000,30000,15000,50000,42000,23000,20000,35000,40000,25000],
       'Place': ['Bengalore','Hubli','Mysore', 'Pune','Bengalore','Pune', 'Hubli','Bengalore','Bengalore','Bengalore']}

In [37]:
df_dct1 = pd.DataFrame(dct1)
df_dct1

Unnamed: 0,ID,Name,email_id,Expirence,Salary,Place
0,1,Raj,raj@gmail.com,2.0,25000,Bengalore
1,2,Ram,ram@gmail.com,5.0,30000,Hubli
2,3,Ramya,ramya@gmail.com,3.0,15000,Mysore
3,4,Vidya,vidya@gmail.com,4.0,50000,Pune
4,5,Vinay,vinay@gmail.com,2.0,42000,Bengalore
5,6,Shanti,shanti@gmail.com,2.5,23000,Pune
6,7,Bhagya,bhagya@gmail.com,8.0,20000,Hubli
7,8,Arun,arun@gmail.com,9.0,35000,Bengalore
8,9,Avani,avani@gmail.com,6.0,40000,Bengalore
9,10,Shivu,shivu@gmail.com,7.0,25000,Bengalore


In [11]:
# Create a DataFrame Dict of Series:

dct = {'one': pd.Series([1,2,3,4,5,6,7], index = ['a', 'b', 'c','d','e','f','g']),
      'two': pd.Series([11,12,13,14,15,16,17], index = ['a', 'b','c','d','e','f','g'])}

df3 = pd.DataFrame(dct)
print(df3)

   one  two
a    1   11
b    2   12
c    3   13
d    4   14
e    5   15
f    6   16
g    7   17


In [12]:
# Adding the columns to the DataFrame
df3['three'] = pd.Series([21,22,23,24,25,26,27], index = ['a', 'b','c','d','e', 'f', 'g'])
df3['four'] = pd.Series([31,32,33,34,35,36,37], index = ['a', 'b', 'c', 'd', 'e','f', 'g'])

df3['five'] = df3['one'] + df3['two']
print(df3)

   one  two  three  four  five
a    1   11     21    31    12
b    2   12     22    32    14
c    3   13     23    33    16
d    4   14     24    34    18
e    5   15     25    35    20
f    6   16     26    36    22
g    7   17     27    37    24


# Importing Dataset:


In [13]:
# We can import Datasets by several ways:
    # Using excel file:
df_ex = pd.read_excel("Attribute DataSet.xlsx")

In [14]:
# Print first 5 Rows of dataset
df_ex.head()

Unnamed: 0,Dress_ID,Style,Price,Rating,Size,Season,NeckLine,SleeveLength,waiseline,Material,FabricType,Decoration,Pattern Type,Recommendation
0,1006032852,Sexy,Low,4.6,M,Summer,o-neck,sleevless,empire,,chiffon,ruffles,animal,1
1,1212192089,Casual,Low,0.0,L,Summer,o-neck,Petal,natural,microfiber,,ruffles,animal,0
2,1190380701,vintage,High,0.0,L,Automn,o-neck,full,natural,polyster,,,print,0
3,966005983,Brief,Average,4.6,L,Spring,o-neck,full,natural,silk,chiffon,embroidary,print,1
4,876339541,cute,Low,4.5,M,Summer,o-neck,butterfly,natural,chiffonfabric,chiffon,bow,dot,0


In [15]:
# Print last 5 rows of dataset
df_ex.tail()

Unnamed: 0,Dress_ID,Style,Price,Rating,Size,Season,NeckLine,SleeveLength,waiseline,Material,FabricType,Decoration,Pattern Type,Recommendation
495,713391965,Casual,Low,4.7,M,Spring,o-neck,full,natural,polyster,,,solid,1
496,722565148,Sexy,Low,4.3,free,Summer,o-neck,full,empire,cotton,,,,0
497,532874347,Casual,Average,4.7,M,Summer,v-neck,full,empire,cotton,,lace,solid,1
498,655464934,Casual,Average,4.6,L,winter,boat-neck,sleevless,empire,silk,broadcloth,applique,print,1
499,919930954,Casual,Low,4.4,free,Summer,v-neck,short,empire,cotton,Corduroy,lace,solid,0


In [16]:
# Importing Dataset using CSV file
df_csv = pd.read_csv("haberman.csv")

In [17]:
# First 5 rows 
df_csv.head()

Unnamed: 0,30,64,1,1.1
0,30,62,3,1
1,30,65,0,1
2,31,59,2,1
3,31,65,4,1
4,33,58,10,1


In [18]:
# Last 5 rows
df_csv.tail()

Unnamed: 0,30,64,1,1.1
300,75,62,1,1
301,76,67,0,1
302,77,65,3,1
303,78,65,1,2
304,83,58,2,2


In [19]:
# Importing Dataset using html
df_html = pd.read_html('https://www.basketball-reference.com/leagues/NBA_2015_totals.html')

In [20]:
df_html[0]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Quincy Acy,PF,24,NYK,68,22,1287,152,331,...,.784,79,222,301,68,27,22,60,147,398
1,2,Jordan Adams,SG,20,MEM,30,0,248,35,86,...,.609,9,19,28,16,16,7,14,24,94
2,3,Steven Adams,C,21,OKC,70,67,1771,217,399,...,.502,199,324,523,66,38,86,99,222,537
3,4,Jeff Adrien,PF,28,MIN,17,0,215,19,44,...,.579,23,54,77,15,4,9,9,30,60
4,5,Arron Afflalo,SG,29,TOT,78,72,2502,375,884,...,.843,27,220,247,129,41,7,116,167,1035
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
670,490,Thaddeus Young,PF,26,TOT,76,68,2434,451,968,...,.655,127,284,411,173,124,25,117,171,1071
671,490,Thaddeus Young,PF,26,MIN,48,48,1605,289,641,...,.682,75,170,245,135,86,17,75,115,685
672,490,Thaddeus Young,PF,26,BRK,28,20,829,162,327,...,.606,52,114,166,38,38,8,42,56,386
673,491,Cody Zeller,C,22,CHO,62,45,1487,172,373,...,.774,97,265,362,100,34,49,62,156,472


In [21]:
# Importing Dataset using Json file
df_json = pd.read_json('https://api.github.com/repos/pandas-dev/pandas/issues')

In [22]:
df_json['user'][0]

{'login': 'WillAyd',
 'id': 609873,
 'node_id': 'MDQ6VXNlcjYwOTg3Mw==',
 'avatar_url': 'https://avatars.githubusercontent.com/u/609873?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/WillAyd',
 'html_url': 'https://github.com/WillAyd',
 'followers_url': 'https://api.github.com/users/WillAyd/followers',
 'following_url': 'https://api.github.com/users/WillAyd/following{/other_user}',
 'gists_url': 'https://api.github.com/users/WillAyd/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/WillAyd/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/WillAyd/subscriptions',
 'organizations_url': 'https://api.github.com/users/WillAyd/orgs',
 'repos_url': 'https://api.github.com/users/WillAyd/repos',
 'events_url': 'https://api.github.com/users/WillAyd/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/WillAyd/received_events',
 'type': 'User',
 'site_admin': False}

In [23]:
# Import Dataset and get insights from data 
# import dataset
df_att = pd.read_excel("Attribute DataSet.xlsx")

In [24]:
# First 5 rows
df_att.head()

Unnamed: 0,Dress_ID,Style,Price,Rating,Size,Season,NeckLine,SleeveLength,waiseline,Material,FabricType,Decoration,Pattern Type,Recommendation
0,1006032852,Sexy,Low,4.6,M,Summer,o-neck,sleevless,empire,,chiffon,ruffles,animal,1
1,1212192089,Casual,Low,0.0,L,Summer,o-neck,Petal,natural,microfiber,,ruffles,animal,0
2,1190380701,vintage,High,0.0,L,Automn,o-neck,full,natural,polyster,,,print,0
3,966005983,Brief,Average,4.6,L,Spring,o-neck,full,natural,silk,chiffon,embroidary,print,1
4,876339541,cute,Low,4.5,M,Summer,o-neck,butterfly,natural,chiffonfabric,chiffon,bow,dot,0


In [25]:
# Last 5 rows
df_att.tail()

Unnamed: 0,Dress_ID,Style,Price,Rating,Size,Season,NeckLine,SleeveLength,waiseline,Material,FabricType,Decoration,Pattern Type,Recommendation
495,713391965,Casual,Low,4.7,M,Spring,o-neck,full,natural,polyster,,,solid,1
496,722565148,Sexy,Low,4.3,free,Summer,o-neck,full,empire,cotton,,,,0
497,532874347,Casual,Average,4.7,M,Summer,v-neck,full,empire,cotton,,lace,solid,1
498,655464934,Casual,Average,4.6,L,winter,boat-neck,sleevless,empire,silk,broadcloth,applique,print,1
499,919930954,Casual,Low,4.4,free,Summer,v-neck,short,empire,cotton,Corduroy,lace,solid,0


In [26]:
# Columns of Dataset
df_att.columns

Index(['Dress_ID', 'Style', 'Price', 'Rating', 'Size', 'Season', 'NeckLine',
       'SleeveLength', 'waiseline', 'Material', 'FabricType', 'Decoration',
       'Pattern Type', 'Recommendation'],
      dtype='object')

In [27]:
# Check the Data type
df_att.dtypes

Dress_ID            int64
Style              object
Price              object
Rating            float64
Size               object
Season             object
NeckLine           object
SleeveLength       object
waiseline          object
Material           object
FabricType         object
Decoration         object
Pattern Type       object
Recommendation      int64
dtype: object

In [28]:
# Detail of data
df_att.describe()

Unnamed: 0,Dress_ID,Rating,Recommendation
count,500.0,500.0,500.0
mean,905541700.0,3.5286,0.42
std,173619000.0,2.005364,0.494053
min,444282000.0,0.0,0.0
25%,767316400.0,3.7,0.0
50%,908329600.0,4.6,0.0
75%,1039534000.0,4.8,1.0
max,1253973000.0,5.0,1.0


In [29]:
# Check for Null values
df_att.isnull()

Unnamed: 0,Dress_ID,Style,Price,Rating,Size,Season,NeckLine,SleeveLength,waiseline,Material,FabricType,Decoration,Pattern Type,Recommendation
0,False,False,False,False,False,False,False,False,False,True,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,True,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,True,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,False,False,False,False,False,False,False,False,False,False,True,True,False,False
496,False,False,False,False,False,False,False,False,False,False,True,True,True,False
497,False,False,False,False,False,False,False,False,False,False,True,False,False,False
498,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [30]:
df_att.isnull().sum()

Dress_ID            0
Style               0
Price               2
Rating              0
Size                0
Season              2
NeckLine            3
SleeveLength        2
waiseline          87
Material          128
FabricType        266
Decoration        236
Pattern Type      109
Recommendation      0
dtype: int64

In [31]:
# Chech for duplicate columns
df_att.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
495    False
496    False
497    False
498    False
499    False
Length: 500, dtype: bool