##### Data Structures in Pandas Library
Pandas provide two data structures for manipulating data:
- Series
- DataFrame
###### Pandas Series
- A pandas series is a one-dimensional labeled array (a column in an Excel sheet) capable of holding data of any type i.e. intenger, float, strings, ....
- The axes labels are collectively called indexes
###### Creating a Series
-Pandas Series is created by loading the datasets from existing storage( database, csv or Excel file). Further it can be created from lists, dictionaries and scalar values etc.
###### Pandas DataFrame
- Pandas DataFrame is a two-dimentional data structure with labeled axes(rows and columns)
###### Creating DataFrame
- Pandas DataFrame is created by loading the dataset from existing storge(database, csv or excel). It can also be created from lists, dictionaries or a liat of dictionaries etc.

##### Methods
###### describe() method
Pandas **describe()** is used to view some basic statistical details like percentile, mean, std etc of a data from a series of numeric values.
- **Syntax**: DataFrame.describe(percentiles = None, include = None, exclude = None)
- **Parameters**:
   * **percentile**: list like data type of numbers between 0 -1 to return the respective percentile
   * **include**: list of data types to be included while describing dataframe. Default is None.
   * **exclude**: list of data types to be excluded while describing dataframe.Default is None.

- **Return type:**: Statistical summary of data frame.


In [2]:
# Example
import pandas as pd
data = pd.read_csv('nba.csv')
print(data.head())

            Name            Team  Number Position   Age Height  Weight  \
0  Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0   
1    Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0   
2   John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0   
3    R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0   
4  Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0   

             College     Salary  
0              Texas  7730337.0  
1          Marquette  6796117.0  
2  Boston University        NaN  
3      Georgia State  1148640.0  
4                NaN  5000000.0  


In [8]:
df = data['Age'].head()
df.unique()    # To print all the unique values

array([25., 27., 22., 29.])

In [10]:
df.nunique()  # To give the total count of unique values

4

In [15]:
# Using Describe function
print(data.describe()) # It gives several statistical measures, including mean, std, quartiles etc

           Number         Age      Weight        Salary
count  457.000000  457.000000  457.000000  4.460000e+02
mean    17.678337   26.938731  221.522976  4.842684e+06
std     15.966090    4.404016   26.368343  5.229238e+06
min      0.000000   19.000000  161.000000  3.088800e+04
25%      5.000000   24.000000  200.000000  1.044792e+06
50%     13.000000   26.000000  220.000000  2.839073e+06
75%     25.000000   30.000000  240.000000  6.500000e+06
max     99.000000   40.000000  307.000000  2.500000e+07


##### Explanation of the description of numerical columns
- **Count**: Total number of Non-Empty values
- **Mean**: Mean of the column values
- **std**: Standard Deviation of the column values
- **min**: Minimum value fro the column
- **25%** : 25 percentile
- **50**: 50 percentile
- **75**: 75 percentile
- **Max**: Maximum value from the column

In [17]:
# pandas describe() behavior for numeric dtypes
import pandas as pd
data = pd.read_csv('nba.csv')

# removing null values to avoid errors
data.dropna(inplace = True)

# percentile list
percentile = [.20, .40, .60, .80]

# list of dtypes to include
include = ['object', 'float', 'int']

# calling describe method
desc = data.describe(percentiles = percentile, include = include)

#display
desc   # For the columns with strings, NaN was returned for numeric operations


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
count,364,364,364.0,364,364.0,364,364.0,364,364.0
unique,364,30,,5,,17,,115,
top,Avery Bradley,New Orleans Pelicans,,SG,,6-9,,Kentucky,
freq,1,16,,87,,49,,22,
mean,,,16.82967,,26.615385,,219.785714,,4620311.0
std,,,14.994162,,4.233591,,24.793099,,5119716.0
min,,,0.0,,19.0,,161.0,,55722.0
20%,,,4.0,,23.0,,195.0,,947276.0
40%,,,9.0,,25.0,,212.0,,1638754.0
50%,,,12.0,,26.0,,220.0,,2515440.0


In [18]:
# Describing series of strings
import pandas as pd

# making data frame
data = pd.read_csv('nba.csv')

# removing null values to avoid errors
data.dropna(inplace = True)

# Calling describe method
desc = data['Name'].describe()

# display
desc

count               364
unique              364
top       Avery Bradley
freq                  1
Name: Name, dtype: object

##### Dealing with Rows and Columns in Pandas DataFrame
We can perform basic operations on rows/columns like selecting, deleting, adding and renaming.
###### Column Selection
In order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.

In [20]:
# Example
import pandas as pd

data = {
        'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']
       }
df = pd.DataFrame(data)

print(df[['Name','Qualification']])





     Name Qualification
0     Jai           Msc
1  Princi            MA
2  Gaurav           MCA
3    Anuj           Phd


###### Other methonds of selecting multiple columns in a DataFrame
- using Basic Method - the one described above.
- using loc[]
- using iloc[]
- using .ix

In [26]:
# Another example of using basic method
import pandas as pd

data = {
        'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']
        }
df = pd.DataFrame(data)
df[df.columns[1:4]] # using DataFrame slicing

Unnamed: 0,Age,Address,Qualification
0,27,Delhi,Msc
1,24,Kanpur,MA
2,22,Allahabad,MCA
3,32,Kannauj,Phd


##### select multiple columns in a pandas DataFrame using loc[]


In [48]:
# Example
import pandas as pd

data = {
        'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']
        }
df = pd.DataFrame(data)

# select three rows and two columns                
df.loc[1:3,['Name', 'Address']]


Unnamed: 0,Name,Address
1,Princi,Kanpur
2,Gaurav,Allahabad
3,Anuj,Kannauj


In [15]:
df.loc[1:3]

Unnamed: 0,Name,Age,Address,Qualification
1,Princi,24,Kanpur,MA
2,Gaurav,22,Allahabad,MCA
3,Anuj,32,Kannauj,Phd


##### Select one to another columns


In [39]:
# In this example, loc[] method is used to select column name i.e. 'Name' to 'Address'
import pandas as pd

data = {
        'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']
        }
pd = pd.DataFrame(data)

df.loc[0:1, 'Name':'Address']

Unnamed: 0,Name,Age,Address
0,Jai,27,Delhi
1,Princi,24,Kanpur


In [49]:
# selecting the first row and all columns
df.loc[0, :]

Name               Jai
Age                 27
Address          Delhi
Qualification      Msc
Name: 0, dtype: object

##### Select multiple columns in pandas using iloc[]
- iloc is better for integer-location based indexing

In [8]:
# Selecting all rows and three columns
# Note: iloc follows python indexing i.e. the last index in not included in the result
import pandas as pd

data = {
        'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']
        }

df = pd.DataFrame(data)
df.iloc[:,:3]


Unnamed: 0,Name,Age,Address
0,Jai,27,Delhi
1,Princi,24,Kanpur
2,Gaurav,22,Allahabad
3,Anuj,32,Kannauj


In [16]:
# Selecting all or some columns, one to another using .iloc

import pandas as pd

data = {
        'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']
        }
df = pd.DataFrame(data)

# iloc[row slicing, column slicing]
df.iloc[0:2, 1:3]


Unnamed: 0,Age,Address
0,27,Delhi
1,24,Kanpur


##### Adding a column to a existing DataFrame
- By using DataFrame.insert() method
- By using DataFrame.assign() method
- Using Dictionary
- Using list
- Using .loc()

##### using list

In [17]:
# Creating original DataFrame
import pandas as pd

data = {
        'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc']
        }
df = pd.DataFrame(data)
print(df)


     Name  Height Qualification
0     Jai     5.1           Msc
1  Princi     6.2            MA
2  Gaurav     5.1           Msc
3    Anuj     5.2           Msc


In [19]:
# Adding another column called address using list
# Note: the length of the list should match the length of the index column, otherwise it will throw an error
address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']
df['Address'] = address


     Name  Height Qualification    Address
0     Jai     5.1           Msc      Delhi
1  Princi     6.2            MA  Bangalore
2  Gaurav     5.1           Msc    Chennai
3    Anuj     5.2           Msc      Patna


##### Using .insert()

In [26]:
# Adding a new column to an existing DataFrame using DataFrame.insert()
#  It it gives the freedom to add a column at any position we like and not just at the end.
# It also provides different options for inserting the column values

import pandas as pd
data = {
        'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc']
        }
df = pd.DataFrame(data)

# using DataFrame.insert() to add a column
df.insert(2,'Age',[21,23,24,21], True)

print(df)




     Name  Height  Age Qualification
0     Jai     5.1   21           Msc
1  Princi     6.2   23            MA
2  Gaurav     5.1   24           Msc
3    Anuj     5.2   21           Msc


##### Using .assign()

In [27]:
# Adding columns to pandas DataFrame using DataFrame.assign()
# This method will create a new dataframe with a new column added to the old dataframe

import pandas as pd

data = {
        'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc']
        }

df = pd.DataFrame(data)
df2 = df.assign(address=['Delhi', 'Bangalore', 'Chennai', 'Patna'])
print(df2)

     Name  Height Qualification    address
0     Jai     5.1           Msc      Delhi
1  Princi     6.2            MA  Bangalore
2  Gaurav     5.1           Msc    Chennai
3    Anuj     5.2           Msc      Patna


##### Using Dictionary

In [28]:
print(df) # the old dataframe without address column

     Name  Height Qualification
0     Jai     5.1           Msc
1  Princi     6.2            MA
2  Gaurav     5.1           Msc
3    Anuj     5.2           Msc


In [44]:
# Adding a column to the DataFrame using a dictionary
# Use an existing column as the key values and their respective values will be the values for a new column

import pandas as pd

data = {
        'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc']
        }
# Dictionary
address = {'Jai': 'Delhi', 'princi': 'Bangalore',
           'Gaurav': 'Patna', 'Anuj': 'Chennai'}
df = pd.DataFrame(data)
    
df['Address'] = address
print(df) # Not getting the desired result


     Name  Height Qualification Address
0     Jai     5.1           Msc     NaN
1  Princi     6.2            MA     NaN
2  Gaurav     5.1           Msc     NaN
3    Anuj     5.2           Msc     NaN


##### Using .loc()

In [37]:
# Adding a new column to an existing pandas DataFrame using DataFrame.loc()

import pandas as pd

data = {
        'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
       'Height': [5.1, 6.2, 5.1, 5.2],
       'Qualification': ['Msc', 'MA', 'Msc', 'Msc']
        }
df = pd.DataFrame(data)

# Create a list with new column values
address = ["Delhi", "Bangalore", "Chennai", "Patna"]
age = [22, 25, 23, 24]

# Add the new column using loc
df.loc[:,'Address'] = address

print(df)



     Name  Height Qualification    Address
0     Jai     5.1           Msc      Delhi
1  Princi     6.2            MA  Bangalore
2  Gaurav     5.1           Msc    Chennai
3    Anuj     5.2           Msc      Patna


##### Adding more than one column in existing DataFrame

In [41]:
import pandas as pd

data = {
        'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc'],
        'Address': ['Delhi', 'Bangalore', 'Chennai', 'Patna']
        }
df = pd.DataFrame(data)

# Defining data for additional columns
age = [22, 25, 23, 24]
state = ['NCT', 'Karnataka', 'Tamil Nadu', 'Bihar']

# Adding multiple columns using dictionary assignment
new_data = {'Age': age, 'State': state }
df = df.assign(**new_data)
print(df)



     Name  Height Qualification    Address  Age       State
0     Jai     5.1           Msc      Delhi   22         NCT
1  Princi     6.2            MA  Bangalore   25   Karnataka
2  Gaurav     5.1           Msc    Chennai   23  Tamil Nadu
3    Anuj     5.2           Msc      Patna   24       Bihar


##### Column Deletion
To delete a column in pandas dataframe, you can use the **drop()** method. Columns is deleted by dropping columns with column names.

- **Syntax**: DataFrame.drop(labels = None, axis = 0, index = None,
  columns = None, level = None, inplace = False, errors = 'raise')
- **Parameters**:
   1. **labels**: String or list of strings referring to row or column name
   2. **axis**: int or string value, 0 'index' for Rows and 1 'index' for clumns.
   3. **index or columns**: Single label or list. index or columns are an alternative to axis and cannot be used together.
   4. **level**: Used to specify level in case dataframe is having multiple level index.
   5. **Inplace**: Makes chanes to the original dataframe if True.
   6. **errors**: Ignores error if any value from the list doesn't exist and drops rest of the values when errors = 'ignore'
- **Return type**: Datafraxxme with dropped values.
Note: Rows or columns can be removed using an idex label or column name using this method.

In [54]:
# Example

import pandas as pd

# making dataframe from csv file
data = pd.read_csv('nba.csv', index_col = 'Name')
"""
The parameter index_col ="Name" sets the column named "Name" as the index of the DataFrame. 
The index in a DataFrame is a special column that identifies the rows, 
making it easier to access rows by their labels (in this case, the names).
"""

# dropping passed columns
data.drop(['Team', 'Weight'], axis = 1, inplace = True)
"""
The parameter axis = 1 indicates that the operation is to be performed on columns 
(axis=0 would indicate rows).
"""
print(data)

               Number Position   Age Height            College     Salary
Name                                                                     
Avery Bradley     0.0       PG  25.0    6-2              Texas  7730337.0
Jae Crowder      99.0       SF  25.0    6-6          Marquette  6796117.0
John Holland     30.0       SG  27.0    6-5  Boston University        NaN
R.J. Hunter      28.0       SG  22.0    6-5      Georgia State  1148640.0
Jonas Jerebko     8.0       PF  29.0   6-10                NaN  5000000.0
...               ...      ...   ...    ...                ...        ...
Shelvin Mack      8.0       PG  26.0    6-3             Butler  2433333.0
Raul Neto        25.0       PG  24.0    6-1                NaN   900000.0
Tibor Pleiss     21.0        C  26.0    7-3                NaN  2900000.0
Jeff Withey      24.0        C  26.0    7-0             Kansas   947276.0
NaN               NaN      NaN   NaN    NaN                NaN        NaN

[458 rows x 6 columns]


##### **Dealing with Rows**
We can perform basic operations on rows like **selecting, deleting, adding** and **renaming**.

###### **Row Selection**
-Methods to retrieve rows from a dataframe:
 1. .loc[]
 2. .iloc[] - using integer locaton.
###### **Extracting rows using .loc[]**
- **Syntax**:DataFrame.loc[]
- **Parameters**:
   1. **Index label**: String or list of string of index label of rows.
- **Return type**: DataFrame or Series depending on parameters
  
    

In [58]:
# Example 1
import pandas as pd

# making a datafreame from csv file
df = pd.read_csv('nba.csv', index_col = 'Name')

# retrieving row by loc[] method
first = df.loc['Avery Bradley']
second = df.loc['R.J. Hunter']

print(first, '\n\n\n', second) # Two series are returned since there was only one parameter both of the times

Team        Boston Celtics
Number                 0.0
Position                PG
Age                   25.0
Height                 6-2
Weight               180.0
College              Texas
Salary           7730337.0
Name: Avery Bradley, dtype: object 


 Team        Boston Celtics
Number                28.0
Position                SG
Age                   22.0
Height                 6-5
Weight               185.0
College      Georgia State
Salary           1148640.0
Name: R.J. Hunter, dtype: object


In [62]:
# Example 2: Multiple parameters
"""
In this example, Name column is made as the index column and
the two single rows are extracted at the same time by passing 
a list as parameter.
"""

import pandas as pd

# making data frame from csv file
data = pd.read_csv('nba.csv', index_col = 'Name')

# retrieve rows by loc method
rows = df.loc[['Avery Bradley','R.J. Hunter']]

# checking data type of rows
print(type(rows))

# display
rows

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,Team,Number,Position,Age,Height,Weight,College,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0


In [63]:
# Example 3: Extracting multiple rows with same index
"""
In this example, Team name is made as the index column and one team name is passed to .loc method
to check of all values with the same name have been returned or not.
"""
import pandas as pd

# making a dataframe from csv file

data = pd.read_csv('nba.csv', index_col = 'Team')

# retrieving rows by loc method
rows = data.loc['Utah Jazz']

# checking data type of rows
print(type(rows))

# dispaly
rows # This returns all rows with team name 'Utah Jazz' in the form of a dataframe.

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,Name,Number,Position,Age,Height,Weight,College,Salary
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Utah Jazz,Trevor Booker,33.0,PF,28.0,6-8,228.0,Clemson,4775000.0
Utah Jazz,Trey Burke,3.0,PG,23.0,6-1,191.0,Michigan,2658240.0
Utah Jazz,Alec Burks,10.0,SG,24.0,6-6,214.0,Colorado,9463484.0
Utah Jazz,Dante Exum,11.0,PG,20.0,6-6,190.0,,3777720.0
Utah Jazz,Derrick Favors,15.0,PF,24.0,6-10,265.0,Georgia Tech,12000000.0
Utah Jazz,Rudy Gobert,27.0,C,23.0,7-1,245.0,,1175880.0
Utah Jazz,Gordon Hayward,20.0,SF,26.0,6-8,226.0,Butler,15409570.0
Utah Jazz,Rodney Hood,5.0,SG,23.0,6-8,206.0,Duke,1348440.0
Utah Jazz,Joe Ingles,2.0,SF,28.0,6-8,226.0,,2050000.0
Utah Jazz,Chris Johnson,23.0,SF,26.0,6-6,206.0,Dayton,981348.0


In [64]:
# Example 4: Extracting rows between two index labels
"""
In this example, two index label of rows are passed and all the rows that
fall between those two index label have been returned(Both index labels inclusive)
"""
import pandas as pd

# making dataframe from csv file
data = pd.read_csv('nba.csv', index_col = 'Name')

# retrieving rows by loc method
rows = data.loc['Avery Bradley':'Isaiah Thomas']

# Checking data type of rows
print(type(rows))

# display
rows   # all the rows that fall between passed two index labels are retuned in the form of a dataframe

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,Team,Number,Position,Age,Height,Weight,College,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0
Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0
