# Advance Pandas - Remaining Functionality

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

According to the Wikipedia page on Pandas, “the name is derived from the term “panel data”, an econometrics term for multidimensional structured data sets <br>
pandas consists of the following elements. <br>

* A set of labeled array data structures, the primary of which are Series and DataFrame
* Index objects enabling both simple axis indexing and multi-level / hierarchical axis indexing
* An integrated group by engine for aggregating and transforming data sets
* Date range generation (date_range) and custom date offsets enabling the implementation of customized frequencies
* Input/Output tools: loading tabular data from flat files (CSV, delimited, Excel 2003), and saving and loading pandas objects from the fast and efficient PyTables/HDF5 format.
* Memory-efficient “sparse” versions of the standard data structures for storing data that is mostly missing or mostly constant (some fixed value)
* Moving window statistics (rolling mean, rolling standard deviation, etc.)

### Some quick references

__[10 Minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html#min)__

### Managing your working directory in Python

In [1]:
import os

In [2]:
os.getcwd()

'/home/mohit/anaconda3/a_basic/Pandas/Advance Pandas'

** Set you current working directory**

In [3]:
# list down all objects in the current working directory
os.listdir()

['02-Groupby.ipynb',
 '01-Missing Data.ipynb',
 'Exercises',
 '05-RemainingFunctionalities.ipynb',
 'data',
 '03-Data Input and Output.ipynb',
 '.ipynb_checkpoints',
 '00-JoinsVisual.jpg',
 '04-JoinMergeOperations.ipynb',
 '00-Data Science with Python_Lesson 07_Data Manipulation with Python_Pandas.pptx']

In [5]:
# list down all objects in the current working directory
os.listdir()

['02-Groupby.ipynb',
 '01-Missing Data.ipynb',
 'Exercises',
 '05-RemainingFunctionalities.ipynb',
 'data',
 '03-Data Input and Output.ipynb',
 '.ipynb_checkpoints',
 '00-JoinsVisual.jpg',
 '04-JoinMergeOperations.ipynb',
 '00-Data Science with Python_Lesson 07_Data Manipulation with Python_Pandas.pptx']

## Create Data in Pandas DataFrame

In [7]:
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt

In [10]:
pwd

'/home/mohit/anaconda3/a_basic/Pandas/Advance Pandas'

## Titanic Kaggle Data __[Data Description](https://www.kaggle.com/c/titanic/data)__

In [14]:
titanic_train = pd.read_csv("data/titanic.csv",sep=',')   

In [15]:
titanic_train.shape

(891, 12)

In [16]:
titanic_train.head(20)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [17]:
titanic_train.Name[:10]

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
5                                     Moran, Mr. James
6                              McCarthy, Mr. Timothy J
7                       Palsson, Master. Gosta Leonard
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                  Nasser, Mrs. Nicholas (Adele Achem)
Name: Name, dtype: object

In [18]:
type(titanic_train[["PassengerId", 'Pclass','Name']][:10])

pandas.core.frame.DataFrame

In [19]:
titanic_train[["PassengerId", 'Pclass','Name']][:10]

Unnamed: 0,PassengerId,Pclass,Name
0,1,3,"Braund, Mr. Owen Harris"
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,3,3,"Heikkinen, Miss. Laina"
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,5,3,"Allen, Mr. William Henry"
5,6,3,"Moran, Mr. James"
6,7,1,"McCarthy, Mr. Timothy J"
7,8,3,"Palsson, Master. Gosta Leonard"
8,9,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"
9,10,2,"Nasser, Mrs. Nicholas (Adele Achem)"


In [20]:
type(titanic_train.Pclass)

pandas.core.series.Series

In [21]:
titanic_train[:5]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [22]:
#type(titanic_train.PassengerId)
#type(titanic_train.PassengerId.values)
(titanic_train.Name.values)

array(['Braund, Mr. Owen Harris',
       'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
       'Heikkinen, Miss. Laina',
       'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
       'Allen, Mr. William Henry', 'Moran, Mr. James',
       'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard',
       'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)',
       'Nasser, Mrs. Nicholas (Adele Achem)',
       'Sandstrom, Miss. Marguerite Rut', 'Bonnell, Miss. Elizabeth',
       'Saundercock, Mr. William Henry', 'Andersson, Mr. Anders Johan',
       'Vestrom, Miss. Hulda Amanda Adolfina',
       'Hewlett, Mrs. (Mary D Kingcome) ', 'Rice, Master. Eugene',
       'Williams, Mr. Charles Eugene',
       'Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)',
       'Masselmani, Mrs. Fatima', 'Fynney, Mr. Joseph J',
       'Beesley, Mr. Lawrence', 'McGowan, Miss. Anna "Annie"',
       'Sloper, Mr. William Thompson', 'Palsson, Miss. Torborg Danira',
       'Asplund, Mrs. Carl Oscar 

### DataFrame Index

In [23]:
titanic_train.index

RangeIndex(start=0, stop=891, step=1)

In [24]:
titanic_train.index = titanic_train.PassengerId +1000

In [25]:
titanic_train.head()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1001,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1002,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
1003,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
1004,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
1005,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**Get a list of columns**

In [26]:
titanic_train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [27]:
titanic_train.columns = ['PassengerId', 'Survived', 'Passengerclass', 'PassName', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'EmbarkedPort']

In [28]:
titanic_train.head()

Unnamed: 0_level_0,PassengerId,Survived,Passengerclass,PassName,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1001,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1002,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
1003,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
1004,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
1005,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**Get Data Type for each column**

In [29]:
titanic_train.dtypes
#type(titanic_train.dtypes)
#titanic_train.dtypes.index

PassengerId         int64
Survived            int64
Passengerclass      int64
PassName           object
Sex                object
Age               float64
SibSp               int64
Parch               int64
Ticket             object
Fare              float64
Cabin              object
EmbarkedPort       object
dtype: object

In [30]:
titanic_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1001 to 1891
Data columns (total 12 columns):
PassengerId       891 non-null int64
Survived          891 non-null int64
Passengerclass    891 non-null int64
PassName          891 non-null object
Sex               891 non-null object
Age               714 non-null float64
SibSp             891 non-null int64
Parch             891 non-null int64
Ticket            891 non-null object
Fare              891 non-null float64
Cabin             204 non-null object
EmbarkedPort      889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB


In [31]:
titanic_train.dtypes

PassengerId         int64
Survived            int64
Passengerclass      int64
PassName           object
Sex                object
Age               float64
SibSp               int64
Parch               int64
Ticket             object
Fare              float64
Cabin              object
EmbarkedPort       object
dtype: object

In [32]:
type(titanic_train.dtypes)

pandas.core.series.Series

In [33]:
titanic_train.dtypes['Age']

dtype('float64')

In [34]:
print("Categorical Columns")
titanic_train.dtypes == 'object'

Categorical Columns


PassengerId       False
Survived          False
Passengerclass    False
PassName           True
Sex                True
Age               False
SibSp             False
Parch             False
Ticket             True
Fare              False
Cabin              True
EmbarkedPort       True
dtype: bool

In [35]:
type(titanic_train.dtypes == 'object')

pandas.core.series.Series

In [36]:
titanic_train.columns[titanic_train.dtypes == 'object']

Index(['PassName', 'Sex', 'Ticket', 'Cabin', 'EmbarkedPort'], dtype='object')

In [37]:
titanic_train.columns

Index(['PassengerId', 'Survived', 'Passengerclass', 'PassName', 'Sex', 'Age',
       'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'EmbarkedPort'],
      dtype='object')

In [38]:
print("Numberical Columns")
titanic_train.columns[titanic_train.dtypes != 'object']

Numberical Columns


Index(['PassengerId', 'Survived', 'Passengerclass', 'Age', 'SibSp', 'Parch',
       'Fare'],
      dtype='object')

**Get Descriptive Statistics for Numeric Columns**

In [39]:
#?pd.read_csv

In [40]:
print( titanic_train.describe() )

       PassengerId    Survived  Passengerclass         Age       SibSp  \
count   891.000000  891.000000      891.000000  714.000000  891.000000   
mean    446.000000    0.383838        2.308642   29.699118    0.523008   
std     257.353842    0.486592        0.836071   14.526497    1.102743   
min       1.000000    0.000000        1.000000    0.420000    0.000000   
25%     223.500000    0.000000        2.000000   20.125000    0.000000   
50%     446.000000    0.000000        3.000000   28.000000    0.000000   
75%     668.500000    1.000000        3.000000   38.000000    1.000000   
max     891.000000    1.000000        3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


**Get Descriptive Statistics for Categorical Columns**

In [41]:
#?pd.DataFrame.describe

In [42]:
categorical = titanic_train.dtypes[titanic_train.dtypes == "object"].index
print(categorical)

titanic_train[categorical].describe()

Index(['PassName', 'Sex', 'Ticket', 'Cabin', 'EmbarkedPort'], dtype='object')


Unnamed: 0,PassName,Sex,Ticket,Cabin,EmbarkedPort
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Toufik, Mr. Nakli",male,1601,B96 B98,S
freq,1,577,7,4,644


In [43]:
titanic_train.describe(include='all')

Unnamed: 0,PassengerId,Survived,Passengerclass,PassName,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Toufik, Mr. Nakli",male,,,,1601.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [44]:
titanic_train.shape

(891, 12)

In [45]:
titanic_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1001 to 1891
Data columns (total 12 columns):
PassengerId       891 non-null int64
Survived          891 non-null int64
Passengerclass    891 non-null int64
PassName          891 non-null object
Sex               891 non-null object
Age               714 non-null float64
SibSp             891 non-null int64
Parch             891 non-null int64
Ticket            891 non-null object
Fare              891 non-null float64
Cabin             204 non-null object
EmbarkedPort      889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB


**Individual Statistics**

In [46]:
## How to we get individual column statistics
titanic_train['Age'].mean()

29.69911764705882

In [47]:
titanic_train['Age'].max()

80.0

In [48]:
titanic_train['Age'].min()

0.42

In [52]:
np.max(titanic_train.Age)

80.0

In [53]:
np.max(titanic_train['Age'])

80.0

### Two ways to remove column from Data Frame

In [54]:
titanic_train.head()

Unnamed: 0_level_0,PassengerId,Survived,Passengerclass,PassName,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1001,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1002,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
1003,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
1004,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
1005,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [55]:
del titanic_train["PassengerId"]     # Remove PassengerId

In [56]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Passengerclass,PassName,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1001,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1002,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
1003,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
1004,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
1005,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [57]:
titanic_train.drop(['PassName'],axis=1).head()

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1001,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1002,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
1003,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
1004,1,1,female,35.0,1,0,113803,53.1,C123,S
1005,0,3,male,35.0,0,0,373450,8.05,,S


In [58]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Passengerclass,PassName,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1001,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1002,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
1003,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
1004,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
1005,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [59]:
titanic_train.drop(['PassName'],axis=1,inplace=True)

In [60]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1001,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1002,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
1003,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
1004,1,1,female,35.0,1,0,113803,53.1,C123,S
1005,0,3,male,35.0,0,0,373450,8.05,,S


### Sorting Values

In [61]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1001,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1002,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
1003,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
1004,1,1,female,35.0,1,0,113803,53.1,C123,S
1005,0,3,male,35.0,0,0,373450,8.05,,S


In [62]:
titanic_train.sort_values(by='Age', inplace=True)

In [63]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1804,1,3,male,0.42,0,1,2625,8.5167,,C
1756,1,2,male,0.67,1,1,250649,14.5,,S
1645,1,3,female,0.75,2,1,2666,19.2583,,C
1470,1,3,female,0.75,2,1,2666,19.2583,,C
1079,1,2,male,0.83,0,2,248738,29.0,,S


In [64]:
titanic_train.sort_values(by='Age',ascending=False ,inplace=True)

In [65]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1631,1,1,male,80.0,0,0,27042,30.0,A23,S
1852,0,3,male,74.0,0,0,347060,7.775,,S
1097,0,1,male,71.0,0,0,PC 17754,34.6542,A5,C
1494,0,1,male,71.0,0,0,PC 17609,49.5042,,C
1117,0,3,male,70.5,0,0,370369,7.75,,Q


In [66]:
titanic_train["Ticket"][:10]

PassengerId
1631         27042
1852        347060
1097      PC 17754
1494      PC 17609
1117        370369
1746     WE/P 5735
1673    C.A. 24580
1034    C.A. 24579
1457         13509
1281        336439
Name: Ticket, dtype: object

In [67]:
sorted(titanic_train["Ticket"])[0:15]   # Check the first 15 sorted names

['110152',
 '110152',
 '110152',
 '110413',
 '110413',
 '110413',
 '110465',
 '110465',
 '110564',
 '110813',
 '111240',
 '111320',
 '111361',
 '111361',
 '111369']

In [68]:
# Sorting by multiple columns
titanic_train.sort_values(by=['Age','Ticket'],ascending=[False,True] ,inplace=True)

In [69]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1631,1,1,male,80.0,0,0,27042,30.0,A23,S
1852,0,3,male,74.0,0,0,347060,7.775,,S
1494,0,1,male,71.0,0,0,PC 17609,49.5042,,C
1097,0,1,male,71.0,0,0,PC 17754,34.6542,A5,C
1117,0,3,male,70.5,0,0,370369,7.75,,Q


### Dealing with Categorical Data

In [70]:
titanic_train["Cabin"].describe()  # Check number of unique cabins

count         204
unique        147
top       B96 B98
freq            4
Name: Cabin, dtype: object

**We will now deal with data which is numerical but categorical in nature**

In [71]:
titanic_train["Survived"].dtypes

dtype('int64')

In [72]:
titanic_train["Survived"].describe()

count    891.000000
mean       0.383838
std        0.486592
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: Survived, dtype: float64

In [73]:
titanic_train["Survived"].unique()

array([1, 0])

In [74]:
new_survived = pd.Categorical(titanic_train["Survived"])

In [75]:
new_survived[0:10]

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Categories (2, int64): [0, 1]

In [76]:
type(new_survived)

pandas.core.arrays.categorical.Categorical

In [77]:
new_survived.describe()

Unnamed: 0_level_0,counts,freqs
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
0,549,0.616162
1,342,0.383838


In [80]:
#?new_survived.rename_categories

In [81]:
new_survived = new_survived.rename_categories(["Died","Survived"])              

new_survived.describe()

Unnamed: 0_level_0,counts,freqs
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
Died,549,0.616162
Survived,342,0.383838


In [82]:
new_survived2 = new_survived.rename_categories({0:"Passenger_Died",1:"Passenger_Survived"})     
new_survived2.describe()

Unnamed: 0_level_0,counts,freqs
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
Died,549,0.616162
Survived,342,0.383838


In [83]:
titanic_train['Survived_New']=new_survived2

In [84]:
titanic_train.head(10)

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort,Survived_New
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1631,1,1,male,80.0,0,0,27042,30.0,A23,S,Survived
1852,0,3,male,74.0,0,0,347060,7.775,,S,Died
1494,0,1,male,71.0,0,0,PC 17609,49.5042,,C,Died
1097,0,1,male,71.0,0,0,PC 17754,34.6542,A5,C,Died
1117,0,3,male,70.5,0,0,370369,7.75,,Q,Died
1673,0,2,male,70.0,0,0,C.A. 24580,10.5,,S,Died
1746,0,1,male,70.0,1,1,WE/P 5735,71.0,B22,S,Died
1034,0,2,male,66.0,0,0,C.A. 24579,10.5,,S,Died
1055,0,1,male,65.0,0,1,113509,61.9792,B30,C,Died
1457,0,1,male,65.0,0,0,13509,26.55,E38,S,Died


In [86]:
titanic_train.Parch.unique()

array([0, 1, 4, 2, 3, 6, 5])

In [None]:
new_Pclass = pd.Categorical(titanic_train["Pclass"],
                           ordered=True)

new_Pclass = new_Pclass.rename_categories(["Class3","Class2","Class1"])     

new_Pclass.describe()

In [None]:
titanic_train.Sex.unique()

In [None]:
new_Gender = pd.Categorical(titanic_train["Sex"],
                           ordered=True)

new_Gender = new_Gender.rename_categories(["0","1"])     

new_Gender.describe()

In [None]:
new_Gender.unique()

In [None]:
titanic_train["Pclass"] = new_Pclass

In [None]:
titanic_train["Cabin"].unique()   # Check unique cabins

In [None]:
? pd.Categorical

## Indexing and Slicing DataFrames

We can use column names or row names to index data points in DataFrame

In [87]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort,Survived_New
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1631,1,1,male,80.0,0,0,27042,30.0,A23,S,Survived
1852,0,3,male,74.0,0,0,347060,7.775,,S,Died
1494,0,1,male,71.0,0,0,PC 17609,49.5042,,C,Died
1097,0,1,male,71.0,0,0,PC 17754,34.6542,A5,C,Died
1117,0,3,male,70.5,0,0,370369,7.75,,Q,Died


In [88]:
titanic_train[10:15]

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort,Survived_New
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1281,0,3,male,65.0,0,0,336439,7.75,,Q,Died
1439,0,1,male,64.0,1,4,19950,263.0,C23 C25 C27,S,Died
1546,0,1,male,64.0,0,0,693,26.0,,S,Died
1276,1,1,female,63.0,1,0,13502,77.9583,D7,S,Survived
1484,1,3,female,63.0,0,0,4134,9.5875,,S,Survived


In [89]:
titanic_train[1:2]

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort,Survived_New
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1852,0,3,male,74.0,0,0,347060,7.775,,S,Died


In [90]:
titanic_train[['Fare']].head()

Unnamed: 0_level_0,Fare
PassengerId,Unnamed: 1_level_1
1631,30.0
1852,7.775
1494,49.5042
1097,34.6542
1117,7.75


In [91]:
titanic_train[['Fare','Age']].head()

Unnamed: 0_level_0,Fare,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1631,30.0,80.0
1852,7.775,74.0
1494,49.5042,71.0
1097,34.6542,71.0
1117,7.75,70.5


**For advaned indexing we can use below two methods**
* loc
* iloc

**loc uses actual row or column indexes for slicing data**

In [92]:
Pandasdataframe.loc[row_index,col_index]

NameError: name 'Pandasdataframe' is not defined

In [93]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort,Survived_New
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1631,1,1,male,80.0,0,0,27042,30.0,A23,S,Survived
1852,0,3,male,74.0,0,0,347060,7.775,,S,Died
1494,0,1,male,71.0,0,0,PC 17609,49.5042,,C,Died
1097,0,1,male,71.0,0,0,PC 17754,34.6542,A5,C,Died
1117,0,3,male,70.5,0,0,370369,7.75,,Q,Died


In [94]:
titanic_train.loc[[1022,1038],:]

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort,Survived_New
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1022,1,2,male,34.0,0,0,248698,13.0,D56,S,Survived
1038,0,3,male,21.0,0,0,A./5. 2152,8.05,,S,Died


In [95]:
titanic_train.loc[1001:1020,:]

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort,Survived_New
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1001,0,3,male,22.0,1,0,A/5 21171,7.2500,,S,Died
1321,0,3,male,22.0,0,0,A/5 21172,7.2500,,S,Died
1213,0,3,male,22.0,0,0,A/5 21174,7.2500,,S,Died
1377,1,3,female,22.0,0,0,C 7077,7.2500,,S,Survived
1374,0,1,male,22.0,0,0,PC 17760,135.6333,,C,Died
...,...,...,...,...,...,...,...,...,...,...,...
1860,0,3,male,,0,0,2629,7.2292,,C,Died
1027,0,3,male,,0,0,2631,7.2250,,C,Died
1532,0,3,male,,0,0,2641,7.2292,,C,Died
1355,0,3,male,,0,0,2647,7.2250,,C,Died


In [96]:
titanic_train.loc[1010:1015, ['Fare','Age']]

Unnamed: 0_level_0,Fare,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1010,30.0708,14.0
1040,11.2417,14.0
1687,39.6875,14.0
1015,7.8542,14.0


In [97]:
titanic_train.index=titanic_train['PassName']

KeyError: 'PassName'

In [None]:
titanic_train.head(5)

In [98]:
titanic_train.loc['Heikkinen, Miss. Laina':'Allen, Mr. William Henry',:]

KeyError: 'Heikkinen, Miss. Laina'

In [None]:
titanic_train.index=titanic_train.PassengerId

In [99]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort,Survived_New
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1631,1,1,male,80.0,0,0,27042,30.0,A23,S,Survived
1852,0,3,male,74.0,0,0,347060,7.775,,S,Died
1494,0,1,male,71.0,0,0,PC 17609,49.5042,,C,Died
1097,0,1,male,71.0,0,0,PC 17754,34.6542,A5,C,Died
1117,0,3,male,70.5,0,0,370369,7.75,,Q,Died


In [100]:
titanic_train.loc[[1,100,90,80],:]

KeyError: "None of [Int64Index([1, 100, 90, 80], dtype='int64', name='PassengerId')] are in the [index]"

In [101]:
titanic_train.loc['Allen, Mr. William Henry':'Heikkinen, Miss. Laina',:]

KeyError: 'Allen, Mr. William Henry'

In [102]:
#titanic_train.loc[1:5,:]

**Similarly iloc uses integers as numbered indexes to slice data**

In [103]:
titanic_train.index=titanic_train.PassName

AttributeError: 'DataFrame' object has no attribute 'PassName'

In [104]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort,Survived_New
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1631,1,1,male,80.0,0,0,27042,30.0,A23,S,Survived
1852,0,3,male,74.0,0,0,347060,7.775,,S,Died
1494,0,1,male,71.0,0,0,PC 17609,49.5042,,C,Died
1097,0,1,male,71.0,0,0,PC 17754,34.6542,A5,C,Died
1117,0,3,male,70.5,0,0,370369,7.75,,Q,Died


In [105]:
titanic_train.iloc[2:4,:]

Unnamed: 0_level_0,Survived,Passengerclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,EmbarkedPort,Survived_New
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1494,0,1,male,71.0,0,0,PC 17609,49.5042,,C,Died
1097,0,1,male,71.0,0,0,PC 17754,34.6542,A5,C,Died


# Let's Do It Together

In [106]:
## Read HR_Employee_Attrition_Data.csv file as pandas DataFrame

In [107]:
## Read first 6 and last 7 rows 

In [108]:
## List all columns and data types for each

In [109]:
## Find descriptive statistics for all columns

In [110]:
## Find average age of people whose Attrition and is Yes and No separately

In [111]:
## Find average daily rate for people whose Attrition is Yes and No separately

In [112]:
## Find the Department where Attrition rate is highest

In [113]:
## For people who Travel Rarely and from Sales department what is the average daily rate?